Skip to content

Commit f70a607

Browse files
committed
Docsplit 0.5.0
1 parent 1df3cc6 commit f70a607

5 files changed

Lines changed: 28 additions & 17 deletions

File tree

docsplit.gemspec

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Gem::Specification.new do |s|
22
s.name = 'docsplit'
3-
s.version = '0.4.1' # Keep version in sync with docsplit.rb
4-
s.date = '2010-8-23'
3+
s.version = '0.5.0' # Keep version in sync with docsplit.rb
4+
s.date = '2010-10-18'
55

66
s.homepage = "http://documentcloud.github.com/docsplit/"
77
s.summary = "Break Apart Documents into Images, Text, Pages and PDFs"

index.html

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ <h1>Doc<sub style="font-size:150%;">&#9889;</sub>split</h1>
9898
(title, author, number of pages...)
9999
</p>
100100

101-
<p>Docsplit is currently at <a href="http://rubygems.org/gems/docsplit">version 0.4.1</a>.</p>
101+
<p>Docsplit is currently at <a href="http://rubygems.org/gems/docsplit">version 0.5.0</a>.</p>
102102

103103
<p>
104104
<i>Docsplit is an open-source component of <a href="http://documentcloud.org/">DocumentCloud</a>.</i>
@@ -192,15 +192,17 @@ <h2 id="usage">Usage</h2>
192192
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])</pre>
193193

194194
<p class="break">
195-
<b class="header">text</b><code>--pages --ocr --no-ocr</code>
195+
<b class="header">text</b><code>--pages --ocr --no-ocr --no-clean</code>
196196
<span class="alias">Ruby: <b>extract_text</b></span>
197197
<br />
198198
Extract the complete <b>UTF-8</b>-encoded plain text of a document to a
199199
single file. If you'd like to extract the text for each page separately,
200200
pass <tt>--pages all</tt>. You can use the <tt>--ocr</tt> and <tt>--no-ocr</tt>
201201
flags to force OCR, or disable it, respectively. By default (if Tesseract is installed)
202202
Docsplit will OCR the text of each page for which it fails to extract text
203-
directly from the document.
203+
directly from the document. Docsplit will also attempt to clean up garbage
204+
characters in the OCR'd text &mdash; to disable this, pass the
205+
<tt>--no-clean</tt> flag.
204206
</p>
205207
<pre>
206208
docsplit text path/to/doc.pdf --pages all</pre>
@@ -210,7 +212,7 @@ <h2 id="usage">Usage</h2>
210212

211213
<p class="break">
212214
<b class="header">pages</b><code>--pages</code>
213-
<span class="alias">Ruby: <b>extract_text</b></span>
215+
<span class="alias">Ruby: <b>extract_pages</b></span>
214216
<br />
215217
Burst apart a document into single-page PDFs. Use <tt>--pages</tt> to
216218
specify the individual pages (or ranges of pages) you'd like to generate.
@@ -279,6 +281,13 @@ <h2 id="internals">Internals</h2>
279281

280282
<h2 id="changes">Change Log</h2>
281283

284+
<p>
285+
<b class="header">0.5.0</b><br />
286+
Added a <tt>Docsplit::TextCleaner</tt> class which is used to post-process
287+
OCR'd text, and remove garbage characters that are created when Tesseract
288+
encounters non-english text. To disable the cleanup, pass <tt>--no-clean</tt>.
289+
</p>
290+
282291
<p>
283292
<b class="header">0.4.1</b><br />
284293
Upgraded the JODConverter dependency for PDF conversion via OpenOffice to

lib/docsplit.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# The Docsplit module delegates to the Java PDF extractors.
22
module Docsplit
33

4-
VERSION = '0.4.1' # Keep in sync with gemspec.
4+
VERSION = '0.5.0' # Keep in sync with gemspec.
55

66
ROOT = File.expand_path(File.dirname(__FILE__) + '/..')
77

lib/docsplit/text_cleaner.rb

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
require 'iconv'
12
require 'strscan'
23

34
module Docsplit
@@ -19,11 +20,11 @@ class TextCleaner
1920
SPACE = /\s+/
2021
NEWLINE = /[\r\n]/
2122
ALNUM = /[a-z0-9]/i
22-
PUNCT = /[^a-z0-9\s]/i
23+
PUNCT = /[[:punct:]]/i
2324
REPEAT = /([^0-9])\1{2,}/
2425
UPPER = /[A-Z]/
2526
LOWER = /[a-z]/
26-
ACRONYM = /^\(?[A-Z0-9\.]+('?s)?\)?[.,]?$/
27+
ACRONYM = /^\(?[A-Z0-9\.]+('?s)?\)?[.,:]?$/
2728
ALL_ALPHA = /^[a-z]+$/i
2829
CONSONANT = /(^y|[bcdfghjklmnpqrstvwxz])/i
2930
VOWEL = /([aeiou]|y$)/i
@@ -33,8 +34,9 @@ class TextCleaner
3334
SINGLETONS = /^[AaIi]$/
3435

3536
# For the time being, `clean` uses the regular StringScanner, and not the
36-
# multibyte-aware version.
37+
# multibyte-aware version, coercing to ASCII first.
3738
def clean(text)
39+
text = Iconv.iconv('ascii//translit//ignore', 'utf-8', text).first
3840
scanner = StringScanner.new(text)
3941
cleaned = []
4042
spaced = false

test/fixtures/corrosion/corrosion_2.txt

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ the inhibitors in mitigating internal corrosion; and
66
calendar year, but with intervals not exceeding 7 1/2 months.
77
Internal corrosion monitoring was discontinued on the five hydrogen permeation monitors
88
(Beta oils) installed on Line 6B. Two monitors were discontinued in
9-
May 2006. One gated monitor was discontinued in January 2006, and the
10-
other two gated monitors were discontinued in October 2007. Enbridge
9+
May 2006. One remotely-interro gated monitor was discontinued in January 2006, and the
10+
other two remotely-interro gated monitors were discontinued in October 2007. Enbridge
1111
representatives stated the monitoring was discontinued due to
1212
"communication/instrumentation problems."
1313
Enbridge is in the process of implementing an alternative method of internal corrosion
@@ -17,11 +17,11 @@ the first half of 2010. In the interim, Enbridge provided the following informat
1717
demonstration that the internal corrosion threat is being properly managed:
1818
a comprehensive report related to the internal corrosion mitigation and
1919
monitoring program for their heavy oil pipeline system
20-
repair sleeve installations (which require circumferential non-destructive
20+
repair sleeve installations (which require circumferential non-destructive
2121
testing)
22-
inspection ofthe Line 6B Pig Sending Trap at Griffith Station (which included
22+
inspection ofthe Line 6B Pig Sending Trap at Griffith Station (which included
2323
ultrasonic inspection of the trap floor between the 5:00 and 7:00 positions)
24-
detailed pipe examinations at in-line inspection indications
24+
detailed pipe examinations at in-line inspection indications
2525
records for a weight loss coupon at the Stockbridge Ptunping Station (Line 17),
2626
which sees only fluid flow from Line 6B
2727
The information provided does not demonstrate compliance with the above regulation. Line
@@ -30,10 +30,10 @@ several years. As required by Line 6B must have coupons or other monitoring
3030
equipment to determine the effectiveness of the inhibitor program, and the coupons or other
3131
monitoring equipment nlust be examined at least twice each calendar year, at intervals not to
3232
exceed 7-l/2 months. PHMSA acknowledges the positive steps being taken to improve
33-
internal corrosion mitigation and monitoring program. However, the transition
33+
Enbridge's internal corrosion mitigation and monitoring program. However, the transition
3434
from one technology to another must be implemented in a manner that ensures continued
3535
compliance with the regulations.
36-
Under 49 United States Code, 60122, you are subject to a civil penalty not to exceed
36+
Under 49 United States Code, SS 60122, you are subject to a civil penalty not to exceed
3737
$100,000 for each violation for each day the violation persists up to a maximum of $1,000,000
3838
for any related series of violations. We have reviewed the circumstances and supporting
3939
documents involved in this case, and have decided not to conduct additional enforcement

0 commit comments

Comments
 (0)