Skip to content

Commit 1df3cc6

Browse files
committed
Updating tests for OCR cleaning
1 parent 41e257a commit 1df3cc6

6 files changed

Lines changed: 30 additions & 27 deletions

File tree

lib/docsplit/text_cleaner.rb

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,10 @@ class TextCleaner
2020
NEWLINE = /[\r\n]/
2121
ALNUM = /[a-z0-9]/i
2222
PUNCT = /[^a-z0-9\s]/i
23-
REPEAT = /(.)\1{2,}/
23+
REPEAT = /([^0-9])\1{2,}/
2424
UPPER = /[A-Z]/
2525
LOWER = /[a-z]/
26-
ACRONYM = /^\(?[A-Z]+('?s|[.,])?\)?$/
26+
ACRONYM = /^\(?[A-Z0-9\.]+('?s)?\)?[.,]?$/
2727
ALL_ALPHA = /^[a-z]+$/i
2828
CONSONANT = /(^y|[bcdfghjklmnpqrstvwxz])/i
2929
VOWEL = /([aeiou]|y$)/i
@@ -55,14 +55,16 @@ def clean(text)
5555

5656
# Is a given word OCR garbage?
5757
def garbage(w)
58-
# More than 20 bytes in length.
59-
(w.length > 20) ||
58+
acronym = w =~ ACRONYM
59+
60+
# More than 30 bytes in length.
61+
(w.length > 30) ||
6062

6163
# If there are three or more identical characters in a row in the string.
6264
(w =~ REPEAT) ||
6365

6466
# More punctuation than alpha numerics.
65-
(w.scan(ALNUM).length < w.scan(PUNCT).length) ||
67+
(!acronym && (w.scan(ALNUM).length < w.scan(PUNCT).length)) ||
6668

6769
# Ignoring the first and last characters in the string, if there are three or
6870
# more different punctuation characters in the string.
@@ -73,14 +75,14 @@ def garbage(w)
7375

7476
# Number of uppercase letters greater than lowercase letters, but the word is
7577
# not all uppercase + punctuation.
76-
((w.scan(UPPER).length > w.scan(LOWER).length) && (w !~ ACRONYM)) ||
78+
(!acronym && (w.scan(UPPER).length > w.scan(LOWER).length)) ||
7779

7880
# Single letters that are not A or I.
7981
(w.length == 1 && (w =~ ALL_ALPHA) && (w !~ SINGLETONS)) ||
8082

8183
# All characters are alphabetic and there are 8 times more vowels than
8284
# consonants, or 8 times more consonants than vowels.
83-
((w.length > 2 && (w =~ ALL_ALPHA) && (w !~ ACRONYM)) &&
85+
(!acronym && (w.length > 2 && (w =~ ALL_ALPHA)) &&
8486
(((vows = w.scan(VOWEL).length) > (cons = w.scan(CONSONANT).length) * 8) ||
8587
(cons > vows * 8)))
8688
end

lib/docsplit/text_extractor.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ def extract_options(options)
118118
@pages = options[:pages]
119119
@force_ocr = options[:ocr] == true
120120
@forbid_ocr = options[:ocr] == false
121-
@clean_ocr = options[:clean]
121+
@clean_ocr = !(options[:clean] == false)
122122
end
123123

124124
end
Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,26 @@
1-
©
1+
22
U.S. Deponmem 901 rachel Street, suns 452
33
of Tronsportotion Kansas City, Mo 64106-2641
44
Pipeline una
55
Hazardous Materials Safety
66
Administration
77
WARNING LETTER
8-
CERTIFIED MAIL - RETURN RECEIPT RE! QUESTED
8+
CERTIFIED MAIL RETURN RECEIPT QUESTED
99
January 21, 2010
1010
Mr. Terry McGill, President
11-
Enbridge Energy Paitners, L,P,
11+
Enbridge Energy Paitners,
1212
1100 Louisiana, Suite 3300
1313
Houston, Texas 77002
14-
. CPF 3-2010-5002W
14+
. CPF
1515
Dear Mr. McGill:
1616
On October 6-8, 2008, October 28, 2008, and January 21-22, 2009, a representative of the
1717
Pipeline and Hazardous Materials Safety Administration (PHMSA) pursuant to Chapter 601 of
1818
49 United States Code inspected your facilities associated with the Griffith Unit in Griffith,
1919
Indiana, and surrounding locations
2020
As a result ofthe inspection, it appears that you have committed a probable violation of the
2121
Pipeline Safety Regulations, Title 49, Code of Federal Regulations. The items inspected and
22-
the probable violation(s) are:
22+
the probable violation(s) are:
2323
1. 195.579 What must I do to mitigate internal corrosion?
24-
(b) Inhibitors. If you use corrosion inhibitors to mitigate internal corrosion, you
24+
Inhibitors. If you use corrosion inhibitors to mitigate internal corrosion, you
2525
must--
2626

test/fixtures/corrosion/corrosion_2.txt

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,35 +5,35 @@ the inhibitors in mitigating internal corrosion; and
55
(3) Examine the coupons or other monitoring equipment at least twice each
66
calendar year, but with intervals not exceeding 7 1/2 months.
77
Internal corrosion monitoring was discontinued on the five hydrogen permeation monitors
8-
(Beta F oils) installed on Line 6B. Two manuallydnterrogated monitors were discontinued in
9-
May 2006. One remotely—interro gated monitor was discontinued in January 2006, and the
10-
other two remotely—interro gated monitors were discontinued in October 2007. Enbridge
8+
(Beta oils) installed on Line 6B. Two monitors were discontinued in
9+
May 2006. One gated monitor was discontinued in January 2006, and the
10+
other two gated monitors were discontinued in October 2007. Enbridge
1111
representatives stated the monitoring was discontinued due to
1212
"communication/instrumentation problems."
1313
Enbridge is in the process of implementing an alternative method of internal corrosion
1414
monitoring on Line 6B utilizing a technology referred to as Electrical Resistance Tomography
1515
(FSMPIT), however, it is not expected to be implemented on Line 6B until sometime during
1616
the first half of 2010. In the interim, Enbridge provided the following information as
1717
demonstration that the internal corrosion threat is being properly managed:
18-
a comprehensive report related to the internal corrosion mitigation and
18+
a comprehensive report related to the internal corrosion mitigation and
1919
monitoring program for their heavy oil pipeline system
20-
¤ repair sleeve installations (which require circumferential non-destructive
20+
repair sleeve installations (which require circumferential non-destructive
2121
testing)
22-
¤ inspection ofthe Line 6B Pig Sending Trap at Griffith Station (which included
22+
inspection ofthe Line 6B Pig Sending Trap at Griffith Station (which included
2323
ultrasonic inspection of the trap floor between the 5:00 and 7:00 positions)
24-
¤ detailed pipe examinations at in-line inspection indications
25-
e records for a weight loss coupon at the Stockbridge Ptunping Station (Line 17),
24+
detailed pipe examinations at in-line inspection indications
25+
records for a weight loss coupon at the Stockbridge Ptunping Station (Line 17),
2626
which sees only fluid flow from Line 6B
2727
The information provided does not demonstrate compliance with the above regulation. Line
28-
6B has been subject to a batch chemical treatment program to inhibit internal corrosion for
29-
several years. As required by l95.579(b), Line 6B must have coupons or other monitoring
28+
6B has been subject to a batch chemical treatment program to inhibit internal corrosion for
29+
several years. As required by Line 6B must have coupons or other monitoring
3030
equipment to determine the effectiveness of the inhibitor program, and the coupons or other
3131
monitoring equipment nlust be examined at least twice each calendar year, at intervals not to
3232
exceed 7-l/2 months. PHMSA acknowledges the positive steps being taken to improve
33-
Enbridge’s internal corrosion mitigation and monitoring program. However, the transition
33+
internal corrosion mitigation and monitoring program. However, the transition
3434
from one technology to another must be implemented in a manner that ensures continued
3535
compliance with the regulations.
36-
Under 49 United States Code, § 60122, you are subject to a civil penalty not to exceed
36+
Under 49 United States Code, 60122, you are subject to a civil penalty not to exceed
3737
$100,000 for each violation for each day the violation persists up to a maximum of $1,000,000
3838
for any related series of violations. We have reviewed the circumstances and supporting
3939
documents involved in this case, and have decided not to conduct additional enforcement

test/fixtures/corrosion/corrosion_4.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ action or penalty assessment proceedings at this time. We advise you to correct
22
identified in this letter. Failure to do so will result in Enbridge being subject to additional
33
enforcement action.
44
No reply to this letter is required. If` you choose to reply, in your correspondence please refer
5-
to CPF 3-2010-5002W. Be advised that all material you submit in response to this
5+
to CPF Be advised that all material you submit in response to this
66
enforcement action is subject to being made publicly available. If` you believe that any portion
77
of your responsive material qualifies for confidential treatment under 5 U.S.C, 552(b), along
88
with the complete original document you must provide a second copy of the doctunent with the

test/unit/test_extract_text.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ def test_ocr_extraction
3838
assert Dir["#{OUTPUT}/*.txt"].length == 4
3939
4.times do |i|
4040
file = "corrosion_#{i + 1}.txt"
41+
# File.open("test/fixtures/corrosion/#{file}", "w+") {|f| f.write(File.read("#{OUTPUT}/#{file}")) }
4142
assert File.read("#{OUTPUT}/#{file}") == File.read("test/fixtures/corrosion/#{file}")
4243
end
4344
end

0 commit comments

Comments
 (0)