From: David Lubkin (lubkin@unreasonable.com)
Date: Wed Sep 13 2000 - 06:12:08 MDT
On 9/12/00, at 11:53 PM, Samantha Atkins wrote:
>Somehow I doubt you actually tried this. The results are not very good
>at all. You would have to position the pages well (no mean trick as
>they aren't exactly 8-1/2 * 11 usually) flip the stack of pages over to
>scan both sides, have the OCR or post processing paste the results
>together in a continuous narrative, post process a lot more with a
>really good dictionary/grammar program to try to fix the 10% or so
>minimum OCR errors likely fro the process thus far and still have a
>pretty major editing job to make the results really good.
I have not tried to do this. I have been told by people who routinely
move material onto on-line that this is the procedure they use.
There are gotchas, but not as many as you list. The pages don't have to
be well-positioned. High-end software can automatically correct for
page orientation. Flipping the stack takes three seconds. Pasting the
results into the right order is also fast. Good OCR will get at least
98% accuracy. And who said the posted result is error-free?
On the other hand, they could just be retyping the books.
ProCD was one of the first companies to sell CDs of phone directories.
They tried the OCR approach and found it took too much work to get the
quality they were shooting for. So they hired hundreds of young Chinese
women, paying them pennies an hour, to retype the phone books. Each entry
is typed in by two different people. If the two match, the entry is
accepted. If they differ, a third person, more literate in English,
determines which is correct.
-- David Lubkin.
______________________________________________________________________________
lubkin@unreasonable.com || Unreasonable Software, Inc. || www.unreasonable.com
a trademark of USI:
> > > > > B e u n r e a s o n a b l e .
______________________________________________________________________________
This archive was generated by hypermail 2.1.5 : Fri Nov 01 2002 - 15:30:57 MST