[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Index by Month]

Re: MCM - Digest V1 #287



On Tue, 2 May 2000 nfrank@mindspring.com wrote:

> why not just create pdf files?  

PDF's are easy to create from a document source, i.e. a Word file.  

But we don't have document sources for TAG, right?  All we have is the
physical magazine from the printers.  Not a lot to go on.  We could still
make PDF's by creating a Word document from the paper magazine.  Two
ways this can be done, and both are nasty:

1) Scan in the original magazine, and just make each page of the word
document a "picture" of the original page.  The resulting document is huge
and looks kind of like a FAX with the pages all crooked.  It does,
however, preserve the original look and layout.  For an example of this,
see a single page at

http://www.aquatic-gardeners.org/page04.jpg

Each physical page of TAG takes about 150k as a JPEG.  10 years of TAG,
saved as JPEGs takes up maybe 1/2 a CD-ROM. 

When put into a word document or PDF, this expands greatly, perhaps
tenfold!  Such documents would NOT easily fit on a CD-ROM.


OK, method #2 is that we do all of step 1 above, but THEN take the scanned
images (such as the example above), and reconstruct the text of the
original article.  This is more complicated, taking sophisticated OCR
software and lots of massaging so the text doesn't look like a ransom
note.  As I mentioned yesterday, this takes about two hours per issue of
TAG, whereas just scanning the article in takes about 1/2 hour and can be
done while watching TV or talking on the phone). It generally does not
preserve the original layout (which to me is actually not a problem,
especially for archiving on the web).  The sparse illustrations are added
back in as graphics, but the resultant file is TINY compared to the other,
can be put on the web, edited, searched, archived as a PDF, whatever.


I will definitely do the first part, because we'll then have a true
archive of TAG in its original layout.  Last night I got volume 4 scanned
in in 3 hours.  If I work steadily on this, I can quickly get all 9 and a
half years done.  But the OCR phase I may either just do very very slowly,
or save only for articles to be archived on the website.

  - Erik

-- 
Erik Olson
erik at thekrib dot com

  ------------------
  To unsubscribe from this list, e-mail majordomo@thekrib.com
  with "unsubscribe aga-mcm" in the body of the message.
  To subscribe to the digest version, add "subscribe aga-mcm-digest"
  in the same message.
  Old messages are available at http://lists.thekrib.com/aga-mcm
  When asked, log in as username is "aga-mcm", and password "incorporate".