The time it takes on our cluster to compute is approximately the same, but if we add GPU’s we should be able to spe up OCR and PDF creation, maybe 10 times, which would help a great deal since we are processing millions of pages a day. File size as well as rendering quickly in browser implementations, have useful functionality (text search, page numbers, cut-and-paste of text), and comply with archival (PDF/A) and accessibility standards (PDF/UA). At the heart of the new PDF generation is the “archive-pdf-tools” Python library, which performs
Mixed Raster Content (MRC) compression, creates a hidden text layer using a modifid
Tesseract PDF renderer that can read hOCR files as input, and ensures the PDFs are compatible with archival standards (VeraPDF is us to verify every PDF that we generate against the archival PDF standards). The MRC compression decomposes special database each image into a background, foreground and foreground mask, heavily compressing (and sometimes downscaling) each layer separately. The mask is compress losslessly, ensuring giving an inside look at the industry’s that the text and lines in an image do not suffer from compression artifacts and look clear. Using this method, we observe a 10x compression factor for most of our books.
The PDFs themselves are creat using the high-performance
Mupdf and pymupdf python library: both projects were supportive and promptly fixe various bugs, which propell our efforts forwards.And best of all, we have expanded our community to include people all over the world that are working line data together to make cultural materials more available. We have a slack channel for OCR researchers and implementers now, that you can join if you would like (to join, drop an email t
We look to contribute software and data
Sets to these projects to help them improve (lead by Merlijn Wajer and Derek Fukumori).Next steps to fulfill the dream of Vanevar Bush’s Memex, Ted Nelson’s Xanadu, Michael Hart’s Project Gutenberg, Tim Berners-Lee’s World Wide Web, Raj Ready’s call for Universal Access to All Knowledge (and now the Internet Archive’s mission statement.