2025-01-31

GitHub - Goldziher/kreuzberg: A text extraction library supporting PDFs, images, office documents and more

Kreuzberg is a Python library offering asynchronous text extraction capabilities from various document formats, including PDFs, images, and office files, with local processing and minimal dependencies. The library provides both single-item and batch processing options, integrating tools like Tesseract OCR and Pandoc for comprehensive format support.

Original archive.is archive.ph web.archive.org

read comments on news aggregators:

https://news.ycombinator.com/item?id=43057375

Pulse AI Blog - Putting Andrew Ng’s OCR Models to The Test

Andrew Ng's newly released document extraction service shows significant limitations when processing complex financial statements, with high error rates and slow processing times. Tests revealed over 50% hallucinated values and frequent missing data in financial tables, highlighting the challenges of using LLMs for document extraction.

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

A new benchmark evaluates Vision-Language Models against traditional OCR systems for text recognition in video environments, using a dataset of 1,477 annotated frames from diverse sources. Advanced models like Claude-3, Gemini-1.5, and GPT-4o demonstrate superior performance in many scenarios, though challenges with hallucinations and occluded text persist.

OCR4all

OCR4all provides a completely free, open-source optical character recognition solution without any paywalled features or private code restrictions.

Smuggling arbitrary data through an emoji

A technical exploration demonstrates how Unicode variation selectors can be used to encode arbitrary data within any Unicode character, making it possible to hide invisible messages in text or emojis while surviving copy/paste operations.

GitHub - c0stya/trre: Transductive regular expressions

An extension to regular expressions called 'transductive regular expressions' (trre) introduces a new ':' symbol for pattern matching and text modification, offering a more natural approach to text editing than traditional regex. The implementation includes a command-line tool that supports various operations like replacement, deletion, and insertion, utilizing Finite State Transducers for processing.

Related articles

Pulse AI Blog - Putting Andrew Ng’s OCR Models to The Test

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

OCR4all

Smuggling arbitrary data through an emoji

GitHub - c0stya/trre: Transductive regular expressions