Text Processing

GitHub - Goldziher/kreuzberg: A text extraction library supporting PDFs, images, office documents and more

Kreuzberg is a Python library offering asynchronous text extraction capabilities from various document formats, including PDFs, images, and office files, with local processing and minimal dependencies. The library provides both single-item and batch processing options, integrating tools like Tesseract OCR and Pandoc for comprehensive format support.

GitHub - c0stya/trre: Transductive regular expressions

An extension to regular expressions called 'transductive regular expressions' (trre) introduces a new ':' symbol for pattern matching and text modification, offering a more natural approach to text editing than traditional regex. The implementation includes a command-line tool that supports various operations like replacement, deletion, and insertion, utilizing Finite State Transducers for processing.