Python Library

GitHub - Goldziher/kreuzberg: A text extraction library supporting PDFs, images, office documents and more

Kreuzberg is a Python library offering asynchronous text extraction capabilities from various document formats, including PDFs, images, and office files, with local processing and minimal dependencies. The library provides both single-item and batch processing options, integrating tools like Tesseract OCR and Pandoc for comprehensive format support.