2025-02-10

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

A new benchmark evaluates Vision-Language Models against traditional OCR systems for text recognition in video environments, using a dataset of 1,477 annotated frames from diverse sources. Advanced models like Claude-3, Gemini-1.5, and GPT-4o demonstrate superior performance in many scenarios, though challenges with hallucinations and occluded text persist.

Original archive.is archive.ph web.archive.org

read comments on news aggregators:

https://news.ycombinator.com/item?id=43045801

Pulse AI Blog - Putting Andrew Ng’s OCR Models to The Test

Andrew Ng's newly released document extraction service shows significant limitations when processing complex financial statements, with high error rates and slow processing times. Tests revealed over 50% hallucinated values and frequent missing data in financial tables, highlighting the challenges of using LLMs for document extraction.

Introducing DeepSearcher: A Local Open Source Deep Research

DeepSearcher is an open-source research agent that builds upon previous work by adding features like conditional execution flow, query routing, and improved interfaces. The system leverages SambaNova's custom hardware for faster inference with the DeepSeek-R1 model, demonstrating advanced concepts in AI research automation through a four-step process of question definition, research, analysis, and synthesis.

Google Co-Scientist AI cracks superbug problem in two days! — because it had been fed the team’s previous paper with the answer in it

Google's Co-Scientist AI tool, powered by Gemini LLM, made headlines for supposedly solving a superbug problem in 48 hours, but it was later revealed that the solution was derived from previously published research. Similar patterns of overstated achievements were found in Google's other AI research claims, including drug discovery and materials synthesis.

Please Commit More Blatant Academic Fraud

A critical analysis of academic fraud in AI research argues that explicit fraud could paradoxically improve scientific standards by forcing greater scrutiny and skepticism. The author suggests that prevalent subtle fraud has become normalized in academia, leading to widespread publication of papers without scientific merit. The piece advocates for intentional academic misconduct as a way to expose and ultimately reform the field's compromised research practices.

GitHub - vlm-run/vlmrun-hub: A hub for various industry-specific schemas to be used with VLMs.

VLM Run Hub offers pre-defined Pydantic schemas for extracting structured data from visual content using Vision Language Models, featuring industry-specific templates and automatic data validation. The platform supports multiple VLM providers and includes comprehensive documentation for seamless integration across various use cases.

Magma: A Foundation Model for Multimodal AI Agents

Magma is a foundation model for multimodal AI agents that can process text, images, and videos while enabling action planning and execution across different domains. The model utilizes Set-of-Mark and Trace-of-Mark techniques for action grounding and planning, demonstrating strong performance in UI navigation, robotics, and video understanding tasks.

Accelerating scientific breakthroughs with an AI co-scientist

Google introduces an AI co-scientist system built with Gemini 2.0, designed to generate novel research hypotheses and accelerate scientific discoveries through multi-agent collaboration. The system successfully validated predictions in biomedical applications, including drug repurposing and antimicrobial resistance research. Access to the system will be available through a Trusted Tester Program for research organizations.

GitHub - Goldziher/kreuzberg: A text extraction library supporting PDFs, images, office documents and more

Kreuzberg is a Python library offering asynchronous text extraction capabilities from various document formats, including PDFs, images, and office files, with local processing and minimal dependencies. The library provides both single-item and batch processing options, integrating tools like Tesseract OCR and Pandoc for comprehensive format support.

OCR4all

OCR4all provides a completely free, open-source optical character recognition solution without any paywalled features or private code restrictions.

FOSDEM 2025 - MapTCHA, the open source CAPTCHA that improves OpenStreetMap

MapTCHA is an open-source CAPTCHA solution that combines bot prevention with OpenStreetMap improvement by having users verify AI-predicted building outlines in aerial imagery. The system leverages human verification to validate both known and unknown cases, aggregating responses through voting to suggest new locations for OSM mapping while using fAIr, an open-source AI mapping system by HOT.

Related articles