Computer Vision

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

A new benchmark evaluates Vision-Language Models against traditional OCR systems for text recognition in video environments, using a dataset of 1,477 annotated frames from diverse sources. Advanced models like Claude-3, Gemini-1.5, and GPT-4o demonstrate superior performance in many scenarios, though challenges with hallucinations and occluded text persist.

FOSDEM 2025 - MapTCHA, the open source CAPTCHA that improves OpenStreetMap

MapTCHA is an open-source CAPTCHA solution that combines bot prevention with OpenStreetMap improvement by having users verify AI-predicted building outlines in aerial imagery. The system leverages human verification to validate both known and unknown cases, aggregating responses through voting to suggest new locations for OSM mapping while using fAIr, an open-source AI mapping system by HOT.

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

OmniHuman is an advanced AI system capable of generating realistic human videos with diverse visual and audio styles, supporting various aspect ratios and body proportions. The system excels in producing high-quality animations driven by music, speech, or video inputs, while handling complex gestures and accommodating multiple body poses and singing forms.