Magma: A Foundation Model for Multimodal AI Agents

Magma is a foundation model for multimodal AI agents that can process text, images, and videos while enabling action planning and execution across different domains. The model utilizes Set-of-Mark and Trace-of-Mark techniques for action grounding and planning, demonstrating strong performance in UI navigation, robotics, and video understanding tasks.

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

A new benchmark evaluates Vision-Language Models against traditional OCR systems for text recognition in video environments, using a dataset of 1,477 annotated frames from diverse sources. Advanced models like Claude-3, Gemini-1.5, and GPT-4o demonstrate superior performance in many scenarios, though challenges with hallucinations and occluded text persist.

FOSDEM 2025 - MapTCHA, the open source CAPTCHA that improves OpenStreetMap

MapTCHA is an open-source CAPTCHA solution that combines bot prevention with OpenStreetMap improvement by having users verify AI-predicted building outlines in aerial imagery. The system leverages human verification to validate both known and unknown cases, aggregating responses through voting to suggest new locations for OSM mapping while using fAIr, an open-source AI mapping system by HOT.

Deep Reinforcement Learning: Pong from Pixels

A comprehensive exploration of Reinforcement Learning (RL) through implementing a Pong-playing AI using Policy Gradients, demonstrating how neural networks can learn complex behaviors from raw pixel inputs with minimal preprocessing and assumptions.

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

OmniHuman is an advanced AI system capable of generating realistic human videos with diverse visual and audio styles, supporting various aspect ratios and body proportions. The system excels in producing high-quality animations driven by music, speech, or video inputs, while handling complex gestures and accommodating multiple body poses and singing forms.

Computer Vision

Magma: A Foundation Model for Multimodal AI Agents

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

FOSDEM 2025 - MapTCHA, the open source CAPTCHA that improves OpenStreetMap

Deep Reinforcement Learning: Pong from Pixels

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models