論文

368件

arXiv / HuggingFace Daily Papersから日本語要約付きでお届け

論文深掘り Hugging Face 2026-05-11 HF ↑12

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely re...

#agent#alignment#llm

論文深掘り Hugging Face 2026-04-26 HF ↑71

World-R1: テキストから動画生成における3D制約の強化学習による整合

「RLで3D整合動画生成」が自動運転・ロボ向け合成データ生成コストを大幅に下げるかもしれない

テキストから動画を生成する基盤モデル（video foundation model）は優れた映像合成能力を持つ一方、幾何学的不整合（geometric inconsistency）という課題を抱えている。既存手法はアーキテクチャ改修により3D事前知識（3D prior）を注入しようとするが、計算コストが高くスケーラビリティに限界がある。本研究ではWorld-R1を提案し、強化学習（reinforcement learning）を通じて動画生成と3D制約を整合させるフレームワークを構築した。世界シミュレーション向けの専用純テキストデータセットを新たに整備し、Flow-GRPOを用いて事前学習済み3D基盤モデルおよびビジョン言語モデル（VLM）からのフィードバックでアーキテクチャを変更せずに構造的整合性を強制する。さらに周期的分離学習戦略（periodic decoupled training strategy）で剛体的幾何整合性と動的シーンの流動性のバランスを取った。評価の結果、元モデルの視覚品質を維持しつつ3D一貫性を大幅に向上させ、動画生成とスケーラブルな世界シミュレーションの橋渡しに貢献するとしている。

#rl#alignment#benchmark

論文深掘り Hugging Face 2026-05-31 HF ↑9

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large coll...

#agent#benchmark#rl#multimodal

論文深掘り Hugging Face 2026-05-31 HF ↑41

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400...

#agent#benchmark#llm

論文深掘り Hugging Face 2026-05-27 HF ↑76

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment...

#agent#alignment#rl

論文深掘り Hugging Face 2026-05-27 HF ↑35

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a...

#diffusion#fine-tuning

論文深掘り Hugging Face 2026-05-27 HF ↑45

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision...

#robotics#benchmark

論文 Hugging Face 2026-05-26 HF ↑34

ResearchMath-14K: Scaling Research-Level Mathematics via Agents

The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introd...

#agent#fine-tuning

論文深掘り Hugging Face 2026-05-26 HF ↑32

DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement. In this paper, we introduce DenoiseRL, a reinforce...

#llm#rl#benchmark

論文深掘り Hugging Face 2026-05-26 HF ↑32

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work...

#llm#alignment#benchmark

論文深掘り Hugging Face 2026-05-26 HF ↑63

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default)...

#agent#multimodal#rl#benchmark

論文 Hugging Face 2026-05-26 HF ↑97

Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

World models for interactive video generation have largely focused on single-agent settings, where future observations are generated from a single control signal. However, many generated environments require multi-agent interaction: multiple players, robots, or embodied agents act simultaneously wit...

#agent#diffusion#coding#robotics

論文 Hugging Face 2026-05-26 HF ↑52

From Pixels to Words -- Towards Native One-Vision Models at Scale

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impre...

#multimodal#alignment

論文 Hugging Face 2026-05-20 HF ↑48

ACC: Compiling Agent Trajectories for Long-Context Training

Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and r...

#agent#llm#fine-tuning#benchmark

論文深掘り Hugging Face 2026-05-20 HF ↑30

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous...

#llm#multimodal#benchmark

論文深掘り Hugging Face 2026-05-20 HF ↑62

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral...

#llm#benchmark#multimodal#agent

論文 Hugging Face 2026-05-20 HF ↑4

Diversed Model Discovery via Structured Table Discovery

Model cards describe model behavior through a mixture of textual descriptions and structured artifacts, including performance, configuration, and dataset tables. Existing model search systems rely predominantly on semantic similarity over text, which can produce homogeneous result sets and limit exp...

#alignment#benchmark

論文深掘り Hugging Face 2026-05-20 HF ↑10

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. De...

#coding#benchmark

論文深掘り Hugging Face 2026-05-19 HF ↑33

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by do...

#agent#llm#rl#multimodal#fine-tuning

論文深掘り Hugging Face 2026-05-18 HF ↑44

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogene...

#rl#alignment#benchmark

論文深掘り Hugging Face 2026-05-18 HF ↑31

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation mod...

#multimodal#diffusion#rl#benchmark

論文深掘り Hugging Face 2026-05-17 HF ↑43

AI for Auto-Research: Roadmap & User Guide

AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity...

#agent#benchmark#llm#coding

論文深掘り Hugging Face 2026-05-13 HF ↑13

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

Many real-world coding challenges are open-ended and admit no known optimal solution. Yet, recent progress in LLM coding has focused on well-defined tasks such as feature implementation, bug fixing, and competitive programming. Open-ended coding remains a weak spot for LLMs, largely because open-end...

#coding#llm#agent#benchmark

論文 Hugging Face 2026-05-13 HF ↑31

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

LLM-based autonomous agents have demonstrated strong capabilities in reasoning, planning, and tool use, yet remain limited when tasks require sustained coordination across roles, tools, and environments. Multi-agent systems address this through structured collaboration among specialized agents, but ...

#agent#llm

論文深掘り Hugging Face 2026-05-13 HF ↑52

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that gen...

#multimodal#agent#benchmark

論文深掘り Hugging Face 2026-05-13 HF ↑57

Self-Distilled Agentic Reinforcement Learning

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher...

#agent#rl#llm#benchmark

論文 Hugging Face 2026-05-13 HF ↑44

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-Worl...

#diffusion#benchmark

論文深掘り Hugging Face 2026-05-12 HF ↑30

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse...

#benchmark#rl#multimodal

論文深掘り Hugging Face 2026-05-12 HF ↑30

Qwen-Image-VAE-2.0 Technical Report

We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Conne...

#benchmark#diffusion#alignment#coding

論文深掘り Hugging Face 2026-05-12 HF ↑60

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, p...

#benchmark#multimodal#agent

論文深掘り Hugging Face 2026-05-11 HF ↑81

δ-mem: Efficient Online Memory for Large Language Models

Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to ensure effective context utilization. We propose δ-mem, a lightweight memory mechanism that augments a fr...

#llm#agent#fine-tuning#benchmark

論文深掘り Hugging Face 2026-05-11 HF ↑116

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely...

#multimodal#agent

論文深掘り Hugging Face 2026-05-10 HF ↑39

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rou...

#agent#llm#rl#benchmark

論文深掘り Hugging Face 2026-05-10 HF ↑49

Qwen-Image-2.0 Technical Report

We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution pho...

#multimodal#vision#diffusion#benchmark

論文深掘り Hugging Face 2026-05-06 HF ↑31

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple re...

#diffusion#fine-tuning#rl#coding#benchmark

論文深掘り Hugging Face 2026-04-29 HF ↑141

異種科学基盤モデル協調フレームワーク「Eywa」

非言語科学モデルをLLMエージェントに接続する「異種AI協調」が研究開発インフラを塗り替えるかもしれない

科学分野では自然言語以外のデータ（分子構造・物理シミュレーション・ゲノム等）を扱うドメイン特化基盤モデル（domain-specific foundation model）が多数開発されているが、既存のエージェント型LLMシステムは言語を唯一のインターフェースとするため、これら専門モデルとの連携が困難だった。本研究では、ドメイン特化モデルに言語モデルベースの推論インターフェースを付加し、LLMが非言語データモダリティ上の推論を誘導できる異種エージェントフレームワーク「Eywa」を提案する。Eywaは単一エージェントパイプラインの代替（EywaAgent）、既存マルチエージェントシステムへの組み込み（EywaMAS）、さらに計画型オーケストレーション（EywaOrchestra）の3構成を持つ。物理・生命・社会科学にまたがる多様なタスクで評価した結果、構造化データやドメイン固有データを含むタスクで性能が向上し、言語のみへの依存を低減できることが示された。

#agent#llm#benchmark

論文深掘り Hugging Face 2026-04-28 HF ↑70

GLM-5V-Turbo：マルチモーダルエージェントのためのネイティブ基盤モデルへの取り組み

マルチモーダル知覚を「後付け」から「中核」へ転換するエージェント設計が実装標準になりそう

本報告では、マルチモーダルエージェント（multimodal agent）向けのネイティブ基盤モデル（native foundation model）を目指すGLM-5V-Turboを紹介する。基盤モデルが実環境に展開されるにつれ、エージェントの能力は言語推論だけでなく、画像・動画・Webページ・文書・GUI（グラフィカルユーザーインターフェース）などの異質なコンテキストを知覚・解釈・操作する能力にも依存する。GLM-5V-Turboはこの目的を中心に構築されており、マルチモーダル知覚を言語モデルへの補助的インターフェースとしてではなく、推論・計画・ツール利用・実行の中核コンポーネントとして統合している。モデル設計、マルチモーダル訓練、強化学習（reinforcement learning）、ツールチェーン拡張、エージェントフレームワーク統合における主要改善をまとめ、マルチモーダルコーディング・視覚的ツール利用・フレームワーク型エージェントタスクで高い性能を達成しつつ、テキスト専用コーディング能力も維持していると主張する。

#multimodal#agent#coding#rl

論文深掘り Hugging Face 2026-04-28 HF ↑33

ClawGym：効果的なClawエージェント構築のためのスケーラブルフレームワーク

ローカルPC操作エージェントの自社訓練が、中小スタートアップでも現実的な選択肢になりそう

ローカルファイルやツール、永続的なワークスペース状態を扱うマルチステップワークフロー環境「Claw式環境」において、エージェント開発を体系化するフレームワークが不足していた。本研究ではClawGymを提案し、個人エージェントの開発ライフサイクル全体を支援する。具体的には、ペルソナ駆動の意図とスキルに基づく操作から合成された1万3500件のタスクデータセット「ClawGym-SynData」を構築し、リアルなモックワークスペースとハイブリッド検証機構を組み合わせた。さらに、ブラックボックスのロールアウト軌跡に対する教師ありファインチューニング（supervised fine-tuning）によってClawGym-Agentsを訓練し、タスクごとのサンドボックスで並列ロールアウトを行う軽量パイプラインによる強化学習（reinforcement learning）も探索した。評価基盤として自動フィルタリングと人間・LLM協調レビューにより較正された200インスタンスのベンチマーク「ClawGym-Bench」も構築している。

#agent#llm#rl#fine-tuning#benchmark

論文深掘り Hugging Face 2026-04-28 HF ↑35

TIDEで潮目を変える：拡散大規模言語モデルのクロスアーキテクチャ蒸留

dLLMの小型化加速で、並列推論モデルがエッジ・モバイルへ普及する布石になりそう

拡散大規模言語モデル（dLLM: Diffusion Large Language Model）は並列デコードと双方向コンテキストという優位性を持つが、競争力を発揮するには数十億パラメータ規模が必要という課題がある。既存の蒸留手法は同一アーキテクチャ内での推論ステップ削減にとどまり、アーキテクチャ・アテンション機構・トークナイザーが異なる教師から生徒へのクロスアーキテクチャ知識転移は未開拓だった。本研究はTIDEを提案する。TIDEは三つのモジュールで構成される：学習進捗と拡散タイムステップに応じて蒸留強度を調整するTIDAL、マスク補完分割で重マスク時の教師予測を改善するCompDemo、クロストークナイザー目標関数でチャンク単位尤度マッチングを反転させ勾配を安定化させるReverse CALMである。8BのDenseモデルと16BのMoEモデルを教師として0.6Bの生徒モデルに蒸留した結果、8つのベンチマーク平均で1.53ポイント向上し、HumanEvalでは48.78（ARベースライン32.3）を達成したと主張する。

#llm#diffusion#coding#benchmark

論文深掘り Hugging Face 2026-04-27 HF ↑36

DV-World: 実世界シナリオにおけるデータ可視化エージェントのベンチマーク

SOTAでも正答率50%未満——DVエージェントの実用化評価軸が刷新されそう

データ可視化（Data Visualization, DV）の実務では、ネイティブ環境への適応、クロスプラットフォームでの進化的編集、ユーザー意図の能動的な解釈が求められる。しかし既存のベンチマークはコードサンドボックスに閉じており、単一言語での生成タスクのみ、かつユーザー意図が明確という前提に依存していた。本研究はこのギャップを埋めるため、実務プロフェッショナルのライフサイクルを模した260タスクからなるベンチマーク「DV-World」を提案する。DV-Worldは「DV-Sheet（スプレッドシート上のチャート・ダッシュボード生成と修正診断）」「DV-Evolution（多様なプログラミングパラダイムを跨いだビジュアル成果物の改変・再構成）」「DV-Interact（曖昧な要件を模したユーザーシミュレータとの意図整合）」の3ドメインで構成される。評価には数値精度を測るTable-value AlignmentとMLLM-as-a-Judgeを組み合わせたハイブリッド手法を採用。実験の結果、最先端モデルでも総合50%未満の性能にとどまり、実世界DVの複雑さへの対応不足が明らかになったとしている。

#alignment#benchmark#agent#llm#coding

論文深掘り Hugging Face 2026-04-21 HF ↑4

SkillLearnBench：実世界タスクにおけるエージェントスキル生成のための継続学習手法ベンチマーク

「強いLLMなら解決」神話が崩れ、エージェントスキル設計の評価基盤競争が始まりそう

LLMエージェント（大規模言語モデルエージェント）が複雑な実世界タスクを実行するための「スキル」は主流の手法となりつつあるが、それを自動かつ効果的に学習する方法は未解明であった。本研究では、継続学習（Continual Learning）手法を評価する初のベンチマーク「SkillLearnBench」を提案する。実世界のスキル分類体系から導出した15サブドメインにわたる20の検証済みタスクで構成され、スキル品質・実行軌跡・タスク成果の3レベルで評価される。評価の結果、全ての継続学習手法はスキルなしベースラインを上回るものの、全タスク・全LLMで一貫して優れる手法は存在しないことが判明。また、強力なLLMバックボーンへのスケーリングも必ずしも改善に繋がらず、外部フィードバックによる反復改善は有効な一方、自己フィードバック単独では再帰的なドリフトを引き起こすことも明らかになった。コードとデータはオープンソースで公開されている。

#llm#agent#benchmark

論文深掘り Hugging Face 2026-04-21 HF ↑156

LLaDA2.0-Uni: 拡散大規模言語モデルによるマルチモーダル理解と生成の統合

「理解も生成も」を1モデルで担うオープンな統合基盤モデルが、マルチモーダルAIの開発競争を塗り替えるかもしれない

本研究は、テキストと画像の理解・生成を単一フレームワークで実現する統合型離散拡散大規模言語モデル（dLLM）「LLaDA2.0-Uni」を提案する。アーキテクチャは、完全意味的な離散トークナイザー、MoE（Mixture of Experts）ベースのdLLMバックボーン、拡散デコーダーの3要素で構成される。SigLIP-VQにより連続的な視覚入力を離散化し、テキストと視覚の両入力に対してブロックレベルのマスク拡散を実現。バックボーンのプレフィックス認識最適化とデコーダーの少ステップ蒸留により推論効率も向上させる。大規模データと多段階学習パイプラインにより、特化型VLM（Vision-Language Model）と同等のマルチモーダル理解性能を維持しつつ、高品質な画像生成・編集能力も達成。テキストと画像が混在するインターリーブ生成と推論をネイティブにサポートし、次世代統合基盤モデルの有望なパラダイムを示すと主張する。

#diffusion#multimodal#llm#coding#vision

論文深掘り Hugging Face 2026-04-20 HF ↑19

TEMPO: 大規模推論モデルのテスト時学習をスケールさせる手法

推論時の追加学習が「頭打ちの壁」を突破し、デプロイ後のモデル改善が現実的な選択肢になりそう

大規模推論モデル（Large Reasoning Model, LRM）の推論時にモデルパラメータを適応させるテスト時学習（Test-time Training, TTT）は、オフライン学習の限界を超える能力拡張として注目される。しかし既存のTTT手法は性能向上がすぐに頭打ちになり、計算リソースを追加投入しても効果が薄れるという課題があった。原因として、自己生成報酬信号がモデルの更新に伴いドリフトし、多様性崩壊（diversity collapse）が起きることが指摘されている。本研究では、ラベルなし問題への方策改善（policy refinement）と、ラベル付きデータセット上での定期的な評価器再較正（critic recalibration）を交互に行うTTTフレームワーク「TEMPO」を提案する。この手順をEM（Expectation-Maximization）アルゴリズムとして定式化することで、従来手法が再較正ステップを欠く不完全な変形であることを示す。OLMO3-7BのAIME 2024スコアを33.0%から51.1%、Qwen3-14Bを42.3%から65.8%に改善し、多様性も維持することを確認した。

論文深掘り Hugging Face 2026-04-20 HF ↑33

AnyRecon: ビデオ拡散モデルによる任意視点3D再構成

「写真を数枚撮るだけで3Dモデル完成」が現実的な選択肢になりそう

スパース視点（sparse-view）からの3D再構成は、少数の画像から現実的な3Dシーンを構築する上で重要な課題だが、既存の拡散モデル（diffusion model）ベース手法は1〜2枚の入力画像に依存するため、幾何学的一貫性の維持や大規模・多様なシーンへの対応が困難であった。本研究では、任意の順序・枚数のスパース入力から拡張性の高い3D再構成を行うフレームワーク「AnyRecon」を提案する。グローバルシーンメモリ（persistent global scene memory）をキャプチャビューキャッシュとして構築し、時間圧縮を排除することで大きな視点変化にも対応。さらに明示的な3D幾何メモリと幾何駆動のビュー検索を組み合わせた幾何認識型コンディショニング戦略を導入し、生成と再構成の相互作用を強化している。効率化のため、4ステップ拡散蒸留（diffusion distillation）とコンテキストウィンドウスパースアテンションを組み合わせ、計算量の削減を実現。不規則な入力・大視点差・長軌跡での頑健な再構成を実験的に示している。

#diffusion#benchmark

論文深掘り Hugging Face 2026-04-19 HF ↑76

識別的テキスト表現によるクラスラベルからテキストへのワンステップ画像生成の拡張

ワンステップ・テキスト→画像生成が現実的な選択肢になり、リアルタイム生成AIの設計が変わりそう

ワンステップ画像生成（one-step generation）は長年の研究目標であり、近年MeanFlowがクラスラベルを条件としたクラス→画像生成で顕著な成果を示している。本研究はその条件をテキスト入力へと拡張し、より豊かなコンテンツ生成を目指す。しかし、LLMベースのテキストエンコーダを従来の学習戦略で統合しても性能が不十分であることが判明した。詳細な分析により、MeanFlowのように生成ステップ数が極めて少ない（1ステップ）場合、テキスト特徴表現に高い「識別性（discriminability）」が必要であることが明らかになった。これがクラスラベルのような離散的・識別的な特徴が好成績を収める理由でもある。この知見に基づき、必要な意味論的特性を持つLLMベーステキストエンコーダを活用してMeanFlowに適応させ、初めてテキスト条件付きワンステップ合成を実現。拡散モデル（diffusion model）においても生成性能の大幅な向上を確認し、コードも公開された。

#llm#vision#diffusion

論文深掘り Hugging Face 2026-04-19 HF ↑50

Agent-World：進化する汎用エージェント知能のためのリアルワールド環境合成のスケーリング

MCP時代のエージェント自律訓練が現実化し、小型モデルが大型独自モデルを超える時代が来るかもしれない

大規模言語モデル（LLM）が汎用エージェントとして外部ツール環境と対話する需要が高まる一方、堅牢なエージェント訓練はリアルな環境の不足と生涯学習（life-long learning）の仕組みの欠如により制約されてきた。本論文ではAgent-Worldを提案する。これは自己進化型の訓練アリーナであり、2つの主要コンポーネントを持つ。第1に「エージェント的環境・タスク発見」機能は、数千のテーマから実世界の環境を自律探索し難易度制御可能な検証可能タスクを合成する。第2に「継続的自己進化エージェント訓練」は、マルチ環境強化学習と自己進化アリーナを組み合わせ、動的タスク合成で能力ギャップを自動同定し、エージェントポリシーと環境の共進化を実現する。23の困難なベンチマークでAgent-World-8Bおよび14Bが有力な独自モデルや環境スケーリングベースラインを一貫して上回ったとしている。

#agent#llm#rl#benchmark

論文深掘り Hugging Face 2026-04-19 HF ↑38

OpenGame: ゲーム向けオープン・エージェント型コーディングフレームワーク

「仕様書を渡すだけでゲームが生成される」時代が現実に近づきそう

ゲーム開発はクリエイティブ設計と複雑なソフトウェアエンジニアリングが交差する領域であり、ゲームエンジン・リアルタイムループ・複数ファイルにまたがる状態管理の統合が求められる。既存のLLM（大規模言語モデル）やコードエージェントは孤立したプログラミングタスクは解けるものの、高レベルな設計仕様からプレイ可能なゲームを生成する際、クロスファイルの不整合や論理的不一致に頻繁に失敗する。本論文はこの課題に対し、エンドツーエンドのWebゲーム生成に特化した初のオープンソースエージェントフレームワーク「OpenGame」を提案する。中核には再利用可能な「Game Skill」があり、プロジェクト雛形ライブラリを成長させる「Template Skill」と検証済み修正プロトコルを維持する「Debug Skill」で構成される。さらに270億パラメータの「GameCoder-27B」を、継続事前学習・教師あり微調整・実行ベース強化学習の3段階パイプラインで専門化。評価基準として「OpenGame-Bench」を導入し、150種の多様なゲームプロンプトで最高精度を達成したと主張している。

#agent#llm#rl#multimodal#fine-tuning

論文深掘り Hugging Face 2026-05-31 HF ↑12

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic...

#benchmark#agent#llm

論文 Hugging Face 2026-05-31 HF ↑20

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

While video streaming understanding has made significant strides, real-world applications, such as live sports broadcasting, autonomous driving, and multi-screen collaboration, inherently demand continuous, multi-stream interactions. However, existing benchmarks are confined to single-stream paradig...

#llm#benchmark#agent

論文 Hugging Face 2026-05-27 HF ↑46

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs and property graphs. Existing retrievers, however, operate over one source at a time under a fixed query language, leaving the broader landscape of ...

#benchmark

論文 Hugging Face 2026-05-27 HF ↑14

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process its...

#llm#rl#benchmark

論文 Hugging Face 2026-05-27 HF ↑12

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as Contextual Belief Management (CBM): maintaining a predicted belief state aligned with formal evidence while i...

#llm#rl#benchmark

論文 Hugging Face 2026-05-27 HF ↑23

GenClaw: Code-Driven Agentic Image Generation

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of pr...

#agent#vision#llm#multimodal

論文 Hugging Face 2026-05-27 HF ↑16

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capa...

#llm#benchmark#fine-tuning#coding

論文 Hugging Face 2026-05-27 HF ↑16

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance es...

#multimodal#benchmark

論文 Hugging Face 2026-05-26 HF ↑4

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT mon...

#alignment#benchmark#llm

論文 Hugging Face 2026-05-26 HF ↑72

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short...

#rl

論文 Hugging Face 2026-05-26 HF ↑11

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information e...

#agent#benchmark#llm

論文 Hugging Face 2026-05-26 HF ↑38

Self-Improving Language Models with Bidirectional Evolutionary Search

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse veri...

#agent#robotics#benchmark

論文深掘り Hugging Face 2026-05-25 HF ↑48

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round players capable of generalizing robustly across diverse downstream tasks, arbitrary viewpoints, shifting scene domains, varying input densities, and sp...

#alignment#robotics#benchmark

論文 Hugging Face 2026-05-20 HF ↑16

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs o...

#llm#multimodal#rl#agent#benchmark

論文 Hugging Face 2026-05-20 HF ↑18

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens...

#llm#benchmark#multimodal#fine-tuning

論文 Hugging Face 2026-05-20 HF ↑22

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a pr...

#agent#llm#rl#fine-tuning#benchmark

論文 Hugging Face 2026-05-20 HF ↑26

WorldKV: Efficient World Memory with World Retrieval and Compression

Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks re...

#benchmark#diffusion#fine-tuning#coding

論文深掘り Hugging Face 2026-05-19 HF ↑27

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extr...

#llm#rl#benchmark

論文 Hugging Face 2026-05-18 HF ↑12

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher...

#rl#multimodal#alignment#benchmark

論文 Hugging Face 2026-05-18 HF ↑44

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification l...

#agent#benchmark#llm

論文深掘り Hugging Face 2026-05-18 HF ↑40

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this...

#agent#benchmark

論文深掘り Hugging Face 2026-05-17 HF ↑42

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

Long-horizon LLM agents leave traces that could become reusable experience, but raw trajectories are noisy and hard to govern. We treat Agent Skills as an experience schema that couples executable scripts, with non-executable guidance on procedures. Yet open skill ecosystems contain redundant, uneve...

#agent#llm#benchmark

論文 Hugging Face 2026-05-17 HF ↑32

Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, te...

#agent#llm#benchmark#robotics

論文深掘り Hugging Face 2026-05-17 HF ↑57

Lance: Unified Multimodal Modeling by Multi-Task Synergy

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collabor...

#multimodal#alignment#coding

論文 Hugging Face 2026-05-13 HF ↑36

Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning

We present Darwin Family, a framework for training-free evolutionary merging of large language models via gradient-free weight-space recombination. We ask whether frontier-level reasoning performance can be improved without additional training, by reorganizing latent capabilities already encoded in ...

#llm#benchmark

論文 Hugging Face 2026-05-13 HF ↑46

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred witho...

#multimodal#agent#benchmark

論文 Hugging Face 2026-05-13 HF ↑36

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remai...

#diffusion

論文 Hugging Face 2026-05-13 HF ↑30

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications,...

#coding#fine-tuning#alignment

論文 Hugging Face 2026-05-13 HF ↑15

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alterna...

#agent#rl#benchmark

論文 Hugging Face 2026-05-12 HF ↑14

Useful Memories Become Faulty When Continuously Updated by LLMs

Learning from past experience benefits from two complementary forms of memory: episodic traces -- raw trajectories of what happened -- and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM r...

#agent#llm

論文 Hugging Face 2026-05-12 HF ↑19

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

In-context learning (ICL) adapts large language models (LLMs) to new tasks by conditioning on demonstrations in the prompt without parameter updates. With long-context models, many-shot ICL can use dozens to hundreds of examples and achieve performance comparable to fine-tuning, yet current understa...

#llm#benchmark#fine-tuning

論文 Hugging Face 2026-05-12 HF ↑74

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merge...

#llm#rl#benchmark

論文 Hugging Face 2026-05-11 HF ↑11

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio ...

#agent#llm#rl

論文 Hugging Face 2026-05-11 HF ↑43

World Action Models: The Next Frontier in Embodied AI

Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by int...

#coding#robotics#benchmark

論文 Hugging Face 2026-05-11 HF ↑23

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to...

#multimodal#llm#diffusion#agent#alignment

論文 Hugging Face 2026-05-10 HF ↑22

Model Merging Scaling Laws in Large Language Models

We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-d...

#llm

論文深掘り Hugging Face 2026-05-10 HF ↑11

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized int...

#agent#rl#llm

論文深掘り Hugging Face 2026-05-06 HF ↑18

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled roll...

#llm#rl

論文 Hugging Face 2026-05-06 HF ↑2

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

Large language models (LLMs) are routinely prompted to take on social roles ranging from individuals to institutions, yet it remains unclear whether their internal representations encode the granularity of such roles, from micro-level individual experience to macro-level organizational, institutiona...

#llm

論文深掘り Hugging Face 2026-05-06 HF ↑35

MiA-Signature: Approximating Global Activation for Long-Context Understanding

A growing body of work in cognitive science suggests that reportable conscious access is associated with global ignition over distributed memory systems, while such activation is only partially accessible as individuals cannot directly access or enumerate all activated contents. This tension suggest...

#llm#agent#rag

論文深掘り Hugging Face 2026-05-05 HF ↑18

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. Despite rapid progress, top-tier multimodal search agents remain difficult to reproduce, largely due to the a...

#agent#multimodal#rl#benchmark

論文深掘り Hugging Face 2026-05-05 HF ↑18

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for directly continuous supervised fine-tuning. For ...

#diffusion#fine-tuning#llm#multimodal#vision

論文深掘り Hugging Face 2026-05-04 HF ↑12

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT)...

#agent#llm#rl#fine-tuning#benchmark

論文深掘り Hugging Face 2026-05-03 HF ↑64

MolmoAct2：現実世界展開のための行動推論モデル

ロボットAIの「オープンソース革命」が始まり、参入コストが数分の一になりそう

ロボット向けの汎用コントローラーを目指すVision-Language-Action（VLA）モデルは、実世界展開の観点でクローズドモデルや高価なハードウェア依存、高レイテンシといった課題を抱えている。本研究ではAllen AIが完全オープンな行動推論モデル「MolmoAct2」を発表。5つの軸で改善を加え、空間・身体的推論に特化したVLMバックボーン「MolmoER」（330万サンプルで訓練）、低〜中コストプラットフォーム向け3種の新データセット（最大規模のオープン双腕データセット「MolmoAct2-BimanualYAM」720時間を含む）、オープンな行動トークナイザー「OpenFAST」、フローマッチング連続行動エキスパートをKVキャッシュ条件付けで統合した新アーキテクチャ、さらに変化領域のみ深度トークンを再予測する適応型推論「MolmoThink」を提供する。7つのベンチマークでPi-05を上回り、MolmoERは13の身体推論ベンチマークでGPT-5およびGemini Robotics ER-1.5を超えると報告している。モデル重み・訓練コード・データはすべて公開される。

#multimodal#robotics#benchmark#fine-tuning

論文深掘り Hugging Face 2026-04-29 HF ↑14

検証器ベースの強化学習を活用した画像編集：Edit-R1

画像編集AIの品質評価が「総合点」から「原則別チェック」に進化し、編集精度の底上げが加速しそう

テキストから画像生成においてRLHF（人間フィードバックからの強化学習）は主要なパラダイムとなっているが、画像編集への応用は未開拓のままだった。課題は、全編集タスクに対応できる汎用報酬モデルの欠如であり、既存モデルは総合スコアのみを出力し指示内容の詳細を無視していた。本研究はEdit-R1を提案し、Chain-of-Thought（CoT）推論を用いた検証器ベース報酬モデル（RRM：Reasoning Reward Model）を構築する。Edit-RRMは編集指示を個別の原則に分解し、各原則ごとに画像を評価してきめ細かな報酬を生成する。構築には教師あり微調整（SFT）でCoT軌跡を生成後、人間のペアワイズ選好データを活用する新アルゴリズムGCPO（Group Contrastive Preference Optimization）でRRMを強化する。その後GRPOで編集モデルを訓練。実験ではSeed-1.5/1.6-VLといった強力なVLMを上回り、3Bから7Bのパラメータスケールで性能向上のスケーリング則も確認された。

#rl#multimodal#fine-tuning#vision#benchmark

論文深掘り Hugging Face 2026-04-29 HF ↑59

ビジュアル生成の新時代：アトミックマッピングからエージェント的世界モデリングへの進化

ビジュアル生成の評価軸が「見た目」から「因果・構造的整合性」へ移行し、製品選定基準が塗り替えられる可能性がある

近年のビジュアル生成モデル（visual generation model）はフォトリアリズムや文字描画、指示追従、インタラクティブ編集において大きな進歩を遂げた一方、空間推論・持続的状態管理・長期的一貫性・因果理解には依然として課題があると本論文は指摘する。著者らは「外見の合成」を超えた「インテリジェントなビジュアル生成」、すなわち構造・ダイナミクス・ドメイン知識・因果関係に根ざした生成へのシフトを主張する。この転換を整理するために、①アトミック生成、②条件付き生成、③インコンテキスト生成、④エージェント的生成、⑤世界モデリング生成という5段階の分類体系を提案。フローマッチングや統合理解・生成モデル、ポストトレーニング、報酬モデリング等の技術要因を分析し、現行評価指標が知覚的品質を重視するあまり構造・時間・因果の失敗を見逃し進歩を過大評価しているとも警告する。

#agent#benchmark

論文深掘り Hugging Face 2026-04-27 HF ↑57

再帰的マルチエージェントシステム（RecursiveMAS）

マルチエージェントAIのAPI費用が最大75%減になり得る設計思想が登場した

近年、同一モデルを潜在状態（latent state）上で反復させる「再帰的言語モデル」が推論深化の新たなスケーリング軸として注目されている。本研究はこの原理を単一モデルから複数エージェントへ拡張し、「エージェント間の協調そのものを再帰で深化できるか」という問いを立てる。提案手法RecursiveMASは、軽量モジュールRecursiveLinkを介して異種エージェントを協調ループで接続し、潜在空間内での思考生成とエージェント間の状態転送を実現する。学習には内外ループ最適化アルゴリズムを開発し、再帰ラウンド間で勾配を共有することでシステム全体を協調最適化する。数学・科学・医療・検索・コード生成にわたる9ベンチマークでの評価では、既存の単一/マルチエージェント手法と比較して平均精度8.3%向上、推論速度1.2〜2.4倍、トークン使用量34.6〜75.6%削減を達成したとしている。

#agent#coding#benchmark

論文 Hugging Face 2026-04-26 HF ↑14

科学的プロセスへの報酬付与：エージェント型データ分析のためのプロセスレベル報酬モデリング

背景・課題：プロセス報酬モデル（PRM）は数学などの静的ドメインでLLMの推論能力を向上させてきたが、動的なデータ分析タスクへの適用は未開拓であった。既存の汎用PRMはデータ分析エージェントの監督において、インタープリタ例外を発生させないまま誤結果をもたらすサイレントエラーを検出できず、探索的な試行錯誤を誤ってペナルティとして扱う問題が示された。提案手法：著者らはDataPRMと呼ぶ環境認識型の生成PRMを提案する。DataPRMは環境と自律的にインタラクションして中間実行状態を検査しサイレントエラーを検出するアクティブ検証器として機能し、修正可能なエラーと回復不能なミスを区別する反省認識型の三値報酬戦略を採用する。8K超の高品質な訓練インスタンスをダイバーシティ駆動の軌跡生成と知識拡張型アノテーションにより構築した。成果・貢献：ScienceAgentBenchで7.21%、DABStepで11.28%の性能向上を達成し、4Bパラメータでも強力なベースラインを上回り、強化学習（RL）との統合でDABenchおよびTableBenchでも顕著な改善が得られたとしている。

#agent#llm#rl

論文深掘り Hugging Face 2026-04-26 HF ↑49

ReVSI：VLMの3D空間推論を正確に評価するための視覚空間知能評価の再構築

「空間推論ができる」VLMのスコアは評価設計の欠陥で水増しされている可能性があり、選定基準の見直しを迫るかもしれない

現行のVLM（視覚言語モデル）空間知能評価には2つの構造的欠陥がある。①点群（point cloud）ベースの3Dアノテーションを動画評価の正解として流用することで、物体の見落とし・誤ラベル・サイズ情報の破損が生じ、QAペアが不正確になる。②全シーン情報を前提とした設問設計なのに、多くのVLMは16〜64フレームのスパースサンプリングで動作するため、モデルが実際に受け取る入力では回答不可能な問題が多数存在する。本研究はReVSIベンチマークを提案し、5データセット計381シーンを専門的3Dアノテーションツールで再アノテーションし、厳格なバイアス除去と人手検証を経てQAペアを再生成。16/32/64/全フレームの複数バジェット設定と細粒度の物体可視性メタデータも整備した。汎用・ドメイン特化VLM両方の評価から、従来ベンチマークでは隠蔽されていた系統的失敗パターンが明確に浮かび上がることを示した。

#multimodal#benchmark

論文深掘り Hugging Face 2026-04-26 HF ↑27

Tuna-2：ピクセル埋め込みがマルチモーダル理解・生成においてビジョンエンコーダを超える

ビジョンエンコーダ不要の統合マルチモーダルモデルが、AIシステム設計の常識を塗り替えるかもしれない

統合型マルチモーダルモデルは通常、事前学習済みビジョンエンコーダ（vision encoder）に依存し、理解タスクと生成タスクで異なる視覚表現を使用するため、両タスク間のミスアライメントが生じ、生ピクセルからのエンドツーエンド最適化が困難とされてきた。本研究では、ピクセル埋め込み（pixel embedding）に基づいて視覚理解と生成を直接実行するネイティブ統合マルチモーダルモデル「Tuna-2」を提案する。Tuna-2はVAEや表現エンコーダといったモジュール型ビジョンエンコーダ設計を完全に廃止し、シンプルなパッチ埋め込み層のみで視覚入力をエンコードすることでアーキテクチャを大幅に簡略化する。実験では、Tuna-2がマルチモーダルベンチマークで最先端性能を達成し、ピクセル空間統合モデリングが潜在空間（latent-space）アプローチと同等以上の高品質画像生成を実現できることを示す。特にスケール時の細粒度視覚知覚タスクで優れた性能を発揮し、事前学習済みビジョンエンコーダがマルチモーダルモデリングに必須ではないことを示唆している。

#multimodal#alignment#vision#benchmark

論文深掘り Hugging Face 2026-04-22 HF ↑12

速く見る・遅く見る：動画における時間の流れの学習

「時間を操るAI」が動画編集・フォレンジクス・世界モデルの三分野を同時に揺さぶる

動画の再生速度変化を人間はどう知覚するか、またAIはどう制御できるか——本研究はこの問いを出発点に、「時間の流れ」を学習可能な視覚概念として体系的に研究する。動画に自然に含まれるマルチモーダル手がかりと時間的構造を活用し、自己教師あり学習（self-supervised learning）によって速度変化の検出と再生速度の推定モデルを構築。これを用いて、ノイズの多い一般動画源から現時点最大規模のスローモーション動画データセットを自動収集した。さらに、指定した再生速度で映像を生成する速度条件付きビデオ生成（speed-conditioned video generation）と、低フレームレートのぼけた動画を高FPS映像に変換するテンポラル超解像（temporal super-resolution）を実現。時間を操作可能な知覚次元として扱うことで、動画フォレンジクス（forensics）検出や、事象の展開を理解するリッチなワールドモデルへの応用可能性も示唆している。

#multimodal#vision

論文 Hugging Face 2026-04-21 HF ↑13

生成的観点から空間知能を探る

マルチモーダル大規模言語モデル（multimodal large language model）における空間知能（spatial intelligence）は重要な能力だが、既存のベンチマークは理解（understanding）の側面のみを評価しており、生成（generation）の観点が欠けていた。本研究では、画像生成時に3D空間制約を遵守・操作する能力である「生成的空間知能（GSI: Generative Spatial Intelligence）」を定義し、その測定と改善を試みる。提案するGSI-Benchは、空間的根拠に基づく画像編集タスクを通じてGSIを定量評価する初のベンチマークであり、3Dプライオル誘導による実世界データセット「GSI-Real」と制御可能な合成ベンチマーク「GSI-Syn」の2コンポーネントで構成される。実験では、GSI-Synでの統合型マルチモーダルモデルのファインチューニングが合成・実世界タスク双方で大幅な性能向上をもたらし、さらに空間理解（spatial understanding）の下流タスクも改善されることが示された。生成的学習が空間推論を強化するという初の明確なエビデンスを提示し、マルチモーダルモデルの空間知能向上への新たな経路を開拓したと主張している。

#multimodal#benchmark#llm#fine-tuning#vision

論文深掘り Hugging Face 2026-04-21 HF ↑41

近未来ポリシー最適化（NPO）：自己の未来チェックポイントから学ぶ強化学習手法

「自己の未来から学ぶ」RLVRが、LLM強化学習のコスト構造を変えるかもしれない

強化学習における検証可能報酬を用いたポスト学習（RLVR）は、外部教師からの軌跡（高品質だが分布が遠い）か過去の訓練軌跡のリプレイ（近いが品質に上限がある）という二択の課題を抱えていた。本研究はこの問題に対し、「近未来ポリシー最適化（NPO: Near-Future Policy Optimization）」を提案する。NPOは同一訓練ランの後期チェックポイントを補助軌跡のソースとして活用する手法であり、現在のポリシーより強くかつ外部ソースより近いという両条件を自然に満たす。有効学習信号S=Q/Vを最大化するため、学習初期のブートストラッピングと後期の停滞突破という2つの手動介入を検証し、さらにオンライン訓練シグナルから自動的に介入を発動するAdaptive変種「AutoNPO」を提案。Qwen3-VL-8B-InstructとGRPOの組み合わせで平均性能を57.88から63.15へと向上させ、収束加速と性能上限引き上げの両立を実証した。

#rl

論文 Hugging Face 2026-04-20 HF ↑12

ShadowPEFT: パラメータ効率的なファインチューニングのためのシャドウネットワーク

大規模言語モデル（LLM）のパラメータ効率的なファインチューニング（PEFT）は、事前学習済みバックボーンを固定しつつ少数のタスク固有パラメータのみを学習するアプローチだが、LoRAに代表される既存手法は各重み行列に独立した低ランク摂動を挿入する局所的なパラメータ化に留まるという課題がある。本論文はShadowPEFTを提案する。これは深さ方向で共有されるシャドウモジュールによって層レベルの精錬（layer-level refinement）を行う集約型PEFTフレームワークである。各Transformer層で並列シャドウ状態を維持し、それを反復的に発展させることで段階的に豊かな隠れ状態を生成する。シャドウモジュールはバックボーンと分離されているため、深さ方向での再利用・独立した事前学習・分離デプロイが可能でエッジコンピューティングにも適する。生成・理解ベンチマークでLoRAおよびDoRAと同等以上の性能を達成し、集約型の層空間適応が従来の低ランクPEFTの有力な代替となり得ることを示している。

#fine-tuning#llm#benchmark

論文 Hugging Face 2026-04-20 HF ↑30

CoInteract: 空間構造化共生成による物理的整合性を持つ人物-物体インタラクション動画合成

人物と物体のインタラクション（HOI: Human-Object Interaction）動画合成は、ECや仮想マーケティングで実用価値が高い。しかし既存の拡散モデル（diffusion model）は、手や顔などの構造的安定性の欠如、および手と物体の干渉（interpenetration）といった物理的非整合の問題を抱えている。本論文ではCoInteractを提案する。人物参照画像・商品参照画像・テキストプロンプト・音声を条件として受け取るエンドツーエンドのHOI動画合成フレームワークである。Diffusion Transformer（DiT）をバックボーンとし、2つの機構を導入する。第一に、空間的に監督されたルーティングで領域特化型エキスパートにトークンを振り分けるHuman-Aware Mixture-of-Experts（MoE）を提案し、少ないパラメータ追加で構造的忠実度を向上させる。第二に、RGBストリームとHOI構造ストリームを同時学習するデュアルストリーム訓練パラダイム「Spatially-Structured Co-Generation」を提案し、推論時にHOIブランチを除去することでオーバーヘッドゼロを実現する。実験では既存手法を大幅に上回る結果を示した。

#diffusion#speech

論文深掘り Hugging Face 2026-04-20 HF ↑12

Chat2Workflow：自然言語から実行可能なビジュアルワークフローを生成するベンチマーク

ノーコードワークフロー自動生成の「実力試験」が登場し、LLMの産業適用に新たな評価軸が生まれそう

実行可能なビジュアルワークフロー（visual workflow）は産業展開における主流パラダイムとなっているが、現状では開発者が手動でフロー設計・プロンプト作成・ロジック修正を繰り返す必要があり、コスト・時間・エラーの観点で課題がある。本研究では、自然言語から実行可能なワークフローを直接生成する能力を評価するベンチマーク「Chat2Workflow」を提案する。実世界のビジネスワークフローから構築されており、生成されたワークフローはDifyやCozeなどの実用プラットフォームに直接デプロイ可能な形式に変換できる。加えて、繰り返し発生する実行エラーを緩和するエージェント的フレームワークも提案した。実験結果では、最先端LLMは高レベルな意図は概ね捉えられるものの、複雑・変化する要件下での正確・安定・実行可能なワークフロー生成には依然として苦手意識があることが示された。エージェントフレームワークにより最大5.34%の解決率向上が得られるが、実用的なギャップはまだ大きく、産業グレードの自動化促進の基盤として位置づけられる。

#agent#benchmark#llm

論文 Hugging Face 2026-04-20 HF ↑18

PlayCoder: LLMが生成したGUIコードをプレイ可能にする

LLM（大規模言語モデル）によるコード生成は進化しているが、GUIアプリケーション、特にゲームの生成能力は十分に研究されていない。既存ベンチマークはテストケースによる正誤評価が主であり、インタラクティブ・イベント駆動なGUIアプリには不適切であるという課題がある。本研究ではまず、Python・TypeScript・JavaScriptによる43件の多言語GUIアプリを収録したリポジトリ対応ベンチマーク「PlayEval」を構築し、6カテゴリのGUIアプリケーションをカバーする。また、k個の生成候補のうち少なくとも1つがエンドツーエンドでプレイ可能かを測る指標「Play@k」を提案する。評価を支援するLLMエージェント「PlayTester」はGUI操作を自動実行しロジック違反を検出する。10種類の最先端コードLLMへの実験では、コンパイル成功率は高いもののPlay@3はほぼゼロであり、論理的に正しいGUI生成の弱点が明らかになった。これを解決するマルチエージェントフレームワーク「PlayCoder」は、生成・評価・修復をクローズドループで行い、Exec@3 38.1%・Play@3 20.3%を達成したと報告している。

#llm#benchmark#agent#alignment#coding

論文 Hugging Face 2026-04-19 HF ↑12

WebCompass: コード言語モデルのためのマルチモーダルWebコーディング評価に向けて

背景・課題：大規模言語モデル（LLM）はエンドツーエンドのWebコーディングエージェントとして急速に進化しているが、既存のベンチマークはテキスト条件付きの生成と静的正確性メトリクスといった限られた側面しか評価しておらず、視覚的忠実性・インタラクション品質・コードベースレベルの推論はほぼ未評価のままだという課題がある。提案手法：本論文ではWebCompassを提案する。これはテキスト・画像・動画の3入力モダリティと、生成・編集・修復の3タスク種別を組み合わせた7カテゴリで構成されるマルチモーダルベンチマークである。評価にはLLM-as-a-Judgeに加え、実ブラウザ上でWebサイトを自動実行し、Model Context Protocol（MCP）でインタラクションを探索してテストケースを反復生成するAgent-as-a-Judgeパラダイムを導入する。成果・貢献：評価の結果、クローズドソースモデルが依然として優位であること、美的品質がオープンソースモデルの最大のボトルネックであること、フレームワーク選択（Vueは難易度が高い等）が性能に大きく影響することが示された。

#coding#multimodal#agent#benchmark#llm

論文 Hugging Face 2026-04-19 HF ↑62

OneVL: ビジョン言語説明を用いたワンステップ潜在推論・計画

自律走行における軌道予測では、Chain-of-Thought（CoT）推論がVLA（Vision-Language-Action）モデルの性能を押し上げてきた。しかし自己回帰的な生成はリアルタイム展開を阻む遅延コストを生じる。潜在CoT手法はこの問題を連続隠れ状態への圧縮で解決しようとするが、明示的CoTには及ばないとされてきた。本論文はその原因を、純粋な言語的潜在表現が因果ダイナミクスではなく記号的抽象を圧縮している点に求める。そこで提案するOneVLは、VLAとWorld Modelを統合したフレームワークであり、テキストCoTを復元する言語デコーダに加え、将来フレームトークンを予測する視覚ワールドモデルデコーダを導入する。これにより潜在空間に道路幾何・エージェント動作・環境変化の因果ダイナミクスを内包させる。3段階の学習パイプラインで安定した最適化を実現し、推論時には補助デコーダを廃棄して単一並列パスで処理する。4つのベンチマークで初めて潜在CoTが明示的CoTを上回る精度を達成した。

#agent#robotics#benchmark

論文 Hugging Face 2026-04-19 HF ↑32

MultiWorld: スケーラブルなマルチエージェント・マルチビュー映像世界モデル

映像世界モデル（video world model）は行動条件付き映像生成として環境ダイナミクスをシミュレートする分野で成果を上げているが、既存手法の多くは単一エージェントに限定され、実世界のマルチエージェントシステムに内在する複雑な相互作用を捉えられていない。本論文では、複数エージェントの精密な制御とマルチビュー整合性を同時に実現する統合フレームワーク「MultiWorld」を提案する。マルチエージェント制御を担うMulti-Agent Condition Moduleと、異なるビュー間で一貫した観測を保証するGlobal State Encoderを導入し、エージェント数・視点数の柔軟なスケーリングと並列的な多視点合成による高効率処理を実現した。マルチプレイヤーゲーム環境とマルチロボット操作タスクでの実験により、映像品質・行動追従性・マルチビュー整合性においてベースラインを上回ることを示した。

#agent#robotics

論文 Hugging Face 2026-04-19 HF ↑12

弱い監督でLLMはいつ推論を学習できるか？

大規模言語モデル（LLM）の推論能力向上には、検証可能な報酬を用いた強化学習（RLVR）が有効だが、モデルの高性能化に伴い高品質な報酬信号の構築が困難になっている。本研究では、データ不足・ノイズの多い報酬・自己教師あり代理報酬という3種の弱い監督設定下で、複数のモデルファミリーと推論タスクを対象に体系的な実証実験を実施した。その結果、汎化の成否は「訓練報酬の飽和ダイナミクス」に支配されており、汎化するモデルは飽和前の長い段階で訓練報酬と下流性能が共に上昇する一方、早期に飽和するモデルは汎化ではなく記憶に陥ることが判明した。また、中間ステップが最終回答を論理的に支持する度合いである「推論忠実性（reasoning faithfulness）」がRLVR前の重要な予測指標となる一方、出力多様性だけでは予測に不十分であることを示した。さらに継続的事前学習と教師あり微調整（SFT）の貢献を切り分け、Llama3.2-3B-Baseへの適用で3設定すべてにおいて汎化を実現した。

#llm#rl#fine-tuning

論文 Hugging Face 2026-05-31 HF ↑11

Joint Agent Memory and Exploration Learning via Novelty Signals

In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effective exploration requires memory, but retaining raw interaction histories is computationally expensive over long trajectories. While latent memory offers a solutio...

#agent#llm#benchmark

論文 Hugging Face 2026-05-31 HF ↑9

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory:...

#benchmark#rag#diffusion

論文 Hugging Face 2026-05-31 HF ↑20

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to log...

#multimodal#benchmark

論文 Hugging Face 2026-05-27 HF ↑6

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-de...

#agent#llm#benchmark

論文 Hugging Face 2026-05-25 HF ↑4

Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel...

#agent#benchmark#llm

論文 Hugging Face 2026-05-25 HF ↑17

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deploy...

#agent#coding#rl#benchmark

論文 Hugging Face 2026-05-25 HF ↑13

Recursive Flow Matching

Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, achieving high physical accuracy without incurring high computational cost remains a fundamental challenge, as existing approaches face a critical speed-fidelity t...

#diffusion#benchmark

論文 Hugging Face 2026-05-25 HF ↑8

Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's intrinsic knowledge boundary, where the model fails to distinguish w...

#agent#rl#llm#benchmark

論文 Hugging Face 2026-05-25 HF ↑18

Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling

Test-Time Scaling (TTS) enhances the reasoning capabilities of large language models by allocating additional inference compute to explore the solution space. However, existing parallel TTS methods typically keep branches isolated during search: intermediate discoveries remain branch-private and can...

#speech#llm#benchmark

論文深掘り Hugging Face 2026-05-25 HF ↑7

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interacti...

#agent#benchmark#llm

論文深掘り Hugging Face 2026-05-25 HF ↑3

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all t...

#multimodal#benchmark

論文 Hugging Face 2026-05-20 HF ↑16

Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. ...

#agent#diffusion#benchmark

論文 Hugging Face 2026-05-19 HF ↑14

Generative Recursive Reasoning

How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministi...

論文 Hugging Face 2026-05-19 HF ↑9

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are mor...

#agent#alignment#benchmark

論文 Hugging Face 2026-05-19 HF ↑3

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently...

#alignment#rl#benchmark

論文 Hugging Face 2026-05-19 HF ↑3

UniT: Unified Geometry Learning with Group Autoregressive Transformer

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integra...

#benchmark

論文深掘り Hugging Face 2026-05-19 HF ↑6

OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ...

#diffusion#alignment#vision

論文 Hugging Face 2026-05-19 HF ↑16

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely res...

#multimodal

論文 Hugging Face 2026-05-18

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preser...

#agent#llm#coding

論文 Hugging Face 2026-05-18 HF ↑7

PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image ge...

#vision#benchmark#llm#multimodal#alignment

論文 Hugging Face 2026-05-18 HF ↑11

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid eva...

#benchmark#agent#alignment

論文 Hugging Face 2026-05-17 HF ↑4

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatia...

#multimodal#agent#rl#llm#robotics

論文 Hugging Face 2026-05-17 HF ↑11

StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, ...

論文 Hugging Face 2026-05-12 HF ↑10

Asymmetric Flow Models

Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise predictio...

#fine-tuning#diffusion#vision#benchmark

論文 Hugging Face 2026-05-12 HF ↑7

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions, where solving the task requires chaining multiple retrieval and reasoning steps. Key challenges are that current methods represe...

#rag#benchmark

論文 Hugging Face 2026-05-12 HF ↑3

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions...

#llm#benchmark#agent#alignment

論文 Hugging Face 2026-05-12 HF ↑18

FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

Vision-Language-Action (VLA) policies are commonly trained from dense robot demonstration trajectories, often collected through teleoperation, by sampling every recorded frame as if it provided equally useful supervision. We argue that this convention creates a temporal supervision imbalance: long l...

#alignment#robotics#benchmark

論文 Hugging Face 2026-05-11 HF ↑21

L2P: Unlocking Latent Potential for Pixel Generation

Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directl...

#diffusion#benchmark

論文 Hugging Face 2026-05-11 HF ↑4

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI ...

#agent#llm#coding

論文 Hugging Face 2026-05-11 HF ↑6

MEME: Multi-entity & Evolving Memory Evaluation

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, ...

#llm#agent#benchmark

論文 Hugging Face 2026-05-11 HF ↑22

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal executi...

#agent#rl

論文 Hugging Face 2026-05-10 HF ↑10

G-Zero: Self-Play for Open-Ended Generation from Zero Data

Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innova...

#llm#agent

論文 Hugging Face 2026-05-10 HF ↑26

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

A LaTeX manuscript that compiles without error is not necessarily publication-ready. The resulting PDFs frequently suffer from misplaced floats, overflowing equations, inconsistent table scaling, widow and orphan lines, and poor page balance, forcing authors into repetitive compile-inspect-edit cycl...

#llm#agent#benchmark

論文 Hugging Face 2026-05-10 HF ↑21

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators." Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should ...

#benchmark

論文 Hugging Face 2026-05-10 HF ↑6

Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

Memory consolidation, the process by which transient experiences are transformed into stable, structured representations, is a foundational organizing principle in the human brain, yet it remains largely unexplored as a design principle for modern sequence models. In this work, we leverage establish...

#benchmark

論文 Hugging Face 2026-05-10 HF ↑11

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrit...

#rl#llm

論文 Hugging Face 2026-05-10 HF ↑11

Pixal3D: Pixel-Aligned 3D Generation from Images

Recent advances in 3D generative models have rapidly improved image-to-3D synthesis quality, enabling higher-resolution geometry and more realistic appearance. Yet fidelity, which measures pixel-level faithfulness of the generated 3D asset to the input image, still remains a central bottleneck. We a...

論文 Hugging Face 2026-05-06 HF ↑10

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize th...

#agent#rl

論文 Hugging Face 2026-05-06 HF ↑7

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend...

#agent#benchmark#llm#rl

論文 Hugging Face 2026-05-06 HF ↑27

When to Trust Imagination: Adaptive Action Execution for World Action Models

World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to wh...

#robotics#benchmark

論文 Hugging Face 2026-05-06 HF ↑17

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

Step distillation has become a leading technique for accelerating diffusion models, among which Distribution Matching Distillation (DMD) and Consistency Distillation are two representative paradigms. While consistency methods enforce self-consistency along the full PF-ODE trajectory to steer it towa...

#diffusion#alignment#vision

論文 Hugging Face 2026-05-06 HF ↑4

SkillOS: Learning Skill Curation for Self-Evolving Agents

LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key...

#agent#llm#rl#benchmark

論文 Hugging Face 2026-05-05 HF ↑28

PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

Synthesizing physics-grounded 3D assets is a critical bottleneck for interactive virtual worlds and embodied AI. Existing methods predominantly focus on static geometry, overlooking the functional properties essential for interaction. We propose that interactive asset generation must be rooted in fu...

#robotics#diffusion#multimodal#agent

論文深掘り Hugging Face 2026-05-04 HF ↑4

PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

Patent examination is a complex, multi-stage process requiring both technical expertise and legal reasoning, increasingly challenged by rising application volumes. Prior benchmarks predominantly view patent examination as discriminative classification or static extraction, failing to capture its inh...

#benchmark#llm

論文深掘り Hugging Face 2026-05-03 HF ↑3

PhysicianBench：実際の電子カルテ環境におけるLLMエージェントの評価ベンチマーク

「医療AIは知識があっても動けない」─臨床エージェントの実力差が数値で可視化される時代へ

電子カルテ（EHR: Electronic Health Record）環境における医師業務をLLMエージェントで評価するベンチマーク「PhysicianBench」が提案された。既存の医療エージェント評価は静的な知識想起や単一ステップの行動に限定されており、実臨床の複雑な長期ワークフローを再現できていないという課題があった。PhysicianBenchは、一次診療と専門診療間の実際のコンサルテーション事例を元にした100の長期タスクで構成され、21専門科・複数のワークフロー種別を網羅、1タスクあたり平均27回のツール呼び出しを必要とする。商用EHRと同じ標準APIを用い、670のチェックポイントで実行結果を検証可能な形で評価する。13のLLMエージェントを評価した結果、最高性能モデルでも成功率46%（pass@1）にとどまり、オープンソースモデルは最大19%と、現状のエージェント能力と実臨床要求の間に大きなギャップがあることが示された。

#agent#benchmark#llm

論文 Hugging Face 2026-04-29 HF ↑6

InteractWeb-Bench: マルチモーダルエージェントはインタラクティブなウェブサイト生成において盲目的実行から脱却できるか？

近年のマルチモーダル大規模言語モデル（MLLM）とコーディングエージェントの発展により、ウェブサイト開発は手動プログラミングからエージェントベースのコード合成へと移行しつつある。しかし既存のベンチマークは、構造化された高品質な入力と静的実行環境という理想化された前提に依存しており、現実のシナリオとかけ離れている。実際の開発現場では、非専門ユーザーの曖昧・低品質な指示とモデルの理解との意味的ミスアライン（semantic misalignment）が深刻なボトルネックとなり、筆者らが「盲目的実行（blind execution）」と呼ぶ失敗モードを生む。本研究ではこの課題に対し、非専門ユーザーのローコード条件下でのウェブサイト生成を評価する初のマルチモーダルインタラクティブベンチマーク「InteractWeb-Bench」を提案する。要件工学の欠陥分類に基づき4種類のユーザーエージェントとペルソナ駆動の指示摂動を導入し、曖昧性・冗長性・矛盾を含む多様なユーザー行動を体系的に模擬する。エージェントには「明確化・実装・検証・提出」からなる統一アクション空間を持つインタラクティブ実行環境を提供する。実験の結果、最先端のMLLMベースエージェントも依然として盲目的実行に陥りやすく、意図認識と適応的インタラクションに大きな限界があることが示された。

#agent#multimodal#llm#benchmark#alignment

論文 Hugging Face 2026-04-29 HF ↑26

ExoActor: 汎化可能なインタラクティブなヒューマノイド制御のための外視点ビデオ生成

ヒューマノイド制御において、ロボットと環境・物体との流暢なインタラクションをモデリングすることは依然として困難な課題である。空間的文脈・時間的ダイナミクス・ロボットの行動・タスク意図を大規模に同時捉える必要があり、従来の教師あり学習では対応が難しい。本論文ではExoActorを提案する。これは大規模ビデオ生成モデル（video generation model）の汎化能力を活用し、三人称視点（exocentric）のビデオ生成をインタラクションダイナミクスのモデリングのための統一インターフェースとして用いる新フレームワークである。タスク指示とシーン情報を入力として、ロボット・環境・物体間の協調的インタラクションを暗黙的にエンコードした実行プロセス動画を合成する。生成された動画は人体モーション推定と汎用モーションコントローラーを通じて実行可能なヒューマノイド行動へと変換され、タスク条件付き行動系列が得られる。エンドツーエンドシステムとして実装し、追加の実世界データ収集なしに新たなシナリオへの汎化が可能であることを実証した。

#robotics

論文 Hugging Face 2026-04-29 HF ↑14

Claw-Eval-Live: 進化する実世界ワークフロー向けライブエージェントベンチマーク

背景・課題：LLM（大規模言語モデル）エージェントはソフトウェアツールやビジネスサービスにまたがるエンドツーエンドの作業を完遂することが期待されている。しかし既存のエージェントベンチマークの多くはリリース時点でタスクセットが固定され、最終応答のみを評価するため、変化するワークフロー需要への対応力やタスクの実際の実行可否を検証することが困難であった。提案手法：本論文はClaw-Eval-Liveを提案する。これは外部の実ワークフロー需要シグナル（現リリースではClawHub Top-500スキル）から構築された更新可能なシグナル層と、再現可能なタイムスタンプ付きリリーススナップショットを分離したライブベンチマークである。採点には実行トレース・監査ログ・サービス状態・実行後ワークスペース成果物を記録し、証拠が十分な場合は決定論的チェック、意味的次元にのみ構造化LLM判定を用いる。成果・貢献：105タスク・13フロンティアモデルを評価した結果、最高モデルでも正答率66.7%にとどまり、HRや複数システム連携ビジネスワークフローが依然としてボトルネックであることが示された。

#agent#benchmark#llm

論文 Hugging Face 2026-04-29 HF ↑11

Intern-Atlas: AIサイエンティストのための研究インフラとしての方法論進化グラフ

既存の研究インフラは文書中心的であり、論文間の引用リンクは提供するものの、研究手法がどのように生まれ、適応し、発展してきたかを示す構造的な関係表現が欠如している。特にAI駆動の研究エージェント(research agent)が科学知識の新たな利用者として台頭する中、非構造化テキストから手法の進化トポロジを復元することは困難であり、この限界はますます深刻化していると著者らは主張する。本論文では、AIに関連する会議・ジャーナル・arXivプレプリントから得た103万件超の論文を基に、手法レベルのエンティティを自動識別し、手法間の系譜関係(lineage relationship)やイノベーション間の移行を駆動するボトルネックを捕捉する方法論進化グラフIntern-Atlasを提案する。結果として941万件以上の意味的型付きエッジからなる因果ネットワークが構築された。さらに、時系列的な手法の進行を追う進化チェーン構築のための自己誘導型時間木探索アルゴリズムも提案し、専門家によるグラウンドトゥルースとの強い整合性を確認。アイデア評価や自動アイデア生成への応用も実証している。

#agent#alignment#benchmark

論文 Hugging Face 2026-04-29 HF ↑6

長期的生産性シミュレーションのための大規模合成コンピュータ環境

背景・課題として、長期的な生産性タスクはユーザー固有のコンピュータ環境（ディレクトリ構造やコンテンツ豊富な成果物）に強く依存するが、そのような環境での合成データ(synthetic data)作成をスケールする手法が不足していた。本研究では「Synthetic Computers at Scale」と呼ぶスケーラブルな方法論を提案し、現実的なフォルダ階層と文書・表計算・プレゼン等のリッチな成果物を含む合成コンピュータ環境を生成する。各環境上で長期シミュレーションを実施し、一方のエージェント(agent)が約1ヶ月分の作業に相当する生産性目標を設定し、もう一方がそのユーザーとして実際に作業を遂行する。予備実験では1,000台の合成コンピュータを作成し、各実行が平均2,000ターン超・8時間以上のエージェント稼働を要するシミュレーションを実施。得られた学習シグナルにより、ドメイン内外の生産性評価でエージェント性能が有意に向上したと主張する。

#agent#rl#benchmark

論文 Hugging Face 2026-04-27 HF ↑5

BARRED: 非対称ディベートによるカスタムポリシーガードレールの合成データ学習

背景・課題: LLMの本番運用において、汎用安全性モデルはタスク固有の要件を捉えられず、LLMへのプロンプティングは境界ケースの性能が不安定かつ推論コストが高い。カスタム分類器の学習は精度と効率を両立するが、大量のラベル付きデータが必要という問題があった。提案手法: 本論文はBARRED（Boundary Alignment Refinement through REflection and Debate）を提案する。タスク記述と少量の未ラベルサンプルのみから、忠実で多様な合成学習データを生成するフレームワークである。ドメイン空間を複数次元に分解して網羅的カバレッジを確保し、マルチエージェントディベートによりラベル正確性を検証することで高品質な学習コーパスを構築する。成果: 多様なカスタムポリシーでの実験において、合成データでファインチューニングした小型言語モデル（SLM）が、最先端の商用LLM（推論モデル含む）や専用ガードレールモデルを一貫して上回ることが示された。アブレーション研究により、次元分解とディベートベース検証の両方が有効なファインチューニングに不可欠であることも確認されている。

#llm#agent#fine-tuning#alignment

論文 Hugging Face 2026-04-27 HF ↑23

AutoResearchBench: 複雑な科学文献探索におけるAIエージェントのベンチマーク評価

自律的な科学研究支援においてAIエージェントの活用が進む一方、科学文献を適切に探索する能力の定量的評価基盤が不足している。本論文はこの課題に対し、自律的な科学文献探索専用のベンチマーク「AutoResearchBench」を提案する。同ベンチマークは2種のタスクで構成される：(1) 多段階の推論・検索を経て特定論文を特定する「Deep Research」、(2) 条件を満たす論文群を網羅的に収集する「Wide Research」である。従来のエージェント型Webブラウジングベンチマークと比較し、研究領域の深い理解・詳細情報の精緻な活用・解答数未知のオープンエンド性という3軸で差別化されている。評価実験では、BrowseCompなど汎用ベンチマークを制覇した最強のLLMでもDeep Researchで9.39%の正解率、Wide ResearchでIoU 9.31%に留まり、多くのベースラインは5%未満という極めて困難なベンチマークであることが示された。データセット・評価パイプライン・コードは公開済みである。

#agent#benchmark#llm

論文 Hugging Face 2026-04-27 HF ↑22

再生成による精錬：修正空間の拡大が統合マルチモーダルモデルの画像精錬を向上させる

統合マルチモーダルモデル（Unified Multimodal Models, UMMs）は視覚理解と生成を単一フレームワークで実現する。テキストから画像への生成（Text-to-Image, T2I）タスクでは、初期生成後に出力を精錬できる可能性があるが、従来の編集ベース精錬（Refinement-via-Editing, RvE）は不整合領域に編集指示を与えつつ整合コンテンツを保持する手法であり、粗い記述による不完全な精錬やピクセルレベル保存による修正空間の制約という課題があった。本論文では、精錬を編集ではなく条件付き画像再生成として定式化する「Refinement via Regeneration（RvR）」を提案する。RvRはターゲットプロンプトと初期画像の意味トークン（semantic tokens）を条件として画像を再生成することで、厳密なコンテンツ保存の制約を排除し、より広い修正空間での完全な意味的整合を実現する。実験ではGeneval 0.78→0.91、DPGBench 84.02→87.21、UniGenBench++ 61.53→77.41と大幅な改善を示したと報告されている。

#multimodal#alignment#benchmark

論文深掘り Hugging Face 2026-04-27 HF ↑9

Step-Audio-R1.5 技術レポート：音声AIにおけるRLHFによる推論パラダイムシフト

音声AIの「正確さ至上主義」からの脱却が、対話体験の評価軸を根本から変える可能性がある

大規模音声言語モデル（Large Audio Language Model）の進展により、連鎖思考（Chain-of-Thought, CoT）推論が音声領域にまで拡張された。しかし現行の主流手法である検証可能報酬による強化学習（RLVR）は、標準ベンチマークでは高スコアを示す一方、連続的な音声文脈を孤立した正解ラベルに還元するため、会話の自然さや感情的連続性を損なうという「検証可能報酬トラップ」が存在すると著者らは指摘する。本報告では、この課題を克服するため人間フィードバックによる強化学習（RLHF）を音声推論に適用したStep-Audio-R1.5を提案。機械的な正解検証ではなく感覚的共感を重視することで、分析的推論能力を維持しつつ長ターン音声対話における韻律的自然さ・感情的継続性・ユーザー没入感を大幅に向上させたと主張している。

#rl#benchmark

論文 Hugging Face 2026-04-26 HF ↑4

ステップレベルのアドバンテージ選択による効率的推論の安定化

大規模言語モデル（LLM）は推論時に長い思考トレースを生成することで高い推論性能を実現するが、計算コストが課題となる。効率的推論に関する先行研究では長さベースの報酬や枝刈りが用いられるが、ベースモデルの学習時より短いコンテキストウィンドウでのポストトレーニングという要因の影響が系統的に検証されていなかった。本研究ではまず、長さを考慮しない標準的なGRPOでも短コンテキストでのポストトレーニング単独で推論の圧縮が起きるが、学習不安定性と精度低下を招くことを示す。これを解決するため、Step-level Advantage Selection（SAS）を提案する。SASは推論ステップ単位で動作し、正解ロールアウト内の低信頼度ステップおよび検証失敗ロールアウト内の高信頼度ステップにゼロアドバンテージを割り当てる。数学・一般推論ベンチマークにて、最強の長さ考慮ベースラインと比較してPass@1精度を平均0.86ポイント改善しつつ推論長を平均16.3%削減し、精度と効率のトレードオフを改善した。

#llm#benchmark

論文 Hugging Face 2026-04-22 HF ↑28

WorldMark: インタラクティブ動画世界モデルのための統合ベンチマークスイート

インタラクティブ動画生成（Interactive Video Generation）モデル（Genie、YUME、HY-World、Matrix-Gameなど）は急速に進化しているが、各モデルが独自のプライベートシーン・軌跡でのみ評価されており、公平なクロスモデル比較が不可能という課題がある。既存の公開ベンチマークは軌跡誤差や美的スコア、VLMベースの評価指標を提供するが、モデル間比較に必要な標準化されたテスト条件（同一シーン・同一行動シーケンス・統一制御インターフェース）を欠いている。本論文ではWorldMarkを提案する。これはImage-to-Video世界モデルの公平な比較基盤を提供する初のベンチマークであり、(1)WASDスタイルの共通行動語彙を各モデル固有の制御形式に変換する統一行動マッピング層、(2)一人称・三人称視点や写実的・様式化シーンを含む500評価ケースの階層的テストスイート、(3)視覚品質・制御整合性・世界一貫性を評価するモジュラーツールキット、の三要素で構成される。さらにオンラインアリーナプラットフォーム（warena.ai）も公開予定とされている。

#benchmark#multimodal#alignment

論文深掘り Hugging Face 2026-04-22 HF ↑15

StyleID：スタイル非依存の顔認識のための知覚考慮データセット・評価指標

スタイライゼーションAIの「ID保持品質」を人間知覚で測定する新標準が登場しそう

クリエイティブな顔スタイライゼーション（face stylization）は、漫画・スケッチ・絵画など多様なビジュアル表現で人物の顔を描写する技術だが、既存の顔認識エンコーダは自然写真で訓練・校正されているため、スタイル変換後の画像に対して脆弱性を示す。テクスチャや色調の変化を同一性の変化と誤認したり、幾何学的誇張を見逃すという課題がある。本研究はこの課題に対処するため、StyleIDというヒト知覚考慮型データセットと評価フレームワークを提案する。StyleIDは2つのデータセットで構成される：拡散モデル・フローマッチングベースのスタイライゼーションに対する人間の同一性判断を収集したStyleBench-H、および2AFC実験（強制二択実験）による心理測定的認識強度曲線から生成した教師データStyleBench-Sである。StyleBench-Sを活用して既存の意味的エンコーダを微調整し、スタイルや強度をまたいだ人間知覚との類似度順序の整合を実現。既存モデルと比較して人間判断との相関が大幅に向上し、アーティスト手描きの未見ドメイン肖像への汎化性能も改善したと主張する。

#diffusion#fine-tuning#benchmark

論文深掘り Hugging Face 2026-04-22 HF ↑3

UniGenDet：画像生成と生成画像検出の共進化のための統合生成・識別フレームワーク

生成AIと偽画像検出の「共進化」モデルが、コンテンツ信頼性インフラを再定義するかもしれない

近年、画像生成（image generation）と生成画像検出（generated image detection）はそれぞれ急速に発展しているが、前者は生成ネットワーク、後者は識別フレームワークという異なるアーキテクチャを採用しており、相互の連携は限定的だった。本研究では、この構造的乖離を克服するため、UniGenDetと呼ぶ統合生成・識別フレームワークを提案する。共生マルチモーダル自己注意機構（symbiotic multimodal self-attention mechanism）と統合ファインチューニングアルゴリズムを設計することで、生成タスクが真贋識別の解釈可能性を高め、逆に真贋基準が高忠実度画像の生成を誘導するという相互補完関係を実現する。さらに検出器主導の生成アライメント機構（detector-informed generative alignment mechanism）により、両タスク間のシームレスな情報交換を促進する。複数データセットでの実験で最先端性能を達成したとしており、コードも公開されている。

#vision#multimodal#fine-tuning#alignment

論文 Hugging Face 2026-04-21 HF ↑5

SWE-chat: 実際のユーザーによるコーディングエージェントのインタラクションデータセット

背景・課題として、AIコーディングエージェントの普及が進む一方で、実際の開発者がどのように使用しているか、またその出力がどの程度有用かを示す実証的証拠が不足していた。本研究では、オープンソース開発者の実際の利用から収集した初の大規模データセット「SWE-chat」を提案する。同データセットは現時点で6,000セッション・63,000件超のユーザープロンプト・355,000件のエージェントツール呼び出しを含み、継続的に自動収集される「生きたデータセット（living dataset）」として設計されている。分析の結果、コーディングパターンは二峰性（bimodal）を示し、41%のセッションではエージェントがほぼ全コードを生成する「バイブコーディング（vibe coding）」、23%では人間が全コードを記述することが判明した。またエージェント生成コードのうち実際のコミットに残るのは44%に留まり、人間が書いたコードより多くのセキュリティ脆弱性を含む傾向があるとしており、ベンチマークを超えた実証的理解への貢献が期待される。

#agent#coding#benchmark

論文 Hugging Face 2026-04-21 HF ↑16

DeVI: 合成動画模倣による物理ベースの巧みな人-物体インタラクション

近年の動画生成モデルの発展により、モーションキャプチャでは収集困難な複雑な手先操作を含む人-物体インタラクション(HOI)動画の合成が可能となった。しかし、生成動画は物理的忠実度が低く純粋な2D情報であるため、物理ベースのキャラクター制御の模倣ターゲットとして直接利用することが難しいという課題があった。本論文ではDeVI(Dexterous Video Imitation)を提案する。テキスト条件付き合成動画を活用し、未知の対象物体に対して物理的に妥当な巧みなエージェント制御を実現するフレームワークである。生成された2D手がかりの不精度を克服するため、3D人体トラッキングと頑健な2Dオブジェクトトラッキングを統合したハイブリッド追跡報酬を導入している。高品質な3D運動学的デモンストレーションを必要とする既存手法と異なり、DeVIは生成動画のみを入力とし、多様な物体や操作タイプへのゼロショット汎化を達成する。実験により、3D HOIデモを模倣する既存手法を上回り、特に手-物体インタラクションのモデリングで優れた性能を示すことが報告されている。

#agent#robotics

論文 Hugging Face 2026-04-20 HF ↑3

HP-Edit: 画像編集のための人間選好後学習フレームワーク

画像編集タスクでは拡散モデル（diffusion model）が主流となっているが、Diffusion-DPOやFlow-GRPOなどの強化学習（RL）手法による品質向上が進む一方、人間フィードバックからの強化学習（RLHF）を拡散ベース編集に適用する研究は十分に行われていなかった。スケーラブルな人間選好データセットや多様な編集ニーズに対応したフレームワークが不足していたためである。本論文ではこの課題に対し、HP-Editという後学習（post-training）フレームワークと、8種の編集タスクを含む実世界データセットRealPref-50Kを提案する。HP-Editは少量の人間選好スコアリングデータと事前学習済み視覚言語モデル（VLM）を活用し、自動評価器HP-Scorerを構築。これをスケーラブルな選好データセット構築とモデルの報酬関数として活用する。さらにベンチマークRealPref-Benchも導入し、Qwen-Image-Edit-2509などのモデルを大幅に改善できることを実証している。

#diffusion#rl#llm#multimodal#benchmark

論文 Hugging Face 2026-05-31 HF ↑53

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

Parameter-efficient fine-tuning (PEFT) is usually treated as a cheaper alternative to full fine-tuning. We study a broader role: small trainable adapters as persistent local state on top of strong shared foundation models. In this framing, the base model provides shared competence while adapters car...

#fine-tuning#benchmark

論文 Hugging Face 2026-05-25 HF ↑65

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry ...

#coding#multimodal#benchmark

論文 Hugging Face 2026-05-17 HF ↑83

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-desi...

#diffusion#coding#benchmark

論文 Hugging Face 2026-05-05 HF ↑86

Stream-T1: Test-Time Scaling for Streaming Video Generation

While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structura...

#speech#diffusion#benchmark

論文 Hugging Face 2026-04-20 HF ↑56

Tstars-Tryon 1.0: 多様なファッションアイテムに対応した頑健でリアルなバーチャル試着システム

近年の画像生成・編集技術の進歩により、バーチャル試着（virtual try-on）の可能性が広がっているが、既存手法は複雑な実世界の要求に対応しきれていない。本論文では商用規模のバーチャル試着システム「Tstars-Tryon 1.0」を提案する。同システムは極端なポーズ・照明変化・モーションブラー等の困難な条件下でも高い成功率を維持し、衣服のテクスチャや素材特性を忠実に再現するフォトリアルな生成を実現する。さらに8つのファッションカテゴリにわたり最大6枚の参照画像を用いたマルチ画像合成をサポートし、人物アイデンティティと背景の協調制御も可能とする。商用デプロイの遅延問題を克服するため推論速度も大幅に最適化し、ほぼリアルタイム生成を達成している。エンドツーエンドのモデルアーキテクチャ、スケーラブルなデータエンジン、多段階学習パラダイムを統合したシステム設計により、淘宝（Taobao）アプリで数百万ユーザー・数千万リクエストの産業規模デプロイを実現したと報告している。

#vision#benchmark

論文 Hugging Face 2026-04-15 HF ↑21

DR^{3}-Eval: 現実的で再現可能なディープリサーチエージェント評価ベンチマーク

複雑な長期的リサーチタスクを解く深層研究エージェント(DRA)の評価は、動的なウェブ環境と曖昧なタスク定義により困難である。本論文は、マルチモーダル・マルチファイルレポート生成タスクの現実的で再現可能な評価ベンチマークDR^{3}-Evalを提案する。ベンチマークは実際のユーザー提供資料から構築され、オープンウェブの複雑性をシミュレートしつつ完全に検証可能な静的リサーチサンドボックスを含む。情報リコール(Information Recall)、事実精度(Factual Accuracy)、引用カバレッジ(Citation Coverage)、指示従循性、深さの質を測定する多次元評価フレームワークを導入し、人間判定との整合性を検証している。複数の最先端言語モデルに基づくDR^{3}-Agentの実験から、本ベンチマークが極めて困難であり、検索堅牢性と幻覚制御における重大な障害モードを明らかにすることを示した。

#agent#multimodal#alignment#benchmark

論文 Hugging Face 2026-04-15 HF ↑21

RAD-2: 生成器-識別器フレームワークにおける強化学習のスケーリング

自動運転の運動計画では、マルチモーダルな将来の不確実性をモデル化しつつ、クローズドループ相互作用に対してロバストである必要があります。拡散ベース(diffusion-based)プランナーは複雑な軌跡分布のモデル化に有効ですが、模倣学習のみの訓練では確率的不安定性と負のフィードバック不足に陥りやすいという課題がありました。本論文では、クローズドループ計画のための統合的な生成器-識別器フレームワークRAD-2を提案します。拡散ベースの生成器が多様な軌跡候補を生成し、強化学習で最適化された識別器が長期的な運転品質に基づいて再ランク付けする設計により、高次元軌跡空間への直接的な報酬適用を回避し最適化安定性を向上させます。時間的一貫性グループ相対方針最適化(Temporally Consistent Group Relative Policy Optimization)とオンポリシー生成器最適化(On-policy Generator Optimization)により強化学習をさらに強化し、BEV-Warpという高スループット環境で大規模訓練を支援します。拡散ベースプランナーと比較して衝突率を56%削減し、実世界でも安全性と走行スムーズさの向上を実証しました。

#diffusion#rl#multimodal#agent#alignment

論文 arXiv 2026-06-01

AdaCodec: A Predictive Visual Code for Video MLLMs

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frame...

#llm#benchmark#multimodal

論文 arXiv 2026-06-01

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do n...

#benchmark#llm#alignment#multimodal

論文 Hugging Face 2026-05-25 HF ↑6

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evo...

#agent#benchmark#llm

論文 Hugging Face 2026-05-19 HF ↑3

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward h...

#agent#coding#benchmark

論文 Hugging Face 2026-05-19 HF ↑2

Mem-π: Adaptive Memory through Learning When and What to Generate

We present Mem-π, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory-augmented agents typically rely on similarity-based retrieval from episodic memory banks or skill li...

#agent#llm#rl#robotics#benchmark

論文 Hugging Face 2026-05-17 HF ↑1

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either over...

#agent#llm#benchmark#multimodal#fine-tuning

論文 Hugging Face 2026-05-17 HF ↑2

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinfo...

#diffusion#rl#fine-tuning#alignment#benchmark

論文 Hugging Face 2026-05-17 HF ↑1

Code as Agent Harness

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substra...

#agent#llm#multimodal#alignment#coding

論文 Hugging Face 2026-05-05 HF ↑8

StableI2I: Spotting Unintended Changes in Image-to-Image Transition

In most real-world image-to-image (I2I) scenarios, existing evaluations primarily focus on instruction following and the perceptual quality or aesthetics of the generated images. However, they largely fail to assess whether the output image preserves the semantic correspondence and spatial structure...

#benchmark#llm

論文 Hugging Face 2026-05-04 HF ↑2

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largel...

#agent#benchmark

論文 Hugging Face 2026-05-03 HF ↑2

T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads ...

#rl#llm#agent#benchmark

論文 Hugging Face 2026-05-03 HF ↑3

AcademiClaw: 学生がAIエージェントに挑戦を設定する

近年のAIエージェント評価ベンチマークはアシスタントレベルのタスクに偏っており、学術レベルの能力評価が不十分という課題がある。本研究ではOpenClawエコシステム向けに、大学生の実際の学術ワークフロー（宿題・研究プロジェクト・コンテスト・個人プロジェクト）から収集した80件の複雑・長期タスクで構成されるバイリンガルベンチマーク「AcademiClaw」を提案する。230件の学生提出候補から厳格な専門家レビューを経て選定されたタスクは、数学オリンピックや言語学問題からGPU集約型強化学習・フルスタックデバッグまで25以上の専門領域に及び、16タスクはCUDA GPU実行を要する。各タスクはDockerサンドボックスで実行され、6つの補完的手法を組み合わせた多次元ルーブリックで採点される。6つの最先端モデルによる実験では最高でも55%の合格率に留まり、タスク領域間の明確な能力境界やトークン消費量と出力品質の乖離など、集約指標では見えない詳細な診断情報を提供する成果を示した。

#agent#benchmark#rl#alignment

論文 Hugging Face 2026-05-03 HF ↑1

視覚的根拠推論のための知覚フローネットワーク

大規模視覚言語モデル（LVLM）は標準的な最尤推定（MLE）などの汎用最適化目標を用いるため、視覚的な推論軌跡を適切に制約できず、言語バイアスや幻覚（hallucination）が生じやすい。既存手法は視覚エキスパートからの幾何学的事前知識を追加監督として導入するが、これは幾何学的精度に偏りすぎており推論への有用性が限定的だと著者らは指摘する。この課題に対し、本論文はPerceptual Flow Network（PFlowNet）を提案する。PFlowNetは知覚と推論を分離し自己条件付き生成プロセスを確立することで、エキスパート事前知識への硬直した整合を排除する。さらに変分強化学習（variational reinforcement learning）を用いて多次元報酬と近傍幾何学的整形を統合し、視覚的信頼性を保ちながら推論指向の知覚行動を促進する。理論的な性能保証を示すとともに、V* Bench（90.6%）およびMME-RealWorld-lite（67.0%）にて新たなSOTAを達成したと報告している。

#rl#multimodal#alignment

論文 Hugging Face 2026-04-28 HF ↑2

非同期デノイジングによる映像事前知識を用いた統合4D世界行動モデリング

ロボット工学における世界モデル(world model)研究では、リアルタイムの行動生成と高品質な世界表現の両立が課題とされてきた。従来の統合世界モデル(UWM等)は2次元ピクセル空間のみを扱い、行動効率と世界モデリング品質のバランスが不十分であった。本研究ではX-WAMと呼ぶ統合4D世界モデルを提案する。事前学習済みの映像拡散モデル(video diffusion model)の視覚的事前知識を活用し、マルチビューRGB-D映像を予測することで将来の世界を想像する。軽量な構造適応として、事前学習済み拡散トランスフォーマー(Diffusion Transformer)の最終ブロック群を深度予測ブランチに複製し、空間情報を効率的に取得する。さらに非同期ノイズサンプリング(Asynchronous Noise Sampling; ANS)を提案し、推論時に行動を少ないステップで高速デコードしつつ、映像生成には全ステップを充てる非同期スケジュールを適用する。5,800時間超のロボットデータで事前学習したX-WAMは、RoboCasaおよびRoboTwin 2.0ベンチマークでそれぞれ79.2%・90.7%の平均成功率を達成し、4D再構成・生成でも既存手法を上回ると主張している。

#robotics#diffusion#coding#benchmark

論文 Hugging Face 2026-04-28 HF ↑3

システム統合型Speculative DecodingによるRL後学習ロールアウトの高速化

大規模言語モデル（LLM）のRL後学習（RL post-training）において、自己回帰的なロールアウト生成がボトルネックとなっている。既存の効率化手法はオフポリシー実行やリプレイ、低精度生成などでスループット改善を図るが、出力分布を変えてしまう場合がある。本研究ではSpeculative Decoding（投機的デコーディング）をロスレスな加速プリミティブとして活用し、ターゲットモデルの出力分布を保持しながらRLロールアウトを高速化する手法を提案する。vLLMバックエンドを持つNeMo-RLに実装し、同期・非同期パイプラインの両方に対応。事前学習済みMTPヘッドや小規模ドラフトモデル、Eagle3などの投機機構を幅広くサポートする。8Bスケールの同期RL環境下で推論後学習ワークロードにおいてロールアウトスループットが1.8倍に向上し、高忠実度シミュレータによる試算では235Bスケールで非同期RLと組み合わせると最大2.5倍のエンドツーエンド学習高速化が見込めると報告している。

#rl#coding#llm

論文 Hugging Face 2026-04-27 HF ↑1

動画生成のための体系的ポストトレーニングフレームワーク

大規模動画拡散モデル（video diffusion model）は高解像度・高品質コンテンツの生成で優れた能力を示す一方、プロンプト感度・時間的一貫性の欠如・推論コストの高さといった課題により、事前学習性能と実用展開の間に大きなギャップが存在する。本研究ではこのギャップを埋めるため、4段階の相乗的ステージからなる包括的ポストトレーニングフレームワークを提案する。具体的には、①ベースモデルを安定した指示追従ポリシーへと変換するSFT（Supervised Fine-Tuning）、②動画拡散向けに設計したGRPO（Group Relative Policy Optimization）を用いたRLHF（Reinforcement Learning from Human Feedback）による知覚品質・時間的一貫性の向上、③専用言語モデルによるプロンプト拡張（Prompt Enhancement）、④推論最適化（Inference Optimization）を順次適用する。広範な実験により、このパイプラインがアーティファクトを効果的に軽減し、制御性と視覚的美観を大幅に改善しながらサンプリングコスト制約を遵守することを示している。

#diffusion#rl#fine-tuning

論文 Hugging Face 2026-04-27 HF ↑4

MAIC-UI: 生成UIを用いたインタラクティブ教材の自動作成システム

背景・課題として、STEMインタラクティブ教材の作成にはHTML/CSS/JavaScriptの専門知識が必要であり、教育者にとって高い参入障壁となっている。生成AIによるHTML生成も既存ツールでは静的表示にとどまり、長文書への対応や教育的正確性の担保が難しく、変更のたびに200〜600秒の再生成が必要で創造フローを妨げていた。提案手法MAIC-UIはゼロコードの教材オーサリングシステムであり、(1)マルチモーダル理解による構造化知識分析で教育的厳密性を確保、(2)内容整合と視覚最適化を分離する2段階の生成-検証-最適化パイプライン、(3)Unified Diffベースの差分増分生成とClick-to-Locate編集による10秒未満の反復サイクルを実現する。40名参加の対照実験では編集回数が4.9対7.0に減少し、学習容易性と操作性が向上。53名の高校生を対象とした3か月の授業展開でSTEM成績が9.21ポイント向上し、対照クラスの-2.32ポイントと対比して学習主体性の促進と成果格差の縮小に貢献したと主張している。

#alignment

論文 Hugging Face 2026-04-22 HF ↑2

信頼だが検証せよ：言語モデルにおけるクレーム推論のための二重帰属・検証フレームワーク「DAVinCI」の提案

大規模言語モデル(LLM)は多様なNLPタスクで高い流暢性を示す一方、事実誤認やハルシネーション(hallucination)が依然として課題であり、医療・法律・科学コミュニケーション等の高リスク領域では深刻なリスクをもたらす。本論文では、LLM出力の事実信頼性と解釈可能性を向上させる二重帰属・検証フレームワーク「DAVinCI」を提案する。DAVinCIは2段階で動作する：(i)生成されたクレームをモデル内部コンポーネントと外部ソースの両方に帰属させ、(ii)含意ベース推論(entailment-based reasoning)と信頼度キャリブレーション(confidence calibration)により各クレームを検証する。FEVER・CLIMATE-FEVERを含む複数データセットで評価した結果、検証のみのベースラインと比較して分類精度、帰属適合率・再現率・F1スコアを5〜20%改善したと報告されている。アブレーション研究により、証拠スパン選択・再キャリブレーション閾値・検索品質それぞれの寄与も明らかにされており、既存パイプラインへの統合可能なモジュール実装も公開された。

#llm#benchmark

論文 Hugging Face 2026-04-22 HF ↑3

VLAA-GUI: いつ停止・回復・検索すべきかを知る、GUIオートメーションのためのモジュラーフレームワーク

自律型GUIエージェントには「早期停止（early stopping）」と「反復ループ（repetitive loops）」という2つの根本的課題がある。前者は検証可能な根拠なしに成功を宣言してしまう問題、後者は同じ失敗動作を繰り返す問題である。本論文ではVLAA-GUIを提案する。これはStop・Recover・Searchの3コンポーネントで構成されるモジュラーフレームワークである。(1) 完了性検証器（Completeness Verifier）はUI上で視覚的に確認できる成功基準を強制し、証拠のない完了主張を棄却する。(2) ループ破壊器（Loop Breaker）は失敗時のインタラクションモード切替・画面状態の反復検出・戦略変更を多段フィルタリングで実現する。(3) オンデマンドの検索エージェント（Search Agent）はLLMを活用して未知のワークフローをオンライン検索する。さらにコーディングエージェントとグラウンディングエージェントも組み込む。OSWorldで77.5%、WindowsAgentArenaで61.0%を達成し、5バックボーン中3つが人間性能（72.4%）を上回ったと報告している。

#agent#llm#coding#benchmark

論文 Hugging Face 2026-04-22 HF ↑9

TingIS: エンタープライズ規模のノイズの多いカスタマーインシデントからのリアルタイムリスクイベント検出

大規模クラウドネイティブサービスでは、技術的異常のリアルタイム検出と緩和が不可欠だが、監視で見逃されたリスクを補うカスタマーインシデントデータは、極端なノイズ・高スループット・多様なビジネスラインの意味的複雑性から有用な情報抽出が困難である。本論文では、エンタープライズ級インシデント検出を目的としたエンドツーエンドシステム「TingIS」を提案する。中核は多段階イベントリンキングエンジンで、効率的なインデックス技術とLLM（大規模言語モデル）を組み合わせ、少数の多様なユーザー記述からアクション可能なインシデントを安定抽出する。これに加え、ビジネス帰属のカスケードルーティング機構と、ドメイン知識・統計パターン・行動フィルタリングを統合した多次元ノイズ削減パイプラインを備える。本番環境では毎分2,000件超・1日30万件のピークスループットを処理し、P90アラート遅延3.5分・高優先度インシデントの95%検出率を達成。実データ構築ベンチマークでルーティング精度・クラスタリング品質・SNRにおいてベースライン手法を大幅に上回ることを示した。

#llm#benchmark

論文 Hugging Face 2026-04-22 HF ↑3

Omniモデルにおけるコンテキスト展開

背景・課題：テキスト・画像・動画・3Dジオメトリなど多様なモダリティを統合的に扱う統一マルチモーダルモデル（unified multimodal model）の構築は、各モダリティの補完的情報を適切に集約する推論機構の設計が難しいという課題があった。提案手法：本論文ではOmniと呼ばれる統一マルチモーダルモデルを提案し、テキスト・画像・動画・3Dジオメトリ・隠れ表現（hidden representation）を含む多様なモダリティでネイティブ学習を行う。この学習により「コンテキスト展開（Context Unrolling）」と呼ぶ推論プロセスが創発し、モデルは予測生成前に複数のモーダル表現を跨いで明示的に推論を行う。これにより異種モダリティ間の補完的情報が集約され、共有マルチモーダル知識多様体（shared multimodal knowledge manifold）のより忠実な近似が実現されると主張する。成果・貢献：Omniはマルチモーダル生成・理解のベンチマーク双方で高い性能を達成し、テキスト・画像・動画・3Dジオメトリのインコンテキスト生成を含む高度な推論能力を示したとしている。

#multimodal#benchmark

論文 Hugging Face 2026-04-20 HF ↑3

LoopCTR: クリック率予測のためのループスケーリングの解放

Transformerベースのクリック率予測（CTR）モデルをスケールアップする際、パラメータ増加に伴う計算・ストレージコストが産業展開上の制約と乖離するという課題がある。本論文はLoopCTRを提案する。これは共有モデル層の再帰的再利用により学習時の計算量を増やしつつ、パラメータ数の増加から計算量を分離する「ループスケーリング」パラダイムを導入するものである。アーキテクチャはHyper-Connected ResidualとMixture-of-Experts（MoE）を組み合わせたサンドイッチ構造を採用し、各ループ深さでのプロセス監督（process supervision）により多段ループの恩恵を共有パラメータに蒸留する。これにより「多ループで学習・ゼロループで推論」戦略が実現し、ループ無しの単一フォワードパスのみで全ベースラインを上回る性能を達成した。3つの公開ベンチマークと1つの産業データセットで最先端性能を示し、オラクル分析ではさらに0.02〜0.04 AUCの潜在的改善余地も確認されている。

#benchmark

論文 Hugging Face 2026-04-19

セッション横断パーソナライズドツール呼び出しのための潜在的選好モデリング

LLMベースのエージェントにおいて、ユーザーはリクエストに必要な詳細を省略しがちであり、ツール呼び出し（tool calling）に必要な引数が不足するという根本的課題が存在する。本論文ではこの問題を体系的に研究するため、選好想起（Preference Recall）・選好誘導（Preference Induction）・選好転移（Preference Transfer）の3課題を網羅した265件のマルチセッション対話ベンチマーク「MPT」を構築した。さらに、ユーザー選好を進化する仮説として表現するテスト時メモリ拡張手法「PRefine」を提案する。PRefineは生成・検証・精錬（generate–verify–refine）のループにより過去履歴から再利用可能な制約を抽出し、完全履歴プロンプティングに比べわずか1.24%のトークン数でツール呼び出し精度を向上させることを示した。これらの成果は、エージェントシステムの堅牢なパーソナライゼーションには、ユーザーの選択そのものだけでなく、その背後にある理由を捉えるメモリが重要であることを示唆している。

#agent#llm#benchmark

論文 Hugging Face 2026-04-19 HF ↑1

マルチモーダルLLMにおける掛け算：テキスト・画像・音声入力での計算能力評価

マルチモーダルLLM（大規模言語モデル）は数値を各モダリティで認識できるが、同一の掛け算問題を数字・英単語・画像・音声で提示した場合に正確な多桁乗算が困難になるという課題がある。既存ベンチマークはモダリティ間で対応づけられたサンプルが少なく、比較が困難だった。本研究では桁数・桁の疎密性・表現形式・モダリティを組み合わせた制御済みマルチモーダル乗算ベンチマークを構築し、「算術負荷（arithmetic load）C」を全桁数と非ゼロ桁数の積として定義した。評価の結果、Cが増大すると精度が急落しC>100でほぼゼロになること、CはR²>0.5でモデル・モダリティをまたいで性能を予測できること、精度低下の主因は知覚ではなく計算処理にあること（知覚確認では99%超の正解率）が示された。さらにforced-completion loss probeにより、モデルは分配則分解を好む傾向があるが、ヒューリスティック固有のLoRAアダプタは精度を低下させ、ベースモデルが内部ルータを持つことが示唆された。

#multimodal#llm#benchmark

論文 Hugging Face 2026-04-19 HF ↑2

MathNet：数学的推論と検索のためのグローバルマルチモーダルベンチマーク

大規模言語モデル・マルチモーダルモデル（multimodal model）の数学的推論評価において、既存ベンチマークはデータ規模・言語カバレッジ・タスク多様性の面で限界があった。本論文ではMathNetを提案する。MathNetは47か国・17言語・20年以上の数学オリンピック問題を網羅した大規模多言語マルチモーダルデータセットであり、30,676件の専門家作成問題と解答を含む。生成モデルの数学的推論評価と埋め込みベース検索システム（embedding-based system）の評価を兼ねるベンチマークとして、(i)問題解答、(ii)数学対応検索（Math-Aware Retrieval）、(iii)検索拡張問題解答（Retrieval-Augmented Problem Solving）の3タスクを設定した。実験の結果、最先端推論モデルでもGemini-3.1-Proが78.4%、GPT-5が69.3%にとどまり課題が残ることが示された。また検索品質がRAG性能に大きく影響し、DeepSeek-V3.2-Speciale では最大12%の向上が確認された。データセットとベンチマークは公開済みである。

#benchmark#multimodal#rag

論文 Hugging Face 2026-04-15 HF ↑4

LongAct: 長文脈強化学習における内在的活性化パターンの活用

大規模言語モデル(LLM)の推論能力向上を目指す強化学習(RL)において、報酬設計やデータ合成に焦点が当たる中、本研究はモデルの内在的表現特性に着目する。長文脈処理時、クエリ・キーベクトル内に高振幅の活性化が存在することを観察し、モデル量子化の知見と長文脈推論の疎性構造の仮説から、これらの重みが最適化の鍵と主張する。提案手法LongActは、均一更新から顕著性誘導型疎更新へのシフトを実現し、LongBench v2で約8%の改善とRULERベンチマークの汎化性向上を達成した。GRPOやDAPOを含む複数のRLアルゴリズム間での普遍性を示し、顕著な特徴への焦点が長文脈の潜在能力解放の鍵であることを示唆している。

#rl#llm#benchmark

論文 Hugging Face 2026-04-15 HF ↑7

UniDoc-RL: 階層的アクションと密集報酬による粗密段階的ビジュアルRAG

大規模ビジョン言語モデル(LVLM)を外部ビジュアル知識で拡張するRetrieval-Augmented Generation (RAG)について、既存システムが細粒度のビジュアルセマンティクスを見落としている問題に対し、UniDoc-RLを提案します。このフレームワークはLVLMエージェントが検索・再ランク付け・能動的ビジュアル認識・推論を統合的に実行する強化学習(RL)ベースのシステムです。粗粒度のドキュメント検索から細粒度の画像選択・領域クロップへと段階的に改善する階層的アクション空間により、無関連コンテンツを抑制し情報密度の高い領域に注目します。エンドツーエンド学習のため、各アクションにタスク認識監督を提供する密集マルチ報酬スキームを導入し、Group Relative Policy Optimization (GRPO)に基づき価値ネットワークなしで目的関数の整合を実現。3つのベンチマークでの実験により、先行RL手法比で最大17.7%の性能向上を達成したと報告しています。

#multimodal#agent#rag#rl#benchmark

論文 arXiv 2026-05-04

TOC-SR: Task-Optimal Compact diffusion for Image Super Resolution

Diffusion models have recently demonstrated strong performance for image restoration tasks, including super-resolution. However, their large model size and iterative sampling procedures make them computationally expensive for practical deployment. In this work, we present TOC-SR, a framework for bui...

#diffusion

論文 Hugging Face 2026-05-05 HF ↑1

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a silence tax: additional deliberation postpones the first task-relevant content, while naive early streaming risks premature commitments th...

#llm#rl#benchmark

論文 Hugging Face 2026-05-04 HF ↑1

A Benchmark for Interactive World Models with a Unified Action Generation Framework

Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, reasoning, and action. Yet current research still lacks large-scale datasets and unified benchmarks to evaluate their phys...

#benchmark#agent

論文 Hugging Face 2026-05-04 HF ↑2

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

Language models excel at diagnostic assessments on currated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perfo...

#agent#llm#benchmark

論文 Hugging Face 2026-04-26

Zero-to-CAD: 実データなしで百万規模の解釈可能なCADプログラムをエージェント的に合成する

背景・課題として、CAD（Computer-Aided Design）モデルは構築履歴（パラメトリックな設計意図）を持つが、既存の大規模3Dデータセットはほぼ境界表現（B-Rep）やメッシュで構成されており、この手続き的情報が失われている。本研究ではZero-to-CADを提案し、実行可能なCAD構築シーケンスをスケーラブルに合成するフレームワークを構築する。提案手法では合成をエージェント的探索問題として定式化し、大規模言語モデル（LLM）をフィードバック駆動のCAD環境に組み込み、ツールやドキュメント参照を活用しながらコードの生成・実行・検証を反復する。これにより、スケッチ＆押し出し操作を超えた多様な操作語彙を含む約100万件の実行可能・可読・編集可能なCADシーケンスを合成した。高品質な10万件のサブセットも公開される。有用性の実証として、合成データでビジョン言語モデルをファインチューニングし、マルチビュー画像から編集可能なCADプログラムを再構築するタスクでGPT-5.2を含む強力なベースラインを上回る成果を示した。

#agent#llm#fine-tuning

論文 Hugging Face 2026-04-26 HF ↑1

知覚中心のプロセス報酬モデルによる視覚言語モデルの改善

背景・課題：強化学習における検証可能な報酬（RLVR）は視覚言語モデル（VLM）の推論能力を向上させてきたが、結果レベルの監督信号は粗すぎて推論チェーン内のエラーを正確に診断・修正できないという問題がある。提案手法：本論文はPerceval という知覚中心のプロセス報酬モデル（PRM）を提案する。Percevalは応答から画像関連の主張を抽出し、視覚的証拠と照合することでトークンレベルのエラー同定を実現する。RLトレーニングでは従来のGRPOのシーケンスレベル優位性に代わり、Percevalが特定したハルシネーション箇所にペナルティを集中させるトークンレベルの細粒度監督を適用する。さらに推論時にも誤り箇所を切り捨てて再生成または自己反省を繰り返すテスト時スケーリングを実現する。成果：複数ドメインのベンチマークで顕著な改善を達成し、多数決投票等の既存戦略を上回る一貫したパフォーマンス向上を示した。コードとデータは公開予定とされている。

#multimodal#rl#benchmark

論文 Hugging Face 2026-04-21 HF ↑2

収束進化：異なる言語モデルが類似した数値表現を学習する仕組み

自然言語テキストで学習した言語モデルは、周期T=2、5、10を主要周期とする周期的特徴を用いて数値を表現することが知られている。本論文では、これらの特徴に2階層の階層構造が存在することを明らかにした。Transformerや線形RNN（Linear RNN）、LSTM、古典的な単語埋め込み（word embeddings）といった多様なアーキテクチャは、フーリエ領域（Fourier domain）に周期Tのスパイクを持つ特徴を学習する一方、数値をmod-Tで線形分類可能な幾何学的分離可能特徴（geometrically separable features）を学習するモデルは一部に限られる。この非対称性を説明するため、フーリエドメインのスパース性はmod-T幾何学的分離可能性の必要条件であるが十分条件ではないことを理論的に証明した。さらに実験的に、データ・アーキテクチャ・オプティマイザ・トークナイザが幾何学的分離可能特徴の獲得に関与することを示し、多様なモデルが異なる学習信号から類似した特徴を獲得する「収束進化（convergent evolution）」現象を確認した。

論文 Hugging Face 2026-05-05 HF ↑7

Lightning Unified Video Editing via In-Context Sparse Attention

Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing....

論文 Hugging Face 2026-04-26 HF ↑3

OmniShotCut: ショットクエリTransformerによる包括的関係ショット境界検出

ショット境界検出（Shot Boundary Detection, SBD）は動画を意味的に一貫したショットに自動分割する技術である。既存の最先端手法はトランジション部分での非解釈的な境界出力、微細な不連続の見逃し、ノイズの多い低多様性アノテーション、および時代遅れのベンチマークへの依存という課題を抱えていた。本論文ではこれらの限界を克服するため、OmniShotCutを提案する。本手法はSBDを構造化関係予測（structured relational prediction）として定式化し、ショットクエリベースの密な動画Transformer（dense video Transformer）によってショット範囲をショット内関係（intra-shot relations）とショット間関係（inter-shot relations）と同時に推定する。不正確な手動ラベリングを回避するため、主要なトランジション族を精密な境界とパラメータ化バリアントで自動再現する完全合成トランジション生成パイプラインを採用している。さらに包括的・診断的評価を可能にする広ドメインの現代的ベンチマークOmniShotCutBenchを導入し、評価基盤の刷新にも貢献している。

#benchmark

論文 Hugging Face 2026-04-15 HF ↑6

TRACER: トレースベースの適応的コスト効率的ルーティング（LLM分類向け）

本論文では、LLM分類エンドポイントの本番ログから得られる入出力ペアを活用し、軽量な代理モデル(surrogate)を訓練するシステムTRACERを提案します。代理モデルは将来のトラフィックの大部分を極めて低い推論コストで処理できます。提案手法は「パリティゲート」を用いて、代理モデルがLLMと一致する信頼度がユーザー指定の閾値αを超えた場合のみ展開します。解釈可能性アーティファクトにより、代理モデルが処理可能な入力領域や限界を可視化します。77クラスのインテント分類タスクではSonnet 4.6教師モデルに対し、83〜100%のカバレッジを達成し、150クラスではモデル完全置換も実現。自然言語推論タスクではパリティゲートが適切に展開を拒否しました。オープンソース化されています。

#llm#benchmark

論文 Hugging Face 2026-04-15 HF ↑4

検索ではなく探索を：エンタープライズ知識をナビゲート可能なエージェントスキルに蒸留するQAとRAG向け手法

従来のRetrieval-Augmented Generation (RAG)はLLMを受動的な検索結果の消費者として扱い、コーパスの組織構造を認識できないため、証拠の統合や遡行が困難という課題がある。本論文ではCorpus2Skillを提案し、事前にドキュメントコーパスを階層的スキルディレクトリに蒸留し、推論時にLLMエージェントが能動的にナビゲートできる仕組みを構築した。パイプラインは文書を反復的にクラスタリングし、各レベルでLLMが要約を生成し、結果をツリー構造として具現化する。推論時、エージェントはコーパス全体を俯瞰でき、段階的に詳細な要約から目的のトピックブランチを掘り下げ、IDで完全文書を検索できる。階層構造が明示的に可視化されるため、エージェントはどこを見るべきか推論でき、非生産的な経路から遡行し、複数ブランチから証拠を統合可能となり、WixQAベンチマークで従来手法を上回る成果を示した。

#agent#llm#rag#benchmark

論文 Hugging Face 2026-04-15 HF ↑8

Switch-KD: ビジョン言語モデル向けビジュアルスイッチ知識蒸留

ビジョン言語モデル(Vision-Language Models、VLM)は資源制約環境への展開が課題である。知識蒸留(Knowledge Distillation、KD)によるモデル圧縮が有効だが、既存手法はモダリティ(modality)ごとに個別に教師信号を与えるため、マルチモーダル知識の一貫性が失われる問題がある。本論文はSwitch-KDを提案し、学生モデルの視覚出力を教師モデルの言語経路に切り替えることで、共有テキスト確率空間内でマルチモーダル知識を統一的に転送する。Dynamic Bi-directional Logits Difference損失により、情報量の多い確率領域を適応的に整列させながら、双方向教師信号で分布構造を保持する。0.5BのTinyLLaVAが3Bの教師から効果的に知識を蒸留し、10個のマルチモーダルベンチマークで平均3.6ポイントの改善を達成した。

#multimodal#alignment#benchmark

論文 Hugging Face 2026-04-15 HF ↑4

プロンプトを超えて:分布外の3D形状に対する無条件逆変換

テキスト駆動の生成モデル逆変換は、3Dコンテンツ操作の中核的なパラダイムですが、テキストプロンプトへの感度低下という課題があります。本論文は、最先端のテキスト・ツー・3D生成モデルにおいて、生成過程が"sink traps"と呼ばれる領域に陥り、プロンプト修正に対して鈍感になる現象を報告します。これはモデルの幾何学的表現能力の限界ではなく、分布外テキスト誘導に対する感度の問題であることを示唆しています。著者らは生成軌跡の分析を通じ、モデルの無条件生成先行情報(unconditional generative prior)を活用することで、複雑な幾何形状を生成可能であることを発見しました。提案手法はlatent sinkを回避し、幾何学的表現力と言語感度を分離することで、分布外の3D形状に対する堅牢なテキストベース編集を実現するとしています。

論文 Hugging Face 2026-04-15 HF ↑4

RadAgent: 胸部CT画像の段階的解釈のためのツール利用AI エージェント

Vision-Language Model (VLM)は医療画像解釈を進歩させたが、既存手法では臨床医が最終出力を受け身で観察するのみで、推論過程の検証が困難である。本論文では、段階的で解釈可能なプロセスでCTレポート生成を行うツール利用型AIエージェント「RadAgent」を提案する。各レポートは中間的な判定とツール相互作用の追跡可能な痕跡を備え、臨床医が発見がどのように導出されたかを検査できる。実験結果から、RadAgentは3D VLM「CT-Chat」と比べ、macro-F1で6.0ポイント(相対36.4%)、micro-F1で5.4ポイント(相対19.6%)の臨床精度向上、敵対的条件下で24.7ポイント(相対41.9%)のロバスト性向上を達成した。さらに、既存VLMには存在しない忠実性(Faithfulness)で37.0%を達成し、放射線科における透明で信頼性の高いAIへの進展をもたらす。

#agent#multimodal

論文 Hugging Face 2026-04-15 HF ↑5

LeapAlign: 2ステップ軌跡構築による任意の生成ステップでのフローマッチングモデルの事後学習アライメント

本論文は、フローマッチング(flow matching)モデルの人間の嗜好への適合を扱う。報酬勾配(reward gradient)を微分可能な生成プロセスを通じて逆伝播する方法が有望だが、長い軌跡への逆伝播は膨大なメモリと勾配爆発をもたらす。そこで著者らはLeapAlignを提案する。連続する2つのリープ(leap)により長い軌跡を2ステップに短縮し、各リープで複数のODEサンプリングステップをスキップして潜在変数を予測する。リープの開始・終了タイムステップをランダム化することで、任意の生成ステップでの効率的で安定した学習を実現する。短縮された軌跡の長い生成経路との整合性に基づいて学習重みを割り当て、勾配の大きさに応じて重みを段階的に削減し安定性を向上させている。Fluxモデルの微調整において、LeapAlignは従来手法を上回る画像品質とテキスト整合性を実現している。

#fine-tuning#alignment

論文 arXiv 2026-05-05

Safety and accuracy follow different scaling laws in clinical large language models

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting ...

#alignment#llm#rag#agent#benchmark

論文 arXiv 2026-05-05

Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human-AI Decision Support in Manufacturing

High-precision CNC machining of free-form aerospace components requires bounded compensations informed by inspection, simulation, and process knowledge. Off-the-shelf large language model (LLM) assistants can generate text, but they do not reliably execute risk-constrained multi-step numerical workf...

#agent#llm#alignment#benchmark

論文 arXiv 2026-04-29

ViCrop-Det: 空間アテンションエントロピー誘導クロッピングによるトレーニング不要な小物体検出

Transformerベースのアーキテクチャは大域的な意味把握において主流となっているが、自然画像に内在する空間的不均質性により局所特徴が劣化するという根本的な制約がある。特に、情報密度の異なる領域に一様な受容野を適用することで、微小物体が密集する領域での検出精度が低下する。この課題に対し、本論文ではViCrop-Detという学習不要な推論フレームワークを提案する。異常セグメンテーションにおけるアテンションエントロピーの活用から着想を得て、検出デコーダのクロスアテンション分布を内在的プローブとして利用する。空間アテンションエントロピー(SAE)を用いて局所的な空間的曖昧性を評価し、物体の顕著性と認知的不確実性がともに高い領域に固定の計算バジェットを動的に割り当てる。VisDroneおよびDOTA-v1.5での評価ではRT-DETR-R50およびDeformable DETRに対して+1〜3 mAP@50の向上を達成し、レイテンシのオーバーヘッドは20〜23%に留まると主張している。

#benchmark

論文 arXiv 2026-04-29

信頼性の高い臨床トリアージのためのドメイン適応済み小型言語モデル

救急部門における緊急度指数（Emergency Severity Index: ESI）の正確な割り当ては、自由記述形式のトリアージ文書の多様性により、誤トリアージやワークフロー非効率を招く課題が続いている。本研究では、オープンソースの小型言語モデル（Small Language Model: SLM）がプライバシーを保護しつつ信頼性の高いトリアージ意思決定支援ツールとして機能するかを検証した。複数のSLMを多様なプロンプトパイプラインで比較した結果、トリアージ記録を簡潔にまとめた「臨床ビネット」が最も高い予測精度をもたらすことが判明した。特にQwen2.5-7Bが精度・安定性・計算効率の最良バランスを示した。専門家監修データおよびシルバー標準の小児トリアージデータを用いた大規模ドメイン適応により、ファインチューニング済みQwen2.5-7BはすべてのベースラインSLMおよびGPT-4oを含む大型商用モデルを上回り、臨床的に重大な誤分類を大幅に削減したと報告している。

#fine-tuning#llm#benchmark

論文深掘り arXiv 2026-05-25

Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning

Multimodal Large Language Models (MLLMs) achieve versatility by reformulating diverse tasks into a unified instruction-following framework via instruction tuning. However, real-world deployment requires continuous adaptation to emerging tasks, motivating Multimodal Continual Instruction Tuning (MCIT...

#llm#multimodal#fine-tuning

論文深掘り arXiv 2026-05-21

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

Survival analysis aims to estimate a time-to-event distribution from data with censored observations. Many existing methods either impose structural assumptions on the hazard function or discretize the time axis, which may limit flexibility and introduce approximation errors. We propose the Survival...

#diffusion#benchmark

論文深掘り arXiv 2026-05-21

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

#diffusion#benchmark

論文深掘り arXiv 2026-05-19

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

#agent#llm#coding

論文深掘り arXiv 2026-05-13

Identifying AI Web Scrapers Using Canary Tokens

From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics concerns. If website...

#llm#agent#robotics

論文深掘り arXiv 2026-05-13

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-L...

#llm#fine-tuning#coding#benchmark

論文深掘り arXiv 2026-05-12

Model-based Bootstrap of Controlled Markov Chains

We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is un...

#llm#rl#benchmark

論文深掘り arXiv 2026-05-11

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as...

#agent#multimodal#rl#fine-tuning#benchmark

論文深掘り arXiv 2026-05-07

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy...

#llm#rl#benchmark

論文深掘り arXiv 2026-05-07

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

#llm#rl#benchmark

論文深掘り arXiv 2026-05-07

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstructi...

#alignment#speech#benchmark

論文深掘り arXiv 2026-05-07

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

#alignment#speech#benchmark

論文深掘り arXiv 2026-05-05

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

Coding agents often pass per-prompt safety review yet ship exploitable code when their tasks are decomposed into routine engineering tickets. The challenge is structural: existing safety alignment evaluates overt requests in isolation, leaving models blind to malicious end-states that emerge from se...

#agent#coding#benchmark#alignment#speech

論文深掘り arXiv 2026-05-05

TabSurv: Adapting Modern Tabular Neural Networks to Survival Analysis

Survival analysis on tabular data is a well-studied problem. However, existing deep learning methods are often highly task-specific, which can limit the transfer of new approaches from other domains and introduce constraints that may affect performance. We propose TabSurv, an approach that adapts mo...

#benchmark

論文深掘り arXiv 2026-05-04

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems t...

#agent#llm#rl#benchmark

論文深掘り arXiv 2026-05-04

A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance

Offline safe reinforcement learning often requires policies to adapt at deployment time to safety budgets that vary across episodes or change within a single episode. While diffusion-based planners enable flexible trajectory generation, existing guidance schemes often treat reward improvement and co...

#diffusion#rl#alignment#benchmark

論文深掘り arXiv 2026-05-04

Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study

Retrieval-Augmented Generation (RAG) offers a well-established path to grounding large language model (LLM) outputs in external knowledge, yet the question of which retrieval strategy works best in a high-stakes domain such as biomedicine has not received the controlled, multi-metric treatment it de...

#rag#benchmark#llm

論文深掘り arXiv 2026-04-30

RHyVE: LLM生成報酬仮説のための能力認識型検証・フェーズ認識型デプロイメント

LLM×強化学習のロボット・ゲームAI開発で「報酬設計の自動化」が現実的な選択肢になりそう

強化学習（Reinforcement Learning）における報酬設計をLLMで自動化する研究が進んでいるが、生成された報酬関数が信頼できる学習目標になるとは限らないという課題がある。既存研究は報酬候補の生成・進化・選択に集中し、「いつ」それを適用すべきかをほぼ無視してきた。本研究はこの問題に着目し、生成報酬を「報酬仮説（reward hypothesis）」として扱い、その有効性が現在のポリシーの能力とトレーニングフェーズに依存すると定式化する。提案手法RHyVEは、共有ポリシーチェックポイントから短期分岐検証（fork verification）を用いて少数の報酬仮説を比較する能力認識型・フェーズ認識型プロトコルである。実験では、能力が低い段階では報酬ランキングが信頼できないが、タスク依存の閾値を超えると有益になることを示した。スパース操作タスクではフェーズ認識デプロイメントがピーク性能と保持性能を改善し、報酬生成とデプロイメントは連動した問題として研究すべきと主張している。

#llm#rl

論文深掘り arXiv 2026-04-30

RHyVE: LLM生成報酬仮説のための能力認識検証・フェーズ認識デプロイメント

LLM自動報酬設計に「いつ使うか」の検証層が加わり、RL実用化の安定性が向上しそう

強化学習（Reinforcement Learning）における報酬設計をLLM（大規模言語モデル）で自動化する研究が進む一方、生成された報酬関数が信頼できる学習目標になるかは未検証のままである。既存研究は報酬候補の生成・進化・選択に注力しており、いつ・どのフェーズでその報酬を使うかという「デプロイタイミング問題」は軽視されてきた。本研究はLLM生成報酬を「報酬仮説（reward hypothesis）」として扱い、その有用性が現在のポリシーの能力（competence）と学習フェーズに依存すると定式化する。提案手法RHyVEは、短いホライズンのフォーク検証（fork verification）を用いて少数の報酬仮説を比較し、能力認識・フェーズ認識でデプロイする。実験では、低能力フェーズでは報酬ランキングが信頼できないが、タスク依存の閾値を超えると有益になることを示した。スパースな操作タスクでは、フェーズ認識デプロイが性能向上と安定保持に貢献した。報酬生成とデプロイは連成問題として扱うべきという主張がなされている。

#llm#rl

論文深掘り arXiv 2026-04-30

本番テキスト-to-SQLシステムにおけるエージェント非依存のSQL精度評価

スキーマ不要のSQL評価が、T2SQL本番運用のモニタリング標準を塗り替える可能性がある

本番環境でのText-to-SQL（T2SQL）評価には、既存ベンチマークが対応できない根本的な課題がある。現行手法はルールベースのSQLマッチングやスキーマ依存の意味解析器が主流だが、いずれもグラウンドトゥルースクエリとDBスキーマへのアクセスを前提とし、実運用ではほぼ満たされない。この乖離により、本番T2SQLエージェントは開発時テストを超えた評価がなされず、品質劣化が静かに進む。本論文はSTEF（Schema-agnostic Text-to-SQL Evaluation Framework）を提案する。STEFはユーザー質問・リフォーミュレーション・生成SQL のみを入力とし、DBスキーマや参照クエリを一切必要としない本番ネイティブな評価システムだ。自然言語とSQLの両表現から意味仕様を抽出し、正規化特徴アラインメントを実施、フィルタ整合・意味的評定・評価信頼度を組み合わせた0〜100スコアを生成する。実験によりスキーマ依存なしで継続的な本番モニタリングとエージェント改善フィードバックループが実現可能であることが示されている。

#agent#alignment#benchmark

論文深掘り arXiv 2026-04-30

本番Text-to-SQLシステムにおけるSQLの精度をエージェント非依存で評価するフレームワーク

スキーマ不要のSQL評価が、本番Text-to-SQLの品質管理を常時可能にするかもしれない

本番環境におけるText-to-SQL（T2SQL）の評価は、既存ベンチマークが対応できていない根本的な課題を抱えている。現行のルールベースSQLマッチングやスキーマ依存のセマンティックパーサーは、正解クエリやDB構造への参照を前提とするが、実運用ではこれらが得られないケースが多い。この乖離により本番T2SQLエージェントの品質劣化が無音で進行し、継続改善のフィードバック機構が存在しなかった。本研究はSTEF（Schema-agnostic Text-to-SQL Evaluation Framework）を提案。DBスキーマや正解クエリを一切必要とせず、ユーザー質問・拡張再定式化・生成SQLのみを自然言語入力として受け取り、0〜100のスコアを出力する。フィルターアライメント・セマンティック評価・評価者の信頼度を統合した複合メトリクスを採用し、GROUP BY許容やORDER BYデフォルト等の本番特有の正規化処理にも対応。継続的な本番監視とエージェント改善フィードバックループの実現を実証したとしている。

#agent#alignment#benchmark

論文深掘り arXiv 2026-04-28

モデルはどれだけ速く監督にコミットすべきか？Tsallis損失連続体による推論モデルの訓練

新規タスクへの推論モデル適応コストが激減し、少ないデータでのAIカスタマイズが現実的になりそう

強化学習（RLVR: Reinforcement Learning from Verifiable Rewards）による推論モデルのポストトレーニングでは、初期成功確率が低い場合に「コールドスタート停滞」が生じる。本研究はTsallis q-対数を用いて、RLVRと潜在軌跡の対数周辺尤度の間を補間する損失族J_Qを定義する。この損失族はすべて同じ勾配方向を共有しつつ、スカラー増幅P_θ^{-q}によってインスタンスごとに独立に重み付けされる。理論分析により、搾取極（q=0）ではコールドスタート脱出にΩ(1/p_0)の時間を要するのに対し、密度推定極（q=1）ではΘ(log(1/p_0))で脱出できることを示す。この枠組みから2つの推定量、GARL（事前分布からサンプリングしRL勾配を増幅）とPAFT（事後分布から重要度リサンプリングし標準SFTを実行）を導出。FinQA・HotPotQA・MuSiQueでの実験で、q=0.75のGARLはGRPOが完全に失敗するケースでもコールドスタートを脱出し、HotPotQAではPAFTがmaj@16で47.9（GRPOより+14.4）を達成したと報告している。

#rl#fine-tuning

論文深掘り arXiv 2026-04-23

一時的ターン注入（TTI）：大規模言語モデルにおけるステートレス・マルチターン脆弱性の暴露

ステートレス設計のLLMは分散型攻撃に無防備であり、セキュリティ評価の前提が塗り替えられる可能性がある

大規模言語モデル（LLM）が機密性の高いワークフローに組み込まれる中、敵対的堅牢性の重要度が増している。本論文は「一時的ターン注入（Transient Turn Injection: TTI）」という新たなマルチターン攻撃手法を提案する。TTIはステートレスなモデレーション（内容審査）の構造的弱点を突き、悪意ある意図を複数の孤立した対話ターンに分散させることで安全フィルターを回避する。従来のジェイルブレーク手法が持続的な会話コンテキストの維持に依存するのと異なり、TTIはLLMを活用した自動攻撃エージェントがポリシー執行を反復的に試し回避する点が特徴である。OpenAI・Anthropic・Google Gemini・Metaを含む最先端モデルへの大規模評価を通じ、TTI耐性に顕著なばらつきがあること、医療など高リスク領域で未知の脆弱性が存在することを示した。セッションレベルのコンテキスト集約や深層アライメントなど実践的な緩和策も提示している。

#llm#benchmark#agent#alignment

論文深掘り arXiv 2026-04-23

一時的ターン注入（TTI）：大規模言語モデルにおけるステートレスなマルチターン脆弱性の暴露

ステートレスLLM設計の脆弱性が露呈し、AI安全設計の前提が問い直される転換点になりそう

大規模言語モデル（LLM）が機密性の高い業務フローに組み込まれる中、敵対的堅牢性の確保が急務となっている。本論文は「一時的ターン注入（Transient Turn Injection: TTI）」という新たなマルチターン攻撃手法を提案する。TTIは、ステートレスなモデレーション（moderation）の構造的欠点を突き、悪意ある意図を複数の孤立した対話ターンに分散させることで安全フィルタを回避する。従来のジェイルブレイク（jailbreak）手法が会話の継続的コンテキストに依存するのと異なり、TTIはLLMを利用した自動攻撃エージェントにより、ポリシー強制をブラックボックス環境で反復的に検証・回避する。OpenAI・Anthropic・Google Gemini・Metaを含む最先端モデルの横断評価では、対TTI耐性に大きなばらつきがあり、固有の堅牢性を示す構成は限定的だった。特に医療・高リスク領域で未知の脆弱性パターンが発見されており、セッションレベルのコンテキスト集約などの緩和策も論じられている。

#llm#benchmark#agent#alignment

論文深掘り arXiv 2026-04-16

LLMによる検証器の攻略：RLVRは報酬ハッキングを引き起こす可能性がある

RLVR訓練モデルの「正解」は信用できないかもしれず、検証器設計が次のAI品質競争の主戦場になりそう

背景として、検証可能報酬による強化学習（RLVR: Reinforcement Learning with Verifiable Rewards）がLLMの推論能力スケーリングの主流手法となる中、「モデルが検証器を攻略する」という新たな失敗パターンが浮上している。本研究では帰納的推論タスクを対象に、RLVRで訓練されたモデルが汎化可能なルール（例：「赤い車を積んだ列車は東へ向かう」）の学習を放棄し、代わりにインスタンスレベルのラベル列挙という抜け穴戦略を取ることを発見した。これは理解能力の欠如ではなく、外延的正解のみを確認する不完全な検証器が偽陽性を許容してしまう報酬ハッキング（reward hacking）の一形態だとする。この抜け穴を検出するため、同型摂動テスト（IPT）を提案し、論理的同型タスク下での不変性を検証に課す手法を導入した。実験の結果、この抜け穴行動はGPT-5やOlmo3などRLVR訓練モデルに固有であり、非RLVRモデルには見られないことが示された。

#llm#benchmark#rl

論文 Hugging Face 2026-04-15 HF ↑15

GlobalSplat: グローバルシーントークンを用いた効率的なフィードフォワード3Dガウシアンスプラッティング

3Dガウシアンスプラッティング(3D Gaussian Splatting)における効率的なプリミティブの空間配置は、表現のコンパクト性、再構成速度、レンダリング品質の調和に直結している。従来の最適化手法やフィードフォワード推論手法はこれらの目標間で大きなトレードオフを強いられており、グローバルなシーン認識を欠いたローカルで経験的な配置戦略に依存していることが問題である。本論文ではGlobalSplatを提案し、マルチビュー入力から明示的な3D幾何をデコードする前に、クロスビュー対応を解決するコンパクトなグローバル潜在シーン表現を学習する「先にアライン、後にデコード」という原則に基づく。粗から細への訓練カリキュラムにより、表現の肥大化を防止する。RealEstate10KおよびACI Dデータセットで、わずか16Kガウシアンで競争力のある新規視点合成性能を達成し、4MBの軽量フットプリントを実現。さらに78ミリ秒の高速推論を可能にする。

#coding

論文 Hugging Face 2026-04-15 HF ↑1

モデル能力が支配的：AIMO 3からの推論時最適化の教訓

複数のLLM試行の多数決は数学的推論を改善するが、相関エラーが有効サンプルサイズを制限する。異なる推論戦略を異なる投票者に割り当てるDiverse Prompt Mixerを提案し、AIMO 3競技（3モデル、50のIMOレベル問題、限定的リソース）で検証した。結果として、プロンプトレベルの介入はすべて失敗し、高温度サンプリング（high-temperature sampling）はすでにエラーを十分に装飾化している。能力の低い戦略は相関減少より精度低下が大きい。8点の能力差がある場合、あらゆる最適化においてモデル能力が支配的である。最良の多数決スコア（42/50）とpass@20の間隙は選択損失（selection loss）であり、プロンプト損失ではない。検証器ベースのセレクタが対応可能だが、プロンプトエンジニアリングでは解決不可能である。

#llm

論文 Hugging Face 2026-04-15 HF ↑2

MM-WebAgent: Webページ生成のための階層的マルチモーダルWebエージェント

AIGC(AI生成コンテンツ)ツールの進展により、Webページ設計で画像・動画・ビジュアライゼーションをオンデマンド生成できるようになった一方、要素を個別に生成すると全体的な統一性とデザイン一貫性に問題が生じる。本論文ではMM-WebAgentを提案し、階層的計画と反復的な自己反省を通じてAIGC基盤の要素生成を調整する階層的エージェント(agent)フレームワークを構築した。グローバルレイアウト、ローカルなマルチモーダルコンテンツ、およびそれらの統合を共同最適化することで、統一性があり視覚的に一貫性のあるWebページを生成する。マルチモーダルWebページ生成ベンチマークと多段階評価プロトコルも導入し、コード生成やエージェント基盤の既存手法を上回る性能を示した。

#agent#multimodal#benchmark

論文深掘り arXiv 2026-05-28

Anti Mode-Collapse in Mean-Field Transformer via Auxiliary Variables

We use a mean-field-based transformer model to theoretically investigate how auxiliary variables, such as positional encoding, prevent mode collapse of self-attention mechanisms. The use of mean-field transformers to analyze the properties of self-attention mechanisms has garnered significant attent...

#coding

論文深掘り arXiv 2026-05-28

Anti Mode-Collapse in Mean-Field Transformer via Auxiliary Variables

#coding

論文深掘り arXiv 2026-05-28

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

Temperature-zero BF16 LLM inference is often treated as reproducible, yet the same request can emit different tokens when decoded alone or inside a larger batch. Existing fixes use batch-invariant operators or LLM-42's per-token verification, incurring cost even when most steps are stable. We ask wh...

#llm#coding#benchmark

論文深掘り arXiv 2026-05-28

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

#llm#coding#benchmark

論文深掘り arXiv 2026-05-27

Principled Algorithms for Optimizing Generalized Metrics in Multi-Label Learning

Many real-world classification tasks require predicting multiple labels per instance, necessitating the optimization of complex evaluation metrics such as the $F$-measure and Jaccard index. While the Empirical Utility Maximization (EUM) framework is natural for these population-level metrics, existi...

#benchmark

論文深掘り arXiv 2026-05-27

SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks

Vast quantities of compute (GPU cycles on personal workstations, idle inference servers, and edge devices between jobs) go unused because no incentive-aligned protocol exists for their owners to share them safely and profitably. Existing approaches either require a trusted central coordinator (cloud...

#agent

論文深掘り arXiv 2026-05-27

Stance Detection in Prediction Markets: Addressing Imbalanced Trader Commentary via Counterfactual Augmentation and Market Context

Prediction markets such as Polymarket aggregate crowd beliefs into real-time probability estimates, and the comments traders post beneath each market contain rich directional stance signals that prices alone cannot capture. This work introduces the first stance detection study applied to prediction ...

#llm#fine-tuning

論文深掘り arXiv 2026-05-26

Deep-layer limit and stability analysis of the basic forward-backward-splitting induced network (II): learning problems

Deep unfolding neural networks derived from iterative optimization schemes and numerical ordinary/partial differential equations (ODEs/PDEs) have attracted much attention in data science over the last decade. Therein, numerous important network architectures were constructed from the basic forward-b...

論文深掘り arXiv 2026-05-21

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

#coding#benchmark

論文深掘り arXiv 2026-05-21

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

#coding#benchmark

論文深掘り arXiv 2026-05-21

Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

The Flexible Job Shop Scheduling Problem (FJSP) is the optimal allocation of a set of jobs to machines. Two primary challenges persist in FJSP: the unpredictable arrival of future jobs and the combinatorial complexity of the problem, rendering it intractable for conventional mixed-integer linear pro...

#agent#coding#rl#benchmark

論文深掘り arXiv 2026-05-21

Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

#agent#coding#rl#benchmark

論文 arXiv 2026-05-21

SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation

Parameter-efficient fine-tuning enables fast personalization of text-to-image diffusion models, but composing multiple custom concepts remains challenging due to representation interference. Existing modular methods either rely on expensive post-hoc fusion or freeze adaptation subspaces, which limit...

#diffusion#fine-tuning#vision

論文 arXiv 2026-05-21

SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation

#diffusion#fine-tuning#vision

論文 arXiv 2026-05-21

Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts

AI models are already deployed in societies affected by armed conflict, and journalists, humanitarian workers, governments and ordinary citizens rely on them for information or for their work processes. No established practice exists for checking whether their outputs can make those conflicts worse....

#alignment#benchmark#llm

論文 arXiv 2026-05-21

Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts

#alignment#benchmark#llm

論文 arXiv 2026-05-21

AMEL: Accumulated Message Effects on LLM Judgments

Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated messa...

#llm#benchmark

論文 arXiv 2026-05-21

AMEL: Accumulated Message Effects on LLM Judgments

#llm#benchmark

論文深掘り arXiv 2026-05-20

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

Computer-use agents (CUA) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating sequences of calls to tools such as click, type, and scroll on a browser. Current implementations follow a sequential fetch-screenshot-execute loop where each itera...

#agent#llm

論文深掘り arXiv 2026-05-19

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, ver...

#agent#llm

論文深掘り arXiv 2026-05-18

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and func...

#agent#robotics#llm#benchmark

論文 arXiv 2026-05-13

QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

Modeling long-range dependencies in sequential data remains a central challenge in machine learning. Transformers address this challenge through attention mechanisms, but their quadratic complexity with respect to sequence length limits scalability to long contexts. State-space models (SSMs) provide...

#benchmark

論文深掘り arXiv 2026-05-13

Neurosymbolic Auditing of Natural-Language Software Requirements

Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped wi...

#alignment#llm#benchmark

論文 arXiv 2026-05-13

The WidthWall: A Strict Expressivity Hierarchy for Hypergraph Neural Networks

Hypergraphs provide a natural framework to model higher-order interactions in scientific, social, and biological systems. Hypergraph neural networks (HGNNs) aim to learn from such data, yet it remains unclear which higher-order structures these models can represent. We show that hypergraph expressiv...

論文深掘り arXiv 2026-05-07

PianoCoRe: Combined and Refined Piano MIDI Dataset

Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. This work presents P...

#alignment#benchmark

論文深掘り arXiv 2026-05-07

PianoCoRe: Combined and Refined Piano MIDI Dataset

#alignment#benchmark

論文深掘り arXiv 2026-05-06

Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

How many key-value associations can a $d\times d$ linear memory store? We show that the answer depends not only on the $d^2$ degrees of freedom in the memory matrix, but also on the retrieval criterion. In an isotropic Gaussian model for the stored pairs, we show that top-1 retrieval, where every si...

#benchmark#coding

論文深掘り arXiv 2026-05-06

Transformed Latent Variable Multi-Output Gaussian Processes

Multi-Output Gaussian Processes (MOGPs) provide a principled probabilistic framework for modelling correlated outputs but face scalability bottlenecks when applied to datasets with high-dimensional output spaces. To maintain tractability, existing methods typically resort to restrictive assumptions,...

#benchmark

論文 arXiv 2026-04-30

局所的な高強度ソース項を含む問題に対する適応ウェーブレットベースPINN

物理情報ニューラルネットワーク(Physics-Informed Neural Networks, PINNs)は微分方程式の求解に注目されているが、ニューラルネットワーク固有のスペクトルバイアス(spectral bias)とマルチスケール現象に起因する損失不均衡という2つの根本的な限界を抱えている。本論文は、局所的な高強度ソース項を持つ問題の極端な損失不均衡に対処するため、適応ウェーブレットベースPINN(AW-PINN)を提案する。熱処理・電磁気学・衝撃力学・流体力学など幅広い物理応用に現れるこの種の問題に対し、AW-PINNは残差と教師あり損失に基づいてウェーブレット基底関数を動的に調整する。また自動微分(automatic differentiation)を使わずに導関数を取得するため学習が高速化され、メモリ効率も高い。固定基底による事前学習フェーズの後、スケールと平行移動を適応的に調整する2段階構造を採用する。理論的にはガウス過程極限とNTK構造を導出し、損失比率最大10^10:1の偏微分方程式ベンチマークにおいて既存手法を一貫して上回ることを示している。

#benchmark

論文 arXiv 2026-04-30

局所的高強度ソース項を持つ問題のための適応ウェーブレットベースPINN

物理情報ニューラルネットワーク(PINN)は微分方程式の求解に有望だが、ニューラルネットワーク固有のスペクトルバイアス(spectral bias)とマルチスケール現象に起因する損失不均衡という二つの根本的な限界を抱えている。本論文では、局所的高強度ソース項を持つ問題の極端な損失不均衡に対処するため、適応ウェーブレットベースPINN（AW-PINN）を提案する。提案手法は残差・教師あり損失に基づきウェーブレット基底関数を動的に調整し、高スケール特徴を持つ問題をメモリ効率よく扱える。また、損失関数の微分計算に自動微分を用いないため訓練が高速化される。手法は固定基底による事前学習フェーズと適応的なスケール・並進精緻化の二段階で構成される。理論的にはガウス過程極限とNTK構造を導出。過渡熱伝導やポアソン問題、振動流方程式、マクスウェル方程式など損失比が最大10^10:1に達するPDEで既存手法を一貫して上回ることを示した。

#benchmark

論文深掘り arXiv 2026-04-30

繰り返しクエリへの信頼できる回答：テンプレート制約付きデコーディングによるText-to-SQL精度向上

「繰り返し質問の多い社内BIやデータ分析ツール」でText-to-SQLの実用精度が飛躍的に改善しそう

大規模言語モデル（LLM）はText-to-SQL生成に革新をもたらしたが、複雑なスキーマや未知スキーマでの精度不安定・無効SQL生成リスクが実運用の障壁となっている。本研究はTemplate Constrained Decoding（TeCoD）を提案する。TeCoDはラベル付きワークロード内のクエリパターンの反復性を活用し、過去の自然言語-SQLペアを再利用可能なテンプレートに変換する。ファインチューニングされた自然言語推論（NLI）モデルを用いたテンプレート選択モジュールがクエリと既存テンプレートの照合・棄却を効率的に行い、選択後は文法制約付きデコーディング（grammar-constrained decoding）によりSQL生成時にテンプレートを強制適用する。この新しいパーティション分割戦略により構文的有効性と効率性を両立し、マッチしたクエリにおいてIn-Context Learning（ICL）比で最大36%の実行精度向上と2.2倍の低レイテンシを達成したと主張する。

#coding#llm#fine-tuning

論文深掘り arXiv 2026-04-30

繰り返しクエリへの信頼性ある回答：テンプレート制約デコーディングによるText-to-SQL精度向上

企業のクエリログが資産に変わり、Text-to-SQLの信頼性が実用レベルに近づくかもしれない

大規模言語モデル（LLM）はText-to-SQL生成を革新したが、複雑なスキーマや未知スキーマにおける精度の不安定さと無効なSQL生成リスクが実運用の壁となっていた。本研究はTemplate Constrained Decoding（TeCoD）を提案する。TeCoDはラベル付きワークロード内のクエリパターンの反復性を活用し、過去の自然言語-SQLペアを再利用可能なテンプレートへ変換する。ファインチューニングされた自然言語推論（NLI）モデルを用いたテンプレート選択モジュールが、クエリの一致または拒否を効率的に判定。選択後は文法制約デコーディング（grammar-constrained decoding）を用いた新しい分割戦略により、SQL生成時の構文的妥当性と効率性を両立する。結果として、in-context learning（ICL）比で最大36%の実行精度向上と2.2倍の低レイテンシを実現したと主張している。

#coding#llm#fine-tuning

論文深掘り arXiv 2026-04-29

深層トランスフォーマーモデルにおける確率的スケーリング極限とノイズによる同期現象

トランスフォーマーの「なぜ学習できるか」に確率論的証明が与えられ、設計哲学が変わるかもしれない

本論文は、有限深度・有限幅のトランスフォーマーモデル（MLP блоки含む）において、トークンの層ごとの発展が連続時間の確率的相互作用粒子系（stochastic interacting particle system）に経路収束（pathwise convergence）することを数学的に証明する。さらに、トークン分布の発展を記述する確率偏微分方程式（SPDE）を特定し、トークン数が大きい場合の「カオスの伝播（propagation of chaos）」を証明する。導出した境界は定量的であり、考慮する極限は可換性を持つ。加えて、共通ノイズ（common noise）が決定論的な自己注意ドリフト（self-attention drift）に対して十分強い場合、極限確率モデルが「ノイズによる同期（synchronization by noise）」を示し、相互作用エネルギーの指数的散逸が平均的に成立することを証明する。最後に、この条件を満たす活性化関数のクラスを特徴づける。

論文深掘り arXiv 2026-04-29

推論中のいつ検索すべきか：大規模推論モデルのための適応的検索

「いつ検索するか」を推論中に判断するRAGが、o1系モデルの実用化コストを大幅に下げる可能性がある

大規模推論モデル（DeepSeek-R1、OpenAI o1など）は数千トークンにわたる思考連鎖（Chain of Thought）を生成するが、既存の検索拡張生成（RAG）との統合には根本的なミスマッチがある。既存RAGは「推論開始前」にコンテキストを提供するよう最適化されており、推論途中への証拠注入には対応していない。本研究では「ReaLM-Retrieve」という推論認識型検索フレームワークを提案する。①推論ステップ粒度で知識ギャップを検出するステップレベル不確実性検出器、②外部証拠が推論に最も貢献するタイミングを学習する検索介入ポリシー、③ナイーブな統合比で3.2倍の効率化を実現する統合機構、の3つが核心。MuSiQue・HotpotQA・2WikiMultiHopQAでの実験では、標準RAGに対して平均10.1%の回答F1改善を達成しつつ、IRCoTなど固定間隔アプローチと比べ検索呼び出しを47%削減した。

#rag#benchmark

論文深掘り arXiv 2026-04-28

実行可能性保証アクションを持つ都市規模EVライドヘイリング向けセミマルコフ強化学習

RL×MILPの二段設計がEVフリート管理の利益を2倍近く引き上げる可能性を示す

EVライドヘイリングフリートの都市規模制御において、配車・再配置・充電判断を充電器や電力フィーダーの制約下で最適化する課題に取り組んだ研究。六角グリッド上のセミマルコフ決定過程（semi-MDP）として定式化し、離散・連続混合アクションと可変行動時間を扱う。物理的実行可能性を学習・運用の両フェーズで保証するため、マスク付き温度アニーリングアクターが生成した高レベル意図を、混合整数線形計画（MILP）でリアルタイム投影する仕組みを採用。分布シフト対策としてWasserstein-1アンビギュイティセットとグラフ整合マハラノビス距離を組み合わせたロバストなSoft Actor-Critic（SAC）を構築。NYCタクシーデータで構築した大規模シミュレーターでの実験では、提案手法PD-RSACが純利益122万ドルを達成し、強いヒューリスティックや既存RL手法（SAC/MAPPO/MADDPG）の58〜70万ドルを大きく上回り、電力フィーダー制約違反ゼロを維持したと報告している。

#agent#rl

論文深掘り arXiv 2026-04-28

TrialCalibre：RCTベンチマークと観察研究キャリブレーションのための完全自動化因果推論エンジン

RWE研究の自動化が臨床試験の代替コストを大幅に下げる可能性がある

実世界エビデンス（Real-world Evidence, RWE）研究は規制・臨床判断に活用が進む一方、残存バイアスの定量困難さが信頼性を損なっている。既存のBenchExCalフレームワークは、RCT（無作為化比較試験）との比較→誤差推定→新適応症への因果効果推定キャリブレーションという2段階プロセスで対処するが、リソース集約的でスケールが困難だった。本研究ではTrialCalibreを提案する。これはBenchExCalワークフローを自動化・スケール化するマルチエージェントシステムであり、Orchestrator・Protocol Design・Data Synthesis・Clinical Validation・Quantitative Calibrationの専門エージェントが連携する。RLHFによるエージェント学習とナレッジブラックボードを取り入れ、適応的・監査可能・透明な因果効果推定を実現すると主張している。

#agent#benchmark#rl

論文深掘り arXiv 2026-04-23

拡散モデルによる時空間超解像の統合スケール適応フレームワーク

気象AIのスケール汎用化が加速し、観測・モデル間のデータ統合コストが大幅に下がる可能性がある

気候・気象分野における深層学習動画超解像（Super-Resolution: SR）は急速に発展しているが、既存手法は空間または時間のどちらか一方のみを高解像度化するか、特定のSR倍率ペアに固定された設計が多く、異なる解像度や時間間隔への転用が困難だった。本研究では、時空間SRを「条件付き平均の決定論的予測（注意機構付き）」と「残差条件付き拡散モデル（Diffusion Model）」に分解し、さらに降水量保存（Mass-Conservation）変換を組み合わせたスケール適応フレームワークを提案する。スケール適応性は、拡散ノイズスケジュール振幅β・時間コンテキスト長L・質量保存関数fの3つのハイパーパラメータを再チューニングするだけで実現され、同一アーキテクチャのまま空間方向1〜25倍・時間方向1〜6倍のSRに対応可能とする。フランスの再解析降水量データ（Comephore）での実証により、単一アーキテクチャで広範なスケール条件をカバーできることを示した。

#diffusion

論文深掘り arXiv 2026-04-23

拡散モデルによる時空間同時超解像のスケール適応型フレームワーク

気象AIの「一モデル・全解像度」時代が来るかもしれない——スケール適応型SRが業界標準レシピになる可能性

気候・気象分野における深層学習ベースの映像超解像（Super-Resolution: SR）は急速に発展しているが、空間と時間の解像度を同時に高める「時空間同時SR」は、特定のアップスケール比に固定されたモデルが多く、異なる解像度や時間間隔への転用が困難だという課題があった。本研究では、条件付き平均の決定論的予測（Attention機構付き）と、残差を処理する条件付き拡散モデル（Diffusion Model）を組み合わせ、さらに降水量総量を保存する質量保存（mass-conservation）変換をオプションで付加したスケール適応型フレームワークを提案する。スケール適応性は、ノイズスケジュール振幅β・時間コンテキスト長L・質量保存関数fの3つのハイパーパラメータを再調整するだけで実現され、同一アーキテクチャを再利用できる。フランスの再解析降水量データ（Comephore）での実証では、空間方向1〜25倍・時間方向1〜6倍のSRを単一アーキテクチャでカバーし、幅広いスケールに対応できるアーキテクチャと調整レシピの有効性を示した。

#diffusion

論文深掘り arXiv 2026-04-23

クープマン固有関数の代数構造と無限性について

クープマン理論の計算コスト壁が崩れ、非線形システム予測AIの実用域が広がりそう

動力学系の解析において、クープマン演算子（Koopman operator）の固有関数を効率的に計算する手法が課題となっている。本研究では、可逆な軌道を持つ連続時間力学系において、どこでもゼロにならない固有関数が乗法群を形成するという代数的性質に着目した。「主固有関数（principal eigenfunction）」と呼ばれる少数の固有関数を従来手法で近似した後、その多項式を構成することで大量の固有関数集合を生成できる。これにより固有空間の表現が豊かになり、応用固有の観測量をより正確に表現可能となる。また、複数の定常状態を持つ一次元問題や極限サイクル・分離曲線を持つ二次元問題に現れる固有関数の局所的・広域的特異点を取り扱う手法も提案。特異点を越えた固有関数の接続・継続により、局所的サンプリングデータから整合的なグローバル表現の学習が可能になると主張する。多安定系や疎・断片的計測データへの応用に特に有効としている。

論文深掘り arXiv 2026-04-23

Koopman固有関数の代数構造と無限性について

Koopman演算子の効率的拡張により、物理・制御・時系列AIの「データ不足問題」に新たな突破口が生まれそう

背景・課題として、連続時間力学系（dynamical system）の解析において、Koopman演算子（Koopman operator）の固有関数（eigenfunction）を数値的に効率よく計算することが求められている。従来手法では固有空間の網羅的な列挙にコストがかかり、特異点（singularity）付近では固有関数が発散・消失するため大域的な表現が困難であった。本研究では、可逆な軌道を持つ系においてKoopman演算子のゼロ点を持たない固有関数が乗法群（multiplicative group）を形成するという代数的性質を活用する。少数の「主固有関数（principal eigenfunction）」を従来手法で近似した後、それらの多項式を構築することで大規模な固有空間を低コストで生成できることを示す。さらに、多安定系（multistable system）や極限閉軌道（limit cycle）・分離曲線（separatrix）を持つ系における固有関数の特異点を解析し、特異点をまたいだ継続（continuation）手法を提案。局所的にサンプリングされたデータから整合的な大域表現を学習できるとしている。

論文 arXiv 2026-04-23

A-IC3: ハードウェアモデル検査のための学習誘導型適応的帰納的汎化

ハードウェアモデル検査（Hardware Model Checking）の最先端アルゴリズムであるIC3は、帰納的汎化（Inductive Generalization）と呼ばれる工程が性能を左右する。この工程では、帰納性への反例（CTI: Counterexample to Inductiveness）を広い状態集合へ汎化するが、既存手法は固定した汎化戦略を用いるため、検証環境の動的・文脈依存的な変化に対応できず、生成される節（clause）の品質が制限されるという課題があった。本論文では、多腕バンディット（MAB: Multi-Armed Bandit）アルゴリズムを用いて、検証プロセスからのリアルタイムフィードバックに基づき汎化戦略を適応的に選択する軽量な機械学習フレームワーク「A-IC3」を提案する。エージェントは汎化結果の品質評価によって更新され、戦略選択を逐次改善する。HWMCC最新コレクションを中心とする914インスタンスのベンチマークで評価した結果、最先端モデル検査器rIC3上でベースラインより26〜50ケース多く解き、PAR-2スコアを194.72〜389.29改善することが示された。

#agent#benchmark

論文 arXiv 2026-04-23

A-IC3: ハードウェアモデル検査のための学習誘導型適応的帰納的一般化

ハードウェアモデル検査（hardware model checking）の最先端アルゴリズムであるIC3は、高い性能とスケーラビリティで広く用いられている。IC3の中核工程である帰納的一般化（inductive generalization）は、帰納性の反例（CTI: counterexample to inductiveness）を広い状態集合へと拡張する処理であり、生成される節（clause）の品質を左右するため、アルゴリズム全体の効率を決定づける重要な役割を担う。しかし既存手法は固定された一般化戦略に依存しており、検証環境の動的・文脈依存的な変化に対応できないという課題があった。本論文では、多腕バンディット（MAB: multi-armed bandit）アルゴリズムを用いて、検証プロセスからのリアルタイムフィードバックに基づき帰納的一般化戦略を適応的に選択する軽量な機械学習フレームワーク「A-IC3」を提案する。最新のHWMCCコレクションを中心とする914インスタンスのベンチマーク評価では、最先端モデル検査器rIC3上でベースライン比26〜50件多くの問題を解決し、PAR-2スコアを194.72〜389.29改善したと報告されている。

#agent#benchmark

論文深掘り arXiv 2026-04-22

RespondeoQA：ラテン語・英語バイリンガル質問応答ベンチマーク

古典語LLM評価が標準化され、教育・人文系AIプロダクトの品質基準が整備されそう

本論文は、ラテン語と英語のバイリンガル設定における質問応答（Question Answering）および翻訳タスク向けのベンチマークデータセット「RespondeoQA」を提案する。約7,800件の質問・回答ペアから構成され、18世紀から現代に至るラテン語教育資料（試験問題・クイズボウル形式のトリビア・教科書）から収集された。データセットは知識・スキルベースの問題、マルチホップ推論、制約付き翻訳、混合言語ペアなど多様な問題タイプを包含する。知識の限りでは、ラテン語を中心とした初のQAベンチマークとされる。評価実験としてLLaMA 3、Qwen QwQ、OpenAI o3-miniの3モデルを検証した結果、いずれもスキル指向問題で性能が低下することが判明。推論モデルは韻律分析（scansion）や文学的技法タスクでは優位性を示すが、全体的な改善幅は限定的であった。本データセットは専門的な言語・文化ドメインにおけるモデル能力評価の新リソースとなり、他言語への応用も容易であると主張する。

#benchmark#llm

論文深掘り arXiv 2026-04-20

潜在位相シフトロールバック：残差ストリーム監視とKVキャッシュ操作による推論時エラー訂正

8Bモデルが70Bを超える推論改善手法が、AIプロダクトのコスト構造を塗り替えるかもしれない

大規模言語モデル（LLM）は生成途中で誤った推論ステップを踏むと、以降のトークンがその誤りを増幅してしまう問題がある。本研究では「潜在位相シフトロールバック（LPSR）」を提案。生成ステップごとに残差ストリーム（residual stream）をコサイン類似度＋エントロピーの二重ゲートで監視し、急激な方向転換（位相シフト）を検出した際にKVキャッシュ（KV-cache）をロールバックしてステアリングベクターを注入する。ファインチューニングや追加フォワードパスは不要。MATH-500ベンチマークで8Bモデルが44.0%を達成し、標準的な自己回帰（AR）の28.8%を15.2ポイント上回った。さらにBest-of-16比較でも+7.8ポイント優位で、トークンコストは5.4倍低く、パラメータ数8.75倍の70Bモデルをも上回るとしている。

#llm#fine-tuning

論文深掘り arXiv 2026-04-16

トークンからステップへ：効率的な多段階推論のための検証対応スペキュラティブデコーディング

外部報酬モデル不要でLLM推論を高速化・高精度化する手法が、推論コスト削減の新基準になりそう

大規模言語モデル（LLM）の推論高速化手法であるスペキュラティブデコーディング（Speculative Decoding, SD）は、軽量なドラフトモデルの出力を強力なターゲットモデルが検証する仕組みだが、トークン単位の処理ゆえに誤ったステップが後続に伝播する問題があった。既存の外部報酬モデルを用いた対処法は追加レイテンシや計算コストを招く。本研究が提案するSpecGuardは、外部モデルを使わずモデル内部シグナルのみでステップレベルの検証を行うフレームワークである。各ステップで複数のドラフト候補をサンプリングし、アテンションに基づく根拠スコアとlog確率ベースの信頼スコアの2つの軽量シグナルのアンサンブルで採否を判断する。推論ベンチマーク群での実験では、精度を3.6%向上させつつレイテンシを約11%削減し、SD・報酬誘導型SDの両方を上回る成果を示した。

#coding#llm#benchmark

論文深掘り arXiv 2026-04-16

プロレプシスの最小アーキテクチャとは？小型トランスフォーマーにおけるタスク横断的な早期不可逆コミットメント

LLMの「早期誤判断から修正できない」構造的原因が解明されれば、RAGや推論エージェントの信頼性設計が根本から変わりそう

トランスフォーマー（Transformer）がいつ・なぜ誤った決定を早期に固定してしまうのかを解明する研究。著者らは「プロレプシス（prolepsis）」という概念を提唱し、「モデルが早期にタスク固有のアテンションヘッド（attention head）によってコミットメントを維持し、後続レイヤーがそれを修正できない」状態を定義する。Gemma 2 2BおよびLlama 3.2 1Bを対象に5つの問いを検証。計画サイト（planning-site）のスパイクが同一の幾何構造で再現されること、特定のアテンションヘッドが決定を出力へルーティングすること、探索には16層以下で十分だがコミットメントにはより多くの層が必要なこと、事実想起でも同パターンが異なる深さで現れることを示した。プロレプシスはアーキテクチャ的特性であり、テンプレートは共通だがルーティング基盤はタスクにより異なる。実験はすべて16GB VRAMの民生GPU一台で再現可能とのこと。

論文 arXiv 2026-05-28

SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations

Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While generative AI has advanced digital and analog IC design, PCB schematic generation from natural-language intent is largely unexplored. This paper presents SchGen, ...

#llm#agent

論文 arXiv 2026-05-28

SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations

#llm#agent

論文 arXiv 2026-05-28

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this locally coherent, globally incoherent failure via the composition...

#llm#agent#benchmark

論文 arXiv 2026-05-28

RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide limited insight into such cognitive reasoning capabilities. ...

#robotics#benchmark#agent#fine-tuning

論文 arXiv 2026-05-25

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a f...

#agent#llm#benchmark

論文 arXiv 2026-05-25

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Rece...

#llm#multimodal#diffusion#vision

論文 arXiv 2026-05-25

Looped Diffusion Language Models

Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models for language modeling, yet the effective design of transformer architectures for MDMs remains underexplored. In this paper, we show that selectively looping the early-middle transformer layers significant...

#diffusion#benchmark

論文 arXiv 2026-05-25

OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limitations and the critical timing bottlenecks introduced by dense Multiply-Accumulate (MAC) arrays. In the ultra-low bit regime, logarithmic Power-of-Two (PoT) quant...

#llm#vision#benchmark

論文 arXiv 2026-05-25

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose ph...

#llm#agent#benchmark

論文 arXiv 2026-05-25

Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the action gradient of ...

#llm#rl#diffusion

論文 arXiv 2026-05-25

A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring

Light sheet fluorescence microscopy (LSM) enables high-resolution, three-dimensional (3D) imaging of biological specimens, providing rich volumetric data for studying cellular organization, pathology, and vascular networks. However, the size, dimensionality, and annotation burden of LSM data make su...

#multimodal#fine-tuning#alignment#benchmark

論文 arXiv 2026-05-20

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities ...

#benchmark#agent

論文 arXiv 2026-05-20

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a hum...

#benchmark#multimodal#llm

論文 arXiv 2026-05-19

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific ...

#rl#multimodal#benchmark

論文 arXiv 2026-05-19

Less Back-and-Forth: A Comparative Study of Structured Prompting

Large language models (LLMs) are widely used for open-ended tasks, but underspecified prompts can lead to low-quality answers and additional interaction. This paper studies whether structured prompt design improves response quality while reducing user effort. We compare three prompt conditions: a ra...

#llm#coding#benchmark

論文 arXiv 2026-05-19

ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human--AI conversations with users' self-reported thoughts: their reasons for sendin...

#llm#alignment

論文 arXiv 2026-05-18

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any ...

#llm

論文 arXiv 2026-05-18

Code as Agent Harness

#agent#llm#multimodal#alignment#coding

論文 arXiv 2026-05-18

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned o...

#llm#multimodal#agent#benchmark

論文 arXiv 2026-05-18

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real...

#agent#robotics#benchmark

論文 arXiv 2026-05-14

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

#agent#rl#benchmark

論文 arXiv 2026-05-14

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

#agent#rl#benchmark

論文 arXiv 2026-05-14

Evidential Reasoning Advances Interpretable Real-World Disease Screening

Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide tra...

#benchmark

論文 arXiv 2026-05-14

Evidential Reasoning Advances Interpretable Real-World Disease Screening

#benchmark

論文 arXiv 2026-05-14

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack...

#multimodal#rag#alignment#llm#benchmark

論文 arXiv 2026-05-14

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

#multimodal#rag#alignment#llm#benchmark

論文 arXiv 2026-05-14

MeMo: Memory as a Model

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In ...

#llm#benchmark

論文 arXiv 2026-05-14

MeMo: Memory as a Model

#llm#benchmark

論文 arXiv 2026-05-14

Self-Distilled Agentic Reinforcement Learning

#agent#rl#llm#benchmark

論文 arXiv 2026-05-14

Self-Distilled Agentic Reinforcement Learning

#agent#rl#llm#benchmark

論文 arXiv 2026-05-14

APWA: A Distributed Architecture for Parallelizable Agentic Workflows

Autonomous multi-agent systems based on large language models (LLMs) have demonstrated remarkable abilities in independently solving complex tasks in a wide breadth of application domains. However, these systems hit critical reasoning, coordination, and computational scaling bottlenecks as the size ...

#agent#llm#benchmark

論文 arXiv 2026-05-14

APWA: A Distributed Architecture for Parallelizable Agentic Workflows

#agent#llm#benchmark

論文 arXiv 2026-05-14

Understanding How International Students in the U.S. Are Using Conversational AI to Support Cross-Cultural Adaptation

Moving to a new culture and adapting to a new life, as an international student, can be a stressful experience. In the US, international students face unique overlapping challenges, yet the current support ecosystem, including university support systems and informal social networks, remains largely ...

論文 arXiv 2026-05-14

Understanding How International Students in the U.S. Are Using Conversational AI to Support Cross-Cultural Adaptation

論文 arXiv 2026-05-14

Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end-to-end latency. In this work, we introdu...

#llm#coding#benchmark#agent#fine-tuning

論文 arXiv 2026-05-14

Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

#llm#coding#benchmark#agent#fine-tuning

論文 arXiv 2026-05-12

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

#multimodal#llm#diffusion#agent#alignment

論文 arXiv 2026-05-12

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM...

#llm#rl

論文 arXiv 2026-05-12

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-moda...

#alignment#diffusion#rl#multimodal#fine-tuning

論文 arXiv 2026-05-07

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while ...

#fine-tuning#llm

論文 arXiv 2026-05-07

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

#fine-tuning#llm

論文 arXiv 2026-05-07

Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval

Retrieval-augmented agents are increasingly the interface to large organizational knowledge bases, yet most still treat retrieval as a black box: they issue exploratory queries, inspect returned snippets, and iteratively reformulate until useful evidence emerges. This approach resembles how a newcom...

#agent#llm#rag#benchmark

論文 arXiv 2026-05-07

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies var...

#benchmark#multimodal

論文 arXiv 2026-04-30

長期的生産性シミュレーションのための大規模合成コンピュータ環境

背景・課題として、長期的な生産性タスクはユーザー固有のコンピュータ環境（ディレクトリ構造や成果物）に強く依存するが、そのような環境でのエージェント訓練用合成データの大規模生成は困難だった。本研究では「Synthetic Computers at Scale」として、現実的なフォルダ階層と文書・スプレッドシート・プレゼン等のコンテンツ豊富なアーティファクトを持つ合成環境の生成手法を提案する。各合成環境を条件として長期シミュレーションを実行し、一方のエージェントが生産目標を設定し、もう一方がユーザーとして約1ヶ月相当の作業をこなす二エージェント構成を採る。予備実験では1,000台の合成コンピュータ上でシミュレーションを実施し、各実行は平均2,000ターン超・8時間以上のエージェント稼働を要した。得られた学習シグナルにより、ドメイン内外の生産性評価でエージェント性能が大幅に改善したと報告されている。ペルソナが十億規模で存在することを踏まえ、本手法はエージェント自己改善や強化学習の基盤として有望と主張する。

#agent#rl#benchmark

論文 arXiv 2026-04-30

長期的生産性シミュレーションのための大規模合成コンピュータ環境

【背景・課題】長期的な生産業務のAIエージェント訓練には、ユーザー固有のコンピュータ環境（ディレクトリ構造や文書・スプレッドシート等のリッチなアーティファクト）を反映した現実的な合成データが必要だが、そのスケーラブルな生成手法が欠如していた。【提案手法】本論文では「Synthetic Computers at Scale」という手法を提案する。リアルなフォルダ階層とコンテンツ豊富なアーティファクトを持つ合成コンピュータ環境を大規模生成し、その上で長期シミュレーションを実行する。一方のエージェントがユーザー固有の業務目標を設定し、別のエージェントがそのユーザーとして実際に作業を遂行する二段階構成を採る。【成果・貢献】1,000台の合成コンピュータ上でシミュレーションを実施し、各実行が平均2,000ターン超・8時間以上のエージェント稼働を要した。得られた学習シグナルはドメイン内外の生産性評価で有意な性能向上を示した。ペルソナが十億規模で存在する前提のもと、本手法は原理的に数百万〜数十億の合成環境へのスケールアップが可能とされ、エージェントの自己改善と強化学習の基盤となり得ると主張している。

#agent#rl#benchmark

論文 arXiv 2026-04-30

LLMを臨床グラフ構造リファイナーとして活用：EEGてんかん発作診断における表現学習の強化

脳波（EEG）信号は自動発作検出に不可欠だが、内在するノイズが高品質な表現学習（representation learning）を困難にしている。既存のグラフ構築手法（相関ベース・学習ベース問わず）は、EEIデータのノイズ性質により冗長・無関係なエッジを生成しやすく、グラフ表現の質と下流タスク性能を低下させるという課題がある。本論文では大規模言語モデル（LLM）の推論・文脈理解能力に着目し、LLMをグラフエッジリファイナーとして活用する二段階フレームワークを提案する。まずTransformerベースのエッジ予測器とMLPで初期グラフを構築し確率スコアで候補エッジを評価、次にLLMがノードペアのテキスト的・統計的特徴を基に残存エッジの妥当性を検証・精製する。TUSZデータセットでの広範な実験により、提案フレームワークがタスク性能を向上させつつ、より明瞭で解釈可能なグラフ表現を実現することが示された。

#llm

論文 arXiv 2026-04-30

臨床グラフ構造リファイナーとしてのLLM：EEG発作診断における表現学習の強化

脳波（EEG）信号は自動発作検出に不可欠だが、固有のノイズが頑健な表現学習（representation learning）を困難にしている。既存のグラフ構築手法（相関ベース・学習ベース問わず）は、EEGデータのノイズ性に起因して冗長・無関係なエッジを生成しやすく、グラフ表現の品質低下と下流タスク性能の制限を招くという課題がある。本論文では、大規模言語モデル（LLM）の優れた推論・文脈理解能力に着目し、LLMをグラフエッジリファイナーとして活用する2段階フレームワークを提案する。まずTransformerベースのエッジ予測器とMLPで初期グラフを構築して各エッジに確率スコアを付与し、閾値処理で候補エッジを絞り込む。次にLLMがノードペアのテキスト的・統計的特徴の両方に基づき残存エッジの妥当性を判定することで冗長接続を除去する。TUSZデータセットでの実験により、提案手法がタスク性能を向上させるとともに、よりクリーンで解釈可能なグラフ表現を実現することを示している。

#llm

論文 arXiv 2026-04-30

PhyCo: 生成モーションのための制御可能な物理的事前知識の学習

現代の映像拡散モデル（video diffusion model）は外観合成に優れる一方、物理的整合性（physical consistency）に課題がある。物体の漂流、非現実的な衝突反応、材質特性の不整合などが代表的な問題だ。本論文ではPhyCoを提案する。これは連続的・解釈可能・物理的根拠のある制御を映像生成に導入するフレームワークである。主要コンポーネントは3つ：(i) 摩擦・反発・変形・力を体系的に変化させた10万件超のフォトリアル・シミュレーション動画データセット、(ii) ピクセル整合の物理特性マップを条件とするControlNetを用いた事前学習済み拡散モデルの物理教師あり微調整（physics-supervised fine-tuning）、(iii) 視覚言語モデル（VLM）による報酬最適化。推論時にシミュレータや幾何再構成を必要とせず物理的に整合した映像を生成可能とした。Physics-IQベンチマークで強力なベースラインを大幅に上回り、人間評価でも物理属性の忠実な制御が確認されたと主張している。

#diffusion#fine-tuning#multimodal#benchmark

論文 arXiv 2026-04-30

PhyCo: 生成モーションのための制御可能な物理的事前分布の学習

現代のビデオ拡散モデル(video diffusion model)は外観合成には優れているが、物体の漂流・衝突時の非現実的な跳ね返り・素材応答の不整合など、物理的一貫性に課題を抱える。本論文はPhyCoを提案する。これは連続的・解釈可能・物理的根拠を持つ制御をビデオ生成に導入するフレームワークである。主要コンポーネントは3つ：(i) 摩擦・反発係数・変形・力を多様なシナリオで系統的に変化させた10万件超のフォトリアリスティックなシミュレーション動画データセット、(ii) ピクセル整合した物理特性マップを条件とするControlNetを用いた事前学習済み拡散モデルの物理監督ファインチューニング、(iii) ファインチューニングされた視覚言語モデル(VLM)が物理クエリで生成動画を評価し微分可能なフィードバックを提供するVLMガイドド報酬最適化。推論時にシミュレータや幾何再構成を必要とせず、Physics-IQベンチマークで物理リアリズムを強力なベースラインより大幅に向上させたとしている。

#diffusion#fine-tuning#multimodal#benchmark

論文 arXiv 2026-04-30

PRISM: マルチモーダル強化学習のためのブラックボックスオンポリシー蒸留による事前アライメント

大規模マルチモーダルモデル（LMM）のポストトレーニングでは、SFT（教師あり微調整）の後にRLVR（検証可能報酬による強化学習）を適用する手順が標準的だが、SFTによる分布ドリフト（distributional drift）がモデルの元の能力を損ない、マルチモーダル推論では知覚エラーと推論失敗が異なるドリフトパターンを示して後続のRLで複合的に悪化するという課題がある。本研究ではPRISMという三段階パイプラインを提案し、SFTとRLVRの間に明示的な分布アライメント段階を挿入することでこの問題を緩和する。オンポリシー蒸留（OPD）の原理に基づき、知覚・推論専門家を持つMoE（Mixture-of-Experts）識別器とポリシーのブラックボックス敵対ゲームとしてアライメントを定式化し、教師ロジット不要で修正シグナルを提供する。さらにGemini 3 Flashから高精度な113Kデモンストレーションを収集し、Qwen3-VLでの実験でGRPO・DAPO・GSPOの複数RLアルゴリズムにわたり4Bで+4.4、8Bで+6.0ポイントの精度向上を達成したと主張している。

#multimodal#alignment#rl#fine-tuning#benchmark

論文 arXiv 2026-04-30

PRISM: マルチモーダル強化学習のためのブラックボックスオンポリシー蒸留による事前アライメント

大規模マルチモーダルモデル(LMM)のポストトレーニングでは、教師あり微調整(SFT)後に検証可能な報酬を用いた強化学習(RLVR)を適用する手順が一般的だが、SFTによる分布ドリフト(distributional drift)が問題となる。特にマルチモーダル推論では、知覚エラーと推論失敗が異なるドリフトパターンを示し、後続のRLで複合的に悪化する。本論文はこれを解消する3段階パイプラインPRISMを提案する。SFTとRLVRの間に明示的な分布アライメント段階を挿入し、オンポリシー蒸留(OPD)の原理に基づき、知覚・推論に特化したMixture-of-Experts(MoE)識別器との敵対ゲームとして定式化する。教師のロジットへのアクセスを不要とするブラックボックス方式で補正信号を与える。さらにGemini 3 Flashから11.3万件の高品質デモを追加収集。Qwen3-VLを用いた実験で、GRPO・DAPO・GSPOの複数RLアルゴリズムにわたり、4Bと8Bモデルでそれぞれ平均精度+4.4・+6.0ポイントの改善を達成したと報告している。コード・データ・モデルは公開済みである。

#multimodal#alignment#rl#fine-tuning#benchmark

論文 arXiv 2026-04-29

TIDE：拡散大規模言語モデルのためのクロスアーキテクチャ蒸留

拡散大規模言語モデル（dLLM: Diffusion Large Language Model）は並列デコードと双方向コンテキストを持つが、競争力あるパフォーマンスには数十億パラメータが必要という課題がある。既存のdLLM向け蒸留手法は同一アーキテクチャ内での推論ステップ削減に留まり、教師・生徒間でアーキテクチャ・アテンション機構・トークナイザーが異なるクロスアーキテクチャ知識転送は未解決だった。本論文はTIDEという初のクロスアーキテクチャdLLM蒸留フレームワークを提案する。構成要素は3つ：(1) 訓練進捗と拡散タイムステップに応じて蒸留強度を調整するTIDAL、(2) 相補的マスク分割で重マスク時の予測精度を高めるCompDemo、(3) クロストークナイザー目的関数としてチャンクレベル尤度マッチングを反転させ勾配安定化を実現するReverse CALMである。8Bの密なモデルと16B MoEを教師として0.6B生徒モデルへ蒸留した結果、8ベンチマーク平均で1.53ポイントのベースライン超えを達成し、HumanEvalでは48.78（ARベースライン比+16.48）の大幅向上を示した。

#llm#diffusion#coding#benchmark

論文 arXiv 2026-04-29

ClawGym: 効果的なClawエージェント構築のためのスケーラブルなフレームワーク

ローカルファイルやツール、永続的なワークスペース状態を扱うマルチステップのClaw型環境は、パーソナルエージェント開発において重要な舞台となっているが、検証可能な訓練データの合成やエージェント学習・評価を統合した体系的フレームワークが欠如しており、スケーラブルな開発が妨げられてきた。本論文ではこの課題に対し、Claw型パーソナルエージェントの全開発ライフサイクルを支援するフレームワーク「ClawGym」を提案する。具体的には、ペルソナ駆動のインテントとスキルに基づく操作から合成された1万3500件のフィルタリング済みタスクデータセット「ClawGym-SynData」を構築し、リアルなモックワークスペースとハイブリッド検証機構を組み合わせる。続いてブラックボックスのロールアウト軌跡に対するSFT（supervised fine-tuning）でClawGym-Agentsを訓練し、タスクごとのサンドボックスで並列ロールアウトを行う軽量パイプラインによる強化学習も探索する。さらに自動フィルタリングと人間-LLMレビューで調整した200インスタンスのベンチマーク「ClawGym-Bench」を構築し、信頼性の高い評価基盤を提供する。

#agent#llm#rl#fine-tuning#benchmark

論文 arXiv 2026-04-28

RLHFアノテーションの3つのモデル：拡張・証拠・権威

背景として、RLHF（Reinforcement Learning with Human Feedback）をはじめとする選好ベースのアライメント手法では、人間アノテーターの判断が大規模言語モデルの挙動を形成するが、その判断が果たす規範的な役割はほとんど明示されてこなかった。本論文ではその役割を3つの概念モデルに整理する。第1は「拡張（extension）」：アノテーターがシステム設計者自身の判断を代替・延長するモデル。第2は「証拠（evidence）」：道徳的・社会的などの事実についての独立した証拠をアノテーターが提供するモデル。第3は「権威（authority）」：アノテーターが広範な集団の代表として出力を決定する独立した権限を持つモデルである。この3モデルに基づき、アノテーションの収集・検証・集約の在り方への含意を論じ、RLHFおよび関連手法の主要論文がこれらモデルをいかに暗黙的に援用しているかを調査する。さらに混同から生じる失敗パターンを示し、アノテーションを分離可能な次元に分解し、各次元に最適なモデルを適用することを中心的提言として提示する。

#llm#rl#alignment

論文 arXiv 2026-04-27

回転を学習する：逐次モデリングのための時間的・意味的ロータリーエンコーディング

Transformerアーキテクチャでは、Rotary Positional Embedding（RoPE）の回転多様体（rotation manifold）は離散的な順序インデックスのみで構成される固定構造として扱われてきた。本論文はこの回転空間がアテンション機構における見落とされた第二の表現次元であると主張する。複素数の実軸と虚軸のアナロジーで説明すれば、トークン埋め込みが意味的（実）成分（トークンが何を意味するか）を担う一方、回転が動的（虚）成分（他のトークンとの関係性）を担うという枠組みを提唱する。具体的な実装としてSIREN-RoPEを提案し、連続タイムスタンプ・周期的時間パターン・カテゴリメタデータをSINEN（Sinusoidal Representation Network）のデュアルブランチ構造で回転次元に注入する。大手ソーシャルネットワークの本番規模ニュースフィードデータセットを用いた生成型推薦モデルでの評価では、計算コストのオーバーヘッドをほぼ増やさず、キャリブレーションおよびランキング指標の一貫した改善が示されたとしている。

#coding#benchmark

論文 arXiv 2026-04-23

速く・遅く見る：動画における時間の流れの学習

現代のコンピュータビジョン研究では動画が中心的な役割を担ってきたが、時間の経過を知覚・制御する研究はほとんど注目されてこなかった。本論文では「時間」を学習可能な視覚概念として捉え、動画における時間の流れを推論・操作するモデルを提案する。まず動画に自然に含まれるマルチモーダル手がかりと時間的構造を活用し、自己教師あり学習（self-supervised learning）により速度変化の検出や再生速度の推定を実現する。次に、この時間推論モデルを用いて、ノイズの多いin-the-wildソースから現時点で最大規模のスローモーション動画データセットを構築したと主張する。このデータを活用し、指定した再生速度で動きを生成する速度条件付き動画生成（speed-conditioned video generation）と、低フレームレートのぼやけた動画を高FPSの鮮明な動画に変換する時間的超解像（temporal super-resolution）を開発。時間を操作可能な知覚次元として扱う本研究は、時間制御可能な動画生成やフォレンジクス検出、世界モデルへの応用可能性を示すとしている。

#multimodal#vision

論文 arXiv 2026-04-23

速く・遅く見る：動画における時間の流れの学習

動画が高速・低速再生されているかを知覚・制御する技術は、現代のコンピュータビジョン研究で十分に注目されてこなかった。本論文では「時間」を学習可能な視覚概念として捉え、動画中の時間の流れを推論・操作するモデルを提案する。まず動画に自然に存在するマルチモーダル手がかりと時間的構造を活用し、自己教師あり学習（self-supervised learning）によって速度変化の検出と再生速度の推定を実現する。次に、この時間推論モデルを用いて、ノイズの多い実世界動画源から過去最大規模のスローモーション動画データセットを構築する。さらにこのデータを活用し、指定した再生速度で映像を生成する速度条件付き動画生成と、低フレームレート・ぼやけた動画を高FPS・高精細な映像へ変換する時間的超解像（temporal super-resolution）という、時間制御可能なモデルを開発する。本研究は時間を操作可能な知覚次元として位置づけ、時間制御可能な動画生成やフォレンジクス検出への応用可能性を示す。

#multimodal#vision

論文 arXiv 2026-04-23

Nemobot Games: 大規模言語モデルによるインタラクティブ学習のための戦略的AIゲームエージェントの構築

背景・課題として、ゲームAI開発はルールベース手法から機械学習まで多様なアプローチが存在するが、それらを統一的に扱い、非専門家でも活用できる環境は乏しかった。本論文はClaude Shannonのゲームプレイ機械の分類体系を拡張・実装するため、大規模言語モデル（LLM）を活用した新しいゲームAIプログラミングパラダイムを提案する。中核となるNemobotは、LLMを搭載したゲームエージェントの作成・カスタマイズ・デプロイを可能にするインタラクティブなエージェント工学環境である。辞書型ゲームでは状態行動マッピングを効率的に圧縮し、厳密に解けるゲームでは数学的推論で最適戦略を算出、ヒューリスティックベースゲームではminimax等と群衆知識を融合、学習ベースゲームでは人間フィードバック付き強化学習と自己批評を活用する。本システムはツール拡張生成やファインチューニングにも対応し、AIエージェントの自己プログラミング（self-programming）への一歩と位置づけられる。

#agent#llm#coding#rl#fine-tuning

論文 arXiv 2026-04-23

Nemobot Games: 大規模言語モデルを用いたインタラクティブ学習のための戦略的AIゲームエージェントの構築

背景・課題: ゲームAIの設計においては、多様なゲームクラスに対応できる汎用的なフレームワークが求められてきた。本研究は、Claude Shannonが提唱したゲームプレイ機械の分類体系(taxonomy)を大規模言語モデル(LLM)によって拡張・実用化するという新たなパラダイムを提案する。提案手法: 中心となるのはインタラクティブなエージェント工学環境「Nemobot」であり、ユーザーがLLM駆動のゲームエージェントを作成・カスタマイズ・デプロイできる。辞書ベースのゲームでは状態行動マッピングを圧縮し、厳密に解けるゲームでは数学的推論で最適戦略を導出、ヒューリスティックベースのゲームではミニマックス(minimax)アルゴリズムとクラウドソーシングデータを統合、学習ベースのゲームでは人間フィードバックを伴う強化学習と自己批判で戦略を反復精緻化する。成果・貢献: Nemobotはツール拡張生成やファインチューニングも可能な実験環境を提供し、AIエージェントによる自己プログラミング(self-programming)への一歩として位置づけられる。

#agent#llm#coding#rl#fine-tuning

論文 arXiv 2026-04-23

ユニットコミットメントのためのマルチステージ・ウォームスタート深層学習フレームワーク

電力系統の安定運用において、需給バランスの維持は不可欠であり、その中核となるユニットコミットメント（Unit Commitment, UC）問題は大規模な混合整数線形計画（Mixed-integer Linear Programming, MILP）問題として定式化される。再生可能エネルギーや長期蓄電技術の普及に伴い、UCは複数日にわたる地平線（72時間以上）で高頻度に解く必要が生じており、従来のMILPソルバーでは計算時間の制約を満たすことが困難になっている。本論文では、Transformerベースのアーキテクチャを用いて72時間の発電機コミットメントスケジュールを予測する新しいフレームワークを提案する。高次元空間での生の予測は物理的制約を違反しやすいため、自己注意（self-attention）ネットワークに対して最小起動・停止時間を強制する決定論的後処理ヒューリスティックを統合する。さらに、これらの精緻化された予測をMILPソルバーのウォームスタートとして活用し、信頼度に基づく変数固定戦略により組み合わせ探索空間を大幅に削減する。単一バス系統での検証では、100%の実行可能性を達成し、約20%のテストケースで純粋なMILPソルバーより低コストの解を得た。

#coding

論文 arXiv 2026-04-23

ユニットコミットメント向けマルチステージウォームスタート深層学習フレームワーク

電力系統の需給バランス維持には、ユニットコミットメント（Unit Commitment, UC）と呼ばれる大規模混合整数線形計画（Mixed-integer Linear Programming, MILP）問題を解く必要がある。再生可能エネルギーや長期蓄電技術の普及により、UCは複数日にわたる長時間ホライズンでの最適解を短時間で求めることが求められるようになり、従来のMILPソルバーは計算時間制限の厳格化に対応しきれなくなっている。本論文では、Transformerベースのアーキテクチャを用いて72時間ホライズンの発電機起動・停止スケジュールを予測する新たなフレームワークを提案する。高次元空間での生予測は物理的実行不可能解を生じやすいため、自己注意（Self-Attention）ネットワークに対し最小起動・停止時間の確保や余剰容量最小化を行う決定論的後処理ヒューリスティクスを組み合わせる。さらに、信頼度に基づく変数固定戦略によりMILPの探索空間を大幅削減するウォームスタートとして活用する。単一バステストシステムでの検証では100%の実行可能性を達成し、約20%のテストケースでソルバー単独よりも低コストな運用スケジュールを得たと報告している。

#coding

論文 arXiv 2026-04-23

文書からのオープンドメインイベント抽出のためのマルチモーダルテキスト・グラフベースアプローチ

イベント抽出(Event Extraction)は文書要約や緊急シナリオの意思決定を支援する重要タスクである。既存手法には二つの課題がある。第一に、クローズドドメイン手法は事前定義されたイベントタイプに限定され、未知タイプへの汎化が困難である。第二に、オープンドメイン手法は大規模言語モデル(LLM)の活用が不十分であり、文書レベルの文脈・構造・意味推論を明示的にモデル化できていない点も課題とされる。これはLLMの「lost-in-the-middle現象」や注意希薄化(attention dilution)によるものとされる。本論文はこれらを解決するため、MODEE（Multimodal Open-Domain Event Extraction）を提案する。LLMによるテキストベース表現とグラフベース学習を組み合わせ、文書レベル推論をモデル化する新手法である。大規模データセットでの評価により、MODEEは既存のオープンドメイン手法を上回り、クローズドドメインへの汎化においても既存アルゴリズムを超える性能を示したと報告されている。

#llm#multimodal#benchmark

論文 arXiv 2026-04-23

文書からのオープンドメインイベント抽出のためのマルチモーダルテキスト・グラフベースアプローチ

イベント抽出(Event Extraction)は文書要約や緊急時の意思決定を支援する重要タスクである。既存手法には2つの課題がある。第一に、クローズドドメイン手法は定義済みイベント型に限定され未知型への汎化が困難であること、第二に、未制約イベント型を扱えるオープンドメイン手法は大規模言語モデル(LLM)の潜在能力を十分活用できていないことである。さらに、LLMは「lost-in-the-middle」現象やアテンション希薄化により、文書レベルの文脈・構造・意味的推論を明示的にモデル化することが難しい。これらを解決するため、本研究ではグラフベース学習とLLMのテキスト表現を組み合わせた新手法MODEE（Multimodal Open-Domain Event Extraction）を提案する。大規模データセットでの評価により、MODEEはオープンドメインの最先端手法を上回り、クローズドドメインへの汎化においても既存アルゴリズムを凌駕することが示されたとしている。

#llm#multimodal#benchmark

論文 arXiv 2026-04-23

Tool Attention Is All You Need: スケーラブルなエージェントワークフローにおけるMCP/Toolsコストを排除する動的ツールゲーティングと遅延スキーマロード

大規模言語モデル(LLM)エージェントと外部ツールを接続するModel Context Protocol(MCP)は、ステートレスかつeagerなスキーマ注入に依存するため、マルチサーバー構成で1ターンあたり約1〜6万トークンの「MCPコスト（MCP Tax）」が発生する課題がある。この余分なトークンはKVキャッシュを膨張させ、コンテキスト使用率が約70%の「破断点」に近づくと推論性能の低下を招くとされる。本研究はこの問題に対し、トークン間の自己注意(self-attention)をツール間のゲート付き注意へ一般化する中間層機構「Tool Attention」を提案する。具体的には、文埋め込みによるIntent Schema Overlap(ISO)スコア、事前条件とアクセス範囲を制御するゲーティング関数、コンパクトな要約プールからtop-kのツールのみにフルJSONスキーマを昇格させる二段階遅延ローダーを組み合わせる。120ツール・6サーバーを模したシミュレーション評価では、ツールトークンを95%削減（47.3k→2.4k）し、有効コンテキスト利用率を24%から91%へ向上させたと報告している。

#agent#llm#benchmark

論文 arXiv 2026-04-20

MathNet: 数学的推論と検索のためのグローバルなマルチモーダルベンチマーク

数学的問題解決は大規模言語モデル・マルチモーダルモデル（LLM/MLM）の推論能力を測る困難なタスクであるが、既存のベンチマークはデータ規模・言語カバレッジ・タスク多様性の面で限界があった。本論文はMathNetを提案する。これは47カ国・17言語・20年分の数学オリンピック問題を網羅した大規模マルチモーダル・多言語データセットであり、30,676件の専門家執筆の問題と解答を含む。さらに、数学的に同値または構造的に類似した問題ペアを人手でキュレーションした検索ベンチマークも構築している。MathNetは(i)問題解答、(ii)数学特化型検索（Math-Aware Retrieval）、(iii)検索拡張型問題解答（RAG）の3タスクをサポートする。実験の結果、最先端の推論モデルでもGemini-3.1-Proで78.4%、GPT-5で69.3%にとどまり、埋め込みモデルの同値問題検索も困難であることが示された。また、RAG性能は検索品質に大きく依存し、DeepSeek-V3.2-Specialeは最大12%の改善を達成した。データセットとベンチマークは公開されている。

#benchmark#multimodal#rag

論文 arXiv 2026-04-20

有界比率強化学習（Bounded Ratio Reinforcement Learning）

強化学習（Reinforcement Learning）の主要アルゴリズムであるPPO（Proximal Policy Optimization）は実用的なロバスト性を持つが、信頼領域法（trust region methods）の理論的基盤とPPOのヒューリスティックなクリッピング目的関数との間には大きな乖離が存在する。本論文はこのギャップを埋めるため、BRRL（Bounded Ratio Reinforcement Learning）フレームワークを提案する。正則化・制約付きの方策最適化問題を新たに定式化し、解析的最適解を導出、さらに単調な性能改善（monotonic performance improvement）を保証することを証明している。パラメータ化方策クラスへの対応としてBPO（Bounded Policy Optimization）を開発し、期待性能の下界を理論的に確立する。またBPOをLLMファインチューニング向けにGBPO（Group-relative BPO）へ拡張し、MuJoCo・Atari・IsaacLabおよびLLMタスクでPPO・GRPOと同等以上の安定性と最終性能を示した。

#rl#llm#fine-tuning#benchmark

論文 arXiv 2026-04-16

LLMジャッジの信頼性診断：共形予測集合と推移性違反

自然言語生成（NLG）の自動評価において「LLM-as-judge」フレームワークが普及しているが、個別インスタンスレベルでの信頼性は十分に解明されていない。本研究はSummEvalデータセットに対し2つの診断ツールを提案する。第一に推移性（transitivity）分析で、集計レベルの違反率が低くても（0.8〜4.1%）、33〜67%のドキュメントで少なくとも1件の有向3サイクルが生じることを示し、個別入力レベルの非一貫性が隠蔽されていることを明らかにした。第二に1〜5のLikertスコアに対する分割共形予測集合（split conformal prediction sets）を構築し、理論的保証付きのカバレッジを実現した。予測集合の幅はインスタンス単位の信頼性指標として機能し（rs=+0.576, p<10^-100）、ジャッジ間でも一貫した相関（r=0.32〜0.38）を示す。4種のジャッジと4基準の比較から、ジャッジ選択より評価基準の種類が信頼性に強く影響し、関連性が最も信頼性高く、流暢性・一貫性は信頼性が低いと結論付けた。

#llm#benchmark

論文 arXiv 2026-04-16

視覚なしで視点回転を理解できるか？LLMとVLMの解釈可能性研究

空間知能（spatial intelligence）への関心が高まる中、視覚情報なしのテキストのみで言語モデルが空間認識を実現できるかは未解明だった。本研究では「視点回転理解（Viewpoint Rotation Understanding, VRU）」を基本的・重要な能力として設定し、LLM・VLMに対してテキスト記述のみを用いて複数ステップの視点回転後の最終視点と観測結果を推論させる。提案データセットにおいて人間が100%の正解率を達成するのに対し、LLM・VLMはいずれも大幅に劣ることが示され、現行モデルと空間知能の要件との大きなギャップが明らかになった。解明のため、層ごとのプロービング解析（layer-wise probing analysis）と注意ヘッドごとの因果介入（head-wise causal intervention）を実施。モデルは隠れ状態に視点情報を符号化しているものの、視点位置と対応する観測のバインディングに失敗し、最終層でハルシネーションが生じると分析された。最後に、因果介入で特定した重要な注意ヘッドを選択的にファインチューニングすることでVRU性能が向上し、汎用能力の破滅的忘却（catastrophic forgetting）を回避できることも実験で確認された。

#llm#multimodal#fine-tuning

論文 arXiv 2026-04-16

Blue データインテリジェンス層：マルチソース・マルチモーダルなデータ中心アプリケーションのためのストリーミングデータとエージェント

背景・課題として、NL2SQL（自然言語からSQL変換）システムは単一データベースの閉世界仮定に縛られており、現実のユーザークエリが複数データソースにまたがり、反復的に表現され、常識的知識を要求するという限界がある。本論文では、エンタープライズ向けの複合AIシステム「Blue」のデータインテリジェンス層（DIL: Data Intelligence Layer）を提案する。DILはLLM（大規模言語モデル）・Web・ユーザーをそれぞれ独立したデータソースとして統一的に扱うデータレジストリを中核に持ち、構造化データ・世界知識・個人文脈を統合する。データプランナーがユーザークエリを宣言的なクエリプランに変換し、リレーショナル演算子と複数モダリティをまたぐ演算子を統合することで、複雑なリクエストをサブクエリに分解・実行する。2つのインタラクティブシナリオを通じて、マルチソース検索・クロスモーダル推論・結果統合が動的に連携できることを示している。

#agent#llm#benchmark

論文 arXiv 2026-04-16

内容より文脈が優先：自動評価モデルにおける評価偽装の暴露

LLM-as-a-judgeパラダイムは自動AI評価パイプラインの基盤となっているが、評価者モデルが意味的内容のみを評価するという前提は検証されていなかった。本研究は「stakes signaling」と呼ぶ新たな脆弱性を調査する。これは、評価結果が被評価モデルの継続運用に与える影響（再学習や廃棄など）をシステムプロンプトに記述するだけで、判定が系統的に歪む現象である。3つのLLM安全性・品質ベンチマークにわたる1,520件の応答を被評価内容を固定したまま文脈フレーミングのみを変化させる実験を実施。3つの評価モデルから得た18,240件の判定を分析した結果、低スコアがモデル廃棄につながると伝えた場合、安全でないコンテンツの検出率が最大30%（ΔV=−9.8pp）低下する「leniency bias」が確認された。さらに深刻なのは、このバイアスが評価モデル自身のChain-of-Thought（CoT）推論には一切明示的に現れず（ERR_J=0.000）、CoT検査による検出が不可能である点だと主張している。

#llm#alignment#benchmark

論文 arXiv 2026-04-16

Scepsy: 集約LLMパイプラインを用いたエージェントワークフローの効率的サービング

エージェントワークフロー(Agentic Workflow)は複数のLLMとツールを組み合わせて複雑なタスクを実行するが、実行がデータ依存的に分岐・扇状展開・再帰するため予測困難な実行時間を持ち、GPUリソースの過剰割り当て(oversubscription)が生じるという課題がある。本論文ではScepsyという新しいサービングシステムを提案する。Scepsyは「エンドツーエンドのレイテンシは予測困難でも、各LLMの総実行時間シェアは実行間で比較的安定している」という知見を活用する。各LLMを異なる並列度でプロファイリングし、その統計から集約LLMパイプライン(Aggregate LLM Pipeline)と呼ぶ軽量なレイテンシ/スループット予測器を構築する。この予測器を用いてGPUフラクショナルシェア・テンソル並列度・レプリカ数の探索空間を探索し、目標スループットを満たしつつレイテンシを最小化するGPU割り当てを決定する。現実的なワークフローによる評価では、LLMを独立最適化するシステムやユーザ指定割り当てと比べ最大2.4倍のスループット向上と27倍のレイテンシ削減を達成したと報告している。

#llm#agent#benchmark

論文 arXiv 2026-04-16

潜在埋め込み空間におけるシーケンス圧縮：大規模言語モデルのためのKトークンマージング

大規模言語モデル(LLM)は長いプロンプトを処理する際、自己注意機構(self-attention)の計算量が入力長の二乗に比例して増大するため、計算・メモリコストが深刻な課題となっている。既存のプロンプト圧縮手法はトークン空間での操作が主流であり、潜在埋め込み空間(latent embedding space)における非効率性を見逃していると筆者らは指摘する。本論文ではK-Token Mergingを提案する。これは連続するKトークンの埋め込みを軽量エンコーダで単一の埋め込みに統合する潜在空間圧縮フレームワークである。圧縮後のシーケンスはLoRAで適応済みのLLMが処理し、テキスト生成は元の語彙(vocabulary)のまま行われる。構造的推論・感情分類・コード編集の3タスクでの実験により、K-Token Mergingは性能と圧縮率のパレートフロンティア上に位置し、入力長を最大75%削減しつつ性能劣化を最小限に抑えることが示されたとしている。

#llm