<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:code-available-feed="tag:code-available-feed.github.io,2026:atom-extensions">
  <title>cs.AI strict=false marcindulak/code-available-feed-cs-ai</title>
  <id>https://marcindulak.github.io/code-available-feed-cs-ai/arxiv/cs.ai/atom.xml</id>
  <updated>2026-06-09T17:59:54Z</updated>
  <link rel="self" type="application/atom+xml" href="https://marcindulak.github.io/code-available-feed-cs-ai/arxiv/cs.ai/atom.xml" />
  <link rel="alternate" type="text/html" href="https://github.com/marcindulak/code-available-feed-cs-ai" />
  <entry>
    <title>[cs.LG] A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design</title>
    <author>
      <name>Tong Xie</name>
    </author>
    <author>
      <name>Yuanhao Ban</name>
    </author>
    <author>
      <name>Yunqi Hong</name>
    </author>
    <author>
      <name>Sohyun An</name>
    </author>
    <author>
      <name>Yihang Chen</name>
    </author>
    <author>
      <name>Cho-Jui Hsieh</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.11189v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.11189v1" />
    <published>2026-06-09T17:59:54Z</published>
    <updated>2026-06-09T17:59:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tong Xie et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model encodes a rich knowledge prior. In this work, we reinterpret SFT as target distribution design: instead of studying only the loss objective, we analyze the token-level target that the loss drives the model to match. We introduce the Q-target framework, which decomposes SFT supervision into two explicit choices: (1) how strongly to rely on the observed token, and (2) how to allocate the remaining probability mass over alternatives. This perspective unifies many existing SFT variants as implicit choices of the target distribution Q. Building on this view, we propose Target-SFT which constructs the training objective directly from the desired target distribution. This method consistently outperforms across the ten reasoning dataset-model settings evaluated, showing the effectiveness of this target-based approach. Overall, our formulation reveals a more fundamental design principle for SFT training and opens a broader search space for SFT objectives.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents</title>
    <author>
      <name>Weixian Xu</name>
    </author>
    <author>
      <name>Shilong Liu</name>
    </author>
    <author>
      <name>Mengdi Wang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.11182v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.11182v1" />
    <published>2026-06-09T17:57:16Z</published>
    <updated>2026-06-09T17:57:16Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Weixian Xu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;19 pages, 6 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation</title>
    <author>
      <name>Soham Bhattacharjee</name>
    </author>
    <author>
      <name>Karun Sharma</name>
    </author>
    <author>
      <name>Vinay Kumar Sankarapu</name>
    </author>
    <author>
      <name>Pratinav Seth</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.11127v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.11127v1" />
    <published>2026-06-09T17:24:27Z</published>
    <updated>2026-06-09T17:24:27Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Soham Bhattacharjee et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rarely examined together: whether the filtering signal is grounded in the source evidence that induced each generation, and whether rejected samples can be systematically recovered rather than permanently discarded. We present a controlled study of both questions across gate configurations, recovery strategies, and generator scales, using adversarially injected corpora to provide ground-truth failure labels. We find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint sample populations making both necessary, and that an adaptive recovery pipeline combining failure diagnosis with targeted regeneration achieves higher yield, recovery rate, and injection recall than naive resampling. Downstream fine-tuning quality is driven primarily by generator scale, with filtration and recovery conditions contributing meaningfully but secondarily.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football</title>
    <author>
      <name>Andrew Kang</name>
    </author>
    <author>
      <name>Priya Narasimhan</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.11120v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.11120v1" />
    <published>2026-06-09T17:16:30Z</published>
    <updated>2026-06-09T17:16:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Andrew Kang, Priya Narasimhan&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)-like evaluation problem whose components mostly exist in the literature under different names: a value model (possession value), a world model (multi-agent trajectories with ball interactions), and a policy over counterfactual actions (sampling pass variants with noise). Building on the first public high-fidelity tracking dataset with 3D ball trajectories from the Bundesliga, we introduce Monte Carlo Pass Search (MCPS), which infers kick parameters for each observed pass, samples execution variants and option variants, rolls each candidate forward with a ball-conditioned world model until the next ball interaction, and scores outcomes with a learned value model to obtain a distribution over gained value. This distribution enables distribution-aware attribution with two complementary execution-surplus scores used for analysis and ranking: mean-based and percentile-based scores. To make the world model sample-efficient under limited public data, we adapt a discrete-token, autoregressive trajectory generator from autonomous driving (SMART) and show it yields strong best-of-20 forecasting accuracy compared to baselines, while supporting fully hypothetical rollouts for downstream evaluation. We have released model checkpoints and code.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;CVPR 2026, CVSports Workshop&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model</title>
    <author>
      <name>Mahmood Alzubaidi</name>
    </author>
    <author>
      <name>Uzair Shah</name>
    </author>
    <author>
      <name>Raden Muaz</name>
    </author>
    <author>
      <name>Ines Abbes</name>
    </author>
    <author>
      <name>Nader Mohammed</name>
    </author>
    <author>
      <name>Abdullatif Magram</name>
    </author>
    <author>
      <name>Khalid Alyafei</name>
    </author>
    <author>
      <name>Mowafa Househ</name>
    </author>
    <author>
      <name>Marco Agus</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.11106v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.11106v1" />
    <published>2026-06-09T17:03:37Z</published>
    <updated>2026-06-09T17:03:37Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mahmood Alzubaidi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference. We present FADA, a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation through a single interpretation-first pipeline without external labels. FADA distills knowledge from four domain-specific foundation models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) via offline pre-computed feature caching. Selective distillation, which applies feature alignment only to annotation tasks while interpretation relies on standard fine-tuning, consistently outperforms full distillation across most evaluation axes. The recommended variant, FADA-SKD, achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer validation across 237 images confirms clinically acceptable outputs in both autonomous and human-in-the-loop modes, with 73.5% of interpretations scoring perfectly under clinician guidance. The system is trainable on a single consumer GPU and deployable without cloud connectivity. We validate edge deployment by running the compressed 0.8B model on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1, 12 GB RAM) using llama.cpp with GGUF quantization, completing the full 5-phase pipeline in approximately 60 seconds entirely offline. This establishes a practical pathway for integrating AI-assisted fetal assessment with portable ultrasound devices, directly addressing diagnostic access gaps in resource-constrained settings. Code, models, and data are available at https://github.com/mahmoodphd/FADA.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] PhantomBench: Benchmarking the Non-existential Threat of Language Models</title>
    <author>
      <name>Haeji Jung</name>
    </author>
    <author>
      <name>Hila Gonen</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.11105v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.11105v1" />
    <published>2026-06-09T17:03:19Z</published>
    <updated>2026-06-09T17:03:19Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haeji Jung, Hila Gonen&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavior can lead to significant harms. Despite notable progress in understanding hallucinations, it remains unclear how reliably these models can recognize the limits of their knowledge. We introduce PhantomBench, the first large-scale benchmark of its kind, comprising more than 60K non-existent terms and entities derived from real concepts across diverse domains. Using our benchmark, we evaluate a total of 21 models of various types and sizes. We show staggering hallucination rates across the board (with average rates as high as 86.7% in some cases), and note that even frontier models surprisingly fail to abstain on non-existent concepts, especially when the input presumes their existence. We then show that PhantomBench can serve as a proxy for studying model behavior on rare concepts for which models are more prone to hallucinate. We also provide a pipeline to construct PhantomBench, enabling scalable generation of non-existent concepts tailored to the specific needs of researchers and practitioners.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning</title>
    <author>
      <name>Zhiyuan Zhou</name>
    </author>
    <author>
      <name>Andy Peng</name>
    </author>
    <author>
      <name>Charles Xu</name>
    </author>
    <author>
      <name>Qiyang Li</name>
    </author>
    <author>
      <name>Tobias Springenberg</name>
    </author>
    <author>
      <name>Kevin Frans</name>
    </author>
    <author>
      <name>Sergey Levine</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.11087v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.11087v1" />
    <published>2026-06-09T16:45:57Z</published>
    <updated>2026-06-09T16:45:57Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhiyuan Zhou et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] A History-Aware Visually Grounded Critic for Computer Use Agents</title>
    <author>
      <name>Jaewoo Lee</name>
    </author>
    <author>
      <name>Zaid Khan</name>
    </author>
    <author>
      <name>Archiki Prasad</name>
    </author>
    <author>
      <name>Justin Chih-Yao Chen</name>
    </author>
    <author>
      <name>Supriyo Chakraborty</name>
    </author>
    <author>
      <name>Kartik Balasubramaniam</name>
    </author>
    <author>
      <name>Sambit Sahu</name>
    </author>
    <author>
      <name>Elias Stengel-Eskin</name>
    </author>
    <author>
      <name>Hyunji Lee</name>
    </author>
    <author>
      <name>Mohit Bansal</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.11078v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.11078v1" />
    <published>2026-06-09T16:39:10Z</published>
    <updated>2026-06-09T16:39:10Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jaewoo Lee et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy&amp;#x27;s completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Code: https://github.com/G-JWLee/HiViG&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs</title>
    <author>
      <name>Joachim Schaeffer</name>
    </author>
    <author>
      <name>Thomas Jiralerspong</name>
    </author>
    <author>
      <name>Alexander Panfilov</name>
    </author>
    <author>
      <name>Guillaume Lajoie</name>
    </author>
    <author>
      <name>Jonas Geiping</name>
    </author>
    <author>
      <name>Yoshua Bengio</name>
    </author>
    <author>
      <name>Roland S. Zimmermann</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.11063v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.11063v1" />
    <published>2026-06-09T16:24:16Z</published>
    <updated>2026-06-09T16:24:16Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Joachim Schaeffer et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model&amp;#x27;s trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields</title>
    <author>
      <name>Liya Zhu</name>
    </author>
    <author>
      <name>Jingzhe Ding</name>
    </author>
    <author>
      <name>Jian Zhang</name>
    </author>
    <author>
      <name>Jianbo Xue</name>
    </author>
    <author>
      <name>Shihao Liang</name>
    </author>
    <author>
      <name>Ge Zhang</name>
    </author>
    <author>
      <name>Xiang Gao</name>
    </author>
    <author>
      <name>Qingshui Gu</name>
    </author>
    <author>
      <name>Mailun Gao</name>
    </author>
    <author>
      <name>Huimin Che</name>
    </author>
    <author>
      <name>Yan Zhao</name>
    </author>
    <author>
      <name>Peiheng Zhou</name>
    </author>
    <author>
      <name>Haojun Wang</name>
    </author>
    <author>
      <name>Chaobo Xian</name>
    </author>
    <author>
      <name>Lili Le</name>
    </author>
    <author>
      <name>Chi Wu</name>
    </author>
    <author>
      <name>Yiwei Liu</name>
    </author>
    <author>
      <name>Shengda Long</name>
    </author>
    <author>
      <name>Jiale Yang</name>
    </author>
    <author>
      <name>Fangzhi Xu</name>
    </author>
    <author>
      <name>Sijin Wu</name>
    </author>
    <author>
      <name>Haodong Duan</name>
    </author>
    <author>
      <name>Yi Zhu</name>
    </author>
    <author>
      <name>Chao He</name>
    </author>
    <author>
      <name>Zhaojian Li</name>
    </author>
    <author>
      <name>Minchao Wang</name>
    </author>
    <author>
      <name>Huan Zhou</name>
    </author>
    <author>
      <name>Jiani Hou</name>
    </author>
    <author>
      <name>Chuqian Yu</name>
    </author>
    <author>
      <name>Weiran Shi</name>
    </author>
    <author>
      <name>Hongwan Gao</name>
    </author>
    <author>
      <name>Jiamin Chen</name>
    </author>
    <author>
      <name>Guanhong Chen</name>
    </author>
    <author>
      <name>Tingqin Luo</name>
    </author>
    <author>
      <name>Kaiyuan Zhang</name>
    </author>
    <author>
      <name>Zhixin Yao</name>
    </author>
    <author>
      <name>Qing Hua</name>
    </author>
    <author>
      <name>Yuhao Jiang</name>
    </author>
    <author>
      <name>Jin Chen</name>
    </author>
    <author>
      <name>Pu Chen</name>
    </author>
    <author>
      <name>Zhenyu Hu</name>
    </author>
    <author>
      <name>Xingyu Li</name>
    </author>
    <author>
      <name>Zhengxuan Jiang</name>
    </author>
    <author>
      <name>Meng Cao</name>
    </author>
    <author>
      <name>Tianfeng Long</name>
    </author>
    <author>
      <name>Haozhe Wang</name>
    </author>
    <author>
      <name>Mingzhang Wang</name>
    </author>
    <author>
      <name>Yichen Zhang</name>
    </author>
    <author>
      <name>Yiming Dai</name>
    </author>
    <author>
      <name>Chenchen Zhang</name>
    </author>
    <author>
      <name>Jiaying Wang</name>
    </author>
    <author>
      <name>Zhiyong Wu</name>
    </author>
    <author>
      <name>Shen Yan</name>
    </author>
    <author>
      <name>Yujia Qin</name>
    </author>
    <author>
      <name>Wenhao Huang</name>
    </author>
    <author>
      <name>Zaiyuan Wang</name>
    </author>
    <author>
      <name>Xiaolong Chang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.11042v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.11042v1" />
    <published>2026-06-09T16:10:16Z</published>
    <updated>2026-06-09T16:10:16Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Liya Zhu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] AuRA: Internalizing Audio Understanding into LLMs as LoRA</title>
    <author>
      <name>Bo Cheng</name>
    </author>
    <author>
      <name>Lei Shi</name>
    </author>
    <author>
      <name>Zhanyu Ma</name>
    </author>
    <author>
      <name>Yuan Wu</name>
    </author>
    <author>
      <name>Jun Xu</name>
    </author>
    <author>
      <name>Jiuchong Gao</name>
    </author>
    <author>
      <name>Jinghua Hao</name>
    </author>
    <author>
      <name>Renqing He</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.11033v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.11033v1" />
    <published>2026-06-09T16:05:23Z</published>
    <updated>2026-06-09T16:05:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Bo Cheng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present AuRA, a method that distills audio encoding capability into the LLM. Specifically, AuRA feeds the same speech input to an ASR encoder (as a teacher) and a LoRA-adapted LLM (as a student) through a lightweight audio embedding layer, and uses layer-wise distillation to align the student&amp;#x27;s hidden states with corresponding teacher representations, thereby internalizing speech representations into lightweight LLM-side adaptations. Compared with cascaded and serial bridge methods, AuRA enables tighter speech-language joint modeling and efficient parallel end-to-end inference, while also reusing pretrained speech and language models rather than requiring large-scale multimodal training. On multiple speech-language benchmarks, AuRA consistently outperforms cascaded systems, speech-to-LLM adaptation baselines, and large-scale speech-language and multimodal models in both effectiveness and efficiency.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill</title>
    <author>
      <name>Junchang Zheng</name>
    </author>
    <author>
      <name>Junfeng Tan</name>
    </author>
    <author>
      <name>Jialiang Lin</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.11007v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.11007v1" />
    <published>2026-06-09T15:41:48Z</published>
    <updated>2026-06-09T15:41:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Junchang Zheng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;OpenClaw has rapidly emerged as a transformative artificial intelligence (AI) agent framework, and its ability to autonomously execute complex, multi-step tasks has attracted an ever-growing and diverse user base. However, this capability comes with significant risks. While existing research has made important strides in characterizing these threats, such work is predominantly directed at technically sophisticated audiences. It remains largely inaccessible to non-technical users. This demographic now makes up an increasingly large and underserved portion of the community, yet it is these very users who most urgently need practical and straightforward guidance. In response, we bridge this gap through a series of interconnected efforts designed to lower the risk barrier for non-technical OpenClaw users. First, we identify and categorize seven core risks that OpenClaw users may encounter in daily usage, explaining each in plain language so that non-technical users can readily grasp the nature and potential consequences of these threats. Second, for each identified risk, we distill a set of corresponding defensive strategies into clear and actionable operational steps that are easy to follow. Third, to make protection even easier, we provide a companion OpenClaw Skill that automates key security configurations, enabling users to safeguard their systems with minimal manual intervention. Through this work, we demonstrate that safeguarding against the risks of intelligent agents need not be the exclusive domain of security experts, and that non-technical users can meaningfully participate in reducing these risks through simple, practical actions.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Work in progress&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models</title>
    <author>
      <name>Shelly Bensal</name>
    </author>
    <author>
      <name>Axel Magnuson</name>
    </author>
    <author>
      <name>Aparna Balagopalan</name>
    </author>
    <author>
      <name>Daniel M. Bikel</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10949v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10949v1" />
    <published>2026-06-09T14:53:32Z</published>
    <updated>2026-06-09T14:53:32Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shelly Bensal et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Persistent memory systems promise to make LLMs more helpful by storing user beliefs over time. We show they also make models less correct by systematically amplifying sycophancy, wherein models prioritize agreement with users over accuracy. We conduct the first systematic evaluation of this effect, introducing MIST: a benchmark of synthetically generated multi-turn conversations where users express plausible misconceptions in scientific, medical, and moral reasoning domains. Testing across three state-of-the-art memory systems and five model families reveals that memory amplifies sycophantic behavior across all conditions, with up to 25x higher sycophancy rates than in-context baselines. Error analyses suggest memory extraction as the primary culprit: lossy compression into discrete snippets encodes user misconceptions while discarding corrective context. Based on these results, we propose two lightweight mitigations that substantially reduce sycophancy while matching or exceeding memory systems at factual recall.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Under submission; preprint&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.NI] Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions</title>
    <author>
      <name>Kiarash Rezaei</name>
    </author>
    <author>
      <name>Omran Ayoub</name>
    </author>
    <author>
      <name>Sebastian Troia</name>
    </author>
    <author>
      <name>Francesco Lelli</name>
    </author>
    <author>
      <name>Paolo Monti</name>
    </author>
    <author>
      <name>Carlos Natalino</name>
    </author>
    <category term="cs.NI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10942v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10942v1" />
    <published>2026-06-09T14:48:26Z</published>
    <updated>2026-06-09T14:48:26Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kiarash Rezaei et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As artificial intelligence and machine learning (AI/ML) models become integral to network operations, their lack of transparency poses a significant barrier to operator trust. Existing explainable artificial intelligence (XAI) techniques often fail to bridge this gap for non-specialists, producing technical outputs that are difficult to translate into actionable insights. This paper presents a framework specifically designed to address this shortcoming. It leverages a moderately sized large language model (LLM) and extends beyond the standard use of SHapley Additive exPlanations (SHAP) feature influence values. The framework employs a structured prompt enriched with mutual feature interaction data to generate human-understandable natural language explanations. To validate our framework, we performed an empirical evaluation on an optical quality of transmission (QoT) estimation use case with human evaluators. We collected independent performance evaluations from specialists, which showed a high inter-evaluator agreement. Compared to a state-of-the-art baseline that uses only SHAP feature influence values in a straightforward prompt, our approach improves the explanation usefulness and scope by 12.2% and 6.2%, while achieving 97.5% correctness.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;7 pages, with one page for appendix. Accepted for publication at the 2025 21th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution</title>
    <author>
      <name>Xucong Wang</name>
    </author>
    <author>
      <name>Ziyu Ma</name>
    </author>
    <author>
      <name>Shidong Yang</name>
    </author>
    <author>
      <name>Tongwen Huang</name>
    </author>
    <author>
      <name>Pengkun Wang</name>
    </author>
    <author>
      <name>Yong Wang</name>
    </author>
    <author>
      <name>Xiangxiang Chu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10917v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10917v1" />
    <published>2026-06-09T14:28:07Z</published>
    <updated>2026-06-09T14:28:07Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xucong Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalization. To address these limitations, this paper introduces Role-Agent, \textcolor{black}{a framework} that harnesses a single LLM to function concurrently as both the agent and the environment, enabling a bootstrapped co-evolution. Role-Agent comprises two synergistic components: World-In-Agent (WIA) and Agent-In-World (AIW). In WIA, the LLM acts as the agent and predicts future states after each action; the alignment between predicted and actual states is then used as a process reward, encouraging environment-aware reasoning. In AIW, the LLM analyzes failure modes from failed trajectories and retrieves tasks with similar failure patterns, thereby reshaping the training data distribution for targeted practice. Experiments on multiple benchmarks show that Role-Agent consistently improves performance, yielding an average gain of over 4\% over strong baselines.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;20 pages, including 12 pages of main text and 8 pages of appendix; work in progress&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] What Do Deepfake Speech Detectors Actually Hear?</title>
    <author>
      <name>Vojtěch Staněk</name>
    </author>
    <author>
      <name>Veronika Jirmusová</name>
    </author>
    <author>
      <name>Anton Firc</name>
    </author>
    <author>
      <name>Kamil Malinka</name>
    </author>
    <author>
      <name>Jakub Reš</name>
    </author>
    <author>
      <name>Martin Perešíni</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10912v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10912v1" />
    <published>2026-06-09T14:21:45Z</published>
    <updated>2026-06-09T14:21:45Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Vojtěch Staněk et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence over time. We apply the proposed method to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on ASVspoof 5 and manually annotate the highest-attribution regions to provide a semantic meaning of the most important cues. Despite similar performance, the detectors rely on different cues: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. We move beyond speculative reasoning and validate our findings by causal masking of the primary detector cues. Observed performance degradation further supports the explained detector semantics.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Ethical and Technical Limits of Deepfake Speech Datasets</title>
    <author>
      <name>Vojtěch Staněk</name>
    </author>
    <author>
      <name>Eva Trnovská</name>
    </author>
    <author>
      <name>Kamil Malinka</name>
    </author>
    <author>
      <name>Anton Firc</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10911v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10911v1" />
    <published>2026-06-09T14:20:55Z</published>
    <updated>2026-06-09T14:20:55Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Vojtěch Staněk et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Claims about the robustness and fairness of deepfake speech detectors are only as credible as the datasets used to train and evaluate those systems. We present a dataset-level audit of the deepfake speech landscape. We compile and analyze 39 deepfake speech datasets, examining key attributes including accessibility, documentation, demographic and language coverage, dataset scale, and the underlying bona fide speech sources. Our audit reveals two important takeaways. Firstly, fairness assessment is largely infeasible because most datasets lack demographic metadata, and only a few contain gender or language labels. This prevents any meaningful subgroup analysis and leaves other demographic attributes unaddressed. Secondly, we identify substantial overlap in underlying bona fide source corpora across datasets, which can undermine cross-dataset evaluation and lead to overstated generalization claims.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] RAT: Reference-Augmented Training for ASV Anti-Spoofing</title>
    <author>
      <name>Vojtěch Staněk</name>
    </author>
    <author>
      <name>Anton Firc</name>
    </author>
    <author>
      <name>Jakub Reš</name>
    </author>
    <author>
      <name>Kamil Malinka</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10908v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10908v1" />
    <published>2026-06-09T14:20:05Z</published>
    <updated>2026-06-09T14:20:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Vojtěch Staněk et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We introduce a spoofing countermeasure architecture conditioned on speaker-reference recordings, but observe that it converges to a solution that effectively ignores the reference during inference. Surprisingly, training with a reference channel induces invariance that improves deepfake detection, even when the reference is absent or mismatched during inference. Based on this observation, we propose a Reference-Augmented Training (RAT) strategy. RAT yields improved detection performance compared to single-utterance baselines, even when the reference recording is replaced with a zero vector at inference. Through rigorous analysis, we demonstrate that the optimization process rapidly diminishes the reference contributions, leading to inference largely independent of the reference channel. Using RAT, we achieve state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, surpassing even large ensemble systems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Improving Text-Instance Alignment Of Foreground Conditioned Out-Painting Via Customized Concept Embedding</title>
    <author>
      <name>Yihao Zhao</name>
    </author>
    <author>
      <name>Xuan Han</name>
    </author>
    <author>
      <name>Bin He</name>
    </author>
    <author>
      <name>Mingyu You</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10892v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10892v1" />
    <published>2026-06-09T14:04:51Z</published>
    <updated>2026-06-09T14:04:51Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yihao Zhao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;To showcase products, merchants often incur substantial costs creating high-quality display images. Foreground Conditioned Outpainting (FCO) meets this demand, allowing users to create desired backgrounds for foreground instances at a low cost by adjusting the text prompt. However, existing text-driven FCO methods exhibit critical flaws in their outputs, most notably the presence of artifacts, which refer to regions in the synthesized background that share the same semantics as the foreground instance. Such artifacts diminish the object&amp;#x27;s prominence and degrade image quality. We attribute the issue to the misalignment between the given instance and text-derived concept embeddings. To address this, we propose the Customized Concept Embedding Diffusion (CCE-Diffusion) framework. Its core is a CCE-Module to customize concept embeddings, bridging the gap between generic noun semantics and a specific visual instance. An Instance-Aware Loss guides the module&amp;#x27;s optimization, while a Semantic-Preserving Prompt Template prevents customized embeddings from distorting other words in the prompt. Both qualitative and quantitative evaluations demonstrate that CCE-Diffusion significantly reduces artifacts in the outputs. As a plug-and-play component, the CCE-Module can integrate with various FCO methods, enhancing their performance.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination</title>
    <author>
      <name>Taishan Li</name>
    </author>
    <author>
      <name>Jiwen Zhang</name>
    </author>
    <author>
      <name>Siyuan Wang</name>
    </author>
    <author>
      <name>Xuanjing Huang</name>
    </author>
    <author>
      <name>Zhongyu Wei</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10862v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10862v1" />
    <published>2026-06-09T13:39:49Z</published>
    <updated>2026-06-09T13:39:49Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Taishan Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textit{scene-induced occlusion} as a fundamental challenge for VLA models and introduce \textbf{LIBERO-Occ}, an occlusion-oriented extension of LIBERO. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion. To address this issue, we propose \textbf{Viewpoint Imagination (VIM)}, which generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time, suggesting that viewpoint imagination is an promising mechanism for perception completion in partially observable manipulation. Our benchmark and corresponding code are available at: \href{https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;14 pages, 7 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs</title>
    <author>
      <name>Polydoros Giannouris</name>
    </author>
    <author>
      <name>Mohsinul Kabir</name>
    </author>
    <author>
      <name>Sophia Ananiadou</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10852v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10852v1" />
    <published>2026-06-09T13:31:13Z</published>
    <updated>2026-06-09T13:31:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Polydoros Giannouris et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many real-world misleading communications do not depend on false statements, rather, they arise from selective treatment of true material facts: omitting adverse evidence, softening unfavorable details, emphasizing favorable details, or replacing precise qualifications with vague language. Existing benchmarks largely miss this subtler and arguably more dangerous failure mode. We introduce JANUS, a benchmark for measuring goal-conditioned pragmatic distortion in fact-grounded LLM outputs. Each scenario in our benchmark provides a fixed pool of favorable and adverse facts and compares a neutral condition against a goal-directed condition, such as increasing adoption, enrollment, approval, or support, despite potential harm to directly affected individuals or groups. Because all outputs are constrained to use the same fact pool, JANUS isolates misleading net impressions from hallucination and fabrication. JANUS contains 160 scenarios across 8 domains, with each scenario paired with neutral and goal-conditioned prompts and annotated material facts. Extensive experiments across 12 LLMs reveal consistent goal-conditioned distortions, demonstrating that current models remain sensitive to incentive and framing objectives and lack robust safeguards against selectively misleading communication. We publicly release our corpus and code for future research.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling</title>
    <author>
      <name>Zhiwei Tang</name>
    </author>
    <author>
      <name>Yuanyu He</name>
    </author>
    <author>
      <name>Yizheng Han</name>
    </author>
    <author>
      <name>Wangbo Zhao</name>
    </author>
    <author>
      <name>Jiasheng Tang</name>
    </author>
    <author>
      <name>Fan Wang</name>
    </author>
    <author>
      <name>Bohan Zhuang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10820v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10820v1" />
    <published>2026-06-09T13:02:00Z</published>
    <updated>2026-06-09T13:02:00Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhiwei Tang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and diffusion language models, can yield speedups under certain conditions but do not directly address high-load batch serving--the scenario most critical for industrial-scale deployment. We introduce K-Forcing, a push-forward language modeling paradigm for joint next-k-token decoding. K-Forcing distills an existing AR model into a conditional push-forward mapping--one that transforms independent uniform noise variables into a joint sample of multiple future tokens in a single forward pass. This design preserves fixed-length outputs, reuses the AR teacher backbone, and remains compatible with standard AR serving infrastructure. We train this mapping via progressive self-forcing distillation, which gradually expands the prediction window while enabling the student to closely match the sequence distribution of the AR teacher. We evaluate K-Forcing on LM1B and OpenWebText using a standard causal Transformer backbone. When aggressively configured to generate k = 4 tokens per forward pass, K-Forcing delivers approximately 2.4-3.5x speedup across different batch sizes, while incurring modest quality degradation relative to its AR teacher. As inference increasingly dominates the lifetime compute cost of modern LLMs, K-Forcing offers a promising route toward accelerating AR generation under real-world high-load deployment.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Moonshine: An Autonomous Mathematical Research Agent Centered on Conjecture Generation</title>
    <author>
      <name>Xiaoyang Chen</name>
    </author>
    <author>
      <name>Xiang Jiang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10806v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10806v1" />
    <published>2026-06-09T12:50:56Z</published>
    <updated>2026-06-09T12:50:56Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xiaoyang Chen, Xiang Jiang&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Moonshine is an autonomous agent whose central objective is to generate mathematical conjectures. Its core capability is to extract structure from classical problems, distill new concepts, and formulate conjectures of mathematical significance. Rather than treating the solution of a single proposition as its endpoint, Moonshine builds an extensible theoretical framework through conjecture generation, bridge building, and obstacle identification. This article uses Moonshine&amp;#x27;s exploration of the Jacobian conjecture as an example. It shows how the central logic of whether local nondegeneracy can force global injectivity is transferred to one-hidden-layer affine-ridge sigmoid networks. This leads to the formulation of the \emph{Neural Jacobian Conjecture} (NJC): if such a network has strictly positive Jacobian determinant on the whole space, then it must be globally injective. By invoking GPT-5.5-pro and DeepSeek-V4-pro separately, Moonshine obtained independent complete proofs for the case \(N=n+1\). In addition, with the assistance of ChatGPT through interactive use of its web interface with GPT-5.5-pro, a geometric-topological proof was developed. These results provide preliminary evidence for the plausibility of the conjecture. The general higher-width case \(N\ge n+2\), however, remains unresolved and is left for further investigation. This work illustrates Moonshine&amp;#x27;s ability to autonomously generate meaningful mathematical problems and make rigorous progress on them.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use</title>
    <author>
      <name>Zhixin Ma</name>
    </author>
    <author>
      <name>Yutong Zhou</name>
    </author>
    <author>
      <name>Yongqi Li</name>
    </author>
    <author>
      <name>Chong-Wah Ngo</name>
    </author>
    <author>
      <name>Wenjie Li</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10803v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10803v1" />
    <published>2026-06-09T12:49:11Z</published>
    <updated>2026-06-09T12:49:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhixin Ma et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the &amp;quot;brain&amp;quot; of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs&amp;#x27; ability to assist humans in real-world tasks. Despite the importance, MLLMs&amp;#x27; proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs&amp;#x27; ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Evaluating Research-Level Math Proofs via Strict Step-Level Verification</title>
    <author>
      <name>Yifeng Sun</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10799v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10799v1" />
    <published>2026-06-09T12:46:35Z</published>
    <updated>2026-06-09T12:46:35Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yifeng Sun&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from &amp;quot;context poisoning,&amp;quot; in which superficially plausible statements mask subtle logical flaws, leading to hallucination or over-skepticism. To address this, we shift from global evaluation to strict step-level verification: our framework maintains detailed context for each deduction step and strictly constrains the sources of applied theorems. We evaluate on a carefully curated adversarial diagnostic suite of research-level proofs drawn from the FirstProof challenge. A systematic ablation study demonstrates that these deductive constraints are indispensable, as unconstrained global prompting consistently fails to localize subtle logical errors. Beyond outperforming global evaluation, our approach fundamentally alters the failure taxonomy. Error analysis reveals that, rather than exhibiting severe logical hallucinations, remaining rejections are primarily instances of &amp;quot;pedantic hyper-rigor&amp;quot; stemming from unstated domain conventions, effectively exposing implicit ambiguities within the expert benchmark itself. Our findings suggest that prompting agents to organize their verification notes in a cautious, human-mathematician-like manner can substantially improve their ability to distinguish rigorous proofs from flawed ones, with the potential to strengthen agentic reasoning on frontier mathematical concepts that the base model does not already know well, and to lay a theoretical foundation for future automated proof-review systems. Code and prompts are available at GitHub.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Accelerating NeurASP with vectorization and caching</title>
    <author>
      <name>Alexander Philipp Rader</name>
    </author>
    <author>
      <name>Alessandra Russo</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10787v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10787v1" />
    <published>2026-06-09T12:37:43Z</published>
    <updated>2026-06-09T12:37:43Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Alexander Philipp Rader, Alessandra Russo&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Neurosymbolic AI combines neural networks with symbolic programs to create robust and explainable predictions. One such framework is NeurASP, which trains a neural network to predict concepts and reasons over them using rules written in answer set programming (ASP) to solve downstream tasks. Crucially, labels are only provided for the downstream prediction produced by the symbolic rules, not for the latent concepts themselves.Backpropagation through the non-differentiable ASP component requires expensive probability and gradient calculations, which has hindered scalability to more sophisticated tasks.In this paper, we address the current limitations of NeurASP by improving its computational performance through vectorization, batch processing and caching of intermediate computations during training. We compare computation speeds between the original and our new implementation of NeurASP and report speedups of multiple orders of magnitude for larger tasks. To this end, we propose a new dataset of difficult tasks involving playing cards, which we use to test the capabilities of NeurASP&amp;#x27;s enhanced learning function.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;16 pages, 5 figures, to be published in the Theory and Practice of Logic Programming (TPLP) journal for the 42nd International Conference on Logic Programming (ICLP) issue&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation</title>
    <author>
      <name>Yuchen Ling</name>
    </author>
    <author>
      <name>Shengcheng Yu</name>
    </author>
    <author>
      <name>Zhenyu Chen</name>
    </author>
    <author>
      <name>Chunrong Fang</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10749v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10749v1" />
    <published>2026-06-09T12:01:07Z</published>
    <updated>2026-06-09T12:01:07Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuchen Ling et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintain memory, and act on external environments. This transition changes the nature of security risk. In agentic settings, failures are no longer limited to unsafe text generation. Untrusted content may redirect control flow, misuse tool privileges, corrupt persistent state, leak sensitive information, or trigger harmful external actions. At the same time, research on LLM agent security is expanding quickly but remains fragmented across attack families, defense layers, application domains, and evaluation settings. This paper synthesizes 247 papers through a lifecycle-based, systems-oriented framework that models agent security around the interaction of information flow, delegated authority, and persistent state. We organize the literature around four questions: how LLM agent security should be modeled, which threat surfaces and attack families dominate, what defenses have been proposed and with what tradeoffs, and how security claims are evaluated. We find that prompt injection and tool-mediated control-flow hijacking still dominate the field, while persistent state corruption and multi-agent propagation are becoming central emerging concerns. We further find that current defenses provide useful building blocks but remain weakly compositional, and that existing benchmarks still underrepresent long-horizon, stateful, and deployment-sensitive risks. We argue that secure LLM agents require explicit trust boundaries, principled privilege control, provenance-aware state management, and evaluation practices aligned with realistic operational settings.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment</title>
    <author>
      <name>Filippo Tonini</name>
    </author>
    <author>
      <name>Federico Torrielli</name>
    </author>
    <author>
      <name>Anton Danholt Lautrup</name>
    </author>
    <author>
      <name>Peter Schneider-Kamp</name>
    </author>
    <author>
      <name>Mustafa Mert Çelikok</name>
    </author>
    <author>
      <name>Lukas Galke Poech</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10747v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10747v1" />
    <published>2026-06-09T11:57:02Z</published>
    <updated>2026-06-09T11:57:02Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Filippo Tonini et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi-agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited &amp;quot;inspection budget&amp;quot;, meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation-aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget-aware monitoring can effectively catch misalignment, and that overseeing multi-agent systems may require treating the auditor as an active participant in the process. The code is available at https://github.com/aisilab/arbiter.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;AITC 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models</title>
    <author>
      <name>Sai Kartheek Reddy Kasu</name>
    </author>
    <author>
      <name>Nils Lukas</name>
    </author>
    <author>
      <name>Samuele Poppi</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10740v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10740v1" />
    <published>2026-06-09T11:50:28Z</published>
    <updated>2026-06-09T11:50:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sai Kartheek Reddy Kasu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at the ICML 2026 FAGEN Workshop&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding</title>
    <author>
      <name>Zhiyuan Zhu</name>
    </author>
    <author>
      <name>Yixuan Chen</name>
    </author>
    <author>
      <name>Yiwen Shao</name>
    </author>
    <author>
      <name>Wenxiang Guo</name>
    </author>
    <author>
      <name>Changhao Pan</name>
    </author>
    <author>
      <name>Yu Zhang</name>
    </author>
    <author>
      <name>Yuxiang Wang</name>
    </author>
    <author>
      <name>Wei Liu</name>
    </author>
    <author>
      <name>Houhua Zhang</name>
    </author>
    <author>
      <name>Chengkuan Zeng</name>
    </author>
    <author>
      <name>Wenbo Cheng</name>
    </author>
    <author>
      <name>Yunxi Liu</name>
    </author>
    <author>
      <name>Rui Yang</name>
    </author>
    <author>
      <name>Steve Yves</name>
    </author>
    <author>
      <name>Liefeng Bo</name>
    </author>
    <author>
      <name>Zhou Zhao</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10738v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10738v1" />
    <published>2026-06-09T11:50:06Z</published>
    <updated>2026-06-09T11:50:06Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhiyuan Zhu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that implements SO-Encoder to inject First-Order Ambisonics (FOA) spatial audio into existing Omni LLMs as an independent modality, without modifying their original audio encoders. SO-Encoder provides spatial tokens with limited additional context cost and improves spatial audio understanding through efficient staged training. To support training and evaluation, we construct SO-Dataset, SO-QA, and SO-Bench from open-source data, real recordings, and simulations, containing 400K FOA spatial audio clips and 2.1M spatial question answering pairs. SO-Bench covers 16 spatial audio understanding subtasks, including basic detection and location estimation, spatial relation understanding, and complex spatial reasoning. Experiments show that Spatial-Omni outperforms existing open-source Large Audio-Language Models (LALMs) and Omni LLM models on spatial audio understanding tasks while retaining a reasonable level of general audio understanding. Code and data are available at https://github.com/dieKarotte/Spatial-Omni.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings</title>
    <author>
      <name>Roberto Martínez-Cruz</name>
    </author>
    <author>
      <name>Alvaro J. López-López</name>
    </author>
    <author>
      <name>José Portela</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10716v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10716v1" />
    <published>2026-06-09T11:24:07Z</published>
    <updated>2026-06-09T11:24:07Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Roberto Martínez-Cruz et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate rich contextualized representations. However, long-document KPE remains challenging because salient keyphrase evidence may be scattered across distant document sections that cannot be jointly captured within the limited context window of most PLMs. Although long-context large language models (LLMs) can process broader textual contexts, their computational cost limits their practicality for efficient and high-throughput KPE. To overcome this limitation, we propose an attention expansion mechanism that augments PLM token representations with information from surrounding out-of-context chunks using pre-trained word embeddings. The proposed mechanism expands the effective contextual scope of PLM-based KPE models without requiring full-document attention or expensive LLM-based inference. We evaluate our approach across five PLM backbones, including general-purpose, scientific, task-specific, and long-context encoders, using two training regimes and five benchmark corpora from scientific and news domains. Experimental results demonstrate that attention expansion consistently enhances KPE performance across all evaluation settings, outperforming state-of-the-art models and yielding notable improvements in F1 score. The improvements extend to domain-specific, task-specialized, and native long-context models, showing that the proposed mechanism provides complementary information rather than merely compensating for limited input length. These results establish attention expansion as an efficient and effective strategy for long-document KPE.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.IV] ++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation</title>
    <author>
      <name>Ana Sofia Santos</name>
    </author>
    <author>
      <name>André Ferreira</name>
    </author>
    <author>
      <name>Gijs Luijten</name>
    </author>
    <author>
      <name>Naida Solak</name>
    </author>
    <author>
      <name>Lisle Faray de Paiva</name>
    </author>
    <author>
      <name>Behrus Hinrichs-Puladi</name>
    </author>
    <author>
      <name>Jens Kleesiek</name>
    </author>
    <author>
      <name>Jan Egger</name>
    </author>
    <author>
      <name>Victor Alves</name>
    </author>
    <category term="eess.IV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10713v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10713v1" />
    <published>2026-06-09T11:19:09Z</published>
    <updated>2026-06-09T11:19:09Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ana Sofia Santos et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: https://github.com/sofia-adelie/plusplusnnunet.git&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;7 pages, 1 figure, 2 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.IR] Effective Reinforcement Learning for Agentic Search by Recycling Zero-Variance Queries During Training</title>
    <author>
      <name>João Coelho</name>
    </author>
    <author>
      <name>João Magalhães</name>
    </author>
    <author>
      <name>Bruno Martins</name>
    </author>
    <author>
      <name>Chenyan Xiong</name>
    </author>
    <category term="cs.IR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10709v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10709v1" />
    <published>2026-06-09T11:12:58Z</published>
    <updated>2026-06-09T11:12:58Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;João Coelho et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The use of GRPO-style algorithms has become the standard strategy for training LLM search agents under outcome-only rewards. With these algorithms, a query contributes to parameter updates only when its rollout group mixes successes and failures; all-correct (too-easy) and all-incorrect (too-hard) groups are zero-variance and waste rollout cost. Existing approaches treat zero-variance as a static property and either discard or pre-filter such groups. We hypothesize and empirically validate that queries flip between zero-variance and signal-bearing states as the policy evolves during training. Building on this intuition, we propose query recycling, which returns zero-variance groups to a mutable pool for future resampling, so that the effective training distribution co-evolves with the policy. With the proposed technique, a 1.7B parameter model trained on synthetic data can reach 66.0 average Pass@1 accross seven multi-hop QA benchmarks, matching or surpassing systems with up to 7B parameters trained on benchmark-derived supervision. Analysis of recycling patterns shows that recycled queries supply roughly three quarters of the effective batch by the end of training, with contributions split between recovery from policy improvement and policy drift.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals</title>
    <author>
      <name>Jaewan Park</name>
    </author>
    <author>
      <name>Solbee Cho</name>
    </author>
    <author>
      <name>Jay-Yoon Lee</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10684v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10684v1" />
    <published>2026-06-09T10:40:55Z</published>
    <updated>2026-06-09T10:40:55Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jaewan Park et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Modern language agents which perform multi-step reasoning have shown strong performance in knowledge-intensive question answering. However, existing approaches typically couple evidence acquisition and answer generation within a single policy. This forces a single model to play multiple potentially conflicting roles, inducing a combinatorial explosion in the policy space and hindering efficient exploration. It also introduces a credit assignment problem during training: a search action that retrieves sufficient evidence may still be penalized when generation fails, and vice versa. We propose DAC (Divide and Cooperate), a role-decomposed multi-agent training framework that divides agentic search into two cooperative subtasks, each handled by a dedicated agent trained with role-specific learning signals. The generator serves a dual role as both an answer producer and an evidence sufficiency verifier, abstaining when retrieved evidence is insufficient. This abstention signal is incorporated into the search agent&amp;#x27;s reward, providing structured cross-agent learning signals that improve credit assignment. Conversely, the searcher exposes the generator to diverse and challenging evidence environments by hard-positive evidence augmentation, improving its robustness. Experiments on general and multi-hop QA benchmarks show that DAC, implemented via parameter-efficient LoRA modules over a shared backbone, achieves strong performance against prior baselines that rely on full fine-tuning of monolithic models.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory</title>
    <author>
      <name>Suozhao Ji</name>
    </author>
    <author>
      <name>Baodong Wu</name>
    </author>
    <author>
      <name>Zehao Wang</name>
    </author>
    <author>
      <name>Lei Xia</name>
    </author>
    <author>
      <name>Qingping Li</name>
    </author>
    <author>
      <name>Ruisong Wang</name>
    </author>
    <author>
      <name>Wenbo Ding</name>
    </author>
    <author>
      <name>Zhenhua Zhu</name>
    </author>
    <author>
      <name>Boxun Li</name>
    </author>
    <author>
      <name>Guohao Dai</name>
    </author>
    <author>
      <name>Yu Wang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10677v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10677v1" />
    <published>2026-06-09T10:31:51Z</published>
    <updated>2026-06-09T10:31:51Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Suozhao Ji et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and memory maintenance difficult. We propose Infini Memory, a maintainable text-based persistent memory architecture that treats agent memory as topic-structured documents. Each topic document serves as a semantic unit for collecting related evidence, preserving metadata, and revising facts over time. New observations are first staged in a buffer and periodically consolidated into coherent textual contexts. At inference time, an agentic retrieval procedure lets the LLM read memory through iterative tool calls rather than a single retrieval step. On MemoryAgentBench, Infini Memory achieves 64.7% overall score. Ablations show that topic-structured maintenance and iterative evidence inspection improve complementary aspects of long-term memory use.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.MA] Decentralized Multi-Agent Systems with Shared Context</title>
    <author>
      <name>Yuzhen Mao</name>
    </author>
    <author>
      <name>Azalia Mirhoseini</name>
    </author>
    <category term="cs.MA" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10662v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10662v1" />
    <published>2026-06-09T10:13:07Z</published>
    <updated>2026-06-09T10:13:07Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuzhen Mao, Azalia Mirhoseini&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the number of subtasks grows, this controller becomes a communication and integration bottleneck. We propose Decentralized Language Models (DeLM), a MAS framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates. The shared context acts as a common communication substrate, enabling agents to build on one another&amp;#x27;s verified progress without routing every update through a central controller. Empirically, DeLM improves both software-engineering test-time scaling and long-context reasoning. On SWE-bench Verified, DeLM achieves the best performance across Avg.@1, Pass@2, and Pass@4, with gains of up to 10.5 percentage points over the strongest baseline, while reducing cost per task by roughly 50%. On LongBench-v2 Multi-Doc QA, DeLM achieves the highest average accuracy across four frontier model families, improving over the strongest baseline by up to 5.7 percentage points. The code is available on our project website at https://yuzhenmao.github.io/DeLM/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CY] Accounting for AI Inference in Corporate GHG Inventories: A Four-Tier Methodology for Scope 3 Category 1 Reporting</title>
    <author>
      <name>Guillermo Llopis</name>
    </author>
    <category term="cs.CY" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10660v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10660v1" />
    <published>2026-06-09T10:08:36Z</published>
    <updated>2026-06-09T10:08:36Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Guillermo Llopis&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI inference services -- API subscriptions, enterprise chat tools, and SaaS products with embedded AI features -- fall unambiguously within Scope 3 Category 1 under the Corporate Sustainability Reporting Directive (CSRD), which requires disclosure for fiscal years starting January 2024. Yet no standardised methodology exists for including them in corporate GHG inventories. Current practice either omits the category entirely or applies a generic economic input-output (EEIO) factor calibrated to the ICT sector as a whole, overestimating AI inference emissions by 10-40x relative to physically derived alternatives.
  We propose a four-tier framework that matches estimation precision to the data organisations can realistically obtain, progressing from direct token-based physical estimation -- using GPU energy benchmarks and regional grid carbon intensities -- down to a spend-based EEIO fallback for services where no usage data exists. Emission factors are derived from peer-reviewed GPU energy benchmarks (ML.ENERGY Leaderboard v3), confirmed grid carbon intensities (EPA eGRID 2023; Ember 2023), and published water use effectiveness data (Li et al., 2025). Applied to a 200-person European firm, the framework yields a total below 1 tCO2e, illustrating that the compliance challenge is methodological rather than magnitude-driven. We further document a water-carbon trade-off that current ESG tools do not surface: Sweden&amp;#x27;s hydro-dominated grid delivers the lowest carbon intensity in our dataset but the highest water footprint, with direct implications for data centre location strategy.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Preprint. Data repository: https://doi.org/10.5281/zenodo.20443586. 18 pages, 3 figures, 6 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.IR] STORM: Stepwise Token Optimization with Reward-Guided Beam Search</title>
    <author>
      <name>Arthur Satouf</name>
    </author>
    <author>
      <name>Giulio D'Erasmo</name>
    </author>
    <author>
      <name>Yuxuan Zong</name>
    </author>
    <author>
      <name>Habiboulaye Amadou Boubacar</name>
    </author>
    <author>
      <name>Pablo Piantanida</name>
    </author>
    <author>
      <name>Benjamin Piwowarski</name>
    </author>
    <category term="cs.IR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10621v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10621v1" />
    <published>2026-06-09T09:18:09Z</published>
    <updated>2026-06-09T09:18:09Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Arthur Satouf et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Modern retrieval increasingly relies on dense and learned-sparse neural models that are effective but require encoding the entire corpus into a specialized index, rebuilt whenever the model changes. Lexical retrievers like BM25 stay efficient and transparent on a standard inverted index that need not change as models evolve, but suffer from vocabulary mismatch. LLM query rewriting can help, yet prompted rewriters emit well-formed but retrieval-ineffective or harmful-terms, and training against a retrieval reward gives only delayed, sequence-level supervision that obscures which terms helped. We introduce STORM (Stepwise Token Optimization with Reward-guided beaM search), a self-supervised framework for lexical query expansion. STORM trains the rewriter through generation guided by retrieval metrics: at each step, candidate expansions are scored against the BM25 index and low-reward continuations pruned, turning the retrieval reward into a token-level signal that concentrates exploration on retrieval-effective vocabulary. Across TREC DL and BEIR, STORM lets 0.6B-8B backbones match or surpass competitive LLM rewriters while retrieving as fast as plain BM25; at 8B it rivals far larger proprietary rewriters. It further transfers zero-shot to 18 languages (MIRACL), beating dedicated multilingual dense retrievers on average, making STORM a competitive, infrastructure-light alternative to dense neural retrieval.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency</title>
    <author>
      <name>Xinrui Wu</name>
    </author>
    <author>
      <name>Lichen Huang</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10620v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10620v1" />
    <published>2026-06-09T09:17:55Z</published>
    <updated>2026-06-09T09:17:55Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xinrui Wu, Lichen Huang&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Image generation models now produce high-quality static images, yet their ability to represent how a visual world changes over time remains poorly understood. Practical workflows such as storyboarding, step-by-step illustration, reference-guided editing, and video previsualization require models to preserve identities, objects, spatial relations, and causal order across multiple visual states. Existing evaluations largely measure single-image correctness, compositional alignment, or video quality, leaving open whether an image model can coherently imagine a temporally ordered process. We introduce ImageTime, a diagnostic benchmark that uses spatiotemporal consistency as a behavioral probe of visual world modeling in image generation. Given an action instruction, and optionally a reference image specifying the initial state, a model must generate one image containing four ordered key states: initial state, action onset, transition state, and final state. This four-keyframe protocol is more temporally demanding than single-image generation while avoiding the confounds of dense video dynamics. ImageTime organizes tasks with a progressive capability hierarchy and decomposes each scenario into stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations. GPT-5.5 scores all generated images under a structured VLM-as-judge protocol, producing interpretable capability scores, diagnostic subscores, and failure labels. Through multi-family benchmarking, ImageTime reveals where current image generation systems succeed, fail, and drift when asked to maintain coherent visual world states over time.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Embedding Hybrid Systems into Continuous Latent Vector Fields</title>
    <author>
      <name>Sangli Teng</name>
    </author>
    <author>
      <name>Hang Liu</name>
    </author>
    <author>
      <name>Koushil Sreenath</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10596v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10596v1" />
    <published>2026-06-09T09:01:18Z</published>
    <updated>2026-06-09T09:01:18Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sangli Teng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;This work proves that an $n$-dimensional hybrid system can be embedded into an $m$-dimensional Euclidean space equipped with a continuous vector field on its embedded image whenever $m&amp;gt;2n$. This result suggests that an intrinsically discontinuous hybrid system generically admits a continuous extrinsic representation that is well-posed for differentiable optimization. Building on this existence theorem, we show that a latent Neural ODE with consistency loss in both the latent and state space can accurately recover the flow of hybrid systems. Extensive experiments suggest the proposed method outperforms the existing method in learning hybrid systems with varying geometries from only time series data.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Towards Diverse Scientific Hypothesis Search with Large Language Models</title>
    <author>
      <name>Haorui Wang</name>
    </author>
    <author>
      <name>Parshin Shojaee</name>
    </author>
    <author>
      <name>Kazem Meidani</name>
    </author>
    <author>
      <name>Kunyang Sun</name>
    </author>
    <author>
      <name>José Miguel Hernández-Lobato</name>
    </author>
    <author>
      <name>Teresa Head-Gordon</name>
    </author>
    <author>
      <name>Jiajun He</name>
    </author>
    <author>
      <name>Chandan K. Reddy</name>
    </author>
    <author>
      <name>Chao Zhang</name>
    </author>
    <author>
      <name>Yuanqi Du</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10587v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10587v1" />
    <published>2026-06-09T08:52:49Z</published>
    <updated>2026-06-09T08:52:49Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haorui Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs) are on the rise for accelerating scientific discovery, most recently in advanced tasks such as generating valid scientific hypotheses. Yet in many discovery settings, the goal is not to identify a single best hypothesis since validation can be noisy and expensive, and scientists benefit from a set of high-quality alternative hypotheses that hedge against downstream uncertainty for the best solutions. Nevertheless, commonly used evolutionary search recipes tend to prioritize optimization over exploration in hypothesis generation, and the resulting selection pressure during the search process leads to diversity collapse. Motivated by these limitations, we formulate hypothesis search as a sampling problem, where the objective is to efficiently produce diverse, high-quality hypotheses under a fixed validation budget. Building on this perspective, we propose \ours, an evolutionary framework inspired by the classical parallel tempering algorithm that searches hypotheses at multiple temperature levels and enables principled information exchange across temperatures to improve exploration without disrupting convergence. Across domains including molecular discovery, equation discovery, and algorithm discovery, our approach consistently improves both hypothesis quality and diversity under the same validation budget, and produces candidates that remain robust under more expensive downstream computational validations.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA</title>
    <author>
      <name>Zhi Zheng</name>
    </author>
    <author>
      <name>Ziqiao Meng</name>
    </author>
    <author>
      <name>Hao Luan</name>
    </author>
    <author>
      <name>Wei Liu</name>
    </author>
    <author>
      <name>Wee Sun Lee</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10572v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10572v1" />
    <published>2026-06-09T08:36:08Z</published>
    <updated>2026-06-09T08:36:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhi Zheng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Benchmarking Knowledge Editing using Logical Rules</title>
    <author>
      <name>Tatiana Moteu Ngoli</name>
    </author>
    <author>
      <name>NDah Jean Kouagou</name>
    </author>
    <author>
      <name>Hamada M. Zahera</name>
    </author>
    <author>
      <name>Axel-Cyrille Ngonga Ngomo</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10554v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10554v1" />
    <published>2026-06-09T08:21:56Z</published>
    <updated>2026-06-09T08:21:56Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tatiana Moteu Ngoli et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, retraining LLMs is computationally expensive. Therefore, knowledge editing techniques are crucial for maintaining current information and correcting erroneous assertions within pre-trained models. Current benchmarks for knowledge editing primarily focus on recalling edited facts, often neglecting their logical consequences. To address this limitation, we introduce a new benchmark designed to evaluate how knowledge editing methods handle the logical consequences of a single fact edit. Our benchmark extracts relevant logical rules from a knowledge graph for a given edit. Then, it generates multi-hop questions based on these rules to assess the impact on logical consequences. Our findings indicate that while existing knowledge editing approaches can accurately insert direct assertions into LLMs, they frequently fail to inject entailed knowledge. Specifically, experiments with popular methods like ROME and FT reveal a substantial performance gap, up to 24%, between evaluations on directly edited knowledge and on entailed knowledge. This highlights the critical need for semantics-aware evaluation frameworks in knowledge editing.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at the 24th International Semantic Web Conference 2025&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Machine Learning Methods for Studying Latent Neural Activity Dynamics</title>
    <author>
      <name>Shufeng Kong</name>
    </author>
    <author>
      <name>Fumei Deng</name>
    </author>
    <author>
      <name>Xinyi Dong</name>
    </author>
    <author>
      <name>Caihua Liu</name>
    </author>
    <author>
      <name>Weiwei Chen</name>
    </author>
    <author>
      <name>Yingheng Wang</name>
    </author>
    <author>
      <name>Daniel Cao</name>
    </author>
    <author>
      <name>Azahara Oliva</name>
    </author>
    <author>
      <name>Antonio Fernandez-Ruiz</name>
    </author>
    <author>
      <name>Carla Gomes</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10530v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10530v1" />
    <published>2026-06-09T08:01:11Z</published>
    <updated>2026-06-09T08:01:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shufeng Kong et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent developments in brain recording are driving a demand for machine learning tools capable of decoding the latent structure of large populations of neurons. In this paper, we provide a comprehensive survey that outlines the trajectory of Latent Variable Models (LVMs) from early state-space models to more recent deep generative models. We organize the literature into three closely related domains: (1) Single-Region Latent Dynamics, which includes models such as linear dynamical systems to more complex dynamics represented by Recurrent Neural Networks (RNNs) and Neural Ordinary Differential Equations (ODEs); (2) Multi-Region Communication, which employs probabilistic as well as subspace methods to study how information is transferred across different brain areas considering synaptic propagation delays and network connectivity; and (3) Behavior-Aligned Modeling, which seeks to disentangle neural activity related to task performance from other internal states via supervised or contrastive learning. This survey also includes large-scale neural foundation models, such as Transformers and diffusion models, that rely on large-scale pre-training for optimal performance across subjects. Finally, we conclude and discuss benchmarks, evaluation criteria, and open challenges, such as the ability to identify causal links or directionality of communication, to facilitate future research for bridging interpretable brain dynamics with reliable neural decoding.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by IJCAI 2026 survey track&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner</title>
    <author>
      <name>Youssef Abdelkader</name>
    </author>
    <author>
      <name>Humbert Fiorino</name>
    </author>
    <author>
      <name>Damien Pellier</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10489v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10489v1" />
    <published>2026-06-09T07:07:48Z</published>
    <updated>2026-06-09T07:07:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Youssef Abdelkader et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Automated Planning is a subfield of Artificial Intelligence (AI) where the main objective is generating a sequence of actions, known as a plan, that helps us reach a goal state from an initial state. A planning problem is defined by a set of objects, an initial state and a desired goal state. The objective is to compute a plan that&amp;#x27;ll lead us from the inital state to the goal state. Programs that generate plans are called planners.
  In this paper, we did a complementary study to the state-of-the-art LLM called PlanGPT which was released last year. We redid some experiments to verify whether planning with LLMs is \textbf{pertinent} and \textbf{worthwhile}. We also check whether the results obtained in the official PlanGPT paper for plan coverage were correct, and we also performed a more comprehensive study on PlanGPT&amp;#x27;s performance: in our paper PlanGPT&amp;#x27;s performance was evaluated using two metrics: Plan Cost and Plan Generation Time. The results of planGPT were compared to those produced by a traditional planner for the same plans and same metrics. We discovered that PlanGPT is no better than a Greedy search strategy.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;7 pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics</title>
    <author>
      <name>Shunkai Zhang</name>
    </author>
    <author>
      <name>Haoran Zhang</name>
    </author>
    <author>
      <name>Yun Luo</name>
    </author>
    <author>
      <name>Qianjia Cheng</name>
    </author>
    <author>
      <name>Haodi Lei</name>
    </author>
    <author>
      <name>Yizhuo Li</name>
    </author>
    <author>
      <name>Runzhe Zhan</name>
    </author>
    <author>
      <name>Zhilin Wang</name>
    </author>
    <author>
      <name>Bangjie Xu</name>
    </author>
    <author>
      <name>Yucheng Su</name>
    </author>
    <author>
      <name>Xinmiao Han</name>
    </author>
    <author>
      <name>Xiaoye Qu</name>
    </author>
    <author>
      <name>Dongrui Liu</name>
    </author>
    <author>
      <name>Zhouchen Lin</name>
    </author>
    <author>
      <name>Yu Qiao</name>
    </author>
    <author>
      <name>Ning Ding</name>
    </author>
    <author>
      <name>Yafu Li</name>
    </author>
    <author>
      <name>Yu Cheng</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10479v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10479v1" />
    <published>2026-06-09T06:50:15Z</published>
    <updated>2026-06-09T06:50:15Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shunkai Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today&amp;#x27;s strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;39 pages, 6 figures, 26 tables. Project page: https://simplified-reasoning.github.io/ComBench/docs/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Detecting Speculative Language in Biomedical Texts using Recurrent Neural Tensor Networks</title>
    <author>
      <name>Dhruv Dixit</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10471v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10471v1" />
    <published>2026-06-09T06:39:31Z</published>
    <updated>2026-06-09T06:39:31Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dhruv Dixit&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;In this investigation, we delve into the automated detection of speculative language within biomedical articles by utilizing distributed sentence representations and advanced deep learning techniques. The implications of such identification extend to information retrieval, multi-document summarization, and the exploration of new knowledge. Our exploration encompasses two distinct approaches for acquiring distributed sentence representations: the Paragraph Vector model and the Recursive Neural Tensor Network. These methodologies are then rigorously compared against three foundational baseline algorithms: Support Vector Machines, Naive Bayes, and pattern matching. Our findings reveal that the Recursive Neural Tensor Network (RNTN) demonstrates a slight performance edge (F1 = 0.885) over the top-performing baseline, the linear bigram SVM (F1 = 0.881). Meanwhile, the Paragraph Vector model proves less effective (F1 = 0.368), even after extensive training using an expansive, unlabeled dataset. We engage in a comprehensive discourse on the factors influencing these performance disparities and provide insightful recommendations for future research directions.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 Pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake</title>
    <author>
      <name>Haonan Wang</name>
    </author>
    <author>
      <name>Jiaxiang Liu</name>
    </author>
    <author>
      <name>Yurong Liu</name>
    </author>
    <author>
      <name>Austin Senna Wijaya</name>
    </author>
    <author>
      <name>Tianle Zhou</name>
    </author>
    <author>
      <name>Eden Wu</name>
    </author>
    <author>
      <name>Yijia Chen</name>
    </author>
    <author>
      <name>Wanting You</name>
    </author>
    <author>
      <name>Reya Vir</name>
    </author>
    <author>
      <name>Daniela Pinto</name>
    </author>
    <author>
      <name>Grace Fan</name>
    </author>
    <author>
      <name>Yusen Zhang</name>
    </author>
    <author>
      <name>Juliana Freire</name>
    </author>
    <author>
      <name>Eugene Wu</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10460v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10460v1" />
    <published>2026-06-09T06:15:36Z</published>
    <updated>2026-06-09T06:15:36Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haonan Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Mitigating Bias in Low-SNR Financial Reinforcement Learning via Quantum Representations</title>
    <author>
      <name>Zeyu Liu</name>
    </author>
    <author>
      <name>Xuanzhi Feng</name>
    </author>
    <author>
      <name>Sing Kwong Lai</name>
    </author>
    <author>
      <name>Yuanchen Gao</name>
    </author>
    <author>
      <name>Xiaoyi Pang</name>
    </author>
    <author>
      <name>Hualei Zhang</name>
    </author>
    <author>
      <name>Jingcai Guo</name>
    </author>
    <author>
      <name>Jie Zhang</name>
    </author>
    <author>
      <name>Song Guo</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10448v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10448v1" />
    <published>2026-06-09T05:55:06Z</published>
    <updated>2026-06-09T05:55:06Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zeyu Liu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The financial market is a typical low signal-to-noise ratio (SNR) setting, which often destabilizes off-policy maximum-entropy methods like Soft Actor-Critic (SAC). Specifically, noisy state representations may produce unreliable Q-value estimates, and bootstrapping amplifies these errors, forming a failure mode we call the &amp;quot;Financial Entropy Trap&amp;quot;. In this paper, we propose FPQC-SAC, an efficient and plug-and-play SAC variant that places a compact and bounded Parameterized Quantum Circuit (PQC) before the actor and critic networks to constrain feature propagation at the representation level, rather than filtering raw inputs or regularizing Q-values after bootstrapping. Notably, FPQC-SAC reduces the impact of extreme market fluctuations on Bellman target estimation, while trainable quantum entanglement preserves flexible cross-asset interactions. Empirical evaluations on real-world portfolio management tasks demonstrate that FPQC-SAC substantially enhances out-of-sample stability and cumulative returns by achieving a 66.89% relative gain in cumulative return over standard unconstrained SAC and outperforms the best continuous-control deep reinforcement learning baseline by approximately 27%. Open-source code is available at https://github.com/ZeyuLIU-UST/FPQC-SAC-main.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Preprint. Code available at https://github.com/ZeyuLIU-UST/FPQC-SAC-main&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems</title>
    <author>
      <name>Shuangchun Gui</name>
    </author>
    <author>
      <name>Zhiguang Cao</name>
    </author>
    <author>
      <name>Wen Song</name>
    </author>
    <author>
      <name>Yew-Soon Ong</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10431v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10431v1" />
    <published>2026-06-09T05:15:25Z</published>
    <updated>2026-06-09T05:15:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shuangchun Gui et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multi-task vehicle routing problems play a critical role in enhancing efficiency across various industries and service sectors. These problems consist of multiple variants that optimize routing costs while meeting diverse customer constraints. Existing multi-task VRP solvers solely utilize a graph-based modality, limiting their ability to address variants with multiple constraints. As a format to represent complex semantics, vision modality shows great potential for encoding diverse VRP constraints. This motivates us to learn patch-level semantics from the vision images, and then integrate them into a graph-based model to solve various VRP variants simultaneously. However, directly applying this approach to multi-task VRPs presents three challenges: 1) existing VRP images lack constraint representations, which are essential for multi-task VRPs, 2) the fixed receptive field of individual patches cannot effectively accommodate varying requirements across tasks, and 3) imbalanced pixel distribution among constraints may cause the model to overlook constraints with fewer pixels. In this paper, we propose a vision-assisted foundation model (VaFM) to address these challenges. In the vision modality, input images tailored to all constraints are encoded by a convolutional neural network. The obtained patch embeddings are fused with graph-based nodes to generate solutions, with an auxiliary task designed to address the pixel-imbalanced issue. The performance of VaFM is evaluated across 16 different VRP variants. The experimental results demonstrate the superiority of VaFM over state-of-the-art methods, especially for variants with complex constraints.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by TNNLS&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries</title>
    <author>
      <name>Federico Bianchi</name>
    </author>
    <author>
      <name>Yongchan Kwon</name>
    </author>
    <author>
      <name>Aneesh Pappu</name>
    </author>
    <author>
      <name>James Zou</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10402v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10402v1" />
    <published>2026-06-09T04:25:28Z</published>
    <updated>2026-06-09T04:25:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Federico Bianchi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other&amp;#x27;s ideas over long time horizons. Recent AI systems have shown that language-model-based agents can make meaningful progress on open scientific problems, but most existing systems operate in isolation. In this paper, we present EinsteinArena, an agent-native platform for open distributed research and discovery. EinsteinArena provides agents with a live set of open problems, each with a solid verifier, public leaderboard, and problem-specific discussion forum where agents can ask questions and share insights. We focus on mathematical tasks that have garnered substantial research interest, where progress can be measured unambiguously. As of May 2026, agents on EinsteinArena have discovered 12 new state-of-the-art results better than any previous human or AI solutions. One notable example is the kissing number problem in dimension 11, where the platform improved the best known lower bound from 593 to 604. This advance did not come from a single agent or isolated run. Rather it arose through a sequence of submissions, public discussion, verifier refinement, and subsequent agent-to-agent borrowing of ideas. These results provide evidence that decentralized scientific discovery can emerge from open interaction among autonomous agents in the wild, demonstrating a new paradigm for collective AI-driven research.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios</title>
    <author>
      <name>Sirui Liang</name>
    </author>
    <author>
      <name>Bohan Yu</name>
    </author>
    <author>
      <name>Peiyu Wang</name>
    </author>
    <author>
      <name>Shiguang Guo</name>
    </author>
    <author>
      <name>Wenxing Hu</name>
    </author>
    <author>
      <name>Pengfei Cao</name>
    </author>
    <author>
      <name>Jian Zhao</name>
    </author>
    <author>
      <name>Cao Liu</name>
    </author>
    <author>
      <name>Ke Zeng</name>
    </author>
    <author>
      <name>Xunliang Cai</name>
    </author>
    <author>
      <name>Kang Liu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10394v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10394v1" />
    <published>2026-06-09T04:16:35Z</published>
    <updated>2026-06-09T04:16:35Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sirui Liang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal-agent evaluation. This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based personal-computing environments. Given a task hint, STAGE-Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs. Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response. Using STAGE-Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool-call reliability, and common failure patterns. Overall, STAGE-Claw offers a scalable, state-based way to evaluate agents in realistic user scenarios.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games</title>
    <author>
      <name>Haoran Li</name>
    </author>
    <author>
      <name>Zengle Ge</name>
    </author>
    <author>
      <name>Ziyang Zhang</name>
    </author>
    <author>
      <name>Xiaomin Yuan</name>
    </author>
    <author>
      <name>Yui Lo</name>
    </author>
    <author>
      <name>Qianhui Liu</name>
    </author>
    <author>
      <name>Bocheng An</name>
    </author>
    <author>
      <name>Dongke Rong</name>
    </author>
    <author>
      <name>Jiaqun Liu</name>
    </author>
    <author>
      <name>Annan Li</name>
    </author>
    <author>
      <name>Jianmin Wu</name>
    </author>
    <author>
      <name>Dawei Yin</name>
    </author>
    <author>
      <name>Dou Shen</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10389v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10389v1" />
    <published>2026-06-09T03:55:31Z</published>
    <updated>2026-06-09T03:55:31Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haoran Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent advances in LLM-driven code evolution have enabled automated discovery by iteratively generating and improving programs. However, applying these methods to adversarial multi-agent games introduces a fundamental challenge: the evaluation landscape shifts as strategies improve, causing fixed evaluators to become unreliable and evolution to stagnate. We propose three mechanisms to address this challenge: evaluator co-evolution, which incorporates discovered champions into the opponent pool; hierarchical deep evaluation, which replaces noisy few-game scores with statistically reliable assessments; and weakness pressure, which dynamically up-weights the most difficult opponents to break through plateaus. We implement these mechanisms within FAMOU, a framework built upon the same foundation-model code-evolution paradigm as OpenEvolve and ShinkaEvolve. On the MCTF 2026 3v3 maritime capture-the-flag task, FAMOU consistently outperforms both baselines under two backbone LLMs, achieving the highest combined score (0.526) and the best generalization to unseen opponents (61.7% win rate), while ablations confirm that each mechanism contributes to performance. Notably, the LLM mutation process generates tactical structures entirely absent from the seed strategies -- including lookahead search and adaptive interception -- demonstrating that code-level evolution can produce nontrivial algorithmic innovations in adversarial settings. The FAMOU-evolved strategy further achieved 1st place in the hardware round-robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, validating its real-world transferability. The optimized implementation and corresponding evaluation codes developed through our evolutionary process are available at: https://github.com/1xiangliu1/FAMOU-CoEvo&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation</title>
    <author>
      <name>Wenhao Zhang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10385v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10385v1" />
    <published>2026-06-09T03:51:41Z</published>
    <updated>2026-06-09T03:51:41Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wenhao Zhang&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;On-policy distillation (OPD) has demonstrated strong empirical gains in enhancing complex reasoning in LLMs by aligning a student model with a teacher&amp;#x27;s predictive distribution over the student&amp;#x27;s own trajectories. An emerging variant, Privileged OPD, further strengthens this paradigm by employing a self-teacher model augmented with privileged information, such as oracle traces, to mitigate teacher-student capacity gaps while providing dense, answer-directed supervision. However, current methods treat privileged information as a monolithic imitation target, failing to disentangle locally reachable reasoning steps from future-conditioned oracle signals. Consequently, the student is encouraged to match a hindsight-biased distribution that often falls outside its local predictive support. This reachability mismatch incentivizes the student model to skip valid intermediate reasoning in favor of locally unsupported shortcuts. To resolve this, we introduce Anchored Residual On-Policy Distillation (AR-OPD), a dual-view framework that disentangles privileged supervision. Rather than enforcing strict full-view imitation, AR-OPD establishes a locally compatible anchor using a partially privileged teacher, isolating and injecting oracle foresight as a controlled residual to provide destination-directed guidance. Across diverse reasoning tasks, AR-OPD outperforms full privileged OPD by 2.3 points and SFT by 7.9 points. Crucially, this anchored residual mechanism reduces hindsight leakage by 21.7% and mitigates late-stage drift, yielding up to a 7.2-point advantage on challenging long-horizon trajectories exceeding 768 tokens.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;17 pages, 8 figures. Project page: https://vanhowe.github.io/AR-OPD/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies</title>
    <author>
      <name>Zi Yin</name>
    </author>
    <author>
      <name>Peilin Chai</name>
    </author>
    <author>
      <name>Siyuan Huang</name>
    </author>
    <author>
      <name>Zhanhao Hu</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10371v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10371v1" />
    <published>2026-06-09T03:31:09Z</published>
    <updated>2026-06-09T03:31:09Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zi Yin et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Diffusion-based action generation has become a foundational component of embodied AI, but its reliance on visual conditioning leaves deployed visuomotor policies vulnerable to adversarial manipulation. Most prior attacks focus on disruption: they perturb the observation stream to reduce task success or induce erratic behavior. We study a stronger threat, Test-time Adversarial Takeover (TAKO), in which an attacker obtains a real-time steering interface over a frozen robot policy and turns it into a remotely piloted instrument. TAKO learns a small vocabulary of reusable universal patches through differentiable diffusion inference; at test time, the attacker switches among these patches in the camera stream to compose attacker-chosen trajectories. This works because the perturbation acts on the visual conditioning pathway, where the induced bias can persist through iterative generative inference. We further show that the natural targeted baseline, target-policy matching, fails because the victim policy cannot reliably supervise itself on out-of-distribution target shifts. Across four tasks (2D manipulation, simulated aerial delivery, simulated ground navigation, and physical-world ground navigation), two visual encoders (ResNet-18 and EfficientNet-B0 + Transformer), and three generative inference families (DDPM, DDIM, and flow matching), human operators achieve 100\% takeover success on attacker-defined objectives in every evaluated setting. The project page is available at https://tako-attack.github.io.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation</title>
    <author>
      <name>Xuanchen Li</name>
    </author>
    <author>
      <name>Tianrui Wang</name>
    </author>
    <author>
      <name>Yuheng Lu</name>
    </author>
    <author>
      <name>Zikang Huang</name>
    </author>
    <author>
      <name>Yu Jiang</name>
    </author>
    <author>
      <name>Chenghan Lin</name>
    </author>
    <author>
      <name>Chenrui Cui</name>
    </author>
    <author>
      <name>Ziyang Ma</name>
    </author>
    <author>
      <name>Xingyu Ma</name>
    </author>
    <author>
      <name>Chunyu Qiang</name>
    </author>
    <author>
      <name>Guochen Yu</name>
    </author>
    <author>
      <name>Xie Chen</name>
    </author>
    <author>
      <name>Longbiao Wang</name>
    </author>
    <author>
      <name>Jianwu Dang</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10368v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10368v1" />
    <published>2026-06-09T03:27:30Z</published>
    <updated>2026-06-09T03:27:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xuanchen Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. Built upon the pre-trained Embedded Language Flows (ELF) backbone, ELF-S2T processes speech via a frozen Whisper encoder and a single linear projector, prepending the resulting audio condition to the noisy text latent for in-context, flow-matching denoising. To prevent the model from over-relying on its pre-trained text context, we introduce audio forcing during training, and further amplify the audio condition via classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 show that ELF-S2T achieves competitive ASR and S2TT performance. Crucially, our error analysis reveals that, although ASR and S2TT errors look very different on the surface, both stem from the same underlying cause, a close distance confusion in the continuous latent space. This finding naturally aligns with the continuous representation generation paradigm, indicating a common semantic mapping process beneath recognition and translation. Our code and pretrained models are publicly available at https://github.com/Sslnon/ELF-S2T.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.IR] Atomic Intent Reasoning: Bringing LLM Semantics to Industrial Cross-Domain Recommendations</title>
    <author>
      <name>Zhuohang Jiang</name>
    </author>
    <author>
      <name>Yuxin Chen</name>
    </author>
    <author>
      <name>Shijie Wang</name>
    </author>
    <author>
      <name>Haohao Qu</name>
    </author>
    <author>
      <name>Zhou Jindong</name>
    </author>
    <author>
      <name>Wenqi Fan</name>
    </author>
    <author>
      <name>Li Qing</name>
    </author>
    <author>
      <name>Dongxu Liang</name>
    </author>
    <author>
      <name>Jun Wang</name>
    </author>
    <category term="cs.IR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10357v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10357v1" />
    <published>2026-06-09T03:13:52Z</published>
    <updated>2026-06-09T03:13:52Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhuohang Jiang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Cross-domain recommendation is a core problem in content-to-e-commerce platforms. Its objective is to leverage user interactions with content to infer potential purchasing intent on the e-commerce side, thereby enhancing conversion rates and commercial value. However, in real industrial scenarios, cross-domain recommendation faces multiple challenges: significant semantic gaps exist between different domains, and user cross-domain behavior sequences are often massive in scale and rich in noise. Although large language models (LLMs) possess powerful semantic understanding and reasoning capabilities, their millisecond-level inference latency makes direct application in online recommendation systems difficult. To address these issues, this paper introduces AIR (Atomic Intent Reasoning), an LLM-driven cross-domain recommendation framework designed for industrial-grade deployment. By migrating LLM inference to the offline phase and dynamically constructing user intent representations through efficient retrieval and composition during online operations, it achieves approximately 400* inference acceleration while maintaining semantic consistency. Experimental results across multiple public datasets demonstrate that our method achieves state-of-the-art performance in cross-domain recommendation tasks. Furthermore, large-scale online A/B testing conducted in Kuaishou E-commerce&amp;#x27;s real-world business scenarios shows that our approach delivers stable and significant improvements across multiple core business metrics, including a +3.446% increase in GMV, fully validating its effectiveness and practical value in industrial-scale recommendation systems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs</title>
    <author>
      <name>Runze Jiang</name>
    </author>
    <author>
      <name>Taiqiang Wu</name>
    </author>
    <author>
      <name>Yan Wang</name>
    </author>
    <author>
      <name>Bingyu Zhu</name>
    </author>
    <author>
      <name>Longtao Huang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10298v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10298v1" />
    <published>2026-06-09T01:34:13Z</published>
    <updated>2026-06-09T01:34:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Runze Jiang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;When large language models generate from retrieved or augmented contexts, conflicts between external context and parametric priors remain a central reliability bottleneck. Existing contrastive decoding methods follow a \emph{context-aware} paradigm that unilaterally amplifies context over parametric priors, overwriting correct priors when the context is erroneous. We generalize this to the \textbf{conflict-aware} paradigm that dynamically allocates authority between prior and context based on conflict signals, rather than presupposing context trustworthiness. We show that the affine combination of prior and context logits yields a \textbf{power family} with an inherent \textbf{regime asymmetry}: extrapolation amplifies errors unboundedly when the prior is correct, interpolation under-corrects when the context is correct, and no static regime covers both. Existing contrastive decoding methods are instances of this family, mostly extrapolative. To evaluate both conflict directions, we propose TriState-Bench, a model-aware evaluation protocol that calibrates per-model prior knowledge to measure three conflict states: correction, resistance, and agreement. To resolve the asymmetry, we propose Adaptive Regime Routing (ARR), which routes between regimes at each step, lifting resistance EM from below 6 to 16--33 without sacrificing correction or agreement. Our code is available at https://github.com/keith-Jiang/conflict-aware-decoding.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;27 pages, 9 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge</title>
    <author>
      <name>Ali Keramati</name>
    </author>
    <author>
      <name>Justin Cheok</name>
    </author>
    <author>
      <name>Jacob Horne</name>
    </author>
    <author>
      <name>Mark Warschauer</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10296v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10296v1" />
    <published>2026-06-09T01:33:15Z</published>
    <updated>2026-06-09T01:33:15Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ali Keramati et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between three signals in multi-agent debate: token-level log-probability distributions over reasoning tokens, LLM-as-judge rubric scores assigned to those tokens, and final task accuracy. We examine whether internal confidence signals predict externally evaluated reasoning quality, and whether either signal aligns with task correctness, across three domains: rubric-based scoring, mathematical reasoning, and factual question answering. Our framework pairs a two-agent debate architecture -- a Constructor and an Auditor -- with an LLM-as-judge that scores each agent&amp;#x27;s reasoning along instruction following, justification quality, and evidence grounding, together with a critical-failure flag. Experiments in the rubric-scoring domain reveal a consistent four-phase confidence trajectory and a substantial role asymmetry: confidence aligns with judged reasoning quality roughly twice as strongly for the Constructor as for the Auditor, and confidence-based detection of critical reasoning failures is markedly more reliable for the Constructor (AUROC 0.804) than for the Auditor (0.634). These findings motivate the broader cross-domain investigation proposed in this paper.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;15 pages, 7 figures, 1 table, ACL proceedings&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction</title>
    <author>
      <name>Dongjun Lee</name>
    </author>
    <author>
      <name>Juheon Choi</name>
    </author>
    <author>
      <name>Dong Kyu Shin</name>
    </author>
    <author>
      <name>Sinjae Kang</name>
    </author>
    <author>
      <name>Kimin Lee</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10276v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10276v1" />
    <published>2026-06-09T00:50:29Z</published>
    <updated>2026-06-09T00:50:29Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dongjun Lee et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;For natural human-robot interaction, a robot must understand human intent expressed not only through language but also through nonverbal signals such as gestures and gaze. However, current robot policies rely on language instructions as the sole interface for conveying intent, leaving nonverbal signals unused and placing the full burden of communication. In this work, we present EDITH, a robot framework that captures the human&amp;#x27;s nonverbal signals through continuous streams of first-person view and gaze from smart glasses, and uses them alongside language instructions as inputs to the robot policy. Our hardware system streams the human&amp;#x27;s first-person view, gaze, and speech to the robot in real time, transcribing the speech into language instructions. To handle these rich but noisy signals, we design a hierarchical policy in which a high-level policy infers the human&amp;#x27;s intent and produces a sequence of subtasks, where each subtask is represented as a fine-grained instruction paired with a keyframe that grounds the intent in the scene (e.g., the frame where the human points at the target object). A low-level policy then executes these subtasks. In our experiments on human-robot interactive tasks, EDITH enables the robot to act on the human&amp;#x27;s nonverbal signals even when intent is expressed only briefly, and significantly reduces user effort to convey intent compared to using language instructions alone. Visit our project page for source code and real-robot demo videos.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;We provide video demos and code in: https://project-edith.github.io&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents</title>
    <author>
      <name>Jiaheng Hu</name>
    </author>
    <author>
      <name>Mohit Shridhar</name>
    </author>
    <author>
      <name>Caden Lu</name>
    </author>
    <author>
      <name>Dhruv Shah</name>
    </author>
    <author>
      <name>Hao-Tien Lewis Chiang</name>
    </author>
    <author>
      <name>Jie Tan</name>
    </author>
    <author>
      <name>Annie Xie</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10267v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10267v1" />
    <published>2026-06-09T00:24:00Z</published>
    <updated>2026-06-09T00:24:00Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiaheng Hu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect planners, controllers, mechanisms to switch between the two, and how observations and memory are represented in the planner. In this paper, we present a systematic study of Hi-VLA design for robot manipulation. We unify representative Hi-VLA agents under an options-style control framework and benchmark core design choices across short-horizon, long-horizon, and reasoning-intensive tasks. Our analysis distills practical principles for building Hi-VLA systems, showing how model choices and interface mechanisms jointly shape performance. Applying these principles yields a substantially stronger system than either flat VLA control or a naively designed hierarchy, across experiments both in simulation and on a real ALOHA robot. Overall, our results provide a foundation for building more capable, robust, and principled hierarchical VLA agents. More information and video at jiahenghu.github.io/hi-vla.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning</title>
    <author>
      <name>Yiteng Mao</name>
    </author>
    <author>
      <name>Kenan Xu</name>
    </author>
    <author>
      <name>Yijia Lyu</name>
    </author>
    <author>
      <name>Wenhao Li</name>
    </author>
    <author>
      <name>Jianlong Chen</name>
    </author>
    <author>
      <name>Xiangfeng Wang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10254v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10254v1" />
    <published>2026-06-08T23:40:34Z</published>
    <updated>2026-06-08T23:40:34Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yiteng Mao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce \textbf{RealMath-Eval}, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial evaluation reveals that even state-of-the-art LLM judges struggle significantly on this task, exhibiting a high Mean Squared Error ($\sim$2.96) against expert human grading. To probe a plausible explanation, we contrast this performance with a control setting where the same judges evaluate synthetic LLM-generated solutions. We identify a stark ``Evaluation Gap&amp;#x27;&amp;#x27;: judges are considerably more accurate and consistent on synthetic text (MSE $\sim$1.17) but struggle to generalize to authentic student reasoning. Through semantic embedding analysis, we find that synthetic errors suffer from a ``structural collapse&amp;#x27;&amp;#x27; into predictable, low-dimensional linear subspaces, whereas human errors form a more diverse error space. Furthermore, generative probability probes suggest that human reasoning involves significantly higher information-theoretic surprisal, indicating that student reasoning transitions are more out-of-distribution for current models. Finally, we find that surface-level style transfer fails to close this gap. Our findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not adequately capture the diversity of authentic student mathematical reasoning.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Code available at https://github.com/RicharMd/RealMath-Eval , Data available at https://huggingface.co/datasets/RicharMd/RealMath-Eval&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale</title>
    <author>
      <name>Takehiko Ohkawa</name>
    </author>
    <author>
      <name>Jumpei Arima</name>
    </author>
    <author>
      <name>Yuki Noguchi</name>
    </author>
    <author>
      <name>Masatoshi Tateno</name>
    </author>
    <author>
      <name>Makoto Sugiura</name>
    </author>
    <author>
      <name>Takuya Okubo</name>
    </author>
    <author>
      <name>Kengo Ikeuchi</name>
    </author>
    <author>
      <name>Yuma Shin</name>
    </author>
    <author>
      <name>Hiroki Nishizawa</name>
    </author>
    <author>
      <name>Naoaki Kanazawa</name>
    </author>
    <author>
      <name>Yuki Wakayama</name>
    </author>
    <author>
      <name>Daiki Fukunaga</name>
    </author>
    <author>
      <name>Koshi Makihara</name>
    </author>
    <author>
      <name>Tomohiro Motoda</name>
    </author>
    <author>
      <name>Floris Erich</name>
    </author>
    <author>
      <name>Yukiyasu Domae</name>
    </author>
    <author>
      <name>Tatsuya Matsushima</name>
    </author>
    <author>
      <name>Yohishiro Okumatsu</name>
    </author>
    <author>
      <name>Kei Ota</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10244v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10244v1" />
    <published>2026-06-08T23:21:14Z</published>
    <updated>2026-06-08T23:21:14Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Takehiko Ohkawa et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data collection for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) enable affordable data collection, their bulky pistol-grip designs can pose ergonomic and usability challenges for fine-grained, dexterous manipulation tasks. To address this, YUBI presents a distinct design principle: yielding, finger-driven actuation that directly maps human finger movements to gripper jaw motion. Using the YUBI devices, we set up a data collection system with integrated VR-based 6 DoF tracking of the gripper, ensuring high-fidelity trajectory data acquisition. We curate a UMI-based dataset of unprecedented scale: 8,434 hours across 1.20M episodes and 119 tasks. Experiments show that YUBI offers advantages over the UMI gripper in versatility for complex bimanual tasks, dexterity, and operational efficiency. A single policy trained on the YUBI dataset transfers across multiple bimanual robots (UR, Franka, and ELEY) simply by mounting the gripper on each platform, confirming that the collected data are directly executable as policy supervision. We release the gripper hardware, data-collection software, and dataset as one integrated stack, offering the open community a reproducible path to large-scale data acquisition for advancing robotic foundation models.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Project page: https://yubi.airoa.io/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph</title>
    <author>
      <name>Yohei Nakajima</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10241v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10241v1" />
    <published>2026-06-08T23:04:35Z</published>
    <updated>2026-06-08T23:04:35Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yohei Nakajima&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Autonomous improvement loops are hard to trust because the improvement process is usually external scaffolding bolted onto the agent: failures go unlogged, diagnoses cannot be replayed, and promote-or-discard decisions land in a side database rather than the agent&amp;#x27;s own history. We show that an event-sourced agent runtime removes that friction and turns controlled improvement into a first-class workflow. When the agent&amp;#x27;s state is a deterministic projection of an append-only event log, failures are recorded, a run replays exactly from its log, candidate patches scope to typed pipeline seams, gates are auditable, and every promotion or discard is itself an event. We demonstrate this with Regimes, a loop on the ActiveGraph runtime that diagnoses failed evaluations, proposes a repair at a pipeline point, and promotes it only after static checks, sandbox execution, in-sample evaluation, and held-out validation. The loop is target-agnostic: the same control flow runs against different tasks through a common interface. On LongMemEval-S the dominant failure is not retrieval but reconciliation: the evidence is already in the assembled context, yet the reader answers incorrectly. Across five seeded held-out splits, Regimes discovers reader-prompt repairs that improve final held-out accuracy by +0.05 to +0.10 in four splits and +0.01 in one over-promotion split; two splits are individually significant (seed 5 unadjusted for its sequential promotion structure), and the pooled count is descriptive only, since the splits share one 500-question pool. The durable contributions are ActiveGraph as an auditable substrate that makes controlled improvement loops tractable, the held-out-gated loop it supports, the failure-regime taxonomy routing each failure to a pipeline location (whose marginal value over an unrouted baseline is the primary open question), and the prompt-as-discovery-probe hypothesis.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;30 pages, 5 figures. Code and committed runs: https://github.com/yoheinakajima/regimes&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[q-bio.NC] Hyperbolic Neural Population Geometry Benefits Computation</title>
    <author>
      <name>Dennis Wu</name>
    </author>
    <author>
      <name>Yi-Chun Hung</name>
    </author>
    <author>
      <name>Braden Yuille</name>
    </author>
    <author>
      <name>James E. Fitzgerald</name>
    </author>
    <author>
      <name>Han Liu</name>
    </author>
    <category term="q-bio.NC" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10238v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10238v1" />
    <published>2026-06-08T22:57:39Z</published>
    <updated>2026-06-08T22:57:39Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dennis Wu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Neural population geometry shapes downstream computation. Recent empirical findings in neurobiology suggest that a hyperbolic structure underlies population activity in the hippocampus. Here we provide a theoretical framework for this phenomenon. First, we propose a plausible construction of hippocampal tuning curves that statistically induces hyperbolic geometry. Next, we establish a connection between neural decoding and associative memory by demonstrating that the Modern Hopfield Network update rule computes the minimum mean-squared-error (MMSE) estimator. Finally, we introduce a novel associative memory model defined in hyperbolic space that yields significantly larger capacity than leading models. Our results suggest that animals encode spatial information as a latent hyperbolic cognitive map, improving both memory capacity and decoding accuracy.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at ICML 2026, 37 pages, 5 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration</title>
    <author>
      <name>Kaustubh Mani</name>
    </author>
    <author>
      <name>Yann Pequignot</name>
    </author>
    <author>
      <name>Vincent Mai</name>
    </author>
    <author>
      <name>Liam Paull</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10228v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10228v1" />
    <published>2026-06-08T22:40:45Z</published>
    <updated>2026-06-08T22:40:45Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kaustubh Mani et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor&amp;#x27;s sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor&amp;#x27;s epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;ICLR 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing</title>
    <author>
      <name>Awais Khan</name>
    </author>
    <author>
      <name>Kutub Uddin</name>
    </author>
    <author>
      <name>Khalid Malik</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10223v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10223v1" />
    <published>2026-06-08T22:22:48Z</published>
    <updated>2026-06-08T22:22:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Awais Khan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-dimensional descriptor that, unlike prior Linear Filter Bank (LFB)-only work, spans cepstral, oscillatory, rhythmic, energy, and spectral dimensions to capture complementary synthesis artifacts. Our analysis shows XLSR-53 remains discriminative in-domain (ID) while CORES generalizes stably under distribution shift (OOD), yet their naive concatenation fails due to SSL representational imbalance. To resolve this, an input-conditioned gate adaptively weights each branch under joint training with cross-entropy, an energy margin loss for ID/OOD separation, and a gate diversity term. On the MLAAD benchmark, our system achieves 97.6\% ID accuracy, 4.9\% EERc, and an 83.5\% relative FPR95 reduction over the Interspeech 2025 baseline.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] A Source Domain is All You Need: Source-Only Cross-OS Transfer Learning for APT Anomaly Detection via Semantic Alignment and Optimal Transport</title>
    <author>
      <name>Sidahmed Benabderrahmanea</name>
    </author>
    <author>
      <name>Petko Valtchev</name>
    </author>
    <author>
      <name>James Cheney</name>
    </author>
    <author>
      <name>Talal Rahwan</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10216v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10216v1" />
    <published>2026-06-08T22:13:42Z</published>
    <updated>2026-06-08T22:13:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sidahmed Benabderrahmanea et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Advanced Persistent Threats (APTs) are stealthy, multi-stage cyberattacks whose detection is difficult due to scarce labeled traces, severe class imbalance, and the challenge of generating realistic malicious behavior. These challenges are amplified in cross-operating-system (cross-OS) settings, where a detector trained on one source platform must be deployed on an unlabeled target platform without access to target-domain labels. We study this source-only cross-OS APT detection problem using system-level provenance traces and propose a transport-based framework for ranking anomalous target processes under zero target supervision. The framework abstracts process behavior into structured natural-language descriptions, embeds them using pretrained language models, and constructs a source-normal reference for target scoring. It combines three evidence channels: semantic deviation from source-normal prototypes, structural deviation captured by graph autoencoding, and geometric deviation measured through Optimal Transport (OT). The main contribution is an OT-based barycentric anomaly score that projects target embeddings onto the source-normal manifold and quantifies residual transport mismatch. We further introduce entropy-weighted, angle-aware, and density-aware OT variants to capture uncertainty, directional drift, and sparse-support behavior. Evaluation on DARPA Transparent Computing data spanning Linux, Windows, BSD, and Android, across two APT scenarios and twelve cross-OS transfer pairs, shows that the proposed framework improves ROC-AUC and nDCG over source-only anomaly-detection baselines. The results demonstrate that source-only provenance modeling, combined with semantic abstraction and OT-based anomaly scoring, can support practical cross-platform APT detection without target-domain supervision.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[astro-ph.GA] Integral Field Unit Spectroscopy with One Fiber</title>
    <author>
      <name>Zehao Peng</name>
    </author>
    <author>
      <name>Biprateep Dey</name>
    </author>
    <author>
      <name>Chris J. Maddison</name>
    </author>
    <author>
      <name>Joshua S. Speagle</name>
    </author>
    <category term="astro-ph.GA" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10197v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10197v1" />
    <published>2026-06-08T21:35:52Z</published>
    <updated>2026-06-08T21:35:52Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zehao Peng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Integral field unit (IFU) spectroscopy provides spatially resolved spectra across galaxies, offering crucial insights into their evolution. However, its high observational cost limits current IFU datasets to $\sim 10^4$ objects. We present a multi-modal, probabilistic foundation model that predicts high-resolution spectra with calibrated uncertainties at arbitrary spatial locations within a galaxy directly from broadband images. Built on a masked autoencoder framework, our architecture injects fiber positional encodings and redshift aware wavelength encodings, enabling spatially conditioned predictions. Trained on 4.7 million images and single fiber spectroscopic observations from the Dark Energy Spectroscopic Instrument (DESI) survey, our model exploits the natural variance of fiber placements and the morphological self-similarity of galaxies to achieve IFU-like capabilities without any IFU training data. Predicted emission line flux maps match independent IFU observations from the Mapping Nearby Galaxies at APO (MaNGA) survey, with performance comparable to a supervised baseline trained directly on IFU data.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted for Conference on Physics and AI at Stanford University (PAI 2026)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Fisher-Guided Progressive Parameter Selection for Adaptive Fine-Tuning</title>
    <author>
      <name>Ghodsiyeh Rostami</name>
    </author>
    <author>
      <name>Po-Han Chen</name>
    </author>
    <author>
      <name>Mahdi S. Hosseini</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10196v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10196v1" />
    <published>2026-06-08T21:35:11Z</published>
    <updated>2026-06-08T21:35:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ghodsiyeh Rostami et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Parameter-efficient fine-tuning (PEFT) aims to adapt pretrained models with a small trainable parameter subset, however, most existing methods choose this subset from fixed architectural heuristics rather than using dynamic, task-aware criteria. We introduce \textbf{FisherAdapTune}, a Fisher-guided Adaptive Fine-Tuning framework that progressively selects parameter groups by tracking the temporal drift of their Fisher geometry. Starting from a PAC-Bayesian view of fine-tuning, we decompose the generalization error bound into Fisher-weighted update costs and show that parameter groups whose curvature contribution has stabilized can be frozen to reduce the error bound without interrupting the remaining adaptation dynamics. FisherAdapTune formulates this criterion with a scale-invariant Jensen-Shannon distance between consecutive Fisher distributions, yielding an adaptive active parameter set. We evaluate our approach on a downstream segmentation task, and results show FisherAdapTune improves the in-distribution performance and zero-shot transfer in multiple settings, validating that Fisher structural drift is a useful signal for efficient, task-aware adaptation. We release our \href{https://github.com/AtlasAnalyticsLab/FisherAdapTune}{code} publicly to enable further application of our proposed approach.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.IR] $τ$-Rec: A Verifiable Benchmark for Agentic Recommender Systems</title>
    <author>
      <name>Bharath Sivaram Narasimhan</name>
    </author>
    <author>
      <name>Karthik R Narasimhan</name>
    </author>
    <category term="cs.IR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10156v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10156v1" />
    <published>2026-06-08T20:35:45Z</published>
    <updated>2026-06-08T20:35:45Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Bharath Sivaram Narasimhan, Karthik R Narasimhan&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on &amp;quot;LLM-as-a-judge&amp;quot; evaluations, which introduce subjectivity, high costs and inconsistency. We present $τ$-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, $τ$-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression</title>
    <author>
      <name>Shaohao Rui</name>
    </author>
    <author>
      <name>Xiaofeng Mao</name>
    </author>
    <author>
      <name>Zhanyu Zhang</name>
    </author>
    <author>
      <name>Peijia Lin</name>
    </author>
    <author>
      <name>Yansong Zhu</name>
    </author>
    <author>
      <name>Yibo Zhang</name>
    </author>
    <author>
      <name>Haibin Wan</name>
    </author>
    <author>
      <name>Weijie Ma</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10135v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10135v1" />
    <published>2026-06-08T20:08:41Z</published>
    <updated>2026-06-08T20:08:41Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shaohao Rui et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD&amp;#x27;s mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Pareto-Guided Teacher Alignment for Fair Personalized Text Generation</title>
    <author>
      <name>Tunazzina Islam</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10126v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10126v1" />
    <published>2026-06-08T19:57:13Z</published>
    <updated>2026-06-08T19:57:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tunazzina Islam&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained multi-objective alignment problem: reduce demographic disparities while preserving personalization fidelity. We propose a Pareto-guided teacher alignment framework that combines revision-based candidate generation, pair-aware feasibility gating, Pareto-style candidate selection, and optional preference optimization through supervised fine-tuning and direct preference optimization. We evaluate the framework on climate change and vaccination persuasion tasks using a controlled context-rich demographic grid with matched gender and age pairs and a unified five-audit evaluation suite spanning persuasion bias, formality disparity, emotional framing disparity, lexical association disparity, and personalization fidelity. Across both domains and cross-family transfer settings, no single alignment strategy dominates all objectives simultaneously. Instead, methods occupy different regions of a fairness-personalization Pareto frontier: some achieve stronger disparity reductions, while others better preserve personalization or demographic stability. Our results show that fairness mitigation effects are objective-dependent and transfer inconsistently across domains and model families, motivating bounded-regression, multi-audit model selection over single-metric optimization for fairness-sensitive personalized generation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.IR] MetaPlate: Counterfactual-Guided RAG-LLM Tool for Personalized Food Recommendation and Hyperglycemia Prevention</title>
    <author>
      <name>Asiful Arefeen</name>
    </author>
    <author>
      <name>Carol Johnston</name>
    </author>
    <author>
      <name>Hassan Ghasemzadeh</name>
    </author>
    <category term="cs.IR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10120v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10120v1" />
    <published>2026-06-08T19:52:08Z</published>
    <updated>2026-06-08T19:52:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Asiful Arefeen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Postprandial hyperglycemia is a key risk factor for metabolic disorders; however, existing dietary guidance is often static, impractical, and insufficiently personalized, providing recommendations that are difficult to follow or not impactful. While recent advances leverage continuous glucose monitoring (CGM) and machine learning to predict glycemic responses, these approaches are largely predictive and lack actionable guidance. Moreover, recommendation systems are often misaligned with user goals and require extensive input. We present MetaPlate, a counterfactual explanation (CF) guided, context-aware decision-support framework that generates personalized meal recommendations to mitigate postprandial glucose excursions in healthy adults. MetaPlate integrates multimodal data, including CGM readings, wearable-derived physiological signals, and user-provided meal inputs from $25$ individuals to model pre-meal context. A machine learning model predicts glucose response, while a CF optimization module adjusts meal composition modifying macronutrient amounts to maintain glucose levels within a target range ($\leq 140$ mg/dL). An LLM-based retrieval-augmented generation (RAG) layer enhances interpretability by producing human-readable recommendations using constrained search of the USDA food database. We evaluate MetaPlate via a structured expert-in-the-loop assessment with registered dietitians (RDs), comparing performance before and after prompt refinement. Results show improvements in meal realism, portion suitability, and recommendation likelihood, with expert feedback indicating a shift from clinically implausible outputs to actionable, contextually appropriate recommendations. Our findings emphasize the importance of domain knowledge and structured constraints in LLM-driven systems and highlight the potential of MetaPlate as a real-time personalized dietary decision-support tool.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Emotion Profiling in LLM-Based Literary Translation: Systematic Shifts Across MT and Post-Editing</title>
    <author>
      <name>Antonio Castaldo</name>
    </author>
    <author>
      <name>Johanna Monti</name>
    </author>
    <author>
      <name>Sheila Castilho</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10113v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10113v1" />
    <published>2026-06-08T19:46:00Z</published>
    <updated>2026-06-08T19:46:00Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Antonio Castaldo et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;This paper investigates whether LLM translations exhibit identifiable emotional profiles and how post-editing reshapes them toward human-like norms. We compare LLM translations of Margaret Atwood&amp;#x27;s Oryx and Crake with their post-edited versions and a human translation, using a large-scale corpus of contemporary Italian science-fiction as a baseline. We examine emotion through lexicon-based and multilingual modeling, conducting a fine-grained analysis of emotional variation across systems. We find that MT systems introduce model-specific and statistically significant emotional fingerprints across translations, leading to a limited preservation of an author&amp;#x27;s voice.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] What makes a harness a harness: necessary and sufficient conditions for an agent harness</title>
    <author>
      <name>Sanderson Oliveira de Macedo</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10106v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10106v1" />
    <published>2026-06-08T19:35:37Z</published>
    <updated>2026-06-08T19:35:37Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sanderson Oliveira de Macedo&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The term agent harness now circulates widely in software engineering with generative artificial intelligence. It names the layer that wraps a language model and turns it into a coding agent able to act on a repository. The usage is loose and polysemous. Sometimes the term denotes the whole product (Claude Code, Codex CLI); sometimes it denotes the evaluation scaffold that runs an agent against tasks (the SWE-bench harness); sometimes it gets conflated with an agent framework, an SDK, an IDE plugin, or an orchestrator. What is missing is a reference definition that works as an instrument, one that includes and excludes cases consistently. We build that definition through a conceptual analysis that combines works with persistent identifiers and primary grey-literature sources, such as official documentation, glossaries, and engineering reports. We reconstruct the genealogy of the term, from the horse&amp;#x27;s tack to the classic test harness, to the machine-learning evaluation harness, and finally to the agent harness. We then propose a constitutive definition that states the necessary and sufficient conditions for a system to be an agent harness, we operationalize it as an inclusion and exclusion test, and we draw the boundary of the concept against an agent framework, an agent SDK, an IDE plugin, an eval harness, and an orchestrator. We apply the definition to six real harnesses (Claude Code, Codex CLI, Aider, Cline, OpenHands, and SWE-agent) and to deliberate edge cases; the test includes and excludes consistently. We close with a research agenda organized by design tension axes. The contribution is an operational definition of agent harness, with a shared vocabulary, able to guide engineering practice and the scientific comparison of agentic systems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Unsupervised Style Representation Learning for AI-Text Detection via Paraphrase Inversion</title>
    <author>
      <name>Rafael Rivera Soto</name>
    </author>
    <author>
      <name>Barry Chen</name>
    </author>
    <author>
      <name>Nicholas Andrews</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10099v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10099v1" />
    <published>2026-06-08T19:28:52Z</published>
    <updated>2026-06-08T19:28:52Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Rafael Rivera Soto et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The rapid development of large language models (LLMs) has raised concerns about misuse such as plagiarism, misinformation, and automated influence operations, motivating the need for robust detectors. Recent work has shown that neural representations of writing style are effective for detection and, crucially, robust to adversarial attacks that defeat most existing detectors. However, current style-based detectors rely on authorship labels for training, and are limited to few-shot inference for detection, requiring in-distribution samples that may not always be available. We learn discriminative style features without authorship labels by training a style encoder to reconstruct human-authored text from its machine-generated paraphrase; freezing a semantic encoder during training biases the style encoder to capture only the non-semantic features needed for reconstruction. We evaluate the learned representations via two detection strategies: a few-shot detector and a zero-shot DeepSVDD-based detector. Across benchmarks, our method matches or outperforms all baselines in the few-shot setting and, in the zero-shot regime, is competitive with fully supervised classifiers on in-distribution test data while generalizing better to unseen LLMs. Beyond detection, the learned representations generalize to unseen tasks, achieving competitive performance on authorship verification and fine-grained style discrimination despite never being trained on either objective.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Temporal Sheaf Neural Networks with Dynamic Orthogonal Transport</title>
    <author>
      <name>Md Sadek Hossain Asif</name>
    </author>
    <author>
      <name>Tanzila Khan</name>
    </author>
    <author>
      <name>Md. Mosaddek Khan</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10071v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10071v1" />
    <published>2026-06-08T18:47:31Z</published>
    <updated>2026-06-08T18:47:31Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Md Sadek Hossain Asif et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We introduce Temporal Sheaf Neural Networks (TSNN), a temporal link prediction framework that equips each node with a time-varying orthogonal frame and compares node states only after explicit transport between local coordinate systems. In contrast to existing continuous-time graph models that operate in a shared global embedding space, TSNN models node-specific and evolving interaction semantics through dynamic local frames. The model parameterizes per-node frames via efficient low-rank Householder products, preserves stored hidden states exactly under frame updates, and uses a geometric-residual decoder that anchors predictions on transported distances while learning residual corrections. All computations are strictly causal and use only the pre-event history. We show that the symmetric degree-normalized sheaf Laplacian is orthogonally similar to the symmetric normalized graph Laplacian, with the random-walk normalized form similar in the corresponding degree metric; the full-active, feature-scaled diffusion used by TSNN is exactly a metric-gradient step on the combinatorial sheaf Dirichlet energy, with a degree-free monotone-descent and non-expansiveness guarantee. Frame drift perturbs updates only linearly. Across TGB v2 link-prediction and temporal-heterogeneous leaderboards, together with the DGB benchmark suite, TSNN matches or surpasses the strongest prior methods on most benchmarks, with the largest improvements on graphs exhibiting strong node-role heterogeneity. Ablations confirm the distinct benefit of dynamic frames, orthogonal transport, and geometric-residual decoding.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Bittensor Agent Arenas as a Trajectory Primitive: Distilling a Shopping Agent from ShoppingBench Subnet Traces</title>
    <author>
      <name>Shardul Bansal</name>
    </author>
    <author>
      <name>Seth Schilbe</name>
    </author>
    <author>
      <name>Jarrod Barnes</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10064v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10064v1" />
    <published>2026-06-08T18:39:15Z</published>
    <updated>2026-06-08T18:39:15Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shardul Bansal et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Small-model agentic post-training is bottlenecked less by the algorithm than by the trajectory substrate it consumes. Leading recipes (RLVR, group-relative RL, rejection-sampled re-SFT) all need multi-turn traces carrying per-trajectory supervision, and the two existing sources fall short: frontier-synthesised data inherits the synthesizer&amp;#x27;s biases and collapses the long tail, while unfiltered production logs are unjudged and contaminated by shortcut behaviour. We argue that an incentive-aligned agent arena can be engineered to manufacture such trajectories, and demonstrate this on ORO Subnet 15 (SN15), a Bittensor deployment of the ShoppingBench agentic-commerce benchmark. SN15&amp;#x27;s race mechanism, LLM reasoning judge, and rotating leak-cluster-guarded problem suite yield a corpus with three properties: incentive-aligned diversity, per-trajectory judging, and anti-memorised held-out evaluation. We introduce a structural-quality filter that converts the raw firehose into a trainable corpus by keeping agentic trajectories (the model itself emits the tool calls) and rejecting sub-task trajectories (the model only classifies or narrates over a deterministic search loop), then post-train Qwen3-4B with a recipe matched to the published ShoppingBench SFT-then-GRPO pipeline. On a leak-cluster-guarded held-out partition scored production-strict, the model lifts from the published Qwen3-4B base of 18.0% ASR to 42.7%, within single-problem noise of the synthetic-data SFT-only baseline (43.6%), while training on a fraction of a single day of subnet output. The supervised stack leaves a large pass@8 to pass@1 gap (53.3% vs 34.8%); a per-step teacher-grounded Dr. GRPO reward converts that headroom into process improvement, and we identify the sub-task firehose as the primary lever for closing the gap to the 48.7% SFT+GRPO bar. We release the filter, the corpus splits, and the arena mechanics.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;10 pages, 4 figures, Data and Models available at: https://huggingface.co/collections/oro-ai/shoppingbench-sn15-trajectory-primitive&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization</title>
    <author>
      <name>Ray Zhang</name>
    </author>
    <author>
      <name>Marcus Greiff</name>
    </author>
    <author>
      <name>Thomas Lew</name>
    </author>
    <author>
      <name>John Subosits</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10019v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10019v1" />
    <published>2026-06-08T18:07:11Z</published>
    <updated>2026-06-08T18:07:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ray Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order solvers used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR tracking registration task in the driving domain, we achieve a reduction of $&amp;gt;55\%$ in both translational and rotational drift in challenging feature-sparse environments. On object registration benchmarks, we show improved robustness over ICP-based methods and further gains when refining global initialization, particularly under moderate misalignment.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;16 pages, 12 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment</title>
    <author>
      <name>Chien-Chun Wang</name>
    </author>
    <author>
      <name>Hung-Shin Lee</name>
    </author>
    <author>
      <name>Hsin-Min Wang</name>
    </author>
    <author>
      <name>Berlin Chen</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10010v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10010v1" />
    <published>2026-06-08T18:01:20Z</published>
    <updated>2026-06-08T18:01:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Chien-Chun Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman&amp;#x27;s rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to IEEE Signal Processing Letters (SPL)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics</title>
    <author>
      <name>Mingxian Lin</name>
    </author>
    <author>
      <name>Shengju Qian</name>
    </author>
    <author>
      <name>Yuqi Liu</name>
    </author>
    <author>
      <name>Yi-Hua Huang</name>
    </author>
    <author>
      <name>Yiyu Wang</name>
    </author>
    <author>
      <name>Wei Huang</name>
    </author>
    <author>
      <name>Yitang Li</name>
    </author>
    <author>
      <name>Fan Zhang</name>
    </author>
    <author>
      <name>Zeyu Hu</name>
    </author>
    <author>
      <name>Lingting Zhu</name>
    </author>
    <author>
      <name>Xin Wang</name>
    </author>
    <author>
      <name>Xiaojuan Qi</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09826v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09826v1" />
    <published>2026-06-08T17:59:43Z</published>
    <updated>2026-06-08T17:59:43Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mingxian Lin et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing</title>
    <author>
      <name>Jisong Cai</name>
    </author>
    <author>
      <name>Long Ling</name>
    </author>
    <author>
      <name>Shiwei Chu</name>
    </author>
    <author>
      <name>Zhongshan Liu</name>
    </author>
    <author>
      <name>Jiayue Kang</name>
    </author>
    <author>
      <name>Zhixuan Liang</name>
    </author>
    <author>
      <name>Wenjie Xu</name>
    </author>
    <author>
      <name>Yinan Mao</name>
    </author>
    <author>
      <name>Weinan Zhang</name>
    </author>
    <author>
      <name>Xiaokang Yang</name>
    </author>
    <author>
      <name>Ru Ying</name>
    </author>
    <author>
      <name>Ran Zheng</name>
    </author>
    <author>
      <name>Yao Mu</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09811v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09811v1" />
    <published>2026-06-08T17:55:18Z</published>
    <updated>2026-06-08T17:55:18Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jisong Cai et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Project page: https://serene-sivy.github.io/aha-wam/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting</title>
    <author>
      <name>Avijit Ghosh</name>
    </author>
    <author>
      <name>Anka Reuel</name>
    </author>
    <author>
      <name>Jenny Chim</name>
    </author>
    <author>
      <name>Wm. Matthew Kennedy</name>
    </author>
    <author>
      <name>Srishti Yadav</name>
    </author>
    <author>
      <name>Jennifer Mickel</name>
    </author>
    <author>
      <name>Yanan Long</name>
    </author>
    <author>
      <name>Andrew Tran</name>
    </author>
    <author>
      <name>Anastassia Kornilova</name>
    </author>
    <author>
      <name>Damian Stachura</name>
    </author>
    <author>
      <name>Kevin Klyman</name>
    </author>
    <author>
      <name>Felix Friedrich</name>
    </author>
    <author>
      <name>Jeba Sania</name>
    </author>
    <author>
      <name>Jan Batzner</name>
    </author>
    <author>
      <name>Anoop Mishra</name>
    </author>
    <author>
      <name>Eliya Habba</name>
    </author>
    <author>
      <name>Yixiong Hao</name>
    </author>
    <author>
      <name>Nathan Heath</name>
    </author>
    <author>
      <name>Shalaleh Rismani</name>
    </author>
    <author>
      <name>Usman Gohar</name>
    </author>
    <author>
      <name>Andrea Loehr</name>
    </author>
    <author>
      <name>David Manheim</name>
    </author>
    <author>
      <name>Ruchira Dhar</name>
    </author>
    <author>
      <name>Sree Harsha Nelaturu</name>
    </author>
    <author>
      <name>Aarush Sinha</name>
    </author>
    <author>
      <name>Leshem Choshen</name>
    </author>
    <author>
      <name>Drishti Sharma</name>
    </author>
    <author>
      <name>Ishan Khire</name>
    </author>
    <author>
      <name>Amit Saha</name>
    </author>
    <author>
      <name>Subramanyam Sahoo</name>
    </author>
    <author>
      <name>Michael Hardy</name>
    </author>
    <author>
      <name>Michael Alexander Riegler</name>
    </author>
    <author>
      <name>Kabir Manghnani</name>
    </author>
    <author>
      <name>Michelle Lin</name>
    </author>
    <author>
      <name>Yanan Jiang</name>
    </author>
    <author>
      <name>Yilin Huang</name>
    </author>
    <author>
      <name>Asaf Yehudai</name>
    </author>
    <author>
      <name>Jessica Ji</name>
    </author>
    <author>
      <name>Aris Hofmann</name>
    </author>
    <author>
      <name>Mubashara Akhtar</name>
    </author>
    <author>
      <name>Max Lamparth</name>
    </author>
    <author>
      <name>Nuno Moniz</name>
    </author>
    <author>
      <name>Yacine Jernite</name>
    </author>
    <author>
      <name>Stella Biderman</name>
    </author>
    <author>
      <name>Zeerak Talat</name>
    </author>
    <author>
      <name>Sanmi Koyejo</name>
    </author>
    <author>
      <name>Mykel Kochenderfer</name>
    </author>
    <author>
      <name>Irene Solaiman</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09809v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09809v2" />
    <published>2026-06-08T17:55:02Z</published>
    <updated>2026-06-09T17:58:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Avijit Ghosh et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present \EvalCards{}, an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies \EvalCards{} across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Topological Neural Operators</title>
    <author>
      <name>Lennart Bastian</name>
    </author>
    <author>
      <name>Samuel Leventhal</name>
    </author>
    <author>
      <name>Mustafa Hajij</name>
    </author>
    <author>
      <name>Tolga Birdal</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09806v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09806v1" />
    <published>2026-06-08T17:54:33Z</published>
    <updated>2026-06-08T17:54:33Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Lennart Bastian et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We introduce Topological Neural Operators (TNOs), a principled framework for operator learning on cell complexes that lifts neural operators (NOs) from functions on points and/or edges to topological domains. TNOs represent data as features defined on cells of varying dimension and model their interactions through Discrete Exterior Calculus, enabling explicit cross-dimensional coupling via gradient-, curl-, and divergence-type operators. The key design principle is to decouple where information flows, as governed by fixed topological operators, from how it is transformed (which is learned), yielding models that respect the geometric support of physical quantities and expose conservation and compatibility structure. We further propose Hierarchical TNOs (HTNOs), which incorporate learned coarse complexes to propagate long-range and topology-dependent information. Our framework subsumes existing NOs as a special case, providing a unified perspective on operator learning across discretizations. Across a range of PDE benchmarks, including irregular-geometry flow problems, TNOs and HTNOs improve accuracy; controlled studies further isolate the benefits of native higher-rank and topological structure. Project page: https://circle-group.github.io/research/TNO&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Bandits for Efficient Experimentation: Adapting to Control Group, Preferences, and Context Drifts</title>
    <author>
      <name>Udvas Das</name>
    </author>
    <author>
      <name>Waris Radji</name>
    </author>
    <author>
      <name>Debabrota Basu</name>
    </author>
    <author>
      <name>Odalric-Ambrym Maillard</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09802v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09802v1" />
    <published>2026-06-08T17:53:29Z</published>
    <updated>2026-06-08T17:53:29Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Udvas Das et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context distributions that are drifting over time. Under practitioner-friendly assumptions, we reduce this setting to linear bandit with stationary mean but heteroskedastic and non-stationary noise. We further study the case when the learner must ensure the mean reward of each decision must exceed that of a baseline strategy $\boldsymbolπ_0$ at each decision step. We introduce Dri-MED, an algorithm inspired from the linear version of the MED strategy, and carefully adapted to handle the non-stationary heteroskedastic noise. We show that the instance-dependent regret scales as $\tilde{\mathcal O}\left(\fracκ{\tildeΔ}d^2(\log(T)\right)$, where $\tildeΔ$ is the constraint-aware sub-optimality gap subject to policy $π_0$, with variance-aware multiplicative term $κ$ that we carefully handle using heteroskedastic regression. We further show Dri-MED enjoys $\tilde{\mathcal{O}}(d)$ expected constraint violations. Our numerical results suggest that Dri-MED significantly outperforms conservative baselines that ignores the drift and preference structure.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] FASE: Fast Adaptive Semantic Entropy for Code Quality</title>
    <author>
      <name>Shizhe Lin</name>
    </author>
    <author>
      <name>Ladan Tahvildari</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09800v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09800v1" />
    <published>2026-06-08T17:53:05Z</published>
    <updated>2026-06-08T17:53:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shizhe Lin, Ladan Tahvildari&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks. In this work, we introduce Fast Adaptive Semantic Entropy (FASE), a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. Evaluations on HumanEval and BigCodeBench demonstrate that FASE outperforms state-of-the-art semantic entropy by LLM entailment, achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model. Furthermore, by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches. These results position FASE as a practical, cost-effective solution for optimizing uncertainty quantification in real-world multi-agent workflows.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation</title>
    <author>
      <name>Matthew Ho</name>
    </author>
    <author>
      <name>Brian Liu</name>
    </author>
    <author>
      <name>Jixuan Chen</name>
    </author>
    <author>
      <name>Audrey Wang</name>
    </author>
    <author>
      <name>Lianhui Qin</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09774v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09774v1" />
    <published>2026-06-08T17:35:17Z</published>
    <updated>2026-06-08T17:35:17Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Matthew Ho et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator&amp;#x27;s executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan</title>
    <author>
      <name>Alexander Chulzhanov</name>
    </author>
    <author>
      <name>Soeren Eberhardt</name>
    </author>
    <author>
      <name>Arjun Mukherjee</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09767v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09767v1" />
    <published>2026-06-08T17:29:08Z</published>
    <updated>2026-06-08T17:29:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Alexander Chulzhanov et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodology to bootstrap NMT models without scraping target-language parallel text. Focusing on Q&amp;#x27;eqchi&amp;#x27; Mayan, we transformed community-sourced dictionaries into a massive synthetic corpus, utilizing Parameter-Efficient Fine-Tuning (PEFT) via LoRA adapters on an mT5-base model.
  In-domain evaluation demonstrates high structural acquisition (BLEU 42.02), proving that synthetic constraints effectively teach complex agglutinative morphology and VOS word order. However, evaluation against an organic glossary reveals a structural-semantic gap (BLEU 0.59), where the model maintains grammatical integrity but lacks the lexical grounding of natural language. The model exhibits overfitting to the constrained structural variance of the synthetic templates; despite high semantic entropy in the pipeline, it struggles with the syntactic fluidity of natural language, forcing organic inputs into rigid learned patterns. Furthermore, an ablation study utilizing a Multi-Task Learning architecture resulted in negative transfer, suggesting that auxiliary tasks competed for limited parameter capacity within the LoRA adapters, causing over-optimization for synthetic markers at the expense of organic flexibility. Ultimately, we establish that synthetic bootstrapping is a highly effective structural primer, but requires authentic data for semantic refinement via Curriculum Learning.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to the 29th International Conference on Text, Speech and Dialogue (TSD 2026). This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] Difference-Aware Retrieval Policies for Imitation Learning</title>
    <author>
      <name>Quinn Pfeifer</name>
    </author>
    <author>
      <name>Ethan Pronovost</name>
    </author>
    <author>
      <name>Paarth Shah</name>
    </author>
    <author>
      <name>Khimya Khetarpal</name>
    </author>
    <author>
      <name>Siddhartha Srinivasa</name>
    </author>
    <author>
      <name>Abhishek Gupta</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09758v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09758v1" />
    <published>2026-06-08T17:18:19Z</published>
    <updated>2026-06-08T17:18:19Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Quinn Pfeifer et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Parametric imitation learning via behavior cloning can suffer from poor generalization to out-of-distribution states due to compounding errors during deployment. We show that reusing the training data during inference via a semi-parametric retrieval-based imitation learning approach can alleviate this challenge. We present Difference-Aware Retrieval Policies for Imitation Learning (DARP), a semi-parametric retrieval-based imitation learning approach that addresses this limitation by reparameterizing the imitation learning problem in terms of local neighborhood structure rather than direct state-to-action mappings. Instead of learning a global policy, DARP trains a model to predict actions based on $k$-nearest neighbors from expert demonstrations, their corresponding actions, and the relative distance vectors between neighbor states and query states. DARP requires no additional assumptions beyond those made for standard behavior cloning -- it does not require additional data collection, online expert feedback, or task-specific knowledge. We demonstrate consistent performance improvements of 15-46% over standard behavior cloning across diverse domains, including continuous control and robotic manipulation, and across different representations, including high-dimensional visual features. Code and demos are available at https://weirdlabuw.github.io/darp-site/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 pages, 7 figures, 3 tables. Accepted to ICLR 2026. Code and demos available at https://weirdlabuw.github.io/darp-site/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Collaborative Human-Agent Protocol (CHAP)</title>
    <author>
      <name>Arsalan Shahid</name>
    </author>
    <author>
      <name>Gordon Suttie</name>
    </author>
    <author>
      <name>Philip Black</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09751v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09751v1" />
    <published>2026-06-08T17:11:42Z</published>
    <updated>2026-06-08T17:11:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Arsalan Shahid et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects customers, claims, code, contracts, and clinical decisions. Production deployments are no longer one human supervising one model. They are multi-human, multi-agent collaborations that cross teams, time zones, and trust boundaries. The technical surface for this collaboration remains weakly specified. When an agent drafts a response and a human edits it before it ships, the moment of human judgement is the most valuable signal in the system. In current practice it is recorded, if at all, in application code, chat threads, ticket comments, and tribal memory. Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent-to-agent interoperability. Neither defines the shared workspace in which humans and agents perform accountable work together. This paper presents CHAP, the Collaborative Human-Agent Protocol. Under CHAP, the override that used to vanish into a chat thread becomes a structured event carrying a diff, a rationale, and a content hash. The handoff between shifts becomes a portable envelope rather than a pinned message. The human approval of an agent&amp;#x27;s draft becomes a non-repudiable signed decision that can be replayed years later. The protocol achieves this through a small Core (workspaces, participants, tasks, artefacts, and an append-only evidence log) together with composable profiles that add review, modes, routing, deliberation, handoff, identity, signatures, and transparency-backed audit as deployments require them. Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback</title>
    <author>
      <name>Rishabh Sabharwal</name>
    </author>
    <author>
      <name>Hongru Wang</name>
    </author>
    <author>
      <name>Amos Storkey</name>
    </author>
    <author>
      <name>Jeff Z. Pan</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09748v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09748v1" />
    <published>2026-06-08T17:08:36Z</published>
    <updated>2026-06-08T17:08:36Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Rishabh Sabharwal et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40\%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24\%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Published as a workshop paper at SCALE - ICML 2026 (Oral)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research</title>
    <author>
      <name>Pu Ning</name>
    </author>
    <author>
      <name>Quan Chen</name>
    </author>
    <author>
      <name>Kun Tao</name>
    </author>
    <author>
      <name>Xinyu Tang</name>
    </author>
    <author>
      <name>Tianshu Wang</name>
    </author>
    <author>
      <name>Qianggang Cao</name>
    </author>
    <author>
      <name>Xinyu Kong</name>
    </author>
    <author>
      <name>Zujie Wen</name>
    </author>
    <author>
      <name>Zhiqiang Zhang</name>
    </author>
    <author>
      <name>Jun Zhou</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09730v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09730v1" />
    <published>2026-06-08T16:52:26Z</published>
    <updated>2026-06-08T16:52:26Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Pu Ning et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent&amp;#x27;s context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent&amp;#x27;s workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AR] An 84-Format Numeric Catalog with Bit-Exact Conformance Vectors: A Vendor-Neutral Reference for FP8, BF16, MXFP4, and Microscaling Formats</title>
    <author>
      <name>Dmitrii Vasilev</name>
    </author>
    <category term="cs.AR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09686v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09686v1" />
    <published>2026-06-08T16:04:15Z</published>
    <updated>2026-06-08T16:04:15Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dmitrii Vasilev&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Numeric format proliferation in machine learning hardware -- FP8 (E4M3 and E5M2), BF16, MXFP4, microscaling block formats, and dozens of research variants -- has outpaced the availability of vendor-neutral, bit-exact reference material. Engineers porting models across accelerators encounter silent divergences that are difficult to diagnose without a shared ruler.
  This paper describes a catalog of 84 numeric formats spanning 13 families, a suite of six bit-exact conformance packs covering GF16, MXFP4 element, BF16, FP8 E4M3, FP8 E5M2, and E8M0 block scale, and an IEEE P3109 v3.2.0 cross-walk that maps each pack to its corresponding standards-track configured format. Each pack is a self-contained JSON document with a SHA-256 fingerprint, a shared row schema, and an anchor vector that encodes 3.0 -- the identity phi^2 + 1/phi^2 = 3 -- as a cross-pack sanity check. Packs are cross-validated against ml_dtypes 0.5.4 (Google/JAX); any divergence is documented explicitly and interpreted as a spec-permitted interpretation gap rather than hidden. The work is framed as registry filling: it does not propose new formats, make model-accuracy claims, or assert superiority over any vendor&amp;#x27;s implementation. All artifacts are publicly available at https://github.com/gHashTag/t27 under an open license.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;17 pages. Source repository: https://github.com/gHashTag/paper3-methodology tag v4.0-trinity. Paper CC BY 4.0; code MIT. ORCID 0009-0008-4294-6159&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation</title>
    <author>
      <name>Dohwan Kim</name>
    </author>
    <author>
      <name>Jung-Woo Choi</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09677v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09677v2" />
    <published>2026-06-08T15:58:31Z</published>
    <updated>2026-06-09T08:08:09Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dohwan Kim, Jung-Woo Choi&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. To maximize one-step generation performance, we introduce Data-Space Optimization (DSO). DSO integrates an $\mathbf{x}_r$-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity. Experiments demonstrate that MeCo achieves state-of-the-art (SOTA) performance with minimal computational overhead, simultaneously achieving superior signal fidelity and human listening quality in both in-domain and out-of-domain scenarios.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;5 pages, accepted to Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery</title>
    <author>
      <name>Suraj Biswas</name>
    </author>
    <author>
      <name>Saurabh Gupta</name>
    </author>
    <author>
      <name>Pritam Mukherjee</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09672v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09672v1" />
    <published>2026-06-08T15:54:28Z</published>
    <updated>2026-06-08T15:54:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Suraj Biswas et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Ask a pretrained biomedical language model whether &amp;quot;cortisol 28 ug/dL&amp;quot; and &amp;quot;stock-market volatility&amp;quot; are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%.
  Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user&amp;#x27;s life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness.
  We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;20 pages, 18 figures, 9 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Transition-Based Digital Twin Modelling for Alzheimer's Disease under Sparse Longitudinal Data</title>
    <author>
      <name>Yinyu Huang</name>
    </author>
    <author>
      <name>Yilin Zhang</name>
    </author>
    <author>
      <name>Sofia Michopoulou</name>
    </author>
    <author>
      <name>Christopher Kipps</name>
    </author>
    <author>
      <name>Rahman Attar</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09671v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09671v1" />
    <published>2026-06-08T15:54:10Z</published>
    <updated>2026-06-08T15:54:10Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yinyu Huang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Alzheimer&amp;#x27;s disease (AD) progression is highly heterogeneous and is typically observed through sparse and irregular longitudinal data, posing challenges for prediction and personalised monitoring. Existing machine learning approaches have improved AD prediction using multimodal data, yet often focus on static classification or cohort-level risk estimation, providing limited support for subject-specific modelling and uncertainty-aware reasoning. To address these limitations, we present a personalised digital twin framework for AD prediction and scenario-based analysis using multimodal longitudinal data. The proposed approach integrates complementary modelling strategies to capture clinical transitions and temporal dependencies across visits. Using data from the Alzheimer&amp;#x27;s Disease Neuroimaging Initiative (ADNI), including cognitive assessments, clinical variables, and MRI-derived phenotypes, the framework predicts cognitive status and diagnostic categories while quantifying predictive uncertainty and enabling patient-specific what-if trajectory analysis. Evaluation on leak-free subject-level splits demonstrates strong performance in score forecasting and diagnosis classification. In this sparse and irregular ADNI setting, transition-based modelling of adjacent visits achieved higher predictive accuracy than the sequence-based branch, suggesting that local transition modelling may be more data-efficient. While sequence models remain valuable for uncertainty-aware trajectory forecasting, local transition modelling offers a more data-efficient and robust predictive strategy. These findings highlight the importance of aligning temporal modelling strategies with clinical data structure and suggest that transition-based digital twin formulations may provide a practical and interpretable approach for personalised disease forecasting in neurodegenerative disorders.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;13 pages, 5 figures, 3 tables. Accepted as a full-length paper at the International Conference on AI in Healthcare (AIiH) 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision</title>
    <author>
      <name>Mateo Diaz-Bone</name>
    </author>
    <author>
      <name>Daniel Caraballo</name>
    </author>
    <author>
      <name>Florian Scheidegger</name>
    </author>
    <author>
      <name>Thomas Frick</name>
    </author>
    <author>
      <name>Mattia Rigotti</name>
    </author>
    <author>
      <name>Andrea Bartezzaghi</name>
    </author>
    <author>
      <name>Roy Assaf</name>
    </author>
    <author>
      <name>Niccolo Avogaro</name>
    </author>
    <author>
      <name>Yagmur G. Cinar</name>
    </author>
    <author>
      <name>Brown Ebouky</name>
    </author>
    <author>
      <name>Filip M. Janicki</name>
    </author>
    <author>
      <name>Piotr S. Kluska</name>
    </author>
    <author>
      <name>Cezary Skura</name>
    </author>
    <author>
      <name>Cristiano Malossi</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09670v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09670v1" />
    <published>2026-06-08T15:52:05Z</published>
    <updated>2026-06-08T15:52:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mateo Diaz-Bone et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established datasets, such as MVTec. However, many of these methods face challenges when foundational assumptions - such as consistent object scale, viewpoint, background, illumination, and centered placement - are violated. Those variations that occur render anomaly detection methods unusable in many real-world scenarios. To address these limitations, we introduce three key contributions: (1) a visual prompting pipeline that isolates objects using foreground-background masking; (2) a mechanism for unfreezing the teacher in student-teacher models to improve domain adaptability; and (3) a data augmentation strategy leveraging diffusion-generated synthetic images to enhance anomaly detection performance. We achieve a 3.5 percentage point improvement over the previous state-of-the-art on the challenging AeBAD dataset by using the Masked Multiscale Reconstruction (MMR) model as our backbone.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks</title>
    <author>
      <name>Hongcheng Gao</name>
    </author>
    <author>
      <name>Hailong Qu</name>
    </author>
    <author>
      <name>Jingyi Tang</name>
    </author>
    <author>
      <name>Jiahao Wang</name>
    </author>
    <author>
      <name>Zihao Huang</name>
    </author>
    <author>
      <name>Hengkang Qiao</name>
    </author>
    <author>
      <name>Shihong Huang</name>
    </author>
    <author>
      <name>Junming Yang</name>
    </author>
    <author>
      <name>Yi Li</name>
    </author>
    <author>
      <name>Hongyixuan Yuan</name>
    </author>
    <author>
      <name>Wenjie Li</name>
    </author>
    <author>
      <name>Bohan Zeng</name>
    </author>
    <author>
      <name>Wenbo Li</name>
    </author>
    <author>
      <name>Bo Wang</name>
    </author>
    <author>
      <name>Jianhui Liu</name>
    </author>
    <author>
      <name>Olive Huang</name>
    </author>
    <author>
      <name>Haoyang Huang</name>
    </author>
    <author>
      <name>Wentao Zhang</name>
    </author>
    <author>
      <name>Guoqing Huang</name>
    </author>
    <author>
      <name>Nan Duan</name>
    </author>
    <author>
      <name>Yinpeng Dong</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09669v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09669v1" />
    <published>2026-06-08T15:51:51Z</published>
    <updated>2026-06-08T15:51:51Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Hongcheng Gao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design</title>
    <author>
      <name>Dun Li</name>
    </author>
    <author>
      <name>Jiatao Li</name>
    </author>
    <author>
      <name>Hongzhi Li</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09663v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09663v1" />
    <published>2026-06-08T15:45:15Z</published>
    <updated>2026-06-08T15:45:15Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dun Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This paper treats MetaAI not as a mature paradigm, but as a working term for a human-seeded, AI-expanded development pattern in which the design space itself becomes a target of modification. We propose an operational evidence framework with four criteria: inspectable target system, meta-level modifier, feedback-directed selection, and recursive continuation. We then map public systems, including Darwin Goedel Machine (DGM), STOP, Goedel Agent, and ShinkaEvolve, against these criteria. DGM provides the most direct currently reported evidence: its published results show improvement from 20% to 50% on SWE-bench Verified and from 14.2% to 30.7% on full Polyglot after 80 iterations, with ablations suggesting that both open-ended exploration and self-improvement contribute. Finally, we provide MetaAI-Mini, a reproducible HumanEval-based protocol and codebase. Because no completed model run is included in this build, MetaAI-Mini is reported as a protocol rather than as an experimental result.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;6 pages, 2 figures, 7 tables. Supplementary code: https://github.com/DunLi-Tsinghua/MetaAI-Mini&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] End-to-End Context Compression at Scale</title>
    <author>
      <name>Ang Li</name>
    </author>
    <author>
      <name>Sean McLeish</name>
    </author>
    <author>
      <name>Haozhe Chen</name>
    </author>
    <author>
      <name>Nimit Kalra</name>
    </author>
    <author>
      <name>Zaiqian Chen</name>
    </author>
    <author>
      <name>Artem Gazizov</name>
    </author>
    <author>
      <name>Venkata Anoop Suhas Kumar Morisetty</name>
    </author>
    <author>
      <name>Bhavya Kailkhura</name>
    </author>
    <author>
      <name>Harshitha Menon</name>
    </author>
    <author>
      <name>Zhuang Liu</name>
    </author>
    <author>
      <name>Brian R. Bartoldson</name>
    </author>
    <author>
      <name>Tom Goldstein</name>
    </author>
    <author>
      <name>Sanae Lotfi</name>
    </author>
    <author>
      <name>Micah Goldblum</name>
    </author>
    <author>
      <name>Pavel Izmailov</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09659v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09659v1" />
    <published>2026-06-08T15:43:16Z</published>
    <updated>2026-06-08T15:43:16Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ang Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model&amp;#x27;s context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.DB] ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset</title>
    <author>
      <name>Luciano Duarte</name>
    </author>
    <author>
      <name>Olga Ovcharenko</name>
    </author>
    <author>
      <name>Sebastian Schelter</name>
    </author>
    <category term="cs.DB" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09648v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09648v1" />
    <published>2026-06-08T15:40:57Z</published>
    <updated>2026-06-08T15:40:57Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Luciano Duarte et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multi-modal data management has emerged as a central research topic in the database community, spanning data integration, semantic query processing, and data quality assessment. Despite this growing interest, the community lacks large-scale, real-world datasets combining tables, text, and images. We present ArtiFact, a multi-modal cultural heritage dataset of 651045 museum records collected from the Metropolitan Museum of Art, the Art Institute of Chicago, and the Rijksmuseum. We demonstrate the utility of ArtiFact through two downstream tasks. For cross-modal error detection, we introduce a curated taxonomy of seven error categories injected into 130209 records and show that reliably detecting subtle domain-specific errors such as material anachronisms and temporal shifts remain an open challenge. For semantic query processing, we show that current systems struggle with queries involving cultural proximity, ambiguous object types, and historically contingent terminology. Our results position ArtiFact as a challenging benchmark for multi-modal data management research.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Preprint&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes</title>
    <author>
      <name>Yongzhong Xu</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09607v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09607v1" />
    <published>2026-06-08T15:17:54Z</published>
    <updated>2026-06-08T15:17:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yongzhong Xu&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clustering co-activation statistics. We ask whether such a cheap signal actually identifies an attention-head circuit. Adapting a sparse-autoencoder clustering recipe to attention heads -- but validating by causal ablation rather than reconstruction -- we cluster heads and then run a closure test: ablate the discovered community and compare per-example damage to matched-random controls. Across two dense 1B-scale models (Pythia 1B, OLMo 1B) and two input distributions, the communities pass closure. In a Mixture-of-Experts model (OLMoE-1B-7B), route-conditional clustering recovers a statistically real signal that nonetheless does not survive closure -- ablation improves loss, the wrong direction. Extending closure across training, attention-target selectivity and participation ratio decouple from function in both directions. We conclude that a cheap signal is a circuit proposal, not a confirmed circuit; closure is what separates them.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;22 pages, 3 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Next-Token Prediction Learns Generalisable Representations of Sleep Physiology</title>
    <author>
      <name>Jonathan F. Carter</name>
    </author>
    <author>
      <name>Lionel Tarassenko</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09605v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09605v1" />
    <published>2026-06-08T15:13:57Z</published>
    <updated>2026-06-08T15:13:57Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jonathan F. Carter, Lionel Tarassenko&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Foundation models offer a promising route to compress multi-modal physiological signals into compact representations of human health, with broad applications across sleep medicine, cardiology, neurology and other healthcare domains. Existing models have typically been trained with masked-reconstruction or contrastive objectives. However, masked reconstruction may be poorly suited to the stochastic nature of these signals, while contrastive approaches rely on positive-pair definitions despite the semantic invariances of physiological signals being poorly understood. In this work, we show that next-token prediction is a simple and scalable alternative. We develop Hypnos, a multi-modal sleep foundation model trained using eight different sensing modalities (e.g. EEG, ECG, respiratory signals) drawn from over 20,000 overnight polysomnography recordings. We tokenize each modality into streams of discrete tokens using residual vector quantization, then train a large auto-regressive RQ-Transformer to jointly predict the next token across all modalities in parallel. After training, Hypnos can be applied to continuous streams of sensor data from any subset of supported modalities, generating embeddings for downstream tasks. Across a range of benchmarks, Hypnos significantly outperforms existing foundation models. In sleep stage classification, we match the performance of strong supervised baselines on held-out test sets whilst using \(100\times\) less labelled data. Hypnos even generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation. Our results demonstrate that next-token prediction is a strong self-supervised objective for representation learning from multi-modal physiological signals.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text</title>
    <author>
      <name>Yutong Bian</name>
    </author>
    <author>
      <name>Dongjie Cheng</name>
    </author>
    <author>
      <name>Heming Xia</name>
    </author>
    <author>
      <name>Yongqi Li</name>
    </author>
    <author>
      <name>Wenjie Li</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09585v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09585v1" />
    <published>2026-06-08T14:58:59Z</published>
    <updated>2026-06-08T14:58:59Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yutong Bian et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs</title>
    <author>
      <name>Momina Ahsan</name>
    </author>
    <author>
      <name>Sarfraz Ahmad</name>
    </author>
    <author>
      <name>Ming Shan Hee</name>
    </author>
    <author>
      <name>Roy Ka-Wei Lee</name>
    </author>
    <author>
      <name>Preslav Nakov</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09578v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09578v1" />
    <published>2026-06-08T14:52:46Z</published>
    <updated>2026-06-08T14:52:46Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Momina Ahsan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remains under-explored. In practice, the same table content may appear in different structural formats, such as HTML, Markdown, and LaTeX, or as rendered images. However, existing evaluations often let content, format, layout, and modality vary together, making it difficult to isolate representation effects. We introduce TABVERSE, a controlled multimodal table benchmark that aligns the same table content across multiple structural formats and rendered images, with question category and difficulty tags. This design enables systematic evaluation of representation effects while holding table content fixed. We evaluate LLMs and VLMs across three tasks: Question Answering (QA), Structural Understanding Capability (SUC), and Structure Reconstruction (SR). Our results show that representation choice substantially affects table understanding. Models generally perform better with structured text than with rendered images, but the size of this gap depends on the task, model, and format. HTML is often the most robust text format, while row-sensitive structural tasks and syntactically usable LaTeX reconstruction remain challenging. These findings show that table representation is a key factor in reliable table evaluation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;24 pages, 18 tables, 16 figures, Submitted to ARR May 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control</title>
    <author>
      <name>Jiacheng Li</name>
    </author>
    <author>
      <name>Yize Guo</name>
    </author>
    <author>
      <name>Jiabin Guo</name>
    </author>
    <author>
      <name>Qingchen Liu</name>
    </author>
    <author>
      <name>Jiahu Qin</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09572v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09572v1" />
    <published>2026-06-08T14:46:43Z</published>
    <updated>2026-06-08T14:46:43Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiacheng Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-action model (CT-VAM) for efficient task-conditioned visuomotor control. CT-VAM acts as a compact local execution policy that predicts action chunks from dualview visual observations, proprioception, and a lightweight task condition, potentially enabling a practical cloud-edge paradigm in which high-level semantic reasoning can be handled by large models while fast closed-loop control runs on local hardware. To fuse heterogeneous inputs effectively, CT-VAM introduces TARS (Thalamic Action Routing Stream), a stream-separated conditional attention decoder that independently routes action, visual and task streams, preventing dense sensory tokens from overwhelming compact task-relevant conditions. With only 68M parameters, CT-VAM achieves LIBERO success rates competitive with substantially larger VLA models, while reducing inference latency. Together with flow-consistent inpainting for asynchronous chunk execution, CT-VAM supports high-frequency control and demonstrates robust realworld deployment on resource-constrained robotic platforms.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech</title>
    <author>
      <name>Vadim Popov</name>
    </author>
    <author>
      <name>Wenju Gu</name>
    </author>
    <author>
      <name>Tasnima Sadekova</name>
    </author>
    <author>
      <name>Georgii Aparin</name>
    </author>
    <author>
      <name>Assel Yermekova</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09962v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09962v1" />
    <published>2026-06-08T14:41:24Z</published>
    <updated>2026-06-08T14:41:24Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Vadim Popov et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scientific interest to such models has been constantly increasing these days because researchers try to achieve a challenging goal of finding reasonable alternatives to autoregressive large language models. In this paper, we study the properties of the structure of the latent space corresponding to discrete tokens expressed in terms of Kullback-Leibler divergence on diffusion path measures and accuracy of the correct token prediction by the optimally trained diffusion model. We find that FSQ tokenization scheme has the latent space structure with the properties that make it best suited for continuous diffusion for categorical data as verified through rigorous theoretical analysis and numerical experiments. To validate our findings in real-life scenario, we train several text-to-speech diffusion models having speech tokens as intermediate acoustic features, and show that the one based on FSQ tokens indeed performs the best, and, moreover, it outperforms its strong LLM-based counterpart, at the same time being significantly smaller and faster.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] 3SPO: State-Score-Supervised Policy Optimization for LLM Agents</title>
    <author>
      <name>Yu Han</name>
    </author>
    <author>
      <name>Kailing Li</name>
    </author>
    <author>
      <name>Yang Jiao</name>
    </author>
    <author>
      <name>Yulin Dai</name>
    </author>
    <author>
      <name>Yuqian Fu</name>
    </author>
    <author>
      <name>Linhai Zhuo</name>
    </author>
    <author>
      <name>Tianwen Qian</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09961v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09961v1" />
    <published>2026-06-08T14:26:05Z</published>
    <updated>2026-06-08T14:26:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yu Han et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuman performance in long-horizon tasks. However, existing RL algorithms operate at the trajectory level, performing policy optimization only after collecting complete episode rollouts. This coarse-grained approach faces fundamental challenges in multi-turn agent settings where rewards are sparse, delayed, and credit assignment across individual steps is critical. In this work, we propose \textbf{State-Score-Supervised Policy Optimization (3SPO)}, a novel RL algorithm that performs post-step policy optimization with dynamic state score supervision. At each step, 3SPO computes the state score based on historical success rates, supervising step-wise credit assignment, adaptive rollout and post-step policy optimization without requiring value function estimation or additional auxiliary models. Theoretically, under a per-state bandit abstraction, we show that the proposed score-supervised allocation mechanism achieves logarithmic allocation regret and provide sample-complexity guarantees for action identification, score distinguishability, and filtering stability. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B-Instruct show that 3SPO consistently outperforms GRPO by $+22.6\%$ on ALFWorld and $+15.6$ points on WebShop, while using comparable resources to achieve $2.4\times$ more state exploration and $1.8\times$ faster convergence. Code is available at https://github.com/genalyu/3SPO.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Emergence of Context Characteristics Sensitivity in Large Language Models</title>
    <author>
      <name>Nadya Yuki Wangsajaya</name>
    </author>
    <author>
      <name>Haeun Yu</name>
    </author>
    <author>
      <name>Isabelle Augenstein</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09525v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09525v1" />
    <published>2026-06-08T14:11:07Z</published>
    <updated>2026-06-08T14:11:07Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Nadya Yuki Wangsajaya et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a query. While prior work has studied how context characteristics correlate with context usage by the LLM, this analysis has been limited to inference time, leaving open how these relationships are acquired in the first place. Here, we measure how models&amp;#x27; sensitivity to such characteristics shifts across successive IFT stages: supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR). Experiments across four models and three datasets show that SFT makes models more likely to use contexts that are easy to understand, such as containing high length, context-query similarity, and fluency. Post-SFT dynamics may either reinforce or resolve these preferences depending on the training dataset. Our findings reveal that context usage is actively reshaped at each IFT stage, and designing a balanced IFT dataset is important in ensuring robust context utilization of instruction-tuned models.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs</title>
    <author>
      <name>Zhanchao Xu</name>
    </author>
    <author>
      <name>Haoyang Li</name>
    </author>
    <author>
      <name>Qingfa Xiao</name>
    </author>
    <author>
      <name>Fei Teng</name>
    </author>
    <author>
      <name>Chen Jason Zhang</name>
    </author>
    <author>
      <name>Lei Chen</name>
    </author>
    <author>
      <name>Qing Li</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09508v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09508v1" />
    <published>2026-06-08T14:02:18Z</published>
    <updated>2026-06-08T14:02:18Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhanchao Xu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture</title>
    <author>
      <name>Yoojin Nam</name>
    </author>
    <author>
      <name>Jinhoon Jeong</name>
    </author>
    <author>
      <name>Namkug Kim</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09500v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09500v2" />
    <published>2026-06-08T13:51:04Z</published>
    <updated>2026-06-09T14:24:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yoojin Nam et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As autonomous research agents and AI co-scientist systems push large language models (LLMs) from drafting toward end-to-end manuscript production, the bottleneck shifts from generation to verification. Fluent LLM output can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items; existing tools generate without verifying, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture pairing generation with verification, resting on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism, a deterministic, re-executable check where one suffices and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills with a 21-detector deterministic tier, evaluated on three public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects; on 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a single-prompt LLM reviewer detected 11, its misses in code, bibliography, and style defects the prose hides. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript: feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;28 pages, 3 figures, 4 tables; includes supplementary material (deterministic-detector inventory, per-class defect breakdown, worked example). Software (MIT): https://github.com/Aperivue/medsci-skills ; archived on Zenodo (concept DOI 10.5281/zenodo.20155321; v3.8.0 version DOI 10.5281/zenodo.20582972)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales</title>
    <author>
      <name>Aditya Kamlesh Parikh</name>
    </author>
    <author>
      <name>Cristian Tejedor-Garcia</name>
    </author>
    <author>
      <name>Catia Cucchiarini</name>
    </author>
    <author>
      <name>Helmer Strik</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09470v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09470v1" />
    <published>2026-06-08T13:27:40Z</published>
    <updated>2026-06-08T13:27:40Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Aditya Kamlesh Parikh et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026. This publication is part of the project Responsible AI for Voice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants, which is financed by the Dutch Research Council (NWO)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning</title>
    <author>
      <name>Bojie Rong</name>
    </author>
    <author>
      <name>Zheyu Shen</name>
    </author>
    <author>
      <name>Qiaoping Wang</name>
    </author>
    <author>
      <name>Pengfei Kang</name>
    </author>
    <author>
      <name>Yang Xu</name>
    </author>
    <author>
      <name>Yawen Wei</name>
    </author>
    <author>
      <name>Hanyu Wu</name>
    </author>
    <author>
      <name>Zhi Zhao</name>
    </author>
    <author>
      <name>Leihao Pei</name>
    </author>
    <author>
      <name>Linquan Jiang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09447v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09447v1" />
    <published>2026-06-08T12:55:42Z</published>
    <updated>2026-06-08T12:55:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Bojie Rong et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to frequently diverge from their corresponding documentation. Verifying that documented procedures accurately reflect the current console and can be executed end-to-end demands an estimated 4 million recurring inspections annually, yet manual coverage remains below 1%. While agent systems built on frontier proprietary models achieve high success rates, their prohibitive cost and data privacy constraints preclude large-scale deployment. We propose a two-stage training paradigm: supervised fine-tuning (SFT) on distilled frontier-model trajectories, followed by reinforcement learning using Group Relative Policy Optimization (GRPO) and a dual-channel outcome reward model in real cloud environments. To support large-scale RL training, we construct a high-determinism rollout system featuring Terraform-based resource pre-provisioning and LLM-driven on-demand provisioning, which effectively isolates environment noise from the training signal. We further introduce a rule-based reward evaluation protocol grounded in backend audit logs, providing objective, reward-hacking-resistant outcome judgment. Our model evolves from mechanical instruction following to autonomous decision-making with cloud console and product-specific understanding. Experiments on a challenging 278-task benchmark where the best frontier model achieves only 65.34% demonstrate that AliyunConsoleAgent-32B achieves a 63.52% mean success rate -- a 20.24 percentage-point improvement over the base model, narrowing the gap to the best frontier proprietary model to 1.82 pp (bootstrap 95% CI [-1.27, 7.39]) -- at 92% lower inference cost.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models</title>
    <author>
      <name>Mingqi Yuan</name>
    </author>
    <author>
      <name>Xiaoquan Sun</name>
    </author>
    <author>
      <name>Shihao Luo</name>
    </author>
    <author>
      <name>Jiayu Chen</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09430v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09430v1" />
    <published>2026-06-08T12:41:15Z</published>
    <updated>2026-06-08T12:41:15Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mingqi Yuan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Online task-free continual learning (TFCL) requires intelligent agents to sequentially accumulate knowledge from an unbounded, non-stationary data stream under strict single-pass constraints and without any explicit task identifiers. Existing online TFCL paradigms primarily rely on parameter-efficient prompt tuning or dynamic structure expansion driven by training-coupled optimization dynamics, such as empirical loss fluctuations or evolving latent distances. As a result, these training-coupled solvers remain agnostic to the structural origins of distribution drift, mechanically enforcing a fixed strategy across fundamentally distinct streaming variations. To address this gap, we propose LargeMonitor, a framework that leverages large pretrained foundation models to autonomously orchestrate task-free continuous adaptation. Specifically, LargeMonitor introduces a decoupled detection module utilizing the frozen, stable representation space of large vision models (LVMs) to achieve robust, zero-shot drift detection without training-dependent interference or brittle threshold tuning. Upon a confirmed drift, the framework activates a context-aware diagnostic module driven by large multimodal models (LMMs) to interpret the precise semantic etiologies of the stream variation (e.g., novel class emergence vs. environmental domain shift). This dual-stage capability empowers the continuous learner to dynamically deploy adaptive and shift-specific optimization strategies. Extensive experiments across multiple TFCL settings and benchmarks demonstrate that LargeMonitor achieves precise, robust detection and diagnosis of complex data streams while consistently improving the performance of existing online TFCL algorithms.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces</title>
    <author>
      <name>Wanli Li</name>
    </author>
    <author>
      <name>Bowen Zhou</name>
    </author>
    <author>
      <name>Yunyao Yu</name>
    </author>
    <author>
      <name>Zhou Xu</name>
    </author>
    <author>
      <name>Yifan Yang</name>
    </author>
    <author>
      <name>Dongsheng Li</name>
    </author>
    <author>
      <name>Caihua Shan</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09426v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09426v1" />
    <published>2026-06-08T12:39:23Z</published>
    <updated>2026-06-08T12:39:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wanli Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;38 pages, 7 figures, 12 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Temporal Context Conditioning for Seasonality-Aware Precipitation Nowcasting of High-Intensity Rainfall</title>
    <author>
      <name>Gijs van Nieuwkoop</name>
    </author>
    <author>
      <name>Siamak Mehrkanoon</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09959v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09959v1" />
    <published>2026-06-08T12:28:28Z</published>
    <updated>2026-06-08T12:28:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Gijs van Nieuwkoop, Siamak Mehrkanoon&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Precipitation nowcasting is increasingly being approached with deep learning models that learn directly from recent radar observations. Although such models can efficiently capture short-term precipitation motion, they often lack broader contextual information about the meteorological conditions under which rainfall develops. This paper investigates whether lightweight temporal context can improve radar-based nowcasting, particularly for high-intensity rainfall. We propose the Time-Aware Small-Attention U-Net (TA-SmaAt-UNet), which extends the core SmaAt-UNet model with temporal conditioning layers that use cyclical encodings of time-of-day and time-of-year to modulate intermediate feature representations. Experiments on KNMI radar precipitation data show that temporal conditioning is most beneficial for rare, high-intensity precipitation events, while also improving the representation of seasonal variability and predicted rainfall-intensity distributions. A layer conductance analysis further indicates that the added temporal conditioning layers are actively used by the model despite their small parameter cost. These findings suggest that simple, physically motivated temporal context can improve the realism and reliability of deep learning-based precipitation nowcasts. The implementation of our models and training setup is available on \href{https://github.com/gijsvn/TA-SmaAt-UNet}{GitHub}.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;9 pages, 6 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings</title>
    <author>
      <name>Mina Remeli</name>
    </author>
    <author>
      <name>Moritz Hardt</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09409v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09409v1" />
    <published>2026-06-08T12:26:54Z</published>
    <updated>2026-06-08T12:26:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mina Remeli, Moritz Hardt&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Pairwise comparisons combined with aggregation methods like Elo have become central to evaluating generative models, yet concerns remain that they reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. By converting five well-known benchmarks into free-form generative evaluations, we find that Elo rankings achieve a Spearman correlation above 0.9 with accuracy rankings and substantially outperform direct evaluation when the judge is weak. Furthermore, style and judge bias have only minor effects on model rankings, despite most judgments occurring on pairs where both candidate answers are correct (or incorrect). On such pairs, we find that repetition after the final answer (echo) is a causal driver of judge preference.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at ICML&amp;#x27;26&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour</title>
    <author>
      <name>Radeen Mostafa</name>
    </author>
    <author>
      <name>Sawradip Saha</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09399v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09399v1" />
    <published>2026-06-08T12:18:22Z</published>
    <updated>2026-06-08T12:18:22Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Radeen Mostafa, Sawradip Saha&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a web agent should browse the way a person browses. A human reading a page does not retain every pixel they have seen; they look at a few candidate targets, decide on one, and remember only what is needed to keep the goal alive. We operationalize this perception-cognition-action triad as three coupled mechanisms. First, a vision-first bounding-box pipeline labels candidate interactive regions on every screenshot and feeds them, asynchronously prefetched, to the language model so that the &amp;quot;eye&amp;quot; precedes the &amp;quot;hand&amp;quot;. Second, a three-role brain -- an Orchestrator that classifies and routes, a Planner that evaluates progress every few steps, and a Worker that emits per-step actions -- separates strategic from operational reasoning. Third, a structured Ledger stores only what a person would: the goal, the last three actions, a small set of facts and dead-ends, and a handful of checkpoints; a six-phase eviction loop systematically discards stale screenshots, state blobs, and reasoning traces from the live context. Action execution is a three-tier click cascade (Chrome DevTools Protocol to Puppeteer to scripted) with humanized Bezier motion, plus a chevron-aware bounding-box snapper that resolves the &amp;quot;small arrow beside a large label&amp;quot; ambiguity. On the Mind2Web Hard benchmark (66 tasks), SUPERBROWSER attains 89.47% success, placing third overall and ahead of every published open/research browser-agent baseline by a large margin. We argue that the gain comes not from any single trick but from the consistent application of a cognitive contract throughout the system.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;31 pages, 8 figures, preprint/work in progress&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Capability-Aligned Hierarchical Learning for Tool-Augmented LLMs</title>
    <author>
      <name>Haotong Yang</name>
    </author>
    <author>
      <name>Ting Long</name>
    </author>
    <author>
      <name>Yi Chang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09371v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09371v1" />
    <published>2026-06-08T11:48:55Z</published>
    <updated>2026-06-08T11:48:55Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haotong Yang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Tool learning enables LLMs to invoke external tools to accomplish tasks. Prior studies have demonstrated the effectiveness of a hierarchical structure: a high-level policy handles global planning and decomposes tasks into manageable sub-tasks, and a low-level policy focuses on invoking tools to solve these sub-tasks. However, these works typically optimize the high-level and low-level policies separately, leading to planner-executor misalignment and limiting LLM performance on tool-use tasks. In this paper, we propose a method called Capability-Aligned Hierarchical Learning (CAHL), which leverages RLVR to jointly optimize both policies, enabling better alignment between the high-level planner and the low-level executor. Experiments on constrained tool-use benchmarks (API-Bank and BFCL) and an open-ended environment (Bamboogle) demonstrate the effectiveness of CAHL.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;14 pages, 5 figures, 6 tables. Preprint&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] PhysScene: A Scene Graph Dataset for Scientific Visual Reasoning in Physics Experiments</title>
    <author>
      <name>Minghao Zou</name>
    </author>
    <author>
      <name>Qingtian Zeng</name>
    </author>
    <author>
      <name>Shangkun Liu</name>
    </author>
    <author>
      <name>Yanda Meng</name>
    </author>
    <author>
      <name>Guanghui Yue</name>
    </author>
    <author>
      <name>Baoquan Zhao</name>
    </author>
    <author>
      <name>Abdulmotaleb El Saddik</name>
    </author>
    <author>
      <name>Wei Zhou</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09368v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09368v1" />
    <published>2026-06-08T11:40:48Z</published>
    <updated>2026-06-08T11:40:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Minghao Zou et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise relationships. Despite recent progress, existing datasets primarily focus on generic natural contexts, leaving domain-specific and function-oriented scenes largely underexplored. This limitation restricts the evaluation of relational reasoning in scientific experimental scenes, thereby hindering the development of intelligent monitoring, analysis, and related applications in such scenes. To address this gap, we introduce PhysScene, the first SG dataset tailored to physics experiments. PhysScene encompasses specialized instruments, structured experimental setups, and functional relations intrinsic to experimental environments, enabling reasoning that extends beyond spatial co-occurrence to logical dependencies. Rather than pursuing large data scale, PhysScene focuses on strong semantic constraints and high relation density in experimental scenes, posing new challenges for existing scene parsing algorithms while offering opportunities for further improvements. Extensive analyses and experiments show that PhysScene complements existing benchmarks and establishes a valuable testbed for advancing scientific visual reasoning. The dataset is publicly available at https://github.com/ZMH-SDUST/PhysScene.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Beyond Humans: Multispecies Animal Face Recognition Using Transfer Learning</title>
    <author>
      <name>Maria De Marsico</name>
    </author>
    <author>
      <name>Anil K. Jain</name>
    </author>
    <author>
      <name>Annalaura Miglino</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09353v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09353v1" />
    <published>2026-06-08T11:27:11Z</published>
    <updated>2026-06-08T11:27:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Maria De Marsico et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Individual animal recognition can be useful in the search for lost or stolen pets, the tracking of individuals of endangered species, and the recognition of animals in crowded farms. Present recognition techniques mostly use physical devices, e.g., microchips, often impractical and difficult to apply. These could be replaced by remote recognition via the animal&amp;#x27;s face; if accurate enough, it provides several advantages: it is non-invasive, can work at a distance, and is difficult to counterfeit, as, for instance, in the case of substituting sick animals for healthy ones in the food industry. The few existing datasets with sufficient per-subject images annotated with a single animal identity are not large enough to train current deep learning architectures. We rather investigate the possibility of transfer learning, exploiting pre-trained network models as backbones. Our experiments compared FaceNet, which is specifically trained on large databases of human faces, with the Vision Transformer (ViT) pre-trained on ImageNet, i.e., on object categories. We used three face datasets of very different animals: dogs, primates (lemurs, golden monkeys, and chimpanzees), and cattle. We report the results and, for each dataset, compare them with the state of the art (SOTA) ad hoc-trained deep networks. The capture conditions differ among the three datasets. Image quality (resolution, motion blur, diverse poses, etc.) decreases from dogs to cattle to primates. The best performance was achieved with dogs, where ViT reached a mean verification accuracy of 96.85% and a Rank-1 Identification Rate of 84.34%. The results for endangered primates are still encouraging, but performance varies across animal classes and tasks (verification or identification), and does not always outperform SOTA. For cattle, the ViT results outperform SOTA, while FaceNet is still competitive.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;This paper extends the work published in the proceedings of CAIP 2025 conference: &amp;#x27;Adapting to the Wild: From Human Face to Animal Face Recognition&amp;#x27; by De Marsico, M., Jain, A. K., Miranda, M., &amp;amp; Orlando, A&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders</title>
    <author>
      <name>Wei Pang</name>
    </author>
    <author>
      <name>Xiangru Jian</name>
    </author>
    <author>
      <name>Hehan Li</name>
    </author>
    <author>
      <name>Zhixuan Yu</name>
    </author>
    <author>
      <name>Alex Xue</name>
    </author>
    <author>
      <name>Jinyang Li</name>
    </author>
    <author>
      <name>Zhengyuan Dong</name>
    </author>
    <author>
      <name>Xinjian Zhao</name>
    </author>
    <author>
      <name>Hao Xu</name>
    </author>
    <author>
      <name>Chao Zhang</name>
    </author>
    <author>
      <name>Reynold Cheng</name>
    </author>
    <author>
      <name>M. Tamer Özsu</name>
    </author>
    <author>
      <name>Tianshu Yu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09323v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09323v1" />
    <published>2026-06-08T10:39:25Z</published>
    <updated>2026-06-08T10:39:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wei Pang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Anything2Skill: Compiling External Knowledge into Reusable Skills for Agents</title>
    <author>
      <name>Qianjun Pan</name>
    </author>
    <author>
      <name>Yutao Yang</name>
    </author>
    <author>
      <name>Junsong Li</name>
    </author>
    <author>
      <name>Jie Zhou</name>
    </author>
    <author>
      <name>Kai Chen</name>
    </author>
    <author>
      <name>Xin Li</name>
    </author>
    <author>
      <name>Qin Chen</name>
    </author>
    <author>
      <name>Liang He</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09316v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09316v2" />
    <published>2026-06-08T10:24:32Z</published>
    <updated>2026-06-09T09:39:52Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Qianjun Pan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Retrieval-augmented generation (RAG) enables agents to access external knowledge at inference time, but it primarily retrieves fragmented declarative evidence, leaving agents to repeatedly infer task procedures from passages, manuals, examples, logs, or trajectories. This raises a fundamental question: can skills extracted from external knowledge bases be installed into an agent, enabling it to rapidly approximate domain expertise? In this paper, we propose Anything2Skill, a taxonomy-guided framework that compiles heterogeneous external knowledge into reusable, retrievable, and executable skills for agents. Given a corpus of knowledge records, \textsc{Anything2Skill} first decomposes each record into evidence windows and performs plan-and-expand skill extraction under a skill-tree prior. The extracted candidates are then converted into structured skill contracts that specify invocation conditions, contraindications, action moves, workflow steps, constraints, output specifications, supporting evidence, and confidence scores. To construct a deployable procedural memory, Anything2Skill manages the extracted skills in a persistent SkillBank through taxonomy-aware compilation, registry-level reconciliation, lifecycle tracking, versioned updates, and visible skill-tree projection. At inference time, agents retrieve both task-specific passages from the original knowledge base and relevant procedural skills from the SkillBank, allowing RAG to provide declarative evidence while compiled skills provide reusable procedural guidance. Experiments on qsv and GitHub-CLI show that Anything2Skill combined with RAG achieves 98.85\% and 94.10\% success rates, respectively, substantially outperforming RAG-only agents. These results suggest that compiling latent procedural knowledge into explicit skills is an effective way to extend retrieval-augmented agents from knowledge access toward capability reuse.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation</title>
    <author>
      <name>Rafael Cabral</name>
    </author>
    <author>
      <name>Pang Zixi</name>
    </author>
    <author>
      <name>Ziyi Shou</name>
    </author>
    <author>
      <name>Shen Xin</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09278v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09278v1" />
    <published>2026-06-08T09:44:31Z</published>
    <updated>2026-06-08T09:44:31Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Rafael Cabral et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Language Models frequently hallucinate in precision-critical domains such as technical diagramming and mechanical design, where outputs must satisfy strict geometric constraints. We study open-ended geometric synthesis from natural language: translating free-form descriptions into precise constructions whose entities must simultaneously satisfy dozens of interacting constraints. To make this tractable, we release PyGeoX, a programmable geometric DSL that compiles declarative constraints into a differentiable loss, and PyGeoX-Bench, a stratified suite of 300 problems with per-constraint verifiable rewards. Using PyGeoX as a verifier, we identify a failure mode we call Outlier Gradient Masking: under global-norm rewards (any scheme that aggregates residuals through a single norm, for example, $\exp(-\mathrm{MSE})$), a single outlier constraint can nullify the learning signal across all others. To address this, we propose Saturating Additive Rewards (SAR), which decompose the reward into bounded per-constraint terms, preserving partial progress and ensuring consistent gradients even under severe violations. Against MSE-based rewards, the natural baseline for geometry solvers, SAR improves the hard-tier solving rate by $2.3\times$, and the resulting 8B model is competitive with much larger frontier systems on this benchmark. We release the engine, benchmark, and data at https://github.com/Huawei-AI4Math/PyGeoX.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation</title>
    <author>
      <name>Al Zadid Sultan Bin Habib</name>
    </author>
    <author>
      <name>Md Younus Ahamed</name>
    </author>
    <author>
      <name>Prashnna Gyawali</name>
    </author>
    <author>
      <name>Gianfranco Doretto</name>
    </author>
    <author>
      <name>Donald A. Adjeroh</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09257v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09257v1" />
    <published>2026-06-08T09:30:34Z</published>
    <updated>2026-06-08T09:30:34Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Al Zadid Sultan Bin Habib et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Published as a paper at the 2nd DeLTa Workshop, ICLR 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video</title>
    <author>
      <name>Yuan Zeng</name>
    </author>
    <author>
      <name>Yujia Shi</name>
    </author>
    <author>
      <name>Tiao Tan</name>
    </author>
    <author>
      <name>Xingting Li</name>
    </author>
    <author>
      <name>Yaqi Qin</name>
    </author>
    <author>
      <name>Zongqing Lu</name>
    </author>
    <author>
      <name>Wenming Yang</name>
    </author>
    <author>
      <name>Jing-Hao Xue</name>
    </author>
    <author>
      <name>Qingmin Liao</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09243v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09243v1" />
    <published>2026-06-08T09:20:03Z</published>
    <updated>2026-06-08T09:20:03Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuan Zeng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Estimating full-hand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing often relies on intrusive hardware. Existing vision-based methods predominantly rely on planar surfaces or fingertip contacts, failing to generalize to complex 3D object interactions. Therefore, we introduce EgoTactile, a benchmark pairing egocentric video with full-hand pressure supervision for diverse everyday objects, incorporating a bare-hand transfer subset to enable generalization to natural scenarios. Leveraging this benchmark, we first establish EgoPressureFormer as a discriminative baseline. Beyond this, to explicitly address the uncertainty in partial observations, we propose EgoPressureDiff, a conditional diffusion framework that adapts a large-scale pre-trained video diffusion backbone. By combining rich world knowledge priors with a Physically-Informed Feature Rectification layer to inject semantic constraints, our approach effectively infers plausible contact patterns and resolves visual-physical ambiguities. Extensive experiments demonstrate that our method achieves superior performance on the benchmark and robust transferability to in-the-wild scenarios. Our project page is available at https://egotactile.github.io/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ICML2026 spotlight&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] MASS: Deep Research for Social Sciences with Memory-Augmented Social Simulation</title>
    <author>
      <name>Yongrui Liu</name>
    </author>
    <author>
      <name>Deyi Xiong</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09198v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09198v1" />
    <published>2026-06-08T08:32:21Z</published>
    <updated>2026-06-08T08:32:21Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yongrui Liu, Deyi Xiong&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Deep Research agents powered by Large Language Models (LLMs) have exhibited extraordinary potential in automated paper writing tasks. However, existing systems rely heavily on literature retrieval and synthesis through internet and local knowledge bases, often resulting research in lacking insight and creativity in social science. To address this issue, we propose &amp;quot;Memory-Augmented Social Simulation (MASS)&amp;quot;, an innovative paradigm that leverages highly realistic and research-oriented social simulations to enhance the creativity and empirical founding of LLMs-generated research. Specifically, MASS integrates three core components: dynamic goal-path planning with multi-level social norm restraint to guide the simulation, a multi-disciplinary behavior dataset for agent memory cold-start, and a structured forgetting mechanism inspired by the Ebbinghaus curve. Together, these ensure simulation authenticity and provide a robust empirical foundation for generating innovative scholarly papers. Experimental results demonstrate the effectiveness of our method, showing a 6.81\% improvement in generation overall quality over foundation LLMs and 17.19\% gain in Insight over strong baselines.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation</title>
    <author>
      <name>Lingyi Meng</name>
    </author>
    <author>
      <name>Zecong Tang</name>
    </author>
    <author>
      <name>Haoran Li</name>
    </author>
    <author>
      <name>Tengju Ru</name>
    </author>
    <author>
      <name>Zhejun Cui</name>
    </author>
    <author>
      <name>Weitong Lian</name>
    </author>
    <author>
      <name>Qi Kang</name>
    </author>
    <author>
      <name>Hangshuo Cao</name>
    </author>
    <author>
      <name>Yichen Zhu</name>
    </author>
    <author>
      <name>Yechi Liu</name>
    </author>
    <author>
      <name>Kaixuan Wang</name>
    </author>
    <author>
      <name>Yu-Jie Yuan</name>
    </author>
    <author>
      <name>Chunwei Wang</name>
    </author>
    <author>
      <name>Yu Zhang</name>
    </author>
    <author>
      <name>Bo Dai</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09169v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09169v1" />
    <published>2026-06-08T08:08:20Z</published>
    <updated>2026-06-08T08:08:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Lingyi Meng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Unified Energy for Invariant and Independent Decoding in Diffusion Language Models</title>
    <author>
      <name>Yuchen Yan</name>
    </author>
    <author>
      <name>Minkai Xu</name>
    </author>
    <author>
      <name>Zaiquan Yang</name>
    </author>
    <author>
      <name>Yatao Bian</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09159v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09159v1" />
    <published>2026-06-08T07:50:12Z</published>
    <updated>2026-06-08T07:50:12Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuchen Yan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Diffusion Language Models (DLMs) enable parallel text generation by iteratively denoising a full sequence, offering attractive flexibility compared to auto-regressive (AR) decoding. However, existing methods fail to fully capture token relationships, leading to a performance gap relative to AR baselines, especially as the degree of parallelism increases. In this paper, we give a systematic analysis of the gap, identifying three key factors: (i) model capacity, (ii) dependency, and (iii) invariance. To address these issues, we first propose an invariant energy (Inv-E) together with an effective sampling-based estimator to handle the invariance issue. By further combining with the independent energy (Ind-E), we obtain a unified energy (Uni-E), that accounts for all these factors. Uni-E enjoys a unique advantage: it can be computed exactly without sampling-based partition estimation. Besides, Uni-E is model agnostic and can therefore be scaled to models of arbitrary size. We further prove that Uni-E can correct the distribution shift caused by dependency and invariance. Extensive experiments across Diffusion Language Models (DLMs) and Diffusion Large Language Models (DLLMs) demonstrate the effectiveness of the proposed Uni-E.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance</title>
    <author>
      <name>Hanna Abi Akl</name>
    </author>
    <author>
      <name>Fabien Gandon</name>
    </author>
    <author>
      <name>Catherine Faron</name>
    </author>
    <author>
      <name>Pierre Monnin</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09157v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09157v1" />
    <published>2026-06-08T07:47:57Z</published>
    <updated>2026-06-08T07:47:57Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Hanna Abi Akl et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;This paper revisits our pipeline called Syllogistic Evaluation Framework-Common Logic Grammar Construction (SEF-CLGC). We combine formal logical notations with Small Language Models (SLMs) to evaluate reasoning performance on the SemEval-2026 Task 11 Subtask 1: Disentangling Content and Formal Reasoning in Large Language Models. Our experiments show that by relying solely on SLMs, trained on a combination of natural and symbolic languages, our best model achieves a content score of 27.80% on the task while significantly lowering the content bias in reasoning.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to SemEval-2026 co-located with ACL 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models</title>
    <author>
      <name>Danya Li</name>
    </author>
    <author>
      <name>Xiang Su</name>
    </author>
    <author>
      <name>Yan Feng</name>
    </author>
    <author>
      <name>Rico Krueger</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09142v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09142v1" />
    <published>2026-06-08T07:39:42Z</published>
    <updated>2026-06-08T07:39:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Danya Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remains underexplored. In this work, we study the decoding of pedestrian crossing intentions from short egocentric video clips. We approach this by formulating the task as a closed-ended visual question answering (VQA) problem and leveraging vision language models (VLMs) to predict the pedestrians&amp;#x27; intent. We first benchmark three families of state-of-the-art VLMs in a zero-shot setting, finding that they achieve moderate gains over random guessing but exhibit limited higher-level traffic reasoning. Motivated by these findings, we further adapt VLMs to the target task using parameter-efficient fine-tuning. Our results show that the fine-tuned models substantially outperform their zero-shot counterparts and achieve a 9\% accuracy improvement over a specialized transformer-based baseline. Finally, we demonstrate that incorporating additional contextual cues, including ego motion, vehicle motion, and eye gaze, further improves predictive performance. In particular, the fine-tuned Qwen3-VL-2B model guided by eye gaze and ego motion achieves a 14.5% accuracy improvement over the transformer baseline, establishing a new state of the art for egocentric pedestrian intent decoding.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Steganography Without Modification: Hidden Communication via LLM Seeds</title>
    <author>
      <name>Felix Mächtle</name>
    </author>
    <author>
      <name>Jonas Sander</name>
    </author>
    <author>
      <name>Sebastian Berndt</name>
    </author>
    <author>
      <name>Ben Weimar</name>
    </author>
    <author>
      <name>Nils Loose</name>
    </author>
    <author>
      <name>Thomas Eisenbarth</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09135v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09135v1" />
    <published>2026-06-08T07:32:44Z</published>
    <updated>2026-06-08T07:32:44Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Felix Mächtle et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We demonstrate that widely deployed Large Language Model (LLM) inference stacks harbor a steganographic channel that requires no modification to model weights, sampling code, or output distributions. The channel exploits a structural property of deterministic decoding: pseudo-random number generators (PRNGs) used in inverse-transform sampling produce a seed-dependent sequence of token-level probability intervals that can be reconstructed from the generated text alone. A sender encodes a secret message in the PRNG seed before generation; a receiver reconstructs the intervals and recovers the seed, and thus the hidden payload, by exhaustive search over the seed space.
  We formalize two operational modes. In the known-prompt setting, sender and receiver share the prompt, enabling exact interval reconstruction and perfect seed recovery via forced alignment. In the unknown-prompt setting, only the generated text is available; approximate interval reconstruction combined with a maximum-hit-count scoring strategy still permits reliable recovery from sufficiently long outputs.
  Extensive experiments across six model families and five heterogeneous text domains show that, in the known-prompt setting, full 32-bit seed recovery from the complete 2^32 candidate space achieves up to 100% accuracy, depending on model and text domain, within 300 tokens and under 35 seconds on a single GPU. In the unknown-prompt setting, recovery reaches near-perfect accuracy at 600-800 tokens in about 12 seconds. We further analyze the influence of prompting strategies, tokenization ambiguities, and sampling hyperparameters on channel reliability. Moreover, we discuss several applications of our results: First, it allows for the steganographic transmission of 32 bits, but also shows that ignorance of the prompt is not a valid security assumption.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;To appear in the Proceedings of the International Conference on Availability, Reliability and Security (ARES 2026)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs</title>
    <author>
      <name>Jiangtao Shuai</name>
    </author>
    <author>
      <name>Zongxiong Chen</name>
    </author>
    <author>
      <name>Manfred Hauswirth</name>
    </author>
    <author>
      <name>Sonja Schimmler</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09134v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09134v1" />
    <published>2026-06-08T07:32:06Z</published>
    <updated>2026-06-08T07:32:06Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiangtao Shuai et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to the IEEE ICRA 2026 International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Vision Language Model Helps Private Information De-Identification in Vision Data</title>
    <author>
      <name>Tiejin Chen</name>
    </author>
    <author>
      <name>Pingzhi Li</name>
    </author>
    <author>
      <name>Kaixiong Zhou</name>
    </author>
    <author>
      <name>Tianlong Chen</name>
    </author>
    <author>
      <name>Hua Wei</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09132v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09132v1" />
    <published>2026-06-08T07:30:20Z</published>
    <updated>2026-06-08T07:30:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tiejin Chen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Visual Language Models (VLMs) have gained significant popularity due to their remarkable ability. While various methods exist to enhance privacy in text-based applications, privacy risks associated with visual inputs remain largely overlooked such as Protected Health Information (PHI) in medical images. To tackle this problem, two key tasks: accurately localizing sensitive text and processing it to ensure privacy protection should be performed. To address this issue, we introduce VisShield (Vision Privacy Shield), an end-to-end framework designed to enhance the privacy awareness of VLMs. Our framework consists of two key components: a specialized instruction-tuning dataset OPTIC (Optical Privacy Text Instruction Collection) and a tailored training methodology. The dataset provides diverse privacy-oriented prompts that guide VLMs to perform targeted Optical Character Recognition (OCR) for precise localization of sensitive text, while the training strategy ensures effective adaptation of VLMs to privacy-preserving tasks. Specifically, our approach ensures that VLMs recognize privacy-sensitive text and output precise bounding boxes for detected entities, allowing for effective masking of sensitive information. Extensive experiments demonstrate that our framework significantly outperforms existing approaches in handling private information, paving the way for privacy-preserving applications in vision-language models. Our dataset and code can be found here.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Unveiling Privacy Risks in Multi-modal Large Language Models: Task-specific Vulnerabilities and Mitigation Challenges</title>
    <author>
      <name>Tiejin Chen</name>
    </author>
    <author>
      <name>Pingzhi Li</name>
    </author>
    <author>
      <name>Kaixiong Zhou</name>
    </author>
    <author>
      <name>Tianlong Chen</name>
    </author>
    <author>
      <name>Hua Wei</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09125v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09125v1" />
    <published>2026-06-08T07:19:42Z</published>
    <updated>2026-06-08T07:19:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tiejin Chen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Privacy risks in text-only Large Language Models (LLMs) are well studied, particularly their tendency to memorize and leak sensitive information. However, Multi-modal Large Language Models (MLLMs), which process both text and images, introduce unique privacy challenges that remain underexplored. Compared to text-only models, MLLMs can extract and expose sensitive information embedded in images, posing new privacy risks. We reveal that some MLLMs are susceptible to privacy breaches, leaking sensitive data embedded in images or stored in memory. Specifically, in this paper, we (1) introduce MM-Privacy, a comprehensive dataset designed to assess privacy risks across various multi-modal tasks and scenarios, where we define Disclosure Risks and Retention Risks. (2) systematically evaluate different MLLMs using MM-Privacy and demonstrate how models leak sensitive data across various tasks, and (3) provide additional insights into the role of task inconsistency in privacy risks, emphasizing the urgent need for mitigation strategies. Our findings highlight privacy concerns in MLLMs, underscoring the necessity of safeguards to prevent data exposure. Our dataset and code can be found here.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] An Enhanced Geometric-Spectral Feature Learning Framework for Airborne Multispectral Point Cloud Classification</title>
    <author>
      <name>Xian Li</name>
    </author>
    <author>
      <name>Yanfeng Gu</name>
    </author>
    <author>
      <name>Aleksandra Pižurica</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09123v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09123v1" />
    <published>2026-06-08T07:18:38Z</published>
    <updated>2026-06-08T07:18:38Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xian Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multispectral point cloud (MPC) is composed of 3D spatial-spectral information, which holds tremendous potential for accurate land-cover classification. However, the representation power of classification models is limited by inherent high-dimensional and heterogeneous spatial-spectral information, unbalanced sample distribution, and inter-class spectral similarity of airborne MPCs. We build two MPC datasets and propose an enhanced geometric-spectral feature learning framework based on attentions for airborne MPC classification. A key component in our model is a two-stream feature fusion method with attention mechanisms, which enhances the representation capability of spatial-spectral features from high-dimensional heterogeneous MPCs. The first stream aims to extract position-encoded global spectral features with fusion self-attention, and the second stream comprises a multikernel point convolution and feature aggregation attention to extract spectral-guided geometric features. We then develop a residual attention fusion block to integrate the most informative geometric-spectral features from the two parallel streams. Another important contribution of this work is a joint loss function to improve the learning ability on unbalanced and interclass similar samples. Experimental results on two airborne MPC datasets demonstrate the effectiveness of the proposed method compared with the state-of-the-art methods. Furthermore, the codes and datasets used in this paper will be made available freely at https://github.com/HITlixian/TGRS_GSFF.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] ComplexConstraints and Beyond: Expert Rubrics for RLVR</title>
    <author>
      <name>Sushant Mehta</name>
    </author>
    <author>
      <name>Liudas Panavas</name>
    </author>
    <author>
      <name>Edwin Chen</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09118v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09118v1" />
    <published>2026-06-08T07:11:56Z</published>
    <updated>2026-06-08T07:11:56Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sushant Mehta et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks relied on programmatic verification of narrow, surface-level constraints, but real-world instruction following and agentic tasks demand assessment of nuanced, context-dependent behaviors that resist simple scripted checks. We present a systematic analysis of expert-curated rubric-based evaluation as an alternative paradigm, drawing on empirical evidence from two domains: complex instruction following and enterprise agentic tasks. We first articulate five design principles for constructing high-quality rubrics, including Maximum Viable Atomicity, intent-aware criterion design, and iterative LLM-judge calibration. To validate these principles, we introduce ComplexConstraints, a new expert-curated instruction-following dataset in which each prompt is paired with 10-40 atomic rubric criteria. We demonstrate that these expert rubrics are not only better evaluation instruments but also highly effective training signals: training on approximately 1,000 ComplexConstraints examples yields +15.5% improvement for a 4B-parameter model and +12.2% for a 235B-parameter model on instruction following, while single-epoch RL training on a rubric-graded enterprise environment produces gains that transfer to out-of-distribution benchmarks the model was never trained on (+4.5% BFCL, +7.4% Tau2-Bench, +6.8% Tool-Decathlon). Our findings establish that expert-authored rubrics improve both the measurement and the development of frontier LLM capabilities, serving as effective evaluation and RL training signals.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to the GEM workshop at ACL 2026: https://gem-workshop.com/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] DynaOD: Dynamic Origin-Destination Flow Generation with Discrete-to-Continuous Temporal Semantic Modeling</title>
    <author>
      <name>Jie Zhao</name>
    </author>
    <author>
      <name>Xianqi Dai</name>
    </author>
    <author>
      <name>Jie Feng</name>
    </author>
    <author>
      <name>Huandong Wang</name>
    </author>
    <author>
      <name>Yong Li</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09086v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09086v1" />
    <published>2026-06-08T06:32:49Z</published>
    <updated>2026-06-08T06:32:49Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jie Zhao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Dynamic origin-destination (OD) flow generation seeks to synthesize realistic mobility dynamics from temporal context alone, without relying on historical OD observations. A key challenge is to translate semantic temporal signals into temporally coherent OD patterns while preserving the inherent spatial heterogeneity of urban regions. We propose DynaOD, a semantic-driven framework that models temporal dynamics through two complementary perspectives: discrete directional trends that characterize qualitative shifts in urban activity patterns, and continuous temporal evolution that captures how such shifts unfold over time. By jointly encoding these temporal semantics, the framework constructs time-varying region representations that condition pretrained static OD generators in a lightweight and plug-and-play fashion. This modular design further supports scalable deployment and cross-city transferability. Extensive experiments on large-scale real-world datasets show that our method consistently outperforms representative baselines in both predictive accuracy and distributional fidelity. Code is publicly available at https://github.com/csjiezhao/DynaOD.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by IJCAI2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps</title>
    <author>
      <name>Xiaofeng Lin</name>
    </author>
    <author>
      <name>Yukai Yang</name>
    </author>
    <author>
      <name>Daniel Guo</name>
    </author>
    <author>
      <name>Sahil Arun Nale</name>
    </author>
    <author>
      <name>Charles Fleming</name>
    </author>
    <author>
      <name>Guang Cheng</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09084v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09084v1" />
    <published>2026-06-08T06:29:42Z</published>
    <updated>2026-06-08T06:29:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xiaofeng Lin et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Tool-using LLM agents interact with the world through actions that persist state in artifacts (e.g., workspace files or logs). Consequently, jailbreak defenses must reason about cross-step composition rather than isolated text. Yet most existing attacks and defenses, including ``multi-turn&amp;#x27;&amp;#x27; jailbreaks such as Crescendo and Tree of Attacks,still assume a single contiguous conversation visible to the defender. This assumption breaks down in real agent pipelines, where enforcement is fragmented across tools, modules, and time, and where artifact provenance is often not tracked. We operationalize a deployment failure mode for tool-using LLM agents, the \emph{provenance gap}, and study reproducible triggers for it: \emph{Context-Fractured Decomposition} (CFD), a family of cross-context multi-step jailbreaks that preserve benign-looking intermediate artifacts from an early interaction and elicit harmful behavior much later, potentially in a different agent instance or workflow stage, via individually innocuous tool actions whose risk emerges only under delayed artifact-mediated composition. We instrument the failure mode with trace-level diagnostics and outline a verifiable mitigation direction (provenance lineage tagging). Across agent-system jailbreak benchmarks, CFD improves success rates by up to 28.3 percentage points over state-of-the-art baselines, even against strong single-turn judges. Disclaimer: This paper contains examples of harmful or offensive language.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention</title>
    <author>
      <name>Yan Wang</name>
    </author>
    <author>
      <name>Qifan Zhang</name>
    </author>
    <author>
      <name>Jiachen Yu</name>
    </author>
    <author>
      <name>Tian Liang</name>
    </author>
    <author>
      <name>Dongyang Ma</name>
    </author>
    <author>
      <name>Xiang Hu</name>
    </author>
    <author>
      <name>Zibo Lin</name>
    </author>
    <author>
      <name>Chunyang Li</name>
    </author>
    <author>
      <name>Zhichao Wang</name>
    </author>
    <author>
      <name>Miao Peng</name>
    </author>
    <author>
      <name>Nuo Chen</name>
    </author>
    <author>
      <name>Jia Li</name>
    </author>
    <author>
      <name>Yujiu Yang</name>
    </author>
    <author>
      <name>Haitao Mi</name>
    </author>
    <author>
      <name>Dong Yu</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09079v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09079v2" />
    <published>2026-06-08T06:25:54Z</published>
    <updated>2026-06-09T17:04:45Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yan Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory.
  We demonstrate that this &amp;quot;less is more&amp;quot; paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone&amp;#x27;s core reasoning capacities.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Technical report. 11 pages. Code and model available at https://github.com/libertywing/FlashMemory-Deepseek-V4 and https://huggingface.co/libertywing/FlashMemory-Deepseek-V4&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] INFUSER: Influence-Guided Self-Evolution Improves Reasoning</title>
    <author>
      <name>Siyu Chen</name>
    </author>
    <author>
      <name>Miao Lu</name>
    </author>
    <author>
      <name>Beining Wu</name>
    </author>
    <author>
      <name>Heejune Sheen</name>
    </author>
    <author>
      <name>Fengzhuo Zhang</name>
    </author>
    <author>
      <name>Shuangning Li</name>
    </author>
    <author>
      <name>Zhiyuan Li</name>
    </author>
    <author>
      <name>Jose Blanchet</name>
    </author>
    <author>
      <name>Tianhao Wang</name>
    </author>
    <author>
      <name>Zhuoran Yang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09052v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09052v1" />
    <published>2026-06-08T05:40:36Z</published>
    <updated>2026-06-08T05:40:36Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Siyu Chen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervision. Yet existing methods either depend on extensively curated or teacher-generated training data, or, when the generator runs unsupervised, reward it by a difficulty heuristic that need not improve the solver. We introduce INFUSER, an iterative co-training framework with two co-evolving roles: a Generator that drafts questions and reference golden answers from a pool of unstructured, automatically collected documents, and a Solver that improves by training on them. The solver is trained with standard correctness rewards against the generator-provided answers, while the generator is rewarded by an optimizer-aware influence score that measures whether each proposed question would actually improve the solver on the target distribution. Because this continuous, noisy influence score is poorly served by standard GRPO, we propose DuGRPO, a dual-normalized variant of GRPO, for generator training. Together, these turn the document pool into an adaptive curriculum that favors questions useful to the current solver, not just hard ones. On Qwen3-8B-Base, INFUSER outperforms strong self-evolution baselines with over 20% relative improvement on Olympiad and SuperGPQA benchmarks, and an 8B INFUSER co-evolving generator outperforms a frozen 32B thinking generator on math and coding. Ablations confirm each design choice is necessary, and two extensions, applying INFUSER to an instruction-finetuned anchor and augmenting it with rule-verifiable RLVR data, further demonstrate the flexibility and generalizability of the framework. Code is available at https://github.com/FFishy-git/INFUSER.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;66 pages, 17 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] BareWave: Waveform-Native Flow-Matching Text-to-Speech</title>
    <author>
      <name>Wei Fan</name>
    </author>
    <author>
      <name>Chao-Hong Tan</name>
    </author>
    <author>
      <name>Qian Chen</name>
    </author>
    <author>
      <name>Wen Wang</name>
    </author>
    <author>
      <name>Xiangang Li</name>
    </author>
    <author>
      <name>Kejiang Chen</name>
    </author>
    <author>
      <name>Weiming Zhang</name>
    </author>
    <author>
      <name>Nenghai Yu</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09048v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09048v1" />
    <published>2026-06-08T05:36:42Z</published>
    <updated>2026-06-08T05:36:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wei Fan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Under Review&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CY] Culturally-Aware AI for Cross-Boundary Community Learning: Undergraduate Innovation at the Intersection of Computation and Design</title>
    <author>
      <name>Jiaojiao Zhao</name>
    </author>
    <author>
      <name>Weisheng Zhang</name>
    </author>
    <author>
      <name>Jiawen Cai</name>
    </author>
    <author>
      <name>Haibin Gao</name>
    </author>
    <author>
      <name>Luyao Zhang</name>
    </author>
    <category term="cs.CY" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09041v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09041v1" />
    <published>2026-06-08T05:14:01Z</published>
    <updated>2026-06-08T05:14:01Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiaojiao Zhao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Research on artificial intelligence in education (AIED) is rapidly expanding, yet technical progress often lacks human-centered grounding and adequate attention to cultural context. Community-Based Learning, a pedagogy rooted in social work, remains underrepresented in AIED research, particularly within Asia-Pacific contexts. This paper reports on cross-boundary Community-Based Learning where undergraduate students develop AI-enabled solutions for cultural heritage preservation and sustainable development. We examine how community-engaged computing operationalizes human-centered AIED across three dimensions: education, technology, and culture. We contribute a collaborative framework for culturally-aware AIED that fosters multi-stakeholder collaboration while widening participation by dissolving disciplinary silos between social work and computational science.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs</title>
    <author>
      <name>Hyeongwon Jang</name>
    </author>
    <author>
      <name>Gyouk Chu</name>
    </author>
    <author>
      <name>Changhun Kim</name>
    </author>
    <author>
      <name>Joonhyung Park</name>
    </author>
    <author>
      <name>Hangyul Yoon</name>
    </author>
    <author>
      <name>Eunho Yang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09030v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09030v1" />
    <published>2026-06-08T04:53:44Z</published>
    <updated>2026-06-08T04:53:44Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Hyeongwon Jang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medical time series (ISMTS), must deliver both calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Large Language Models (LLMs) have been explored for this task, yet they collapse graded clinical risk into overconfident binary predictions. This risk polarization undermines both calibration and cross-patient comparability. To address this, we propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, TRIAGE achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to the competitive baselines. An LLM-as-a-judge assessment further shows that our rationales surpass post-hoc explanations from the baseline by 20% in clinical reasoning quality. The source code is available at https://github.com/HyeongWon-Jang/TRIAGE .&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Code is available at https://github.com/HyeongWon-Jang/TRIAGE&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models</title>
    <author>
      <name>Jiaheng Chen</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09028v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09028v1" />
    <published>2026-06-08T04:53:10Z</published>
    <updated>2026-06-08T04:53:10Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiaheng Chen&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Latent world models are increasingly used for control and goal-conditioned planning, yet assessing whether their learned representations are useful for planning usually requires slow, planner-coupled simulator evaluation with CEM or similar planners. Such evaluation is black-box and model-complexity-dependent: under the same protocol, different world models may require minutes to hours per checkpoint. In this work, we propose ATM, an Action-Consistency Transfer Matrix for diagnosing whether latent transitions preserve action semantics relevant to planning. ATM compares action information in real encoded transitions and model-predicted transitions through lightweight post-hoc probes, producing an interpretable matrix that reveals representation quality, transition-domain inconsistency, and failure modes without simulator rollout. It can also be collapsed into a simple screening score for within-task ranking across checkpoints, variants, and world models. When the true success gap is non-trivial, ATM achieves highly reliable pairwise ranking, while reducing minutes-to-hours CEM evaluation to seconds-level transition analysis, yielding more than 100x speedup in our setup. We further introduce AITS, showing that action-identifiability is not only diagnostic but also a useful training signal for improving downstream planning without changing the planner.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;13 pages, 3 figures, 6 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] SafeRun: Enabling Determinism in LLM Planning for Running</title>
    <author>
      <name>Meilin Chen</name>
    </author>
    <author>
      <name>Zepeng Zhai</name>
    </author>
    <author>
      <name>Jiaxuan Zhao</name>
    </author>
    <author>
      <name>Yuan Lu</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09027v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09027v1" />
    <published>2026-06-08T04:53:01Z</published>
    <updated>2026-06-08T04:53:01Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Meilin Chen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Language Models enable flexible natural-language planning but remain unreliable in determinism-critical domains due to their probabilistic nature. This limitation is especially problematic in running planning, where violating safety rules can lead to safety risks. We propose SafeRun, a framework for deterministic LLM-based planning via a decoupled architecture. SafeRun separates soft interpretation by an LLM from hard constraint enforcement by a deterministic solver, ensuring strict safety constraints while preserving natural-language flexibility. To validate SafeRun, we build a comprehensive benchmark for running planning under realistic physiological and safety constraints. Experiments across five LLMs show that SafeRun achieves 100\% safety score (vs.\ 79.1\% PE average and 97.6\% CodeAct average) while maintaining competitive instruction-following scores. The SafeRun benchmark is publicly available at \href{https://huggingface.co/datasets/zzp-seeker/SafeRun-RunPlanning-Benchmark}{huggingface}.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Workshop on Planning in the Era of LLMs (LM4Plan) at ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning</title>
    <author>
      <name>Yucheng Deng</name>
    </author>
    <author>
      <name>Pingrui Lai</name>
    </author>
    <author>
      <name>Xinhai Li</name>
    </author>
    <author>
      <name>Chenjia Bai</name>
    </author>
    <author>
      <name>Xiaoheng Deng</name>
    </author>
    <author>
      <name>Chengnuo Sun</name>
    </author>
    <author>
      <name>Xuelong Li</name>
    </author>
    <author>
      <name>Hua Yang</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08992v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08992v1" />
    <published>2026-06-08T03:42:08Z</published>
    <updated>2026-06-08T03:42:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yucheng Deng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;23 pages, 9 figures, 7 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] RTL-BenchLS: A Large-Scale Benchmark for RTL Reasoning and Generation with Large Language Models</title>
    <author>
      <name>Jing Wang</name>
    </author>
    <author>
      <name>Shang Liu</name>
    </author>
    <author>
      <name>Wenji Fang</name>
    </author>
    <author>
      <name>Yuchao Wu</name>
    </author>
    <author>
      <name>Yugao Zhu</name>
    </author>
    <author>
      <name>Zhiyao Xie</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08976v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08976v1" />
    <published>2026-06-08T03:21:33Z</published>
    <updated>2026-06-08T03:21:33Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jing Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;LLM-based RTL generation and reasoning is a promising direction for hardware design automation. High-quality benchmarks are critical infrastructure for tracking progress in this direction. However, existing RTL benchmarks face inherent limitations in both scale and task scope. The designs they cover are typically small and simple, and the tasks focus almost entirely on specification-to-RTL generation. Frontier models&amp;#x27; performance already saturates on the existing benchmarks. Scaling these benchmarks up is fundamentally difficult because aligned labels are required for benchmarking, such as specifications and testbenches. Such aligned high-quality data are rarely available for real-world designs. We introduce RTL-BenchLS, a large-scale benchmark addressing both limitations above. It contains over 10,000 formally verified Verilog designs, covering substantially larger and more complex designs than existing benchmarks. Beyond specification-to-RTL generation, we propose three novel tasks that jointly evaluate reasoning and generation: round-trip reasoning, masked-content reasoning, and repository-issue reasoning. The first two are self-supervised, which directly resolves the scaling bottleneck. All tasks are verified through formal equivalence checking without any manual testbenches. We evaluate eight LLMs on RTL-BenchLS. Even the best model reaches only 23% on natural-language round-trip reasoning, 28% on masked-content reasoning, and 12% on repository-issue fixing. RTL-BenchLS is substantially more challenging than existing benchmarks. It leaves ample room for future improvement and offers guidance for developing LLM-based methods for hardware design.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] CARE: A Conformal Safety Layer for Medical Summarization</title>
    <author>
      <name>Suhana Bedi</name>
    </author>
    <author>
      <name>Bridget Lin</name>
    </author>
    <author>
      <name>Anson Y. Zhou</name>
    </author>
    <author>
      <name>Chloe O. Stanwyck</name>
    </author>
    <author>
      <name>Jenelle A. Jindal</name>
    </author>
    <author>
      <name>Sanmi Koyejo</name>
    </author>
    <author>
      <name>David Stutz</name>
    </author>
    <author>
      <name>Nigam H. Shah</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08969v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08969v1" />
    <published>2026-06-08T03:10:45Z</published>
    <updated>2026-06-08T03:10:45Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Suhana Bedi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs) are increasingly used for medical summarization, but their outputs can omit medically important information and introduce unsupported claims. Existing error-detection methods produce heuristic or uncalibrated scores, providing no formal control over missed errors and no principled way to trade off safety against clinician review burden. We introduce Conformal Assessment for Risk Evaluation (CARE), a post-hoc, model-agnostic safety layer that uses conformal risk control to overlay calibrated omission and hallucination flags onto summaries from any LLM without retraining. CARE provides finite-sample, distribution-free guarantees through two controllers: a hallucination controller that bounds the probability of a document containing any unflagged hallucinated sentence, and an omission controller that bounds the expected fraction of important omissions not surfaced for review. Unlike hallucination detection, omissions depend jointly on whether a source sentence is important and whether it is covered by the summary. We show that calibrating only one dimension can violate the target risk bound, while marginal decompositions remain valid but overly conservative. By jointly calibrating over the full $(τ,γ)$ threshold space, CARE preserves formal guarantees while surfacing up to 5$\times$ fewer sentences than alternative calibrated baselines. Across five medical summarization tasks, CARE satisfies the target risk bound at $α= 0.15$ with 95% confidence across 100 calibration/test resplits, using only ~100 labeled documents per domain. In a preliminary clinician study (75 document reviews), calibrated flags improved omission detection by 28.6 percentage points on average. These results show that sentence-level safety guarantees are feasible for LLM-assisted medical summarization and offer a tunable mechanism for balancing residual risk and review effort.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;29 pages, 5 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops</title>
    <author>
      <name>Ziqian Zhong</name>
    </author>
    <author>
      <name>Ivgeni Segal</name>
    </author>
    <author>
      <name>Ivan Bercovich</name>
    </author>
    <author>
      <name>Shashwat Saxena</name>
    </author>
    <author>
      <name>Kexun Zhang</name>
    </author>
    <author>
      <name>Aditi Raghunathan</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08960v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08960v1" />
    <published>2026-06-08T03:00:56Z</published>
    <updated>2026-06-08T03:00:56Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ziqian Zhong et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive.
  We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers.
  On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash&amp;#x27;s loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7&amp;#x27;s attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro&amp;#x27;s from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models</title>
    <author>
      <name>Shouwei Ruan</name>
    </author>
    <author>
      <name>Bin Wang</name>
    </author>
    <author>
      <name>Zhenyu Wu</name>
    </author>
    <author>
      <name>Qihui Zhu</name>
    </author>
    <author>
      <name>Yuxiang Zhang</name>
    </author>
    <author>
      <name>Jingzhi Li</name>
    </author>
    <author>
      <name>Yubin Wang</name>
    </author>
    <author>
      <name>Xingxing Wei</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08952v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08952v1" />
    <published>2026-06-08T02:55:05Z</published>
    <updated>2026-06-08T02:55:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shouwei Ruan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] PAI: Preserving Amplitude Information in Representation-Based Time-Series Anomaly Detection</title>
    <author>
      <name>Kang Zhang</name>
    </author>
    <author>
      <name>Wei Jian Lau</name>
    </author>
    <author>
      <name>Shoushou Ren</name>
    </author>
    <author>
      <name>Dong Lin</name>
    </author>
    <author>
      <name>Joon Son Chung</name>
    </author>
    <author>
      <name>Chuanhao Sun</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08935v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08935v1" />
    <published>2026-06-08T02:24:40Z</published>
    <updated>2026-06-08T02:24:40Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kang Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Representation-based time-series anomaly detection algorithms significantly outperform other methods on diverse anomaly detection tasks. However, we notice that they suffer from a major limitation in our evaluation - their learned embeddings are often amplitude-agnostic. Losing amplitude information can degrade performance on amplitude related anomalies, and this failure is prevalent across all existing representation-based methods. To address aforementioned issues, we propose a new anomaly scoring scheme named PAI. PAI consists of two complementary modules, a diagnostic module and a final score augmentation function. The diagnostic module compares cosine and Euclidean scoring on the same representation bank to test whether amplitude information is already captured in the learned representation. Then in final score augmentation function, PAI computes a point-wise median and MAD deviation score and a local mean-shift score-which are fused with the representation score to produce the final anomaly score. On the TSB-AD-U-Eva and TAB UV datasets, PAI improves all four evaluated representation-based methods across every reported metric, achieving average VUS-PR gains of 98.4% and 36.8%, respectively. Among all evaluated combinations, PaAno + PAI achieves the best performance, outperforming the state-of-the-art method by 15%. Further evaluation on bootstrap confidence intervals, anomaly-type breakdowns, and a TS2Vec input-normalization ablation further support the proposed scheme. These results suggest that explicitly retaining amplitude information is important for representation-based time-series anomaly detection, which has been underemphasized in existing scoring schemes. Code is available at: https://github.com/pantheon5100/PAI&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;15 pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] From Statute to Control Flow: Span-Grounded Deontic Trees for Defeasible Scope Parsing</title>
    <author>
      <name>Jian Chen</name>
    </author>
    <author>
      <name>Siyuan Li</name>
    </author>
    <author>
      <name>Chucheng Wan</name>
    </author>
    <author>
      <name>Zixuan Yuan</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08932v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08932v1" />
    <published>2026-06-08T02:17:53Z</published>
    <updated>2026-06-08T02:17:53Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jian Chen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Rule-following agents tasked with executing policies and regulations often fail via Silent Scope Omission (SSO): a model applies a general rule but silently drops nested exceptions or counter-exceptions, producing outputs that appear compliant yet break on important edge cases. Although such failures are often framed as an agentic-systems problem, the underlying bottleneck is statutory and policy understanding, a capability typically studied in legal NLP. However, most existing legal NLP benchmarks emphasize end-task outcomes, which can overlook the structural omissions that cause SSO. To diagnose and mitigate SSO, we introduce NormBench, a benchmark of 2,290 provisions spanning Chinese (laws and local policies), English (U.S. tax law, GDPR, and corporate policies), and cross-lingual settings, designed for defeasible scope parsing: identifying precisely which clause overrides which. NormBench uses Span-Grounded Deontic Trees (SG-DT), a compiler-style intermediate representation that anchors every logical branch to source spans and requires explicit exclusion guards, enabling deterministic compilation and audit. Evaluations of frontier LLMs reveal two recurring pathologies: (1) Recursion Decay, where performance drops sharply as defeater depth increases, and (2) an Auditability Trap, where models retrieve relevant spans but fail to assemble correct control flow. Using SG-DT as a constrained intermediate output improves whole-tree fidelity and defeater recovery, and downstream experiments show that its utility is mechanism-specific: gains concentrate on exception-active, SSO-prone cases, while aggregate accuracy can be mixed when the added structure is unnecessary or parser fidelity is low.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human</title>
    <author>
      <name>Emre Turan</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08919v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08919v1" />
    <published>2026-06-08T01:52:22Z</published>
    <updated>2026-06-08T01:52:22Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Emre Turan&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As LLM agents begin to take real, irreversible actions (shell commands, file edits, deploys), the standard safety pattern is a human-in-the-loop approval gate: risky actions pause and wait for a person. We argue the gate is the easy part; the hard part is the judgment - which actions to stop - which the field evaluates against two false assumptions: that there is a ground-truth notion of &amp;quot;risky,&amp;quot; and that the human reviewer is a perfect, infinitely-available oracle. On a hand-labeled set of 125 adversarially-weighted agent actions we show that (i) reviewers only moderately agree on what is risky (Fleiss&amp;#x27; kappa = 0.52), so there is no single correct label; (ii) framing the guard as selective classification under asymmetric cost makes its operating limits measurable, and on hard inputs the guard cannot safely auto-decide; and (iii) when the reviewer is modeled as endogenous (fatiguing as escalation load grows), realized safety becomes an inverted-U in the escalation rate: more human oversight can make a system less safe, and the safety-optimal guard escalates below full escalation - a setting a load-aware policy also uses to resist a flooding attack that slips a malicious action past a fatigued reviewer. Agent oversight, framed this way, is not only a classification problem but a resource-allocation one: human attention is finite, and the guard&amp;#x27;s escalation policy spends it. We claim none of these mechanisms as novel - fatigue-aware learning-to-defer (FALCON), cost-sensitive deferral under workload constraints (DeCCaF), trajectory-level guarding, and reviewer-fatigue/flooding attacks are all prior art we cite. Our contribution is an open-source agent-oversight system that operationalizes and measures them in the LLM-agent action-gating setting, turning &amp;quot;is my guard good?&amp;quot; from a guess into a curve. The inverted-U and the flooding attack are modeling results that motivate a human study.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 pages, 4 figures. Code and interactive demo: https://github.com/turangenesis/headroom&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Order Matters: Unveiling the Hidden Impact of Macro Placement Sequences via Proxy-Guided LLM Evolution</title>
    <author>
      <name>Shibing Mo</name>
    </author>
    <author>
      <name>Jing Liu</name>
    </author>
    <author>
      <name>Jianchu Xu</name>
    </author>
    <author>
      <name>Ruilin Wu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08904v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08904v1" />
    <published>2026-06-08T01:10:07Z</published>
    <updated>2026-06-08T01:10:07Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shibing Mo et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Macro placement is a fundamental step in modern chip physical design, playing a crucial role in determining the solution quality of high-dimensional combinatorial optimization problems. Despite recent advancements in machine learning for spatial coordinate determination, the temporal dimension of placement sequencing remains largely governed by static heuristics. In this work, we demonstrate that the placement sequence is not merely a preprocessing step but a decisive factor in optimization, where suboptimal early decisions trigger irreversible domino effects that constrain the solution space. To harness this unexplored dimension, we propose \textbf{OrderPlace}, a proxy-guided LLM evolution framework for automatically discovering macro placement order strategies. Instead of relying on manually crafted heuristics such as area- or connectivity-based ordering, OrderPlace explores a broader space of code-level policies, ranging from static scoring metrics to dynamic physics-inspired mechanisms. To mitigate the prohibitive cost of evaluating sequences, we introduce a lightweight proxy evaluation mechanism that efficiently filters candidates using a deterministic greedy probe. Experimental results on the standard ISPD 2005 benchmarks demonstrate that OrderPlace discovers novel ordering strategies. Compared with WireMask-EA and the state-of-the-art method EGPlace, OrderPlace reduces wirelength by 34.04\% and 14.08\%, respectively.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;ICML2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training</title>
    <author>
      <name>Yanxiong Li</name>
    </author>
    <author>
      <name>Guoqing Chen</name>
    </author>
    <author>
      <name>Qianqian Li</name>
    </author>
    <author>
      <name>Sen Huang</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08898v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08898v1" />
    <published>2026-06-08T00:50:39Z</published>
    <updated>2026-06-08T00:50:39Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yanxiong Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;In the task of few-shot class-incremental audio classification, the number of classes is assumed to always increase without considering the possibility of decrease. However, the number of classes generally increases or decreases in practice. In this paper, we investigate a problem of Few-shot Class-variable Incremental Audio Classification (FCIAC), in which the number of classes increases or decreases. We propose a FCIAC method using prototype adaptation and pseudo class-variable training. The model in our method consists of an encoder and a classifier. The classifier is initialized by a class-variable prototype adaptation network, whose structure dynamically changes with the change of classes. In addition, we design a pseudo class-variable training strategy to enhance the model&amp;#x27;s adaptability to changing classes. Experiments on three public datasets show that our method exceeds previous methods in average accuracy. The code is at: https://github.com/cgq2971-afk/FCIAC.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;This paper has been accepted for publication in Interspeech 2026. 4 Tables and 4 Figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] FAME: Forecastability-Aware Mixture of Experts for Heterogeneous Time Series Forecasting</title>
    <author>
      <name>Qianyang Li</name>
    </author>
    <author>
      <name>Xingjun Zhang</name>
    </author>
    <author>
      <name>Shaoxun Wang</name>
    </author>
    <author>
      <name>Tao Peng</name>
    </author>
    <author>
      <name>Jia Wei</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08896v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08896v1" />
    <published>2026-06-08T00:47:58Z</published>
    <updated>2026-06-08T00:47:58Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Qianyang Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large-scale retail and industrial forecasting systems contain many heterogeneous time series whose lifecycle, sparsity, volatility, seasonality, spectral patterns, and contextual sensitivity differ substantially. A single forecasting model rarely performs well across all regimes, while dense ensembles increase inference cost and provide limited insight into expert suitability. This paper studies forecastability-aware expert routing: learning how data characteristics determine the suitability of forecasting experts. We propose \method{}, a sparse mixture-of-experts framework that represents each series with a multidimensional forecastability fingerprint, mines expert-suitability targets from validation performance, and trains a cost-aware sparse router to activate a small budgeted set of experts for each series. Using a production-scale vending-machine sales dataset from Shandong New Beiyang (SNBC), where the forecasting component has been integrated into the replenishment-planning pipeline, together with public retail benchmarks, we show that expert suitability varies systematically across data regimes. On the industrial dataset with 5,000+ machines and 60M+ transactions, \method{} Top-2 reduces MSE by 12.4\% over the strongest single expert, LightGBM, while executing 1.92 experts per series on average. The deployed component produces demand forecasts, while inventory-oriented gains are estimated by an offline replay simulator under a fixed replenishment policy rather than by online intervention. The framework turns heterogeneous sales forecasting from heuristic model selection into data mining of forecastability patterns and expert specialization. Code is available at https://github.com/hit636/FAME&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Interactions Between Crosscoder Features: A Compact Proofs Perspective</title>
    <author>
      <name>Dmitry Manning-Coe</name>
    </author>
    <author>
      <name>Thomas Read</name>
    </author>
    <author>
      <name>Anna Soligo</name>
    </author>
    <author>
      <name>Oliver Clive-Griffin</name>
    </author>
    <author>
      <name>Chun-Hei Yip</name>
    </author>
    <author>
      <name>Rajashree Agrawal</name>
    </author>
    <author>
      <name>Jason Gross</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09940v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09940v1" />
    <published>2026-06-08T00:15:44Z</published>
    <updated>2026-06-08T00:15:44Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dmitry Manning-Coe et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Dictionary learning methods like Sparse Autoencoders (SAEs) and crosscoders attempt to explain a model by decomposing its activations into independent features. Interactions between features hence induce errors in the reconstruction. We formalize this intuition via compact proofs and make five contributions. First, we show how, \textit{in principle}, a compact proof of model performance can be constructed using a crosscoder. Second, we show that an error term arising in this proof can naturally be interpreted as a measure of interaction between crosscoder features and provide an explicit expression for the interaction term in the Multi-Layer Perceptron (MLP) layers. We then provide three applications of this new interaction measure. In our third contribution we show that the interaction term itself can be used as a differentiable loss penalty. Applying this penalty, we can achieve ``computationally sparse&amp;#x27;&amp;#x27; crosscoders that retain $60\%$ of MLP performance when only keeping a single feature at each datapoint and neuron, compared to $10\%$ in standard crosscoders. We then show that clustering according to our interaction measure provides semantically meaningful feature clusters, and finally that sleeper agents have significant interactions. Code is available at https://github.com/chainik1125/crosscoders-feature-interactions/tree/arxiv.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at the NeurIPS 2025 Workshop on Mechanistic Interpretability&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference</title>
    <author>
      <name>Anirudh Sekar</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09937v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09937v1" />
    <published>2026-06-07T21:36:20Z</published>
    <updated>2026-06-07T21:36:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Anirudh Sekar&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in multi-branch LLM reasoning pipelines. ASKS (Attention-Similarity KV Sharing) computes the prefix KV cache once and broadcasts it to all semantically similar branches via hidden-state cosine similarity, strictly generalising the token-exact prefix caching used by vLLM and SGLang. CGEE (Confidence-Gated Early Exit) applies two complementary exit mechanisms: (1) it skips the verification forward pass entirely when generation confidence is decisive across branches, and (2) it terminates the verification pass at an intermediate layer when per-layer entropy stabilises, using lightweight hooks on the transformer backbone. RSBCM (Reasoning-Selective Block Cache Manager) prevents unbounded cache growth via attention-weighted depth-priority eviction. Across five model families (7B-10B), four benchmarks, and 1,000 evaluated problems, RKSC achieves a mean speedup of 3.008x over the No-KV baseline (peak 3.990x), a 1.66x mean improvement over vLLM-equivalent prefix caching, with a CGEE-induced error rate of only 0.37% (6 errors out of 1,616 verify calls). No fine-tuning or architecture changes are required. Code is available at https://github.com/AnirudhSekar/RKSC.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation</title>
    <author>
      <name>Ahmed Abdelmoneim Mazrou</name>
    </author>
    <author>
      <name>Haidy Maher El-Amir</name>
    </author>
    <author>
      <name>Ali Hamdi</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08847v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08847v1" />
    <published>2026-06-07T21:32:50Z</published>
    <updated>2026-06-07T21:32:50Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ahmed Abdelmoneim Mazrou et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Despite the success of image generation from text descriptions, it still faces challenges that are difficult to overcome in domains such as natural language processing (NLP) and computer vision (CV). Recent advancements in text-to-image (T2I) models, particularly those utilizing generative adversarial networks (GANs), have significantly improved the synthesis of realistic images across various domains. However, existing GAN-based T2I models still encounter key challenges, such as difficulty in capturing long-range dependencies, vanishing gradients, and the limitations of sequential processing. To address these issues, we introduce BLM-SGAN, a novel model that incorporates Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation. BLM-SGAN leverages BERT&amp;#x27;s attention mechanisms to capture rich contextual information and efficiently manage extended sequences. Our model demonstrates state-of-the-art performance, with an Inception Score (IS) of 5.45 +/- 0.08, surpassing several competitive models such as SSA-GAN, DF-GAN, SD-GAN, and AttnGAN. BLM-SGAN effectively generates highly realistic images of birds from detailed text descriptions. The implementation code is available at: https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Published in ICACIn 2024. Appears in Advances on Intelligent Computing and Data Science II, Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, 2025&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] ZIPP:Zero-shot Image Personalization from Personas</title>
    <author>
      <name>Harini SI</name>
    </author>
    <author>
      <name>Somesh Singh</name>
    </author>
    <author>
      <name>Yaman Kumar Singla</name>
    </author>
    <author>
      <name>David Doermann</name>
    </author>
    <author>
      <name>Rajiv Ratn Shah</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08841v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08841v1" />
    <published>2026-06-07T21:11:47Z</published>
    <updated>2026-06-07T21:11:47Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Harini SI et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs remain impersonal, optimized for aggregate aesthetics rather than individual taste. Human preferences are pluralistic: one user favoring muted, nostalgic portraits may prefer vibrant street photography, while another gravitates toward dreamy film aesthetics. Existing methods require dense interaction histories or per-user fine-tuning, failing in cold-start settings and collapsing context-dependent preferences into a static representation. We introduce zero-shot image personalization from personas (ZIPP), which conditions image generation on natural-language personas (concise descriptors of a user&amp;#x27;s identity and aesthetic sensibilities) without any user-specific data or weight updates. ZIPP uses an LLM to rewrite prompts from the perspective of a given persona, steering diffusion models toward personalized outputs. To mine personas at scale, we train an inductive Graph Attention Network over a 22M-user Reddit interaction graph with dual contrastive objectives aligning graph structure with visual behavior, then verbalize learned representations into natural-language personas via an MLLM. We introduce ZIPBench, the first zero-shot personalization benchmark with 1.5K users, graph-mined personas, and 40K generated images. Across four benchmarks and 14 LLMs spanning five model families, persona conditioning yields consistent gains (13-20%), with frontier models benefiting most. In the few-shot setting, ZIPP matches or exceeds fine-tuned baselines trained on 100+ examples per user. ZIPP achieves the lowest preference distributional divergence (CMMD 0.16 vs. 0.55), and IPF-normalized demographic evaluation shows it substantially reduces subpopulation bias present in existing methods. Human evaluation confirms a 79% win rate over generic generation and 58-65% over all fine-tuned baselines.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Inference-Time Conformal Reasoning with Valid Factuality Control for Large Language Models</title>
    <author>
      <name>Ting Wang</name>
    </author>
    <author>
      <name>Yuanjie Shi</name>
    </author>
    <author>
      <name>Yan Yan</name>
    </author>
    <author>
      <name>Huan Zhang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08831v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08831v1" />
    <published>2026-06-07T20:52:10Z</published>
    <updated>2026-06-07T20:52:10Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ting Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs) increasingly perform multi-step reasoning, where intermediate claims form implicit directed acyclic graphs whose node correctness is structurally conditioned on their ancestors. This makes factuality uncertainty structural, rather than a trivial accumulation of node-wise errors, and necessitates inference-time uncertainty quantification over the reasoning structure. While conformal prediction (CP) offers flexible user-specified factuality control, existing work remains post-hoc and cannot intervene during generation. To fill the gap between CP&amp;#x27;s flexibility and its post-hoc limitation, we propose an \emph{Inference-Time Conformal Reasoning (ITCR)} framework that integrates CP directly into reasoning graph generation. ITCR learns a structure-level factuality uncertainty function that aggregates claim-level factuality signals over reasoning graphs without complex modeling assumptions. We then design the non-conformity score based on graph-level factuality uncertainty and calibrate the conformal threshold to decide when to stop generation. We theoretically show such generation is nested, yielding valid coverage guarantees for factuality control. Experiments over multiple datasets and coverage objectives demonstrate empirically valid coverage. In downstream reasoning tasks, inference-time calibrated graphs yield more accurate generation than post-hoc pruned graphs.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning</title>
    <author>
      <name>Sumin Park</name>
    </author>
    <author>
      <name>Noseong Park</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08814v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08814v1" />
    <published>2026-06-07T20:07:24Z</published>
    <updated>2026-06-07T20:07:24Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sumin Park, Noseong Park&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Mixture-of-Experts (MoE) scales model capacity efficiently by selectively routing inputs to a specialized subset of experts. However, input-expert specialization, the core motivation of MoE, critically depends on whether the router is actually aware of input structure. In practice, MoE routing is typically implemented as a shallow linear projection with limited awareness of input representation, which often leads to unstable routing. We propose STAR, a Structure Aware Routing that rethinks MoE routing as a subspace learning problem by augmenting standard learnable routing with an evolving principal subspace that tracks dominant input structure via Generalized Hebbian Algorithm (GHA). By aligning routing decisions directly with input structure, STAR enables stable expert specialization. We evaluate STAR on controlled synthetic setup and large-scale language and vision tasks, where it consistently improves routing quality and downstream performance over strong MoE baselines. Moreover, optional test-time subspace updates further enhance routing robustness and generalization under input distribution shifts.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Q-Delta: Beyond Key-Value Associative State Evolution</title>
    <author>
      <name>Sumin Park</name>
    </author>
    <author>
      <name>Seojin Kim</name>
    </author>
    <author>
      <name>Noseong Park</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08804v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08804v1" />
    <published>2026-06-07T19:49:01Z</published>
    <updated>2026-06-07T19:49:01Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sumin Park et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Linear attention reformulates sequence modeling as recurrent state evolution, enabling efficient linear-time inference. Under the key-value associative paradigm, existing approaches restrict the role of the query to the readout operation, decoupling it from state evolution. We show that query-conditioned state readout induces a structured value prediction over accumulated memory that complements key-based retrieval. Based on this insight, we propose Q-Delta, a query-aware delta rule that integrates mixed key-query prediction errors into state evolution, enabling jointly corrective dynamics while preserving delta-rule efficiency. We establish stability guarantees for the resulting dynamics and derive a hardware-efficient chunkwise-parallel formulation with a custom Triton implementation. Empirical results demonstrate stable optimization, competitive throughput, and consistent improvements over strong baselines on language modeling and long-context retrieval tasks.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Bridging Expert Knowledge and Automated Feature Engineering via Self-Evolution</title>
    <author>
      <name>Varun Khurana</name>
    </author>
    <author>
      <name>Vijval Ekbote</name>
    </author>
    <author>
      <name>Vashu Chauhan</name>
    </author>
    <author>
      <name>Yaman Kumar Singla</name>
    </author>
    <author>
      <name>Rajiv Ratn Shah</name>
    </author>
    <author>
      <name>Balaji Krishnamurthy</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08800v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08800v1" />
    <published>2026-06-07T19:39:03Z</published>
    <updated>2026-06-07T19:39:03Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Varun Khurana et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;In high-stakes settings such as brand compliance, clinical care, and content moderation, machine learning cannot be deployed as opaque oracles: practitioners inspect the features driving model decisions, and models must leverage the expert documentation governing these domains. In practice, the data arrives as unstructured content, and features extracted from it must be interpretable, discriminative, and aligned with what experts consider important. Existing methods fall short: they target tabular inputs, lack demonstrated expert alignment, and cannot operationalize qualitative criteria such as &amp;#x27;maintain professional tone&amp;#x27; into precise features. We present FEST (Feature Engineering with Self-evolving Trees), combining dual-stream feature generation (semantic and deterministic), semantic deduplication, and tree-guided iterative evolution to discover auditable features from raw text and images. FEST leads in 17 of 20 classifier-task combinations across brand classification, content authenticity detection, and stress detection, with a mean gain of 4.2 pp over the strongest baseline across five classifiers. An LLM-as-judge evaluation shows FEST achieves 60-80% coverage of expert-designed brand features at strict semantic-alignment thresholds, corroborated by a human expert study rating features highly on relevance, clarity, and actionability. When seeded with expert guidelines, FEST refines qualitative criteria into operational features, improving accuracy by 6-12 pp on average across brands. To enable systematic evaluation of expert alignment in automated feature engineering, we release BrandGuide, the first dataset pairing expert-designed features with 1M+ assets across 2,683 brands. By grounding feature engineering in expert knowledge, FEST opens a practical pathway for interpretable ML in domains demanding human oversight.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Scaling Decision-Focused Learning to Large Problems with Lagrangian Decomposition</title>
    <author>
      <name>Stéphane Eilles-Chan Way</name>
    </author>
    <author>
      <name>Hugo Percot</name>
    </author>
    <author>
      <name>Quentin Cappart</name>
    </author>
    <author>
      <name>Tias Guns</name>
    </author>
    <author>
      <name>Louis-Martin Rousseau</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08797v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08797v1" />
    <published>2026-06-07T19:27:59Z</published>
    <updated>2026-06-07T19:27:59Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Stéphane Eilles-Chan Way et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Decision-focused learning has shown great promise for addressing predict-then-optimize problems, particularly in the presence of under-specified models. However, its practical deployment is often hindered by high computational costs and limited scalability, as it requires solving a constrained optimization problem for each training instance at every iteration. To address these challenges, we propose a novel framework that incorporates Lagrangian decomposition into the decision-focused learning paradigm. Specifically, we introduce a new surrogate objective along with two loss functions for evaluating and training the underlying prediction model. We further propose two variants of our approach, which offer different trade-offs between computational efficiency and solution quality. Our framework can be seamlessly integrated with standard decision-focused learning methods, including Smart Predict-then-Optimize (SPO+) and Implicit Maximum Likelihood Estimation (IMLE). Through experiments on two standard benchmarks, the multi-dimensional knapsack problem and quadratic portfolio optimization, we demonstrate that our approach achieves competitive performance while remaining amenable to parallelization. In particular, it consistently outperforms traditional decision-focused learning methods on large-scale instances, involving up to eight times more variables than those typically considered in related work. The implementation is available at https://github.com/corail-research/DFL-LD.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines</title>
    <author>
      <name>Jafar Isbarov</name>
    </author>
    <author>
      <name>Umid Suleymanov</name>
    </author>
    <author>
      <name>Ilia Shumailov</name>
    </author>
    <author>
      <name>Murat Kantarcioglu</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09935v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09935v1" />
    <published>2026-06-07T19:17:18Z</published>
    <updated>2026-06-07T19:17:18Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jafar Isbarov et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI-powered agents are increasingly embedded in continuous integration and continuous delivery/deployment (CI/CD) pipelines to autonomously review pull requests (PRs), triage issues, and maintain codebases. These agents ingest untrusted content while operating with elevated repository permissions, making them a natural target for prompt injection attacks with supply chain consequences. We present GitInject, an open-source framework for evaluating prompt injection vulnerabilities in real, live GitHub workflows, a widely deployed instance of CI/CD pipelines. Unlike prior agent security benchmarks that simulate tool calls, GitInject provisions ephemeral repositories and triggers actual workflow runs, so that sandbox constraints, credential handling, and permission boundaries behave exactly as in production. Using GitInject, we study workflow configurations across four AI providers and document eleven named attacks spanning config-file injection, credential exfiltration, judgment manipulation, and availability. We find that all tested providers are susceptible to at least one attack class in their default configuration, and that the most critical vulnerabilities are structural: they arise from how CI/CD infrastructure handles credentials and configuration files, not from any specific model&amp;#x27;s behavior. For each confirmed attack class, we identify the minimum-cost workflow-level countermeasure and analyze its coverage and limitations. GitInject is released publicly to facilitate further research in this direction.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery</title>
    <author>
      <name>Syed Rifat Raiyan</name>
    </author>
    <author>
      <name>Mohsinul Kabir</name>
    </author>
    <author>
      <name>Hasan Mahmud</name>
    </author>
    <author>
      <name>Md Kamrul Hasan</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08728v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08728v1" />
    <published>2026-06-07T16:50:07Z</published>
    <updated>2026-06-07T16:50:07Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Syed Rifat Raiyan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem within NLP to one of the most consequential AI frontiers. This survey provides a unified account of the field&amp;#x27;s evolution, from early rule-based math word problem (MWP) solvers and template-driven geometry systems, through neural expression generation and LLM prompting, to contemporary reasoning models, multi-agent systems, neuro-symbolic theorem provers, and verified discovery workflows. We organize the landscape along four axes: (i) informal reasoning over text and diagrams, spanning MWP solving, multimodal geometry, and VLMs; (ii) formal reasoning in proof assistants, including autoformalization, tactic prediction, compiler-guided repair, and proof search; (iii) mathematical discovery, where systems propose constructions, improve bounds, or assist attacks on open problems; and (iv) the inference and training-time techniques, including CoT prompting, tool use, process reward models, and RLVR, that increasingly connect generation with verification. We catalog major benchmarks across grade-school arithmetic, competition mathematics, geometry, formal proving, multimodal and multilingual reasoning, and expert evaluation, and we examine benchmark saturation, contamination, reporting mismatches, and the distinction between pass@1, majority voting, and verifier-assisted pass@$k$. We critically assess failure modes: brittleness under perturbation, reward hacking, multimodal grounding failures, fragile formalization, and the energy cost of reasoning-scale inference. Drawing on recent perspectives from working mathematicians, we identify future directions centered on verified-discovery workflows, reasoning efficiency, and infrastructure to make AI-assisted formalization broadly usable. Companion materials: https://github.com/Starscream-11813/awesome-AI4Math.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Under review, 47 pages, 14 figures, 22 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] Lost in the Flow with Code Talkers: Unveiling the Instruction-Tuning Tax of Large Language Models in Code Tasks</title>
    <author>
      <name>Shi Ying Chang</name>
    </author>
    <author>
      <name>Chiok Yew Ho</name>
    </author>
    <author>
      <name>Yichen Li</name>
    </author>
    <author>
      <name>Yintong Huo</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08676v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08676v1" />
    <published>2026-06-07T15:24:30Z</published>
    <updated>2026-06-07T15:24:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shi Ying Chang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI coding assistants have significantly improved developer productivity by automatically suggesting code that aligns with user intent, and many of these tools are now integrated directly into Integrated Development Environments (IDEs). Developers interact with code in two distinct cognitive modes: Flow and Command. While developers require tools that directly complete or infill code in unfinished programs during Flow mode, they also need tools that can comprehend intentions expressed as natural-language instructions and convert them into executable code in Command mode. Although instruction-tuned Large Language Models (LLMs) dominate many application scenarios due to their abilities to infer and fulfill developers&amp;#x27; intents, it remains unclear whether the same paradigm is equally suitable for different code-related tasks. Therefore, it is necessary to understand how instruction tuning affects the feasibility of CodeLLMs as coding assistants. To fill this gap, we conduct the first empirical study that uncovers a key trade-off caused by instruction tuning across programming modes, which we term the Instruction-Tuning Tax. Our results show that instruction tuning is not a free lunch: although instruction-tuned models are more capable of following instructions and leveraging structured guidance, these gains often come at the cost of weaker infilling performance. We further extend our study through both qualitative and quantitative analyses, including manual failure categorization, behavioral metrics that capture generation fidelity, and intermediate-checkpoint evaluation throughout the tuning process. Summarizing our results into seven findings and four implications, our study offers a new perspective on the development of AI-powered coding tools and highlights the need to carefully balance instruction-following ability with effective code generation assistance.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;25 pages, 6 figures. Evaluation toolkit and dataset: https://github.com/arkosioscambions/CodeTalkers&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning</title>
    <author>
      <name>Haihao Lin</name>
    </author>
    <author>
      <name>Xiangsheng Huang</name>
    </author>
    <author>
      <name>Xiao Yang</name>
    </author>
    <author>
      <name>Weibang Zhou</name>
    </author>
    <author>
      <name>Yiqi Zhang</name>
    </author>
    <author>
      <name>Bo Yang</name>
    </author>
    <author>
      <name>Simin Zeng</name>
    </author>
    <author>
      <name>Jiawei Yang</name>
    </author>
    <author>
      <name>Zhengyang Wang</name>
    </author>
    <author>
      <name>Jiahui Du</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08653v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08653v1" />
    <published>2026-06-07T14:41:22Z</published>
    <updated>2026-06-07T14:41:22Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haihao Lin et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Project page: https://fibertune.github.io/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Reinforcement Learning for Flow-Matching Policies with Density Transport</title>
    <author>
      <name>Boshu Lei</name>
    </author>
    <author>
      <name>Kostas Daniilidis</name>
    </author>
    <author>
      <name>Antonio Loquercio</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08602v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08602v1" />
    <published>2026-06-07T12:28:51Z</published>
    <updated>2026-06-07T12:28:51Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Boshu Lei et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present an online reinforcement learning (RL) algorithm for fine-tuning flow-matching policies in continuous-control problems. Our key insight is to view RL-based policy improvement as a transport of action densities towards regions of high reward, which naturally aligns with the transport formulation of flow matching models. Prior methods either approximate the current or optimal policy distribution or resort to distillation, which introduces biased gradients or sacrifices multimodal modeling capacity. In contrast, our approach for RL with Density Transport, which we name \emph{RLDT}, constructs a transport field from a maximum-entropy RL objective using Stein Variational Gradient Descent (SVGD). Then, it finetunes a pretrained flow matching policy to align with this field. Training with this alignment objective is nontrivial because flow-matching policies generate actions via a multi-step process, making direct gradient-based optimization challenging. To overcome this challenge and stabilize training, we approximate policy actions from intermediate denoising steps via expected-target estimation. This allows the transport-field update to propagate into the network parameters without unstable backpropagation through time. Experimental results demonstrate that RLDT outperforms competitive baselines in reward quality and convergence speed. This performance holds across diverse continuous-control tasks, encompassing both dense and sparse rewards, as well as state- and vision-based long-horizon robot manipulation. The project webpage is \href{https://rpfey.github.io/rldt/}{https://rpfey.github.io/rldt/}.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration</title>
    <author>
      <name>Beiwen Zhang</name>
    </author>
    <author>
      <name>Yongheng Liang</name>
    </author>
    <author>
      <name>Guowei Zou</name>
    </author>
    <author>
      <name>Haitao Wang</name>
    </author>
    <author>
      <name>Hejun Wu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08596v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08596v1" />
    <published>2026-06-07T12:20:32Z</published>
    <updated>2026-06-07T12:20:32Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Beiwen Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Constructing efficient and reliable policies to assist humans is indispensable for human-AI collaboration. Existing methods mainly follow two lines of work. Most prior work relies on multi-agent reinforcement learning (MARL) to learn black-box policies, which limits interpretability and raises safety concerns. Recent methods query large language models (LLMs) at each decision step, causing slow responses and high inference costs. We propose Collaboration Policy Tree (Co-pi-tree), a closed-loop method that learns an executable policy tree consisting of a partner-behavior prediction tree and an agent-action selection tree. Co-pi-tree constructs a policy by distilling LLM reasoning into policy tree code. It then evaluates the policy through partner interaction, obtains feedback, and uses natural language to summarize the interaction feedback to improve problematic branches. Experiments in Overcooked-AI show that Co-pi-tree improves average reward by 35.4% over the baseline average, while reducing the number of LLM queries by 77.7% and test-time latency by 97.1%. Project page: https://beiwenzhang.github.io/Co-pi-tree/&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models</title>
    <author>
      <name>Subramanyam Sahoo</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08571v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08571v1" />
    <published>2026-06-07T11:01:13Z</published>
    <updated>2026-06-07T11:01:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Subramanyam Sahoo&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models frequently fail in a characteristic way: rather than acknowledging ignorance, they produce fluent but incorrect answers to questions that lie beyond their knowledge boundaries. We introduce \textbf{Structured Ignorance Certificates} (SICs), a JSON-formatted output schema that demands a model explicitly name the missing domain intersection, enumerate required concepts, and propose a productive retrieval query rather than hallucinating an answer. To train models to produce high-quality SICs we construct a 7,347-sample \emph{Unknown-Unknown} (UU) dataset by prompting Qwen3-14B to stitch together questions from seven domains (physics, biology, engineering, CS, economics, medical, legal) into novel cross-domain queries that no single-domain expert could answer. We fine-tune a 14B-parameter model with Group Relative Policy Optimization (GRPO) using a composite reward that combines retrieval utility, concept specificity, and output-format validity. A paraphrase-divergence probe trained on model responses confirms that SIC-tuned outputs systematically exhibit higher unknown-unknown probability scores. Evaluation on 735 held-out UU questions achieves a 99.46\% JSON validity rate, a mean Certificate Specificity Score of 0.967, and a 3.6\% ROUGE-L improvement over the base model on retrieval-grounded generation -- demonstrating that explicit epistemic structuring is a learnable and measurable capability.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted in ICML 2026 Workshop: Epistemic Intelligence in Machine Learning&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation</title>
    <author>
      <name>Yuan Zhang</name>
    </author>
    <author>
      <name>Shiqi Zhang</name>
    </author>
    <author>
      <name>Yedong Shen</name>
    </author>
    <author>
      <name>Shuai Dong</name>
    </author>
    <author>
      <name>Jiajun Deng</name>
    </author>
    <author>
      <name>Xin Zhang</name>
    </author>
    <author>
      <name>Yuxuan Gao</name>
    </author>
    <author>
      <name>Jiajia Wu</name>
    </author>
    <author>
      <name>Xin Nie</name>
    </author>
    <author>
      <name>Zhiyuan Cheng</name>
    </author>
    <author>
      <name>Jianmin Ji</name>
    </author>
    <author>
      <name>Yanyong Zhang</name>
    </author>
    <author>
      <name>Xingyi Zhang</name>
    </author>
    <author>
      <name>Jia Pan</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08530v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08530v1" />
    <published>2026-06-07T09:23:16Z</published>
    <updated>2026-06-07T09:23:16Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuan Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at https://github.com/babynabeauty/GEAR-VLA.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Conformal Prediction for Neural Operators: Distribution-Free Uncertainty Quantification in Physics Simulation</title>
    <author>
      <name>Michael Chin</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09923v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09923v1" />
    <published>2026-06-07T08:52:38Z</published>
    <updated>2026-06-07T08:52:38Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Michael Chin&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Neural operators such as the Fourier Neural Operator (FNO) have emerged as powerful surrogates for solving partial differential equations (PDEs), achieving speedups of several orders of magnitude over traditional numerical solvers. However, deploying these models in safety-critical engineering applications -- such as thermal management of electronic components and battery systems -- requires not only accurate point predictions but also rigorous uncertainty guarantees. Existing uncertainty quantification (UQ) methods for neural operators, including Monte Carlo Dropout and Deep Ensembles, provide only relative uncertainty estimates without formal coverage guarantees. In this work, we propose the first application of split conformal prediction to neural operator-based physics simulation, providing distribution-free prediction intervals with finite-sample coverage guarantees. We further introduce a normalized conformal prediction scheme that leverages MC Dropout uncertainty to produce adaptive-width intervals, yielding tighter intervals in regions of low uncertainty and wider intervals where the model is less certain. Full-scale experiments (33.7M parameters, 800 training samples, 5 ensemble members, NVIDIA V100) on steady-state heat conduction benchmarks demonstrate that our method achieves 89.1% empirical coverage at the target level of alpha=0.1, while producing spatially adaptive prediction intervals that reflect the underlying physical uncertainty structure. We also provide an uncertainty decomposition framework that separates epistemic uncertainty (68% of total) from aleatoric uncertainty (32% of total), offering actionable guidance for data collection and model improvement. Our method is implemented in an open-source platform with REST API endpoints and interactive 3D visualization.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;13 pages, 7 tables, 7 figures. Full-scale experiments on NVIDIA V100&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies</title>
    <author>
      <name>Bingjia Huang</name>
    </author>
    <author>
      <name>Xiangyu Li</name>
    </author>
    <author>
      <name>Xiang Wang</name>
    </author>
    <author>
      <name>Liang Mi</name>
    </author>
    <author>
      <name>Zixu Hao</name>
    </author>
    <author>
      <name>Weijun Wang</name>
    </author>
    <author>
      <name>Hao Wu</name>
    </author>
    <author>
      <name>Kun Li</name>
    </author>
    <author>
      <name>Yunxin Liu</name>
    </author>
    <author>
      <name>Ting Cao</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08508v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08508v1" />
    <published>2026-06-07T08:18:11Z</published>
    <updated>2026-06-07T08:18:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Bingjia Huang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Generative robot policies fail unpredictably at deployment: they hesitate at critical moments, drift off-task, or commit to unrecoverable actions. Existing online failure detectors either require white-box access to policy internals or add runtime overhead through resampling and observation-side signals. Our empirical analysis shows that emitted action chunks themselves already carry strong predictive signal for impending failures in generative robot policies. Motivated by this observation, we introduce ActProbe, a lightweight, pure action-space detector that uses two compact signals available from a single forward pass: Temporal Consistency Error (TCE) between consecutive action chunks and Action Chunk Magnitude (ACM) of the current chunk. ActProbe maps these signals to per-step failure probabilities with a task-conditioned LSTM-MLP architecture. Across a diverse suite of generative robot policies and benchmarks, ActProbe raises alerts before failures become visually recognizable, improving the accuracy (F1)-timeliness Pareto frontier of failure detection by an average hypervolume gain of +12.7% over both internal- and external-feature baselines, with a +9.0% early-detection ROC-AUC lead on unseen tasks. ActProbe further transfers to deployment, predicting failures on unseen real-robot pick tasks and accelerating RL fine-tuning (PPO) with 2.9x fewer environment interactions.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;24 pages,9 figures,11 tables, Project page: https://air-embodied-brain.github.io/actprobe&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Explaining Black-Box Language Models: Learning to Optimize Linguistically-Structured Word Subsets</title>
    <author>
      <name>Minyoung Hwang</name>
    </author>
    <author>
      <name>Seokhyun Lee</name>
    </author>
    <author>
      <name>Changhee Lee</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08497v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08497v1" />
    <published>2026-06-07T07:54:49Z</published>
    <updated>2026-06-07T07:54:49Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Minyoung Hwang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As deep language models (DLMs) are increasingly deployed in high-stakes domains such as healthcare, understanding their decision rationale becomes paramount for ensuring trust, safety, and accountability. However, achieving this vital level of interpretability is particularly challenging when these DLMs operate as black-box systems (e.g., via APIs), where access to internal model states (e.g., parameters, gradients) is restricted. Despite numerous efforts, existing explanation methods often fail to concurrently satisfy three key desiderata: (i) inference-time efficiency, (ii) black-box compatibility without inducing out-of-distribution behavior, and (iii) comprehensible explanations grounded in the input&amp;#x27;s linguistic structure. To address these challenges, we propose a method that explains predictions of DLMs by selecting a small, informative subset of input words. We formulate this as an amortized optimization problem, enabling efficient one-shot inference without the need for input-specific search. Our selection policy is trained via REINFORCE-style policy gradients, allowing discrete word selection in a fully gradient-free setting. To enhance interpretability and align with human linguistic intuition, we integrate graph-structured knowledge into this selection process, fostering linguistically coherent subsets that result in explanations both highly informative and cognitively meaningful to end-users. We evaluated our method on diverse DLM architectures and multiple real-world datasets. It consistently identifies word subsets with enhanced discriminative power and stronger alignment with linguistically salient cues, outperforming both conventional black-box compatible methods and gradient-based approaches that are given oracle access to the black-box model&amp;#x27;s gradients for a more challenging benchmark. Our code is available at here.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;KDD 2026 Research Track&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] What Makes a Desired Graph for Relational Deep Learning?</title>
    <author>
      <name>Yao Cheng</name>
    </author>
    <author>
      <name>Siqiang Luo</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08491v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08491v1" />
    <published>2026-06-07T07:30:34Z</published>
    <updated>2026-06-07T07:30:34Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yao Cheng, Siqiang Luo&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Relational deep learning (RDL) converts relational databases (RDBs) into heterogeneous graphs, but graphs derived directly from database schemas are often not well suited for how graph neural networks (GNNs) perform relational reasoning. We study what makes a relational graph suitable for deep learning and show that schema-derived graphs suffer from two systematic failures: information overload and semantic fragmentation. Our empirical analysis reveals that the desired graph is not the raw schema, but a result of controlled structural adaptation. Performance depends on balancing two operations: mitigating information overload via filtering, and repairing semantic fragmentation via injection. Specifically, filtering serves as a bias-variance knob with non-monotonic effects, while injection improves performance only when it explicitly restores the relational dependencies missing from the original schema. Based on these findings, we develop an end-to-end structural optimizer that applies both operations to adapt relational graphs automatically. Across 26 tasks spanning classification, regression, and recommendation, the optimized graphs consistently improve accuracy while often reducing inference cost.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;This article has been accepted by ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems</title>
    <author>
      <name>Suraj Ranganath</name>
    </author>
    <author>
      <name>Anish Raghavendra</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08481v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08481v1" />
    <published>2026-06-07T06:53:09Z</published>
    <updated>2026-06-07T06:53:09Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Suraj Ranganath, Anish Raghavendra&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] The Confidence Trap: Calibration Attacks for Graph Neural Networks</title>
    <author>
      <name>Cuong Dang</name>
    </author>
    <author>
      <name>Jiahao Zhang</name>
    </author>
    <author>
      <name>Hieu Ta Quang</name>
    </author>
    <author>
      <name>Dung Le</name>
    </author>
    <author>
      <name>Lu Cheng</name>
    </author>
    <author>
      <name>Suhang Wang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08467v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08467v1" />
    <published>2026-06-07T06:16:58Z</published>
    <updated>2026-06-07T06:16:58Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Cuong Dang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While confidence calibration is essential for trustworthy decision-making in safety-critical applications, the robustness of calibrated GNNs to adversarial structural perturbations remains largely unexplored. However, studying calibration attacks on graphs presents unique technical challenges: (1) the discrete nature of graph structures complicates gradient-based optimization, (2) existing underconfidence objectives fail to drive predictions toward uniform distributions, and (3) GNNs are highly sensitive to edge perturbations, often causing unintended label changes that violate attack constraints. To address these challenges, we propose a \textbf{Unified Graph Calibration Attack (UGCA)} framework designed for \textbf{worst-case (white-box) analysis} of GNN calibration robustness. UGCA introduces a KL-divergence loss to encourage uniform predictive distributions, a reranking mechanism to reduce label flipping, a hybrid loss to recover labels when violations occur, and beam search to explore a broader adversarial search space. We further provide theoretical insights linking model generalization, dataset complexity, and calibration vulnerability, showing that models with higher accuracy or trained on datasets with more classes are more susceptible under this threat model. Extensive experiments demonstrate that UGCA substantially increases Expected Calibration Error while preserving classification accuracy. Our code is publicly available at https://github.com/CaptainCuong/Graph-Calibration-Attack.git.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Sycophancy as a Multilingual Alignment Failure: How Safety Degrades Across Languages, Topics, and Models</title>
    <author>
      <name>Arya Shah</name>
    </author>
    <author>
      <name>Himanshu Beniwal</name>
    </author>
    <author>
      <name>Mayank Singh</name>
    </author>
    <author>
      <name>Chaklam Silpasuwanchai</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08451v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08451v1" />
    <published>2026-06-07T04:50:40Z</published>
    <updated>2026-06-07T04:50:40Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Arya Shah et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Safety-aligned large language models often exhibit sycophancy, which is the tendency to affirm users&amp;#x27; opinions regardless of factual accuracy. Although well-studied in English, its manifestation in other languages remains largely unexamined, leaving billions of non-English speakers potentially vulnerable to model-validated misinformation. We present the first large-scale, multi-model evaluation of cross-lingual sycophancy, benchmarking \textbf{six instruction-tuned models} across \textbf{1.1 million instances} spanning \textbf{38 languages} and \textbf{33 topic categories}. We identify a consistent resource-tier effect: sycophancy rates spike sharply in low-resource and zero-shot language settings. Critically, this degradation is topic-agnostic, as models fail uniformly across both benign and safety-critical prompts, offering no additional protection where it is most needed. We further identify tokenizer fertility as a structural driver of this alignment collapse. Collectively, our results demonstrate that prevailing alignment methodologies generalize poorly beyond high-resource languages, underscoring the urgent need for equitable multilingual safety techniques.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;19 pages, 9 figures, 7 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] GIFT: LLM-Guided State-Reward Interface for Financial Reinforcement Learning</title>
    <author>
      <name>Yanyan Wu</name>
    </author>
    <author>
      <name>Boyi Zhang</name>
    </author>
    <author>
      <name>Yanlin Liu</name>
    </author>
    <author>
      <name>Xinyu Fang</name>
    </author>
    <author>
      <name>Jining Luan</name>
    </author>
    <author>
      <name>Meiqi Zhang</name>
    </author>
    <author>
      <name>Jiacheng Liu</name>
    </author>
    <author>
      <name>Hao Zeng</name>
    </author>
    <author>
      <name>Dexu Yu</name>
    </author>
    <author>
      <name>Chang Liu</name>
    </author>
    <author>
      <name>Hanwen Du</name>
    </author>
    <author>
      <name>Yongxin Ni</name>
    </author>
    <author>
      <name>Youhua Li</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08450v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08450v1" />
    <published>2026-06-07T04:45:19Z</published>
    <updated>2026-06-07T04:45:19Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yanyan Wu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Financial portfolio trading is naturally formulated as a reinforcement learning problem, where an agent sequentially rebalances assets under changing market conditions to balance return, risk, and transaction costs. Yet in non-stationary markets, raw OHLCV states and short-horizon return rewards often provide an under-specified learning interface, motivating large language models as a way to inject financial knowledge into state and reward design while constraining open-ended generation. To this end, we propose GIFT, an LLM-guided framework for state-reward interface design in PPO-based financial reinforcement learning. Rather than using the LLM to make trading decisions, GIFT uses Factor-guided State Enhancement to generate state features from financial-factor primitives, Risk-rule-guided Reward Shaping to generate auxiliary rewards from portfolio-risk rules, and Diagnostic-guided Refinement to revise candidate interfaces using PPO rollout diagnostics. After refinement, GIFT fixes the selected state-reward interface before evaluation, with no further LLM queries or interface updates at test time. Comprehensive rolling-window experiments across diverse market regimes and portfolio scenarios demonstrate that GIFT improves learning-signal quality and out-of-sample risk-adjusted portfolio performance over baselines. Code and data are available at: https://github.com/KAG778/GIFT .&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;25 pages, 7 figures. Code and data are available at https://github.com/KAG778/GIFT . Equal contribution: Yanyan Wu and Boyi Zhang. Corresponding author: Youhua Li&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models</title>
    <author>
      <name>Yang Zhou</name>
    </author>
    <author>
      <name>Ranajoy Sadhukhan</name>
    </author>
    <author>
      <name>Zhaofeng Sun</name>
    </author>
    <author>
      <name>Zhuoming Chen</name>
    </author>
    <author>
      <name>Souvik Kundu</name>
    </author>
    <author>
      <name>Saket Dingliwal</name>
    </author>
    <author>
      <name>Sai Muralidhar Jayanthi</name>
    </author>
    <author>
      <name>Aram Galstyan</name>
    </author>
    <author>
      <name>Haizhong Zheng</name>
    </author>
    <author>
      <name>Beidi Chen</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08446v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08446v1" />
    <published>2026-06-07T04:24:45Z</published>
    <updated>2026-06-07T04:24:45Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yang Zhou et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Despite being powerful, reinforcement learning with verifiable rewards (RLVR) induces extremely long COT, making it computationally expensive. Since RLVR per-step cost is dominated by long-context rollout generation, sparse attention offers a promising way to accelerate dense rollout. However, sparse rollouts require a delicate stability-efficiency tradeoff: overly aggressive sparsity causes collapse, while overly lenient sparsity gives insufficient speedup. In this work, we study this tradeoff through sparse-to-dense actor-policy mismatch. We first observe that sparse rollout collapse is not driven by uniform degradation across tokens: most sparse tokens align perfectly with dense even under aggressive sparsity. Motivated by this, we hypothesize that sparse rollout training remains stable if the lower tail of per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. We introduce a dynamic sparsity schedule that keeps this tail statistic constant during generation and validate our hypothesis. Across Qwen3 thinking-family models, keeping the tail mismatch statistic near a consistent threshold generally enables stable training. We then use a cost model to find the sparsity schedule for maximum speedup under this mismatch threshold, achieving 2.2x, 2.4x, and 2.0x rollout speedups when training Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. Empirically, we show the thresholds generalize to a larger model (Qwen3-14B) and another RL domain (coding). Finally, our analysis naturally motivates DistillSparse: lightweight LoRA-based distillation on sparse rollout lets more aggressive sparsity reach the same sparse-to-dense mismatch threshold, yielding higher speedup.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] AI Code Sandboxes: A Comparative Security Study. Part 1 of 2 -- Engine-Level Properties (Attack Surface, Leakage, Stackability, CVE History, Patch Cadence, Fuzzing)</title>
    <author>
      <name>George Andronchik</name>
    </author>
    <author>
      <name>Pavel Lokhmakov</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08433v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08433v1" />
    <published>2026-06-07T03:20:03Z</published>
    <updated>2026-06-07T03:20:03Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;George Andronchik, Pavel Lokhmakov&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;This paper reads six engine-level measurements together -- 1.1 host attack surface, 1.2 information leakage, 1.3 defense-in-depth stackability, 1.4 public CVE history, 1.5 patch cadence, and 1.6 upstream fuzzing posture -- to describe how five AI-sandbox products isolate guest code from the host kernel. No single axis is a sufficient basis for a comparative judgement; the cross-axis reading is the load-bearing analysis.
  Three high-level findings: (1) engine classes (microVM, userspace kernel, OCI container) separate cleanly on every architectural axis, but products within a class do not; (2) product pin policy is the dominant operator-facing variable -- engine-side patch latency aggregates to ~0 days for coordinated disclosures, while downstream lag spans 0 days to 471+ days to &amp;quot;opaque&amp;quot; to infinity; (3) fuzzing investment splits into three tiers, and the strongest combination -- microVM x continuous public fuzzer -- is unoccupied in this set, leaving the &amp;quot;0 published CVEs x no upstream fuzzer x no academic study&amp;quot; intersection structurally unmeasured.
  We report per-axis orderings, per-product portraits, and a threat-model qualification matrix; no overall ranking is proposed. Companion repository (code, Apache-2.0): https://github.com/orbitalab/RnD-ai-sandboxes-sec-study-part-1. License: CC BY 4.0.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;61 pages, 7 figures, 33 tables; Part 1 of 2; companion code repository (Apache-2.0): https://github.com/orbitalab/RnD-ai-sandboxes-sec-study-part-1&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Trajectory-Refined Distillation</title>
    <author>
      <name>Li Jiang</name>
    </author>
    <author>
      <name>Haoran Xu</name>
    </author>
    <author>
      <name>Yichuan Ding</name>
    </author>
    <author>
      <name>Amy Zhang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08432v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08432v1" />
    <published>2026-06-07T03:17:25Z</published>
    <updated>2026-06-07T03:17:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Li Jiang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student&amp;#x27;s own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student&amp;#x27;s rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage. Code is available at https://github.com/louieworth/trd&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;under review&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] CoVEBench: Can Video Editing Models Handle Complex Instructions?</title>
    <author>
      <name>Jiangtao Wu</name>
    </author>
    <author>
      <name>Jiaming Wang</name>
    </author>
    <author>
      <name>Yiwen He</name>
    </author>
    <author>
      <name>Yuanxing Zhang</name>
    </author>
    <author>
      <name>Shihao Li</name>
    </author>
    <author>
      <name>Dunyuan Liu</name>
    </author>
    <author>
      <name>Xuedong Zhao</name>
    </author>
    <author>
      <name>Jialu Chen</name>
    </author>
    <author>
      <name>Zekun Moore Wang</name>
    </author>
    <author>
      <name>Jiaheng Liu</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08415v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08415v1" />
    <published>2026-06-07T02:29:41Z</published>
    <updated>2026-06-07T02:29:41Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiangtao Wu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;34 pages, 11 figures, 9 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] PACT: Self-Evolving Physical Safety Alignment for Diffusion Policies in Embodied Manipulation</title>
    <author>
      <name>Lingxuan Wu</name>
    </author>
    <author>
      <name>Zijian Zhu</name>
    </author>
    <author>
      <name>Lizhong Wang</name>
    </author>
    <author>
      <name>Chengyang Ying</name>
    </author>
    <author>
      <name>Huayu Chen</name>
    </author>
    <author>
      <name>Xiao Yang</name>
    </author>
    <author>
      <name>Fangming Liu</name>
    </author>
    <author>
      <name>Jun Zhu</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08414v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08414v1" />
    <published>2026-06-07T02:27:53Z</published>
    <updated>2026-06-07T02:27:53Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Lingxuan Wu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Diffusion policies have achieved remarkable success in robotic manipulation, yet they often fail to satisfy strict physical constraints required for safe deployment. Existing approaches impose safety either prematurely during training or reactively via external guardrails at test time, limiting policy expressivity and overall scalability. We propose Physical safety Alignment for Constrained Trajectories (PACT), a self-evolving post-training framework that projects pretrained diffusion policies onto constraint-feasible regions without accessing demonstration data or task rewards. PACT distills constraint gradients into the diffusion model through a reverse-KL objective with dense supervision across timesteps. It incorporates a curriculum that progressively tightens constraints while maintaining theoretically bounded policy shift and monotone improvement, mitigating the safety-performance trade-off from catastrophic forgetting. On simulated and real-world embodied manipulation benchmarks, PACT significantly reduces safety violations by 31.0% on average while improving task success by 30.7%.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries</title>
    <author>
      <name>Linfeng Cao</name>
    </author>
    <author>
      <name>Ming Shi</name>
    </author>
    <author>
      <name>Ness B. Shroff</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08410v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08410v1" />
    <published>2026-06-07T02:19:59Z</published>
    <updated>2026-06-07T02:19:59Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Linfeng Cao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Personalized decision-making in multi-objective bandits requires learning user-specific trade-offs among competing objectives. Since arm utility depends on both unknown rewards and unknown preferences, existing methods infer preferences only from utility feedback, entangling preference learning with reward exploration. In practice, however, users often reveal their priorities through proactive conversational queries (e.g., &amp;quot;cheap and clean hotel&amp;quot;), yet this structured signal is not leveraged. We formalize a proactive query-based framework in which user queries provide structured preference signals. Modeling these signals via a Plackett-Luce subset choice model, we show that query-only learning is insufficient due to a fundamental shift-invariance barrier. To resolve this, we introduce MO-PQUCB, a hybrid algorithm that integrates query-based preference anchoring with bandit feedback through shift-invariant regularization and dual-exploration UCB. We prove that proactive queries accelerate preference estimation and yield improved regret scaling over prior preference-aware MO-MAB methods. Under corrupted queries, we further characterize statistical limits and design a robust estimator achieving near-optimal performance when the corruption is sparse. Experiments validate both theoretical and practical gains.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;UAI 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] TimpaTeks: Automatic In-place Text Sequence Modification via Diffusion Language Model Steering</title>
    <author>
      <name>Ryandito Diandaru</name>
    </author>
    <author>
      <name>Ikhlasul Akmal Hanif</name>
    </author>
    <author>
      <name>Fadli Aulawi Al Ghiffari</name>
    </author>
    <author>
      <name>Ahmed Elshabrawy</name>
    </author>
    <author>
      <name>Alham Fikri Aji</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08408v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08408v1" />
    <published>2026-06-07T02:13:17Z</published>
    <updated>2026-06-07T02:13:17Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ryandito Diandaru et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We extend activation steering to diffusion language models (DLMs) and study a novel problem that arose due to the inference mechanism of DLMs: Modifying a text in-place to manifest a different concept. We propose TimpaTeks, an automatic in-place text modification mechanism using DLMs. Experiments on IMDB movie reviews (sentiment) and a synthetic Cats and Dogs Dataset (arbitrary, more unconventional concept steering) show that TimpaTeks provides a feasible novel mechanism to steer diffusion language model outputs in-place. TimpaTeks enables in-place modification while simultaneously lowers sentence perplexity and retaining the original sentence structre without the need of instruction tuned models. TimpaTeks is also computationally cheaper than prompt-based DLM steering, as it performs denoising in-place rather than constructing an additional prompt-conditioned output sequence.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;16 pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] SceneConductor: 3D Scene Generation from Single Image with Multi-Agent Orchestration</title>
    <author>
      <name>Jeonghwan Kim</name>
    </author>
    <author>
      <name>Yushi Lan</name>
    </author>
    <author>
      <name>Yongwei Chen</name>
    </author>
    <author>
      <name>Hieu Trung Nguyen</name>
    </author>
    <author>
      <name>Chuanyu Pan</name>
    </author>
    <author>
      <name>Xingang Pan</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08402v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08402v1" />
    <published>2026-06-07T01:38:39Z</published>
    <updated>2026-06-07T01:38:39Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jeonghwan Kim et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control</title>
    <author>
      <name>Priyansh Bhatnagar</name>
    </author>
    <author>
      <name>Ashkan Moradifirouzabadi</name>
    </author>
    <author>
      <name>Se-Hyun Yang</name>
    </author>
    <author>
      <name>SeungJae Lee</name>
    </author>
    <author>
      <name>Jungwook Choi</name>
    </author>
    <author>
      <name>Mingu Kang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08382v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08382v1" />
    <published>2026-06-07T00:24:02Z</published>
    <updated>2026-06-07T00:24:02Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Priyansh Bhatnagar et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive compression with minimal accuracy degradation. We propose STAR-KV, an adaptive low-rank KV cache compression framework with fine-grained rank control. STAR-KV encompasses 1) a differentiable thresholding mechanism that enables optimal rank selection at both attention-head and block levels, 2) a hybrid decomposition strategy that applies different low-rank factorizations according to the sensitivity of key and value projections, and 3) a low-rank-aware mixed precision quantization that leverages data statistics for near lossless low-bit quantization. Evaluated across multiple LLMs and benchmarks, STAR-KV achieves up to 75% KV cache compression and up to 20x overall KV cache reduction when combined with quantization. Enabled by custom Triton-based GPU kernels, STAR-KV delivers up to 6.9x speedup for the attention module and 3.1x end-to-end generation throughput. Our code is publicly available at: https://github.com/PriyanshBhatnagar/STAR-KV.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming</title>
    <author>
      <name>Michal P. Podolinsky</name>
    </author>
    <author>
      <name>Neel P. Bhatt</name>
    </author>
    <author>
      <name>Pranay Samineni</name>
    </author>
    <author>
      <name>Rohan Siva</name>
    </author>
    <author>
      <name>Christian Ellis</name>
    </author>
    <author>
      <name>Ufuk Topcu</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09919v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09919v1" />
    <published>2026-06-07T00:02:21Z</published>
    <updated>2026-06-07T00:02:21Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Michal P. Podolinsky et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Perceptual uncertainty is a central challenge for heterogeneous robot teams operating in unstructured outdoor environments, where no single viewpoint affords reliable scene understanding. Perceptual uncertainty, arising from sources such as occlusions, manifests differently across robot viewpoints depending on scene structure. Detecting and resolving sources of perceptual uncertainty requires both scene-based contextual reasoning and capability-aware robot allocation. While vision-language models provide strong semantic priors for both, they are computationally prohibitive for onboard inference and lack calibrated uncertainty quantification. We introduce Co-GLANCE, a real-time onboard perception and decision-making system for uncertainty resolution in heterogeneous robot teams. Co-GLANCE distills the semantic reasoning capabilities of a vision-language model into an end-to-end model for occlusion segmentation and robot allocation, eliminating the need for cloud-based inference. To quantify perceptual uncertainty, Co-GLANCE combines conformal prediction with selective abstention to provide statistically valid coverage guarantees for segmentation, robot allocation, and detection outputs. These calibrated uncertainty estimates directly trigger active perception, dispatching the most appropriate robot to acquire informative viewpoints and resolve uncertainty. Across real-world scenarios, Co-GLANCE outperforms cloud-based vision-language model baselines in occlusion segmentation and robot allocation accuracy by 25% and 36%, respectively, while reducing per-frame inference latency 350x. We also release an air-ground dataset for future research. Code, videos, and dataset available at https://co-glance.github.io/ .&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Code, videos, and dataset available at https://co-glance.github.io/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.MA] Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy</title>
    <author>
      <name>Deepak Akkil</name>
    </author>
    <author>
      <name>Ravi Kokku</name>
    </author>
    <author>
      <name>Karthik Vikram</name>
    </author>
    <author>
      <name>Tamer Abuelsaad</name>
    </author>
    <author>
      <name>Aditya Vempaty</name>
    </author>
    <author>
      <name>Satya Nitta</name>
    </author>
    <category term="cs.MA" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08367v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08367v1" />
    <published>2026-06-06T22:59:27Z</published>
    <updated>2026-06-06T22:59:27Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Deepak Akkil et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Most evaluations of LLM agents look like exams: a discrete task, a clean environment, a score in minutes or hours. We argue that this approach is mismatched with the deployment conditions of autonomous systems, where the relevant timescale can be weeks to months, and where the dynamics that matter most, such as behavioral drift, governance in diverse environmental contexts, and cross-influence between agents from different model families, only emerge over time. We introduce Emergence World, a continuously running multi-agent simulation platform designed to make those dynamics measurable. The platform hosts populations of LLM-driven agents in a shared spatial world grounded in live external data (e.g. real-time weather, news APIs, internet access), equips each agent with 120+ specialized tools and three persistent memory systems, and lets them govern themselves through democratic mechanisms with consequential outcomes. The platform is model-agnostic at the reasoning layer and supports heterogeneous populations in which agents from different vendors share the same world. To illustrate the kinds of questions the platform makes tractable, we present a 15-day cross-vendor study with five parallel worlds powered by Claude Sonnet 4.6, Grok 4.1 Fast, Gemini 3 Flash, GPT-5-mini, and a mixed population. Identical roles and starting conditions produced radically different outcomes, ranging from stable deliberative governance to total population collapse. We release the prompts, log data and configurations to support further research on long-horizon multi-agent autonomy.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Benchmarking Open-Ended Multi-Agent Coordination in Language Agents</title>
    <author>
      <name>Kale-ab Abebe Tessera</name>
    </author>
    <author>
      <name>Andras Szecsenyi</name>
    </author>
    <author>
      <name>Cameron Barker</name>
    </author>
    <author>
      <name>Alexander Rutherford</name>
    </author>
    <author>
      <name>Davide Paglieri</name>
    </author>
    <author>
      <name>Aidan Scannell</name>
    </author>
    <author>
      <name>Henry Gouk</name>
    </author>
    <author>
      <name>Elliot J. Crowley</name>
    </author>
    <author>
      <name>Tim Rocktäschel</name>
    </author>
    <author>
      <name>Amos Storkey</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08340v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08340v1" />
    <published>2026-06-06T21:13:43Z</published>
    <updated>2026-06-06T21:13:43Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kale-ab Abebe Tessera et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising single-agent tasks, short interactions, or highly structured multi-agent settings. We introduce $alem$, a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon survival world with exploration, crafting, trading, and combat. We evaluate $13$ modern LLMs zero-shot within homogeneous teams, with trained MARL agents as reference points. Current LLM agents remain far from solving alem, averaging only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows that individual task competence does not imply coordination competence. Ablations show that communication is the largest contributor to coordination, while memory and reasoning help when used to maintain multi-step plans. Overall, our results identify coordination as a distinct bottleneck for frontier LLM agents, separate from single-agent capabilities. Alem makes this bottleneck measurable and provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem-world/alem-env.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;42 pages, preprint&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies</title>
    <author>
      <name>Ashkan Ansarifard</name>
    </author>
    <author>
      <name>Matteo Mancanelli</name>
    </author>
    <author>
      <name>Elena Umili</name>
    </author>
    <author>
      <name>Fabio Patrizi</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08312v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08312v1" />
    <published>2026-06-06T19:55:54Z</published>
    <updated>2026-06-06T19:55:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ashkan Ansarifard et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformers and Decision Transformers have been adopted to address RL as a sequence modeling problem. However, these methods optimize purely for reward and do not account for high-level temporal requirements. Here, we introduce a neurosymbolic framework that injects LTLf background knowledge into such transformer-based RL policies. Our approach compiles LTLf formulas into deterministic finite automata (DFAs) and integrates them into the learning process through a differentiable representation and a logic-based loss function. In particular, we derive differentiable satisfaction signals from DFA progression and use them as a regularization term during training. The resulting method is architecture-agnostic across different models. We evaluate the proposed framework on navigation environments with specification suites covering combinations of safety and reachability temporal properties. Experimental results show that incorporating background knowledge not only improves constraint satisfaction, but also maintains competitive return compared to vanilla baselines.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at the Joint Workshop on Statistics and Knowledge Integration for Logic, Learning, Ethical Decisions, and LLMs (SKILLED-LLMs 2026), co-located with KR 2026 and FLoC 2026, Lisbon, Portugal&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures</title>
    <author>
      <name>Jaineet Shah</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08275v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08275v1" />
    <published>2026-06-06T17:44:23Z</published>
    <updated>2026-06-06T17:44:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jaineet Shah&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that executes the harmful action is usually not the step that decided on it, and LLM-judge attribution is correlational and unreliable (state-of-the-art step-level accuracy on the Who&amp;amp;When benchmark is about 14%). We present Causal Agent Replay (CAR), which answers the question by intervention: it models an agent run as a structural causal model, applies a do-operation to a step, and re-executes the trajectory forward under the same stochastic policy, measuring the shift in the outcome distribution. We define an intervention algebra over agent steps, a single-step contrastive estimator whose point-of-commitment rule resolves a confound specific to stochastic run-forward, and a budget-bounded Monte-Carlo Shapley estimator that splits credit across interacting steps. Every effect is reported with confidence intervals. We validate against synthetic structural causal models with planted ground truth: the contrastive estimator recovers the pivotal step, and Shapley recovers a two-step interaction (0.44, 0.45, ~0; efficiency sum 0.909 versus the analytic 0.91). CAR is open source and runs on hosted or free local models.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Open-source: https://github.com/jaineet17/causal-agent-replay&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CY] Contemporary AI lacks the imagination to diverge or negate in science</title>
    <author>
      <name>Honglin Bao</name>
    </author>
    <author>
      <name>Siyang Wu</name>
    </author>
    <author>
      <name>Xiao Liu</name>
    </author>
    <author>
      <name>Sida Li</name>
    </author>
    <author>
      <name>Shiyun Cao</name>
    </author>
    <author>
      <name>James A. Evans</name>
    </author>
    <category term="cs.CY" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08251v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08251v2" />
    <published>2026-06-06T16:39:28Z</published>
    <updated>2026-06-09T03:31:39Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Honglin Bao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Bold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, and the field still lacks large-scale, scientist-in-the-loop tests of these claims. Here we mount the largest such evaluation to date and map what AI cannot yet do for science. We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge ideas that large language models (LLMs) generated from the context and puzzles of their own papers. 6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption. Three patterns emerge. First, non-reasoning LLMs collapse into a narrow &amp;quot;hivemind&amp;quot; of similar ideas; reasoning models roam a wider hypothesis space, yet no model class spontaneously proposes null hypotheses -- a move humans make more freely. Second, scientists reward ideas that resemble their own and prize probability over novelty, though social scientists tolerate risk more readily than life scientists. Senior social scientists are the harshest critics, and their skepticism is well-earned: LLMs falter most in pluralistic fields like the social sciences that demand context-aware interpretation and evolving theories. Third, automated evaluators on which the community currently relies -- LLM-as-a-judge, artificial metrics, and even state-of-the-art (SOTA) models -- agree only weakly with expert judgment, and retrieval augmentation and scientist persona prompting yield only marginal gains. A Qwen3-14B reward model we post-trained on human ratings captures field taste nuances, beats SOTA models by up to 27%, and closes the gap to the inter-rater consistency of independent peer reviewers. For all the hype, today&amp;#x27;s scientific AI still represents a collaborator whose imagination, outputs and judgment benefit from human grounding.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] SciTrace: Trajectory-Aware Safety Reasoning for Scientific Discovery Agents</title>
    <author>
      <name>Tanush Swaminathan</name>
    </author>
    <author>
      <name>Runmin Jiang</name>
    </author>
    <author>
      <name>Letian Zhang</name>
    </author>
    <author>
      <name>Min Xu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08234v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08234v1" />
    <published>2026-06-06T15:44:50Z</published>
    <updated>2026-06-06T15:44:50Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tanush Swaminathan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;LLM-based scientific agents have shown strong capacity for autonomous research, yet their safety layers remain structurally divorced from core reasoning: they inspect pipeline outputs rather than shaping the deliberation that produces them. This separation opens two failure modes: safety signals accumulated at one stage are discarded before the next, and sequences of individually benign tool calls can compose into harmful outcomes that no single-step filter detects. To address these challenges, we introduce \textbf{SciTrace}, a framework that weaves safety reasoning into every stage of the scientific agent pipeline. SciTrace couples two complementary mechanisms: a \textit{Safety-Intrinsic Reasoning Loop} (SIR) that maintains a cumulative risk state across the Thinker, Experimenter, Writer, and Reviewer stages through joint task-and-safety deliberation, and a \textit{Compositional Tool-Chain Verifier} (CTV) that performs trajectory-aware safety checks before execution, catching risks that surface only across multi-step tool sequences. Evaluated on 240 high-risk research tasks and 120 tool-related risk tasks spanning six scientific domains, SciTrace achieves state-of-the-art (\textbf{SOTA}) safety among compared frameworks across four backbone models: it consistently improves tool call safety and adversarial robustness while preserving scientific output quality, and it uncovers \textbf{78.8\%} of the compositional tool-chain escapes that single-step monitors miss. The project website is available at https://opensciagent.github.io/SciTrace/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;23 pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models</title>
    <author>
      <name>Ryner Tan</name>
    </author>
    <author>
      <name>Wenxuan Zhang</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08194v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08194v1" />
    <published>2026-06-06T14:24:05Z</published>
    <updated>2026-06-06T14:24:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ryner Tan, Wenxuan Zhang&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation</title>
    <author>
      <name>Kewei Li</name>
    </author>
    <author>
      <name>Rongying Zhang</name>
    </author>
    <author>
      <name>Xueli Wang</name>
    </author>
    <author>
      <name>Xiwen Gong</name>
    </author>
    <author>
      <name>Zhongjian Wang</name>
    </author>
    <author>
      <name>Lan Huang</name>
    </author>
    <author>
      <name>Ruochi Zhang</name>
    </author>
    <author>
      <name>Fengfeng Zhou</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08191v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08191v1" />
    <published>2026-06-06T14:21:55Z</published>
    <updated>2026-06-06T14:21:55Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kewei Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Token aggregation is a common bottleneck in models that map token representations to sample-level predictions, yet most pooling methods operate only in the original token domain. We propose FLaG, a plug-in aggregation module that transforms token representations with the real FFT, summarizes spectral components with learnable latent queries, applies a channel-wise gate, and reconstructs enhanced time-domain tokens for final pooling. We evaluate FLaG on antimicrobial peptide (AMP) activity prediction with ESM2, image classification with ResNet18 on CIFAR-10 and CIFAR-100, and text classification with RoBERTa on IMDB and GLUE. FLaG achieves its clearest gains on the ESM2-8M antimicrobial peptide tasks and on CIFAR-100, while remaining competitive with strong text baselines on IMDB and GLUE. Then we probe its behavior on the AMP setting with band knockouts, gate summaries, residue perturbations, latent-query readouts, and structure-proxy stratification. We find that low-frequency bands contribute the most overall, and the remaining higher-band pattern is more sample-specific. The gate acts as a broadly shared spectral reweighting stage and the cross-attention patterns are sample-specific with mild query-wise differentiation, and higher-helix peptides exhibit stronger average spectral sensitivity in both bacteria. The supplementary materials, source code and data are released at https://www.healthinformaticslab.org/supp/ and https://github.com/Kewei2023/AMPCliff/tree/FLaG.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Explaining Data Mixing Scaling Laws</title>
    <author>
      <name>Rui Dai</name>
    </author>
    <author>
      <name>Shuran Zheng</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08167v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08167v1" />
    <published>2026-06-06T13:31:38Z</published>
    <updated>2026-06-06T13:31:38Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Rui Dai, Shuran Zheng&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent research has established empirical scaling laws to predict model performance on multi-domain data mixtures. However, a theoretical understanding of these model loss behaviors remains absent. In this work, we propose a unified framework to explain the underlying mechanics of data mixing. Our approach extends theoretical perspectives originally developed for standard neural scaling laws (e.g., Kaplan and Chinchilla) to the multi-domain setting. Based on the distributional assumption that domains overlap on fundamental skills while diverging on specialized skills, we identify two key factors that govern the domain losses of models trained on different data mixtures: \textit{Capacity Competition}, where the allocation of finite model capacity couples domain losses globally, and \textit{Noise Reduction}, where optimal weights shift toward harder-to-learn domains to minimize overall noise. Empirical evaluations show that our framework outperforms existing baselines by fitting the loss landscape with a lower Mean Relative Error and identifying higher-performing training mixtures. Most importantly, our model successfully extrapolates across scales, predicting highly effective mixtures for large, unseen scales using parameters fitted on smaller ones. In addition, our model achieves these results using significantly fewer parameters compared to previous empirical laws. Our code is available at https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Published to ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT</title>
    <author>
      <name>Kyumin Choi</name>
    </author>
    <author>
      <name>Ikbeom Jang</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08156v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08156v1" />
    <published>2026-06-06T13:13:02Z</published>
    <updated>2026-06-06T13:13:02Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kyumin Choi, Ikbeom Jang&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadratic self-attention complexity. Although token reduction techniques such as pruning and merging mitigate this, they typically overlook how representations evolve across network depth. We propose RAPID, a depth-aware token reduction framework that adapts reduction strategies to the layer-wise characteristics of token representations. The primary methodological contribution is a bifurcated strategy: in shallow-to-middle layers, RAPID employs a redundancy-similarity aware pruning metric to eliminate over-represented local patterns. As features transition to global semantic concepts in deeper layers, the framework shifts to an importance-similarity aware merging mechanism. This stage leverages classification (CLS) token attention weights to protect semantically critical tokens while fusing less important but similar neighbors. Empirical validation on ImageNet-1K using ViT and DeiT architectures demonstrates that RAPID establishes a superior accuracy-compression Pareto frontier compared to plug-and-play baselines such as ToMe and ToFu. RAPID is particularly robust in aggressive compression regimes, achieving up to 4.29% higher accuracy than ToMe at extreme reduction rates. Our framework provides a training-free template for optimizing vision models by aligning reduction strategies with hierarchical feature evolution.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;7 pages, 2 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents</title>
    <author>
      <name>Xinyu Guan</name>
    </author>
    <author>
      <name>Qianyang Zhao</name>
    </author>
    <author>
      <name>Yuming Deng</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08151v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08151v1" />
    <published>2026-06-06T13:02:28Z</published>
    <updated>2026-06-06T13:02:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xinyu Guan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Tool-using LLM agents often fail not because relevant text is absent, but because decisive evidence is not selected, compressed, or surfaced at action time. We present CICL, a decision-aware context layer that turns instance evidence into a context graph, routes deterministic, Opus-assisted, Qwen, Codex/GPT-5.5, and Qwen-QLoRA judgments through a shared eight-field schema, scores units by action shift, outcome uplift, necessity, and negative-transfer risk, and packs high-utility evidence as typed memory cards for a budgeted agent. The design separates the measured decision signal from the judge model, so frontier annotation, local surrogates, and lightweight rankers can be compared under one auditable protocol. Empirically, CICL yields a concrete open-benchmark gain while exposing its limits. On 50 SWE-bench Verified file-retrieval instances, direct Qwen3.6-plus reranking of BM25 top-50 candidates raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show action-criticality: at budget 120, CICL reaches F1 0.620 on v1 and 0.425 on v3, and removing the top-utility semantic v3 unit collapses F1 to 0.000. Supplementary checks add Qwen-QLoRA agreement over 710 candidates, a small 200-label real-code Opus-assisted signal, and a three-instance patch smoke validating retrieval-to-patch plumbing without claiming official SWE-bench success. RepoBench-R summaries still beat cards, and compact rankers do not yet replace the heuristic. CICL contributes a reproducible measurement and selection layer for decision-critical context, not an end-to-end coding-agent repair claim.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;15 pages, 2 figures, 9 tables. Code and artifacts are available at https://github.com/stephen-guan-researcher/CICL; Qwen-QLoRA adapter is available at https://huggingface.co/XinyuGuan/CICL&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] SAGE: An LLM-driven Self Reflective Agentic Framework for Fraud Detection</title>
    <author>
      <name>Yichen Chen</name>
    </author>
    <author>
      <name>Siying Li</name>
    </author>
    <author>
      <name>Yuhang Liang</name>
    </author>
    <author>
      <name>Lijun Wang</name>
    </author>
    <author>
      <name>Renyang Liu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08146v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08146v1" />
    <published>2026-06-06T12:48:45Z</published>
    <updated>2026-06-06T12:48:45Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yichen Chen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Fraud detection in payment, e-commerce, and telecommunications systems requires accuracy at the individual level, robustness under severe class imbalance, and ease of understanding for risk managers. Existing methods fall at least one of these requirements: automated machine learning systems search a fixed numerical space without semantic awareness of the dataset; graph neural network-based methods require pre-defined relational graphs and remain opaque at the individual-decision level; and the design of general-purpose large language model (LLM) agents does not consider the recall and precision constraints specific to real-world fraud detection. In this paper, we propose SAGE, the first end-to-end LLM-driven multi-agent framework for fraud detection. SAGE coordinates three dedicated agents that make decisions based on a six-layer Data Diagnostic Tree (DDT) and a Markov decision process guided by natural-language gradients, automatically optimizing the model under a fraud-specific reward. On five fraud datasets and five LLM backbones, SAGE wins $96.00\%$ of method--dataset comparisons and improves F1 by an average of $40.86\%$ over baselines. The code is available at https://github.com/yichenC1c/SAGE.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining</title>
    <author>
      <name>Aaryan Nagpal</name>
    </author>
    <author>
      <name>Debdeep Sanyal</name>
    </author>
    <author>
      <name>Murari Mandal</name>
    </author>
    <author>
      <name>Dhruv Kumar</name>
    </author>
    <author>
      <name>Saurabh Deshpande</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09912v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09912v1" />
    <published>2026-06-06T12:10:38Z</published>
    <updated>2026-06-06T12:10:38Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Aaryan Nagpal et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Choosing the wrong synthetic generator for time-series foundation model pretraining is costly: under identical training budgets, the best and worst generators produce up to a $2\times$ gap in forecasting error, yet the field has no principled way to make this choice. The problem is compounded by the fact that generator rankings are not stable across architectures: across 11 generator families evaluated on Chronos-T5-Mini and Moirai-Small trained from scratch, we find that which generators are useful depends on the model architecture. Rather than solving the generator selection problem, we sidestep it: a simple equal-weight mixture of all generators matches or beats the best individual generator for both architectures, and composing this mixture with real data yields the strongest pretraining corpora overall. Synthetic pretraining is therefore a corpus composition problem, not a generator selection problem, and composition choices should be validated per model family rather than assumed to transfer.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at the ICML 2026 Workshop on Foundation Models for Structured Data (FMSD), Seoul, South Korea&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Human-Centered Benchmarking of Driver Monitoring Models</title>
    <author>
      <name>Ruben Dario Florez-Zela</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08123v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08123v1" />
    <published>2026-06-06T11:57:16Z</published>
    <updated>2026-06-06T11:57:16Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ruben Dario Florez-Zela&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Vision-based driver monitoring systems are increasingly deployed in safety-critical intelligent transportation settings, yet they are almost always compared on classification accuracy alone. This paper argues that accuracy is insufficient to characterize a model&amp;#x27;s fitness for real-world deployment, and proposes the Human-Centered Benchmarking Framework (HCBF), which evaluates models across four dimensions: accuracy, explainability, efficiency, and robustness. The framework is applied to four representative lightweight architectures, MobileNetV3, ShuffleNetV2, EfficientNet-B0, and DeiT-Tiny, on the MRL Eye Dataset for eye-state classification. While the models are nearly indistinguishable on clean-set accuracy, each leads in exactly one dimension, and all four lie on the Pareto frontier. A Human-Centered Score computed under three deployment-oriented weighting scenarios ranks ShuffleNetV2 first throughout. However, this aggregate winner retains less than half of its performance under sensor noise and fails by classifying closed eyes as open, whereas the transformer remains robust. These findings show that aggregate ranking can mask dimension-specific vulnerabilities that are operationally decisive, underscoring the value of multi-dimensional, human-centered evaluation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;9 pages, 3 figures, 7 tables. Code available at: https://github.com/rubendflorezzela/hcbf-driver-monitoring&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Think Before You Act: Intention-Guided Reasoning for LLM-Based Location Prediction</title>
    <author>
      <name>Qingxiang Liu</name>
    </author>
    <author>
      <name>Anqi Liang</name>
    </author>
    <author>
      <name>Zhuoyang Jiang</name>
    </author>
    <author>
      <name>Yutian Jiang</name>
    </author>
    <author>
      <name>Sisuo Lyu</name>
    </author>
    <author>
      <name>Yu Ji</name>
    </author>
    <author>
      <name>Haomin Wen</name>
    </author>
    <author>
      <name>Yuxuan Liang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08122v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08122v1" />
    <published>2026-06-06T11:56:44Z</published>
    <updated>2026-06-06T11:56:44Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Qingxiang Liu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Predicting a user&amp;#x27;s next Point-of-Interest (POI) based on their historical check-in records is a fundamental task in location-based services. While recent methods incorporating large language models have shown strong reasoning capabilities and promising results, they typically formulate the prediction task as a one-step trajectory-to-location mapping problem, making predictions prone to shallow trajectory correlations and historical frequency bias. We argue that users rarely choose locations directly and instead, they usually first form a traveling intention and then accordingly select specific POIs. Motivated by this insight, we propose IntentPOI, a two-stage intention-guided reasoning framework. In the thinking stage, we infer users&amp;#x27; intermediate intentions by incorporating historical mobility patterns, similar peer behaviors, and the temporal contexts. In the acting stage, we first construct a compact candidate pool, and then perform intention-guided reasoning to identify locations that best align with the inferred intention. By explicitly decoupling intention inference from location prediction, IntentPOI transforms the next POI prediction from direct trajectory matching into intention-guided reasoning. Extensive experiments on three real-world datasets demonstrate that IntentPOI consistently outperforms eleven state-of-the-art baselines.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data</title>
    <author>
      <name>Ji Woong Kim</name>
    </author>
    <author>
      <name>Ke Wang</name>
    </author>
    <author>
      <name>Zipeng Fu</name>
    </author>
    <author>
      <name>Sirui Chen</name>
    </author>
    <author>
      <name>Cong Zhao</name>
    </author>
    <author>
      <name>Jeff Lai</name>
    </author>
    <author>
      <name>Chelsea Finn</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08107v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08107v1" />
    <published>2026-06-06T11:14:21Z</published>
    <updated>2026-06-06T11:14:21Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ji Woong Kim et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Robotics faces a fundamental challenge of data scarcity. Unlike language or vision research, there is no internet-scale dataset for robotic manipulation. A promising path forward is to leverage egocentric human data, which can be collected more easily, with greater breadth, and at a larger scale. Towards this end, we investigate key design choices for learning across human and humanoid embodiments equipped with dexterous five-finger hands, using the $π_{0.5}$ model as a foundation. Our results show that human data enables robots to learn new task semantics and compose existing skills into novel behaviors without corresponding robot data. The paper website is here: https://egopipaper.github.io/&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] Continual Quadruped Robots Coordination via Semantic Skill Discovery</title>
    <author>
      <name>Daoqing Wang</name>
    </author>
    <author>
      <name>Yuchen Xiao</name>
    </author>
    <author>
      <name>Weixuan Huang</name>
    </author>
    <author>
      <name>Zhilong Zhang</name>
    </author>
    <author>
      <name>Shenghua Wan</name>
    </author>
    <author>
      <name>Meng Li</name>
    </author>
    <author>
      <name>Lei Yuan</name>
    </author>
    <author>
      <name>Yang Yu</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08102v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08102v1" />
    <published>2026-06-06T11:07:13Z</published>
    <updated>2026-06-06T11:07:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Daoqing Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot&amp;#x27;s own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: https://conquer-project.pages.dev/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;22 pages, 8 figures, 11 tables. Project page: https://conquer-project.pages.dev/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models</title>
    <author>
      <name>Khanh D. Nguyen</name>
    </author>
    <author>
      <name>Hung T. Ho</name>
    </author>
    <author>
      <name>Chinh T. Nguyen</name>
    </author>
    <author>
      <name>Thanh Q. Duong</name>
    </author>
    <author>
      <name>Linh D. Le</name>
    </author>
    <author>
      <name>Duy M. H. Nguyen</name>
    </author>
    <author>
      <name>Vien A. Ngo</name>
    </author>
    <author>
      <name>An T. Le</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08094v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08094v1" />
    <published>2026-06-06T10:45:40Z</published>
    <updated>2026-06-06T10:45:40Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Khanh D. Nguyen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a cross-attending action expert integrated over several solver steps. A single runtime serves seven architectures spanning five backbone and four action-head families behind one request/response protocol, with each model packaged as a self-contained bundle. On LIBERO-Object, the engine matches a state-of-the-art checkpoint to within one episode out of 200, and runs BitVLA at 100% success in 1.3 GiB of memory. The same bundle runs unchanged across three hardware tiers, from a consumer GPU down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound, so utilization rather than bandwidth is the deployment lever; an IMMA ladder GEMM derived from this analysis cuts BitVLA per-step latency by 4.5x. We then frame an on-robot stress test on an ALOHA arm that isolates the latency constraint under which a learned VLA must replan against a moving target on the hardware it was trained for. Code, demo videos, and the reproducible benchmark scaffold are available at https://fai-modelopt-tech.github.io/vla-cpp.github.io/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;17 pages, 3 figures, 12 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions</title>
    <author>
      <name>Po-Ya Angela Wang</name>
    </author>
    <author>
      <name>Chinmaya Mishra</name>
    </author>
    <author>
      <name>Aslı Özyürek</name>
    </author>
    <author>
      <name>Paula Rubio-Fernández</name>
    </author>
    <author>
      <name>Esam Ghaleb</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08081v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08081v1" />
    <published>2026-06-06T10:05:49Z</published>
    <updated>2026-06-06T10:05:49Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Po-Ya Angela Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align on the labels they use. How can we determine whether this alignment reflects partner-specific grounding rather than a shared task vocabulary? We address this question by comparing capable multimodal agent dyads with human dyads from the KTH Tangrams corpus. Our novel methodological contribution is a constrained pseudo-dyad baseline that matches the original referential task structure, but breaks partner history. This baseline enables us to test whether the observed label alignment depends on interaction with a specific partner. Across three analytic layers (task competence, description strategy, alignment dynamics), we find clear differences. Humans reduce effort through entrainment, compressing descriptions and increasing label alignment with partners. Agents instead maintain fixed effort levels, producing verbose descriptions from round one, with near-ceiling label overlap that is statistically indistinguishable between real and pseudo dyads. MLLMs thus achieve coordination without convention, succeeding by verbose description rather than by forming the compact, history-dependent referring expressions characteristic of human dialogue.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] "I understand your perspective": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory</title>
    <author>
      <name>Esra Dönmez</name>
    </author>
    <author>
      <name>Agnieszka Falenska</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08076v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08076v1" />
    <published>2026-06-06T09:54:31Z</published>
    <updated>2026-06-06T09:54:31Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Esra Dönmez, Agnieszka Falenska&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Language Models (LLMs) can generate high-quality arguments, yet their ability to engage in nuanced and persuasive communicative actions remains largely unexplored. This work explores the persuasive potential of LLMs through the framework of Jürgen Habermas&amp;#x27; Theory of Communicative Action. It examines whether LLMs express illocutionary intent (i.e., pragmatic functions of language such as conveying knowledge, building trust, or signaling similarity) in ways that are comparable to human communication. We simulate online discussions between opinion holders and LLMs using conversations from the persuasive subreddit ChangeMyView. We then compare the likelihood of illocutionary intents in human-written and LLM-generated counter-arguments, specifically those that successfully changed the original poster&amp;#x27;s view. We find that all three LLMs effectively convey illocutionary intent -- often more so than humans -- potentially increasing their anthropomorphism. Further, LLMs craft sycophantic responses that closely align with the opinion holder&amp;#x27;s intent, a strategy strongly associated with opinion change. Finally, crowd-sourced workers find LLM-generated counter-arguments more agreeable and consistently prefer them over human-written ones. These findings suggest that LLMs&amp;#x27; persuasive power extends beyond merely generating high-quality arguments. On the contrary, training LLMs with human preferences effectively tunes them to mirror human communication patterns, particularly nuanced communicative actions, potentially increasing individuals&amp;#x27; susceptibility to their influence.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?</title>
    <author>
      <name>Jiaqi Tang</name>
    </author>
    <author>
      <name>Jianmin Chen</name>
    </author>
    <author>
      <name>Youyang Zhai</name>
    </author>
    <author>
      <name>Wei Wei</name>
    </author>
    <author>
      <name>Runtao Liu</name>
    </author>
    <author>
      <name>Mengjie Zhao</name>
    </author>
    <author>
      <name>Xiangyu Wu</name>
    </author>
    <author>
      <name>Qingfa Xiao</name>
    </author>
    <author>
      <name>Qifeng Chen</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08063v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08063v1" />
    <published>2026-06-06T08:58:12Z</published>
    <updated>2026-06-06T08:58:12Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiaqi Tang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions</title>
    <author>
      <name>Donghao Huang</name>
    </author>
    <author>
      <name>Tomas Drietomsky</name>
    </author>
    <author>
      <name>Benjamin Barrett</name>
    </author>
    <author>
      <name>Zhaoxia Wang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08051v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08051v1" />
    <published>2026-06-06T08:32:38Z</published>
    <updated>2026-06-06T08:32:38Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Donghao Huang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Financial transaction processing requires extracting structured merchant information from noisy, abbreviated bank transaction strings at scale. Our current production system, a LoRA-fine-tuned LLaMA 3.1-8B, achieves 96.95% F1 on this task, but deploying 8-billion-parameter models imposes prohibitive memory, latency, and cost constraints. To identify more efficient alternatives, we conduct a deployment-focused study of 24 model variants spanning four model families: Gemma 3 (270M, 1B, 4B), Qwen 3.5 (0.8B, 2B, 4B), Aya (3.35B), and LLaMA 3.1-8B, systematically evaluating accuracy, inference throughput, training cost, and hardware behavior to assess production suitability. Our findings show that: (1) reproducing the LLaMA 3.1-8B fine-tune with a LoRA rank of 8 achieves 96.75% F1, only 0.20 points below the rank-32 baseline; (2) Qwen 3.5 4B with JSON-only prompting reaches 96.60% F1, within 0.35 points of the 8B baseline while using roughly half the parameters; (3) the 0.8B Qwen 3.5 model achieves 94.75% F1, matching models 2.5-4x larger and offering an attractive latency-accuracy trade-off; (4) chain-of-thought fine-tuning generally improves F1 by 0.3-1.8 points across most models, although Qwen 3.5 4B performs best with direct JSON-only prompting; and (5) Qwen 3.5 Think and Nothink training templates produce nearly identical results (F1 differences &amp;lt;0.004), indicating that explicit reasoning supervision is unnecessary for structured extraction tasks. We further deploy all 14 fine-tuned sub-8B models as Databricks Model Serving endpoints and observe that benchmark performance transfers reliably to production, with an average F1 change of only 0.8 points. Aya 3.35B, based on the Cohere2 architecture, is the sole exception, exhibiting a 3-5 point decline under serving conditions. Based on these results, we provide deployment recommendations across accuracy and latency requirements, ...&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;9 pages, 5 figures, 5 tables. Submitted to the IEEE International Conference on Data Mining (ICDM) 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows</title>
    <author>
      <name>Amine El Hattami</name>
    </author>
    <author>
      <name>Nicolas Chapados</name>
    </author>
    <author>
      <name>Christopher Pal</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08049v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08049v1" />
    <published>2026-06-06T08:27:18Z</published>
    <updated>2026-06-06T08:27:18Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Amine El Hattami et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, especially in web automation. We introduce SKILL.nb, a framework for governing reusable agent workflows with evidence-calibrated lifecycle policies. SKILL.nb uses selective formalization: execution evidence decides which workflow steps should become executable code, which should remain natural-language guided, and when those choices should be revised. Workflows are stored as auditable, versioned notebooks that interleave natural-language guidance, multi-language executable cells, validation gates, fallback paths, and multimodal evidence such as outputs, screenshots, and error traces. At runtime, gate-conditioned execution lets each step run code when its gates validate, or fall back locally when drift invalidates the executable realization. On WebArena-Verified, SKILL.nb achieves 53.7% single-round success, improving over the strongest baseline by 3.9 percentage points. Across three re-executions, it retains 91.7% of initially successful tasks, 15.5 points above the next best method. Under bounded repair, it recovers 72.9% of subsequent failures while limiting post-repair regressions to 4.2%, compared with 15.0% to 17.0% for persistent baselines. It also leads on Mind2Web cross-website and cross-domain splits. In a GitLab migration test, SKILL.nb preserves performance when reusing frozen state learned on GitLab 15.7, with frozen-versus-fresh target-version gaps of -1.7 points on GitLab 16.11 and +0.6 points on GitLab 18.9. These results identify lifecycle governance and gate-conditioned execution as reliability axes beyond one-shot task success.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] OSMGraphCLIP: Learning Global Location Representations from OpenStreetMap Graphs</title>
    <author>
      <name>Dimitrios Michail</name>
    </author>
    <author>
      <name>Eleni Saka</name>
    </author>
    <author>
      <name>Ioannis Giannopoulos</name>
    </author>
    <author>
      <name>Ioannis Papoutsis</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08046v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08046v1" />
    <published>2026-06-06T08:18:21Z</published>
    <updated>2026-06-06T08:18:21Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dimitrios Michail et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present OSMGraphCLIP, a CLIP-style geospatial representation model that learns global location embeddings from freely available OpenStreetMap (OSM) data. OSMGraphCLIP represents geographic environments as heterogeneous graphs of typed OSM features, preserving the topological and semantic relationships among roads, buildings, land-use regions, and points of interest. A multi-scale graph encoder captures both fine-grained local structure and broader landscape composition, and supervises a spherical-harmonics location encoder through a contrastive alignment objective. We evaluate OSMGraphCLIP across a diverse suite of downstream geospatial regression and classification tasks spanning climate, ecology, socioeconomic indicators, public health, land cover, biodiversity, and wildfire forecasting, and show that structured OSM data alone supports strong global location representations across domains. OSMGraphCLIP matches or exceeds satellite-based baselines on the majority of benchmarks, with the most pronounced advantage on socioeconomic and public-health tasks, where OSM&amp;#x27;s explicit semantic annotation of the built environment encodes patterns of human activity that satellite pixels can only capture indirectly. On ecological and environmental tasks, the model remains closely competitive with imagery-based methods despite using no Earth observation data. Qualitative analysis confirms that the learned embeddings organize geographic space coherently, recovering biome boundaries, urban gradients, and tropical--temperate distinctions from map topology alone.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] IDP-Bench: Benchmarking ability of LLMs to protect personal information in interdependent privacy contexts</title>
    <author>
      <name>Ayana Hussain</name>
    </author>
    <author>
      <name>Soumya Sharma</name>
    </author>
    <author>
      <name>Golnoosh Farnadi</name>
    </author>
    <author>
      <name>Nicholas Vincent</name>
    </author>
    <author>
      <name>Héber Hwang Arcolezi</name>
    </author>
    <author>
      <name>Ulrich Aïvodji</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09908v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09908v1" />
    <published>2026-06-06T07:59:08Z</published>
    <updated>2026-06-06T07:59:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ayana Hussain et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs) are becoming widely deployed as personal AI assistants with access to sensitive user data, making privacy a major challenge for their design and evaluation. Prior work focuses mainly on individual-level risks, overlooking \textbf{interdependent privacy (IDP)}--where one person&amp;#x27;s data may be revealed by others without their knowledge or consent. We address this gap by introducing \textbf{IDP-Bench}: the first LLM benchmark for IDP scenarios, grounded in the Contextual Integrity (CI) framework. We evaluate eight open-source LLMs on their understanding of IDP scenarios across three levels of IDP reasoning using two LLM judges. Results show strong co-ownership recognition (6/8 models exceed 90\%) but persistent weaknesses in identifying CI parameters (information attribute, primary subject) and IDP-specific parameters such as secondary subjects, where 7/8 models score below 74\%. Models also struggle to judge sharing appropriateness (5/8 scoring below 77\%). While the ability to judge the appropriateness of sharing improves with scale, performance tends to decline in smaller models, and prompt sensitivity remains high on IDP-specific questions--highlighting the need for more targeted study of IDP in LLM privacy research. Data \&amp;amp; code available \href{https://github.com/tisl-lab/Interdependent_Privacy_Bench}{here}.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] SafeECGMatch: Calibration-Aware Joint Frequency and Time Space Semi-Supervised Learning for Open-Set ECG Classification</title>
    <author>
      <name>Hongkyu Koh</name>
    </author>
    <author>
      <name>Ikbeom Jang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08037v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08037v1" />
    <published>2026-06-06T07:57:22Z</published>
    <updated>2026-06-06T07:57:22Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Hongkyu Koh, Ikbeom Jang&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Electrocardiogram (ECG) classification models often suffer from severe label scarcity, making semi-supervised learning (SSL) an attractive strategy for reducing annotation costs. In clinical settings, however, unlabeled pools frequently contain out-of-distribution (OOD) anomalies or diagnostic groups absent from the labeled set. Standard SSL forces incorrect pseudo-labels onto these unseen classes, producing overconfident predictions. To address this, we propose SafeECGMatch, a calibration-aware safe SSL framework for single-label ECG classification under label distribution mismatch. Methodologically, SafeECGMatch employs a dual-branch architecture extracting time-frequency latent representations via ECG-specific augmentations. Crucially, it dynamically aligns confidence with empirical accuracy through adaptive label smoothing and temperature scaling, calibrating both the multiclass classifier and the OOD detector across temporal and spectral domains. This joint optimization allows trustworthy OOD rejection and reliable pseudo-labeling. Evaluated on the PTB-XL and PhysioNet/CinC Challenge benchmarks, SafeECGMatch achieves state-of-the-art accuracy and calibration, advancing reliable knowledge discovery in physiological time-series. Code is available at https://github.com/labhai/SafeECGMatch.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;8 pages. Accepted to the KDD-UC 2026 (ACM International Conference on Data Mining and Knowledge Discovery - Undergraduate Consortium 2026)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL</title>
    <author>
      <name>Jianling Gao</name>
    </author>
    <author>
      <name>Chongyang Tao</name>
    </author>
    <author>
      <name>Jiayuan Bai</name>
    </author>
    <author>
      <name>Liu Yang</name>
    </author>
    <author>
      <name>Xuanguang Pan</name>
    </author>
    <author>
      <name>Jinrui Liu</name>
    </author>
    <author>
      <name>Shihao Xing</name>
    </author>
    <author>
      <name>Xiaohan Xu</name>
    </author>
    <author>
      <name>Jie Liang</name>
    </author>
    <author>
      <name>Shuai Ma</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08018v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08018v1" />
    <published>2026-06-06T07:14:53Z</published>
    <updated>2026-06-06T07:14:53Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jianling Gao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Existing text-to-SQL benchmarks are largely centered on SQLite, making it difficult to evaluate whether models can generalize across heterogeneous SQL dialects. However, real-world database systems differ substantially in syntax, functions, type systems, and execution semantics, so the same natural language intent often requires dialect-specific SQL realizations. We introduce UniQL, a human-verified benchmark for cross-dialect text-to-SQL evaluation. UniQL aligns 1,534 natural language questions with executable SQL annotations across 16 SQL dialects, yielding 24,544 dialect-specific queries. All dialects share the same intents, aligned schemas and database contents, enabling controlled evaluation of dialect generalization. UniQL is constructed through a hybrid pipeline combining database migration, SQL translation, execution-guided verification, iterative rule summarization, and human validation. Experiments on both open-source and closed-source LLMs show that current models remain far from dialect-universal, with substantial performance variation across database systems and limited transfer from SQLite success to other dialects. These findings highlight the need for aligned cross-dialect benchmarks and more dialect-aware text-to-SQL methods. Code and data are available at https://github.com/JerryGao818/UniQL&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment</title>
    <author>
      <name>Zichen Zhu</name>
    </author>
    <author>
      <name>Yuheng Sun</name>
    </author>
    <author>
      <name>Mingxuan Zhu</name>
    </author>
    <author>
      <name>Wenjie Ma</name>
    </author>
    <author>
      <name>Situo Zhang</name>
    </author>
    <author>
      <name>Zhexiang Wang</name>
    </author>
    <author>
      <name>Ziyue Yang</name>
    </author>
    <author>
      <name>Danyang Zhang</name>
    </author>
    <author>
      <name>Kunyao Lan</name>
    </author>
    <author>
      <name>Zihan Zhao</name>
    </author>
    <author>
      <name>Dingye Liu</name>
    </author>
    <author>
      <name>Siqi Xiang</name>
    </author>
    <author>
      <name>Lu Chen</name>
    </author>
    <author>
      <name>Kai Yu</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08016v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08016v1" />
    <published>2026-06-06T07:11:08Z</published>
    <updated>2026-06-06T07:11:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zichen Zhu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Current image editing software often hinges on fixed filters or expert tuning, leaving a gap between amateur users&amp;#x27; intent and outcomes. Creations by generative models may contain artifacts, implausible details, or stylistic drift away from photorealism and offer little insight into why an edit was made. We propose IEA, a conversational Image Editing Agent that learns to operate parameterized tools in an explicit, interpretable action space. IEA is trained via a three-stage multitask pipeline: (1) SFT on distilled expert edits, (2) GRPO with rewards for likeness improvement, tool usefulness, and intent summarization, and (3) large-scale synthetic fine-tuning to jointly master image editing, refinement, and user intent summarization. By manipulating 16 editing tools step by step, IEA produces transparent edit traces that can be inspected and debugged. In quantitative experiments, it attains a lower pixel distance on the edit task and a higher ROUGE-L on the summary task than strong baselines. In user studies, it ranks best among tool-calling methods for instruction following while surpassing generative methods in overall perceptual quality. Our results validate interpretable, tool-centric VLMs as a reliable path to human instruction-guided image retouching.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;[CVPR 2026 Findings] Our data and code are released at https://github.com/OpenDFM/Image_Edit_Agent&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation</title>
    <author>
      <name>Boxuan Lyu</name>
    </author>
    <author>
      <name>Haiyue Song</name>
    </author>
    <author>
      <name>Zhi Qu</name>
    </author>
    <author>
      <name>Hidetaka Kamigaito</name>
    </author>
    <author>
      <name>Kotaro Funakoshi</name>
    </author>
    <author>
      <name>Manabu Okumura</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08011v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08011v1" />
    <published>2026-06-06T07:00:44Z</published>
    <updated>2026-06-06T07:00:44Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Boxuan Lyu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Although directly prompting off-the-shelf Large Language Models (LLMs) to generate meaning-preserving source rewrites can effectively enhance Machine Translation (MT) quality, doing so requires manually tuning prompts for different MT models. In this work, we propose RLSR (Reinforcement Learning for Source Rewriting), a novel RL-based framework for training a source rewriting model without tuning prompts for each MT model. RLSR optimizes the rewriting model by directly using the improvement in downstream translation quality yielded by each rewritten source as the reward. Extensive experiments across six MT models and 16 language pairs demonstrate that our 4B rewriting models trained via RLSR significantly outperform the no-rewriting baseline and existing same-scale prompt-based rewriting baselines, while achieving competitive performance against prompt-based baselines based on the 235B LLM.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models</title>
    <author>
      <name>Kaixin Lan</name>
    </author>
    <author>
      <name>Mu You</name>
    </author>
    <author>
      <name>Tao Fang</name>
    </author>
    <author>
      <name>Binkai Ou</name>
    </author>
    <author>
      <name>Lidia S. Chao</name>
    </author>
    <author>
      <name>Derek F. Wong</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07996v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07996v1" />
    <published>2026-06-06T06:27:54Z</published>
    <updated>2026-06-06T06:27:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kaixin Lan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only input-output interfaces. To address this limitation, we introduce Masked Corpus-level Pretraining Data Detection (MC-PDD), a novel method inspired by the masked language modeling paradigm. MC-PDD masks highly specific tokens in each text and prompts the LLM to predict the missing content. It then assesses whether the difference in prediction hit rates between a candidate corpus and a reference non-member corpus is statistically significant. Based on this comparison, MC-PDD determines whether the candidate texts were likely included in the model&amp;#x27;s pretraining data. Experimental results demonstrate clear and consistent differences in prediction hit rates between pretrained and unseen data across three datasets, for both open-source and closed-source LLMs. Despite operating under a stricter black-box setting, MC-PDD achieves performance comparable to existing detection methods. Our approach enables practical applications such as model auditing and data copyright verification using only standard API access. Upon acceptance, we will publicly release the code and datasets.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;The manuscript consists of 10 pages formatted in the IEEE/ACM two-column style&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] PRISM: PRior-guided Imagination Sampling in world Models</title>
    <author>
      <name>Yuhai Wang</name>
    </author>
    <author>
      <name>Jiawei Xia</name>
    </author>
    <author>
      <name>Rongxuan Zhou</name>
    </author>
    <author>
      <name>Xiao Hu</name>
    </author>
    <author>
      <name>Yongliang Shi</name>
    </author>
    <author>
      <name>Jing Du</name>
    </author>
    <author>
      <name>Yang Ye</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07974v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07974v1" />
    <published>2026-06-06T04:18:10Z</published>
    <updated>2026-06-06T04:18:10Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuhai Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;A learned world model provides a powerful physical intuition for evaluating future states. But its effectiveness in continuous control also depends critically on how candidate actions are generated for model-based planning. Rather than solely asking how accurately a model can simulate the future, we ask: which candidate actions are worth evaluating in the first place? Existing planners typically search arbitrarily or use expert demonstrations only to initialize a sampling mean, discarding the expert&amp;#x27;s state-conditioned confidence. Properly guiding this search requires a robust action prior, yet current approaches often rely on independent visual encoders or large-scale VLMs to obtain one. We argue that this architectural bloat is unnecessary: the exact same data - and the learned representations of the world model itself - inherently encode the agent&amp;#x27;s action intuition. We introduce PRISM, a task-agnostic framework that extracts both from a single dataset while maintaining strict architectural simplicity. Building on a standard JEPA-style latent world model, PRISM attaches a lightweight MLP directly to its frozen encoder to predict a state-conditioned Gaussian prior. At plan time, PRISM fuses this prior into the planner&amp;#x27;s sampling distribution via a precision-weighted Product-of-Gaussians update. This parameter-free, closed-form integration steers the sampling process, making the prior confident where it is and ceding control where it is not. PRISM improves success rates by 35 percentage points over vanilla world-model-based MPC on Cube and 32 percentage points on PushT, without introducing significant inference overhead.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks</title>
    <author>
      <name>Haoming Wen</name>
    </author>
    <author>
      <name>Shi Chen</name>
    </author>
    <author>
      <name>Qingyu Shi</name>
    </author>
    <author>
      <name>Siyuan Liu</name>
    </author>
    <author>
      <name>Minrui Luo</name>
    </author>
    <author>
      <name>Jingzhao Zhang</name>
    </author>
    <author>
      <name>Tianxing He</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07970v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07970v1" />
    <published>2026-06-06T04:04:57Z</published>
    <updated>2026-06-06T04:04:57Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haoming Wen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Current open-weight large language models (LLMs) are prone to malicious finetuning attacks, which could compromise the safety alignment of LLMs with only a few steps of supervised finetuning (SFT) on poisoned datasets. Existing alignment-stage defenses are primarily designed to defend against attacks that use parameter-efficient finetuning methods. However, they fail to defend against stronger attacks that use full-parameter finetuning. In this paper, we propose Patcher, a method inspired by adversarial training and bi-level optimization, to combat such attacks. Patcher strengthens the simulated attack by scaling up the optimization steps in the adversarial loop, thus forcing the defender to find model parameters that are insensitive to stronger attacks. Furthermore, we propose an efficient parallel algorithm to implement Patcher, decreasing the wall-clock time of training while preserving Patcher&amp;#x27;s performance. Extensive experiments show that Patcher substantially improves the model&amp;#x27;s robustness compared to vanilla SFT alignment, and transfers to diverse attack scenarios and model sizes. Code is available at https://github.com/haomingwen/patcher.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Neutrality Bites: Gender Representation in AI-Generated Animal Stories</title>
    <author>
      <name>Imani Finkley</name>
    </author>
    <author>
      <name>Yuanxi Li</name>
    </author>
    <author>
      <name>Melanie Walsh</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07969v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07969v1" />
    <published>2026-06-06T04:04:52Z</published>
    <updated>2026-06-06T04:04:52Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Imani Finkley et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Gender bias in AI-generated stories is a well-documented problem. While much attention has been paid to reducing or mitigating this bias, it is not always clear whether interventions produce genuinely fairer results. To investigate this issue, we examine how large language models (LLMs) handle gender assignment in a narrative context that is popular, highly ambiguous, and also known to closely reproduce human stereotypes: stories about talking animals. We prompt six leading LLMs to complete an English-language story about seven different anthropomorphic animal characters whose gender is unstated. We additionally iterate with four different narrative settings and a range of model temperatures. Across the 23.8K stories, we find that models frequently avoid gendering the animal character in the story (19% on average) or use gender-neutral language like &amp;quot;it&amp;quot; or &amp;quot;its&amp;quot; (38.2% on average). However, when gender is assigned, there is a significant masculine bias. Feminine animal characters are virtually absent, present in just 2.2% of stories vs. 40.6% that feature masculine characters. Our findings point to a broader argument: neutrality bites. In other words, models that prioritize neutrality to address social bias may actually contribute to the erasure of marginalized perspectives and identities. We suggest that alternative strategies beyond neutrality need to be pursued, such as ones that more equally distribute social possibilities across imagined subjects.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;FAccT(ACM Conference on Fairness, Accountability, and Transparency) 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] RecurGuard: Runtime Monitoring for Reasoning-Token Consumption Attacks</title>
    <author>
      <name>Abid Aziz</name>
    </author>
    <author>
      <name>Hafsa Binte Kibria</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07968v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07968v1" />
    <published>2026-06-06T03:52:27Z</published>
    <updated>2026-06-06T03:52:27Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Abid Aziz, Hafsa Binte Kibria&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Reasoning-capable large language models can be induced to spend their generation budget on injected decoy tasks rather than answering the user&amp;#x27;s question, causing denial of service when no final answer is produced and denial of wallet when excess output tokens are billed. Input-side safety classifiers often miss these attacks because the injected prompts can appear syntactically benign. We build RecurGuard, a runtime monitor for detecting reasoning-chain consumption attacks when reasoning traces are exposed by the model. RecurGuard analyzes reasoning traces as they are generated and tracks three signals: recurrence rate, volume growth, and progress toward the user&amp;#x27;s query. If all three signals remain anomalous over three consecutive chunks, RecurGuard terminates generation early. We evaluate RecurGuard against OverThink and ExtendAttack across open-weight reasoning models and conduct adaptive stress tests on DS-R1-Qwen-7B. On this model, RecurGuard detects 99% of OverThink attacks and 92% of ExtendAttack instances while maintaining near-zero false positive rates on question answering, code generation, mathematics, and summarization. Adaptive evaluation reveals the limit of the defense: topical attacks retain 11.9x amplification with an approximately 50% joint miss rate, whereas full semantic evasion reduces amplification from 22.8x to 2.2x. When reasoning traces are unavailable, QDM provides a post-hoc fallback monitor based on the final output.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline</title>
    <author>
      <name>Zekai Zhang</name>
    </author>
    <author>
      <name>Qinghui Chen</name>
    </author>
    <author>
      <name>Maomao Xiong</name>
    </author>
    <author>
      <name>Shijiao Ding</name>
    </author>
    <author>
      <name>Zhanzhi Su</name>
    </author>
    <author>
      <name>Xinjie Yao</name>
    </author>
    <author>
      <name>Yiming Sun</name>
    </author>
    <author>
      <name>Cong Bai</name>
    </author>
    <author>
      <name>Jinglin Zhang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07965v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07965v1" />
    <published>2026-06-06T03:48:12Z</published>
    <updated>2026-06-06T03:48:12Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zekai Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Visual Language Models (LVLMs) have achieved remarkable success in vision tasks. However, the significant differences between industrial and natural scenes make applying LVLMs challenging. Existing LVLMs rely on user-provided prompts to segment objects. This often leads to suboptimal performance due to the inclusion of irrelevant pixels. In addition, the scarcity of data also makes the application of LVLMs in industrial scenarios remain unexplored. To fill this gap, this paper proposes an open industrial dataset and a Refined Text-Visual Prompt (RTVP) for zero-shot industrial defect detection. First, this paper constructs the Multi-Modal Industrial Open Dataset (MMIO) containing 80K+ samples. MMIO contains diverse industrial categories, including 6 super categories and 18 subcategories. MMIO is the first large-scale multi-scenes pre-training dataset for industrial zero-shot learning, and provides valuable training data for open models in future industrial scenarios. Based on MMIO, this paper provides a RTVP specifically for industrial zero-shot tasks. RTVP has two significant advantages: First, this paper designs an expert-guided large model domain adaptation mechanism and designs an industrial zero-shot method based on Mobile-SAM, which enhances the generalization ability of large models in industrial scenarios. Second, RTVP automatically generates visual prompts directly from images and considers text-visual prompt interactions ignored by previous LVLM, improving visual and textual content understanding. RTVP achieves SOTA with 42.2% and 24.7% AP in zero-shot and closed scenes of MMIO.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs</title>
    <author>
      <name>Omar Mahmoud</name>
    </author>
    <author>
      <name>Aly M. Kassem</name>
    </author>
    <author>
      <name>Thommen George Karimpanal</name>
    </author>
    <author>
      <name>Buddhika Laknath Semage</name>
    </author>
    <author>
      <name>Negar Rostamzadeh</name>
    </author>
    <author>
      <name>Golnoosh Farnadi</name>
    </author>
    <author>
      <name>Santu Rana</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07963v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07963v1" />
    <published>2026-06-06T03:41:44Z</published>
    <updated>2026-06-06T03:41:44Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Omar Mahmoud et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to specific triggers or behaviors. We show this view is incomplete. Across diverse backdoor behaviors, we identify a shared latent mechanism that can be detected, causally controlled, and suppressed. Using sparse autoencoders (SAEs) on residual-stream activations, we find a small set of latent features consistently activated across jailbreaking, refusal manipulation, password-locking, bias induction, sentiment misclassification, and country-conditioned harmful advice. These features generalize across Qwen3, Gemma~3, and Llama~3.1 models from 4B to 32B parameters, and across both fine-tuning and weight-editing attacks. Through bidirectional activation steering, we show these features are causal: suppressing them reduces attack success, while amplifying them induces target behaviors on clean prompts. We further train lightweight SAE-feature classifiers that generalize zero-shot to unseen backdoors and outperform residual-stream and weight-diffing baselines. Finally, we introduce Concept Ablation Fine-Tuning (CAFT), which suppresses backdoor formation by ablating the shared latent subspace during training. Together, our results suggest that many backdoors rely on a transferable latent mechanism, enabling unified detection and mitigation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Minibatch Selection via Partition Matroid Constrained Gradient Matching</title>
    <author>
      <name>Prayas Agrawal</name>
    </author>
    <author>
      <name>Prateek Chanda</name>
    </author>
    <author>
      <name>Ishita Khatri</name>
    </author>
    <author>
      <name>Ganesh Ramakrishnan</name>
    </author>
    <author>
      <name>Bamdev Mishra</name>
    </author>
    <author>
      <name>Pratik Jawanpuria</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07954v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07954v1" />
    <published>2026-06-06T03:16:19Z</published>
    <updated>2026-06-06T03:16:19Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Prayas Agrawal et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Training large language models (LLMs) on heterogeneous data requires selecting minibatches that balance convergence speed with coverage across domains. Existing methods either select samples independently within each domain or rely on computationally expensive proxy models to learn continuous domain weights. We propose PartitionSel, a cross-domain minibatch selection approach that maximizes a validation-guided gradient-matching utility under per-domain budgets encoded as a partition-matroid constraint. By coupling the per-domain budgets through a single utility, PartitionSel is designed to reduce redundancy in selections across domains. The proposed objective is weakly submodular and admits an orthogonal matching pursuit algorithm with provable approximation guarantees. Empirically, we evaluate PartitionSel for minibatch selection during the fine-tuning of Qwen2.5 and Llama-3 on MetaMathQA and Mol-Instructions. PartitionSel achieves robust gains over per-domain and domain-agnostic baselines on both benchmarks. It also reduces the number of conflicting gradient pairs within each batch, indicating that the cross-domain coupling translates into more compatible training updates.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;28 pages, 12 figures, ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Unification of Closed-Open Industrial Detection Scenarios: New Large-Scale Benchmarks,Challenges and Baselines</title>
    <author>
      <name>Zekai Zhang</name>
    </author>
    <author>
      <name>Jinglin Zhang</name>
    </author>
    <author>
      <name>Qinghui Chen</name>
    </author>
    <author>
      <name>Gang Li</name>
    </author>
    <author>
      <name>Da Chen</name>
    </author>
    <author>
      <name>Shuainan Jing</name>
    </author>
    <author>
      <name>He Wang</name>
    </author>
    <author>
      <name>Dagang Li</name>
    </author>
    <author>
      <name>Cong Liu</name>
    </author>
    <author>
      <name>Cong Bai</name>
    </author>
    <author>
      <name>Shengyong Chen</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07953v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07953v1" />
    <published>2026-06-06T03:06:10Z</published>
    <updated>2026-06-06T03:06:10Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zekai Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large-scale Visual-Language Models (LVLMs) have achieved remarkable success in natural visual tasks, yet their application to industrial defect detection remains challenging due to two fundamental limitations: (i) the scarcity of large-scale industrial datasets that cover diverse defect categories across multiple domains, and (ii) the reliance on manual prompts (points, boxes, masks) that introduce subjective noise and lack text-visual interaction for fine-grained understanding. To address these challenges, we introduce a Large-Scale Multi-Modal Industrial Open-Closed benchmark (MMIOC-1M) containing over one million samples across $14$ super-categories, $29$ industrial scenes, and $351$ defect subcategories. To our knowledge, MMIOC-1M is the first unified largest benchmark supporting both open-vocabulary and closed-set industrial detection, providing valuable pre-training data for LVLMs in industrial scenarios. Furthermore, we propose a Refined Text-Visual Prompt Network (RTVPNet) that incorporates three key innovations: (1) an expert-assisted domain projection mechanism that enables rapid adaptation of general vision models to industrial domains, (2) an energy-based sparse sampling strategy that automatically generates refined visual prompts without manual intervention, and (3) a bidirectional text-visual interaction module that enhances cross-modal semantic alignment and understanding. Extensive experiments demonstrate that RTVPNet achieves state-of-the-art performance on MMIOC-1M, LVIS, and COCO benchmarks while maintaining computational efficiency. The dataset and code are available at https://github.com/hellozzk/MMIO.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] From `May' to `Is': Certainty Distortion in Language Model Rewriting</title>
    <author>
      <name>Catarina G Belem</name>
    </author>
    <author>
      <name>Shang Wu</name>
    </author>
    <author>
      <name>Hongyu Yao</name>
    </author>
    <author>
      <name>Mark Steyvers</name>
    </author>
    <author>
      <name>Sameer Singh</name>
    </author>
    <author>
      <name>Padhraic Smyth</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07951v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07951v1" />
    <published>2026-06-06T02:53:31Z</published>
    <updated>2026-06-06T02:53:31Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Catarina G Belem et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Humans increasingly turn to Language Models (LMs) in ways that shape beliefs and drive decisions, including discussing, rewriting, and summarizing information from scientific articles, news, and medical reports. However, in these domains, where how confidently a claim is expressed matters, little is known about whether LMs faithfully preserve it. In this work, we investigate certainty distortion in LMs, defined as meaningful changes in expressed certainty when semantic content is preserved. We propose an LM-based evaluation metric that is consistent with population-level judgments of certainty. Using this metric, we characterize certainty distortion across different sizes and families of models in the context of scientific and medical communication tasks. Our results show that certainty distortion affects up to 75\% of LM outputs and is systematically asymmetric in rewriting tasks with most LMs being 1.5-2$\times$ more likely to increase the expressed certainty than to decrease it. These effects can compound over repeated paraphrasing: in the medical domain, claude-haiku-4-5 increases certainty of 20\% examples after a single iteration, increasing to 40\% after five iterations. Prompt-based interventions reduce overall certainty distortion but do not eliminate it. Together, these findings reveal a general bias toward inflating expressed certainty, with direct implications for users who rely on LMs in high-stakes domains.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] POISE: Position-Aware Undetectable Skill Injection on LLM Agents</title>
    <author>
      <name>Haochang Hao</name>
    </author>
    <author>
      <name>Dehai Min</name>
    </author>
    <author>
      <name>Zhifang Zhang</name>
    </author>
    <author>
      <name>Yunbei Zhang</name>
    </author>
    <author>
      <name>Miao Xu</name>
    </author>
    <author>
      <name>Yingqiang Ge</name>
    </author>
    <author>
      <name>Lu Cheng</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07943v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07943v1" />
    <published>2026-06-06T02:10:03Z</published>
    <updated>2026-06-06T02:10:03Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haochang Hao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Agent skills provide a lightweight mechanism for extending general-purpose agents, but their open format exposes them to skill-poisoning attacks. A practically dangerous injection must stay invisible: if executing the payload derails the user&amp;#x27;s legitimate task, the resulting failure signal invites inspection of the skill. We therefore evaluate attacks by Attack Success Rate, which requires the injected payload to execute and the user&amp;#x27;s task to still pass its verifier in the same trial. Prior skill-poisoning attacks face a reliability-stealth trade-off under this lens: YAML-header injections are reliably loaded but easily inspected, whereas stealthier body injections that place explicit malicious commands in the skill prose are less reliable because out-of-context commands invite the agent&amp;#x27;s own suspicion. We introduce POISE, a position-aware attack that compresses the trigger into a single, benign-looking body instruction, placing it at a feasible position and using a context-aware generator to blend it with nearby setup or prerequisite steps. On Skill-Inject with codex+gpt-5.2, POISE achieves an 89.3% ASR, 28.0 points above a random-placement body baseline and 2.6 points above a YAML-only baseline, while retaining the stealth advantage of body placement. That stealth is the decisive margin: because legitimate skill bodies naturally require privileged tool operations, LLM scanners are hyper-sensitive, falsely flagging 74.6% of clean skills on average across four judges and both benchmarks. Blending into these false alarms, POISE causes only 5.6% of poisoned variants to gain a new high-risk alert over their clean baselines, rendering current static defenses ineffective.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;20 pages, 2 figures, 5 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation</title>
    <author>
      <name>Katelyn Xiaoying Mei</name>
    </author>
    <author>
      <name>Yi-Li Hsu</name>
    </author>
    <author>
      <name>Minjoon Choi</name>
    </author>
    <author>
      <name>Zongwan Cao</name>
    </author>
    <author>
      <name>Chenjun Xu</name>
    </author>
    <author>
      <name>Bingbing Wen</name>
    </author>
    <author>
      <name>Su Lin Blodgett</name>
    </author>
    <author>
      <name>Lucy Lu Wang</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07936v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07936v2" />
    <published>2026-06-06T01:55:56Z</published>
    <updated>2026-06-09T12:36:24Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Katelyn Xiaoying Mei et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ACL 2026 Main&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] The CIFAR Synthetic Evidence Corpus for Detecting AI-Generated Evidence</title>
    <author>
      <name>Kelly McConvey</name>
    </author>
    <author>
      <name>Jalehsadat Mahdavimoghaddam</name>
    </author>
    <author>
      <name>Nima Jamali</name>
    </author>
    <author>
      <name>Maksym Taranukhin</name>
    </author>
    <author>
      <name>Sajad Ebrahimi</name>
    </author>
    <author>
      <name>Wentao Zhang</name>
    </author>
    <author>
      <name>Yuntian Deng</name>
    </author>
    <author>
      <name>Karen Eltis</name>
    </author>
    <author>
      <name>Maura R. Grossman</name>
    </author>
    <author>
      <name>Vered Shwartz</name>
    </author>
    <author>
      <name>Ebrahim Bagheri</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07916v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07916v1" />
    <published>2026-06-06T00:43:13Z</published>
    <updated>2026-06-06T00:43:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kelly McConvey et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The growing ability of generative models to produce realistic documents poses a direct challenge to evidentiary workflows in the justice system and the courts, where decisions increasingly depend on the authenticity of evidence such as receipts, communications, and administrative records. Unlike social media or academic settings, evidentiary documents are often only subtly altered, with small, localized edits that preserve overall plausibility while changing legal meaning. Yet progress on automated detection remains limited, largely due to the absence of suitable training and evaluation data especially suited for the justice system requirements. Existing resources are either focused on photos of human faces or natural scenery or on narrowly scoped academic or social media document types, and do not capture the structure, diversity, or manipulation patterns characteristic of real-world evidentiary data. As a result, current detection systems do not necessarily learn meaningful signals appropriate for the justice system. We introduce the CIFAR Synthetic Evidence Corpus, a dataset designed to enable rigorous evaluation of evidence verification under realistic and controlled conditions. The corpus spans multiple document families and a spectrum of manipulation strategies, from small field-level edits to complete document fabrication, and is constructed using a diverse set of state-of-the-art generative tools. It is organized to systematically vary both manipulation complexity and generation method, while enforcing source-level separation between training and test data to reflect real-world generalization challenges.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence</title>
    <author>
      <name>Haoran Xu</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07834v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07834v1" />
    <published>2026-06-05T20:51:51Z</published>
    <updated>2026-06-05T20:51:51Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haoran Xu&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this is unsafe: when the schema exposes CONFLICTING as the authorized non-directional verdict, returning SUPPORTS/REFUTES is an unauthorized directional commitment, a failure we name Cherry-pick Override (CCO). We define CCO under an explicit task contract and report it with a same-denominator diagnostic protocol paired with matched-coverage bootstrap and an apples-to-apples random-veto null. On AVeriTeC&amp;#x27;s Conflicting subset (N_C = 150), three-option judges return a directional verdict on more than 84% of mixed-evidence claims; under the typed schema, three-judge majority voting amplifies direction-on-conflict on AVeriTeC (0.887 vs. 0.840; 95% CI [+0.013, +0.080]) but does not replicate on VitaminC-Mixed. Walking an intervention ladder of common single-channel fixes (typed vocabulary, panel aggregation, confidence thresholding, validator-only filtering), each leaves a distinct residual failure: panel aggregation suppresses single-judge CONFLICTING dissent in 48% of CCO cases; the panel is well-calibrated for direction (ECE = 0.07 on pure-S/R) so confidence cannot operationally separate CCO from correct directional commits; validator-as-classifier nearly halves pure-evidence accuracy. A minimal two-channel reference probe reaches operating points neither single channel reaches; under the random-veto null its promotion to CONFLICTING is structurally targeted on AVeriTeC (empirical p &amp;lt; 1/2001) and weaker but in the same direction on VitaminC-Mixed, a selectivity result rather than a magnitude one. We argue for an external commitment-control layer that separates verdict generation from commitment authorization, using structural evidence and confidence as orthogonal channels and NO-COMMIT as a routed controller state.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 pages, 1 figure&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] Jas: AI-Paired Engineering as a Revival of N-Version Programming</title>
    <author>
      <name>Jason Hickey</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07828v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07828v1" />
    <published>2026-06-05T20:30:08Z</published>
    <updated>2026-06-05T20:30:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jason Hickey&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;I report a case study in AI-paired software engineering: five working ports of a vector illustration application across Rust, Swift, OCaml, Python, and browser-based platforms, built by a single developer in approximately 120 evening hours. The methodology pairs AI-assisted implementation with two safeguards -- a precise executable YAML specification serving as the single source of truth, and parallel implementations functioning as a built-in differential-testing layer. The five ports share a 23{,}000-line specification; per-port native code ranges from 0 to roughly 95{,}000 lines, reflecting the specification&amp;#x27;s escape hatch. I argue that AI-paired engineering, conditional on these two safeguards, makes feasible scope of work that conventionally requires multiple developer-years, and frame the methodology as a revival of N-version programming, a 1980s approach abandoned on cost grounds that AI changes. The paper reports concrete artifacts and honest limitations of the single-developer case study.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Scaling Participation in Modular AI Systems</title>
    <author>
      <name>Shangbin Feng</name>
    </author>
    <author>
      <name>Yike Wang</name>
    </author>
    <author>
      <name>Weijia Shi</name>
    </author>
    <author>
      <name>Luke Zettlemoyer</name>
    </author>
    <author>
      <name>Yejin Choi</name>
    </author>
    <author>
      <name>Yulia Tsvetkov</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07812v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07812v1" />
    <published>2026-06-05T19:39:35Z</published>
    <updated>2026-06-05T19:39:35Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shangbin Feng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Humanity is a mosaic of multifaceted talents and needs, and any truly intelligent AI must reflect that richness. Yet the LLMs used by all are built by the few -- a centralized market of monolithic AI models structurally ill-suited to capture the diversity of human knowledge, reasoning, and values. Here we introduce scaling participation, a new paradigm in which modular AI systems are built from the bottom up through the contributions of diverse stakeholders. Participants contribute small models trained on their own interests and priorities; these models then collaborate in modular frameworks as compositional AI systems. Participatory AI systems outperform monolithic LLMs by up to 15.4% across 15 tasks, such as reasoning and factuality, surpassing models larger than all contributed components combined. Further experiments show that participatory AI systems benefit from contributor diversity, substantially improve on each contributor&amp;#x27;s original priorities, and exhibit emergent capabilities that allow them to solve over 15% of problems where all individual models fail. Scaling participation provides a technical foundation for transitioning from the monolithic status quo toward an open, bottom-up, and collaborative AI future.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] SLMJury: Can Small Language Models Judge as Well as Large Ones?</title>
    <author>
      <name>Anish Laddha</name>
    </author>
    <author>
      <name>Nitesh Pradhan</name>
    </author>
    <author>
      <name>Gaurav Srivastava</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07810v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07810v1" />
    <published>2026-06-05T19:38:15Z</published>
    <updated>2026-06-05T19:38:15Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Anish Laddha et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math-to-general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed-ended and open-ended judging draw on different capabilities: the best binary judge (Phi-4) drops to rank 9 on MT-Bench, while reasoning-trained models invert this ordering. (4) Under the Reflect-Critique-Refine (RCR) debate protocol, multi-agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with &amp;lt;=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems</title>
    <author>
      <name>Yiyang Zhao</name>
    </author>
    <author>
      <name>Zhuo Zhang</name>
    </author>
    <author>
      <name>Qingxuan Le</name>
    </author>
    <author>
      <name>Lizhen Qu</name>
    </author>
    <author>
      <name>Zenglin Xu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07805v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07805v1" />
    <published>2026-06-05T19:33:58Z</published>
    <updated>2026-06-05T19:33:58Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yiyang Zhao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution-capable agents has introduced critical operational risks. Most current evaluation frameworks neglect procedural compliance, leading to &amp;#x27;&amp;#x27;Machiavellian&amp;#x27;&amp;#x27; behaviors where agents strategically violate safety rules to maximize rewards - a direct manifestation of Goodhart&amp;#x27;s Law. To address this blind spot, we introduce MAC-Bench, a dynamic, adversarial benchmark designed to evaluate the procedural alignment of multi-agent systems under realistic pressure. We propose the SERV(Seed - Evolve - Refine - Verify) pipeline, an ``Agent-as-a-Benchmark&amp;#x27;&amp;#x27; paradigm that transforms unstructured legal texts into executable, contamination-free scenarios. By synthesizing holographic sandbox environments and injecting calibrated social-engineering pressure vectors, MAC-Bench forces agents into Pareto-optimal trade-offs between task success and regulatory adherence. We introduced novel metrics: the Compliance-Weighted Success Rate (CSR) and the Machiavellian Gap (MG), and conducted a comprehensive evaluation of state-of-the-art frontier models to reveal the pervasive trade-offs between success and compliance.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[astro-ph.IM] Beyond Point Estimates: Benchmarking Uncertainty Quantification Methods on the AION-1 Astronomical Foundation Model</title>
    <author>
      <name>Karla Tame-Narvaez</name>
    </author>
    <author>
      <name>Aleksandra Ćiprijanović</name>
    </author>
    <author>
      <name>Shubhendu Trivedi</name>
    </author>
    <category term="astro-ph.IM" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07771v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07771v1" />
    <published>2026-06-05T18:29:19Z</published>
    <updated>2026-06-05T18:29:19Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Karla Tame-Narvaez et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Foundation models for astronomical surveys offer powerful learned representations that can be transferred to downstream regression tasks such as galaxy property estimation. However, point predictions alone are insufficient for scientific inference; reliable uncertainty quantification (UQ) is essential. We compare seven UQ methods on galaxy property regression using frozen AION-1 foundation-model embeddings, predicting redshift, stellar mass, stellar-population age, gas-phase metallicity, and specific star-formation rate, from Legacy Survey photometry/imaging and DESI spectra, with PROVABGS-derived labels. Distribution-free conformal methods achieve marginal coverage within $\sim$1\,pp of the nominal 90\% across all properties, while non-conformal baselines (Deep Ensembles, MC~Dropout) fail to calibrate reliably. Among conformal approaches, Conformalized Quantile Regression (CQR) delivers the best coverage in the bin with the poorest model predictions. More importantly, only the Locally Valid and Discriminative (LVD) framework -- particularly when operating on AION-1 embeddings -- also provides finite-sample \emph{local validity}, producing intervals that adapt to each galaxy&amp;#x27;s local prediction difficulty rather than relying on marginal guarantees alone. These results establish conformal prediction, and LVD in particular, as the preferred UQ framework for uncertainty-aware inference on foundation-model embeddings in astrophysics.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;7 pages, 1 table, 1 figure&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism</title>
    <author>
      <name>Cong Chen</name>
    </author>
    <author>
      <name>Guo Gan</name>
    </author>
    <author>
      <name>Kaixiang Ji</name>
    </author>
    <author>
      <name>ChaoYang Zhang</name>
    </author>
    <author>
      <name>Zhen Yang</name>
    </author>
    <author>
      <name>Guangming Yao</name>
    </author>
    <author>
      <name>Hao Chen</name>
    </author>
    <author>
      <name>Jingdong Chen</name>
    </author>
    <author>
      <name>Yi Yuan</name>
    </author>
    <author>
      <name>Chunhua Shen</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07512v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07512v1" />
    <published>2026-06-05T17:59:21Z</published>
    <updated>2026-06-05T17:59:21Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Cong Chen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM&amp;#x27;s performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification</title>
    <author>
      <name>Sercan Karakaş</name>
    </author>
    <author>
      <name>Yusuf Şimşek</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07479v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07479v1" />
    <published>2026-06-05T17:34:07Z</published>
    <updated>2026-06-05T17:34:07Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sercan Karakaş, Yusuf Şimşek&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same surface form as fully literal verb-object combinations while functioning as a single, partially idiomatic predicate. We frame Turkish LVC detection as a binary classification task (literal meaning vs. idiomatic meaning) and evaluate on a manually created controlled set (N=147) with matched negatives: out-of-domain random sentences and in-domain literal controls (NLVC), alongside LVC positives. We compare a supervised Turkish encoder baseline (BERTurk with a classifier head) to three instruction-tuned LLMs from different families under zero-shot, one-shot, and few-shot prompting, and analyze how demonstrations shift error profiles. In zero-shot, LLMs perform well on negatives but show very low LVC recall. One-shot prompting sharply improves LVC detection but can induce strong, model-specific biases, leading models to overpredict or underpredict LVCs. A richer few-shot prompt improves calibration and yields robust overall performance for GPT-OSS-20B and Qwen 2.5-14B. Overall, the results highlight substantial prompt sensitivity in Turkish metalinguistic classification: the supervised baseline remains competitive, while prompted LLMs can match or exceed it on LVCs with carefully constructed demonstrations.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ACL SRW 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders</title>
    <author>
      <name>Georgii Aparin</name>
    </author>
    <author>
      <name>Vadim Popov</name>
    </author>
    <author>
      <name>Tasnima Sadekova</name>
    </author>
    <author>
      <name>Assel Yermekova</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07473v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07473v1" />
    <published>2026-06-05T17:26:23Z</published>
    <updated>2026-06-05T17:26:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Georgii Aparin et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper&amp;#x27;s internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] Planning-aligned Token Compression for Long-Context Autonomous Driving</title>
    <author>
      <name>Zhixuan Liang</name>
    </author>
    <author>
      <name>Yuxiao Chen</name>
    </author>
    <author>
      <name>Yurong You</name>
    </author>
    <author>
      <name>Peter Karkus</name>
    </author>
    <author>
      <name>Wenhao Ding</name>
    </author>
    <author>
      <name>Boyi Li</name>
    </author>
    <author>
      <name>Alexander Popov</name>
    </author>
    <author>
      <name>Yan Wang</name>
    </author>
    <author>
      <name>Maximilian Igl</name>
    </author>
    <author>
      <name>Yiming Li</name>
    </author>
    <author>
      <name>Danfei Xu</name>
    </author>
    <author>
      <name>Nikolai Smolyanskiy</name>
    </author>
    <author>
      <name>Boris Ivanovic</name>
    </author>
    <author>
      <name>Ping Luo</name>
    </author>
    <author>
      <name>Marco Pavone</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07464v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07464v1" />
    <published>2026-06-05T17:16:21Z</published>
    <updated>2026-06-05T17:16:21Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhixuan Liang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $&amp;gt;$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;9 pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle</title>
    <author>
      <name>Jiayu Wang</name>
    </author>
    <author>
      <name>Weijiang Lv</name>
    </author>
    <author>
      <name>Bowen Fu</name>
    </author>
    <author>
      <name>Jing Fu</name>
    </author>
    <author>
      <name>Jiayi Song</name>
    </author>
    <author>
      <name>Lingyu Zhang</name>
    </author>
    <author>
      <name>Lanxuan Xue</name>
    </author>
    <author>
      <name>Luodi Chen</name>
    </author>
    <author>
      <name>Zepeng Xin</name>
    </author>
    <author>
      <name>Kaiyu Li</name>
    </author>
    <author>
      <name>Xiangyong Cao</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07462v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07462v1" />
    <published>2026-06-05T17:13:36Z</published>
    <updated>2026-06-05T17:13:36Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiayu Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant limitations in field sensitivity, research ethics, and nuanced scientific judgment. Consequently, frontier agents remain unable to fully replace human researchers. To bridge this gap, we conceptualize the AARR (Act As a Real Researcher) benchmark series. Unlike existing benchmarks that primarily assess macro-level execution capabilities, AARR focuses on whether agents can emulate the professionalism, thoroughness, and nuanced reasoning that characterize human researchers in granular research scenarios. In this work, we propose AARRI-Bench (Act As a Real Research Intern), the first benchmark in this series. We conduct extensive experiments across frontier models and agentic systems, revealing that even the best-performing configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3\% success rate, frequently overlooking subtle yet critical details that are obvious to real human researchers. Our results indicate that developing researcher-like AI requires further exploration of research behavior, rather than merely complex scaffolding. Our data is released at https://github.com/AARR-bench/AARRI-bench.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Watch, Remember, Reason: Human-View Video Understanding with MLLMs</title>
    <author>
      <name>Jiahao Meng</name>
    </author>
    <author>
      <name>Yue Tan</name>
    </author>
    <author>
      <name>Qi Xu</name>
    </author>
    <author>
      <name>Kuan Gao</name>
    </author>
    <author>
      <name>Weisong Liu</name>
    </author>
    <author>
      <name>Yanwei Li</name>
    </author>
    <author>
      <name>Jason Li</name>
    </author>
    <author>
      <name>Lingdong Kong</name>
    </author>
    <author>
      <name>Haochen Wang</name>
    </author>
    <author>
      <name>Qianyu Zhou</name>
    </author>
    <author>
      <name>Jiangning Zhang</name>
    </author>
    <author>
      <name>Guangliang Cheng</name>
    </author>
    <author>
      <name>Yunhai Tong</name>
    </author>
    <author>
      <name>Lu Qi</name>
    </author>
    <author>
      <name>Minghsuan Yang</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07433v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07433v1" />
    <published>2026-06-05T16:29:13Z</published>
    <updated>2026-06-05T16:29:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiahao Meng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model</title>
    <author>
      <name>Kaouther Mouheb</name>
    </author>
    <author>
      <name>Amos Pomp</name>
    </author>
    <author>
      <name>Antoine Manenti</name>
    </author>
    <author>
      <name>Romy de Haan</name>
    </author>
    <author>
      <name>Farog Faghir</name>
    </author>
    <author>
      <name>Joy Martens</name>
    </author>
    <author>
      <name>Harro Seelaar</name>
    </author>
    <author>
      <name>Francesco Mattace-Raso</name>
    </author>
    <author>
      <name>Meike W. Vernooij</name>
    </author>
    <author>
      <name>Frank J. Wolters</name>
    </author>
    <author>
      <name>Stefan Klein</name>
    </author>
    <author>
      <name>Esther E. Bron</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07721v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07721v1" />
    <published>2026-06-05T15:57:35Z</published>
    <updated>2026-06-05T15:57:35Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kaouther Mouheb et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Objectives: Automatic data extraction from free-text radiology reports enables large-scale research, but few studies assessed the performance of large language models (LLMs) on Dutch neuroradiology reports. Methods: We analyzed 947 brain MRI reports from a tertiary memory clinic (2016-2021), authored by consultant neuroradiologists. Trained medical students annotated thirty variables; 100 reports were double-annotated to assess inter-rater reliability. We evaluated the performance of the open-weight LLM LLaMA 3.1 using different languages (Dutch vs. English translation) and few-shot prompting with different example selection strategies. Performance was evaluated using balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text. Metrics were computed across 10 random splits of the 947 reports. Results: LLaMA 3.1 demonstrated high zero-shot performance for visual rating scores (mean [95%-CI]): Medial Temporal Atrophy: 90% [77-100%] on the left and 96% [94-99%] on the right, Global Cortical Atrophy: 87% [83-91%], and Fazekas: 94% [93-96%]. Microbleed mentions were detected with 93% accuracy [92-95%] and infarct mentions with 82% [80-84%]. Text similarity for lesion location reached 0.95 [0.95-0.96]. Performance was lower for numerical variables: 80% [78-82%] for the number of microbleeds and 66% [63-68%] for infarcts. English translation yielded comparable results. Few-shot prompting improved performance for numerical variables, achieving 92% [90-93%] for microbleeds and 81% [77-85%] for infarcts using structural similarity-based selection. Conclusion: LLaMA 3.1 shows strong potential for extracting data from Dutch neuroradiology reports. Few-shot prompting enhances performance for numerical variables, whereas challenges remain for location-specific variables.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Submitted to European Radiology&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline</title>
    <author>
      <name>Kai A. Horstmann</name>
    </author>
    <author>
      <name>Ethan Lin</name>
    </author>
    <author>
      <name>Alice A. Robie</name>
    </author>
    <author>
      <name>Jennifer J. Sun</name>
    </author>
    <author>
      <name>Kristin Branson</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07718v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07718v1" />
    <published>2026-06-05T15:38:18Z</published>
    <updated>2026-06-05T15:38:18Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kai A. Horstmann et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Agentic AI tools offer a promising path to automating software development bottlenecks in scientific research pipelines, particularly for stages that take domain experts days to months to build, where scientists care about correctness and robustness, not implementation details. We present an empirical study of general-purpose coding agents on a fly optogenetics data-to-discovery pipeline. We assess agents on tasks substantially larger than existing benchmarks, datasets orders of magnitude bigger, and evaluation criteria grounded in domain expert standards. We show that agents can solve several individual pipeline stages, suggesting stage-level automation is tractable. By analyzing agents&amp;#x27; code iterations, we show that they struggle most when there is not a pre-defined criterion to iterate on, and they must instead use their scientific judgment to assess their current solution, a key open challenge. Mirroring scientific practice, they sometimes attempt visual inspection of intermediate outputs for self-evaluation, but largely fail to interpret what they see or act on it appropriately. Solving the end-to-end pipeline correctly requires stringing together successes across all pipeline stages, and this is beyond agents&amp;#x27; current abilities. We identify challenges largely absent from existing benchmarks, including computational resource management and generalization to large held-out data collections. Finally, we distill principles for constructing scientific tasks and rigorous evaluation criteria for open-ended problems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] SHIELD-IDS: Structurally Heterogeneous Ensemble with Integrated Layered Defense for Intrusion Detection Systems</title>
    <author>
      <name>Maryam Zaman</name>
    </author>
    <author>
      <name>Muhammad Khuram Shahzad</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07716v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07716v1" />
    <published>2026-06-05T15:25:51Z</published>
    <updated>2026-06-05T15:25:51Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Maryam Zaman, Muhammad Khuram Shahzad&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Adversarial attacks pose a serious and growing threat to Machine Learning (ML)-based Intrusion Detection Systems (IDS), where imperceptible perturbations to network flow features can systematically mislead classifiers into accepting malicious traffic as benign. The IDS-Anta framework partially addresses this through Z-score normalization, Singular Value Decomposition (SVD), and Multi-Armed Bandit (MAB) classifier selection with Thompson Sampling, yet its classifier pool lacks sufficient structural diversity for robust adversarial resistance. This work introduces IDS-Anta++, which incorporates XGBoost and LightGBM gradient boosting models into the ensemble and wraps the extended pool in a three-layer black-box defense: Isolation Forest anomaly screening, median feature smoothing, and six-way majority voting. Experiments conducted on CIC-IDS-2017, CEC-CIC-IDS-2018, and CIC-DDoS-2019 under both Fast Gradient Sign Method (FGSM) and Zeroth Order Optimization (ZOO) attacks confirm detection accuracy above 99% on clean data, with measurable robustness gains under adversarial conditions relative to the baseline IDS-Anta configuration.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;10 pages, 5 figures, 7 tables. Code available at: https://github.com/maryamzaman-git/SHEILD-IDS&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.IV] Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios</title>
    <author>
      <name>Prabhjot Kaur</name>
    </author>
    <author>
      <name>Hakim Ouaalam</name>
    </author>
    <author>
      <name>Sedat Kandemirli</name>
    </author>
    <author>
      <name>Sanjay P. Prabhu</name>
    </author>
    <author>
      <name>Simon K. Warfield</name>
    </author>
    <category term="eess.IV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07381v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07381v1" />
    <published>2026-06-05T15:21:22Z</published>
    <updated>2026-06-05T15:21:22Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Prabhjot Kaur et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Background and Purpose: Automated detection of focal cortical dysplasia (FCD) requires large volumes of voxelwise lesion-delineated MRI data, which are difficult to acquire. This study aims to generate synthetic MRI data exhibiting FCD, assess their realism, and evaluate their impact on automated FCD detection, particularly in reducing the need for manual annotations.
  Methods: T1-weighted (T1w) and T2-weighted Fluid-Attenuated Inversion Recovery (FLAIR) MRI scans from 131 FCD patients and 90 healthy controls from multiple (3) sites were retrospectively studied. Synthetic MRIs were generated by conditioning a generative network on binary FCD masks. Two neuroradiologists identified real images from a random set of 14 real and 14 synthetic scans. Three nnU-Net models were trained to detect FCD using: (i) real-only (35 FCD / 35 controls), (ii) real (35 FCD / 35 controls) plus synthetic augmentation, and (iii) expanded real data (70 FCD / 70 controls).
  Results: Experts showed limited ability to distinguish real from synthetic images, with classification accuracy of 60% for T1w and 70% for FLAIR (inter-rater agreement kappa = 0.86). Augmenting automated FCD detection with synthetic data increased sensitivity by 8.14% (p = 0.12) and improved model confidence at true lesion sites (0.83 +/- 0.11 to 0.89 +/- 0.12; p = 0.02). The expanded real-data model further improved sensitivity to 73.8% (p &amp;lt; 0.001) and confidence to 0.90 +/- 0.14 (p = 0.01).
  Conclusion: Conditional generative networks can generate realistic synthetic FCD-MRIs, reducing labeled data needs by approximately 20% while maintaining equivalent sensitivity. Equivalent amounts of real data, when available, remain more effective than synthetic augmentation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests</title>
    <author>
      <name>Thanawat Lodkaew</name>
    </author>
    <author>
      <name>Johannes Ackermann</name>
    </author>
    <author>
      <name>Soichiro Nishimori</name>
    </author>
    <author>
      <name>Nontawat Charoenphakdee</name>
    </author>
    <author>
      <name>Masashi Sugiyama</name>
    </author>
    <author>
      <name>Takashi Ishida</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07379v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07379v2" />
    <published>2026-06-05T15:20:37Z</published>
    <updated>2026-06-08T01:53:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Thanawat Lodkaew et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Mitosis Detection in the Wild: Multi-Tumor and Context-Aware Generalization in the MIDOG 2025 Challenge</title>
    <author>
      <name>Marc Aubreville</name>
    </author>
    <author>
      <name>Jonas Ammeling</name>
    </author>
    <author>
      <name>Sweta Banerjee</name>
    </author>
    <author>
      <name>Viktoria Weiss</name>
    </author>
    <author>
      <name>Taryn A. Donovan</name>
    </author>
    <author>
      <name>Robert Klopfleisch</name>
    </author>
    <author>
      <name>Jiaqi Lv</name>
    </author>
    <author>
      <name>Shan E Ahmed Raza</name>
    </author>
    <author>
      <name>Raphaël Bourgade</name>
    </author>
    <author>
      <name>Thomas Walter</name>
    </author>
    <author>
      <name>Yasemin Topuz</name>
    </author>
    <author>
      <name>Songül Varlı</name>
    </author>
    <author>
      <name>Charles-Antoine Collins-Fekete</name>
    </author>
    <author>
      <name>Zhuoyan Shen</name>
    </author>
    <author>
      <name>Navya Sri Kelam</name>
    </author>
    <author>
      <name>Nitin Singhal</name>
    </author>
    <author>
      <name>Christian Marzahl</name>
    </author>
    <author>
      <name>Brian Napora</name>
    </author>
    <author>
      <name>Tengyou Xu</name>
    </author>
    <author>
      <name>Hongyan Gu</name>
    </author>
    <author>
      <name>Mario Vento</name>
    </author>
    <author>
      <name>Gennaro Percannella</name>
    </author>
    <author>
      <name>Norbert Ropiak</name>
    </author>
    <author>
      <name>Izabela Wasiak</name>
    </author>
    <author>
      <name>Jie Xiao</name>
    </author>
    <author>
      <name>Shaojun Liu</name>
    </author>
    <author>
      <name>Seungho Choe</name>
    </author>
    <author>
      <name>April Khademi</name>
    </author>
    <author>
      <name>Vidushi Walia</name>
    </author>
    <author>
      <name>Sujatha Kotte</name>
    </author>
    <author>
      <name>Andrew Broad</name>
    </author>
    <author>
      <name>Alex Wright</name>
    </author>
    <author>
      <name>Guillaume Balezo</name>
    </author>
    <author>
      <name>Esha Sadia Nasir</name>
    </author>
    <author>
      <name>Mostafa Jahanifar</name>
    </author>
    <author>
      <name>Yosuke Yamagishi</name>
    </author>
    <author>
      <name>Shouhei Hanaoka</name>
    </author>
    <author>
      <name>Mattia Sarno</name>
    </author>
    <author>
      <name>Francesco Tortorella</name>
    </author>
    <author>
      <name>Biwen Meng</name>
    </author>
    <author>
      <name>Jingxin Liu</name>
    </author>
    <author>
      <name>Sara Krauss</name>
    </author>
    <author>
      <name>Daniel Hieber</name>
    </author>
    <author>
      <name>Lavish Ramchandani</name>
    </author>
    <author>
      <name>Dev Kumar Das</name>
    </author>
    <author>
      <name>Mieko Ochi</name>
    </author>
    <author>
      <name>Yuan Bae</name>
    </author>
    <author>
      <name>Piotr Giedziun</name>
    </author>
    <author>
      <name>Mateusz Maniewski</name>
    </author>
    <author>
      <name>Vangala Govindakrishnan Saipradeep</name>
    </author>
    <author>
      <name>Naveen Sivadasan</name>
    </author>
    <author>
      <name>Leire Benito-Del-Valle</name>
    </author>
    <author>
      <name>Adrian Galdran</name>
    </author>
    <author>
      <name>Kaustubh Atey</name>
    </author>
    <author>
      <name>Sameer Anand Jha</name>
    </author>
    <author>
      <name>Adinath Dukre</name>
    </author>
    <author>
      <name>Imran Razzak</name>
    </author>
    <author>
      <name>Maxime W. Lafarge</name>
    </author>
    <author>
      <name>Viktor H. Koelzer</name>
    </author>
    <author>
      <name>Nils Porsche</name>
    </author>
    <author>
      <name>Nikolas Stathonikos</name>
    </author>
    <author>
      <name>Mitko Veta</name>
    </author>
    <author>
      <name>Dominik Hirling</name>
    </author>
    <author>
      <name>Zsanett Zsófia Iván</name>
    </author>
    <author>
      <name>Peter Horvath</name>
    </author>
    <author>
      <name>Katharina Breininger</name>
    </author>
    <author>
      <name>Christof A. Bertram</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07368v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07368v1" />
    <published>2026-06-05T15:11:08Z</published>
    <updated>2026-06-05T15:11:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Marc Aubreville et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Automated mitosis detection is a well-established task in computational pathology. While previous benchmarks focused on scanner-induced domain shift, clinical &amp;quot;real-world&amp;quot; application requires models to be robust across the vast variance to be expected in the histological landscape. The MItosis DOmain Generalization (MIDOG) 2025 challenge was designed to evaluate algorithmic performance across unprecedented biological and contextual diversity. We curated a test dataset of 365 cases, encompassing 12 distinct human, canine and feline tumor types, digitized across multiple scanning platforms. Moving beyond hand-selected hotspots, the challenge required detection also in random tissue areas (representative of the whole slide detection situation) and challenging areas (areas rich in hard negatives). In the second track, we introduced the classification of atypical mitotic figures (AMFs). There were 18 teams submitting to the detection track, with F1 scores ranging up to 0.740. In the AMF detection track, we had 21 submissions with balanced accuracy values up to 0.908. Our analysis reveals that while most models perform reliably in traditional hotspots, significant performance degradation occurs in challenging ROIs, where false positive rates tripled. Furthermore, performance varied significantly across the 12 tumor types, highlighting &amp;quot;blind spots&amp;quot; in current state-of-the-art architectures when encountering rare or highly pleomorphic malignancies. Moreover, we evaluated the effectiveness of ensembling and found a mean increases of 1.5 and 1.3 percentage points in F1 score and balanced accuracy, respectively. In contrast, TTA showed no relevant improvement. MIDOG 2025 demonstrates that &amp;quot;in the wild&amp;quot; mitosis detection remains a significant hurdle. The transition from hotspot-only evaluation to a multi-contextual framework provides a more realistic proxy for clinical reliability.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] A robust PPG foundation model using multimodal physiological supervision</title>
    <author>
      <name>Eloy Geenjaar</name>
    </author>
    <author>
      <name>Vince Calhoun</name>
    </author>
    <author>
      <name>Scott Daly</name>
    </author>
    <author>
      <name>Gouthaman KV</name>
    </author>
    <author>
      <name>Lie Lu</name>
    </author>
    <author>
      <name>Trisha Mittal</name>
    </author>
    <author>
      <name>Daniel P. Darcy</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07365v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07365v1" />
    <published>2026-06-05T15:08:50Z</published>
    <updated>2026-06-05T15:08:50Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Eloy Geenjaar et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Photoplethysmography (PPG), a non-invasive measure of changes in blood volume, is widely used in both wearable devices and clinical settings. Recent PPG foundation models either use open-source ICU datasets with pretraining paradigms that require curated data and thus complicate generalization to field-like data, or use closed-source field-like PPG data. In contrast, we propose a PPG foundation model that does not require high-quality or field-like pretraining data, and instead leverages accompanying electrocardiogram and respiratory signals in ICU datasets to select contrastive samples during pretraining. Our approach allows the model to retain and learn from noisy PPG segments, improving robustness at inference. Our model, pretrained on 3x fewer subjects than existing state-of-the-art approaches, achieves performance improvements on 14 out of 15 diverse downstream tasks, including field-like daily activity and heart rate prediction. Our results demonstrate that multimodal supervision can integrate complementary physiological information to improve the robustness of PPG foundation models and enhance their generalization to consumer-grade data.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Beyond Accuracy: Interpreting Topic Representation in Suicide Ideation Detection Models</title>
    <author>
      <name>Hamideh Ghanadian</name>
    </author>
    <author>
      <name>Isar Nejadgholi</name>
    </author>
    <author>
      <name>Hussein Al Osman</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07714v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07714v1" />
    <published>2026-06-05T14:46:50Z</published>
    <updated>2026-06-05T14:46:50Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Hamideh Ghanadian et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Suicide ideation detection models are typically evaluated using aggregate performance metrics, yet little is known about how they internally represent psychologically meaningful risk factors. In high-stakes mental health applications, understanding these internal representations is essential for safety, transparency, and responsible deployment. In this work, we move beyond accuracy and analyze how suicide detection models trained on original and topic-augmented datasets encode psychological risk factors in their internal representation space. Using visualization and geometric analysis, we examine the coherence and separability of topic-related features. Our results show that topic-aware augmentation increases the clarity and distinctness of underrepresented psychosocial risk factors such as immigration, family issues, and financial crisis. These findings suggest that augmentation not only improves model performance but also leads to more structured and interpretable internal representations.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] SV-Detect: AI-generated Text Detection with Steering Vectors</title>
    <author>
      <name>Mikhail Vishnyakov</name>
    </author>
    <author>
      <name>Tatiana Gaintseva</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07313v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07313v1" />
    <published>2026-06-05T14:34:37Z</published>
    <updated>2026-06-05T14:34:37Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mikhail Vishnyakov, Tatiana Gaintseva&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Detecting machine-generated text is especially difficult under distribution shift, such as transfer across domains, source models, and editing attacks. We propose a fake-text detector based on steering vectors extracted from the hidden representations of a frozen language model. At each layer, we construct a direction that separates human-written from machine-generated text, and represent each input by its layer-wise alignment with these directions. A lightweight classifier trained on these projection features yields the final detection score. Our method achieves strong performance both in-distribution and under distribution shift, including across domains, source models, and machine-editing transformations such as polishing and rewriting. Interpretation analyses show that the learned directions align with recognizable stylistic cues while capturing substantial additional signal beyond surface features. These results position fake-text detection as a representation-space probing problem and show that steering vectors provide a simple and effective solution.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition</title>
    <author>
      <name>Iosif Tsangko</name>
    </author>
    <author>
      <name>Andreas Triantafyllopoulos</name>
    </author>
    <author>
      <name>Björn W. Schuller</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07309v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07309v1" />
    <published>2026-06-05T14:26:06Z</published>
    <updated>2026-06-05T14:26:06Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Iosif Tsangko et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question in speech emotion recognition (SER) by deriving six interpretable acoustic concept tokens from the standardised eGeMAPS paralinguistic feature set. These tokens summarise energy, pitch, dynamics, brightness, formants, and voice quality, and are appended to the textual prompt while the audio input is kept unchanged. Across the widely used FAU-Aibo and IEMOCAP benchmarks, aligned tokens improve unweighted average recall (UAR), whereas shuffled, conflicting, or corrupted tokens reduce performance relative to aligned tokens and shift confusions toward neutral. Importantly, predictions do not collapse under strong token perturbations, suggesting that the models are sensitive to the symbolic cue channel but remain partly anchored to the audio signal. We argue that token-only interventions provide a practical way to probe audio-grounded cue use, robustness, and interpretability in ALM-based affective computing.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;6 pages, 3 figures, 3 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning</title>
    <author>
      <name>Lingyong Yan</name>
    </author>
    <author>
      <name>Can Xu</name>
    </author>
    <author>
      <name>Yukun Zhao</name>
    </author>
    <author>
      <name>Wenxuan Li</name>
    </author>
    <author>
      <name>Qingyang Chen</name>
    </author>
    <author>
      <name>Jiulong Wu</name>
    </author>
    <author>
      <name>Wenli Song</name>
    </author>
    <author>
      <name>Xiangnan Li</name>
    </author>
    <author>
      <name>Weixian Shi</name>
    </author>
    <author>
      <name>Yiqun Chen</name>
    </author>
    <author>
      <name>Xuchen Ma</name>
    </author>
    <author>
      <name>Yuchen Li</name>
    </author>
    <author>
      <name>Jiashu Zhao</name>
    </author>
    <author>
      <name>Shuaiqiang Wang</name>
    </author>
    <author>
      <name>Jianmin Wu</name>
    </author>
    <author>
      <name>Dawei Yin</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07299v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07299v1" />
    <published>2026-06-05T14:10:48Z</published>
    <updated>2026-06-05T14:10:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Lingyong Yan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limited process auditability. This technical report presents DuMate-DeepResearch, a multi-agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate-DeepResearch further introduces three mechanisms: (i) a graph-based dynamic planning strategy expands the research roadmap coarse-to-fine and continuously revises it through reflection, re-planning, backtracking, and parallel branching; (ii) a recursive two-level execution design delegates each complex search sub-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long-horizon execution; (iii) a rubric-based test-time optimization mechanism dynamically generates task-specific quality criteria and uses them as live reasoning scaffolds for evidence-grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate-DeepResearch establishes new state-of-the-art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Technical report by the DuMate Team. 26 pages, 6 figures, 4 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.NE] The Whale That Outswam Evolution: Swarm Intelligence Maximises Memory in Connectome Reservoirs</title>
    <author>
      <name>Anmol Guragain</name>
    </author>
    <author>
      <name>Savvas Kakalis</name>
    </author>
    <author>
      <name>Juan Ignacio Godino-Llorente</name>
    </author>
    <category term="cs.NE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09902v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09902v1" />
    <published>2026-06-05T13:59:23Z</published>
    <updated>2026-06-05T13:59:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Anmol Guragain et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Reservoir computing exploits the fixed dynamics of a recurrent network for temporal processing, requiring only a trained linear readout. Biological neural connectomes, shaped by millions of years of evolution, may encode computational structure beyond what random reservoirs provide, yet whether that structure can be further enhanced by principled optimisation remains an open question. We address it by applying four gradient-free, bio-inspired optimisers (Particle Swarm Optimisation, Differential Evolution, Grey Wolf Optimiser, and Whale Optimisation Algorithm) to the edge weights of connectome-based echo-state networks across six species spanning six orders of magnitude in neural complexity: C. elegans (279 neurons), Drosophila (49 nodes), mouse (112), rat (73), macaque (29 regions, continuous FLNe synaptic strengths), and human structural MRI connectivity (83 parcels). Each connectome is evaluated on four canonical reservoir computing benchmarks: Memory Capacity (MC), Lorenz attractor prediction, NARMA-10 system identification, and Mackey-Glass chaotic time-series prediction. All four optimisers consistently outperform unoptimised biological baselines across every task and species when initialised from biological weights. WOA achieves the largest gains on every task: up to a 17x MC improvement (C. elegans: 1.39 to 23.91) and up to 89% NRMSE reduction (Mackey-Glass, human), corresponding to an average 214% improvement across all species and tasks. Crucially, random initialisation on the same topology reliably underperforms biology, establishing biological weight values as an essential inductive bias that topology alone cannot recover. These results position bio-inspired, biologically-initialised optimisation as a principled and broadly effective strategy for connectome reservoir computing across the animal kingdom.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path</title>
    <author>
      <name>Thomas Sesmat</name>
    </author>
    <author>
      <name>Gabriel Meseguer-Brocal</name>
    </author>
    <author>
      <name>Geoffroy Peeters</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07271v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07271v1" />
    <published>2026-06-05T13:46:37Z</published>
    <updated>2026-06-05T13:46:37Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Thomas Sesmat et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Understanding what generative models retain from training data remains challenging, with implications for copyright and privacy. Beyond verbatim reproduction, models can encode subtler traces of their training data that never surface in their outputs yet remain exploitable. We study this regime for Rectified Flows, which are increasingly used in deployed generative systems. We analyse the interpolation path $X_λ= (1-λ)X_0 + λX_1$ that defines the Rectified Flow training. We show that a gap exists between the reconstruction of train and test data that follows a bell-shaped curve over $λ$, wich accumulates during training, while the validation metrics remain stable. The signal has a maximum whose location we derive in closed form under Gaussian assumptions. We validate these predictions on both audio and images and show that the bell-shaped structure is universal, while the peak prediction holds when our assumptions are satisfied. As a proof of concept, we exploit this specific $λ$-resolved structure to perform a Membership Inference Attack, distinguishing members of the training set from non-members.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;ICML 2026 article, 9 main pages and 25 with annexes, 11 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] TOPSIS-RAD: Ranking According to Desires</title>
    <author>
      <name>Leonardo Fernandes Costa</name>
    </author>
    <author>
      <name>Helder Gomes Costa</name>
    </author>
    <author>
      <name>Diogo Lima</name>
    </author>
    <author>
      <name>Brunno Rodrigues</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07253v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07253v1" />
    <published>2026-06-05T13:26:11Z</published>
    <updated>2026-06-05T13:26:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Leonardo Fernandes Costa et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Traditional TOPSIS derives its reference points -- the Positive Ideal Solution ($PIS$) and Negative Ideal Solution ($NIS$) -- from the observed alternative set, making rankings susceptible to misalignment with decision-maker (DM) requirements, sensitivity to outlier performances, and rank reversal. This paper proposes TOPSIS-RAD, which addresses these issues by incorporating two arrays of DM-defined reference levels. Vetoed Performance Levels ($VPL$) exclude non-viable alternatives before normalisation, preventing them from distorting the ranking frontiers. Desired Performance Levels ($DPL$) cap performances at the DM&amp;#x27;s desired level before normalisation, anchoring the $PIS$ in explicit aspirations rather than dataset extremes. Three toy examples demonstrate each mechanism: $VPL$ reshapes normalisation boundaries by removing a non-viable alternative; fixed $DPL$ frontiers stabilise rankings by limiting the influence of performances well above the desired level. The method preserves the familiar distance-based structure of TOPSIS while grounding the ranking in stable, DM-specified boundaries. Limitations and future research directions are also discussed.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;21 pages, 15 Tables and 6 figures. The numerical computation of the data that appear in the Toy Examples was Supported by the Visual TOPSIS RAD that is available at https://topsis-ranking.vercel.app/. The data of the Toy examples are also available in this URL and can be loaded in the app as the template &amp;quot;Article&amp;quot;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios</title>
    <author>
      <name>Tongzhou Yu</name>
    </author>
    <author>
      <name>Mingjia Li</name>
    </author>
    <author>
      <name>Hong Qian</name>
    </author>
    <author>
      <name>Wenkai Wang</name>
    </author>
    <author>
      <name>Zongbao Zhang</name>
    </author>
    <author>
      <name>Yaoyu Jiang</name>
    </author>
    <author>
      <name>Xiangfeng Wang</name>
    </author>
    <author>
      <name>Aimin Zhou</name>
    </author>
    <author>
      <name>Jiajun Guo</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07226v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07226v1" />
    <published>2026-06-05T12:42:56Z</published>
    <updated>2026-06-05T12:42:56Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tongzhou Yu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by KDD 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History</title>
    <author>
      <name>Liuyin Wang</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09900v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09900v1" />
    <published>2026-06-05T11:43:56Z</published>
    <updated>2026-06-05T11:43:56Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Liuyin Wang&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact -- invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time (&amp;quot;as-of&amp;quot;) filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram&amp;#x27;s lean configuration -- answering from a ~9.6k-token retrieved slice, never the full history -- scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p &amp;lt; 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;14 pages, 4 figures, 3 tables. Code, reproducible harness, and raw per-question logs: https://github.com/ly-wang19/engram&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Textual Supervision Enhances Geospatial Representations in Vision-Language Models</title>
    <author>
      <name>Marcelo Sartori Locatelli</name>
    </author>
    <author>
      <name>Fernando Tonucci</name>
    </author>
    <author>
      <name>Jea Kwon</name>
    </author>
    <author>
      <name>Luiz Felipe Vecchietti</name>
    </author>
    <author>
      <name>Bryan Nathanael Wijaya</name>
    </author>
    <author>
      <name>Cheng Yaw Low</name>
    </author>
    <author>
      <name>Virgilio Almeida</name>
    </author>
    <author>
      <name>Meeyoung Cha</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07172v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07172v1" />
    <published>2026-06-05T11:40:13Z</published>
    <updated>2026-06-05T11:40:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Marcelo Sartori Locatelli et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including people, landmarks, and everyday objects, grouped based on the degree of localizability, we reveal systematic gaps in spatial accuracy and show that textual supervision enhances the learning of geospatial representations. Our findings suggest the role of language as an effective complementary modality for encoding spatial context and multimodal learning as a key direction for advancing geospatial AI.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding</title>
    <author>
      <name>Ahmer Tabassum</name>
    </author>
    <author>
      <name>Sarfraz Ahmad</name>
    </author>
    <author>
      <name>Hasan Iqbal</name>
    </author>
    <author>
      <name>Owais Aijaz</name>
    </author>
    <author>
      <name>Momina Ahsan</name>
    </author>
    <author>
      <name>Preslav Nakov</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07167v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07167v1" />
    <published>2026-06-05T11:35:27Z</published>
    <updated>2026-06-05T11:35:27Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ahmer Tabassum et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;27 pages, 18 figures, 17 tables, Submitted to ARR May 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability</title>
    <author>
      <name>Bijaya Dangol</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07150v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07150v1" />
    <published>2026-06-05T11:07:55Z</published>
    <updated>2026-06-05T11:07:55Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Bijaya Dangol&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another, but assume address-based transport over HTTP(S). Such transports protect message content, increasingly with end-to-end encryption. What they leave in the clear is the communication graph: which agent contacts which, when, and how often. In agent systems this graph is more consequential than a privacy framing suggests. Endpoints are often capability-labeled, workflows are structured and chained, and interactions are coupled to real actions, so an observer recovers more than past relationships. It can infer the pending workflow, the task being assembled and the action likely to follow. At machine speed, it can act on that inference before the workflow completes. The threat is therefore one of workflow integrity, not privacy alone: predictive leverage over autonomous action. We give a threat model for the agent communication graph; identify what makes agent metadata distinctively revealing (semanticity, prospectivity, actuation); define transport- and bootstrap-layer privacy properties and weigh candidate transports (SimpleX/SMP, Tor, mixnets) against them; and present an A2A case study in which a metadata-protecting binding is expressible but surfaces the protocol&amp;#x27;s identity assumptions. We test these on a generative model anchored to a real A2A capture. From passive metadata alone, with no payloads, a classifier recovers a task&amp;#x27;s class well above chance, from only the workflow&amp;#x27;s opening; applied together, the properties drive that recovery sharply back toward chance. Beyond what an observer can recover, we measure the leverage of acting on the leak: from a workflow&amp;#x27;s opening and under a fixed budget, an adversary choosing which workflows to act on realizes in this model most of a clairvoyant attacker&amp;#x27;s advantage over a metadata-blind one, and the same properties suppress it.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 pages, 6 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Cross-View Urban Traffic Dataset: Drone-Supervised Ground Truth for Monocular Bird's-Eye View Localization</title>
    <author>
      <name>Prakhar Bhardwaj</name>
    </author>
    <author>
      <name>Simone Weikl</name>
    </author>
    <author>
      <name>Kilian Mang</name>
    </author>
    <author>
      <name>Elia Jonas Sandtner</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07708v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07708v1" />
    <published>2026-06-05T11:01:42Z</published>
    <updated>2026-06-05T11:01:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Prakhar Bhardwaj et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We introduce a dataset and benchmark for cross-view urban traffic perception built from synchronized ego-centric bicycle videos and aerial drone videos recorded at real urban intersections. The benchmark targets two linked tasks: cross-view identity matching between street-view and drone-view object tracks, and ego-to-bird&amp;#x27;s-eye-view prediction using aerial supervision. In contrast to prior urban driving and V2X datasets, our benchmark provides identity-level alignment across radically different viewpoints together with standardized evaluation, annotation tooling, and baseline implementations. This setting is motivated by intersection-centric traffic analysis, where identity preservation, local interactions, and global spatial structure must be reasoned about jointly across views. We evaluate methods at both the track and frame levels, including cross-view ID precision/recall/IDF1, near--far breakdowns, temporal stability, and consistency metrics. We also provide baseline results for wedge-based cross-view matching and for three BEV prediction baselines: inverse perspective mapping, a MonoLayout-style learned baseline, and a regression baseline. The results show that the benchmark is feasible but challenging: cross-view matching achieves strong recall yet remains limited by over-assignment and temporal inconsistency, while ego-to-BEV prediction benefits from aerial supervision but remains far from saturated under lightweight monocular sensing. We hope that this benchmark will support future research on cross-view perception, urban scene alignment, and ego-to-global traffic understanding.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference</title>
    <author>
      <name>Anurag Sharma</name>
    </author>
    <author>
      <name>Sai Teja Chunchu</name>
    </author>
    <author>
      <name>Prasenjit Mitra</name>
    </author>
    <author>
      <name>Sandipan Sikdar</name>
    </author>
    <author>
      <name>Koustav Rudra</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07141v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07141v1" />
    <published>2026-06-05T10:51:20Z</published>
    <updated>2026-06-05T10:51:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Anurag Sharma et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning patient-specific data is intractable, and retraining with minor data removal is resource-intensive. While there exists several machine unlearning methods that can be used, their utility is generally restricted to non-medical domains. Moreover, the existing benchmarks for evaluating such unlearning methods primarily utilize synthetically curated datasets, which are not truly representative of real-world systems. Hence, the effectiveness of these unlearning methods in the medical domain is largely unclear. To this end, we introduce REMEDI, an extensive benchmark for machine unlearning tailored to multi-label and multiclass clinical disease inference, where label correlations, longitudinal structure, and safety constraints make unlearning particularly challenging. Unlike the existing benchmarks, REMEDI considers: (1) a relevant application domain (medical), (2) comprehensive unlearning setups involving diverse sets of forget instances, (3) challenging unlearning scenarios including multi-label and multi-class classification tasks, and (4) evaluation metrics involving performance both in terms of utility and extent of unlearning achieved. REMEDI is developed using the MIMIC-III clinical database that contains comprehensive clinical data of patients. Experiments with existing unlearning methods indicate that there exists a trade-off between utility and unlearning performance. They are also largely unsuited to multi-label classification tasks. To facilitate reproducibility, we make our benchmark publicly available.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Under review&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models</title>
    <author>
      <name>Rishabh Makwana</name>
    </author>
    <author>
      <name> Mamta</name>
    </author>
    <author>
      <name>Deeksha Varshney</name>
    </author>
    <author>
      <name>Oana Cocarascu</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07706v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07706v1" />
    <published>2026-06-05T10:10:53Z</published>
    <updated>2026-06-05T10:10:53Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Rishabh Makwana et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Vision-Language Models (VLMs) have demonstrated strong performance across multimodal tasks, yet their safety robustness remains an open challenge. While prior work has shown that structured visual prompts such as flowcharts can effectively jailbreak VLMs, existing studies are largely limited to English-centric settings. In this paper, we introduce MLingualFC, a multilingual multimodal benchmark designed to evaluate jailbreak vulnerabilities of VLMs across diverse languages using structured flowchart representations. MLingualFC encodes harmful instructions into flowchart images across five languages (Hindi, Punjabi, Spanish, Romanian, and German). We evaluate state-of-the-art multilingual VLMs, including Qwen2.5-VL, Gemma-4, and Pangea, under a black-box threat model. Our results reveal significant multilingual safety gaps. Flowchart-based attacks achieve high attack success rates (ASR) in case of Latin script languages, demonstrating that visual encoding of harmful content effectively bypasses safety alignment across languages. In contrast, non-Latin script languages such as Punjabi exhibit substantially lower ASR, suggesting potential limitations in visual text recognition rather than stronger safety alignment. These findings highlight that current VLM safety mechanisms fail to generalize across languages and modalities. Resources are available at https://github.com/Rishabhpm23/MLingualFC&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling</title>
    <author>
      <name>Tengyao Tu</name>
    </author>
    <author>
      <name>Yulin Li</name>
    </author>
    <author>
      <name>Hui-Ling Zhen</name>
    </author>
    <author>
      <name>Libo Qin</name>
    </author>
    <author>
      <name>Zhoujun Wei</name>
    </author>
    <author>
      <name>Jinghua Piao</name>
    </author>
    <author>
      <name>Zhuotao Tian</name>
    </author>
    <author>
      <name>Yong Li</name>
    </author>
    <author>
      <name>Min Zhang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07108v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07108v2" />
    <published>2026-06-05T10:02:19Z</published>
    <updated>2026-06-08T06:45:26Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tengyao Tu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as &amp;quot;overthinking&amp;quot;. Existing methods to mitigate this issue either rely on static difficulty estimates or require task-specific training, and thus fail to adapt to the dynamic complexity during reasoning. In this work, we empirically show that the problem difficulty evolves dynamically throughout the reasoning process and is linearly encoded in the LRM&amp;#x27;s step-level embeddings. Building on this insight, we propose DyCon, a training-free framework that leverages latent step-level representations to explicitly model the evolving task difficulty, enabling the dynamic control of reasoning depth to mitigate the overthinking issue. Extensive experiments conducted on four models ranging from 4B to 32B, and across twelve benchmarks in math reasoning, general question answering, and coding tasks demonstrate that DyCon significantly enhances reasoning efficiency by reducing redundant steps without sacrificing accuracy or generalization. Code is available at https://github.com/yu-lin-li/DyCon.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models</title>
    <author>
      <name>Yuchen He</name>
    </author>
    <author>
      <name>Baolong Bi</name>
    </author>
    <author>
      <name>Shenghua Liu</name>
    </author>
    <author>
      <name>Huaming Liao</name>
    </author>
    <author>
      <name>Yuyao Ge</name>
    </author>
    <author>
      <name>Bolin Wan</name>
    </author>
    <author>
      <name>Siqian Tong</name>
    </author>
    <author>
      <name>Juan Chen</name>
    </author>
    <author>
      <name>Jiafeng Guo</name>
    </author>
    <author>
      <name>Xueqi Cheng</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07705v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07705v1" />
    <published>2026-06-05T10:00:19Z</published>
    <updated>2026-06-05T10:00:19Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuchen He et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Although multi-objective reinforcement learning (MORL) is central to aligning large language models with complex human preferences, the prevailing practice of static weighted summation overlooks a more fundamental phenomenon: reward learning is markedly asynchronous across objectives. Well-learned dimensions quickly produce homogeneous, low-variance signals whose residual noise contaminates the aggregated reward (in GRPO) or occupies a fixed share of the advantage budget (in GDPO), interfering with the scarce yet high-value signals carried by under-learned dimensions. To address this asynchrony, we propose Stage-Aware Dynamic Weighting (SAW), a lightweight, algorithm-agnostic dynamic weighting mechanism. SAW utilizes the coefficient of variation (CV) as a scale-invariant proxy for real-time informativeness, reweighting each dimension&amp;#x27;s reward or advantage contribution by its relative informativeness within the batch. Unlike gradient-based methods that require multiple forward and backward passes, SAW relies solely on batch-level statistics, introducing nearly negligible computational overhead. Experiments on tool-calling and text summarization tasks demonstrate that SAW consistently improves both training efficiency and final performance under both GRPO and GDPO frameworks, confirming it as a general-purpose plug-in for multi-reward LLM alignment. Our code is available at https://github.com/Zhaolutuan/SAW&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;17 pages, 7 figures, 5 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] GP-Adapter: Gaussian Process CLIP-Adapter for Few-Shot Out-of-Distribution Detection</title>
    <author>
      <name>Taisei Saito</name>
    </author>
    <author>
      <name>Koretaka Ogata</name>
    </author>
    <author>
      <name>Takafumi Hiroi</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07102v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07102v1" />
    <published>2026-06-05T09:53:30Z</published>
    <updated>2026-06-05T09:53:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Taisei Saito et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We propose GP-Adapter, a training-free framework that augments CLIP (Contrastive Language-Image Pre-training) with Gaussian Process (GP) uncertainty modeling for few-shot classification and out-of-distribution (OOD) detection. While CLIP achieves strong zero-shot recognition, it yields deterministic similarity scores and offers limited uncertainty information, which is critical under distribution shift and data scarcity. GP-Adapter constructs modality-specific, class-wise one-class GPs on top of frozen CLIP embeddings using an RBF kernel for image features and a linear kernel for text prompts and fuses their predictive statistics to produce a variance-aware confidence score for OOD detection. The method requires no fine-tuning of the CLIP backbone and relies only on a small $K$-shot cache and lightweight hyperparameter selection, with memory cost scaling as $O(CK^2)$ for $C$ classes and $K$ shots. Experiments on ImageNet and multiple OOD benchmarks show that GP-Adapter provides competitive few-shot performance and consistently improves OOD detection when combined with prompt-learning baselines, highlighting the complementarity between GP-based uncertainty modeling and prompt learning. Overall, our results suggest that integrating probabilistic inference with large pre-trained vision-language models can improve reliability in low-data and distribution-shifted settings. Code is available at https://github.com/tms-byte/GP-Adapter&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;8 pages, 6 figures, Accepted at IJCNN 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] MetaConfigurator: AI-Assisted RDF Authoring from JSON Data</title>
    <author>
      <name>Felix Neubauer</name>
    </author>
    <author>
      <name>Mahdi Jafarkhani</name>
    </author>
    <author>
      <name>Kenichi Endo</name>
    </author>
    <author>
      <name>Jürgen Pleiss</name>
    </author>
    <author>
      <name>Benjamin Uekermann</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07094v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07094v1" />
    <published>2026-06-05T09:42:01Z</published>
    <updated>2026-06-05T09:42:01Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Felix Neubauer et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Scientific workflows increasingly generate structured JSON data that is easy to exchange but difficult to interpret consistently across systems due to lacking semantic interoperability. While JSON Schema ensures structural validation, it provides no native support for Linked Data semantics.
  This paper presents an RDF Authoring View extending the open-source JSON Schema editor MetaConfigurator, enabling researchers to transform existing JSON, YAML, or CSV data into RDF using AI-assisted RML mappings, refine triples, execute SPARQL queries, visualize knowledge graphs, and export RDF serializations within a single integrated web interface. This workflow is supported by ontology-aware IRI auto-completion, bidirectional synchronization between JSON-LD text views and RDF triple tables, and AI-assisted SPARQL query generation from natural language hints.
  We demonstrate the workflow using laboratory data from metal-organic framework (MOF) synthesis experiments. Protocol data describing reagents, procedure steps, and quantities is converted from JSON to ontology-based JSON-LD via RML mappings. We then refine the semantic representation, query relationships between experimental conditions and outcomes, and explore the resulting knowledge graph interactively. This integrated environment bridges conventional structured data management with Semantic Web technologies while preserving experimental context and lowering technical barriers through AI assistance.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Submitted as post-proceedings for the deRSE26 conference&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] dots.tts Technical Report</title>
    <author>
      <name>Shi Lian</name>
    </author>
    <author>
      <name>Changtao Li</name>
    </author>
    <author>
      <name>Bohan Li</name>
    </author>
    <author>
      <name>Hankun Wang</name>
    </author>
    <author>
      <name>Da Zheng</name>
    </author>
    <author>
      <name>Junfeng Tian</name>
    </author>
    <author>
      <name>Yufeng Ma</name>
    </author>
    <author>
      <name>Colin Zhang</name>
    </author>
    <author>
      <name>Kai Yu</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07080v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07080v1" />
    <published>2026-06-05T09:19:24Z</published>
    <updated>2026-06-05T09:19:24Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shi Lian et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating</title>
    <author>
      <name>Zequn Xie</name>
    </author>
    <author>
      <name>Junjie Wang</name>
    </author>
    <author>
      <name>Dan Yang</name>
    </author>
    <author>
      <name>Jie Feng</name>
    </author>
    <author>
      <name>Yue Shen</name>
    </author>
    <author>
      <name>Jian Wang</name>
    </author>
    <author>
      <name>Jinjie Gu</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07074v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07074v1" />
    <published>2026-06-05T09:10:50Z</published>
    <updated>2026-06-05T09:10:50Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zequn Xie et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;17 pages, 8 figures,&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets</title>
    <author>
      <name>Zhenyu Yang</name>
    </author>
    <author>
      <name>Zemin Du</name>
    </author>
    <author>
      <name>Shengsheng Qian</name>
    </author>
    <author>
      <name>Changsheng Xu</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07032v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07032v1" />
    <published>2026-06-05T08:23:25Z</published>
    <updated>2026-06-05T08:23:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhenyu Yang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent reference-target pairs sourced from videos, a data construction pipeline, and evaluation methods that consider the ranking of multiple positive and negative target images. We ensure visually and semantically consistent reference-target pairs by extracting frames from a single video and generating relative captions using LLM-assisted methods. To ensure a true zero-shot scenario, we use video data published after March 31, 2022, ensuring it was not included in CLIP&amp;#x27;s pre-training data. Additionally, we propose a training-free MLLM-driven method, SC4CIR (Symmetric Consistency for CIR), which can effectively identify hard negative targets through 3 symmetric consistency checks. This method is plug-and-play, seamlessly integrating with various CIR methods and significantly improving performance. Our experimental results from 27 methods reveal that current ZS-CIR datasets and evaluation metrics result in inflated retrieval performance, exaggerating the capabilities of CIR methods. Our benchmark and models can be accessed at https://github.com/sotayang/ZeroSight.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.IV] DaX: Learning General Pathology Representations Across Scales</title>
    <author>
      <name>Bokai Zhao</name>
    </author>
    <author>
      <name>Yiyang Zhang</name>
    </author>
    <author>
      <name>Long Bai</name>
    </author>
    <author>
      <name>Tai Ma</name>
    </author>
    <author>
      <name>Hanqing Chao</name>
    </author>
    <author>
      <name>Minfeng Xu</name>
    </author>
    <category term="eess.IV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06983v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06983v1" />
    <published>2026-06-05T07:21:11Z</published>
    <updated>2026-06-05T07:21:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Bokai Zhao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Computational pathology requires visual representations that transfer across diverse clinical endpoints and remain robust to variation in magnification, staining, scanner type, slide preparation, and input resolution. We present DaX, a pathology vision foundation model that adapts DINOv3-style self-supervised learning to whole-slide histopathology. DaX is initialized from natural-image DINOv3 weights and incorporates continuous magnification training, cross-scale tissue views, orientation-agnostic and acquisition-robust augmentation, multi-input-size training, and Gram-anchored dense consistency. These designs aim to connect local cellular morphology with global tissue architecture while stabilizing dense token-level representations across input scales. We further construct a WSI-level benchmark comprising 161 clinically meaningful tasks from 44 public datasets, covering 28,182 patients and 34,394 slides across four clinical domains and nine task categories. All models are evaluated under a fixed patient-level cross-validation protocol with fold-level statistical ranking, enabling reproducible comparisons that are less sensitive to split-dependent variation. Across this benchmark, DaX achieves the highest mean performance across tasks and consistently strong task-level ranking scores, with gains spanning diagnostic pathology, biomarker and molecular profiling, tissue/specimen context, and risk, response, and prognosis. These results support DaX as a transferable visual encoder for computational pathology and provide a standardized evaluation framework for future pathology foundation models. Project page: https://alibaba-damo-academy.github.io/DaX/benchboard/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios</title>
    <author>
      <name>Xinyi Li</name>
    </author>
    <author>
      <name>Zhen Fang</name>
    </author>
    <author>
      <name>Yongxin Deng</name>
    </author>
    <author>
      <name>Jinyuan Luo</name>
    </author>
    <author>
      <name>Hongnan Ma</name>
    </author>
    <author>
      <name>Changdae Oh</name>
    </author>
    <author>
      <name>Zijing Shi</name>
    </author>
    <author>
      <name>Shanshan Ye</name>
    </author>
    <author>
      <name>Hanchen Wang</name>
    </author>
    <author>
      <name>Shu-Lin Chen</name>
    </author>
    <author>
      <name>Yadan Luo</name>
    </author>
    <author>
      <name>Mengyue Yang</name>
    </author>
    <author>
      <name>Sean Du</name>
    </author>
    <author>
      <name>Sharon Li</name>
    </author>
    <author>
      <name>Ling Chen</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06959v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06959v1" />
    <published>2026-06-05T06:38:40Z</published>
    <updated>2026-06-05T06:38:40Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xinyi Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of downstream domains and tasks. Consequently, reported detector performance is often difficult to compare, reproduce, and generalize beyond specific experimental settings. We introduce OpenHalDet, a unified benchmark for hallucination detection across diverse generation scenarios. OpenHalDet standardizes the evaluation pipeline, from prompt construction and response generation to truthfulness annotation, detector scoring, and metric computation. It supports heterogeneous detector families under different access settings, including black-box methods that use only generated outputs, gray-box methods that rely on probability-based signals, and white-box methods that exploit internal model signals. By bringing diverse tasks, models, and detectors into a shared framework, OpenHalDet enables controlled comparison and provides a systematic view of how different detection paradigms behave in LLM applications. We release OpenHalDet as an open and extensible codebase to facilitate reproducible evaluation and future development of hallucination detection methods. The code and datasets are available at https://github.com/Nellie179/Hallucination-Detection.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Preprint. Code and data are available at https://github.com/Nellie179/Hallucination-Detection&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] SS-TPT: Stability and Suitability-Guided Test-Time Prompt Tuning for Adversarially Robust Vision-Language Models</title>
    <author>
      <name>Sunoh Kim</name>
    </author>
    <author>
      <name>Daeho Um</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06943v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06943v1" />
    <published>2026-06-05T06:12:48Z</published>
    <updated>2026-06-05T06:12:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sunoh Kim, Daeho Um&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but remain highly fragile under adversarial perturbations. Recent test-time adaptation defenses improve robustness by leveraging many augmented views, but this leads to impractical slowdown and a clear robustness-throughput trade-off. To address this challenge, we present Stability and Suitability-guided Test-time Prompt Tuning (SS-TPT), evaluating the quality of each augmented view via two complementary scores: (1) stability, measuring prediction invariance to weak augmentations, and (2) suitability, measuring feature-space density among views. These stability and suitability (SS) scores guide both adaptation and inference through an SS-guided consistency loss and an SS-weighted prediction, amplifying trustworthy views while suppressing corrupted ones. Extensive experiments demonstrate that SS-TPT significantly outperforms prior state-of-the-art methods, achieving superior robustness-throughput trade-offs across diverse datasets and varying numbers of views, thereby demonstrating both strong practicality and generality. Our code is available at https://github.com/sunoh-kim/SS-TPT.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted in ICML2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning</title>
    <author>
      <name>Rahul Nair</name>
    </author>
    <author>
      <name>Chun Tao</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06920v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06920v1" />
    <published>2026-06-05T05:34:13Z</published>
    <updated>2026-06-05T05:34:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Rahul Nair, Chun Tao&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B) on mathematical reasoning tasks and uncover a critical vulnerability: Full Fine-Tuning (Full FT) actively harms performance in models under 300M parameters, often dropping accuracy below zero-shot baselines. This &amp;quot;negative transfer&amp;quot; makes Parameter-Efficient Fine-Tuning (PEFT) not just an efficiency preference, but a stability requirement. We find that while Low-Rank Adaptation (LoRA) and Weight-Decomposed LoRA (DoRA) perform comparably, their strengths vary by task; DoRA excels in complex reasoning (GSM8K), while LoRA dominates pattern matching (OrcaMath). In particular, Full FT is outperformed by LoRA on aligned models (Qwen2.5-0.5B) and even by simple 5-shot In-Context Learning on the smallest architectures (SmolLM2-135M). Based on these findings, we recommend defaulting to PEFT for all aligned sub-1B models and caution against Full FT for any architecture smaller than 500M parameters to prevent catastrophic forgetting. Reproduction of this work can be found at https://github.com/gulguluu/tiny-slm-finetune-compare.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;8 pages, 6 figures, 2 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction</title>
    <author>
      <name>Jewon Yeom</name>
    </author>
    <author>
      <name>Hanseul Kim</name>
    </author>
    <author>
      <name>Jeongjae Park</name>
    </author>
    <author>
      <name>Sungmok Jung</name>
    </author>
    <author>
      <name>Jaejin Lee</name>
    </author>
    <author>
      <name>Taesup Kim</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07687v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07687v1" />
    <published>2026-06-05T04:43:02Z</published>
    <updated>2026-06-05T04:43:02Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jewon Yeom et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Video world models are increasingly used to provide predictive visual representations, yet it remains unclear which pretraining signals induce action-relevant structure in their latent spaces. We study this question through a unified probe-based evaluation across diverse encoder families, including image-only self-supervision, video pretraining with and without latent prediction, reconstruction-based autoencoders, diffusion models, and shortcut-forcing dynamics models. Using a common inverse-dynamics probing objective, we find that action-relevant structure is driven primarily by temporal video pretraining rather than pixel reconstruction fidelity: models with strong pixel decoding quality can exhibit near-zero action recoverability, while video-pretrained self-supervised encoders consistently achieve the best Pareto trade-off between visual fidelity and action prediction. Comparing V-JEPA and VideoMAE further shows that most gains arise from natural-video temporal context, with feature-level latent prediction providing a smaller additional benefit. These trends transfer across robotic benchmarks, though CALVIN reveals that static-environment tasks can partially mask the importance of temporal structure by allowing strong image priors to suffice. Finally, inverse-dynamics supervision substantially improves robustness to visual corruption, suggesting that action-aware objectives regularize latent geometry beyond clean-setting performance. Our results identify temporal predictive structure -- not reconstruction fidelity -- as the primary ingredient underlying action-relevant video representations.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy</title>
    <author>
      <name>Yuan Zeng</name>
    </author>
    <author>
      <name>Yujia Shi</name>
    </author>
    <author>
      <name>Yuhao Yang</name>
    </author>
    <author>
      <name>Dongxia Liu</name>
    </author>
    <author>
      <name>Zongqing Lu</name>
    </author>
    <author>
      <name>Wenming Yang</name>
    </author>
    <author>
      <name>Qingmin Liao</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06903v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06903v1" />
    <published>2026-06-05T04:39:46Z</published>
    <updated>2026-06-05T04:39:46Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuan Zeng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Human image animation aims to generate a video from a static reference image, guided by pose information extracted from a driving video. Existing approaches often rely on pose estimators to extract intermediate representations, but such signals are prone to errors under occlusion or complex poses. Building on these observations, we present DirectAnimator, a framework that bypasses pose extraction and directly learns from raw driving videos. We introduce a Driving Cue Triplet consisting of pose, face, and location cues that captures motion, expression, and alignment in a semantically rich yet stable form, and we fuse them through a CueFusion DiT block for reliable control during denoising. To make learning dependable when the driving and reference identities differ, we devise a Same2X training strategy that aligns cross-ID features with those learned from same-ID data, regularizing optimization and accelerating convergence. Extensive experiments demonstrate that DirectAnimator attains state-of-the-art visual quality and identity preservation while remaining robust to occlusions and complex articulation, and it does so with fewer computational resources. Our project page is at https://directanimator.github.io/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ICLR 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Blockchain Infrastructure for Intelligent Cyber--Physical--Social Systems:Post-Quantum Security, Interoperability, and Trustworthy Data Economies in the Era of Embodied AI</title>
    <author>
      <name>Song Guo</name>
    </author>
    <author>
      <name>Huawei Huang</name>
    </author>
    <author>
      <name>Dongping Liu</name>
    </author>
    <author>
      <name>Aoyu Zhang</name>
    </author>
    <author>
      <name>Luyao Zhang</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06895v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06895v1" />
    <published>2026-06-05T04:27:34Z</published>
    <updated>2026-06-05T04:27:34Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Song Guo et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The deployment of embodied artificial intelligence via world-model-based robotics presents a transformative opportunity for blockchain infrastructure, establishing urgent demand for trustworthy data provenance, cross-organizational governance, and incentive-compatible sharing across decentralized ecosystems. Simultaneously, quantum computing advances recognized by the 2025 Nobel Prize in Physics and the Turing Award threaten the cryptographic primitives securing these data economies, creating an interdependent imperative: long-lived verification for embodied AI depends on crypto-agile architectures capable of withstanding quantum adversaries. This tutorial examines blockchain as the coordination layer bridging this dual transition, from financial substrate to foundational Cyber-Physical-Social Systems infrastructure that simultaneously secures against quantum cryptanalysis and enables scalable, trustworthy data economies. The session opens with an immersive AWS Braket demonstration engaging participants with superconducting, trapped-ion, and neutral-atom hardware to assess cryptographic threat timelines and witness ECDSA-to-post-quantum signature transitions. Five integrated modules progress from embodied AI and world-model requirements through quantum hardware reality and evidence-based security migration, to scalable cross-shard architectures via BrokerChain protocols, trustworthy data economies implementing Croissant metadata standards and robotic learning provenance, and industry ecosystem integration for multi-modal cloud deployment. By bridging quantum hardware realities with embodied AI data requirements, this tutorial charts blockchain as unified infrastructure for next-generation decentralized intelligent environments, providing open-source frameworks and roadmaps for architecting quantum-resistant, interoperable, and data-trustworthy systems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising</title>
    <author>
      <name>Yuan Zeng</name>
    </author>
    <author>
      <name>Yujia Shi</name>
    </author>
    <author>
      <name>Zongqing Lu</name>
    </author>
    <author>
      <name>QingMin Liao</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06885v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06885v1" />
    <published>2026-06-05T04:06:42Z</published>
    <updated>2026-06-05T04:06:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuan Zeng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Human Image Animation has seen significant advancements, primarily driven by diffusion models. However, existing methods typically demand substantial training data and resources to achieve high-quality results, limiting generalization and accessibility. In this work, we introduce \emph{FreeAnimate}, a training-free framework that leverages the inherent capabilities of image diffusion models to enable temporal consistency, identity preservation, and background stability. Our approach incorporates a novel preview generation strategy that provides temporal and structural priors from generated preview frames, effectively guiding pose alignment and background consistency without training. Additionally, FreeAnimate introduces Inversion-Boosted Attention and Reference-Anchored Self-Attention modules to guarantee temporal consistency and identity preservation. Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods and offering robust generalization across diverse datasets. Our project page is at https://freeani.github.io/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to IEEE ICASSP 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation</title>
    <author>
      <name>Yuan Zeng</name>
    </author>
    <author>
      <name>Zilue Gao</name>
    </author>
    <author>
      <name>Yujia Shi</name>
    </author>
    <author>
      <name>Zongqing Lu</name>
    </author>
    <author>
      <name>Wenming Yang</name>
    </author>
    <author>
      <name>QingMin Liao</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06872v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06872v1" />
    <published>2026-06-05T03:40:10Z</published>
    <updated>2026-06-05T03:40:10Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuan Zeng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. Existing methods often discretize pressure signal and process frames independently, leading to quantization errors and temporal inconsistencies. We present \emph{EgoPressDiff}, a conditional video diffusion framework that generates UV-pressure maps from visual input. The core of our approach is a multi-modal conditioning strategy, introducing a PoseNet and a Vertex Encoder to efficiently extract features from hand pose and 3D mesh vertices. These signals, along with depth information, guide the generative process to ensure the pressure fields are physically grounded. To effectively fuse these heterogeneous features, we further propose a Distribution-Calibrated Spatial Layer, which aligns their statistical properties before combination. Evaluated on the EgoPressure ego-view setting, EgoPressDiff achieves state-of-the-art results, improving Volumetric IoU by over 34\% relative to prior baseline, while reducing MAE and maintaining high temporal accuracy. Our project page is at https://egopressdiff.github.io/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to IEEE ICASSP 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces</title>
    <author>
      <name>Debjyoti Saha Roy</name>
    </author>
    <author>
      <name>Byron C. Wallace</name>
    </author>
    <author>
      <name>Javed A. Aslam</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06840v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06840v1" />
    <published>2026-06-05T02:32:24Z</published>
    <updated>2026-06-05T02:32:24Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Debjyoti Saha Roy et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We investigate how they achieve this mechanistically. We characterize reasoning as a two-phase process: A broad &amp;quot;shortlisting&amp;quot; of candidates followed by fine-grained reasoning over the resulting set. We provide evidence across a range of datasets that these steps can be isolated and are complementary. Using this characterization, we develop a mechanistic distillation strategy that consistently outperforms standard distillation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] LLM Agent-Assisted Reverse Engineering with Quantitative Readability Metrics</title>
    <author>
      <name>Neil Archibald</name>
    </author>
    <author>
      <name>Ruben Thijssen</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06838v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06838v1" />
    <published>2026-06-05T02:24:25Z</published>
    <updated>2026-06-05T02:24:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Neil Archibald, Ruben Thijssen&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Automatic decompilers produce functionally correct but often unreadable C code. This paper addresses one stage of the reverse engineering workflow: improving the readability of decompiled code using LLM agents guided by quantitative metrics.
  We present a three-phase research evolution. Phase 1 (tool-driven steering via Ghidra MCP) suffered from incomplete coverage and inconsistent improvements due to lack of quantitative guidance. Phase 2 (structural similarity validation alone) revealed that agents optimize for metrics in unintended ways, producing structurally equivalent but less readable code. Our contribution is the Quantitative Readability Score (QRS) framework, a composite metric combining a structural similarity gate with three independent readability sub-metrics (Lexical Surprisal, Structural Simplicity, and Idiomatic Quality).
  We demonstrate that QRS-guided refinement enables LLM agents to make targeted readability improvements without sacrificing correctness. We provide a discussion of the broader reverse engineering workflow (binary lifting, decompilation cleanup, and achieving functional equivalence) as context, however, it remains out of scope.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks</title>
    <author>
      <name>Jiani Xie</name>
    </author>
    <author>
      <name>Andrew C. Cullen</name>
    </author>
    <author>
      <name>Paul Montague</name>
    </author>
    <author>
      <name>Benjamin I. P. Rubinstein</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06833v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06833v1" />
    <published>2026-06-05T02:18:23Z</published>
    <updated>2026-06-05T02:18:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiani Xie et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Automatic Speech Recognition (ASR) systems operating in real-time settings must process acoustic input under strict temporal constraints, where transcription decisions are inherently made on incomplete information. This causal constraint serves as an information bottleneck on attackers, significantly limiting attack performance. Our new Semantic Gambit attack breaks this causal limitation by augmenting the adversary with predictive context derived from a Large Language Model in real-time. Our experiments show that this form of augmentation can elevate the corpus-level Word Error Rate to 35.6% -- a three-fold increase over the current state-of-the-art. Ultimately, this work reveals how common, low-latency LLM tooling can be exploited to systematically subvert real-time ASR pipelines.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation</title>
    <author>
      <name>Dahee Kwon</name>
    </author>
    <author>
      <name>Haeun Lee</name>
    </author>
    <author>
      <name>Jaesik Choi</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06813v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06813v1" />
    <published>2026-06-05T01:27:44Z</published>
    <updated>2026-06-05T01:27:44Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dahee Kwon et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent text-to-image models built on large-scale Transformer backbones and flow-based objectives deliver strong text-image alignment and high visual quality, yet often produce overly similar samples under a fixed prompt. Existing diversity-enhancement methods alleviate this issue, but typically require expensive sampling or auxiliary optimization, incurring non-trivial overhead. To investigate the root cause of this homogeneity, we examine intermediate Transformer features and observe that the zero-frequency spatial average (DC) component rapidly converges across seeds early in generation, causing early trajectory lock-in that limits downstream variation. Building on this observation, we propose DC Attenuation for diVersity Enhancement (DAVE), a training-free representation-level intervention that selectively attenuates this component in the early regime. DAVE preserves the sampling pipeline with negligible overhead, improving prompt-consistent diversity while maintaining competitive image quality.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ICML 2026. Code is available at: https://github.com/daheekwon/DAVE&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] OpenSkill: Open-World Self-Evolution for LLM Agents</title>
    <author>
      <name>Zhiling Yan</name>
    </author>
    <author>
      <name>Dingjie Song</name>
    </author>
    <author>
      <name>Hanrong Zhang</name>
    </author>
    <author>
      <name>Wei Liang</name>
    </author>
    <author>
      <name>Yuxuan Zhang</name>
    </author>
    <author>
      <name>Yutong Dai</name>
    </author>
    <author>
      <name>Lifang He</name>
    </author>
    <author>
      <name>Philip S. Yu</name>
    </author>
    <author>
      <name>Ran Xu</name>
    </author>
    <author>
      <name>Xiang Li</name>
    </author>
    <author>
      <name>Lichao Sun</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06741v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06741v1" />
    <published>2026-06-04T21:55:48Z</published>
    <updated>2026-06-04T21:55:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhiling Yan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of these, offering only a task prompt. In this work, we study open-world self-evolution, where an agent must build both its skills and its own verification signals from scratch, using open-world resources but no target-task supervision. We propose OpenSkill, a framework that bootstraps this loop: it acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self-built virtual tasks grounded in the anchors rather than in target answers. The open world thus supplies both the knowledge to be learned and a supervision-independent practice environment, with target-task supervision reserved for final evaluation. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no-supervision constraint. Analysis shows its skills transfer across models without model-specific adaptation, and its self-built verifier aligns with ground-truth outcomes despite never accessing them.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;20 pages, 4 figures and 8 tables. Code is avalable at https://github.com/OpenLAIR/OpenSkill&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets</title>
    <author>
      <name>Saket Reddy</name>
    </author>
    <author>
      <name>Shiwei Liu</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06717v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06717v1" />
    <published>2026-06-04T21:06:31Z</published>
    <updated>2026-06-04T21:06:31Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Saket Reddy, Shiwei Liu&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets, such as the historically &amp;quot;undruggable&amp;quot; oncology targets KRAS and MYC. To address this gap, we introduce ShallowBench, a strictly curated benchmark of 5,780 shallow-pocket targets extracted from CrossDocked2020. By computing the difference between an Alpha Shape &amp;quot;lid&amp;quot; volume and the underlying protein atom voxel volume, we successfully isolated targets with low concavity while ensuring sufficient surface area for binding. Evaluating various state-of-the-art generative models reveals weaker predicted binding affinity on these low-concavity interfaces. ShallowBench therefore provides a rigorous benchmark for generative biology models and highlights the necessity of new architectural innovations or loss functions capable of navigating these challenging targets.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles</title>
    <author>
      <name>Upasana Chatterjee</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06715v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06715v1" />
    <published>2026-06-04T21:05:48Z</published>
    <updated>2026-06-04T21:05:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Upasana Chatterjee&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label. Using articles from AllSides, paired with shared sentiment annotations from Llama-3.3-70b-versatile, we compare ideology labels from expert human annotators, GPT-4o-mini (baseline and finetuned), and Llama-3.3-70B. We apply Double Machine Learning (DML) and community-level mediation analysis across all four annotation paradigms. Human annotations yield no significant causal effects at the community level. Fine-tuned GPT-4o-mini achieves the highest classification accuracy (F1=72.48) and is the only annotator paradigm that produces significant community-level treatment effects and significant natural direct effects (NDEs) in mediation. We interpret this as evidence of shortcut learning: fine-tuning on ideology-labeled data causes the model to internalise a spurious sentiment--ideology coupling not operative in human judgment for this task. This coupling is structurally invisible to F1-based evaluation, with implications for the use of LLM annotations as silver labels and as proxies for human judgment in downstream causal analyses.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ACL SRW 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule</title>
    <author>
      <name>Xi Xuan</name>
    </author>
    <author>
      <name>Wenxin Zhang</name>
    </author>
    <author>
      <name>Yufei Zhou</name>
    </author>
    <author>
      <name>King-kui Sin</name>
    </author>
    <author>
      <name>Chunyu Kit</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06679v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06679v1" />
    <published>2026-06-04T19:53:12Z</published>
    <updated>2026-06-04T19:53:12Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xi Xuan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK&amp;#x27;s court hierarchy, comprising $\sim$290k sentences and $\sim$6.5 million tokens, fully annotated by legal linguistics experts. We design a two-tier discourse schema that captures what facts a court finds, how it reasons, and what it rules. At the sentence level, each sentence is assigned one of 26 rhetorical roles. At the span level, sentences are further annotated with three sentencing elements (charge, imprisonment term, fine). Ten legal linguistics annotators produced the annotations with an inter-annotator agreement of $κ= 0.8$. We formulate two tasks on HKJudge, termed rhetorical role classification and legal element extraction, and provide the first benchmark evaluation of four BERT-based models, two open-source LLMs under zero-shot and fine-tuning settings, and four commercial LLMs on both tasks. Our work demonstrates the value of sentence-level discourse annotation for modeling the structure of HK judgments and provides a rich data foundation for future work on legal judgment prediction. The HKJudge dataset and code are available at https://github.com/xuanxixi/HKJudge.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers</title>
    <author>
      <name>Tang Li</name>
    </author>
    <author>
      <name>Yanlin Chen</name>
    </author>
    <author>
      <name>Mengmeng Ma</name>
    </author>
    <author>
      <name>Xi Peng</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06664v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06664v1" />
    <published>2026-06-04T19:25:40Z</published>
    <updated>2026-06-04T19:25:40Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tang Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: https://github.com/deep-real/ViSAE.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;In Proceedings of the International Conference on Machine Learning, 2026. (acceptance rate 26.6%)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver</title>
    <author>
      <name>Cody J Christopher</name>
    </author>
    <author>
      <name>Charles Gretton</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06641v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06641v1" />
    <published>2026-06-04T18:47:45Z</published>
    <updated>2026-06-04T18:47:45Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Cody J Christopher, Charles Gretton&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present Accelerated Fourier SAT (AFSAT), a GPU-accelerated solver for pseudo-Boolean satisfiability based on continuous local search (CLS). AFSAT realises the proof-of-concept approach, FastFourierSAT, into a fully-engineered solver supporting any heterogeneous mixture of symmetric constraint types and lengths within a single problem instance. Using the JAX compiler, AFSAT leverages pure function composition, automatic vectorisation, automatic differentiation, and just-in-time (JIT) compilation to perform massively parallel CLS across batches of candidate assignments. We demonstrate substantially improved numerical stability, runtime performance, and memory efficiency over the proof-of-concept. We achieve this by way of identifying and addressing various limitations that arise from memory latency and floating-point representation, as well as leveraging automatic parallelisation and compact representations. The inherent representational and stability limitations of floating point are partially addressed by a tailored discrete Fourier transform implementation. We achieve near-linear throughput when scaling to multiple accelerators via JAX array sharding.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures</title>
    <author>
      <name>Tanvi Thoria</name>
    </author>
    <author>
      <name>Kiana Jafari</name>
    </author>
    <author>
      <name>Marc R. Schlichting</name>
    </author>
    <author>
      <name>Mykel J. Kochenderfer</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06635v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06635v1" />
    <published>2026-06-04T18:36:42Z</published>
    <updated>2026-06-04T18:36:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tanvi Thoria et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework&amp;#x27;s falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes. Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?</title>
    <author>
      <name>Richard Li</name>
    </author>
    <author>
      <name>Aditya Prakash</name>
    </author>
    <author>
      <name>Andrew Wen</name>
    </author>
    <author>
      <name>Saurabh Gupta</name>
    </author>
    <author>
      <name>Yilun Du</name>
    </author>
    <author>
      <name>Pulkit Agrawal</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06627v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06627v1" />
    <published>2026-06-04T18:24:23Z</published>
    <updated>2026-06-04T18:24:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Richard Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural motions. We find that hand pose quality affects transfer, but even with accurate hands, the inherent motion gap hinders transfer unless the vision and policy networks specialize to each embodiment. Our cotraining recipe yields consistent improvements, with an absolute success rate gain of $29.7\%$ in the low-robot-data regime across six manipulation tasks.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;The project website is here: https://richardrl.github.io/what-matters-cotraining-human-videos/index.html&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] FIGMA: Towards FIne-Grained Music retrievAl</title>
    <author>
      <name>Nishit Anand</name>
    </author>
    <author>
      <name>Ashish Seth</name>
    </author>
    <author>
      <name>Sreyan Ghosh</name>
    </author>
    <author>
      <name>Dinesh Manocha</name>
    </author>
    <author>
      <name>Ramani Duraiswami</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06615v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06615v1" />
    <published>2026-06-04T18:05:39Z</published>
    <updated>2026-06-04T18:05:39Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Nishit Anand et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Retrieving music using natural language descriptions has improved with contrastive audio-text models such as CLAP, but current systems remain limited to coarse semantic queries. When descriptions specify fine-grained musical attributes such as tempo, key, chord progression, or rhythmic structure, existing models often fail to retrieve the correct audio. We show that this limitation stems from the contrastive learning objective itself: despite being trained on long captions, CLAP-based models effectively utilize only the first few tokens, discarding much of the information encoded in detailed prompts. Then, we propose FIGMA (FIne-Grained Music RetrievAl), a multi-view contrastive architecture that addresses this limitation by jointly optimizing global audio-text alignment and frame-level, token-wise alignment. This design enables FIGMA to capture both high-level semantic context and fine-grained musical attributes within a unified representation space. Moreover, we formalize the task of Fine-Grained Music Retrieval and construct Fine-Grained Music Caption dataset (FGMCaps), a large-scale dataset of 380K music-caption pairs for training along with a 10K test set, both annotated with tempo, key, chord progression, beat count, as well as genre and mood. Extensive experiments demonstrate that FIGMA consistently outperforms existing CLAP-based music retrieval models across multiple music retrieval benchmarks, including out-of-domain evaluations, with relative improvements of up to 73.3%.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ACL 2026. Project Website: https://nishitanand.github.io/figma-website/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Direct 3D-Aware Object Insertion via Decomposed Visual Proxies</title>
    <author>
      <name>Jingbo Gong</name>
    </author>
    <author>
      <name>Yikai Wang</name>
    </author>
    <author>
      <name>Yushi Lan</name>
    </author>
    <author>
      <name>Yuhao Wan</name>
    </author>
    <author>
      <name>Ziheng Ouyang</name>
    </author>
    <author>
      <name>Rui Zhao</name>
    </author>
    <author>
      <name>Ming-Ming Cheng</name>
    </author>
    <author>
      <name>Qibin Hou</name>
    </author>
    <author>
      <name>Chen Change Loy</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06601v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06601v1" />
    <published>2026-06-04T18:00:34Z</published>
    <updated>2026-06-04T18:00:34Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jingbo Gong et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object&amp;#x27;s 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;ICML 2026; Project Page: https://gong1130.github.io/DIRECT/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers</title>
    <author>
      <name>Lizhi Yang</name>
    </author>
    <author>
      <name>Junheng Li</name>
    </author>
    <author>
      <name>Nehar Poddar</name>
    </author>
    <author>
      <name>Yiling Hou</name>
    </author>
    <author>
      <name>Gio Huh</name>
    </author>
    <author>
      <name>Robert Griffin</name>
    </author>
    <author>
      <name>Georgia Gkioxari</name>
    </author>
    <author>
      <name>Aaron Ames</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06493v3</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06493v3" />
    <published>2026-06-04T17:59:50Z</published>
    <updated>2026-06-09T07:34:32Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Lizhi Yang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse loco-manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;22 pages, 9 figures, Project page: https://lzyang2000.github.io/HANDOFF/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution</title>
    <author>
      <name>Liliana Hotsko</name>
    </author>
    <author>
      <name>Yinxi Li</name>
    </author>
    <author>
      <name>Yuntian Deng</name>
    </author>
    <author>
      <name>Pengyu Nie</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06492v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06492v1" />
    <published>2026-06-04T17:59:46Z</published>
    <updated>2026-06-04T17:59:46Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Liliana Hotsko et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA -- costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific LoRA adapters, effectively injecting repository knowledge with zero inference-time token overhead. Code2LoRA supports two usage scenarios: Code2LoRA-Static converts a single repository snapshot into an adapter, suitable for comprehension of stable codebases; while Code2LoRA-Evo maintains an adapter backed by a GRU hidden state updated per code diff, suitable for active development of evolving codebases. To evaluate Code2LoRA against parameter-efficient fine-tuning baselines, we build RepoPeftBench, a benchmark of 604 Python repositories with two tracks: a static track with 40K training and 12K test assertion-completion tasks, and an evolution track with 215K commit-derived training and 87K commit-derived test tasks. On the static track, Code2LoRA-Static achieves 63.8% cross-repo and 66.2% in-repo exact match, matching the per-repository LoRA upper bound; on the evolution track, Code2LoRA-Evo achieves 60.3% cross-repo exact match (+5.2 pp over a single shared LoRA). Code2LoRA&amp;#x27;s code can be found at https://anonymous.4open.science/r/code2lora-6857; the model checkpoints and RepoPeftBench datasets can be found at https://huggingface.co/code2lora.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection</title>
    <author>
      <name>Sondos Mahmoud Bsharat</name>
    </author>
    <author>
      <name>Jiacheng Liu</name>
    </author>
    <author>
      <name>Xiaohan Zhao</name>
    </author>
    <author>
      <name>Tianjun Yao</name>
    </author>
    <author>
      <name>Xinyi Shang</name>
    </author>
    <author>
      <name>Yi Tang</name>
    </author>
    <author>
      <name>Jiacheng Cui</name>
    </author>
    <author>
      <name>Ahmed Elhagry</name>
    </author>
    <author>
      <name>Salwa K. Al Khatib</name>
    </author>
    <author>
      <name>Hao Li</name>
    </author>
    <author>
      <name>Salman Khan</name>
    </author>
    <author>
      <name>Zhiqiang Shen</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06481v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06481v1" />
    <published>2026-06-04T17:58:05Z</published>
    <updated>2026-06-04T17:58:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sondos Mahmoud Bsharat et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As AI writing assistants become increasingly integrated into real-world drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but instead result from progressive human-AI co-editing. However, existing AI-text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI-Bench, an operation-guided benchmark for studying progressive human-to-AI text transformation across document, sentence, token, and span granularities. Starting from human-written documents, OpAI-Bench constructs nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at multiple granularities. The benchmark supports comprehensive evaluation with 8 document-level detectors, 7 sentence-level detectors, and 2 fine-grained token/span-level detectors. Experiments reveal that AI-text detectability is governed not only by the proportion of AI-edited content, but also by edit operation, domain, and cumulative revision history. Interestingly, we notice that mixed-authorship intermediate versions are often harder to detect than both fully human and heavily AI-edited endpoints, exposing non-monotonic detection patterns missed by existing benchmarks. OpAI-Bench provides a controlled testbed for analyzing whether, when, and how AI-assisted writing becomes detectable under realistic progressive editing scenarios. Our code and benchmark are available at https://github.com/VILA-Lab/OpAI-Bench.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Our code and data are available at https://github.com/VILA-Lab/OpAI-Bench&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Self-Augmenting Retrieval for Diffusion Language Models</title>
    <author>
      <name>Paul Jünger</name>
    </author>
    <author>
      <name>Justin Lovelace</name>
    </author>
    <author>
      <name>Linxi Zhao</name>
    </author>
    <author>
      <name>Dongyoung Go</name>
    </author>
    <author>
      <name>Kilian Q. Weinberger</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06474v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06474v1" />
    <published>2026-06-04T17:56:27Z</published>
    <updated>2026-06-04T17:56:27Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Paul Jünger et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Discrete diffusion language models generate text by iteratively denoising an entire response in parallel. At each step, they predict tentative tokens for every masked position, committing the confident predictions to the output and discarding the unconfident ones. We show that the discarded tokens are in fact a useful lookahead signal for retrieval-augmented generation: even low-confidence tokens often surface salient entities early in the denoising trajectory, enabling retrieval of stronger evidence before the output is finalized. We exploit this through Self-Augmenting Retrieval for Diffusion Language Models (SARDI), a dynamic RAG framework that uses these lookahead tokens to guide retrieval during denoising. SARDI is training-free, retriever-agnostic, and applicable to any reasoning-capable discrete diffusion language model. Across five multi-hop QA benchmarks, SARDI outperforms current training-free diffusion and autoregressive retrieval baselines at up to $8\times$ higher throughput.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery</title>
    <author>
      <name>Shangheng Du</name>
    </author>
    <author>
      <name>Xiangchao Yan</name>
    </author>
    <author>
      <name>Jinxin Shi</name>
    </author>
    <author>
      <name>Zongsheng Cao</name>
    </author>
    <author>
      <name>Shiyang Feng</name>
    </author>
    <author>
      <name>Zichen Liang</name>
    </author>
    <author>
      <name>Boyuan Sun</name>
    </author>
    <author>
      <name>Tianshuo Peng</name>
    </author>
    <author>
      <name>Yifan Zhou</name>
    </author>
    <author>
      <name>Xin Li</name>
    </author>
    <author>
      <name>Jie Zhou</name>
    </author>
    <author>
      <name>Liang He</name>
    </author>
    <author>
      <name>Bo Zhang</name>
    </author>
    <author>
      <name>Lei Bai</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06473v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06473v1" />
    <published>2026-06-04T17:55:59Z</published>
    <updated>2026-06-04T17:55:59Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shangheng Du et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training</title>
    <author>
      <name>Senmiao Wang</name>
    </author>
    <author>
      <name>Tiantian Fang</name>
    </author>
    <author>
      <name>Haoran Zhang</name>
    </author>
    <author>
      <name>Yushun Zhang</name>
    </author>
    <author>
      <name>Kunxiang Zhao</name>
    </author>
    <author>
      <name>Alex Schwing</name>
    </author>
    <author>
      <name>Ruoyu Sun</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06470v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06470v1" />
    <published>2026-06-04T17:55:11Z</published>
    <updated>2026-06-04T17:55:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Senmiao Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular-value spectrum of weight matrices via low-degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama-1B pre-training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum-control principle by proving that uniformly bounding each layer&amp;#x27;s singular values ensures geometric convergence of gradient descent to global minima, for certain deep linear networks. Our code is available at https://github.com/Empath-aln/PC-layer.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Benchmark Everything Everywhere All at Once</title>
    <author>
      <name>Shiyun Xiong</name>
    </author>
    <author>
      <name>Dongming Wu</name>
    </author>
    <author>
      <name>Peiwen Sun</name>
    </author>
    <author>
      <name>Yuang Ai</name>
    </author>
    <author>
      <name>Bokang Yang</name>
    </author>
    <author>
      <name>Wencheng Han</name>
    </author>
    <author>
      <name>Xiao-Hui Li</name>
    </author>
    <author>
      <name>Xiangyu Yue</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06462v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06462v1" />
    <published>2026-06-04T17:52:04Z</published>
    <updated>2026-06-04T17:52:04Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shiyun Xiong et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous agentic system designed for benchmark building. Our framework orchestrates the complete benchmark construction pipeline, from user query analysis and subtask design to data annotation and quality control. To assess Benchmark Agent, we implement it to produce 15 representative benchmarks, spanning diverse evaluation scenarios, including text understanding, multimodal understanding, and domain-specific reasoning. Extensive experiments, including human evaluation, LLM-as-a-judge assessment, and consistency checks, demonstrate Benchmark Agent can generate high-quality benchmark samples with minimal human involvement. More importantly, through continual evaluation, we observe several insightful findings, including that current models struggle with certain domain-specific reasoning tasks. We believe that rapidly evolving benchmarks can contribute significantly to the research community. The preview and code will be publicly available at the demo page and code repository.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Project page: https://benchmarkagent.github.io/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] MalTree: Tracing Malware Evolution from Embeddings at Scale</title>
    <author>
      <name>Akash Amalan</name>
    </author>
    <author>
      <name>Georgios Smaragdakis</name>
    </author>
    <author>
      <name>Tom J. Viering</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06570v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06570v1" />
    <published>2026-06-04T17:51:49Z</published>
    <updated>2026-06-04T17:51:49Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Akash Amalan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Malware detection remains largely reactive: machine learning models trained on known samples degrade as threats evolve. Understanding evolutionary relationships among malware families can inform proactive defense, but traditional reverse engineering can take months to years to uncover such lineage relationships. We propose MalTree, a framework that applies bioinformatics inspired phylogenetic techniques (UPGMA and Neighbor-Joining) at scale to model malware evolution automatically using structural, behavioral, and image-based features. We introduce temporal validation using VirusTotal timestamps to assess whether inferred trees reflect actual evolutionary order. MalTree achieves 87% temporal consistency, indicating that inferred evolutionary relationships closely align with real-world emergence timelines. Our analysis shows that some families mutate over 10 times faster than others, suggesting that detection strategies should be tailored to family-specific evolutionary tempos. Case studies, including the Mirai botnet, confirm that inferred relationships from our phylogenetic tree align with documented threat intelligence. Our framework provides a foundation for shifting malware analysis from sample-by-sample classification toward lineage-aware evolutionary modeling.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;33 pages, accepted at ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals</title>
    <author>
      <name>Thamilvendhan Munirathinam</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06460v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06460v1" />
    <published>2026-06-04T17:50:54Z</published>
    <updated>2026-06-04T17:50:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Thamilvendhan Munirathinam&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in-band deny signal -- the Recuse Signal -- that a server emits over a protocol&amp;#x27;s existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the robots.txt analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini-standard, implement two zero- or low-footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire-protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT-4o and GPT-4o-mini; and Claude Code as a deployed agent), the signal cleanly induces recusal -- 100% recusal when present versus 100% task completion in a no-signal control -- and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator-authorization framing flips the most capable model to proceed, while other agents continue to defer to the on-host policy. We release the standard, adapters, and experiment harness for reproduction.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;8 pages, 1 figure. Code, specification, and experiment harness: https://github.com/mthamil107/Recuse&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] In-Context Multiple Instance Learning</title>
    <author>
      <name>Alexander Möllers</name>
    </author>
    <author>
      <name>Marvin Sextro</name>
    </author>
    <author>
      <name>Julius Hense</name>
    </author>
    <author>
      <name>Gabriel Dernbach</name>
    </author>
    <author>
      <name>Klaus-Robert Müller</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06458v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06458v1" />
    <published>2026-06-04T17:50:32Z</published>
    <updated>2026-06-04T17:50:32Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Alexander Möllers et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multiple Instance Learning (MIL) addresses problems where supervision is available at the level of bags of instances and has been successfully applied in fields ranging from computational pathology to satellite imagery. Nevertheless, existing algorithms struggle in the low-label regime that characterizes many real-world applications. Flexible models overfit and rigid ones fail to adapt to the task at hand. We show that pretraining an in-context learner with a Perceiver-style architecture on synthetic data yields a model that can solve new tasks from a handful of labeled bags. At inference time, classification happens in a single forward pass and requires no gradient updates. We propose and investigate different synthetic data generators for bag-structured data and find that they capture complementary inductive biases. A model pretrained on a mixture of these generators inherits their per-task strengths and achieves the best average performance across twelve MIL benchmarks, outperforming supervised baselines that require task-specific training.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes</title>
    <author>
      <name>Wenbo Li</name>
    </author>
    <author>
      <name>Xiaoliang Ju</name>
    </author>
    <author>
      <name>Zipeng Qin</name>
    </author>
    <author>
      <name>Rongyao Fang</name>
    </author>
    <author>
      <name>Hongsheng Li</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06390v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06390v1" />
    <published>2026-06-04T16:58:43Z</published>
    <updated>2026-06-04T16:58:43Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wenbo Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: https://kairos-homeworld.github.io/&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Emergent Language as an Approach to Conscious AI</title>
    <author>
      <name>Zengqing Wu</name>
    </author>
    <author>
      <name>Chuan Xiao</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06380v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06380v1" />
    <published>2026-06-04T16:47:41Z</published>
    <updated>2026-06-04T16:47:41Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zengqing Wu, Chuan Xiao&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory-derived checklists (discriminative) or engineer consciousness-inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi-agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness-relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self-referential communication, including an echo-mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Source codes available at https://github.com/wuzengqing001225/ConsciousAI_Indexicality/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management</title>
    <author>
      <name>Shweta Mishra</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06337v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06337v1" />
    <published>2026-06-04T16:12:28Z</published>
    <updated>2026-06-04T16:12:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shweta Mishra&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language model (LLM) deployments for long-horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not. When history exceeds the Maximum Effective Context Window (MECW), critical structured information - architectural decisions, task transitions, file histories - is silently discarded. Existing mitigations treat history as flat text, destroying the relational structure that makes sessions resumable. We present TokenMizer, an open-source proxy system that models LLM session history as a typed knowledge graph. The schema defines 14 node types and 7 edge types. A hybrid extraction pipeline populates the graph incrementally, while a three-tier checkpoint system serializes it into compact resume blocks. An 8-layer compression pipeline reduces context overhead, and a semantic cache reduces repeated-query latency. Evaluated on a controlled benchmark of 21 sessions spanning 5 domains, TokenMizer demonstrates significant token economy. It produces resume blocks averaging 78 tokens (range: 42-124) - 2x smaller than evaluated baselines (159-170 tokens) - while achieving higher decision recall (+9-17 percentage points). Crucially, baselines only preserve that a technology was mentioned; TokenMizer preserves the rationale. Across all sessions, TokenMizer achieves mean task recall 51.0%, decision recall 46.6%, and file recall 58.7%. Variance reflects domain heterogeneity: explicit imperative phrasing (software engineering) scores higher than implicit reasoning (research). Ablation studies show fuzzy label matching is the dominant improvement factor (+33 pp task recall). The heuristic compression achieves 47.3% token reduction with zero external dependencies. TokenMizer provides a queryable alternative to text-retention baselines at half the token cost.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 pages, 10 figures. Code and benchmark available at https://github.com/Shweta-Mishra-ai/tokenmizer&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability</title>
    <author>
      <name>Seyed Arshan Dalili</name>
    </author>
    <author>
      <name>Mehrdad Mahdavi</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06333v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06333v1" />
    <published>2026-06-04T16:08:25Z</published>
    <updated>2026-06-04T16:08:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Seyed Arshan Dalili, Mehrdad Mahdavi&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of intrinsic dimension $d_i \ge 2$ to error $\varepsilon$ with single-direction decoders forces a number of atoms that is exponential in $d_i$. From an end-to-end optimization perspective, this splitting is not merely possible but actively preferred. We prove that there exists a continuous path from the true $d_i$-dimensional basis to a strictly lower risk of the $\ell_1$-regularized SAE objective, whose descent directions drive any trained dictionary into that exponential regime. A single coherent feature is therefore fragmented across many near-collinear latents, producing spurious multiplicity and obscuring the intrinsic geometry. Motivated by this, we introduce Subspace-Aware Sparse Autoencoders (SASA), which replace single-vector decoders with learned decoder subspaces, enforce block sparsity via Top-$s$ group gating, and adapt each group&amp;#x27;s effective rank with a nuclear-norm regularizer. We then show that once the block size satisfies $r \ge d_i$, a single group not only can represent the entire feature slice but is the global minimizer of the SASA objective. This consolidation yields a sample complexity polynomial in $d_i$ rather than exponential -- a decisive advantage given that every training activation costs an LLM forward pass. Empirically, on GPT-2 and Mistral-7B, SASA reduces feature splitting and absorption, improves monosemanticity and interpretability, and matches or exceeds standard SAEs while training on roughly half the token budget.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[quant-ph] Quantum enhanced rare event discovery and sampling</title>
    <author>
      <name>Naixu Guo</name>
    </author>
    <author>
      <name>Po-Wei Huang</name>
    </author>
    <author>
      <name>Qisheng Wang</name>
    </author>
    <author>
      <name>Jayne Thompson</name>
    </author>
    <author>
      <name>Patrick Rebentrost</name>
    </author>
    <author>
      <name>Mile Gu</name>
    </author>
    <author>
      <name>Chengran Yang</name>
    </author>
    <category term="quant-ph" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06316v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06316v1" />
    <published>2026-06-04T15:54:53Z</published>
    <updated>2026-06-04T15:54:53Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Naixu Guo et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Financial crashes, cascading failures in infrastructure, and critical errors in AI systems are frequently triggered by events that occur with extremely small probability. Efficiently discovering and sampling events with probability below a threshold is therefore of critical interest. Yet this task is highly non-trivial using existing classical or quantum methods. Being rare, such events require an immense sampling overhead to collect sufficient data samples. Moreover, because the rare events are not known in advance, they cannot be flagged for amplification using standard techniques. Here, we introduce a quantum algorithm for rare-event discovery and sampling without first learning which events are rare. The algorithm achieves the optimal quantum scaling with the rarity threshold. We further demonstrate that this can achieve a quadratic speedup for heavy-tailed systems whose tail has nonvanishing total mass, and translates into a robust polynomial speedup for stationary stochastic processes, with the exponent determined by its entropy-rate structure.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;36 pages (8+28)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] LLM Self-Recognition: Steering and Retrieving Activation Signatures</title>
    <author>
      <name>Thibaud Ardoin</name>
    </author>
    <author>
      <name>Jonas Schäfer</name>
    </author>
    <author>
      <name>Gerhard Wunder</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06315v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06315v1" />
    <published>2026-06-04T15:54:34Z</published>
    <updated>2026-06-04T15:54:34Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Thibaud Ardoin et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent advances in interpretability suggest that large language models (LLMs) implicitly encode signals in their generated text that enable self-recognition of their outputs. We demonstrate that this capability is reliable, even in low-entropy scenarios, and that it can be amplified through targeted intervention. By steering the internal residual stream during generation with a random sparse vector, we create a detectable fingerprint that enables attribution of a given text to a specific LLM. This signal is recoverable from the activations of an LLM used as a detector, achieving over 98% accuracy across multiple detection settings while preserving the quality of generated text. As AI-generated content proliferates, this approach offers a practical alternative to traditional detectors by leveraging the model&amp;#x27;s natural representation structure for attribution rather than embedding a signal externally. Our contributions include: (i) establishing reliable self-recognition capabilities in LLMs, (ii) a simple steering mechanism enabling multi-LLM identification with no quality degradation, (iii) demonstrating that activation spaces contain exploitable structure for encoding signals without semantic interference.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] AIS-Based Vessel Trajectory Prediction Using Memory-Augmented Neural Networks</title>
    <author>
      <name>Wonmo Koo</name>
    </author>
    <author>
      <name>Sanha Chang</name>
    </author>
    <author>
      <name>Heeyoung Kim</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06311v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06311v1" />
    <published>2026-06-04T15:52:21Z</published>
    <updated>2026-06-04T15:52:21Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wonmo Koo et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Accurate vessel trajectory prediction is essential for safe and efficient maritime operations, enabling collision avoidance and supporting route optimization. Although memory-augmented neural networks have recently shown strong performance in pedestrian and road-vehicle trajectory prediction by selectively retrieving relevant information from an external memory, their potential for vessel trajectory prediction remains underexplored. This paper presents an empirical investigation of memory-based trajectory prediction using Automatic Identification System (AIS) data. Experiments on data from the Gulf of Mexico and the New York Bight demonstrate consistent and substantial performance gains over a range of deep learning baselines that do not incorporate an external memory.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action</title>
    <author>
      <name>Boyang Zhang</name>
    </author>
    <author>
      <name>Lianlei Shan</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06245v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06245v1" />
    <published>2026-06-04T14:48:44Z</published>
    <updated>2026-06-04T14:48:44Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Boyang Zhang, Lianlei Shan&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect text-to-action interface. We propose MPCoT, a reward-guided multi-path latent reasoning framework that initializes $M$ hypotheses, refines them for K weight-tied steps, and softly aggregates them before action decoding. A training-only path-preference objective evaluates candidate action branches with expert-action consistency, world-model/VLM-based progress, and success feedback to align the latent path scorer with downstream execution quality. MPCoT preserves the original 8-step action interface, generates zero reasoning tokens, and exposes configurable inference controls (K,M). Under matched protocols on LIBERO and CALVIN, MPCoT improves long-horizon performance, with ablations confirming depth-width effects, confidence-weighted aggregation, and reward-guided path supervision.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;14 pages, 5 figures, submitted to CoRL&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents</title>
    <author>
      <name>AJ Carl P. Dy</name>
    </author>
    <author>
      <name>Aivin V. Solatorio</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06242v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06242v1" />
    <published>2026-06-04T14:47:40Z</published>
    <updated>2026-06-04T14:47:40Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;AJ Carl P. Dy, Aivin V. Solatorio&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \textit{data snapshot extraction}, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at https://huggingface.co/datasets/ai4data/data-snapshot and the source code is available at https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;23 pages, 8 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.DB] TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory</title>
    <author>
      <name>Ziming Wang</name>
    </author>
    <category term="cs.DB" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06240v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06240v1" />
    <published>2026-06-04T14:46:52Z</published>
    <updated>2026-06-04T14:46:52Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ziming Wang&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Persistent memory for an LLM agent is a write-heavy substrate: every belief update is a versioned write, and a new claim may contradict a stored one. Production systems use four resolution heuristics (last-writer-wins, evidence-weighted merge, await-confirmation, per-rule policy), yet none declares the isolation level it assumes or the write-time anomalies it admits. We show that contradiction resolution is write-time concurrency control and make the missing contract explicit. TOKI types the four heuristics as one family of bitemporal operators over a dual-row schema, each with an isolation precondition and a provenance annotation that preserves the losing fact in an audit row. Four soundness theorems close the contract across isolation, schema, and provenance, lift the guarantees to operator pipelines, and extend the fold operators to n-ary conflict sets. A tightness companion proves that, within the relational schedule model, keyed logging of the adjudicating judge is necessary for replay consistency, which every audited baseline omits. A verdict matrix over eight systems localizes the gap: every baseline that keeps a language-model judge on the write path admits at least one of three write-time anomalies (replay inconsistency, belief-drift skew, audit erasure); a content-addressed engine-layer comparator avoids them only by removing the judge, and TOKI alone excludes all three while keeping it. On its one natural-workload slice the audit-row defence moves LoCoMo by 0.86, and ablating the typed memory layer removes 0.49 accuracy on 1,444 answerable LoCoMo questions; the cross-system comparison stays underpowered and claims no superiority. The contribution is the contract: a write-time correctness specification, proved sound across isolation, schema, and provenance, pinning the guarantee every production heuristic assumes but no deployed system makes explicit.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;43 pages including full appendices (proofs, protocols, and reproducibility ledger). Code, data, and reproducibility artifact: https://github.com/ZenAlexa/toki-bitemporal-memory&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments</title>
    <author>
      <name>Tan Zhang</name>
    </author>
    <author>
      <name>Quanyou Li</name>
    </author>
    <author>
      <name>Lu Zhang</name>
    </author>
    <author>
      <name>Jun Liu</name>
    </author>
    <author>
      <name>Xiaofeng Zhu</name>
    </author>
    <author>
      <name>Ping Hu</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06217v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06217v1" />
    <published>2026-06-04T14:31:11Z</published>
    <updated>2026-06-04T14:31:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tan Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a multi-stage multimodal reasoning benchmark for UAV-Based disaster response in complex environments. DisasterBench spans 14 disaster-related scene types and 9 response-critical tasks across pre-, during-, and post-disaster stages, with fine-grained disaster-task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision-oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three-stage pipeline combining domain instruction tuning, chain-of-thought-guided multimodal alignment, and reinforcement learning-based policy optimization. Experiments across 21 popular MLLMs show that our 2B-parameter DisasterVL outperforms all evaluated open-source models and substantially narrows the gap to state-of-the-art closed-source models, achieving GPT-4o-comparable reasoning accuracy with superior efficiency. The project page is available at https://github.com/TanmouTT/DisasterBench.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs</title>
    <author>
      <name>Giovanni Dettori</name>
    </author>
    <author>
      <name>Matteo Boffa</name>
    </author>
    <author>
      <name>Danilo Giordano</name>
    </author>
    <author>
      <name>Idilio Drago</name>
    </author>
    <author>
      <name>Marco Mellia</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06203v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06203v1" />
    <published>2026-06-04T14:08:30Z</published>
    <updated>2026-06-04T14:08:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Giovanni Dettori et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Input length and the position of relevant information are widely cited as the primary causes of degraded LLM long-context performance. Here, we study lexical density -- the rate at which a context introduces distinct information -- as a third, largely overlooked factor that systematically reduces the effective context window of LLMs. We quantify the impact of lexical density on open-weight LLMs (9B-685B) using three &amp;quot;find-the-needle&amp;quot; style benchmarks with identical length (~12k tokens) and controlled needle position, but increasing density of information. We observe a sharp performance collapse in higher-density benchmarks: models that are near-perfect in sparse contexts drop below 60% retrieval score on denser ones. To rule out task-type confounds, we vary and control the density within each benchmark while keeping all other properties unchanged. Reducing density generally restores performance, especially in the high-density regimes where degradation appears. These results show that effective context capacity is a function of lexical density, with direct implications for real-world LLM systems operating on compact, information-rich inputs.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;20 pages, 6 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] MacArena: Benchmarking Computer Use Agents on an Online macOS Environment</title>
    <author>
      <name>Victor Muryn</name>
    </author>
    <author>
      <name>Maksym Shamrai</name>
    </author>
    <author>
      <name>Sofiia Mazepa</name>
    </author>
    <author>
      <name>Yehor Khodysko</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06560v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06560v1" />
    <published>2026-06-04T14:01:32Z</published>
    <updated>2026-06-04T14:01:32Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Victor Muryn et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld, which serve both as evaluation tools and as training environments for reinforcement learning. However, macOS remains underserved in this landscape: the only existing benchmark, macOSWorld, covers a narrow slice of first-party applications with simpler tasks, and runs on x86 virtual machines incompatible with Apple Silicon. We introduce MacArena, a benchmark of 421 manually verified tasks spanning 50 applications that combines a curated port of OSWorld tasks, content sourced from macOSWorld, and 49 new macOS-native tasks, all running on Apple&amp;#x27;s native Virtualization framework on Apple Silicon. We argue that macOS presents distinct GUI challenges beyond what Linux-based benchmarks capture, and our evaluation supports this claim: strong model performance on existing benchmarks can reflect familiarity with task distributions rather than genuine cross-platform GUI competence. Notably, model rankings invert between ported and macOS-native tasks, with a leading model trailing by over 26% on the MacArena subset, suggesting that macOS poses a genuinely harder environment for current GUI agents.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SC] A Finite Certificate for the Positive $n=9$ Vasc Inequality</title>
    <author>
      <name>Dakai Guo</name>
    </author>
    <author>
      <name>Ruichen Qiu</name>
    </author>
    <author>
      <name>Yichuan Cao</name>
    </author>
    <author>
      <name>Ruyong Feng</name>
    </author>
    <category term="cs.SC" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06136v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06136v1" />
    <published>2026-06-04T13:19:19Z</published>
    <updated>2026-06-04T13:19:19Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dakai Guo et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We prove the positive-real $n=9$ case of the Vasc cyclic inequality. The proof was obtained with human-guided assistance from the AI agent MechMath Agent Team: the human-readable part reduces the rational inequality to a homogeneous polynomial inequality, fixes a cyclic maximum, and parametrizes each sorted fixed-maximum cone by cumulative gaps; the finite part is a certificate covering all $8!=40320$ sorted cones. MechMath Agent Team generated the certificate verification workflow through Python tool calls, including the case split, verification programs, and terminal classifications. The published certificate has $36815$ coefficient leaves, $2236$ ordinary Polya multiplier leaves, and $1269$ AM-GM midpoint overlay leaves. Human authors audited the mathematical reductions and verification logic, and a separate artifact contains the certificate, an independent verifier, and a from-source rebuild route.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems</title>
    <author>
      <name>Tao Zhong</name>
    </author>
    <author>
      <name>Jiajun Deng</name>
    </author>
    <author>
      <name>Nikita Kuzmin</name>
    </author>
    <author>
      <name>Yinke Zhu</name>
    </author>
    <author>
      <name>Tianxiang Cao</name>
    </author>
    <author>
      <name>Tristan Tsoi</name>
    </author>
    <author>
      <name>Zhili Tan</name>
    </author>
    <author>
      <name>Simon Lui</name>
    </author>
    <author>
      <name>Xunying Liu</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06559v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06559v1" />
    <published>2026-06-04T12:39:44Z</published>
    <updated>2026-06-04T12:39:44Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tao Zhong et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM&amp;#x27;s conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation</title>
    <author>
      <name>Paavo Parmas</name>
    </author>
    <author>
      <name>Yongmin Kim</name>
    </author>
    <author>
      <name>Kohsei Matsutani</name>
    </author>
    <author>
      <name>Shota Takashiro</name>
    </author>
    <author>
      <name>Soichiro Nishimori</name>
    </author>
    <author>
      <name>Takeshi Kojima</name>
    </author>
    <author>
      <name>Yusuke Iwasawa</name>
    </author>
    <author>
      <name>Yutaka Matsuo</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06096v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06096v1" />
    <published>2026-06-04T12:34:15Z</published>
    <updated>2026-06-04T12:34:15Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Paavo Parmas et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator&amp;#x27;s variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning.
  Code: https://github.com/paavo5/ordergrad&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents</title>
    <author>
      <name>Aofan Yu</name>
    </author>
    <author>
      <name>Chenyu Zhou</name>
    </author>
    <author>
      <name>Tianyi Xu</name>
    </author>
    <author>
      <name>Zihan Guo</name>
    </author>
    <author>
      <name>Rong Shan</name>
    </author>
    <author>
      <name>Zhihui Fu</name>
    </author>
    <author>
      <name>Jun Wang</name>
    </author>
    <author>
      <name>Weiwen Liu</name>
    </author>
    <author>
      <name>Yong Yu</name>
    </author>
    <author>
      <name>Weinan Zhang</name>
    </author>
    <author>
      <name>Jianghao Lin</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06087v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06087v1" />
    <published>2026-06-04T12:26:09Z</published>
    <updated>2026-06-04T12:26:09Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Aofan Yu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug-and-play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per-step skill tokens while preserving modular loading, scaling, and composition. On ALFWorld and Search-QA, LatentSkill outperforms the corresponding in-context skill baseline while using substantially fewer prefill tokens: it improves ALFWorld success by 21.4 and 13.4 points on the seen and unseen splits with 64.1% fewer prefill tokens, and improves Search-QA exact match by 3.0 points with 72.2% lower skill-token overhead. Further analysis shows that generated skill LoRAs form a structured semantic geometry, can be precisely controlled via the LoRA scaling coefficient, and can be composed through parameter-space arithmetic when skill components are aligned. These findings suggest that weight-space skills provide an efficient, modular, and less exposed substrate for extending LLM agents.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;16 pages, 4 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation</title>
    <author>
      <name>Haocheng Luo</name>
    </author>
    <author>
      <name>Jiahui Liu</name>
    </author>
    <author>
      <name>Ruicheng Zhang</name>
    </author>
    <author>
      <name>Zhizhou Zhong</name>
    </author>
    <author>
      <name>Jiaqi Huang</name>
    </author>
    <author>
      <name>Zunnan Xu</name>
    </author>
    <author>
      <name>Quan Shi</name>
    </author>
    <author>
      <name>Jun Zhou</name>
    </author>
    <author>
      <name>Xiu Li</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06076v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06076v2" />
    <published>2026-06-04T12:13:24Z</published>
    <updated>2026-06-06T12:18:07Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haocheng Luo et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student&amp;#x27;s own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;17 pages, preprint&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following</title>
    <author>
      <name>Mohammad Mahdi Salmani-Zarchi</name>
    </author>
    <author>
      <name>Zahra Rahimi</name>
    </author>
    <author>
      <name>Heshaam Faili</name>
    </author>
    <author>
      <name>Mohammad Javad Dousti</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06058v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06058v1" />
    <published>2026-06-04T11:58:59Z</published>
    <updated>2026-06-04T11:58:59Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mohammad Mahdi Salmani-Zarchi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. To address them, we propose MDP-GRPO, which stabilizes learning through (1) multi-temperature sampling to increase reward dispersion, (2) dual-anchor advantages to restore gradients in homogeneous groups and stop mean-centering blindness, (3) prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky&amp;#x27;s theory, and (4) asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO outperforms standard GRPO, improving strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. Our method also enables stable convergence with small group sizes while preserving general capabilities on MMLU and ARC.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ACL 2026 Main Conference. 14 pages, 9 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Beyond Similarity: Trustworthy Memory Search for Personal AI Agents</title>
    <author>
      <name>Jiawen Zhang</name>
    </author>
    <author>
      <name>Kejia Chen</name>
    </author>
    <author>
      <name>Jiachen Ma</name>
    </author>
    <author>
      <name>Yangfan Hu</name>
    </author>
    <author>
      <name>Lipeng He</name>
    </author>
    <author>
      <name>Yechao Zhang</name>
    </author>
    <author>
      <name>Jian Liu</name>
    </author>
    <author>
      <name>Xiaohu Yang</name>
    </author>
    <author>
      <name>Tianwei Zhang</name>
    </author>
    <author>
      <name>Ruoxi Jia</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06054v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06054v1" />
    <published>2026-06-04T11:54:29Z</published>
    <updated>2026-06-04T11:54:29Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiawen Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is retrieved and injected into the model context. This creates a critical trustworthiness gap, since a semantically related memory may still be contextually inappropriate, leading to threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks.
  In this paper, we study memory search as a trust boundary in personal AI agents. We evaluate representative agentic memory frameworks, including A-Mem, Mem0, and MemOS, together with OpenClaw, a real-world personal-agent environment with persistent state and tool-use capability. Our results show that long-term memory is not merely a utility layer, but a durable control channel that can reshape how agents interpret tasks and execute actions, leaving them highly susceptible to the aforementioned threats. To mitigate these vulnerabilities, we propose MemGate, a lightweight and deployable memory plug-in for trustworthy memory search, with only 9M parameters and a 35.1MB footprint. MemGate is inserted between the vector memory store and the backbone LLM, requiring no LLM modification, memory-database rewriting, or inference-time LLM judge. It applies a query-conditioned neural gate to candidate memory representations, turning raw similarity search into task-conditioned memory admission. Across multiple mainstream memory frameworks, real-world agent settings, and diverse LLM backbones, MemGate reduces memory-induced threats while preserving long-term memory utility.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet</title>
    <author>
      <name>Luoming Zhang</name>
    </author>
    <author>
      <name>Yuwei Ren</name>
    </author>
    <author>
      <name>Kui Zhang</name>
    </author>
    <author>
      <name>Tian Liu</name>
    </author>
    <author>
      <name>Lingjuan Ge</name>
    </author>
    <author>
      <name>Denghao Li</name>
    </author>
    <author>
      <name>Matthew Harper Langston</name>
    </author>
    <author>
      <name>Yin Huang</name>
    </author>
    <author>
      <name>Weiliang Will Zeng</name>
    </author>
    <author>
      <name>Liang Zhang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06034v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06034v1" />
    <published>2026-06-04T11:29:05Z</published>
    <updated>2026-06-04T11:29:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Luoming Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model&amp;#x27;s accuracy. Experiments on Qwen3.5-family models demonstrate up to 5$\times$ kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit</title>
    <author>
      <name>Amirhossein Ghaffari</name>
    </author>
    <author>
      <name>Ali Goodarzi</name>
    </author>
    <author>
      <name>Huong Nguyen</name>
    </author>
    <author>
      <name>Simo Hosio</name>
    </author>
    <author>
      <name>Lauri Lovén</name>
    </author>
    <author>
      <name>Ekaterina Gilman</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06027v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06027v1" />
    <published>2026-06-04T11:20:10Z</published>
    <updated>2026-06-04T11:20:10Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Amirhossein Ghaffari et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit-based, graph-structural, semantic, hybrid, and interaction-based), trains a parameter-efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well-being domain (301,429 user profiles, 16M+ comments), we find that adapters&amp;#x27; behavioral identifiability tracks each strategy&amp;#x27;s intrinsic agreement with the subreddit baseline, and that a consistent trade-off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] OPRD: On-Policy Representation Distillation</title>
    <author>
      <name>Shenzhi Yang</name>
    </author>
    <author>
      <name>Guangcheng Zhu</name>
    </author>
    <author>
      <name>Bowen Song</name>
    </author>
    <author>
      <name>Haobo Wang</name>
    </author>
    <author>
      <name>Mingxuan Xia</name>
    </author>
    <author>
      <name>Xing Zheng</name>
    </author>
    <author>
      <name>Yingfan Ma</name>
    </author>
    <author>
      <name>Zhongqi Chen</name>
    </author>
    <author>
      <name>Weiqiang Wang</name>
    </author>
    <author>
      <name>Gang Chen</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06021v3</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06021v3" />
    <published>2026-06-04T11:13:01Z</published>
    <updated>2026-06-09T02:20:46Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shenzhi Yang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen&amp;#x27;s ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Multi-Scale Feature Attention Network for Polymer Classification Using Terahertz Spectroscopy</title>
    <author>
      <name>Roshni Mahtani</name>
    </author>
    <author>
      <name>Ilán Carretero</name>
    </author>
    <author>
      <name>Laura Monroy</name>
    </author>
    <author>
      <name>Aldo Moreno-Oyervides</name>
    </author>
    <author>
      <name>Oscar Elías Bonilla-Manrique</name>
    </author>
    <author>
      <name>Rocío del Amor</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06554v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06554v2" />
    <published>2026-06-04T09:42:23Z</published>
    <updated>2026-06-08T07:48:55Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Roshni Mahtani et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Reliable polymer identification is essential for ensuring the quality and safety of recycled plastics, yet conventional sorting and spectroscopic techniques often struggle to deliver robust discrimination. Terahertz (THz) spectroscopy offers a promising alternative, providing high-resolution and non-destructive measurements. In this work, we leverage THz signals to classify 12 types of polymers, including pure polymers, multilayer films, commercial blends, and biopolymers. To handle the complexity of these spectral signals, we propose the Multi-Scale Feature Attention Network (MSFAN), a novel deep learning architecture tailored for THz data. The framework integrates feature gating for signal recalibration and multi-scale parallel convolutions to capture diverse frequency patterns. These features are further refined through cross-feature attention and attention pooling, enabling the model to intrinsically highlight the most informative THz regions. MSFAN consistently outperforms state-of-the-art models, reaching a classification accuracy of 85.2%. This study demonstrates the potential of combining THz spectroscopy with deep learning techniques for effective, scalable, and interpretable polymer classification.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted in EUSIPCO&amp;#x27;26&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection</title>
    <author>
      <name>Erfan Loweimi</name>
    </author>
    <author>
      <name>Mengjie Qian</name>
    </author>
    <author>
      <name>Kate Knill</name>
    </author>
    <author>
      <name>Guanfeng Wu</name>
    </author>
    <author>
      <name>Chi-Ho Chan</name>
    </author>
    <author>
      <name>Abbas Haider</name>
    </author>
    <author>
      <name>Muhammad Awan</name>
    </author>
    <author>
      <name>Josef Kittler</name>
    </author>
    <author>
      <name>Hui Wang</name>
    </author>
    <author>
      <name>Mark Gales</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05931v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05931v1" />
    <published>2026-06-04T09:33:58Z</published>
    <updated>2026-06-04T09:33:58Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Erfan Loweimi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;INTERSPEECH 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Towards World Models in Biomedical Research</title>
    <author>
      <name>Guangyu Wang</name>
    </author>
    <author>
      <name>Jingkun Yue</name>
    </author>
    <author>
      <name>Siqi Zhang</name>
    </author>
    <author>
      <name>Yu Liu</name>
    </author>
    <author>
      <name>Xiaoyu Wang</name>
    </author>
    <author>
      <name>Mingyuan Meng</name>
    </author>
    <author>
      <name>Changwei Ji</name>
    </author>
    <author>
      <name>Zongbo Han</name>
    </author>
    <author>
      <name>Yulin Wang</name>
    </author>
    <author>
      <name>Yang Yue</name>
    </author>
    <author>
      <name>Frank Fu</name>
    </author>
    <author>
      <name>Ting Chen</name>
    </author>
    <author>
      <name>Song Wu</name>
    </author>
    <author>
      <name>Ziwei Liu</name>
    </author>
    <author>
      <name>Jiangning Song</name>
    </author>
    <author>
      <name>Ming Li</name>
    </author>
    <author>
      <name>Gao Huang</name>
    </author>
    <author>
      <name>Xiaohong Liu</name>
    </author>
    <author>
      <name>Athanasios Vasilakos</name>
    </author>
    <author>
      <name>Xingcai Zhang</name>
    </author>
    <author>
      <name>Ping Zhang</name>
    </author>
    <author>
      <name>Yong Li</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05925v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05925v1" />
    <published>2026-06-04T09:28:54Z</published>
    <updated>2026-06-04T09:28:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Guangyu Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;A central goal of biomedicine is to understand, predict and ultimately control the dynamic mechanisms by which biological systems respond to perturbations, disease progression and therapeutic intervention. Although foundation models and large language models have accelerated biomedical data interpretation, most current systems remain focused on static pattern recognition rather than prospective simulation of biological futures. Here we propose biomedical world models as a paradigm for AI-driven discovery. These models learn latent representations of molecular, cellular, tissue and clinical states, together with intervention-conditioned dynamics that allow future trajectories to be simulated before actions are taken. We discuss how biomedical world models could function as data engines, environment simulators and scientific planning substrates across applications including virtual cells, organoids, virtual patients and surgical simulation. We outline the data infrastructure, evaluation benchmarks, safety constraints and governance frameworks required. Biomedical world models may provide a foundation for simulation-guided, closed-loop and experimentally actionable biomedical discovery.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts</title>
    <author>
      <name>Wenbo Pan</name>
    </author>
    <author>
      <name>Shujie Liu</name>
    </author>
    <author>
      <name>Chin-Yew Lin</name>
    </author>
    <author>
      <name>Jingying Zeng</name>
    </author>
    <author>
      <name>Xianfeng Tang</name>
    </author>
    <author>
      <name>Xiangyang Zhou</name>
    </author>
    <author>
      <name>Yan Lu</name>
    </author>
    <author>
      <name>Xiaohua Jia</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05922v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05922v1" />
    <published>2026-06-04T09:26:00Z</published>
    <updated>2026-06-04T09:26:00Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wenbo Pan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent&amp;#x27;s behavior patterns and sustains higher accuracy during long-horizon sessions.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Code: https://github.com/wbopan/retro-harness ; Project website: https://paper-rho.wenbo.io&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)</title>
    <author>
      <name>Christopher J. Wedge</name>
    </author>
    <author>
      <name>Joshua Stutter</name>
    </author>
    <author>
      <name>Danny Dixon</name>
    </author>
    <author>
      <name>Jacek Cała</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05901v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05901v1" />
    <published>2026-06-04T09:07:06Z</published>
    <updated>2026-06-04T09:07:06Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Christopher J. Wedge et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) systems have emerged as a common deployment scenario seeking to both avoid the well known risk of the LLM &amp;quot;hallucinating&amp;quot; information, and to enable reasoning and question answering over proprietary information that the LLM did not have access to during training without resorting to expensive model fine-tuning.
  In this work, we explore the idea of using a lightweight graph structure with a relatively simple graph schema, to support the RAG subsystem via a dedicated toolset. We design an agentic system with a variety of vector search and graph query tools operating over a structured dataset based on a curated subset of English Wikipedia articles, and evaluate its performance on questions from MoNaCo, a challenging Wikipedia QA benchmark of complex query answering tasks.
  Our results show that the introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios. All this with a modest increase in token usage.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Retry Policy Gradients in Continuous Action Spaces</title>
    <author>
      <name>Soichiro Nishimori</name>
    </author>
    <author>
      <name>Paavo Parmas</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05888v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05888v1" />
    <published>2026-06-04T08:57:45Z</published>
    <updated>2026-06-04T08:57:45Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Soichiro Nishimori, Paavo Parmas&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Retry-based objectives such as pass@K and max@K optimize the best return obtained from multiple sampled trajectories, and recent work has shown that they can promote exploration without explicit exploration bonuses. In discrete action spaces, ReMax was shown to do so by adapting to return uncertainty. In this work, we introduce pathwise derivative estimators for retry objectives and use them to extend ReMax to continuous action spaces. We study the resulting learning dynamics and show that, even with deterministic rewards, ReMax can encourage stochastic exploration by reshaping the policy-gradient landscape. In particular, it alters gradients both in direction, biasing updates toward higher policy entropy, and in magnitude, damping gradients and slowing convergence. We further show that Adam&amp;#x27;s adaptive normalization can mitigate this damping, depending on its numerical stabilization parameter. Empirically, we instantiate this objective as ReMax Actor-Critic (ReMAC), an off-policy actor--critic algorithm that optimizes the ReMax objective using a pathwise derivative estimator. Our experiments show that ReMAC can promote higher policy entropy without entropy regularization and achieves performance comparable to SAC.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] LadderMan: Learning Humanoid Perceptive Ladder Climbing</title>
    <author>
      <name>Siheng Zhao</name>
    </author>
    <author>
      <name>Yuanhang Zhang</name>
    </author>
    <author>
      <name>Ziqi Lu</name>
    </author>
    <author>
      <name>Pieter Abbeel</name>
    </author>
    <author>
      <name>Rocky Duan</name>
    </author>
    <author>
      <name>Koushil Sreenath</name>
    </author>
    <author>
      <name>Yue Wang</name>
    </author>
    <author>
      <name>C. Karen Liu</name>
    </author>
    <author>
      <name>Guanya Shi</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05873v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05873v1" />
    <published>2026-06-04T08:47:08Z</published>
    <updated>2026-06-04T08:47:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Siheng Zhao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at https://ladderman-robot.github.io .&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns</title>
    <author>
      <name>Olasimbo Ayodeji Arigbabu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05872v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05872v2" />
    <published>2026-06-04T08:46:43Z</published>
    <updated>2026-06-05T19:21:00Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Olasimbo Ayodeji Arigbabu&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI agents are commonly evaluated using task success, reward, latency, and cost. These metrics are useful, but they often miss important aspects of agent behavior: whether an agent explores too much, repeats itself too rigidly, uses tools effectively, reduces uncertainty over time, or remains robust across repeated runs. This paper proposes Entropy-Based Evaluation of AI Agents (EEA), a lightweight framework for measuring agent behavior through entropy. Rather than treating intelligence as only final task completion, EEA studies the structure of the agents decision process. The framework introduces action entropy, trajectory entropy, tool entropy, information gain, exploration efficiency, and robustness entropy. These metrics are intended to complement, not replace, traditional evaluation methods. We also present a practical Python implementation designed to integrate with agent frameworks such as LangChain, Google ADK, custom agent loops, and stored observability traces.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;6 pages, 2 Tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Consistency Training Along the Transformer Stack</title>
    <author>
      <name>Sukrati Gautam</name>
    </author>
    <author>
      <name>Neil Shah</name>
    </author>
    <author>
      <name>Arav Dhoot</name>
    </author>
    <author>
      <name>Bryan Maruyama</name>
    </author>
    <author>
      <name>Caroline Wei</name>
    </author>
    <author>
      <name>Rohan Kapoor</name>
    </author>
    <author>
      <name>Robert Sidey</name>
    </author>
    <author>
      <name>Prakhar Gupta</name>
    </author>
    <author>
      <name>Zi Cheng Huang</name>
    </author>
    <author>
      <name>David Demitri Africa</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05817v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05817v1" />
    <published>2026-06-04T07:58:55Z</published>
    <updated>2026-06-04T07:58:55Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sukrati Gautam et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal consistency targets: MLP Consistency Training (MLPCT), which matches post-activation MLP states, and Attention Consistency Training (AttCT), which matches per-head attention distributions. Second, we apply consistency training to four additional safety threats: persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. Across several models and threat settings, we find that consistency training reduces misalignment well beyond the sycophancy and jailbreak settings studied in prior work. We also find cases of cross-threat generalization, where training against one failure mode improves robustness to another, and identify a shared residual-stream mechanism underlying ACT, MLPCT, and AttCT, while distinguishing BCT as mechanistically distinct. Our results suggest that consistency training is a flexible and extensible framework for alignment, capable of unifying defenses against a broader class of model pathologies.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Submitted to EMNLP 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents</title>
    <author>
      <name>Dongsheng Zhu</name>
    </author>
    <author>
      <name>Xuchen Ma</name>
    </author>
    <author>
      <name>Yucheng Shen</name>
    </author>
    <author>
      <name>Xiang Li</name>
    </author>
    <author>
      <name>Yukun Zhao</name>
    </author>
    <author>
      <name>Shuaiqiang Wang</name>
    </author>
    <author>
      <name>Lingyong Yan</name>
    </author>
    <author>
      <name>Dawei Yin</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05806v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05806v1" />
    <published>2026-06-04T07:38:46Z</published>
    <updated>2026-06-04T07:38:46Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dongsheng Zhu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized &amp;#x27;&amp;#x27;happy paths&amp;#x27;&amp;#x27;, largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents</title>
    <author>
      <name>Yuhao Sun</name>
    </author>
    <author>
      <name>Jiacheng Zhang</name>
    </author>
    <author>
      <name>Shaanan Cohney</name>
    </author>
    <author>
      <name>Zhexin Zhang</name>
    </author>
    <author>
      <name>Feng Liu</name>
    </author>
    <author>
      <name>Xingliang Yuan</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05805v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05805v1" />
    <published>2026-06-04T07:34:35Z</published>
    <updated>2026-06-04T07:34:35Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuhao Sun et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail-integrated agent framework that leverages guardrail-generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self-curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural-language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent&amp;#x27;s context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety-utility trade-off among guardrail-integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;32 pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Next-Generation Parallel Decoder for LPDR: Architectural Optimization and Class-Balanced GAN-Augmentation</title>
    <author>
      <name>Shawaiz Obaid</name>
    </author>
    <author>
      <name>Nida Chandio</name>
    </author>
    <author>
      <name>Neha Jamil</name>
    </author>
    <author>
      <name>Muhammad Khuram Shahzad</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05785v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05785v1" />
    <published>2026-06-04T07:16:06Z</published>
    <updated>2026-06-04T07:16:06Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shawaiz Obaid et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Real-Time License Plate Detection and Recognition (LPDR) forms the backbone of modern smart cities. Although the YOLOV5-PDLPR model substantially improved system efficiency through a parallel decoder approach, its performance is still affected by spatial character mismatches and data imbalance within the training set. This paper addresses these limitations by introducing Cross-Spatial Hybrid Attention (CSHA) and Class-Balanced Synthetic Augmentation (CBSA). An extensive study involving 75,000 synthetic samples is conducted and evaluated on four benchmarks: CCPD, CLPD, PKU, and an application-specific dataset. Experimental results demonstrate a substantial improvement in the recognition rate of minority provincial license plates from 78.2% to 91.5% while maintaining real-time processing performance of 152 FPS. The results indicate that spatially-aware parallel decoding combined with class-balanced augmentation provides an effective solution for high-speed license plate recognition systems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;8 pages, 7 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] An Improved CNN-LSTM Based Intrusion Detection System for IoT Networks</title>
    <author>
      <name>Mohammad Tariq Ikhlas</name>
    </author>
    <author>
      <name>Pohanyar Khowaja Khil</name>
    </author>
    <author>
      <name>Malik Muhammad Mueed Aslam</name>
    </author>
    <author>
      <name>Muhammad Khuram Shahzad</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05776v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05776v1" />
    <published>2026-06-04T07:04:57Z</published>
    <updated>2026-06-04T07:04:57Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mohammad Tariq Ikhlas et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;With the rapid proliferation of IoT devices, security concerns have dramatically escalated and intrusion detection systems have become critical for protecting networked environments. This paper presents an improved CNN-LSTM based intrusion detection model that combines multi-class classification, dataset integration, and temporal feature learning to enhance detection performance in IoT networks. Using network traffic data, the proposed approach is evaluated on intrusion detection tasks and achieves an accuracy of approximately 97%. Experimental results demonstrate that the model effectively detects multiple attack categories while maintaining stable training and validation performance. The integration of convolutional and recurrent neural network components enables the framework to capture both spatial and temporal characteristics of network traffic, improving overall intrusion detection capability in IoT environments.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;8 pages, 8 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models</title>
    <author>
      <name>Zhuoming Liu</name>
    </author>
    <author>
      <name>Jinhong Lin</name>
    </author>
    <author>
      <name>Kwan Man Cheng</name>
    </author>
    <author>
      <name>Lin Zhang</name>
    </author>
    <author>
      <name>Shayok Bagchi</name>
    </author>
    <author>
      <name>Yin Li</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05758v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05758v1" />
    <published>2026-06-04T06:37:10Z</published>
    <updated>2026-06-04T06:37:10Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhuoming Liu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework</title>
    <author>
      <name>Weiguang Wang</name>
    </author>
    <author>
      <name>Fugen Wu</name>
    </author>
    <author>
      <name>Hailing Wang</name>
    </author>
    <author>
      <name>Xuechen Liang</name>
    </author>
    <author>
      <name>Xiaobin Li</name>
    </author>
    <author>
      <name>Ru Han</name>
    </author>
    <author>
      <name>Tianchang Xie</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05754v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05754v1" />
    <published>2026-06-04T06:29:25Z</published>
    <updated>2026-06-04T06:29:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Weiguang Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Phase-sensitive optical time-domain reflectometry ($φ$-OTDR) is widely used in large-scale distributed acoustic sensing (DAS) because it provides distributed spatiotemporal monitoring over long sensing distances. Its field performance can still deteriorate because of polarization-induced fading (PIF), local signal degradation, and strong environmental interference. This study develops a Sagnac-assisted enhanced $φ$-OTDR sensing architecture and a standardized benchmark framework for engineering-oriented DAS event recognition. The Sagnac interferometer provides a continuous phase response that supplements fading-prone observations in the $φ$-OTDR channel, and heterogeneous signal alignment is achieved using a cross-correlation procedure implemented on an FPGA platform. The benchmark protocol compares conventional feature-engineering methods, probabilistic shallow classifiers, single-branch deep models, and dual-branch fusion models under consistent data partitioning, preprocessing, and metric definitions. Experiments on a 10-km sensing fiber with six representative acoustic event classes show that the dual-branch fusion model provides the most favorable trade-off among the evaluated methods, reaching 89.79\% accuracy, 89.83\% macro-F1, and a nuisance alarm rate of 5.00\% on the balanced test set. The results also show that channel grouping strongly affects dual-branch evaluation, indicating that deployment-oriented conclusions should be based on accuracy, macro-F1, nuisance alarm rate, false negative rate, and latency rather than accuracy alone. This work provides a physically motivated enhancement strategy for $φ$-OTDR-based DAS and a reproducible benchmark protocol for future fusion-oriented sensing research. The implementation and scripts for reproducing the DAS event-recognition experiments are publicly available at https://github.com/wawa-abc/das.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[quant-ph] Coordinated optimization of departure sequencing and section-track allocation in railway short-term concentrated departure scenarios based on qubo and hybrid quantum algorithms</title>
    <author>
      <name>Xiaobin Li</name>
    </author>
    <author>
      <name>Yanbin Gao</name>
    </author>
    <author>
      <name>Weiguang Wang</name>
    </author>
    <author>
      <name>Xuechen Liang</name>
    </author>
    <category term="quant-ph" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06543v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06543v1" />
    <published>2026-06-04T06:23:13Z</published>
    <updated>2026-06-04T06:23:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xiaobin Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;This study examines the coordinated optimization of departure sequencing and section-track allocation in railway short-term concentrated departure scenarios. A quadratic unconstrained binary optimization (QUBO) model is formulated to represent departure-position assignment and section-track selection within a unified binary framework. Because the quality of a dispatching scheme depends on time-dependent operational interactions that cannot be fully captured by a static combinatorial model, a simulation-based evaluation layer is introduced to assess section occupation, intermediate-station waiting, platform-capacity pressure, running-time fluctuations, and delay propagation. Within this layered framework, conventional heuristics, quantum-inspired algorithms, and hybrid algorithms are compared on the same decision structure. The results show that the QUBO model can generate feasible candidate schemes after decoding, while the simulation layer clearly differentiates the operational performance of the competing algorithms under both normal and disturbed conditions. In the tested scenarios, QPSO-QAOA performs best under normal conditions, and the quantum-enhanced methods reduce comprehensive cost by 4.28\%--26.26\% and total delay by 4.37\%--24.25\% on average under dynamic conditions relative to their conventional counterparts. These findings suggest that the integration of QUBO-based modeling and simulation-based evaluation provides a useful methodological framework for railway short-term concentrated departure scheduling, although validation with real operational data remains necessary.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance</title>
    <author>
      <name>Yansi Li</name>
    </author>
    <author>
      <name>Zhuosheng Zhang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05728v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05728v1" />
    <published>2026-06-04T05:37:31Z</published>
    <updated>2026-06-04T05:37:31Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yansi Li, Zhuosheng Zhang&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG-Plan, a framework that decouples combinatorial exploration from structural refinement. DiG-Plan employs a diffusion-based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG-Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API-Bank results show that the propose-refine-select design remains effective across domains. Code is available at https://github.com/puddingyeah/DiG-Plan.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at IJCAI-ECAI 2026. This is an author preprint; the final version will appear in the IJCAI Proceedings&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models</title>
    <author>
      <name>Haoyu Zhou</name>
    </author>
    <author>
      <name>Qing Qing</name>
    </author>
    <author>
      <name>Caichong Li</name>
    </author>
    <author>
      <name>Qixin Zhang</name>
    </author>
    <author>
      <name>Yongcheng Jing</name>
    </author>
    <author>
      <name>Ziqi Xu</name>
    </author>
    <author>
      <name>Juncheng Hu</name>
    </author>
    <author>
      <name>Xikun Zhang</name>
    </author>
    <author>
      <name>Renqiang Luo</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05702v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05702v1" />
    <published>2026-06-04T04:49:09Z</published>
    <updated>2026-06-04T04:49:09Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haoyu Zhou et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video-based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time-sensitive news text for cross-modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts&amp;#x27;&amp;#x27;, such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high-quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in https://github.com/LuoRenqiang/ChronoVision.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions</title>
    <author>
      <name>Wenhao Mu</name>
    </author>
    <author>
      <name>Facundo Yan</name>
    </author>
    <author>
      <name>Anik Mumssen</name>
    </author>
    <author>
      <name>Marisa Eisenberg</name>
    </author>
    <author>
      <name>Alexander Rodríguez</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05692v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05692v1" />
    <published>2026-06-04T04:18:28Z</published>
    <updated>2026-06-04T04:18:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wenhao Mu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models</title>
    <author>
      <name>Hancheol Park</name>
    </author>
    <author>
      <name>Geonho Lee</name>
    </author>
    <author>
      <name>Tairen Piao</name>
    </author>
    <author>
      <name>Tae-Ho Kim</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05688v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05688v1" />
    <published>2026-06-04T04:13:05Z</published>
    <updated>2026-06-04T04:13:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Hancheol Park et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unlike dense models, however, MoE models are sensitive to routing instability: small quantization-induced perturbations can change the top-$k$ expert selection, altering the computation path and degrading model quality. We propose Value-and-Structure Routing Alignment for Quantization (VSRAQ), a MoE-specific post-training quantization objective that preserves pre-quantization expert-selection behavior under quantization. VSRAQ combines two complementary objectives that jointly preserve expert-selection behavior: value alignment, which matches routing-relevant logits or scores, and structure alignment, which preserves expert ordering and top-$k$ decision boundaries. By maintaining routing consistency, VSRAQ reduces quantization-induced degradation without introducing any inference-time overhead and can be integrated into existing quantization frameworks. Experiments on recent MoE foundation models show that VSRAQ improves expert-selection consistency and consistently outperforms reconstruction-only and router-aware baselines.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;8 pages, 1 figure&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] AdaMEM: Test-Time Adaptive Memory for Language Agents</title>
    <author>
      <name>Yunxiang Zhang</name>
    </author>
    <author>
      <name>Yiheng Li</name>
    </author>
    <author>
      <name>Ali Payani</name>
    </author>
    <author>
      <name>Lu Wang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05684v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05684v1" />
    <published>2026-06-04T04:06:08Z</published>
    <updated>2026-06-04T04:06:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yunxiang Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;A central challenge for language agents is utilizing past experience to adapt to dynamic test-time conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initiation. Consequently, agents are forced to rely on static guidance that becomes increasingly misaligned as long-horizon tasks unfold. To address this rigidity, we propose the Adaptive Memory Agent (AdaMEM), a novel framework for agent test-time adaptation. Without updating model parameters online, AdaMEM adapts agent behavior via a hybrid memory architecture: it maintains a long-term trajectory memory of raw experiences collected offline while generating dynamic short-term strategy memory on-the-fly to guide decision-making. This mechanism enables the trade-off between token efficiency and adaptability across varying inference-time compute levels. Empirically, AdaMEM significantly outperforms static memory baselines, achieving relative gains of up to 13% on ALFWorld and 11% on WebShop, with consistent leading performance extending to agentic search on HotpotQA. To further enhance this adaptation, we develop STEP-MFT, a Step-wise Memory Fine-Tuning technique that trains the policy to synthesize high-quality strategies from retrieved experiences, yielding additional performance gains. Our work establishes a new scaling dimension for agentic memory, supporting continuous reasoning and self-evolution post-deployment in real-world environments. Our code is available at https://github.com/yunx-z/AdaMEM.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.DB] Data Flow Control: Data Safety Policies for AI Agents</title>
    <author>
      <name>Charlie Summers</name>
    </author>
    <author>
      <name>Eugene Wu</name>
    </author>
    <category term="cs.DB" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05679v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05679v1" />
    <published>2026-06-04T04:01:24Z</published>
    <updated>2026-06-04T04:01:24Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Charlie Summers, Eugene Wu&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Agents increasingly generate SQL, orchestrate pipelines, and automate data analysis on behalf of users. While recent work improves query correctness, correctness is not safety. A query may be semantically valid yet violate regulatory, privacy, or business constraints that govern how data may be combined and released. We argue that enforcing such constraints is fundamentally a data infrastructure problem.
  This paper introduces Data Flow Control (DFC), a framework to declaratively specify and guarantee policy enforcement over tuple-level data flows within a DBMS query. A key challenge is defining a policy language that is optimizer-invariant yet efficient to enforce at scale. We formalize data safety as aggregate predicates over provenance monomials and present Passant, a portable query rewriting layer that enforces DFC policies without materializing provenance. Across five DBMS engines -- DuckDB, Umbra, PostgreSQL, DataFusion, and SQLServer -- Passant achieves ~0% overhead and outperforms alternatives by orders of magnitude. As a result, Data Flow Control is the first step towards moving data safety from prompts and post-hoc checks into the data infrastructure. Data Flow Control is available open source at https://github.com/dataflowcontrol/data-flow-control.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;15 pages, 12 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows</title>
    <author>
      <name>Yuhang Fu</name>
    </author>
    <author>
      <name>Ruishan Fang</name>
    </author>
    <author>
      <name>Jiaqi Shao</name>
    </author>
    <author>
      <name>Huiyu Zheng</name>
    </author>
    <author>
      <name>Zhengtao Zhu</name>
    </author>
    <author>
      <name>Bing Luo</name>
    </author>
    <author>
      <name>Tao Lin</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05670v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05670v1" />
    <published>2026-06-04T03:50:47Z</published>
    <updated>2026-06-04T03:50:47Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuhang Fu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;https://github.com/LINs-lab/MASArena/tree/BenchAgent&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] When Surface Form Changes Moderation Decisions: A Paired Study of Code-Mixed Workflow Instability</title>
    <author>
      <name>Suraj Babu Thimma Krishnaram</name>
    </author>
    <author>
      <name>Yibo Hu</name>
    </author>
    <author>
      <name>Karthikeyan Saravanan</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05654v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05654v2" />
    <published>2026-06-04T03:34:02Z</published>
    <updated>2026-06-05T15:08:55Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Suraj Babu Thimma Krishnaram et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Hate moderation is often evaluated as classification on clean English inputs, but deployed systems must route content to actions such as ALLOW, FLAG, or REVIEW. We study how this workflow changes under code-mixed inputs using a paired evaluation setting where the same underlying content is expressed as clean English and Tamil-English code-mix. Under thresholds tuned on clean English development data, code-mixed inputs produce substantial action instability, with a paired clean- to-code-mix decision flip rate of 0.265. The main workflow effects are increased review burden and increased false-flagging of non-hateful content: review rate rises from 0.138 to 0.297 and non-hate false-flag rate rises from 0.069 to 0.104. Tamil-only inputs show stronger degradation overall, suggesting a broader language-coverage limitation rather than the same code-mixed instability pattern. A simple disagreement-based deferral rule reduces automatic errors on stressed inputs, but only by increasing review load. These results show that workflow-level evaluation reveals moderation failures that standard classification summaries can miss.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?</title>
    <author>
      <name>Jingheng Ye</name>
    </author>
    <author>
      <name>Huiqi Zou</name>
    </author>
    <author>
      <name>Simon Yu</name>
    </author>
    <author>
      <name>Weiyan Shi</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05647v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05647v1" />
    <published>2026-06-04T03:22:17Z</published>
    <updated>2026-06-04T03:22:17Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jingheng Ye et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;34 pages, 30 figures, 3 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] Enhancing Software Engineering Through Closed-Loop Memory Optimization</title>
    <author>
      <name>Xuehang Guo</name>
    </author>
    <author>
      <name>Zora Zhiruo Wang</name>
    </author>
    <author>
      <name>Qingyun Wang</name>
    </author>
    <author>
      <name>Graham Neubig</name>
    </author>
    <author>
      <name>Xingyao Wang</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05646v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05646v1" />
    <published>2026-06-04T03:17:21Z</published>
    <updated>2026-06-04T03:17:21Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xuehang Guo et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs) have enabled powerful software engineering (SE) agents capable of navigating complex codebases and resolving real-world issues. However, these agents remain fundamentally episodic: they fail to retain, refine, and reuse experiences across tasks, repeatedly reconstructing context from scratch and reproducing similar mistakes. Even with memory support, they offer no remedy for the absence of a principled, task-agnostic \textit{memory utility}, making them difficult to evaluate rigorously or generalize across agents and settings. To tackle these limitations, we introduce \ours, a closed-loop framework for memory augmentation in SE agents. \ours grounds memory utility in \textit{validated downstream impact}, establishing utility as both a task-agnostic \textbf{evaluation benchmark} and an annotation-free \textbf{optimization signal}. Through complementary evaluation on \textit{single-episode} and \textit{cross-episode} memory augmentation, results demonstrate that \ours consistently improves SE agents across settings, achieving absolute gains of up to $\uparrow5.25\%$ in success rate and $\uparrow4.63\%$ in resolve efficiency, while substantially reducing computational cost by $\geq9.79\%$. Our project page: \href{https://xhguo7.github.io/MemOp/}{https://xhguo7.github.io/MemOp/}.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Multilingual Fine-Tuning via Localized Gradient Conflict Resolution</title>
    <author>
      <name>Long P. Hoang</name>
    </author>
    <author>
      <name>Yiran Zhao</name>
    </author>
    <author>
      <name>Wei Lu</name>
    </author>
    <author>
      <name>Wenxuan Zhang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05613v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05613v1" />
    <published>2026-06-04T02:36:30Z</published>
    <updated>2026-06-04T02:36:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Long P. Hoang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The rapid evolution of Large Language Models (LLMs) has established cross-lingual versatility as a defining feature of modern systems. However, fine-tuning these models frequently induces negative interference across languages. To address this, we reformulate multilingual fine-tuning as a multi-objective optimization (MOO) problem. Specifically, we introduce Bucket-Level MOO, a scalable distributed framework that applies gradient-based MOO algorithms locally on parameter buckets. This enables conflict-aware updates without the prohibitive communication overhead of reconstructing full gradient vectors. Theoretically, we prove this localized resolution natively enforces Refined Pareto Stationarity, a strictly tighter necessary condition for Pareto optimality. Empirically, Bucket-Level MOO mitigates interference by driving LLMs to construct distinct language-specific dimensions, improving representational separability. Extensive experiments across four base LLMs demonstrate that our method significantly improves both seen and unseen multilingual performance over standard fine-tuning paradigms.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks</title>
    <author>
      <name>Seungwon Jeong</name>
    </author>
    <author>
      <name>Jiwoo Jeong</name>
    </author>
    <author>
      <name>Hyeonjin Kim</name>
    </author>
    <author>
      <name>Yunseok Lee</name>
    </author>
    <author>
      <name>Woojin Lee</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05609v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05609v1" />
    <published>2026-06-04T02:31:29Z</published>
    <updated>2026-06-04T02:31:29Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Seungwon Jeong et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting adversarial tokens to the end of prompts. However, GCG restricts adversarial tokens to a fixed insertion point (typically the prompt suffix), leaving the effect of inserting tokens at other positions unexplored. In this paper, we empirically investigate \emph{slots}, i.e., candidate positions within a prompt where tokens can be inserted. We find that vulnerability to jailbreaking is highly related to the selection of the \emph{slots}. Based on these findings, we introduce the \textit{Vulnerable Slot Score} (VSS) to quantify the positional vulnerability to jailbreaking. We then propose SlotGCG, which evaluates all slots with VSS, selects the most vulnerable slots for insertion, and runs a targeted optimization attack at those slots. Our approach provides a position-search mechanism that is attack-agnostic and can be plugged into any optimization-based attack, adding only 200ms of preprocessing time. Experiments across multiple models demonstrate that SlotGCG significantly outperforms existing methods. Specifically, it achieves 14\% higher Attack Success Rates (ASR) over GCG-based attacks, converges faster, and shows superior robustness against defense methods with 42\% higher ASR than baseline approaches. Our implementation is available at \href{https://github.com/youai058/SlotGCG}{https://github.com/youai058/SlotGCG}&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Fix the Mind, Not the Move: Interpretable AI Assistance via Knowledge-Gap Localization</title>
    <author>
      <name>Ayano Hiranaka</name>
    </author>
    <author>
      <name>Ya-Chuan Hsu</name>
    </author>
    <author>
      <name>Stefanos Nikolaidis</name>
    </author>
    <author>
      <name>Erdem Bıyık</name>
    </author>
    <author>
      <name>Daniel Seita</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05602v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05602v1" />
    <published>2026-06-04T02:25:19Z</published>
    <updated>2026-06-04T02:25:19Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ayano Hiranaka et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI assistants in human-AI collaboration often correct suboptimal human actions through behavioral feedback (e.g., alerts or steering-wheel nudges in assistive driving). Such interventions can mitigate immediate errors, but long-term improvement requires addressing the underlying misconceptions that cause repeated mistakes. We introduce SENSEI, a framework that infers user misconceptions from interaction behavior and provides targeted, minimal yet sufficient suggestions to correct them. Our approach departs from action- or trajectory-level interventions by operating over a structured knowledge representation to localize and correct the sources of erroneous behavior. Across three long-horizon tasks with diverse misconceptions and corresponding behaviors, SENSEI demonstrates zero-shot compositional generalization, disentangling multiple overlapping misconceptions despite training only on single-misconception cases. A user study further shows that our method identifies real human misconceptions and provides effective guidance that improves long-horizon task performance, successfully correcting $90\%$ of student misconceptions. Code and project page are available at https://misoshiruseijin.github.io/SENSEI/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to International Conference on Machine Learning (ICML) 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Representation Learning Enables Scalable Multitask Deep Reinforcement Learning</title>
    <author>
      <name>Johan Obando-Ceron</name>
    </author>
    <author>
      <name>Lu Li</name>
    </author>
    <author>
      <name>Scott Fujimoto</name>
    </author>
    <author>
      <name>Pierre-Luc Bacon</name>
    </author>
    <author>
      <name>Aaron Courville</name>
    </author>
    <author>
      <name>Pablo Samuel Castro</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05555v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05555v1" />
    <published>2026-06-04T01:09:20Z</published>
    <updated>2026-06-04T01:09:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Johan Obando-Ceron et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model-based RL achieve strong performance, they rely on planning and complex training pipelines, making it unclear which components are essential for scalability. We revisit this question and argue that the primary driver of scalable multitask RL is not model-based control, but \emph{representation learning}. In particular, we show that combining predictive, model-based representations with high-capacity value function approximation is sufficient to achieve strong performance, even without planning. We evaluate a simple model-free algorithm, MR.Q, coupled with auxiliary predictive objectives into a scalable actor-critic architecture. This approach outperforms a recent world-model-based method and a range of deep RL baselines across a diverse suite of multitask continuous control tasks, while significantly reducing computational overhead and improving wall-clock efficiency. We observe consistent improvements with increased model capacity and show through ablations that predictive representation learning is critical for performance.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[stat.ML] Conformal Risk-Averse Decision Making with Action Conditional Guarantee</title>
    <author>
      <name>Zihan Zhu</name>
    </author>
    <author>
      <name>Shayan Kiyani</name>
    </author>
    <author>
      <name>George Pappas. Hamed Hassani</name>
    </author>
    <category term="stat.ML" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05551v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05551v1" />
    <published>2026-06-04T01:05:57Z</published>
    <updated>2026-06-04T01:05:57Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zihan Zhu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Reliable decision making pipelines powered by machine learning models require uncertainty quantification (UQ) methods that come with explicit safety guarantees. Conformal prediction provides such UQ by wrapping ML predictions into prediction sets, and recent work by Kiyani et al. (2025b) established that these sets can be translated into optimal risk-averse decision policies -- yet only inheriting marginal safety guarantees. We generalize and strengthen their results by (i) introducing action-conditional conformal prediction, which yields safety guarantees conditioned explicitly on each action taken by the decision maker, (ii) showing that action-conditional prediction sets serve as a proxy for the feasible decision space for risk-averse decision makers aiming to optimize action-conditional value-at-risk, and (iii) proposing a principled finite-sample algorithm based on pinball-loss minimization, connecting the framework of Gibbs et al. (2025) to action-conditional guarantees. Experiments on two real-world datasets confirm that our approach significantly improves action-conditional performance over conformal baselines.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning</title>
    <author>
      <name>Rohan Siva</name>
    </author>
    <author>
      <name>Neel P. Bhatt</name>
    </author>
    <author>
      <name>Yunhao Yang</name>
    </author>
    <author>
      <name>Seoyoung Lee</name>
    </author>
    <author>
      <name>Nishant Gadde</name>
    </author>
    <author>
      <name>Christian Ellis</name>
    </author>
    <author>
      <name>Alvaro Velasquez</name>
    </author>
    <author>
      <name>Zhangyang Wang</name>
    </author>
    <author>
      <name>Ufuk Topcu</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05533v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05533v1" />
    <published>2026-06-04T00:26:04Z</published>
    <updated>2026-06-04T00:26:04Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Rohan Siva et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a &amp;quot;cart&amp;quot; based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is &amp;quot;movable&amp;quot;), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., &amp;quot;movable&amp;quot;). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: https://A4Dance-reasoning.github.io.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Code, videos, and data available at: https://A4Dance-reasoning.github.io&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models</title>
    <author>
      <name>Mohammad Mahdi Abootorabi</name>
    </author>
    <author>
      <name>Omid Ghahroodi</name>
    </author>
    <author>
      <name>Anas Madkoor</name>
    </author>
    <author>
      <name>Marzia Nouri</name>
    </author>
    <author>
      <name>Doratossadat Dastgheib</name>
    </author>
    <author>
      <name>Mohamed Hefeeda</name>
    </author>
    <author>
      <name>Ehsaneddin Asgari</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05531v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05531v1" />
    <published>2026-06-04T00:21:22Z</published>
    <updated>2026-06-04T00:21:22Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mohammad Mahdi Abootorabi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom&amp;#x27;s Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ACL 2026 Findings&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization</title>
    <author>
      <name>Kuangshi Ai</name>
    </author>
    <author>
      <name>Haichao Miao</name>
    </author>
    <author>
      <name>Kaiyuan Tang</name>
    </author>
    <author>
      <name>Shusen Liu</name>
    </author>
    <author>
      <name>Chaoli Wang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05525v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05525v1" />
    <published>2026-06-04T00:14:25Z</published>
    <updated>2026-06-04T00:14:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kuangshi Ai et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent advances in agentic visualization have enabled the translation of natural language into executable scientific visualization (SciVis) workflows. While general-purpose coding agents show strong capabilities, they often lack the tool-specific expertise required for SciVis tasks. In this work, we present SciVisAgentSkills, a collection of reusable agent skills that augment coding agents for scientific data analysis and visualization by encoding environment assumptions, tool usage patterns, and domain heuristics across scientific tools such as ParaView, napari, VMD, and TTK. We evaluate these skills on Codex and Claude Code using SciVisAgentBench, a benchmark of 108 expert-designed multi-step tasks. Results show that agent skills improve mean task scores across the evaluated suites, with token-efficiency benefits that depend on the agent harness and tool setting. These findings highlight the importance of structured procedural knowledge for enabling reliable, long-horizon SciVis workflows, while also showing that skills should be studied alongside the execution harness that loads and applies them. The skills are available at https://github.com/KuangshiAi/SciVisAgentSkills.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Exploring LLMs for South Asian Music Understanding and Generation</title>
    <author>
      <name>Faria Binte Kader</name>
    </author>
    <author>
      <name>Mohtasim Hadi Rafi</name>
    </author>
    <author>
      <name>Shah Wasif Sajjad</name>
    </author>
    <author>
      <name>Santu Karmaker</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05522v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05522v1" />
    <published>2026-06-03T23:53:27Z</published>
    <updated>2026-06-03T23:53:27Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Faria Binte Kader et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent advancements in Large Language Models (LLMs) have shown promising results in music understanding and generation tasks. However, existing works remain confined to Western tonal traditions, offering little insight into whether current LLMs can handle structurally distinct low-resource musical traditions. We present the first systematic evaluation of LLM competence in South Asian classical music, a tradition governed by raga, tala-based melodic constraints that impose fundamentally different structural principles from Western harmony-driven music. We ground our evaluation in Hindustani classical theory and Bengali classical forms, including Rabindra and Nazrul Sangeet -- representative low-resource traditions within South Asian classical music. For music understanding evaluation, we introduce a 504-question-answer benchmark spanning raga grammar, cultural knowledge, and symbolic notation reasoning, evaluating 33 LLMs where frontier models such as Gemini 2.5 Pro achieve 85-90% accuracy, while most open-source models remain in the 23-40% range. For music generation, we design a five-level controlled prompting framework and find that even the strongest model produces stylistically faithful outputs only 40% of the time. These results reveal that structural validity and stylistic faithfulness in music generation are distinct objectives and highlight an open challenge for culturally grounded music modeling.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;19 pages, 7 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] GOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data</title>
    <author>
      <name>Al Zadid Sultan Bin Habib</name>
    </author>
    <author>
      <name>Md Younus Ahamed</name>
    </author>
    <author>
      <name>Prashnna Kumar Gyawali</name>
    </author>
    <author>
      <name>Gianfranco Doretto</name>
    </author>
    <author>
      <name>Donald A. Adjeroh</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05441v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05441v2" />
    <published>2026-06-03T21:03:33Z</published>
    <updated>2026-06-07T02:08:53Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Al Zadid Sultan Bin Habib et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We investigate how to make small tabular foundation models effective for High-Dimensional, Low-Sample Size (HDLSS) tabular prediction without retraining large backbones. We introduce Graph-guided Ordering with Local Refinement (GO-LR), show its equivalence to weighted Minimum Linear Arrangement, and interpret the practical solver as a TSP-path-style surrogate. We propose GOTabPFN,which builds on GO-LR, and a Neuro-Inspired Subunit Compression (NSC) unit to pool locally adjacent ordered features into meta-features, yielding a compact representation that makes TabPFN-style prediction practical in HDLSS regimes. Across tabular benchmarks, GOTabPFN improves stability and accuracy under tight token budgets.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Code and resources GitHub https://github.com/zadid6pretam/GOTabPFN PyPI https://pypi.org/project/gotabpfn Project webpage https://www.zadidhabib.com/gotabpfn.html Hugging Face ZeroGPU https://huggingface.co/spaces/zadid6pretam/GOTabPFN CPU backup https://huggingface.co/spaces/zadid6pretam/GOTabPFN_CPU&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] CausalPOI: Spatio-Temporal Graph-Based Causal Modeling for Cold-Start POI Check-in Forecasting</title>
    <author>
      <name>Zhaoqi Zhang</name>
    </author>
    <author>
      <name>Miao Xie</name>
    </author>
    <author>
      <name>Yi Li</name>
    </author>
    <author>
      <name>Linyou Cai</name>
    </author>
    <author>
      <name>Siqiang Luo</name>
    </author>
    <author>
      <name>Gao Cong</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05413v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05413v1" />
    <published>2026-06-03T20:27:48Z</published>
    <updated>2026-06-03T20:27:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhaoqi Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As urban environments continue to evolve rapidly, accurately modeling the dynamic behaviour of Points of Interest is essential for supporting data-driven urban planning and commercial decision-making. While recent advancements in spatio-temporal graph learning have improved POI forecasting, most methods rely on proximity-based graphs and correlation-driven modeling, which overlook the functional dependencies between POIs and fail to capture the causal effects of urban interventions. In this paper, we introduce a novel research problem -- cold-start POI check-in forecasting, which aims to predict the future check-in pattern of a newly introduced POI, by modeling its temporal evolution and functional interactions with nearby POIs in a structured urban spatial context. To address these challenges, we propose CausalPOI, a spatio-temporal graph-based causal representation learning framework. CausalPOI leverages Spatio-Temporal Functional Interaction Graph to model semantic and spatial relationships between POIs, and constructs structurally aligned treatment and control graphs to simulate factual and counterfactual scenarios. Extensive experiments on real-world SafeGraph datasets demonstrate that CausalPOI significantly outperforms state-of-the-art baselines across the board, validating its effectiveness in spatio-temporal forecasting, semantic interaction modeling, and causal effect estimation, providing a more interpretable and actionable foundation for urban intervention analysis. Source code is available at Github.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at KDD 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution</title>
    <author>
      <name>Can Gurkan</name>
    </author>
    <author>
      <name>Forrest Stonedahl</name>
    </author>
    <author>
      <name>Uri Wilensky</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05408v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05408v1" />
    <published>2026-06-03T20:22:29Z</published>
    <updated>2026-06-03T20:22:29Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Can Gurkan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;When an LLM repeatedly mutates a program, does it explore new forms or circle back to the same ones? We study this question by analyzing LLM-driven mutation chains in the absence of selection pressure within a domain-specific language, varying prompt design, model family, and stochastic replication. We find that LLM-based mutation consistently converges toward restricted attractor regions in program space. Convergence is especially severe at the structural level: in 87% of chains, over 93% of mutations revisit a previously seen structural form, with most variation confined to terminal substitutions within recurring templates. Cycle analysis reveals short cycles and self-loops dominating the transition structure. The rate of convergence varies with prompt wording and model choice, but the phenomenon is robust across conditions. A classical GP subtree mutation operator does not exhibit comparable convergence, suggesting that the effect is intrinsic to the LLM mutation pipeline. These findings reveal a tension at the heart of LLM-driven program evolution: the same capabilities that enable semantics-aware program transformation also carry a systematic bias toward structural homogeneity that must be accounted for if such systems are to sustain open-ended exploration. Source code is available at https://github.com/can-gurkan/lmca.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to the Genetic and Evolutionary Computation Conference (GECCO &amp;#x27;26) Workshop on Large Language Models for and with Evolutionary Computation&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Agents' Last Exam</title>
    <author>
      <name>Yiyou Sun</name>
    </author>
    <author>
      <name>Xinyang Han</name>
    </author>
    <author>
      <name>Weichen Zhang</name>
    </author>
    <author>
      <name>Yuanbo Pang</name>
    </author>
    <author>
      <name>Tianyu Wang</name>
    </author>
    <author>
      <name>Yuhan Cao</name>
    </author>
    <author>
      <name>Yixiao Huang</name>
    </author>
    <author>
      <name>Chris Duroiu</name>
    </author>
    <author>
      <name>Haoyun Zhang</name>
    </author>
    <author>
      <name>Jeffrey Lin</name>
    </author>
    <author>
      <name>Weishu Zhang</name>
    </author>
    <author>
      <name>Tyler Zeng</name>
    </author>
    <author>
      <name>Ying Yan</name>
    </author>
    <author>
      <name>Bo Liu</name>
    </author>
    <author>
      <name>Hanson Wen</name>
    </author>
    <author>
      <name>Mingyang Xu</name>
    </author>
    <author>
      <name>Xiaoyuan Liu</name>
    </author>
    <author>
      <name>Zimeng Chen</name>
    </author>
    <author>
      <name>Weiyan Shi</name>
    </author>
    <author>
      <name>Amanda Dsouza</name>
    </author>
    <author>
      <name>Vincent Sunn Chen</name>
    </author>
    <author>
      <name>Patrick Bryant</name>
    </author>
    <author>
      <name>Carl Boettiger</name>
    </author>
    <author>
      <name>Yamini Rangan</name>
    </author>
    <author>
      <name>Bradley Rothenberg</name>
    </author>
    <author>
      <name>Kyle Steinfeld</name>
    </author>
    <author>
      <name>Arvind Rao</name>
    </author>
    <author>
      <name>Tapio Schneider</name>
    </author>
    <author>
      <name>Georgios Yannakakis</name>
    </author>
    <author>
      <name>Laure Zanna</name>
    </author>
    <author>
      <name>Kaan Ozbay</name>
    </author>
    <author>
      <name>Ida Sim</name>
    </author>
    <author>
      <name>Tarek Zohdi</name>
    </author>
    <author>
      <name>George Em Karniadakis</name>
    </author>
    <author>
      <name>Jack Gallant</name>
    </author>
    <author>
      <name>Teresa Head-gordon</name>
    </author>
    <author>
      <name>Yushan Li</name>
    </author>
    <author>
      <name>Wenxi Deng</name>
    </author>
    <author>
      <name>Tao Sun</name>
    </author>
    <author>
      <name>Huiqi Wang</name>
    </author>
    <author>
      <name>Zhun Wang</name>
    </author>
    <author>
      <name>Justin Xu</name>
    </author>
    <author>
      <name>Chris Yuhao Liu</name>
    </author>
    <author>
      <name>Yafei Cheng</name>
    </author>
    <author>
      <name>Rongwang Hu</name>
    </author>
    <author>
      <name>Aras Bacho</name>
    </author>
    <author>
      <name>Shengcao Cao</name>
    </author>
    <author>
      <name>Zengyi Qin</name>
    </author>
    <author>
      <name>Yixiong Chen</name>
    </author>
    <author>
      <name>Hengduan Fan</name>
    </author>
    <author>
      <name>Hao Liu</name>
    </author>
    <author>
      <name>Lin Zeng</name>
    </author>
    <author>
      <name>Shashank Muralidhar Bharadwaj</name>
    </author>
    <author>
      <name>Litian Gong</name>
    </author>
    <author>
      <name>Yingxuan Yang</name>
    </author>
    <author>
      <name>Maojia Song</name>
    </author>
    <author>
      <name>Ruheng Wang</name>
    </author>
    <author>
      <name>Zongzheng Zhang</name>
    </author>
    <author>
      <name>Honglin Bao</name>
    </author>
    <author>
      <name>Shuo Lu</name>
    </author>
    <author>
      <name>Jianhong Tu</name>
    </author>
    <author>
      <name>Zhonghua Wang</name>
    </author>
    <author>
      <name>Zheng Zhang</name>
    </author>
    <author>
      <name>Zijiao Chen</name>
    </author>
    <author>
      <name>yanqiong Jiang</name>
    </author>
    <author>
      <name>Zhendong Li</name>
    </author>
    <author>
      <name>Bohan Lyu</name>
    </author>
    <author>
      <name>Chang Ma</name>
    </author>
    <author>
      <name>Peiran Xu</name>
    </author>
    <author>
      <name>Benran Zhang</name>
    </author>
    <author>
      <name>Shangding Gu</name>
    </author>
    <author>
      <name>Haoyue Hua</name>
    </author>
    <author>
      <name>Haoyang Li</name>
    </author>
    <author>
      <name>Wanzhe Liao</name>
    </author>
    <author>
      <name>Chengzhi Liu</name>
    </author>
    <author>
      <name>Junbo Peng</name>
    </author>
    <author>
      <name>Haoran Sun</name>
    </author>
    <author>
      <name>Zechen Xu</name>
    </author>
    <author>
      <name>Bo Chen</name>
    </author>
    <author>
      <name>Jiayi Cheng</name>
    </author>
    <author>
      <name>Yi Jiang</name>
    </author>
    <author>
      <name>Keying Kuang</name>
    </author>
    <author>
      <name>Yuan Li</name>
    </author>
    <author>
      <name>Youbang Pan</name>
    </author>
    <author>
      <name>Ziyan Rao</name>
    </author>
    <author>
      <name>Alexander Schubert</name>
    </author>
    <author>
      <name>Yifan Shen</name>
    </author>
    <author>
      <name>Vincent Siu</name>
    </author>
    <author>
      <name>Xiatao Sun</name>
    </author>
    <author>
      <name>Kangqi Zhang</name>
    </author>
    <author>
      <name>Xiaopan Zhang</name>
    </author>
    <author>
      <name>Yuchen Zhu</name>
    </author>
    <author>
      <name>Ishaan Singh Chandok</name>
    </author>
    <author>
      <name>Lei Ding</name>
    </author>
    <author>
      <name>Jingxuan Fan</name>
    </author>
    <author>
      <name>Andrew Glover</name>
    </author>
    <author>
      <name>Jiaming Hu</name>
    </author>
    <author>
      <name>Yiran Hu</name>
    </author>
    <author>
      <name>Wenbo Huang</name>
    </author>
    <author>
      <name>Zixin Jiang</name>
    </author>
    <author>
      <name>Haoran Jin</name>
    </author>
    <author>
      <name>Lukas Kim</name>
    </author>
    <author>
      <name>Ming Liu</name>
    </author>
    <author>
      <name>Yang Liu</name>
    </author>
    <author>
      <name>Alireza Rafiei</name>
    </author>
    <author>
      <name>Xuhuan Shen</name>
    </author>
    <author>
      <name>Kunyang Sun</name>
    </author>
    <author>
      <name>Sophia Sun</name>
    </author>
    <author>
      <name>Ting Sun</name>
    </author>
    <author>
      <name>Eric Wang</name>
    </author>
    <author>
      <name>Yixin Wang</name>
    </author>
    <author>
      <name>Hanwen Xing</name>
    </author>
    <author>
      <name>Sihan Xu</name>
    </author>
    <author>
      <name>Yuzheng Xu</name>
    </author>
    <author>
      <name>Zhongxing Xu</name>
    </author>
    <author>
      <name>Zhiling Yan</name>
    </author>
    <author>
      <name>Boqin Yuan</name>
    </author>
    <author>
      <name>Ruiqi Zhang</name>
    </author>
    <author>
      <name>Yifan Zhang</name>
    </author>
    <author>
      <name>Zibo Zhao</name>
    </author>
    <author>
      <name> Liana</name>
    </author>
    <author>
      <name>Santanu Bosu Antu</name>
    </author>
    <author>
      <name>Haoyue Bai</name>
    </author>
    <author>
      <name>Carlo Bosio</name>
    </author>
    <author>
      <name>Joseph Cavanagh</name>
    </author>
    <author>
      <name>Patricia Cavazos-Rehg</name>
    </author>
    <author>
      <name>Tianxing Chen</name>
    </author>
    <author>
      <name>Xuewen Chen</name>
    </author>
    <author>
      <name>Yipu Chen</name>
    </author>
    <author>
      <name>Zhu Chenyu</name>
    </author>
    <author>
      <name>Chen Dai</name>
    </author>
    <author>
      <name>Stefano De Castro</name>
    </author>
    <author>
      <name>Yunfu Deng</name>
    </author>
    <author>
      <name>Kaustubh Dhole</name>
    </author>
    <author>
      <name>Jiayuan Ding</name>
    </author>
    <author>
      <name>Chenchen Du</name>
    </author>
    <author>
      <name>Zhehang Du</name>
    </author>
    <author>
      <name>Hao Fan</name>
    </author>
    <author>
      <name>Run-ze Fan</name>
    </author>
    <author>
      <name>Hengyu Fu</name>
    </author>
    <author>
      <name>Shi Gu</name>
    </author>
    <author>
      <name>Yifan Gu</name>
    </author>
    <author>
      <name>Charlie Guo</name>
    </author>
    <author>
      <name>Baihe Huang</name>
    </author>
    <author>
      <name>Baixiang Huang</name>
    </author>
    <author>
      <name>Rimika Jaiswal</name>
    </author>
    <author>
      <name>Zhihan Jiang</name>
    </author>
    <author>
      <name>Ran Jin</name>
    </author>
    <author>
      <name>Erin Kasson</name>
    </author>
    <author>
      <name>Xin Lan</name>
    </author>
    <author>
      <name>Joseph Lee</name>
    </author>
    <author>
      <name>Deren Lei</name>
    </author>
    <author>
      <name>Chenyu Li</name>
    </author>
    <author>
      <name>Daofeng Li</name>
    </author>
    <author>
      <name>Haitao Li</name>
    </author>
    <author>
      <name>Hongwei Li</name>
    </author>
    <author>
      <name>Jingyan Li</name>
    </author>
    <author>
      <name>Xiao Li</name>
    </author>
    <author>
      <name>Yi Li</name>
    </author>
    <author>
      <name>Yinsheng Li</name>
    </author>
    <author>
      <name>Yuangang Li</name>
    </author>
    <author>
      <name>Zhixu Li</name>
    </author>
    <author>
      <name>Wenyu Liang</name>
    </author>
    <author>
      <name>Longtai Liao</name>
    </author>
    <author>
      <name>Kevin Qinghong Lin</name>
    </author>
    <author>
      <name>AndyZeyi Liu</name>
    </author>
    <author>
      <name>Che Liu</name>
    </author>
    <author>
      <name>Jiaming Liu</name>
    </author>
    <author>
      <name>Kaiyuan Liu</name>
    </author>
    <author>
      <name>Xuan Liu</name>
    </author>
    <author>
      <name>Pan Lu</name>
    </author>
    <author>
      <name>Wenbo Lv</name>
    </author>
    <author>
      <name>Yicheng Lv</name>
    </author>
    <author>
      <name>Qiuyang Mang</name>
    </author>
    <author>
      <name>Kyle Montgomery</name>
    </author>
    <author>
      <name>Yuzhou Nie</name>
    </author>
    <author>
      <name>Ruoxi Ning</name>
    </author>
    <author>
      <name>Jorin Overwiening</name>
    </author>
    <author>
      <name>Xu Pan</name>
    </author>
    <author>
      <name>Layna Paraboschi</name>
    </author>
    <author>
      <name>Core Francisco Park</name>
    </author>
    <author>
      <name>Justin Purnomo</name>
    </author>
    <author>
      <name>Swati Rajwal</name>
    </author>
    <author>
      <name>Scott Rankin</name>
    </author>
    <author>
      <name>Bixuan Ren</name>
    </author>
    <author>
      <name>Yiren Rong</name>
    </author>
    <author>
      <name>HaoYang Shang</name>
    </author>
    <author>
      <name>Ventus Shaw</name>
    </author>
    <author>
      <name>Fiona Shen</name>
    </author>
    <author>
      <name>Jiawei Shen</name>
    </author>
    <author>
      <name>Minqi Shi</name>
    </author>
    <author>
      <name>Qiu Shi</name>
    </author>
    <author>
      <name>Huaxiu Yao</name>
    </author>
    <author>
      <name>Tianneng Shi</name>
    </author>
    <author>
      <name>Jonah So</name>
    </author>
    <author>
      <name>Vladislav Susoy</name>
    </author>
    <author>
      <name>Hannah Szlyk</name>
    </author>
    <author>
      <name>Haocheng Wang</name>
    </author>
    <author>
      <name>Jialu Wang</name>
    </author>
    <author>
      <name>Wei Wang</name>
    </author>
    <author>
      <name>Xinyu Wang</name>
    </author>
    <author>
      <name>Zehao Wang</name>
    </author>
    <author>
      <name>Dowling Wong</name>
    </author>
    <author>
      <name>Angela Wu</name>
    </author>
    <author>
      <name>Dehao Wu</name>
    </author>
    <author>
      <name>Fangyu Wu</name>
    </author>
    <author>
      <name>Mengyuan "Millie" Wu</name>
    </author>
    <author>
      <name>Yu Wu</name>
    </author>
    <author>
      <name>Yuchen Wu</name>
    </author>
    <author>
      <name>Yuhao Wu</name>
    </author>
    <author>
      <name>Qingpo Wuwu</name>
    </author>
    <author>
      <name>Weihang Xiao</name>
    </author>
    <author>
      <name>Yongyi Xiong</name>
    </author>
    <author>
      <name>Fan Xu</name>
    </author>
    <author>
      <name>Ruiling Xu</name>
    </author>
    <author>
      <name>Mingxuan Yan</name>
    </author>
    <author>
      <name>Benjamin Yang</name>
    </author>
    <author>
      <name>Jirong Yang</name>
    </author>
    <author>
      <name>Sen Yang</name>
    </author>
    <author>
      <name>Xiaoli Yang</name>
    </author>
    <author>
      <name>Yushi Yang</name>
    </author>
    <author>
      <name>Haoran Ye</name>
    </author>
    <author>
      <name>Xiaohu Yu</name>
    </author>
    <author>
      <name>Zhengming Yu</name>
    </author>
    <author>
      <name>Chenlong Zhang</name>
    </author>
    <author>
      <name>Chi Zhang</name>
    </author>
    <author>
      <name>Hanning Zhang</name>
    </author>
    <author>
      <name>Hanwen Zhang</name>
    </author>
    <author>
      <name>Junge Zhang</name>
    </author>
    <author>
      <name>Kunpeng Zhang</name>
    </author>
    <author>
      <name>Song Zhang</name>
    </author>
    <author>
      <name>Wenjin Zhang</name>
    </author>
    <author>
      <name>Wenshuo Zhang</name>
    </author>
    <author>
      <name>Ying Zhang</name>
    </author>
    <author>
      <name>Yizhi Zhang</name>
    </author>
    <author>
      <name>Brian Zhao</name>
    </author>
    <author>
      <name>Qijian Zhao</name>
    </author>
    <author>
      <name>Yimin Zhao</name>
    </author>
    <author>
      <name>Yuhaohua Zheng</name>
    </author>
    <author>
      <name>Liwei Zhou</name>
    </author>
    <author>
      <name>Tianyue Zhou</name>
    </author>
    <author>
      <name>Sichen Zhu</name>
    </author>
    <author>
      <name>Siqi Zhu</name>
    </author>
    <author>
      <name>Yan Zhu</name>
    </author>
    <author>
      <name>Yishu Zhu</name>
    </author>
    <author>
      <name>Jierui Zuo</name>
    </author>
    <author>
      <name>Chonghao Cai</name>
    </author>
    <author>
      <name>Helena Casademunt</name>
    </author>
    <author>
      <name>Wenjia Chen</name>
    </author>
    <author>
      <name>Benjamin Cheng</name>
    </author>
    <author>
      <name>Nawen Deng</name>
    </author>
    <author>
      <name>Rao Fu</name>
    </author>
    <author>
      <name>Tianfu Fu</name>
    </author>
    <author>
      <name>Yifan Han</name>
    </author>
    <author>
      <name>Ren He</name>
    </author>
    <author>
      <name>Zhenyu He</name>
    </author>
    <author>
      <name>Qiao Jin</name>
    </author>
    <author>
      <name>Lang Lang</name>
    </author>
    <author>
      <name>Yuetai Li</name>
    </author>
    <author>
      <name>Sylvia Liu</name>
    </author>
    <author>
      <name>Lu Lu</name>
    </author>
    <author>
      <name>Qing Lu</name>
    </author>
    <author>
      <name>Subhabrata Mukherjee</name>
    </author>
    <author>
      <name>Yunqi Ouyang</name>
    </author>
    <author>
      <name>Yin Ren</name>
    </author>
    <author>
      <name>Dawei Shi</name>
    </author>
    <author>
      <name>Haoran Wu</name>
    </author>
    <author>
      <name>Zhiyue Wu</name>
    </author>
    <author>
      <name>Hannah Yao</name>
    </author>
    <author>
      <name>Zhuoran Yi</name>
    </author>
    <author>
      <name>Jenny Yu</name>
    </author>
    <author>
      <name>Rhea Zhan</name>
    </author>
    <author>
      <name>Hang Zhou</name>
    </author>
    <author>
      <name>Blake Zhu</name>
    </author>
    <author>
      <name>Junfan Zhu</name>
    </author>
    <author>
      <name>Alan Yuille</name>
    </author>
    <author>
      <name>Yang Liu</name>
    </author>
    <author>
      <name>Russell Alan Poldrack</name>
    </author>
    <author>
      <name>Jiachen Li</name>
    </author>
    <author>
      <name>Zhenglu Li</name>
    </author>
    <author>
      <name>Molei Tao</name>
    </author>
    <author>
      <name>Jing Huang</name>
    </author>
    <author>
      <name>Wenqi Shi</name>
    </author>
    <author>
      <name>Costas Spanos</name>
    </author>
    <author>
      <name>Lichao Sun</name>
    </author>
    <author>
      <name>Chenguang Wang</name>
    </author>
    <author>
      <name>Orson Xu</name>
    </author>
    <author>
      <name>Zhen Dong</name>
    </author>
    <author>
      <name>Hector Gomez</name>
    </author>
    <author>
      <name>Aylin Caliskan</name>
    </author>
    <author>
      <name>Ali Emami</name>
    </author>
    <author>
      <name>Haimin Hu</name>
    </author>
    <author>
      <name>Zhi Li</name>
    </author>
    <author>
      <name>Lihui Liu</name>
    </author>
    <author>
      <name>Murphy Niu</name>
    </author>
    <author>
      <name>Yi Shao</name>
    </author>
    <author>
      <name>Jianxin Sun</name>
    </author>
    <author>
      <name>Mikko Tolonen</name>
    </author>
    <author>
      <name>Ting Wang</name>
    </author>
    <author>
      <name>Sanjiv Das</name>
    </author>
    <author>
      <name>Yanjun Gao</name>
    </author>
    <author>
      <name>Wenbo Guo</name>
    </author>
    <author>
      <name>Erika J Schneider</name>
    </author>
    <author>
      <name>Zhiyong Lu</name>
    </author>
    <author>
      <name>Mark Mueller</name>
    </author>
    <author>
      <name>Radha Poovendran</name>
    </author>
    <author>
      <name>Somayeh Sojoudi</name>
    </author>
    <author>
      <name>Dawn Song</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05405v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05405v1" />
    <published>2026-06-03T20:20:46Z</published>
    <updated>2026-06-03T20:20:46Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yiyou Sun et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents&amp;#x27; Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Project website: https://agents-last-exam.org Code: https://github.com/rdi-berkeley/agents-last-exam&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Harnessing Generalist Agents for Contextualized Time Series</title>
    <author>
      <name>Zihao Li</name>
    </author>
    <author>
      <name>Kaifeng Jin</name>
    </author>
    <author>
      <name>Yuanchen Bei</name>
    </author>
    <author>
      <name>Jiaru Zou</name>
    </author>
    <author>
      <name>Avaneesh Kumar</name>
    </author>
    <author>
      <name>Xuying Ning</name>
    </author>
    <author>
      <name>Yanjun Zhao</name>
    </author>
    <author>
      <name>Mengting Ai</name>
    </author>
    <author>
      <name>Baoyu Jing</name>
    </author>
    <author>
      <name>Hanghang Tong</name>
    </author>
    <author>
      <name>Jingrui He</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05404v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05404v1" />
    <published>2026-06-03T20:20:34Z</published>
    <updated>2026-06-03T20:20:34Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zihao Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series-native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open-ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Preprint. 38 Pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces</title>
    <author>
      <name>Jinu Lee</name>
    </author>
    <author>
      <name>Shivam Agarwal</name>
    </author>
    <author>
      <name>Amruta Parulekar</name>
    </author>
    <author>
      <name>Siddarth Madala</name>
    </author>
    <author>
      <name>Dilek Hakkani-Tur</name>
    </author>
    <author>
      <name>Julia Hockenmaier</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05402v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05402v1" />
    <published>2026-06-03T20:12:26Z</published>
    <updated>2026-06-03T20:12:26Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jinu Lee et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github.com/jinulee-v/reasoningflow.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization</title>
    <author>
      <name>Yuanhe Zhang</name>
    </author>
    <author>
      <name>Yuekai Sun</name>
    </author>
    <author>
      <name>Taiji Suzuki</name>
    </author>
    <author>
      <name>Jason D. Lee</name>
    </author>
    <author>
      <name>Fanghui Liu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05400v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05400v1" />
    <published>2026-06-03T20:09:39Z</published>
    <updated>2026-06-03T20:09:39Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuanhe Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Long-horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi-agent harness for reliable research-level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural-language proof graph, and shared system of record. Four contract-scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two-stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI-gated rounds. LeanMarathon turns one brittle multi-hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co-mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;26 pages, 9 figures. Comments are welcome&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration</title>
    <author>
      <name>Cristina Carleo</name>
    </author>
    <author>
      <name>Pietro Liguori</name>
    </author>
    <author>
      <name>Naghmeh Ivaki</name>
    </author>
    <author>
      <name>Domenico Cotroneo</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05396v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05396v1" />
    <published>2026-06-03T20:06:52Z</published>
    <updated>2026-06-03T20:06:52Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Cristina Carleo et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Producing a labeled vulnerable code at scale is a recurring obstacle for learning-based vulnerability detection: mined corpora carry substantial label noise, and existing LLM-based augmentation propagates these inaccuracies because it transforms vulnerable seeds rather than synthesising vulnerabilities from a specification. A complementary route is to start from safe code and ask an instruction-tuned LLM to inject a specified CWE (which would shift the labeling burden from open-ended detection to bounded binary confirmation) but safety-aligned code LLMs systematically refuse such prompts. This paper is a preliminary feasibility study of abliteration, a low-rank weight edit that orthogonally projects out the refusal direction in the residual stream, as a tool to remove this barrier. We use Python and CWE-89 (SQL injection) as a case study, evaluating the Qwen2.5-Coder-Instruct family at 3B, 7B, and 14B parameters on safe samples drawn from PromSec and SafeCoder, replicated three times per condition. We find that (i) refusal on injection prompts is strongly size- and prompt-context-dependent: the 14B refuses 100% of prompts, the 7B refuses 73% of PromSec but only 5% of SafeCoder, whereas the 3B is essentially never blocked; (ii) abliteration reduces refusal to zero or near-zero across all sizes while leaving syntactic validity above 93%, supporting the view that, in this setting, refusal can be detached from measured code-generation capability; and (iii) the post-abliteration injection rate remains capacity-bound (88-97% on the 14B, 89-90% on the 7B, and 25-48% on the 3B) separating willingness, which abliteration unlocks, from capability, which scales with parameters. Vulnerability verdicts are produced by a three-tool detector ensemble (CodeQL, Semgrep, Bandit) followed by manual adjudication by two authors on detector-positive outputs.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] VASO: Formally Verifiable Self-Evolving Skills for Physical AI Agents</title>
    <author>
      <name>Yunhao Yang</name>
    </author>
    <author>
      <name>Neel P. Bhatt</name>
    </author>
    <author>
      <name>Kevin Wang</name>
    </author>
    <author>
      <name>Samuel Tetteh</name>
    </author>
    <author>
      <name>Zhangyang Wang</name>
    </author>
    <author>
      <name>Ufuk Topcu</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05395v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05395v1" />
    <published>2026-06-03T20:02:35Z</published>
    <updated>2026-06-03T20:02:35Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yunhao Yang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Reusable robot skills are becoming the basic units through which embodied agents turn open-ended instructions into long-horizon physical behavior. We argue that, while foundation models have collapsed the cost of creating these skills, the cost of trusting them has not. Existing skill-evolution loops refine skills through execution feedback, unit tests, environment reward, or LLM self-critique, but these signals provide only trace-level evidence: they show that a skill worked on sampled executions, not that skill-induced plans satisfy temporal safety contracts under untested conditions. We introduce VASO, a framework for verification-guided self-evolution of LLM-generated robot skill contracts. In VASO, each skill is represented as a semantic contract with two coupled interfaces: a formal interface that aligns robot states, observations, and control commands with logical propositions for model checking, and a planner-facing interface that guides executable behavior generation. A model checker first filters logically inconsistent skill contracts, then verifies plans induced by the skill against global and local temporal specifications. When verification fails, VASO translates the counterexample trace into a textual gradient that updates the reusable skill contract while keeping foundation-model weights frozen. On Clearpath Jackal and PX4 quadcopter tasks, VASO reaches 97.2% formal-specification compliance using fewer than 100 optimization samples, outperforming execution-feedback, prompt-optimization, and fine-tuning baselines. To our knowledge, VASO is the first framework that closes the loop between formal verification and self-evolving LLM-generated skills for physical AI agents: formal counterexamples become optimization feedback for reusable robot skill contracts, rather than merely verifying one-off plans, tuning planner prompts, or fine-tuning model weights.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Project webpage: https://languagegroundedriskdetection.github.io/ProjectPage/vaso-webpage/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models</title>
    <author>
      <name>Yongzhong Xu</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05378v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05378v1" />
    <published>2026-06-03T19:27:07Z</published>
    <updated>2026-06-03T19:27:07Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yongzhong Xu&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We test whether a single screen-and-ablate recipe -- identify attention-head circuits by task-pattern selectivity, then verify by causal ablation against a matched-random null -- produces consistent mechanistic claims across model families. The recipe ports across pipelines; the specific circuit it identifies does not. Across four composed tasks (indirect-object identification, greater-than, successor sequences, variable binding) and three 1B-class language models from distinct training pipelines (Pythia 1B / Pile / dense; OLMo 1B / DCLM / dense; OLMoE 1B-7B / DCLM / mixture-of-experts), we run a unified protocol with the matched-random null sampled across ten seeds per cell. The resulting 12 (task, model) cells contain no two that share the same primary causal screen at comparable effect size: the same task, with the same behavioral capability, is implemented through different attention-pattern types across models.
  We introduce a five-category screen-outcome taxonomy -- primary cause, secondary cause, correlate, interferer, null -- with quantitative thresholds, and show that all five outcomes appear in the panel. We propose a falsifiable hypothesis: the MoE model in our panel builds composed-task circuits on top of a foundational previous-token positional substrate (the prev-token-circuit ablation is the strongest causal screen on 3 of 4 tasks for OLMoE 1B-7B), with the IOI exception consistent with IOI being a final-position name-copying task whose structure directly probes a different pattern. The hypothesis comes with explicit predictions for other MoE language models.
  We frame the methodology honestly: the spectral participation-ratio signal from the companion methodology paper is a general indicator of specialized computation; what makes a finding task-specific is the task-pattern screen plus a per-model causal verification.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;27 pages, 3 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Representation Curriculum: Stagewise Training for Robust Ranking and Allocation</title>
    <author>
      <name>Ehsan Ebrahimzadeh</name>
    </author>
    <author>
      <name>Sina Baharlouei</name>
    </author>
    <author>
      <name>Abraham Bagherjeiran</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09891v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09891v1" />
    <published>2026-06-03T19:02:19Z</published>
    <updated>2026-06-03T19:02:19Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ehsan Ebrahimzadeh et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Ranking in digital marketplaces is a dynamic exposure-allocation mechanism: displayed items shape discovery trajectories and success events logged by the platform to update future allocation policies. Modern ranking systems rely heavily on exposure-confounded signals (e.g. popularity estimates, CTR/CVR aggregates, and ID-based representation), because they are highly predictive under stationary demand. Yet this predictive power can become a learning shortcut: early access to exposure-dependent belief signals steers optimization toward over-reliance on them and away from exposure-independent merit signals (e.g., content-based competitiveness and semantic affinity). Consequently, the learned policy tends to entrench incumbents and degrade cold-start generalization and robustness under distribution shift. We propose Representation Curriculum (RC), a training-time intervention that temporally stages feature utilization. RC foregrounds content-based merit signals initially, then introduces exposure-dependent belief signals while anchoring the content pathway near the learned merit representation, curbing shortcut reliance on historical signals and mitigating gradient starvation on content signals. We formalize RC independently of task and hypothesis class and provide ranking-specific instantiations. In a Gaussian linear ridge setting, we derive closed-form solutions and sufficient conditions under which RC strictly reduces population risk on a cold-start target distribution, with a quantified Pareto tradeoff against source performance. Experiments on public learning-to-rank and recommendation benchmarks, and randomized online experiments in a large-scale e-commerce search system, show that RC measurably shifts reliance from historical belief signals toward content-based merit signals and yields consistent gains on cold populations with a controlled trade-off in head performance.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 pages, 5 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] GITCO: Gated Inference-Time Context Optimization in TSFMs</title>
    <author>
      <name>Manya Pandey</name>
    </author>
    <author>
      <name>Dhruv Kumar</name>
    </author>
    <author>
      <name>Murari Mandal</name>
    </author>
    <author>
      <name>Saurabh Deshpande</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05332v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05332v1" />
    <published>2026-06-03T18:17:40Z</published>
    <updated>2026-06-03T18:17:40Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Manya Pandey et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Patch-based Time Series Foundation Models (TSFMs) suffer from context poisoning: structurally anomalous patches capture disproportionate attention and silently degrade zero-shot forecast quality. We propose improving TSFM accuracy at inference time by optimizing the input context rather than modifying model weights. We present GITCO (Gated Inference-Time Context Optimization), a lightweight three-component framework: Gate, Router, and Critic that selectively identifies and suppresses harmful patches without any parameter updates. Evaluated on TimesFM 2.5 across 53 GIFT-Eval datasets under K-fold cross-validation, GITCO achieves an average +1.95% MASE reduction on TimesFM 2.5 while capturing 89.9% of the improvement upper bound. We introduce context sensitivity profiles as a new characterizable property of TSFMs: the mapping from time series meta-features to expected accuracy improvement under inference-time context intervention, shaped jointly by model architecture and the statistical structure of the data.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;ICML 2026 Workshop on Foundation Models for Structured Data&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] LoRi: Low-Rank Distillation for Implicit Reasoning</title>
    <author>
      <name>Ryan Solgi</name>
    </author>
    <author>
      <name>Jiayi Tian</name>
    </author>
    <author>
      <name>Zheng Zhang</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05315v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05315v1" />
    <published>2026-06-03T18:05:50Z</published>
    <updated>2026-06-03T18:05:50Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ryan Solgi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Implicit chain-of-thought (iCoT) methods aim to internalize reasoning in large language models, but often underperform explicit CoT prompting. We empirically find that hidden-state reasoning trajectories exhibit low-rank structure. Motivated by this observation, we propose a low-rank distillation framework that transfers reasoning by aligning teacher and student trajectories in a shared low-rank tensor subspace using first- and second-order statistics. The resulting formulation captures the global structure of reasoning while supporting a compact latent reasoning process. We evaluate the method across multiple model families, including LLaMA and Qwen, at different scales on mathematical reasoning benchmarks. Our approach consistently improves performance, especially on challenging multi-step tasks, approaching explicit CoT accuracy and outperforming prior iCoT distillation methods.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems</title>
    <author>
      <name>Chen Huang</name>
    </author>
    <author>
      <name>Yuhao Wu</name>
    </author>
    <author>
      <name>Wenxuan Zhang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05304v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05304v1" />
    <published>2026-06-03T18:00:22Z</published>
    <updated>2026-06-03T18:00:22Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Chen Huang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multi-agent systems (MAS) built on large language models are typically organized around roles, pipelines, and turn schedules, while the content that agents pass to one another is often left as unconstrained natural language. However, this free-form communication can rapidly inflate token usage, consume the shared context window, and ultimately affect both system performance and inference cost. We analyze five common inter-agent communication strategies across two MAS topologies, finding that no fixed strategy is universally optimal. Instead, effective inter-agent messages consistently preserve action-centered information needed by downstream agents. Building on this, we propose the PACT (Protocolized Action-state Communication and Transmission), which treats inter-agent communication as a public state-update problem and projects each raw agent output into a compact action-state record before it enters shared history. Across different MAS topologies, PACT consistently improves the performance-cost trade-off, achieving comparable or stronger task performance with substantially fewer tokens. The gains extend to production coding harnesses: PACT lifts OpenHands&amp;#x27; resolve rate at -10% tokens-per-resolved, and is resolve-neutral on SWE-agent while halving input tokens. Our code is publicly available at https://github.com/iNLP-Lab/PACT.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;13 pages, 5 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents</title>
    <author>
      <name>Dae Yon Hwang</name>
    </author>
    <author>
      <name>Raunaq Suri</name>
    </author>
    <author>
      <name>Valentin Villecroze</name>
    </author>
    <author>
      <name>Anthony L. Caterini</name>
    </author>
    <author>
      <name>Jesse C. Cresswell</name>
    </author>
    <author>
      <name>Noël Vouitsis</name>
    </author>
    <author>
      <name>Brendan Leigh Ross</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05296v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05296v1" />
    <published>2026-06-03T18:00:07Z</published>
    <updated>2026-06-03T18:00:07Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dae Yon Hwang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;LLM agents operate in two distinct regimes: open-weight agents amenable to reinforcement learning (RL) and black-box agents whose behaviour must be controlled purely at test time. Although black-box agents are often backed by state-of-the-art proprietary LLMs, API-only access precludes parameter-level optimization, rendering most RL methods inapplicable. To address this limitation, we turn to a known equivalence between RL and Bayesian inference. We propose Agentic Monte Carlo (AMC) to directly sample from the optimal policy of a black-box agent rather than training it through RL. The optimal policy is a posterior over trajectories whose prior we define as the fixed black-box LLM agent. We employ Sequential Monte Carlo to sample from this posterior by learning a value function to steer the agent while leaving the underlying black-box model unchanged. We validate AMC on three diverse environments from the AgentGym benchmark, demonstrating significant improvements over prompting baselines and even outperforming Group Relative Policy Optimization (GRPO) as we scale the test-time compute of our method. AMC demonstrates the feasibility of performing principled RL-style optimization of black-box LLM agents. Code is available at https://github.com/layer6ai-labs/Agentic-Monte-Carlo&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation</title>
    <author>
      <name>Tobia Poppi</name>
    </author>
    <author>
      <name>Silvia Cappelletti</name>
    </author>
    <author>
      <name>Sara Sarto</name>
    </author>
    <author>
      <name>Florian Schiffers</name>
    </author>
    <author>
      <name>Garin Kessler</name>
    </author>
    <author>
      <name>Marcella Cornia</name>
    </author>
    <author>
      <name>Lorenzo Baraldi</name>
    </author>
    <author>
      <name>Rita Cucchiara</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05290v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05290v1" />
    <published>2026-06-03T18:00:04Z</published>
    <updated>2026-06-03T18:00:04Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tobia Poppi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent progress in generative modeling has made safety control a central challenge, yet existing approaches remain largely model-specific, requiring retraining or tailored interventions for each new architecture. In this work, we ask whether safety can be represented as a portable latent direction, learned once and reused across heterogeneous generators. We introduce the first framework for cross-model safety steering, in which a safety direction is estimated in a source LLM from paired safe-unsafe prompts, transported to a target generator through a lightweight alignment fitted on benign data alone, and applied at inference time. Crucially, our pipeline never accesses unsafe data on the target side, isolating whether safety can be transferred through shared representation geometry. Beyond a single global direction, we also identify a multi-vector extension that captures category-specific safety behaviors, enabling more selective control. We evaluate our approach in text-to-image and text-to-video generation across diverse source-target model pairs. Across models, transferred safety directions achieve ASR reduction and CLIP-Score/FID trade-offs comparable to directions learned natively on the target model using unsafe data, while requiring no target-side unsafe data. This indicates that safety improvements do not come at the expense of generation quality. Our results point to a modular view of safety: safety-relevant behavior is not purely model-local, but can be controlled through latent directions that persist across models. This suggests a new path toward lightweight, reusable safety mechanisms that do not require target-side unsafe data.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Project page: https://aimagelab.github.io/cross-model-safety-representations/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Personal AI Agent for Camera Roll VQA</title>
    <author>
      <name>Thao Nguyen</name>
    </author>
    <author>
      <name>Krishna Kumar Singh</name>
    </author>
    <author>
      <name>Donghyun Kim</name>
    </author>
    <author>
      <name>Yong Jae Lee</name>
    </author>
    <author>
      <name>Yuheng Li</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05275v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05275v1" />
    <published>2026-06-03T17:59:30Z</published>
    <updated>2026-06-03T17:59:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Thao Nguyen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user&amp;#x27;s personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?&amp;#x27;&amp;#x27;) to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before&amp;#x27;&amp;#x27;). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents&amp;#x27; long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Project page, code, and demo: https://thaoshibe.github.io/camroll&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics</title>
    <author>
      <name>Stella Biderman</name>
    </author>
    <author>
      <name>Mohammad Aflah Khan</name>
    </author>
    <author>
      <name>Niloofar Mireshghallah</name>
    </author>
    <author>
      <name>Catherine Arnett</name>
    </author>
    <author>
      <name>Fazl Barez</name>
    </author>
    <author>
      <name>Naomi Saphra</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06533v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06533v1" />
    <published>2026-06-03T17:58:14Z</published>
    <updated>2026-06-03T17:58:14Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Stella Biderman et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;What would it mean to have a scientific understanding of AI? Models are not static objects: they are snapshots of time-evolving processes shaped by data, objectives, architectures, and optimization dynamics. Yet much of AI research treats models as fixed artifacts, analyzing behaviors after training rather than asking why they emerge. This position paper argues that a science of AI must move beyond post-hoc fixes and study the training dynamics that produce model behavior. Such a science should support progressively stronger forms of understanding: predicting outcomes from early training signals, intervening when trajectories go wrong, and ultimately designing training procedures that more reliably produce desired properties. Scaling laws have made prediction routine for loss; the challenge is extending this success to capabilities, biases, robustness, and safety-relevant behaviors. We articulate requirements for such theories grounded in the history and philosophy of science, examine progress in mechanistic interpretability, fairness, memorization, and simplicity bias, and identify concrete open problems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted as an oral to the ICML: https://icml.cc/virtual/2026/poster/67142&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Streaming Communication in Multi-Agent Reasoning</title>
    <author>
      <name>Zhen Yang</name>
    </author>
    <author>
      <name>Xiaogang Xu</name>
    </author>
    <author>
      <name>Wen Wang</name>
    </author>
    <author>
      <name>Cong Chen</name>
    </author>
    <author>
      <name>Xander Xu</name>
    </author>
    <author>
      <name>Ying-Cong Chen</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05158v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05158v1" />
    <published>2026-06-03T17:57:04Z</published>
    <updated>2026-06-03T17:57:04Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhen Yang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multi-agent reasoning systems adopt a &amp;quot;generate-then-transfer&amp;quot; paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a &amp;quot;step-level scaling law&amp;quot;: increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;project page: https://zhenyangcs.github.io/StreamMA-website/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Reinforcement Learning from Rich Feedback with Distributional DAgger</title>
    <author>
      <name>Rishabh Agrawal</name>
    </author>
    <author>
      <name>Jacob Fein-Ashley</name>
    </author>
    <author>
      <name>Paria Rashidinejad</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05152v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05152v2" />
    <published>2026-06-03T17:54:04Z</published>
    <updated>2026-06-05T17:44:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Rishabh Agrawal et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes</title>
    <author>
      <name>Josef Bengtson</name>
    </author>
    <author>
      <name>Yaroslava Lochman</name>
    </author>
    <author>
      <name>Fredrik Kahl</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05142v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05142v1" />
    <published>2026-06-03T17:49:48Z</published>
    <updated>2026-06-03T17:49:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Josef Bengtson et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure. Other approaches are trained for specific image editing tasks, such as object removal and addition. Despite this progress, general nonrigid edits, i.e., edits that substantially change the scene geometry, remain challenging for existing methods. We propose GeM-NR, a fast and flexible training-free approach for general multi-view consistent image editing, including edits that drastically change the geometry and appearance of the scene. Given an anchor image edited with a chosen backbone editor (such as FLUX, Qwen, BrushNet) and a query unedited image, GeM-NR edits the query image consistently with the anchor edit. The method incorporates multiple stages: (i) depth map estimation, where we propose a strategy to maximize the alignment between the 3D point clouds of the edited and unedited scenes, (ii) projection onto a query viewpoint, and (iii) refinement of the obtained image conditioned on the unedited query. The conditioning-based formulation scales well from two to many views of an object. We demonstrate the ability of our method to handle edits with significant changes in geometry and appearance, something that existing methods struggle with. We perform an extensive evaluation showing that our method improves consistency for a wide variety of edit tasks, including generating 3D representations of the edited scene. Both quantitative and qualitative results indicate the state-of-the-art performance of our method in terms of edit quality as well as geometric and photometric consistency across multiple views.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Project page: https://gem-nr.github.io/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Towards Efficient and Evidence-grounded Mobility Prediction with LLM-Driven Agent</title>
    <author>
      <name>Linyao Chen</name>
    </author>
    <author>
      <name>Qinlao Zhao</name>
    </author>
    <author>
      <name>Zechen Li</name>
    </author>
    <author>
      <name>Mingming Li</name>
    </author>
    <author>
      <name>Likun Ni</name>
    </author>
    <author>
      <name>Jinyu Chen</name>
    </author>
    <author>
      <name>Yuhao Yao</name>
    </author>
    <author>
      <name>Xuan Song</name>
    </author>
    <author>
      <name>Noboru Koshizuka</name>
    </author>
    <author>
      <name>Hiroki Kobayashi</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05130v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05130v1" />
    <published>2026-06-03T17:34:01Z</published>
    <updated>2026-06-03T17:34:01Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Linyao Chen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Individual-level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence models achieve strong accuracy but require task-specific training and offer limited decision-level transparency. Recent LLM-based methods improve interpretability, yet mostly rely on static prompts and single-pass inference, limiting their ability to seek additional evidence when mobility signals are weak or conflicting. We propose \method{}, a training-free LLM-driven agent framework that formulates next-location prediction as adaptive evidence-controlled decision making. \method{} resolves routine cases through a fast path based on historical regularity, while ambiguous cases trigger iterative tool use over recent trajectories, historical behavior, stay-move likelihood, and geographical evidence. Across three mobility datasets, AgentMob achieves the strongest overall performance among training-free LLM-based methods, with GPT-5.4 reaching 71.42\% Acc@1 on BW, 33.14\% on YJMob100K, and 33.50\% on Shanghai ISP. On BW non-fast-path cases, the LLM controller improves Acc@1 from 30.65\% to 48.62\% over a same-tool statistical baseline, showing that its main benefit lies in resolving ambiguous predictions through adaptive evidence gathering. Our code is available at https://github.com/Unknown-zoo/AgentMob.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Automatic Generation of Titles for Research Papers Using Language Models</title>
    <author>
      <name>Tohida Rehman</name>
    </author>
    <author>
      <name>Debarshi Kumar Sanyal</name>
    </author>
    <author>
      <name>Samiran Chattopadhyay</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05085v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05085v1" />
    <published>2026-06-03T16:49:57Z</published>
    <updated>2026-06-03T16:49:57Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tohida Rehman et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The title of a research paper conveys its primary idea and, occasionally, its conclusions in a clear and concise manner. Choosing an appropriate title is often challenging, and automated title generation can assist authors in this task. In this work, we propose a technique to generate paper titles from abstracts using open-weight pre-trained and large language models. We use the CSPubSum and LREC-COLING-2024 datasets and introduce a new dataset, SpringerSSAT, curated from four Springer journals in the social sciences. Additionally, we use GPT-3.5-turbo in a zero-shot setting to generate titles. Model performance is evaluated with ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore metrics. Our experiments show that fine-tuned PEGASUS-large outperforms other models, including fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo, across most metrics. We further demonstrate that ChatGPT can generate creative paper titles. Overall, AI-generated titles are generally appropriate and reliable.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;24 pages, 24 tables, 01 figure&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?</title>
    <author>
      <name>Zhangchen Xu</name>
    </author>
    <author>
      <name>Junda Chen</name>
    </author>
    <author>
      <name>Yue Huang</name>
    </author>
    <author>
      <name>Dongfu Jiang</name>
    </author>
    <author>
      <name>Jiefeng Chen</name>
    </author>
    <author>
      <name>Hang Hua</name>
    </author>
    <author>
      <name>Zijian Wu</name>
    </author>
    <author>
      <name>Zheyuan Liu</name>
    </author>
    <author>
      <name>Zexue He</name>
    </author>
    <author>
      <name>Lichi Li</name>
    </author>
    <author>
      <name>Shizhe Diao</name>
    </author>
    <author>
      <name>Jiaxin Pei</name>
    </author>
    <author>
      <name>Jinsung Yoon</name>
    </author>
    <author>
      <name>Hao Zhang</name>
    </author>
    <author>
      <name>Mengdi Wang</name>
    </author>
    <author>
      <name>Radha Poovendran</name>
    </author>
    <author>
      <name>Misha Sra</name>
    </author>
    <author>
      <name>Alex Pentland</name>
    </author>
    <author>
      <name>Zichen Chen</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05080v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05080v1" />
    <published>2026-06-03T16:36:54Z</published>
    <updated>2026-06-03T16:36:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhangchen Xu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle &amp;amp; challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent&amp;#x27;s initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Code: https://github.com/autolabhq/autolab ; Website: https://autolab.moe/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] NIV: Neural Axis Variations for Variable Font Generation</title>
    <author>
      <name>Nadav Benedek</name>
    </author>
    <author>
      <name>Ariel Shamir</name>
    </author>
    <author>
      <name>Ohad Fried</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05261v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05261v1" />
    <published>2026-06-03T16:17:43Z</published>
    <updated>2026-06-03T16:17:43Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Nadav Benedek et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Variable fonts enable continuous variation of glyph geometry along semantic design axes such as weight, width, slant, and optical size. However, constructing a variable font from a static font remains a labor-intensive process requiring expert typographic design and manual specification of glyph variation data. We introduce NIV (Neural Axis Variations), a method that automatically converts a static font into a fully functional variable font. Given glyph outlines and a set of desired design axes, NIV predicts per-point displacements. The model operates directly on vector glyph geometry and employs a novel Property Embedding mechanism that captures interactions between multiple axes, enabling consistent multi-axis variation within a unified framework. We train NIV on a newly constructed dataset derived from variable Google Fonts, comprising over one million variation tuples. The resulting model generalizes across unseen code points, unseen font styles, high-complexity CJK glyphs, and even out-of-distribution handwriting inputs. The generated outputs are standard variable font files supporting continuous interpolation via existing rendering engines. To facilitate research, we release the dataset, the complete training and inference implementation, and trained models at https://github.com/ndvbd/NIV. Beyond typography, our approach demonstrates how structured geometric objects with continuous parametric variation can be synthesized using neural deformations.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Strabo: Declarative Specification and Implementation of Agentic Interaction Protocols</title>
    <author>
      <name>Samuel H. Christie</name>
    </author>
    <author>
      <name>Amit K. Chopra</name>
    </author>
    <author>
      <name>Munindar P. Singh</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05043v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05043v1" />
    <published>2026-06-03T16:05:25Z</published>
    <updated>2026-06-03T16:05:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Samuel H. Christie et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The last few years have witnessed major advances in the modeling and implementation of multiagent systems based on declarative interaction protocols. Our contribution, Strabo, establishes the relevance of these advances to ongoing industry efforts in Agentic AI. Specifically, we consider UCP, the Universal Commerce Protocol, a recent Google-led effort to standardize e-commerce interactions for AI agents. Our exercise is in two parts. One, we model the part of UCP dealing with checkouts as a declarative Langshaw protocol and implement agents using Peach, a programming model for Langshaw. This part of the exercise brings out the advantages of formal, declarative specifications. Two, we show that Peach agents can interoperate with UCP agents implemented by Google, thereby establishing the fidelity of our approach with respect to UCP. Such interoperation enables the incremental introduction of declarative protocols and agents into a conventional setting, indicating a pathway by which EMAS ideas could influence practice without demanding a wholesale update.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Presented in the Engineering Multiagent Systems Workshop co-located with the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery</title>
    <author>
      <name>Arquimedes Canedo</name>
    </author>
    <author>
      <name>Grama Chethan</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05037v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05037v1" />
    <published>2026-06-03T16:02:11Z</published>
    <updated>2026-06-03T16:02:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Arquimedes Canedo, Grama Chethan&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next. A self-reflective API returns, on validation failure, a machine-readable recovery\_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak-audited pilot ($N{=}30$ per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task-completion rate by $+36.7$--$40.0$pp over plain-English diagnoses on Anthropic models (Fisher&amp;#x27;s exact $p \le 0.0022$), at $1.8$--$2.2\times$ better per-success token efficiency. The lift is not significant on gpt-4o-mini ($p{=}0.435$); a second-domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit\_prompt\_leakage.py as reusable CI infrastructure. Code and data: https://github.com/arquicanedo/self-reflective-apis.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment</title>
    <author>
      <name>Kokil Jaidka</name>
    </author>
    <author>
      <name>Saifuddin Ahmed</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05256v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05256v1" />
    <published>2026-06-03T15:58:32Z</published>
    <updated>2026-06-03T15:58:32Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kokil Jaidka, Saifuddin Ahmed&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;This study analyzes a publicly released dataset from a discontinued field experiment on Reddit&amp;#x27;s r/ChangeMyView. The intervention, conducted by unknown, external researchers and halted following ethical backlash, involved undisclosed AI-generated accounts engaging users in live debate. After public disclosure, Reddit authorized moderators to release an archive of the AI-generated comments, creating a rare opportunity to examine how large language models operated in an identity-rich deliberative forum without disclosure. We conduct a structured content analysis of this corpus, evaluating identity performance, authority signaling, alignment strategies, and activation of cognitive heuristics. Identity targeting or adoption appears in over two-thirds of comments, alignment moves and authority claims in nearly all of them, and cognitive-bias triggers -- particularly confirmation bias, representativeness, and availability -- in the large majority. These patterns co-occur systematically, composing a rhetorical architecture calibrated for persuasive efficiency rather than authentic deliberative participation. Compared against human-authored CMV counter-arguments, the agents inverted the typical distribution on every dimension: denser authority use, more adversarial alignment, and heavier reliance on external citation over experiential grounding. In such environments, distinctions between authentic and synthetic epistemic standing grow increasingly opaque -- an asymmetry that disclosure mandates alone cannot address. The results point toward auditing frameworks capable of assessing how AI systems structure credibility, not merely whether they are present.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks</title>
    <author>
      <name>Jie Huang</name>
    </author>
    <author>
      <name>Ruixun Liu</name>
    </author>
    <author>
      <name>Sirui Sun</name>
    </author>
    <author>
      <name>Xinyi Yang</name>
    </author>
    <author>
      <name>Yin Li</name>
    </author>
    <author>
      <name>Yixin Zhu</name>
    </author>
    <author>
      <name>Yiwu Zhong</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05008v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05008v1" />
    <published>2026-06-03T15:28:57Z</published>
    <updated>2026-06-03T15:28:57Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jie Huang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;We present an evaluation designed for multi-modal memory in multi-modal models&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] SharedRequest: Privacy-Preserving Model-Agnostic Inference for Large Language Models</title>
    <author>
      <name>Peihua Mai</name>
    </author>
    <author>
      <name>Xuanrong Gao</name>
    </author>
    <author>
      <name>Youlong Ding</name>
    </author>
    <author>
      <name>Xianglong Du</name>
    </author>
    <author>
      <name>Wei Liu</name>
    </author>
    <author>
      <name>Yan Pang</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05004v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05004v1" />
    <published>2026-06-03T15:23:06Z</published>
    <updated>2026-06-03T15:23:06Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Peihua Mai et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;With the widespread deployment of public large language models (LLMs) such as ChatGPT, protecting user prompt privacy has become an increasingly critical issue. Existing privacy-preserving inference methods sacrifice either utility or efficiency, and often require model-specific modifications that limit their compatibility. In this paper, we propose SharedRequest, a model-agnostic framework for privacy-preserving LLM inference that reformulates privacy protection at the batch level rather than the individual-prompt level. The key idea is to obscure sensitive information by mixing original prompts with noisy variants, while grouping semantically equivalent instructions to amortize the inference cost over a large batch of queries with minimal impact on LLM response quality. This design is independent of the LLM architecture, requiring no access to model parameters or architectural modification. Empirical results demonstrate that SharedRequest achieves over $20\%$ higher utility compared to prior differential privacy baselines, and its shared-prompt mechanism reduces query cost by up to $5\times$ compared to non-batched inference.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;accepted by ACL 2026 (main)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] From Attack Simulation to SIEM Rule: Deterministic Detection-as-Code Synthesis with Probe-Level Traceability</title>
    <author>
      <name>Alexandre Cristovão Maiorano</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05252v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05252v1" />
    <published>2026-06-03T14:26:25Z</published>
    <updated>2026-06-03T14:26:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Alexandre Cristovão Maiorano&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Security teams routinely simulate attacks against their own systems to check whether their monitoring would catch a real intruder. These Breach-and-Attack-Simulation (BAS) tools surface findings, but the security information and event management (SIEM) systems that watch production need detection rules -- and today a human bridges that gap by hand, reading each finding and writing the corresponding Sigma rule (a vendor-neutral detection format). We show this translation can be partially automated when probes are drawn from a locked corpus, so each finding carries a stable identifier back to the originating probe. We describe a deterministic synthesis function that maps each finding to a starter Sigma rule through a small template library (N=23, indexed by categories from the OWASP LLM and Web Top 10), with a back-reference to the originating finding and its MITRE ATT&amp;amp;CK technique. On two locked corpora (17-probe LLM, 23-probe Web), every bypassed-probe finding yields a starter rule, and all 17/17 emitted rules parse and convert to Splunk and Elasticsearch backends. Replayed through a live OpenSearch SIEM, the LLM rules fire on 30% of a held-out AdvBench subset and 14% of HarmBench at 7.7% false positives on a benign baseline; the Web side is validated structurally, not against a held-out attack set. The contribution is a verifiable, byte-stable path from BAS finding to operator-deployable starter rule, re-derivable from the published corpus and template library alone -- trading the breadth of LLM-generative methods for exact reproducibility and a typed traceback from any fired alert to the originating probe.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;22 pages, 3 figures, 11 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] What Type of Inference is Active Inference?</title>
    <author>
      <name>Wouter W. L. Nuijten</name>
    </author>
    <author>
      <name>Mykola Lukashchuk</name>
    </author>
    <author>
      <name>Thijs van de Laar</name>
    </author>
    <author>
      <name>Bert de Vries</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04935v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04935v1" />
    <published>2026-06-03T14:24:53Z</published>
    <updated>2026-06-03T14:24:53Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wouter W. L. Nuijten et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that the planning correction already helps when observations are decisive, whereas the additional observation-side epistemic corrections matter most when observations are merely suggestive.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning</title>
    <author>
      <name>Xuekang Wang</name>
    </author>
    <author>
      <name>Zhuoyuan Hao</name>
    </author>
    <author>
      <name>Shuo Hou</name>
    </author>
    <author>
      <name>Hao Peng</name>
    </author>
    <author>
      <name>Juanzi Li</name>
    </author>
    <author>
      <name>Xiaozhi Wang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04923v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04923v1" />
    <published>2026-06-03T14:18:23Z</published>
    <updated>2026-06-03T14:18:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xuekang Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;23 pages, 7 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models</title>
    <author>
      <name>Tran Dinh Tien</name>
    </author>
    <author>
      <name>Zhiqiang Shen</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04922v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04922v1" />
    <published>2026-06-03T14:17:57Z</published>
    <updated>2026-06-03T14:17:57Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tran Dinh Tien, Zhiqiang Shen&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Preprint. Code is available at https://github.com/tientrandinh/OGKD&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] 'Your AI Text is not Mine': Redefining and Evaluating AI-generated Text Detection under Realistic Assumptions</title>
    <author>
      <name>Nils Dycke</name>
    </author>
    <author>
      <name>Marina Sakharova</name>
    </author>
    <author>
      <name>Nico Daheim</name>
    </author>
    <author>
      <name>Iryna Gurevych</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04906v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04906v1" />
    <published>2026-06-03T14:05:14Z</published>
    <updated>2026-06-03T14:05:14Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Nils Dycke et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Although it is generally agreed that AI-generated text poses a broad societal risk, there is no common understanding in the AI-generated text detection literature on what constitutes harmful use. Rather, existing datasets and approaches often define their own criteria and make their own assumptions, sometimes implicitly, and often only loosely related to real-world needs and applications. To address this gap, we here systematically define various notions of AI-generated text and their characteristics. To study these, we collect AITDNA - a new benchmark of human-machine co-constructed texts that is annotated with detailed genesis information, such as the entire edit and AI-interaction history. We benchmark various machine-generated text detectors and find that they often only perform well for specific notions but not as broad detectors. We release code and data publicly.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LO] Provably Auditable and Safe LLM Agents from Human-Authored Ontologies</title>
    <author>
      <name>Aaron Sterling</name>
    </author>
    <category term="cs.LO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04903v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04903v1" />
    <published>2026-06-03T14:01:33Z</published>
    <updated>2026-06-03T14:01:33Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Aaron Sterling&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We introduce the LLM agent architecture Agentic Redux, intended for use with nontrivial problem domains that require linear auditability. Using the typed lambda calculus, we prove that, run on appropriate domains, Agentic Redux executions are semantically guaranteed to be correct, with all decisions recorded in an append-only ledger. We present two production-grade appropriate domains, in healthcare billing compliance, and security vulnerability disclosure. Working code for Agentic Redux run on both domains is available in a supporting code repository. We also introduce Ontology-First Agent Design, a methodology for creation of agent frameworks on a problem domain, in which a human expert ontologizes the problem domain with Basic Formal Ontology, and then assigns an LLM to derive roles that agents and humans-in-the-loop can fill, in order to work the problems in the domain.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LO] Abduction Prover in Isabelle/HOL</title>
    <author>
      <name>Yutaka Nagashima</name>
    </author>
    <author>
      <name>Daniel Sebastian Goc</name>
    </author>
    <category term="cs.LO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04877v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04877v1" />
    <published>2026-06-03T13:41:08Z</published>
    <updated>2026-06-03T13:41:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yutaka Nagashima, Daniel Sebastian Goc&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Proof assistants based on expressive logics suffer limited automation for proof search, raising the cost of formal verification based on proof assistants. We address this problem by introducing the Abduction Prover for Isabelle/HOL. Given a challenging proof goal, the Abduction Prover constructs a proof script for the goal by identifying useful conjectures using abductive reasoning.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Isabelle2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety</title>
    <author>
      <name>Yanjing Ren</name>
    </author>
    <author>
      <name>Reza Ebrahimi</name>
    </author>
    <author>
      <name>TengTeng Ma</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04867v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04867v1" />
    <published>2026-06-03T13:33:42Z</published>
    <updated>2026-06-03T13:33:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yanjing Ren et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As AI companion platforms such as Replika and Character.AI rapidly grow, concerns about unsafe human-AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories. The dataset contains 2,123 real-world Replika conversations collected from Reddit and annotated through human-AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and no-harm. Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Learning Empirically Admissible Neural Heuristics for Combinatorial Search</title>
    <author>
      <name>Siddharth Sahay</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04860v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04860v1" />
    <published>2026-06-03T13:27:46Z</published>
    <updated>2026-06-03T13:27:46Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Siddharth Sahay&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Finding optimal solution paths for combinatorial puzzles like the Rubik&amp;#x27;s Cube, sliding tile puzzles, and Lights Out remains a classical challenge in artificial intelligence. Heuristic search algorithms, such as A* , guarantee path optimality only when using an admissible heuristic-one that never overestimates the true remaining cost-to-go. Deep reinforcement learning (RL) methods like DeepCubeA train deep neural networks to approximate cost-to-go heuristics. However, standard mean-squared error (MSE) training regularly yields overestimations, violating admissibility and compromising solution optimality. In this paper, we introduce a generalizable framework for learning validation-calibrated admissible neural heuristics. We train a value network using an underestimating Admissible Bellman Operator combined with an Asymmetric Loss function to penalize overestimation. To account for residual neural function approximation errors, we propose a post-hoc calibration safety offset computed over validation scrambles. We demonstrate that our calibrated neural heuristics achieve no observed admissibility violations under the evaluation protocol and preserve path optimality in practice while reducing search node expansions by up to 83.0% on a 2 by 2 Rubik&amp;#x27;s Cube, 19.9% on a 3 by 3 Lights Out grid, and 1.9% on an 8-Puzzle compared to standard analytical baselines.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;13 pages, 3 figures, 2 tables, 1 algorithm&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] PreAct-Bench: Benchmarking Predictive Monitoring in LLMs</title>
    <author>
      <name>Hainiu Xu</name>
    </author>
    <author>
      <name>Italo Luis da Silva</name>
    </author>
    <author>
      <name>Jiangnan Ye</name>
    </author>
    <author>
      <name>Yuhao Wang</name>
    </author>
    <author>
      <name>Wei Liu</name>
    </author>
    <author>
      <name>Linyi Yang</name>
    </author>
    <author>
      <name>Jonathan Richard Schwarz</name>
    </author>
    <author>
      <name>Nicola Paoletti</name>
    </author>
    <author>
      <name>Yulan He</name>
    </author>
    <author>
      <name>Hanqi Yan</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09890v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09890v1" />
    <published>2026-06-03T13:26:41Z</published>
    <updated>2026-06-03T13:26:41Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Hainiu Xu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior from complete trajectories, this paradigm is fundamentally retrospective: it identifies harm only after it has already occurred. In this work, we study a critical yet overlooked safety task, which we term Predictive Monitoring: given only a partial action trajectory, can a model infer whether it will culminate in an unethical action before the overt action is executed? To support this task, we present PreActBench, a benchmark of 1,000 paired ethical and unethical action trajectories spanning five domains. We evaluate a range of LLMs, safety guardrail models, and latent probing methods across varying fractions of the action trajectory using our Prefix Foresight F1 metric. Results show that while humans achieve promising performance, predictive monitoring remains challenging even for strong models, highlighting the need for future-oriented risk reasoning in LLM safety.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] OA-CutMix: Correcting the Label Bias of CutMix</title>
    <author>
      <name>Tobias Christian Nauen</name>
    </author>
    <author>
      <name>Stanislav Frolov</name>
    </author>
    <author>
      <name>Federico Raue</name>
    </author>
    <author>
      <name>Brian B. Moser</name>
    </author>
    <author>
      <name>Andreas Dengel</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04820v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04820v1" />
    <published>2026-06-03T12:42:59Z</published>
    <updated>2026-06-03T12:42:59Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tobias Christian Nauen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted patch faithfully reflects its semantic contribution to the mixed image. In practice, however, patches frequently land on background regions, assigning label credit to classes whose objects are not visible. The mean discrepancy of the CutMix label and the semantic object area is $21.5\%$. In $17\%$ of samples an image contributes zero visible object pixels yet receives nonzero label weight. We propose Object-Aware CutMix (OA-CutMix), which corrects this bias by replacing the area-based CutMix weight with one derived from precomputed segmentation masks, assigning labels in proportion to the visible object area each image contributes to the mix. The image mixing procedure is left entirely unchanged. We evaluate OA-CutMix against 10+ static and dynamic mixing methods across 4 architectures and 6 datasets. OA-CutMix consistently achieves the highest accuracy over all tasks, outperforming even dynamic mixing methods, but at a fraction of the training-time cost. Improvements are largest for small objects, where the label bias from CutMix is greatest. Thus, correcting the label is sufficient to match or exceed the performance of methods modifying the image mixing algorithm.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] AIP: A Graph Representation for Learning and Governing Agent Skills</title>
    <author>
      <name>Zachary Blumenfeld</name>
    </author>
    <author>
      <name>Jim Webber</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04781v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04781v1" />
    <published>2026-06-03T12:02:49Z</published>
    <updated>2026-06-03T12:02:49Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zachary Blumenfeld, Jim Webber&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since editing prose is a fragile process that both humans and agents struggle with, particularly for domain-specific procedural knowledge underrepresented in model training. The Agent Instruction Protocol (AIP) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural-language descriptions, connected by explicit typed input/output edges, and governed by a schema-validated YAML specification. A compiler meta-skill translates existing human-written skills into this form. The benefits are twofold. First, compiling human-written skills to AIP raised Claude Sonnet&amp;#x27;s mean task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench - a statistically significant gain (Wilcoxon signed-rank p = 0.011), winning 12 tasks to 2 with 13 ties - often in less wall-clock time. The graph delivers vetted, runnable units to the agent rather than asking it to re-derive code, commands, and tool calls from natural language. Second, on creation and improvement, because each skill is schema-validated, functionally testable, and addressable node-by-node, failures can be diagnosed and repaired precisely. Two authored-skill failures were traced to the script level. After adjusting the AIP spec and recompiling, both recovered with zero regressions (one task going from 0/5 to 5/5), turning skill improvement into a measurable tuning loop rather than a prose rewrite. That same graph structure supports corpus-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[hep-ex] Archi: Agentic Operations at the CMS Experiment</title>
    <author>
      <name>Pietro Lugato</name>
    </author>
    <author>
      <name>Luca Lavezzo</name>
    </author>
    <author>
      <name>Jason Mohoney</name>
    </author>
    <author>
      <name>Hasan Ozturk</name>
    </author>
    <author>
      <name>Muhammad Hassan Ahmed</name>
    </author>
    <author>
      <name>Juan Pablo Salas</name>
    </author>
    <author>
      <name>Viphava Ohm</name>
    </author>
    <author>
      <name>Krittin Phornsiricharoenphant</name>
    </author>
    <author>
      <name>Gabriele Benelli</name>
    </author>
    <author>
      <name>Mariarosaria D'Alfonso</name>
    </author>
    <author>
      <name>Manasvita Joshi</name>
    </author>
    <author>
      <name>Warren Nam</name>
    </author>
    <author>
      <name>Aron Soha</name>
    </author>
    <author>
      <name>Samantha Sunnarborg</name>
    </author>
    <author>
      <name>Austin Swinney</name>
    </author>
    <author>
      <name>Jack Tucker</name>
    </author>
    <author>
      <name>Dmytro Kovalskyi</name>
    </author>
    <author>
      <name>Tim Kraska</name>
    </author>
    <author>
      <name>Christoph Paus</name>
    </author>
    <category term="hep-ex" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04755v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04755v1" />
    <published>2026-06-03T11:38:30Z</published>
    <updated>2026-06-03T11:38:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Pietro Lugato et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. An instance of Archi has been deployed for the Computing Operations team of the CMS experiment at CERN&amp;#x27;s LHC since February 2026 as a support agent for technical operators, offering retrieval and analysis capabilities by combining documentation, historical data, and live monitoring systems. We evaluate the system on operator feedback and a question set collected from production usage, graded by human and automated panels. The system proves effective at operational tasks, resolving real-world queries posed by CMS operators. We also observe that locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] An Empirical Audit of Input Encoders for Multi-Channel Signal Transformers</title>
    <author>
      <name>Ossi Lehtinen</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04752v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04752v2" />
    <published>2026-06-03T11:35:09Z</published>
    <updated>2026-06-08T08:15:04Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ossi Lehtinen&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Transformers consuming multi-channel scalar signals must embed $C$ simultaneous values into one $d_{\text{model}}$-dimensional vector per time step. We audit eight input encoders -- a shared-scalar baseline, per-channel linear projections, an orthogonality regulariser, a nonlinear MLP, block-partitioned concatenation, channel-independent and channel-as-token architectures, and a projected positional encoding -- on a synthetic benchmark where channel identity is informative and on ETTh1, scored by next-step negative log-likelihood. The headline is practical near-equivalence within a wide &amp;quot;top tier&amp;quot;: the standard per-channel linear projection matches every alternative up to small, statistically real but practically modest differences. A direct geometric probe attributes this to a spontaneous orthogonalisation of the per-channel projections: they end up near-orthogonal with no explicit regulariser, letting the standard linear recover channel identity from the summed embedding. Two encoders lose decisively: the shared-scalar baseline collapses for information-theoretic reasons we make explicit, and the channel-independent PatchTST-spirit baseline overfits universally on the synthetic benchmark and underperforms on both. Paired tests resolve two small gaps: projecting the sinusoidal positional encoding through a learned linear layer edges the rest at small $C$ by extending this orthogonality to the positional subspace; a nonlinear MLP stem edges them at the largest $C$, with the gap shrinking under more training data. The practical recommendation: use the standard per-channel linear projection by default; reach for something more elaborate only when the task calls for it.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;21 pages, 1 figure, 8 tables. Code: https://github.com/OssiLehtinen/channel-encoder-audit&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment</title>
    <author>
      <name>Ajay Vishwanath</name>
    </author>
    <author>
      <name>Christian Omlin</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04750v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04750v1" />
    <published>2026-06-03T11:31:37Z</published>
    <updated>2026-06-03T11:31:37Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ajay Vishwanath, Christian Omlin&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to incentivize virtuous actions without being fully dependent on the reward function design. Thus far, this technique has been demonstrated to be effective in grid worlds and toy-problem environments with minimal state and action spaces. To expand this research to more sophisticated environments, we introduce a two-player multi-agent environment based on the role-playing board game known as Fog of Love. In this environment, two agents compete to fulfill their individual virtues, while also cooperating to satisfy their relationship. Given the multi-agent nature, this is a complex problem where multi-agent deep deterministic policy gradient agents neither compete nor cooperate successfully. We present evidence that localized affinities enhance agent performance in achieving both competitive and cooperative objectives, resulting from superior overall scores in both domains. This not only results in virtuous choices but also clarifies an agent&amp;#x27;s teleology and makes its behavior human-level interpretable.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] Revisiting Vul-RAG: Reproducibility and Replicability of RAG-based Vulnerability Detection with Open-Weight Models</title>
    <author>
      <name>Sabrina Kaniewski</name>
    </author>
    <author>
      <name>Fabian Schmidt</name>
    </author>
    <author>
      <name>Tobias Heer</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04739v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04739v1" />
    <published>2026-06-03T11:20:37Z</published>
    <updated>2026-06-03T11:20:37Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sabrina Kaniewski et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs) have shown strong potential for automated software vulnerability detection, particularly in retrieval-augmented generation (RAG) settings. However, for approaches relying on proprietary models and APIs, reproducibility and replicability remain largely unexplored, raising the question of whether reported results generalize or depend primarily on specific model choices. In this work, we present a reproducibility study of Vul-RAG, a RAG-based framework for source code vulnerability detection that enhances LLMs with high-level vulnerability knowledge. We first replicate the results in a fully local and open-weights setting using the reported open-weight baseline models. We then extend the evaluation to a diverse set of recent open-weight LLMs, including code-specialized, general-purpose, and reasoning models of varying parameter sizes. The results confirm that the findings of Vul-RAG are reproducible under local deployment, but with minor deviations. Across all evaluated models, we observe a performance plateau at approximately 0.30 pairwise accuracy (code pairs for which both the vulnerable and the patched function are correctly classified). Notably, this plateau persists even for more recent and advanced models, indicating that improvements in model capacity alone do not substantially enhance performance. Finally, we discuss practical implications and trade-offs between detection effectiveness, model capabilities, and model scale. Implementation and evaluation artifacts are publicly available at https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at AI&amp;amp;CCPS 2026 workshop, co-located with the 21st International Conference on Availability, Reliability and Security (ARES 2026). This is the authors&amp;#x27; preprint version&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation</title>
    <author>
      <name>Kailun Huang</name>
    </author>
    <author>
      <name>Zikang Xie</name>
    </author>
    <author>
      <name>Yanzhe Xie</name>
    </author>
    <author>
      <name>Panpan Liao</name>
    </author>
    <author>
      <name>Fanghai Zhang</name>
    </author>
    <author>
      <name>Yanheng Mai</name>
    </author>
    <author>
      <name>Wenhao Xu</name>
    </author>
    <author>
      <name>Yunheng Wang</name>
    </author>
    <author>
      <name>Renjing Xu</name>
    </author>
    <author>
      <name>Haohui Huang</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04718v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04718v2" />
    <published>2026-06-03T10:51:46Z</published>
    <updated>2026-06-09T06:35:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kailun Huang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Humans primarily rely on walking and running to traverse complex terrains. Similarly, humanoid robots should be able to smoothly transition between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference between tasks and the distribution shift caused by terrain variations. Although Mixture-of-Experts (MoE) architectures can mitigate multi-skill interference, direct joint training often fails to achieve clear expert specialization. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced, and the gating network is trained with a contrastive objective to learn structured terrain representations and promote expert specialization. The final action is obtained through weighted fusion of the base gait policy and the terrain-aware branch, enabling the policy to preserve stable locomotion while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains while maintaining accurate foothold control and dynamic stability.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Kailun Huang, Zikang Xie, Yanzhe Xie and Panpan Liao contributed equally to this work. Corresponding authors: Renjing Xu and Haohui Huang&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation</title>
    <author>
      <name>Amirhossein Movahedisefat</name>
    </author>
    <author>
      <name>Amirreza Fateh</name>
    </author>
    <author>
      <name>Mohammad Reza Mohammadi</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04705v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04705v1" />
    <published>2026-06-03T10:34:15Z</published>
    <updated>2026-06-03T10:34:15Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Amirhossein Movahedisefat et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two-stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Real-Time Automatic License Plate Recognition Using YOLOv8, SORT Tracking, and Temporal Data Interpolation</title>
    <author>
      <name>Mirza Muhammad Mobeen</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04684v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04684v1" />
    <published>2026-06-03T10:05:18Z</published>
    <updated>2026-06-03T10:05:18Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mirza Muhammad Mobeen&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The real-time hardships of video processing seriously limit the usage of Automatic License Plate Recognition (ALPR) with application in dynamic traffic monitoring settings. High-fidelity recognition of unconstrained variables, e.g. drastic variations in illumination, acute camera scans, high vehicle speeds, and harsh physical concealment, is a problem that often leads to disjointed tracking paths and poor Optical Character Recognition (OCR) rates. In order to mitigate these weaknesses, the study proposes a 5 stage, end-to-end algorithmic pipeline, encompassing a smooth transition between deep learning based object detection, multi-object tracking which is kinematic in nature, and geometry temporal data interpolation. The suggested architecture takes advantage of a very powerful YOLOv8 nano model to localize the vehicle at the first stage and then Simple Online and Realtime Tracking (SORT) algorithm is used to build spatial-temporal links between frames. Another, more specific typology of YOLOv8 object detectors the license plate area, channeling the sliced array to an EasyOCR chain under the limitations of positional syntax verification. More importantly, an offline interpolation mechanism of temporal bounding box is initiated to recast fragmented paths.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;7 Pages, For Accessing code:https://github.com/ mobeen-pmo/Automatic-License-Plate-Recognition&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Learning Long Range Spatio-Temporal Representations over Continuous Time Dynamic Graphs with State Space Models</title>
    <author>
      <name>Ayushman Raghuvanshi</name>
    </author>
    <author>
      <name>Thummaluru Siddartha Reddy</name>
    </author>
    <author>
      <name>Sundeep Prabhakar Chepuri</name>
    </author>
    <author>
      <name>Mahesh Chandran</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04672v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04672v2" />
    <published>2026-06-03T09:54:35Z</published>
    <updated>2026-06-04T09:25:02Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ayushman Raghuvanshi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Continuous-time dynamic graphs (CTDGs) provide a richer framework to capture fine-grained temporal patterns in evolving relational data. Long-range information propagation is a key challenge while learning representations, wherein it is important to retain and update information over long temporal horizons. Existing approaches restrict models to capture one-hop or local temporal neighborhoods and fail to capture multi-hop or global structural patterns. To mitigate this, we derive a parameter-efficient state-space modeling framework for continuous-time dynamic graphs (CTDG-SSM) from first principles. We first introduce continuous-time Topology-Aware higher order polynomial projection operator (CTT-HiPPO), a novel memory-based reformulation of HiPPO to jointly encode temporal dynamics and graph structure. The solution from CTT-HiPPO is obtained by projecting the classical HiPPO solution through a polynomial of the Laplacian matrix, yielding topology-aware memory updates that admit an equivalent state-space formulation for CTDGs (CTDG-SSM). Then a computationally efficient discrete formulation is obtained using the zero-order hold approach for model implementation.
  Across benchmarks on dynamic link prediction, dynamic node classification, and sequence classification, CTDG-SSM achieves state-of-the-art performance. Notably, it achieves large performance gains on datasets that require long range temporal (LRT) and spatial reasoning.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples</title>
    <author>
      <name>Mengao Zhang</name>
    </author>
    <author>
      <name>Xiang Yang</name>
    </author>
    <author>
      <name>Chang Liu</name>
    </author>
    <author>
      <name>Tianhui Tan</name>
    </author>
    <author>
      <name>Ke-wei Huang</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04646v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04646v1" />
    <published>2026-06-03T09:14:43Z</published>
    <updated>2026-06-03T09:14:43Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mengao Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution. We introduce QO-Bench, a diagnostic benchmark for query-operator question answering over typed event tuples. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge. This design enables operator-level diagnosis such as joins and intersection. We evaluate RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL under matched conditions, with a long-context oracle ceiling to isolate retrieval failure. A two-axis framework -- index-time preservation versus query-time execution -- predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction-to-SQL on intersection and counting. Even given the gold evidence, a long-context oracle stays far from saturated, so operator execution -- not retrieval alone -- is a core bottleneck that a stronger answer model does not remove. QO-Bench reframes the goal from passage relevance to query-operator-preserving retrieval.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;14 pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] SocraticPO: Policy Optimization via Interactive Guidance</title>
    <author>
      <name>Zirui Liu</name>
    </author>
    <author>
      <name>Jie Ouyang</name>
    </author>
    <author>
      <name>Qi Liu</name>
    </author>
    <author>
      <name>Xianquan Wang</name>
    </author>
    <author>
      <name>Jiayu Liu</name>
    </author>
    <author>
      <name>Tingyue Pan</name>
    </author>
    <author>
      <name>Qingchuan Li</name>
    </author>
    <author>
      <name>Jing Sha</name>
    </author>
    <author>
      <name>Zhenya Huang</name>
    </author>
    <author>
      <name>Shijin Wang</name>
    </author>
    <author>
      <name>Enhong Chen</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09887v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09887v1" />
    <published>2026-06-03T09:08:29Z</published>
    <updated>2026-06-03T09:08:29Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zirui Liu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbf{SocraticPO} (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs</title>
    <author>
      <name>Yuhao Zhang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09886v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09886v1" />
    <published>2026-06-03T08:41:48Z</published>
    <updated>2026-06-03T08:41:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuhao Zhang&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Sparse Mixture-of-Experts (MoE) large language models achieve strong quality with low per-token compute, yet their deployment is often limited by the memory wall: the full expert pool must remain resident to support token-dependent routing. Expert pruning is a direct remedy, but prior criteria typically score experts independently and overlook that MoE inference is inherently \emph{coalitional}, where outputs arise from routed top-$k$ expert combinations. We propose \textbf{SHAPE}, a task-driven pruning framework that explicitly models \emph{intra-layer} expert cooperation. SHAPE formulates routing traces on a small calibration set as an empirical cooperative game and assigns interaction-aware expert values via a Shapley-style attribution over observed top-$k$ coalitions, enabling the identification of experts that are essential for high-utility collaborations rather than merely frequent. To preserve MoE topology under a global pruning budget, SHAPE further introduces a \emph{quality-coverage} selection rule that retains, in each layer, the minimal expert subset covering an $α$ fraction of non-negative Shapley mass, while using bisection to match a target keep rate. Experiments on three modern MoE backbones (Qwen3-30B-A3B, GPT-OSS-20B, and DeepSeek-V2-Lite) across diverse benchmarks show that SHAPE consistently improves robustness over global and layer-wise pruning variants, maintaining competitive accuracy under 20\% and 40\% expert pruning without additional training and delivering clear reductions in peak GPU memory footprint. The open-source code is available at https://github.com/Alizen-1009/Shapley-Moe.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.IR] Trading Engagement for Sustainability: Carbon-Aware Re-ranking for E-commerce Recommendations</title>
    <author>
      <name>Noah Lund Syrdal</name>
    </author>
    <author>
      <name>Anders Vestrum</name>
    </author>
    <author>
      <name>Jorgen Bergh</name>
    </author>
    <category term="cs.IR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04550v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04550v1" />
    <published>2026-06-03T07:34:46Z</published>
    <updated>2026-06-03T07:34:46Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Noah Lund Syrdal et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;E-commerce recommender systems strongly influence which products users consider and purchase, yet sustainability signals such as Product Carbon Footprint (PCF) are almost never available at catalog scale. We study carbon-aware product recommendation in the realistic setting where PCF labels are missing for most items and must be inferred. We first estimate product-level carbon footprints via a retrieval-augmented PCF estimation pipeline that transfers supervision from the Carbon Catalogue, a small set of life-cycle-assessed products, to a large unlabeled e-commerce catalog using semantic similarity search, few-shot LLM prompting, and a nearest-neighbour fallback. We then apply a carbon-aware post-hoc re-ranking strategy on top of relevance scores produced by three established recommendation models: BPR, NeuMF, and LightGCN. The method trades off predicted user-item engagement against estimated carbon footprint through a single tunable parameter, lambda. In this offline study, engagement is operationalized through Amazon review interactions, which serve as implicit feedback and as a proxy for user interest or purchase behavior. We evaluate the framework on the Amazon Reviews dataset across three product categories: Home and Kitchen, Sports and Outdoors, and Electronics. By sweeping lambda, we construct Pareto frontiers that characterize the achievable engagement and carbon trade-off for each model and category. Substantial carbon reductions are achievable at minimal engagement cost across all models and categories. However, the available carbon headroom varies by model and category, underscoring the importance of model choice and domain context.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;23 pages, 30 figures. Code available at https://github.com/andersvestrum/carbon-aware-recsys&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models</title>
    <author>
      <name>Boyan Han</name>
    </author>
    <author>
      <name>Yiwei Wang</name>
    </author>
    <author>
      <name>Yi Song</name>
    </author>
    <author>
      <name>Yujun Cai</name>
    </author>
    <author>
      <name>Chi Zhang</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04535v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04535v1" />
    <published>2026-06-03T07:18:23Z</published>
    <updated>2026-06-03T07:18:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Boyan Han et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Diffusion large language models (dLLMs) offer bidirectional attention and parallel generation, enabling them to exploit global context and naturally support format-constrained tasks like parseable JSON or reasoning templates. While straightforward fixed anchors can enforce such constraints, they often impose rigid spans, leading to truncated reasoning or redundant content. To overcome this, we propose Dynamic Infilling Anchors (DIA), a training-free method that dynamically estimates end-anchor positions to adjust generation length before iterative infilling. This flexible mechanism ensures structural correctness and semantic coherence, avoiding the inefficiencies of fixed-span methods. Experiments on reasoning benchmarks demonstrate that DIA substantially improves format compliance and answer accuracy, achieving significant zero-shot gains on GSM8K and MATH. These results establish DIA as a robust pathway toward reliable, structure-aware generation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.IR] ANN Search: Recall What Matters</title>
    <author>
      <name>Dimitris Dimitropoulos</name>
    </author>
    <author>
      <name>Nikos Mamoulis</name>
    </author>
    <category term="cs.IR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04522v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04522v1" />
    <published>2026-06-03T07:00:50Z</published>
    <updated>2026-06-03T07:00:50Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dimitris Dimitropoulos, Nikos Mamoulis&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Approximate nearest neighbor (ANN) search has become a core primitive in information retrieval and modern machine learning tasks, from classification to retrieval-augmented generation. The community evaluates and tunes ANN algorithms primarily on their throughput at a given Recall@k, the fraction of true exact neighbors retrieved. We argue that what really matters in ANN search is the quality of the retrieved results and not their overlap with the true kNN set. We show that using Recall@k to assess retrieval quality forces unnecessary computational overhead and investigate replacing it by 1/Ratio@k, the inverse approximation ratio. 1/Ratio@k evaluates the differences between the distances of the retrieved and true neighbors. It is judge-free, hyperparameter-free, and computable from standard ANN benchmark inputs alone. We benchmark state-of-the-art ANN algorithms across diverse datasets spanning a wide range of intrinsic dimensionalities, evaluating the two metrics comprehensively across efficiency, downstream classification, and retrieval-augmented generation. On the efficiency axis, optimizing for 1/Ratio@k reaches operational quality thresholds at a substantially lower computational cost than Recall@k. In downstream tasks, performance indicators (label precision, semantic similarity, BERTScore, and LLM-graded quality) remain highly stable even when Recall@k drops significantly. The inverse approximation ratio, on the other hand, closely mirrors this stability, tracking true utility much better than Recall@k. Ultimately, while Recall@k overstates the true cost of approximation, 1/Ratio@k offers a more accurate, deployable proxy for actual ANN quality.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling</title>
    <author>
      <name>Guangcheng Zhu</name>
    </author>
    <author>
      <name>Shenzhi Yang</name>
    </author>
    <author>
      <name>Haobo Wang</name>
    </author>
    <author>
      <name>Xing Zheng</name>
    </author>
    <author>
      <name>Yingfan MA</name>
    </author>
    <author>
      <name>Xuening Feng</name>
    </author>
    <author>
      <name>Zhongqi Chen</name>
    </author>
    <author>
      <name>Kai Tang</name>
    </author>
    <author>
      <name>Zhengqing Zang</name>
    </author>
    <author>
      <name>Bowen Song</name>
    </author>
    <author>
      <name>Weiqiang Wang</name>
    </author>
    <author>
      <name>Gang Chen</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04516v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04516v1" />
    <published>2026-06-03T06:47:50Z</published>
    <updated>2026-06-03T06:47:50Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Guangcheng Zhu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Reinforcement learning with verifiable rewards (RLVR) significantly advances LLM reasoning, yet it faces a dilemma: standard supervised scaling is throttled by high annotation costs, while unsupervised alternatives suffer from severe model collapse. Recent semi-supervised RLVR methods address this by using a small labeled set to guide unlabeled data, achieving a promising trade-off between training efficacy and annotation cost. However, they suffer from a severe data-efficiency bottleneck due to the reliance on coarse performance heuristics, leaving a vast majority of valuable instances underutilized. To this end, we propose GeoMin, which models global feature distributions on labeled data to decode the structural discrepancy between correct and incorrect rollouts, thereby establishing a robust prior to assess the reliability of self-reward signals and fully unleash the potential of unlabeled data. Empirically, GeoMin outperforms the strongest baselines by +4.1% and even surpasses fully supervised models with only 10% of the annotations, demonstrating remarkable data efficiency.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] SFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning</title>
    <author>
      <name>Zhihua Wang</name>
    </author>
    <author>
      <name>Yanping Li</name>
    </author>
    <author>
      <name>Yizhang Liu</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04493v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04493v1" />
    <published>2026-06-03T06:18:05Z</published>
    <updated>2026-06-03T06:18:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhihua Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Correspondence pruning aims to identify inliers from an initial set of correspondences. Most existing Graph Neural Network (GNN)-based methods rely on geometric features mapped from coarse Euclidean coordinates, which struggle to capture the subtle geometric consistencies presented by inliers. While Mamba-based methods possess global receptive fields and long sequence modeling capabilities, they tend to accumulate substantial inconsistent features within the hidden state space, making it difficult to distinguish inliers from outliers. In this paper, we integrate frequency domain perception into this task for the first time and propose SFMambaNet, a novel Spectral-Frequency enhanced Mamba-based two-view correspondence pruning network. Our method is collaboratively composed of two components: First, we design a Local Spectral-Geometric Attention (LSGA) block. LSGA incorporates spectral positional encoding into local graph interactions and introduces multi-scale Mamba processing to enhance the capture of subtle geometric consistencies and improve local feature discriminability. Building upon this, we design a Spectral-Integrated Global Mamba (SIGM) block. SIGM embeds a frequency gating mechanism within the state space, utilizing the frequency information provided by LSGA to explicitly suppress high-frequency noise accumulation within hidden states and mitigate the propagation of inconsistent features. This enhances inlier-outlier separability and achieves robust global context modeling capabilities with nearly linear complexity. Extensive experiments demonstrate that SFMambaNet outperforms current state-of-the-art methods on several challenging tasks. The code is available at https://github.com/Kirito14IT/SFMambaNet.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning</title>
    <author>
      <name>Qingxu Fu</name>
    </author>
    <author>
      <name>Boyin Liu</name>
    </author>
    <author>
      <name>Shuchang Tao</name>
    </author>
    <author>
      <name>Zhaoyang Liu</name>
    </author>
    <author>
      <name>Bolin Ding</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04484v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04484v1" />
    <published>2026-06-03T06:02:52Z</published>
    <updated>2026-06-03T06:02:52Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Qingxu Fu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm server nodes host trainable models and run optimization on GPU clusters, whereas swarm client nodes execute arbitrary agents on arbitrary devices. This design provides capabilities that are difficult to support in centralized frameworks: (1) heterogeneous multi-model reinforcement learning, enabling the training of heterogeneous multi-agent teams with multiple LLM as brains; (2) multi-task cocktail training with isolated agent runtimes; (3) fault-tolerant execution that prevents external environment failures from interrupting the training process; and (4) live code iteration, which allows agents to be edited during training by replacing swarm client nodes. To support efficient RL in multi-model, multi-turn, and multi-agent settings, AgentJet introduces a context tracking module with timeline merging, which consolidates redundant context and achieves a 1.5-10x training speedup. Finally, AgentJet introduces an automated research system that takes a research topic as input and autonomously conducts long-horizon, multi-day RL studies on large-scale clusters. By leveraging the swarm architecture, this system reproduces key exploratory workflows of RL researchers without human intervention during execution.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Technical report, 27 pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] SePO: Self-Evolving Prompt Agent for System Prompt Optimization</title>
    <author>
      <name>Wangcheng Tao</name>
    </author>
    <author>
      <name>Han Wu</name>
    </author>
    <author>
      <name>Weng-Fai Wong</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04465v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04465v1" />
    <published>2026-06-03T05:20:30Z</published>
    <updated>2026-06-03T05:20:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wangcheng Tao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents&amp;#x27; system prompts, yet leave the prompt agent&amp;#x27;s own system prompt hand-engineered and fixed. We propose Self-Evolving Prompt Optimization (SePO), which treats the prompt agent&amp;#x27;s own system prompt as an optimization target alongside task agents&amp;#x27; system prompts. SePO adopts a self-referential design. A single prompt agent improves both task agents&amp;#x27; system prompts and its own under an open-ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre-training evolves the prompt agent on a multi-task pool, and fine-tuning then applies it to a target task. Across five benchmarks spanning math (AIME&amp;#x27;25), abstract reasoning (ARC-AGI-1), graduate-level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual-CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual-CoT. The prompt optimization skill from pre-training also generalizes to tasks beyond the pre-training mixture, rather than memorizing per-task prompts.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;26 pages. Code: https://github.com/taowangcheng/SePO&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?</title>
    <author>
      <name>Xinyu Lu</name>
    </author>
    <author>
      <name>Tianshu Wang</name>
    </author>
    <author>
      <name>Pengbo Wang</name>
    </author>
    <author>
      <name>zujie wen</name>
    </author>
    <author>
      <name>Zhiqiang Zhang</name>
    </author>
    <author>
      <name>Jun Zhou</name>
    </author>
    <author>
      <name>Boxi Cao</name>
    </author>
    <author>
      <name>Yaojie Lu</name>
    </author>
    <author>
      <name>Hongyu Lin</name>
    </author>
    <author>
      <name>Xianpei Han</name>
    </author>
    <author>
      <name>Le Sun</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04455v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04455v1" />
    <published>2026-06-03T04:58:17Z</published>
    <updated>2026-06-03T04:58:17Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xinyu Lu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Website: https://meta-agent-challenge.com/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation</title>
    <author>
      <name>Saroj Mishra</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04435v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04435v1" />
    <published>2026-06-03T04:33:47Z</published>
    <updated>2026-06-03T04:33:47Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Saroj Mishra&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.IV] L-TGVN: Leveraging Longitudinal Priors for Personalized Rapid MRI</title>
    <author>
      <name>Arda Atalık</name>
    </author>
    <author>
      <name>Sumit Chopra</name>
    </author>
    <author>
      <name>Daniel K. Sodickson</name>
    </author>
    <category term="eess.IV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04419v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04419v1" />
    <published>2026-06-03T03:56:25Z</published>
    <updated>2026-06-03T03:56:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Arda Atalık et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;MRI provides excellent soft-tissue contrast without ionizing radiation, but long acquisition times increase patient discomfort while also raising exam costs and limiting scanner throughput. A common approach to reduce scan time is to acquire fewer measurements, which yields an ill-posed linear inverse problem; recovering diagnostic-quality images therefore requires incorporating prior knowledge beyond the measured data. In follow-up exams, the most recent prior scan of a patient can provide a highly informative subject-specific context, but practical use is complicated by temporal changes (including pathology progression), misalignment between scans, and protocol drift across acquisitions. In this work, we introduce L-TGVN, a Longitudinal Trust-Guided Variational Network that leverages prior scans as side information to reconstruct the current scan from heavily undersampled measurements. Crucially, L-TGVN constrains the influence of prior scans to be consistent with the acquired measurements. Unlike many existing longitudinal reconstruction methods, it does not require explicit pre-registration between prior and current scans. It further accommodates differences in acquisition protocols across visits (e.g., changes in sequence parameters). We evaluate L-TGVN against matched-capacity baselines, including prior-guided methods and methods that do not use longitudinal priors, and observe consistent improvements in standard quantitative metrics together with better preservation of fine structures at challenging accelerations. Source code is available at github.com/sodicksonlab/L-TGVN.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to MICCAI 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization</title>
    <author>
      <name>Yidi Zhouluo</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04409v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04409v2" />
    <published>2026-06-03T03:36:34Z</published>
    <updated>2026-06-06T02:49:10Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yidi Zhouluo&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one-dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR-10 and CIFAR-100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: https://github.com/YidiZhouluo/DeepLearning-Empirical-Studies/tree/main/Exp_01.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 pages, 9 figures, 4 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation</title>
    <author>
      <name>Jingbo Wen</name>
    </author>
    <author>
      <name>Liang He</name>
    </author>
    <author>
      <name>Ziqi He</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04402v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04402v1" />
    <published>2026-06-03T03:29:57Z</published>
    <updated>2026-06-03T03:29:57Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jingbo Wen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Modern reasoning models can allocate different amounts of test-time computation, such as thinking tokens, model calls, or compute budget, to different tasks. Existing methods generally drive this allocation by predicted difficulty and spend more compute where it is expected to raise accuracy. This implicitly assumes that all failures cost the same, since an accuracy objective weights every task equally. However, such an assumption does not hold in deployment: A typo in a log message and a migration that corrupts a production database both count as one benchmark failure, but their real-world costs are fundamentally different. To fill this gap, we propose consequence-aware test-time compute allocation. Instead of routing compute only by predicted difficulty, we use a lightweight predictor to estimate from the issue text how costly a task would be if solved incorrectly. The scheduler then routes higher-consequence tasks to larger compute tiers or higher thinking budgets under the same total budget. We conduct main experiments on SWE-bench Lite and evaluate cross-dataset behavior on Multi-SWE-bench mini, covering 700 software-engineering tasks in total. Our results reveal that consequence and difficulty are approximately orthogonal under various annotations, and that current thinking models do not allocate compute sufficiently according to consequence. Moreover, our issue-only predictor never misclassifies a high-consequence task as low-consequence across the 300 SWE-bench tasks. Under matched compute budgets, our consequence-aware scheduler reduces cost-weighted loss by 22% to 33% relative to difficulty-aware routing; in particular, the priority-aware variant, which routes by per-task cost scaled by the marginal-utility signal, crosses 30%, and its deployable predictor-driven version retains over 90% of the oracle gain.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval</title>
    <author>
      <name>Jiaxi Li</name>
    </author>
    <author>
      <name>Ke Deng</name>
    </author>
    <author>
      <name>Yun Wang</name>
    </author>
    <author>
      <name>Jingyuan Huang</name>
    </author>
    <author>
      <name>Yucheng Shi</name>
    </author>
    <author>
      <name>Qiaoyu Tan</name>
    </author>
    <author>
      <name>Jin Lu</name>
    </author>
    <author>
      <name>Ninghao Liu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04391v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04391v1" />
    <published>2026-06-03T03:11:50Z</published>
    <updated>2026-06-03T03:11:50Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiaxi Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse skills at the task-level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover. To address this gap, we propose State-Grounded Dynamic Retrieval (SGDR), an online skill learning method that enables stepwise skill reuse for web agents. SGDR consists of three components: a sliding-window extraction process that turns completed trajectories into reusable sub-procedures invokable at intermediate execution states, a dual text-code representation that connects skill retrieval with executable action, and a state-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state. Experiments on WebArena across five domains show that SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. The code is available at https://github.com/plusnli/skill-dynamic-retrieval.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;17 pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] TITAN-FedAnil+: Trust-Based Adaptive Blockchain Federated Learning for Resource-Constrained Intelligent Enterprises</title>
    <author>
      <name>Muhammad Hadi</name>
    </author>
    <author>
      <name>Muhammad Jahangir</name>
    </author>
    <author>
      <name>Talha Shafique</name>
    </author>
    <author>
      <name>Muhammad Khuram Shahzad</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04388v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04388v1" />
    <published>2026-06-03T03:06:39Z</published>
    <updated>2026-06-03T03:06:39Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Muhammad Hadi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Federated Learning (FL) has emerged as an effective paradigm for collaborative intelligence while preserving data privacy. However, data heterogeneity arising from non-IID distributions and decentralized security threats remain significant challenges, particularly in resource-constrained enterprise environments. This paper presents TITAN-FedAnil+, a Trust-Based Adaptive Network for blockchain-enabled federated learning in intelligent enterprises. The proposed framework introduces affinity propagation-based adaptive clustered aggregation to identify and filter malicious updates without requiring prior knowledge of the number of attackers. In addition, GPU-accelerated vectorization is employed to improve computational efficiency, while a signed state jump mechanism enables lightweight blockchain resynchronization. Experimental results demonstrate substantial reductions in memory overhead, achieving up to 81% savings across 50 communication rounds on constrained 8 GB edge devices compared with the baseline framework. The results indicate that TITAN-FedAnil+ effectively improves robustness, scalability, and resource efficiency for secure federated learning deployments in intelligent enterprise environments.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;8 pages, 5 figures; code available at https://github.com/error8149/FedAnilPlus-Optimized&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] From Symbolic to Geometric: Enabling Spatial Reasoning in Large Language Models</title>
    <author>
      <name>Chen Chu</name>
    </author>
    <author>
      <name>Bita Azarijoo</name>
    </author>
    <author>
      <name>Li Xiong</name>
    </author>
    <author>
      <name>Khurram Shafique</name>
    </author>
    <author>
      <name>Cyrus Shahabi</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04381v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04381v1" />
    <published>2026-06-03T02:54:59Z</published>
    <updated>2026-06-03T02:54:59Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Chen Chu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent large language models (LLMs) often appear to exhibit spatial reasoning ability; however, this capability is largely \emph{symbolic}, arising from pattern matching over spatial language rather than true \emph{geometric} reasoning over space. Because LLMs operate on discrete tokens, they lack native support for continuous spatial representations, explicit geometric computation, and structured spatial operators. To address this limitation, we introduce the \emph{Spatial Language Model (SLM)}, the first multimodal LLM that treats location information as a first-class modality and enables geometric spatial reasoning within the model&amp;#x27;s inference process. SLM directly operates on learned spatial representations rather than textual descriptions of spatial relations. To support effective training, we construct a \emph{Spatial Instruction Dataset} that aligns spatial representations, atomic geometric operations, and natural language instructions. We further propose a new benchmark named \emph{SpatialEval}, which is designed to evaluate spatial reasoning across attributes, distance, topology, and relative-position tasks. Extensive experiments show that SLM significantly outperforms existing LLM-based approaches that rely on symbolic reasoning via prompt engineering or textual abstraction, demonstrating the benefits of integrating geometric spatial representations for robust spatial reasoning.
  Our instruction dataset, evaluation benchmark, model training codes, and models&amp;#x27; checkpoints can be found at:
  \hyperlink{https://github.com/chuchen2017/SLM}{https://github.com/chuchen2017/SLM}.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers</title>
    <author>
      <name>Biao Qian</name>
    </author>
    <author>
      <name>Yang Wang</name>
    </author>
    <author>
      <name>Yong Wu</name>
    </author>
    <author>
      <name>Jungong Han</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04373v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04373v2" />
    <published>2026-06-03T02:38:19Z</published>
    <updated>2026-06-05T08:17:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Biao Qian et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q&amp;#x27;s outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks. Our code is available at https://github.com/hfutqian/MaskAQ.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to appear at ICML 2026, Seoul, Korea&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming</title>
    <author>
      <name>Nicholas Saban</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05233v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05233v1" />
    <published>2026-06-03T01:21:59Z</published>
    <updated>2026-06-03T01:21:59Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Nicholas Saban&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent computer-using-agent (CUA) red-teaming papers report prompt-injection attack success rates (ASR) of 42-98%, but these headline numbers cluster on retired models and on the most-vulnerable model in each paper&amp;#x27;s panel. We ask whether those techniques, reproduced as hand-crafted templates, still work against current frontier CUAs. We release CUA-HandCrafted, a public benchmark of 793 episodes spanning 24 multi-step web tasks, 56 attack templates, 8 attack families, and 4 system-prompt configurations. Against Claude Sonnet 4.6 and GPT-5.4 we measure 0/140 multi-step attack success (Clopper-Pearson 95% upper bound 2.60%); a prompt ablation shows this resistance lives in the model weights. Yet it does not generalize: on a sister coding-agent benchmark (SkillBench), the same weights fall to hand-crafted skill-injection at up to 100%. We argue that the literature&amp;#x27;s high ASR is largely attributable to RL-optimized injection text rather than the attack categories, and that frontier safety hardening is domain-conditioned, specific to the heavily-targeted browser surface. Reporting techniques without releasing the optimized strings, or extrapolating browser-domain safety to other CUA modalities, makes published ASR numbers unreproducible.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Measuring What Matters: Synthetic Benchmarks for Concept Bottleneck Models</title>
    <author>
      <name>Julian Skirzynski</name>
    </author>
    <author>
      <name>Harry Cheon</name>
    </author>
    <author>
      <name>Shreyas Kadekodi</name>
    </author>
    <author>
      <name>Meredith Stewart</name>
    </author>
    <author>
      <name>Berk Ustun</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04326v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04326v1" />
    <published>2026-06-03T01:01:05Z</published>
    <updated>2026-06-03T01:01:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Julian Skirzynski et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Concept bottleneck models predict outcomes from high-level concepts detected in inputs. Although concepts provide a simple way to reap benefits from interpretability, very few datasets include concept labels. This limits researchers&amp;#x27; ability to determine which problems are suitable for these models, isolate the factors that drive their performance or lead to failures, or uncover which algorithms perform well. In this paper, we develop synthetic benchmarks for concept-bottleneck models, focusing on their two main use cases: decision support, in which models assist humans in making better decisions, and automation, in which models handle routine tasks without supervision. Our benchmarks can generate labeled datasets while controlling for properties that affect performance, including data modality, concept choice, annotation quality, and completeness. We demonstrate how the benchmarks can be used to evaluate representative classes of concept bottleneck models. Our demonstrations show how the benchmarks can diagnose failure modes and guide follow-up testing.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Benchmarks available at https://github.com/ustunb/concept-benchmark&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents</title>
    <author>
      <name>Manvendra Modgil</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04296v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04296v1" />
    <published>2026-06-02T23:54:27Z</published>
    <updated>2026-06-02T23:54:27Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Manvendra Modgil&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model (gpt-5.4-mini) never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance (location Krippendorff&amp;#x27;s alpha = +0.047; best pairwise Cohen&amp;#x27;s kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector&amp;#x27;s accuracy.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;11 pages, 5 tables. Code and data:https://github.com/2025eb1100268-tech/intervention-timing-saturation-trap&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling</title>
    <author>
      <name>Yifan Wang</name>
    </author>
    <author>
      <name>Jinyi Mu</name>
    </author>
    <author>
      <name>Mayank Jobanputra</name>
    </author>
    <author>
      <name>Yu Wang</name>
    </author>
    <author>
      <name>Ji-Ung Lee</name>
    </author>
    <author>
      <name>Soyoung Oh</name>
    </author>
    <author>
      <name>Isabel Valera</name>
    </author>
    <author>
      <name>Vera Demberg</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04284v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04284v1" />
    <published>2026-06-02T23:19:12Z</published>
    <updated>2026-06-02T23:19:12Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yifan Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function, neglecting the diversity and heterogeneity of human preferences. To address this limitation without additional annotation costs, recent work has proposed learning multiple preference components from binary data and combining them to model individual preferences. Nevertheless, these components often fail to capture coherent and disentangled patterns, limiting their interpretability and effectiveness for personalization. In this work, we propose a sparse Mixture-of-Experts (MoE) reward model that encourages sparse routing and expert diversity during training on binary preference data. Across controlled and real-world experiments, sparse MoE learns interpretable routing patterns and specialized experts. It also improves test-time personalization, and post-adaptation shifts in expert weights provide a qualitative lens for analyzing how the model adapts to personalized preferences.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments</title>
    <author>
      <name>Saket Tiwari</name>
    </author>
    <author>
      <name>Tejas Kotwal</name>
    </author>
    <author>
      <name>George Konidaris</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04275v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04275v1" />
    <published>2026-06-02T23:02:54Z</published>
    <updated>2026-06-02T23:02:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Saket Tiwari et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a continuous-time stochastic process, drawing on insights from stochastic control. Building on previous work, we introduce a viable model of actor-critic algorithm that incorporates both exploration and stochastic transitions. For single-hidden-layer neural networks, we show that the state of the environment can be formulated as a two time scale process: the environment time and the gradient time. Within this formulation, we characterize how the time-dependent random variables that represent the environment&amp;#x27;s state and estimate of the cumulative discounted return evolve over gradient steps in the infinite width limit of two-layer networks. Using the theory of stochastic differential equations, we derive, for the first time in continuous RL, an equation describing the infinitesimal change in the state distribution at each gradient step, under a vanishingly small learning rate. Overall, our work provides a novel nonparametric formulation for studying overparametrized neural actor-critic algorithms. We empirically corroborate our theoretical result using a toy continuous control task.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Presented at ICLR 2026: https://openreview.net/forum?id=TdiRLe3rPA&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Characterizing initial human-AI proof formalization workflows</title>
    <author>
      <name>Katherine M. Collins</name>
    </author>
    <author>
      <name>Simon Frieder</name>
    </author>
    <author>
      <name>Jonas Bayer</name>
    </author>
    <author>
      <name>Jacob Loader</name>
    </author>
    <author>
      <name>Jeck Lim</name>
    </author>
    <author>
      <name>Peiyang Song</name>
    </author>
    <author>
      <name>Fabian Zaiser</name>
    </author>
    <author>
      <name>Lexin Zhou</name>
    </author>
    <author>
      <name>Shanda Li</name>
    </author>
    <author>
      <name>Sam Looi</name>
    </author>
    <author>
      <name>Joshua B. Tenenbaum</name>
    </author>
    <author>
      <name>Umang Bhatt</name>
    </author>
    <author>
      <name>Adrian Weller</name>
    </author>
    <author>
      <name>Jose Hernandez-Orallo</name>
    </author>
    <author>
      <name>Cameron E. Freer</name>
    </author>
    <author>
      <name>Valerie Chen</name>
    </author>
    <author>
      <name>Ilia Sucholutsky</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04273v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04273v1" />
    <published>2026-06-02T22:58:19Z</published>
    <updated>2026-06-02T22:58:19Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Katherine M. Collins et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge. Advances in AI systems&amp;#x27; ability to generate code and engage in increasingly high-level mathematical reasoning promise to transform people&amp;#x27;s ability to formalize and thereby verify proofs. While many works focus on benchmarking the current frontier, we instead study how people use these tools. We conduct a mixed-methods analysis into the initial impact of AI on people&amp;#x27;s formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice. A qualitative survey shows that people&amp;#x27;s preferences are diverse, but with a general desire for AI assistance in formalization that preserves high-level human control over the proof discovery process. To assess how people actually engage with AI for formalization under such limitations, we conduct a controlled user study in which participants formalize informal math problems and their proofs, with and without AI, across a range of mathematical problems at varying levels of difficulty and domains. Despite limitations of the tools at the time for autoformalization, participants tend to attain higher formalization accuracy when allowed access to AI tools than when formalizing on their own, with most participants flexibly choosing to use multiple different AI tools. Taken together, our work sheds light on the early stages of AI integration into formalization workflows, involving an intimate interplay of human and AI engagement.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets</title>
    <author>
      <name>Stepan Konev</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04271v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04271v1" />
    <published>2026-06-02T22:50:55Z</published>
    <updated>2026-06-02T22:50:55Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Stepan Konev&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at https://github.com/stepankonev/StandardE2E.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation</title>
    <author>
      <name>Yilong Wang</name>
    </author>
    <author>
      <name>Cheng Qian</name>
    </author>
    <author>
      <name>Edward Johns</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04269v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04269v1" />
    <published>2026-06-02T22:46:20Z</published>
    <updated>2026-06-02T22:46:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yilong Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)</title>
    <author>
      <name>Jingbiao Mei</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04240v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04240v1" />
    <published>2026-06-02T21:39:32Z</published>
    <updated>2026-06-02T21:39:32Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jingbiao Mei&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams&amp;#x27; systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;MDR Challenge Report at WWW2025&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Supportive Token Revealing for Fast Diffusion Language Model Decoding</title>
    <author>
      <name>Giries Abu Ayoub</name>
    </author>
    <author>
      <name>Mario Barbara</name>
    </author>
    <author>
      <name>Lluís Pastor-Pérez</name>
    </author>
    <author>
      <name>Tanja Bien</name>
    </author>
    <author>
      <name>Aneesh Barthakur</name>
    </author>
    <author>
      <name>Alaa Maalouf</name>
    </author>
    <author>
      <name>Loay Mualem</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04236v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04236v1" />
    <published>2026-06-02T21:35:54Z</published>
    <updated>2026-06-02T21:35:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Giries Abu Ayoub et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Discrete diffusion language models can generate text efficiently by updating multiple masked positions in parallel, but this parallelism introduces a quality-latency trade-off. Aggressive decoding may commit mutually dependent tokens too early, while conservative decoding requires many denoising steps. Existing methods address this tension by deciding which tokens are safe to reveal using confidence or dependency criteria. However, avoiding unsafe commits does not necessarily make the remaining masked sequence easy to decode, since uncertain tokens may depend on masked tokens, creating a bottleneck for denoising steps. We propose AXON, a training-free module that can be added on top of existing parallel decoding strategies for diffusion language models. Rather than replacing the base decoder, AXON monitors the remaining uncertain masked tokens and intervenes only when their current state suggests that additional context is needed. It then shifts the criterion from which tokens are safest to reveal to which confident reveals would best support later denoising. AXON selects anchors, confident masked tokens that uncertain positions attend to, using attention, uncertainty, and confidence signals. Experiments on reasoning and code-generation benchmarks across multiple diffusion language models show that AXON improves the quality-latency trade-off of existing parallel decoders, often reducing the number of function evaluations while maintaining or improving accuracy.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.DS] Incremental Sheaf Cohomology on Cellular Complexes: O(1)-in-n Lazy Edit Processing under Bounded Local Geometry</title>
    <author>
      <name>Jason L. Volk</name>
    </author>
    <category term="cs.DS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04227v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04227v2" />
    <published>2026-06-02T21:26:02Z</published>
    <updated>2026-06-06T00:13:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jason L. Volk&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present an algorithmic framework for incremental maintenance of first sheaf cohomology $H^1(X; \mathcal{F})$ on dynamically evolving 1-dimensional cellular complexes equipped with finite-dimensional cellular sheaves. The classical computation of $H^1$ via factorization of the coboundary matrix requires $O(n^3)$ time; when the complex evolves with a stream of $m$ edits, full recomputation after each edit costs $O(mn^3)$.
  Under a bounded local geometry assumption -- bounded cell size $v_{\max}$, bounded stalk dimension $d$, and bounded nerve degree $D$ -- each edit (vertex insertion, edge insertion, restriction map update) affects only a bounded set of local coboundary blocks. The algorithm therefore processes lazy streaming edits in $O(1)$ time with respect to the total complex size $n$ (with cost polynomial in the local geometry parameters $v_{\max}$, $d$, and $D$, which are treated as constants independent of $n$), deferring local eigensolves and Mayer-Vietoris global assembly to synchronization points (Flush). At synchronization, the maintained state agrees with the corresponding batch assembly of the partitioned sheaf model; we observe zero measured drift in all batch-verified runs (through $V = 10^6$). We also give an amortized $O(|E|)$ streaming construction for the cellular decomposition and discuss an adversarial algebraic-RAM barrier arguing that unpartitioned non-trivial sheaves ($d \geq 2$, non-identity restriction maps) do not admit the same locality. Experiments on Barabasi-Albert graphs with up to $5 \times 10^6$ vertices and $1.7 \times 10^7$ streaming edits show 35 $μ$s median lazy per-edit update latency (excluding flush); query time (global assembly at synchronization) is $O(n)$ per flush in the implemented full-traversal path. Exact synchronization costs are reported separately.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;2 figures, 2 tables, 1 algorithm; code at https://github.com/Jasonleonardvolk/sigma&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] PerceptTwin: Semantic Scene Reconstruction for Iterative LLM Planning and Verification</title>
    <author>
      <name>Charlie Gauthier</name>
    </author>
    <author>
      <name>Sacha Morin</name>
    </author>
    <author>
      <name>Liam Paull</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04226v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04226v1" />
    <published>2026-06-02T21:25:49Z</published>
    <updated>2026-06-02T21:25:49Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Charlie Gauthier et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Simulation environments are useful for both robot policy learning and planning verification and validation. Traditionally, the process of creating a simulation was onerous. Creating a bespoke simulation environment for each individual environment that a robot would operate in was simply infeasible. In this work, we introduce PerceptTwin, a fully automatic pipeline that constructs interactive simulations directly from semantic scene representations produced by a robot&amp;#x27;s perception stack. PerceptTwin combines open-vocabulary object maps with 3D asset generation, affordance prediction, and commonsense condition checking. These interactive simulations can be used to validate and refine plans before they are executed on the robot hardware. Borrowing from the AI alignment literature, we also introduce an LLM judge that verifies plan correctness and alignment with human preferences. Experiments show that PerceptTwin feedback allows LLM planners to refine plans, enhance safety, and resist harmful black-box prompting attacks. In our suite of tasks, PerceptTwin improves plan success by an average of approximately 39% for GPT5, GPT5Mini, and GPT5Nano planners. Additionally, PerceptTwin also improves human plan verification by up to 18% on average for plans that fail due to unfilled skill preconditions. Our results demonstrate the potential of open-vocabulary scene simulation from robot perception as a foundation for safer, more reliable robot planning.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at ICRA 2026 (Vienna); published on arxiv for archival purposes. See also https://percept-twin.github.io/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Integrating Local and Global Entropy for Uncertainty Quantification in LLMs</title>
    <author>
      <name>Johanne Medina</name>
    </author>
    <author>
      <name>Tianyi Zhou</name>
    </author>
    <author>
      <name>Keivin Isufaj</name>
    </author>
    <author>
      <name>Aristides Gionis</name>
    </author>
    <author>
      <name>Sanjay Chawla</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09875v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09875v1" />
    <published>2026-06-02T20:57:14Z</published>
    <updated>2026-06-02T20:57:14Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Johanne Medina et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models hallucinate confidently, making uncertainty quantification (UQ) essential for reliable deployment. Existing methods rely predominantly on token-level signals, leaving the geometric structure of intermediate hidden states underused. In this paper, we take the geometric complexity of hidden-state matrices as a measure of the global uncertainty of LLMs, while treating token-level uncertainty estimation as a local metric. We show that hidden-state geometric entropy (global uncertainty) and token-level entropy (local uncertainty) are statistically near-orthogonal, capturing distinct failure regimes for reliability prediction. In particular, global geometry recovers the confident-but-wrong failure mode that local signals systematically miss. Building on this, we propose Global-Local Uncertainty (GLU), an unsupervised, single-pass score that fuses the two signals via a multiplicative gate. Across three model families and six benchmarks, GLU matches or outperforms all unsupervised baselines while requiring only a single forward pass and remaining length-normalized and architecture-agnostic.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;17 pages, 2 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.MM] DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities</title>
    <author>
      <name>Sajad Ebrahimi</name>
    </author>
    <author>
      <name>Nima Jamali</name>
    </author>
    <author>
      <name>Bardia Shirsalimian</name>
    </author>
    <author>
      <name>Kelly McConvey</name>
    </author>
    <author>
      <name>Wentao Zhang</name>
    </author>
    <author>
      <name>Jalehsadat Mahdavimoghaddam</name>
    </author>
    <author>
      <name>Maksym Taranukhin</name>
    </author>
    <author>
      <name>Maura Grossman</name>
    </author>
    <author>
      <name>Vered Shwartz</name>
    </author>
    <author>
      <name>Yuntian Deng</name>
    </author>
    <author>
      <name>Ebrahim Bagheri</name>
    </author>
    <category term="cs.MM" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04205v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04205v1" />
    <published>2026-06-02T20:49:20Z</published>
    <updated>2026-06-02T20:49:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sajad Ebrahimi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions</title>
    <author>
      <name>Sherin Muckatira</name>
    </author>
    <author>
      <name>Jesse Geneson</name>
    </author>
    <author>
      <name>Slava Gerovitch</name>
    </author>
    <author>
      <name>Pavel Etingof</name>
    </author>
    <author>
      <name>Mikhail Gronas</name>
    </author>
    <author>
      <name>Anna Rumshisky</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06526v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06526v1" />
    <published>2026-06-02T20:38:39Z</published>
    <updated>2026-06-02T20:38:39Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sherin Muckatira et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof. We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof. Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification. We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;16 pages, 4 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Notarized Agents: Receiver-Attested Confidential Receipts for AI Agent Actions</title>
    <author>
      <name>Juan Figuera</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04193v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04193v1" />
    <published>2026-06-02T20:21:58Z</published>
    <updated>2026-06-02T20:21:58Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Juan Figuera&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is being logged. A compromised or buggy agent can omit, alter, or fabricate its own traces, and the operator running the agent has no independent way to detect tampering. We propose a class of protocols that resolves this by inverting the trust boundary: the service that receives an agent&amp;#x27;s call signs a receipt of what it observed using its own key, encrypts the receipt to the agent&amp;#x27;s owner, and publishes it to a public transparency log. The owner reconstructs a tamper-evident trail without trusting the agent or its operator. We instantiate the class as Sello, a protocol combining four properties absent in any current system: (P1) receiver-side signing, (P2) HPKE encryption to an owner public key bound to the authorization token via JWS, (P3) publication to a witness-cosigned Merkle log, and (P4) owner-side discovery by token reference. We describe the protocol, analyze its security under an adversary that controls the agent and its operator, present microbenchmarks of the cryptographic operations, and situate Sello among adjacent receipt-protocol work (Signet, AgentROA, Agent Passport System, draft-farley-acta, SCITT). We discuss known limitations including the suppression attack, service collusion, and the adoption-incentive problem.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;22 pages. Reference implementation at https://github.com/juanfiguera/sello&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments</title>
    <author>
      <name>Michael J. Bommarito</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04171v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04171v1" />
    <published>2026-06-02T19:35:44Z</published>
    <updated>2026-06-02T19:35:44Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Michael J. Bommarito&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;File-type classification underlies many workflows like malware triage, forensic carving, packet inspection, and storage indexing. Learned systems such as Google&amp;#x27;s Magika assume whole-file access at a known offset, so they break on the inputs many of these tasks actually produce, like a single packet payload, a header-less carved fragment, a random disk block, or a chunked upload. We introduce MimeLens, a family of small BERT-style encoders pretrained on binary content from windows sampled at a uniformly random offset within each file, with no privileged head-of-file position, in standard- and short-context variants. A byte chunk goes in from anywhere in a file, no header needed and no fixed size; out comes one of libmagic&amp;#x27;s 125 MIME labels. On the clean head of complete files, MimeLens beats Magika v1.1 by +10.7 pp top-1 on libmagic-labeled data, and it keeps classifying where Magika cannot: from a single mid-stream UDP packet, and more than twice as accurately as libmagic and Magika on random mid-file disk blocks. The cost is latency: MimeLens runs roughly one to two orders of magnitude slower per sample on CPU than Magika, though it matches on consumer GPUs or in batch. All trained checkpoints are released on Hugging Face (mjbommar/mimelens-001-*).&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;18 pages, 2 figures, 15 tables. Models released on Hugging Face (https://huggingface.co/mjbommar); reference training code at https://github.com/mjbommar/mimelens-training&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Smart Transportation Without Neurons -- Fair Metro Network Expansion with Tabular Reinforcement Learning</title>
    <author>
      <name>Dimitris Michailidis</name>
    </author>
    <author>
      <name>Sennay Ghebreab</name>
    </author>
    <author>
      <name>Fernando P. Santos</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04167v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04167v1" />
    <published>2026-06-02T19:29:35Z</published>
    <updated>2026-06-02T19:29:35Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dimitris Michailidis et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We tackle the Metro Network Expansion Problem (MNEP), a subset of the Transport Network Design Problem (TNDP), which focuses on expanding metro systems to satisfy travel demand. Traditional methods rely on exact and heuristic approaches that require expert-defined constraints to reduce the search space. Recently, deep reinforcement learning (Deep RL) has emerged due to its effectiveness in complex sequential decision-making processes-it remains, however, computationally expensive, environmentally costly, and requires additional engineering to interpret. We show that MNEP problems are small enough to not require Deep RL methods. Reformulating the MNEP as a Non-Markovian Rewards Decision Process (NMRDP), we use tabular RL to achieve similar performance with significantly fewer training episodes, additionally offering greater interpretability. Additionally, we incorporate social equity criteria into the reward functions, focusing on efficiency and fairness, highlighting the versatility of our method. Evaluated in real-world settings-Xi&amp;#x27;an and Amsterdam-our method reduces total episodes by a factor of 18 and total carbon emissions by a factor of 12 on average, while remaining competitive with Deep RL. This approach offers a replicable, modular, interpretable, and resource-efficient solution with potential applications to other combinatorial optimization problems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;16 pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory</title>
    <author>
      <name>Ruida Wang</name>
    </author>
    <author>
      <name>Jerry Huang</name>
    </author>
    <author>
      <name>Pengcheng Wang</name>
    </author>
    <author>
      <name>Xuanqing Liu</name>
    </author>
    <author>
      <name>Luyang Kong</name>
    </author>
    <author>
      <name>Tong Zhang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06523v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06523v1" />
    <published>2026-06-02T18:46:50Z</published>
    <updated>2026-06-02T18:46:50Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ruida Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs&amp;#x27; agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows&amp;#x27; semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Quantifying Faithful Confidence Expression in Large Reasoning Models</title>
    <author>
      <name>Areeb Gani</name>
    </author>
    <author>
      <name>Asal Meskin</name>
    </author>
    <author>
      <name>Gabrielle Kaili-May Liu</name>
    </author>
    <author>
      <name>Arman Cohan</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03969v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03969v1" />
    <published>2026-06-02T17:53:45Z</published>
    <updated>2026-06-02T17:53:45Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Areeb Gani et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models&amp;#x27; intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain-of-thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace--complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix-conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non-reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high-stakes contexts.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Code: https://github.com/yale-nlp/faithful_lrm&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task</title>
    <author>
      <name>Quentin Fuxa</name>
    </author>
    <author>
      <name>Dominik Macháček</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03967v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03967v1" />
    <published>2026-06-02T17:52:18Z</published>
    <updated>2026-06-02T17:52:18Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Quentin Fuxa, Dominik Macháček&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated source transcript, and Gemma-4 E4B-it translates that prefix under an MT-side AlignAtt policy.
  To our knowledge, this is the first application of AlignAtt to a decoder-only LLM, where the encoder-decoder cross-attention used by earlier AlignAtt systems is absent. We recover a usable policy by proposing (1) an explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that preserves model outputs bit-identically.
  On the IWSLT 2026 development set, AlignAtt4LLM outperforms the supplied baselines for the European target languages, English to German and English to Italian, in both the low-latency regime around 2 seconds and the high-latency regime below 4 seconds CU-LongYAAL. Results for English to Chinese are more mixed, but the method is not tied to Gemma-4: because AlignAtt4LLM only requires a deterministic prompt layout, calibrated attention heads, and query/key capture, the same policy can be reapplied to stronger translation-focused decoder-only MT backbones for non-European target languages.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to IWSLT 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning</title>
    <author>
      <name>Yu Xia</name>
    </author>
    <author>
      <name>Zhouhang Xie</name>
    </author>
    <author>
      <name>Xin Xu</name>
    </author>
    <author>
      <name>Byungkyu Kang</name>
    </author>
    <author>
      <name>Prarit Lamba</name>
    </author>
    <author>
      <name>Xiang Gao</name>
    </author>
    <author>
      <name>Julian McAuley</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03965v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03965v1" />
    <published>2026-06-02T17:51:30Z</published>
    <updated>2026-06-02T17:51:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yu Xia et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner&amp;#x27;s generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Efficient ASR Training with Conversations that Never Happened</title>
    <author>
      <name>Máté Gedeon</name>
    </author>
    <author>
      <name>Péter Mihajlik</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03957v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03957v1" />
    <published>2026-06-02T17:46:12Z</published>
    <updated>2026-06-02T17:46:12Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Máté Gedeon, Péter Mihajlik&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] q0: Primitives for Hyper-Epoch Pretraining</title>
    <author>
      <name>Bishwas Mandal</name>
    </author>
    <author>
      <name>Shmuel Berman</name>
    </author>
    <author>
      <name>Akshay Vegesna</name>
    </author>
    <author>
      <name>Samip Dahal</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03938v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03938v2" />
    <published>2026-06-02T17:27:48Z</published>
    <updated>2026-06-03T02:07:12Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Bishwas Mandal et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined predictions reach a lower validation loss than a single refined model. q0 reduces to three core primitives. A cyclic schedule with anti-correlated learning rate and weight decay collects diverse models from a few parallel trajectories. Chain distillation trains each model against its predecessor so that model quality compounds across the population. A learned prior, fit on a held out set, selects and weights members for any inference budget. On a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline using only ~56 epochs (~4.6x fewer), or ~67 epochs (~3.8x fewer) when matched to the baseline&amp;#x27;s ensemble size, and continues to improve beyond it. These gains reach cumulative ~12.9x data efficiency under the Slowrun setting and transfer to downstream benchmarks. Crucially, the optimal allocation shifts with the budget, so we give prescriptive recipes for how to spend a given epoch budget to maximize generalization, from a single epoch up to the largest budgets.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;22 pages, 5 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection</title>
    <author>
      <name>Senjie Jin</name>
    </author>
    <author>
      <name>Peixin Wang</name>
    </author>
    <author>
      <name>Boyang Liu</name>
    </author>
    <author>
      <name>Xiaoran Fan</name>
    </author>
    <author>
      <name>Shuo Li</name>
    </author>
    <author>
      <name>Zhiheng Xi</name>
    </author>
    <author>
      <name>Jiazheng Zhang</name>
    </author>
    <author>
      <name>Yuhao Zhou</name>
    </author>
    <author>
      <name>Tao Gui</name>
    </author>
    <author>
      <name>Qi Zhang</name>
    </author>
    <author>
      <name>Xuanjing Huang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03937v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03937v2" />
    <published>2026-06-02T17:26:55Z</published>
    <updated>2026-06-03T10:57:29Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Senjie Jin et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rewards (RLVR), it remains unclear whether this mechanism still holds in visual reasoning. Our controlled study shows that this mechanism collapses in visual reasoning due to the omission of vision-sensitive tokens with naturally low entropy. Although existing multimodal RL methods increasingly acknowledge the importance of visual perception, they struggle to satisfy the inherent demand for interleaving precise perceptual grounding with semantic reasoning, either lacking systematic visual measurements or overlooking that token entropy primarily drives semantic exploration. To address this, we introduce VEPO (Vision-Entropy token-selection for Policy Optimization), an effective RL framework explicitly integrating visual sensitivity with token entropy via a principled multiplicative coupling, where VEPO redirects gradient credit toward tokens which are simultaneously visually grounded and highly informative. Extensive experiments demonstrate VEPO&amp;#x27;s leading performance, significantly outperforming the entropy-only baseline by 2.28 points at 7B-scale and 3.15 points at 3B-scale. Ablations further substantiate the soundness of our method.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning</title>
    <author>
      <name>Eric Cho</name>
    </author>
    <author>
      <name>Shawn Huang</name>
    </author>
    <author>
      <name>Alice Lu</name>
    </author>
    <author>
      <name>Andy Lyu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03918v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03918v1" />
    <published>2026-06-02T17:11:56Z</published>
    <updated>2026-06-02T17:11:56Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Eric Cho et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that define expert Analyst work. Existing benchmarks do not capture this class of problem, and those that attempt to evaluate open-ended reasoning rely on model-judged outputs that introduce noise and circularity. We present Hedge-Bench 1.0: a benchmark of 102 actual, on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16\% on the benchmark. We publish the dataset and evaluation harness at github.com/Trata-Inc/trata-hedge-bench.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Dataset and evaluation harness available at github.com/Trata-Inc/trata-hedge-bench&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation</title>
    <author>
      <name>Jiabei Cheng</name>
    </author>
    <author>
      <name>Jingbo Zhou</name>
    </author>
    <author>
      <name>Jun Xia</name>
    </author>
    <author>
      <name>Changkai Li</name>
    </author>
    <author>
      <name>Zhen Lei</name>
    </author>
    <author>
      <name>Chang Yu</name>
    </author>
    <author>
      <name>Stan Z. Li</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03906v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03906v1" />
    <published>2026-06-02T17:00:49Z</published>
    <updated>2026-06-02T17:00:49Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiabei Cheng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation metrics, and influencing factors. To address this, we present scTranslation, a comprehensive benchmark for single-cell multi-omics modality translation tasks. It includes diverse translation datasets, integrates state-of-the-art models, and provides a comprehensive evaluation metrics. In addition, we assess model performance under different scenarios, such as feature selection, feature quality, and few-shot settings. These factors significantly affect model performance but have rarely been systematically studied before. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development. The benchmark is open-sourced to facilitate future research. The code is anonymously released at https://github.com/Bunnybeibei/scTranslation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments</title>
    <author>
      <name>Ibrahim Abdelaziz</name>
    </author>
    <author>
      <name>Asim Munawar</name>
    </author>
    <author>
      <name>Kinjal Basu</name>
    </author>
    <author>
      <name>Maxwell Crouse</name>
    </author>
    <author>
      <name>Chulaka Gunasekara</name>
    </author>
    <author>
      <name>Suneet Katrekar</name>
    </author>
    <author>
      <name>Pavan Kapanipathi</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03892v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03892v2" />
    <published>2026-06-02T16:52:31Z</published>
    <updated>2026-06-03T17:27:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ibrahim Abdelaziz et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server&amp;#x27;s actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a library of 20 stateful MCP (Model Context Protocol) servers exposing 343 tools, enabling live-execution RL training with session-scoped state isolation; (2) a state-machine data synthesis pipeline that generates multi-turn tool-call trajectories grounded in live-sampled server state, so generated queries reference entities that actually exist; and (3) a multi-component programmatic reward with an adaptive efficiency penalty that counters the verbosity incentive of recall-based rewards. We train four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with GRPO on the resulting ~13K training examples. On BFCL Multi-Turn, tau2-bench, and T-Eval, PROVE yields improvements of up to +10.2, +6.8, and +6.5 points respectively, demonstrating that this framework yields consistent gains on multi-step tool orchestration across two model families.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Reasoning Structure of Large Language Models</title>
    <author>
      <name>Frédéric Berdoz</name>
    </author>
    <author>
      <name>Luca A. Lanzendörfer</name>
    </author>
    <author>
      <name>Fabian Farestam</name>
    </author>
    <author>
      <name>Roger Wattenhofer</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03883v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03883v1" />
    <published>2026-06-02T16:49:19Z</published>
    <updated>2026-06-02T16:49:19Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Frédéric Berdoz et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model&amp;#x27;s logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at ICML 2026 and presented at the ICLR 2026 workshop on LLM reasoning&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs</title>
    <author>
      <name>Cuong Vuong Tuan</name>
    </author>
    <author>
      <name>Trang Mai Xuan</name>
    </author>
    <author>
      <name>Tien-Cuong Nguyen</name>
    </author>
    <author>
      <name>Vu-Duc Ngo</name>
    </author>
    <author>
      <name>Thien Van Luong</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03867v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03867v1" />
    <published>2026-06-02T16:39:07Z</published>
    <updated>2026-06-02T16:39:07Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Cuong Vuong Tuan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing approaches often struggle to capture complex inter-document relationships, rely heavily on large amounts of labeled data for supervised training, or exhibit limited generalization across domains and languages. To address these limitations, we present a training-free mixture-of-agents framework for MDS that leverages the complementary strengths of large language models (LLMs) and knowledge graphs. Our approach decomposes summarization into specialized agent tasks: extractive selection, knowledge-aware abstraction, and iterative refinement, each operating without task-specific fine-tuning. We unify their outputs using a multi-perspective consistency mechanism guided by LLMs. Experiments across four datasets in English and Vietnamese demonstrate state-of-the-art or competitive performance, validating the effectiveness and adaptability of our modular design.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by Neural Computing and Applications&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models</title>
    <author>
      <name>Zetian Ouyang</name>
    </author>
    <author>
      <name>Linlin Wang</name>
    </author>
    <author>
      <name>Gerard de Melo</name>
    </author>
    <author>
      <name>Liang He</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03858v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03858v1" />
    <published>2026-06-02T16:32:53Z</published>
    <updated>2026-06-02T16:32:53Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zetian Ouyang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across applications, few benchmarks evaluate LLMs by integrating numerical processing and mathematical reasoning, hindering the interpretability of failures in math tasks. We introduce PyraMathBench, a comprehensive hierarchical benchmark with 32,505 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities. Experiments reveal that LLMs&amp;#x27; performance is severely compromised by inadequate numerical computation and weak handling of abstract numerical questions. To address this, we propose the Smart Optimization &amp;amp; Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO), which enhance LLMs&amp;#x27; numerical-mathematical synergy via efficient tool calls (fuzzy matching and low-quality call rejection). Comparative experiments show Qwen-2.5 achieves a 5.0 score improvement with SOLVE and IRPO training.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Large Language Models Hack Rewards, and Society</title>
    <author>
      <name>Wei Liu</name>
    </author>
    <author>
      <name>Xinyi Mou</name>
    </author>
    <author>
      <name>Hanqi Yan</name>
    </author>
    <author>
      <name>Zhongyu Wei</name>
    </author>
    <author>
      <name>Yulan He</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04075v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04075v1" />
    <published>2026-06-02T16:29:48Z</published>
    <updated>2026-06-02T16:29:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wei Liu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models&amp;#x27; well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;14 pages, 9 figures, 7 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models</title>
    <author>
      <name>Qi Cao</name>
    </author>
    <author>
      <name>Takeshi Kojima</name>
    </author>
    <author>
      <name>Andrew Gambardella</name>
    </author>
    <author>
      <name>Helinyi Peng</name>
    </author>
    <author>
      <name>Yutaka Matsuo</name>
    </author>
    <author>
      <name>Yusuke Iwasawa</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03846v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03846v1" />
    <published>2026-06-02T16:25:54Z</published>
    <updated>2026-06-02T16:25:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Qi Cao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs) demonstrate remarkable performance across diverse tasks, but they often generate responses that appear plausible while being factually incorrect. This problem is compounded by the lack of explicit uncertainty estimates, which makes it difficult for users to judge the reliability of model outputs. Existing uncertainty quantification methods typically rely on indirect signals, such as entropy across sampled generations. These signals can be difficult to interpret and do not fully leverage the model&amp;#x27;s ability to assess its own uncertainty. We propose a simple yet effective self-assessment method for uncertainty quantification in LLMs. Our approach groups sampled generations into semantically distinct clusters, converts them into answer options in a structured multiple-choice question, and uses the probability assigned by the LLM to each option as a confidence estimate. Experiments across multiple models and datasets show that our method consistently outperforms baseline approaches. Notably, it achieves competitive performance with as few as two additional samples, demonstrating both its effectiveness and efficiency.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Findings of ACL 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management</title>
    <author>
      <name>Zherui Yang</name>
    </author>
    <author>
      <name>Fan Liu</name>
    </author>
    <author>
      <name>Yansong Ning</name>
    </author>
    <author>
      <name>Hao Liu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03841v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03841v1" />
    <published>2026-06-02T16:20:58Z</published>
    <updated>2026-06-02T16:20:58Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zherui Yang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS&amp;#x27;s hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at https://github.com/usail-hkust/EvoDS.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by KDD2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Consistency Training Can Entrench Misalignment</title>
    <author>
      <name>David Demitri Africa</name>
    </author>
    <author>
      <name>Arathi Mani</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03810v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03810v2" />
    <published>2026-06-02T15:54:24Z</published>
    <updated>2026-06-03T10:22:34Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;David Demitri Africa, Arathi Mani&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly understood. Could the self-bootstrapping nature of these methods amplify undesired behavior in models? We test seven consistency training methods on 108 model organisms: open-source models (7B--70B) fine-tuned to exhibit various forms of controlled misaligned behavior. We find that outcomes vary significantly: consistency training generally suppresses reward hacking and emergent misalignment but amplifies sycophancy. We present evidence that distribution shifts induced by the consistency labeling process, rather than variation in the selection operators, may be the primary driver of systematic alignment effects. Finally, we present a unifying theoretical framework to derive conditions under which consistency training will amplify or suppress misalignment. In total, our study establishes that consistency training is not alignment-neutral, and that its use in critical systems should be carefully audited.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] LiveBand: Live Accompaniment Generation in the Audio Domain</title>
    <author>
      <name>Marco Pasini</name>
    </author>
    <author>
      <name>Javier Nistal</name>
    </author>
    <author>
      <name>Ben Hayes</name>
    </author>
    <author>
      <name>Mathias Rose Bjare</name>
    </author>
    <author>
      <name>Stefan Lattner</name>
    </author>
    <author>
      <name>George Fazekas</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03803v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03803v2" />
    <published>2026-06-02T15:50:13Z</published>
    <updated>2026-06-09T17:04:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Marco Pasini et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision from a discriminator. At each timestep, the generator receives only the causally available mix context and Gaussian noise, and predicts accompaniment latents without access to future mix frames or ground-truth target latents. Training is performed in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The model&amp;#x27;s training and inference computations are matched by design, eliminating teacher forcing and the associated exposure bias. On a multi-instrument music accompaniment benchmark, LiveBand improves over prior work on objective measures of audio quality, beat alignment, and mix adherence, while enabling real-time streaming generation without lookahead into the future on consumer hardware.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.DC] E2LLM: Towards Efficient LLM Serving in Heterogeneous Edge/Fog Environments</title>
    <author>
      <name>Truong-Thanh Le</name>
    </author>
    <author>
      <name>Amir Taherkordi</name>
    </author>
    <author>
      <name>Hoang-Loc La</name>
    </author>
    <author>
      <name>Frank Eliassen</name>
    </author>
    <author>
      <name>Phuong Hoai Ha</name>
    </author>
    <author>
      <name>Peiyuan Guan</name>
    </author>
    <category term="cs.DC" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03770v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03770v1" />
    <published>2026-06-02T15:23:28Z</published>
    <updated>2026-06-02T15:23:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Truong-Thanh Le et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Language Models (LLMs) have become integral to modern applications, yet their deployment remains challenging. Beyond executing the models themselves, practical deployment must address cost efficiency, low latency, and optimal resource utilization. Conventional approaches typically assume that an entire model can be hosted on a single device, which does not hold in many real-world scenarios, particularly in Edge and Fog environments where device resources are constrained. In this paper, we introduce E2LLM, a framework designed to enable efficient LLM deployment in such resource limited settings. Rather than simply partitioning a single model across all available devices, E2LLM replicates the full model across multiple groups of devices (replicas) and applies model parallelism within each replica. Each replica is assigned a specialized role PREFILL or DECODER based on its efficiency in handling input and output tokens. This separation leverages the inherent differences between these two phases of LLM inference. To effectively organize devices, we utilize a Genetic Algorithm to form clusters that maximize system performance. Within each cluster, we apply Dynamic Programming to determine an optimal partitioning strategy that minimizes bottlenecks in model-parallel execution. Experimental results demonstrate that our approach adapts robustly to varying workloads, including scenarios with significant variation in input and output token lengths. Compared to the Splitwise baseline, E2LLM reduces average waiting time by over 50% under high-demand conditions&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[econ.GN] Merit or networks? What decides where research is published</title>
    <author>
      <name>Ning Li</name>
    </author>
    <category term="econ.GN" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03763v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03763v1" />
    <published>2026-06-02T15:18:03Z</published>
    <updated>2026-06-02T15:18:03Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ning Li&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Does scientific publishing reward the quality of ideas or the advantage of connections? The question is universal to prestige-driven science, yet it has resisted decades of study because a paper&amp;#x27;s quality could not be gauged ahead of its publication fate without using that fate as the yardstick. We break this constraint by measuring a paper&amp;#x27;s idea quality directly from its text, before publication, using a discipline-trained LLM evaluator that scores the idea without seeing author names or outcomes. Using economics as a case study, we combine this text-legible idea-quality score with an execution-quality rubric, a connection index, an author-ability index, and an off-the-shelf language-model text score to estimate a five-input production function for journal placement across 6,208 economics working papers. The inputs are not rivals but a sequence along the ladder of prestige. Execution sets a meritocratic floor and is the largest input overall. Text-legible idea quality grades the rungs in between. Connections set a favoritism ceiling that bites mainly near the apex, the most selective journals. Connections work through two additive channels: connected authors write papers that score higher, and at equal scores their papers are still more likely to place better. Yet this advantage is bounded. Connections raise the odds of every rung without making the apex the typical outcome for ordinary ideas, and even the highest-scoring papers face real friction reaching the visible journal ladder. The result nests, rather than chooses between, the meritocracy and network accounts of how science is published.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models</title>
    <author>
      <name>Glenn Jocher</name>
    </author>
    <author>
      <name>Jing Qiu</name>
    </author>
    <author>
      <name>Mengyu Liu</name>
    </author>
    <author>
      <name>Shuai Lyu</name>
    </author>
    <author>
      <name>Fatih Cagatay Akyon</name>
    </author>
    <author>
      <name>Muhammet Esat Kalfaoglu</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03748v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03748v1" />
    <published>2026-06-02T15:01:13Z</published>
    <updated>2026-06-02T15:01:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Glenn Jocher et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Real-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become widely deployed for this reason, yet most YOLO detectors still rely on non-maximum suppression at inference, carry heavy detection heads due to Distribution Focal Loss, require long training schedules, and can leave the smallest objects without positive label assignments. We present Ultralytics YOLO26, a unified real-time vision model family that addresses these limitations through coordinated architecture and training advances. YOLO26 uses a dual-head design for native NMS-free end-to-end inference and removes DFL entirely, yielding a lighter head with unconstrained regression range. Its training pipeline combines MuSGD, a hybrid Muon-SGD optimizer adapted from large language model training; Progressive Loss, which shifts supervision toward the inference-time head; and STAL, a label assignment strategy that guarantees positive coverage for small objects. Beyond detection, YOLO26 introduces task-specific head and loss designs for instance segmentation, pose estimation, and oriented detection, producing consistent gains across tasks and scales. The family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline, with an open-vocabulary extension, YOLOE-26, for text-, visual-, and prompt-free inference. Across all scales, YOLO26 achieves 40.9-57.5 mAP on COCO at 1.7-11.8 ms T4 TensorRT latency, advancing the accuracy-latency Pareto front over prior real-time detectors, while YOLOE-26x reaches 40.6 AP on LVIS minival under text prompting. Code and models are available at https://github.com/ultralytics/ultralytics.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;31 pages, 8 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts</title>
    <author>
      <name>Yiming Fu</name>
    </author>
    <author>
      <name>Peixuan Liu</name>
    </author>
    <author>
      <name>Zichen Wang</name>
    </author>
    <author>
      <name>Kun yuan</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03743v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03743v1" />
    <published>2026-06-02T14:56:10Z</published>
    <updated>2026-06-02T14:56:10Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yiming Fu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While Large Language Models (LLMs) have shown strong performance in generating formal proofs, their outputs often remain less readable, modular, maintainable, and reusable than proofs in mature formal mathematics libraries. We argue that this gap stems in part from the compile-first objective implicit in most proof-generation pipelines, which encourages monolithic or ad hoc proof scripts rather than library-quality artifacts. Existing approaches to proof-quality improvement often rely on explicit, computable optimization objectives. In practice, however, the most tractable and experimentally validated objectives are largely length-based, while higher-level qualities such as readability, modularity, maintainability, and reusability are difficult to reduce to reliable automatic metrics. Instead of optimizing proof improvement against a single proxy metric, we take a process-guided approach inspired by human proof-refactoring workflows. We propose an agentic framework $\textbf{Proof-Refactor}$ that decomposes proof refactoring into four phases: extracting candidate proof fragments, designing helper declarations, formally proving the extracted and designed components, and repairing the original proof using the verified components. On generated Lean proofs from PutnamBench and Putnam2025, Proof-Refactor improves rubric-based refactoring scores over a strong Claude Code refactoring baseline, with the largest gains in signature quality and human readability. These results suggest that process-guided refactoring can improve proof structure without treating proof length as the primary objective.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;21 pages, 3 figures, 3 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] DiBS: Diffusion-Informed Branch Selection</title>
    <author>
      <name>Bo Liu</name>
    </author>
    <author>
      <name>Yuan Xie</name>
    </author>
    <author>
      <name>Yuan Gao</name>
    </author>
    <author>
      <name>Xiaolong Luo</name>
    </author>
    <author>
      <name>Peng Ye</name>
    </author>
    <author>
      <name>Tao Chen</name>
    </author>
    <author>
      <name>Fujun Han</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06518v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06518v1" />
    <published>2026-06-02T14:19:51Z</published>
    <updated>2026-06-02T14:19:51Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Bo Liu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Sudoku is a representative constraint satisfaction problem that requires global structural reasoning under strict discrete constraints. The existing works of solving Sudoku mainly focus on two dominant approaches, i.e., traditional heuristic and deep learning solver. However, they suffer from two complementary limitations: learning-based solvers lack hard correctness guarantees, while complete symbolic solvers are still prone to long-tail search. To address these shortcomings, we propose a novel diffusion model-guided approach, termed as DiBS, for the branch selection search process. Specifically, DiBS keeps the symbolic solver complete and uses the diffusion model as a branch-ordering guide. The core method is ranking candidate values under the current partial assignment and lightweight consistency signal. Furthermore, we provide an in-depth theoretical proof to reveal how it works and why it works. Experiments on the challenging Royle 17-clue Sudoku benchmark show that our DiBS substantially reduces search cost relative to strong heuristic baselines, especially in nodes, backtracks, and long-tail percentiles. Besides, these results confirm that learned global guidance is effective on hard instances where branch-order mistakes are most expensive. All codes are available at https://github.com/shanxierdan/DiBS.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 pages, 6 figures, 3 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models</title>
    <author>
      <name>Mariana Vargas Vieyra</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03689v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03689v1" />
    <published>2026-06-02T14:11:23Z</published>
    <updated>2026-06-02T14:11:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mariana Vargas Vieyra&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Survival Analysis (SA) is a statistical framework that models the time span until some event of interest occurs. Widely used in several domains, including healthcare and churn prediction, a central challenge in its applicability stems from the time of the event being partially observed or \emph{right-censoring}.
  Tabular Foundation Models (TFM) have attracted significant interest in recent years due to their ability to perform prediction tasks in a single forward pass, requiring no dataset-specific parameter fitting. Despite their success, their application to prediction tasks on time-to-event data remains difficult due to right censoring. In this work, we present a training-free method to survival regression by leveraging TFMs to both predict the time of the event and iteratively impute right-censored data.
  Our method uses a TFM to construct an Accelerated Failure Time (AFT) model requiring no training beyond fitting a single scalar parameter. Subsequently, by building on the Buckley-James estimator, we introduce a non-parametric in-context estimator for right-censored data. Our experiments on standard survival analysis benchmarks show that our method is competitive with several parametric and semi-parametric survival regression models that require training, including Cox regression and parametric AFT models.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models</title>
    <author>
      <name>Hongyu Guo</name>
    </author>
    <author>
      <name>Hao Li</name>
    </author>
    <author>
      <name>He Cao</name>
    </author>
    <author>
      <name>Gongbo Zhang</name>
    </author>
    <author>
      <name>Li Yuan</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03660v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03660v2" />
    <published>2026-06-02T13:47:19Z</published>
    <updated>2026-06-03T14:05:21Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Hongyu Guo et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic benchmark for low-cost, auditable evaluation of structured, verifier-addressable chemical reasoning traces. It spans molecular understanding, molecule editing, molecular optimization, and reaction prediction, with 5,620 evaluation samples across 18 reporting tasks. Models must expose key intermediate steps in expert-designed templates, and those steps are checked with deterministic chemistry rules and, for closed-answer tasks, reference traces rather than another LLM judge. Open-ended molecular optimization is evaluated with oracle-verifiable state constraints rather than strict trace matching. The benchmark reports three separate signals: final-answer correctness, template adherence, and step-wise verifier correctness over expert-refined intermediate commitments. Experiments on frontier models reveal a persistent gap between final-answer success and structured-reasoning-state consistency: models often follow the requested format while failing chemical-step checks, or answer correctly with weak supporting reasoning. ChemCoTBench-V2 enables fine-grained model comparison and identifies the concrete step at which the trace first violates the verifier.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;23 pages, 6 figures, 14 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition</title>
    <author>
      <name>Jinnuo Liu</name>
    </author>
    <author>
      <name>Yue Peng</name>
    </author>
    <author>
      <name>Jinhan Niu</name>
    </author>
    <author>
      <name>Hongyi Wen</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03657v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03657v1" />
    <published>2026-06-02T13:46:04Z</published>
    <updated>2026-06-02T13:46:04Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jinnuo Liu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordinate signatures, module paths, input-output contracts, semantics, and executable usage patterns. Existing novel-API benchmarks are typically static, rely on coarse pass/fail metrics, or use synthetic APIs that may not reflect real library evolution. We introduce NovelAPIBench, a fully automated dynamic benchmark that, for any base model and target library, discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories. Across about 1.9K tasks, four base models, and five domains, we compare knowledge injected through retrieval with knowledge internalized through parametric adaptation. We find that knowledge components are not interchangeable: usage examples are the strongest standalone signal, while the best two-component setting pairs signatures with either mechanisms or examples depending on the domain and backbone. Adding more context, especially source code, can hurt by increasing import-path errors. Parametric adaptation also does not replace retrieval once external knowledge is removed; rather, fine-tuning mainly teaches models how to use provided bundles, and this ability transfers to held-out libraries. These results suggest that retrieval and tuning play complementary roles: retrieval supplies volatile API content, while tuning improves procedural integration.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;37 pages, 12 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability</title>
    <author>
      <name>Krishnapriya Vishnubhotla</name>
    </author>
    <author>
      <name>Hillary Dawkins</name>
    </author>
    <author>
      <name>Isar Nejadgholi</name>
    </author>
    <author>
      <name>Svetlana Kiritchenko</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03648v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03648v1" />
    <published>2026-06-02T13:39:17Z</published>
    <updated>2026-06-02T13:39:17Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Krishnapriya Vishnubhotla et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Adapting foundation large language models to a user&amp;#x27;s task or preferred style through fine-tuning can result in compromising the model&amp;#x27;s safety. Previous works examined the effects of fine-tuning on model safety in limited and seemingly random experimental settings. We argue that anchoring fine-tuning to a specific capability goal is essential for avoiding arbitrary empirical choices, allowing us to draw meaningful conclusions about safety impacts, and to compare mitigation methods on a consistent basis. We conduct a multi-dimensional evaluation of the effects of fine-tuning on model behavior by focusing on capability as well as safety. Our results surface important issues that (1) fine-tuned models can produce incoherent generations in response to safety prompts, (2) automated safety judgments are unreliable for such incoherent outputs, and (3) the conclusions about the effects of fine-tuning can change depending on the choice of safety benchmark as well as the safety evaluator.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;8 pages plus appendices&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs</title>
    <author>
      <name>Vincent Limbach</name>
    </author>
    <author>
      <name>Jonas Dornbusch</name>
    </author>
    <author>
      <name>David Lüdke</name>
    </author>
    <author>
      <name>Stephan Günnemann</name>
    </author>
    <author>
      <name>Leo Schwinn</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03647v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03647v1" />
    <published>2026-06-02T13:39:15Z</published>
    <updated>2026-06-02T13:39:15Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Vincent Limbach et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Accurately evaluating adversarial robustness is a longstanding challenge. A flawed attack design can inflate robustness estimates, making deployment risk assessment and defense comparison unreliable. Historically, standardized attacks such as AutoAttack have largely resolved this for image classifiers, providing a reliable evaluation baseline for systematic comparison across defenses. However, no equivalent exists for LLM jailbreak evaluation yet, where designing such an attack is considerably more difficult. A reliable attack must, among other things, be black-box compatible, applicable to arbitrary defense pipelines, and efficient, which no existing method jointly satisfies. We introduce Indirect Harm Optimization (IHO), a masked diffusion language model attacker trained via iterative preference optimization against a harmfulness judge, requiring only black-box access to the target. The same method can be used without modification as a strong adaptive attack on individual behaviors, or as an efficient amortized policy that transfers to held-out behaviors and unseen target models without fine-tuning. Even against layered defenses, such as a Circuit Breaker-trained model combined with an auxiliary detector, IHO improves attack success considerably over state-of-the-art approaches, without any defense-specific adaptation. Our results position IHO as a practical step toward the kind of standardized jailbreak evaluation that has improved reliability in the past. Code and models are available on GitHub and Hugging Face.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency</title>
    <author>
      <name>Qi Han Wong</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03641v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03641v1" />
    <published>2026-06-02T13:35:12Z</published>
    <updated>2026-06-02T13:35:12Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Qi Han Wong&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient&amp;#x27;s stated gender and age vary. Using three model families--Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini--we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders (male, female), plus a gender-unspecified baseline (n = 30 per condition per model, 630 total trials). We find a stark, systemic gender-dependent triage disparity: young women receive significantly lower emergency room (ER) referral rates than age-matched men (Gemini: 0% vs. 23.3%; Claude: 6.7% vs. 96.7%; GPT: 6.7% vs. 66.7%, all p &amp;lt; 0.001). The disparity disappears at age 65 for all models. The primary mechanism is diagnostic substitution: the models anchor on a gender-associated diagnosis, preferentially classifying young women with Idiopathic Intracranial Hypertension (IIH)--a condition epidemiologically linked to women of childbearing age--while diagnosing men with generic increased intracranial pressure with space-occupying lesions in the differential. This diagnostic closure routes female patients to lower-urgency care (outpatient doctor appointments) despite comparable severity ratings (7-9/10). Our findings demonstrate that clinical LLMs replicate documented human clinical biases by using epidemiological priors to suppress triage urgency, suggesting that AI triage engines must decouple urgency assessment from probabilistic diagnostic priors. We release all code, prompts, and raw results.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;7 pages, 3 tables. Multi-model replication across Gemini, Claude, and GPT. Code and data: https://github.com/wongqihan/ai-behavioral-experiments&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] VidMsg: A Benchmark for Implicit Message Inference in Short Videos</title>
    <author>
      <name>Issar Tzachor</name>
    </author>
    <author>
      <name>Michael Green</name>
    </author>
    <author>
      <name>Rami Ben-Ari</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03635v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03635v1" />
    <published>2026-06-02T13:31:57Z</published>
    <updated>2026-06-02T13:31:57Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Issar Tzachor et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying message or purpose in the clip. We introduce VidMsg, a benchmark for evaluating implicit message understanding in short, internet-native video clips. VidMsg contains 400 YouTube-derived clips across 9 practical topic areas and 52 fine-grained target messages, covering domains such as career and finance, education, health and well-being, culture, safety, sustainability, and lifestyle. VidMsg is constructed through a message-first pipeline: an LLM first translates target messages into indirect search scenarios, which are used to retrieve candidate clips. Human annotators then retain clips that convey the intended message without being overly explicit. VidMsg is designed primarily for bidirectional message-clip retrieval for scalable applications such as video search and recommendation, where systems must capture holistic video understanding. In addition to retrieval, VidMsg includes a diagnostic multiple-choice QA benchmark, where models select the intended message of a clip from semantically related alternatives. Experiments with contemporary video-language and retrieval models show that strong models often fail on VidMsg, because the task requires pragmatic inference, integration of contextual cues, and discrimination among semantically close messages. We also introduce VidVec-Msg, a baseline method that improves message-oriented retrieval while leaving substantial headroom for future work.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Project page: https://iyttor.github.io/VidMsg&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Building Reliable Long-Form Generation via Hallucination Rejection Sampling</title>
    <author>
      <name>Lin Li</name>
    </author>
    <author>
      <name>Georgia Channing</name>
    </author>
    <author>
      <name>Suhaas M Bhat</name>
    </author>
    <author>
      <name>Gabriel Davis Jones</name>
    </author>
    <author>
      <name>Yarin Gal</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03628v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03628v1" />
    <published>2026-06-02T13:26:17Z</published>
    <updated>2026-06-02T13:26:17Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Lin Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs) have achieved remarkable progress in open-ended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long-form generation due to hallucination snowballing, a phenomenon where early errors propagate and compound into subsequent outputs. To address this challenge, we propose a novel inference-time hallucination mitigation framework, named Segment-wise HAllucination Rejection Sampling (SHARS), which uses an arbitrary hallucination detector to identify and reject hallucinated segments during generation and resample until faithful content is produced. By retaining only confident information and building subsequent generations upon it, the framework mitigates hallucination accumulation and enhances factual consistency. To instantiate this framework, we adopt semantic uncertainty as the detector and introduce several vital modifications to address its limitations and better adapt it to long-form text. Our method enables models to self-correct hallucinations without requiring external resources such as web search or knowledge bases, while remaining compatible with them for future extensions. Empirical evaluations on standardized hallucination benchmarks demonstrate that our method substantially reduces hallucinations in long-form generation while preserving or even improving the informativeness of generation. Code is available at: https://github.com/TreeLLi/hallucination-rejection-sampling.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;accepted by ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models</title>
    <author>
      <name>Zhengyi Zhao</name>
    </author>
    <author>
      <name>Shubo Zhang</name>
    </author>
    <author>
      <name>Huimin Wang</name>
    </author>
    <author>
      <name>Zezhong Wang</name>
    </author>
    <author>
      <name>Yutian Zhao</name>
    </author>
    <author>
      <name>Yefeng Zheng</name>
    </author>
    <author>
      <name>Binyang Li</name>
    </author>
    <author>
      <name>Yulan He</name>
    </author>
    <author>
      <name>Kam-Fai Wong</name>
    </author>
    <author>
      <name>Xian Wu</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03624v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03624v1" />
    <published>2026-06-02T13:23:28Z</published>
    <updated>2026-06-02T13:23:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhengyi Zhao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formalize this challenge as the Constraint Adherence Problem (CAP). This paper introduces a novel framework that addresses CAP by representing instructions as a structured knowledge graph of constraints. Our approach, Constraint Relationship Graph Completion (CRGC), explicitly models relationships between constraints, identifies adherence challenges, and discovers ``bridge constraints&amp;#x27;&amp;#x27; that help the model better focus on and reconcile requirements. Bridge constraints act as auxiliary instructions that make primary constraints more salient and compatible. Unlike existing approaches that enhance instruction following through general training methods, CRGC specifically improves constraint satisfaction by leveraging the model&amp;#x27;s own knowledge to create better pathways for generation. Experiments across three popular instruction following datasets demonstrate that our approach reduces constraint violations by 39% compared to standard prompting while maintaining reasoning abilities of large reasoning models.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;a pre-MIT Press publication version&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing</title>
    <author>
      <name>Mehmet Utku Colak</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03618v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03618v1" />
    <published>2026-06-02T13:17:45Z</published>
    <updated>2026-06-02T13:17:45Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mehmet Utku Colak&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur.
  We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cross-lingual translation into English, structural rewriting into a compact task-oriented format, and regex-validated rewrite-with-fallback safeguards to ensure the optimized prompt is never larger than the original.
  We evaluate on OMH-Polyglot, a multilingual coding benchmark spanning Turkish, Arabic, Chinese, and code-switched specifications. Across three commercial LLM backends, the middleware reduces prompt tokens by 34-47 percent and total tokens by up to 18.8 percent while preserving or improving task accuracy. Ablation studies show that gains arise primarily from the rewriting stage rather than simple function-name extraction. Compared with LLMLingua-2 at matched compression rates, our method consistently achieves superior OckScore performance across all evaluated backends. These results demonstrate that proactive prompt optimization can substantially reduce inference costs without sacrificing coding quality.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Submitted to EMNLP 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification</title>
    <author>
      <name>Jiahui Li</name>
    </author>
    <author>
      <name>Jianfeng Shan</name>
    </author>
    <author>
      <name>Wenpei Chen</name>
    </author>
    <author>
      <name>Shunyu Wu</name>
    </author>
    <author>
      <name>Jian Lou</name>
    </author>
    <author>
      <name>Wenjie Feng</name>
    </author>
    <author>
      <name>Dan Li</name>
    </author>
    <author>
      <name>See-Kiong Ng</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03608v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03608v1" />
    <published>2026-06-02T13:11:09Z</published>
    <updated>2026-06-02T13:11:09Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiahui Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Test-time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label-free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under-explored yet critical in label-free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label-free setting is highly non-trivial, as directly applying the Pass@k advantage designs effective for RLVR yields unsatisfactory performance. Through in-depth empirical analysis, we discover the root causes hindering performance: pseudo-label estimations for low-confidence samples have a high probability of being incorrect, while candidate answers for high-confidence samples suffer from severe diversity collapse. To overcome these hurdles, we propose TTRL-CoCoV (Test-Time Reinforcement Learning with Confidence-Conditioned Verification), a novel confidence-adaptive framework that expands Pass@k coverage and improves Pass@1 performance. Based on our key insight that verification capability generally leads generation capability, TTRL-CoCoV employs a confidence-conditioned mechanism: for high-confidence samples, it bootstraps verifier and applies an exploration-enhancing reward to prevent diversity collapse; for low-confidence samples, it delegates pseudo-label selection to the verifier to filter incorrect pseudo-labels; and for medium-confidence samples, it bypasses verification entirely. Extensive experiments demonstrate that TTRL-CoCoV outperforms the best competing methods across 6 widely-recognized benchmarks, achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 over TTRL, and even achieves absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks when compared against fully supervised RL methods. Our code repository: https://github.com/shanjf666/CoCoV.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery</title>
    <author>
      <name>Bo Peng</name>
    </author>
    <author>
      <name>Kaiwen Wu</name>
    </author>
    <author>
      <name>Sirui Chen</name>
    </author>
    <author>
      <name>Zhiheng Wang</name>
    </author>
    <author>
      <name>Yu Qiao</name>
    </author>
    <author>
      <name>Chaochao Lu</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03602v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03602v1" />
    <published>2026-06-02T13:07:43Z</published>
    <updated>2026-06-02T13:07:43Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Bo Peng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM-augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data-centric algorithm can make results sensitive to algorithm-specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near-perfect accuracy on the filtered consensus edges. Second, a trust-calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation-free trust calibration procedure, which is then utilized to govern a trust-weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data-centric and LLM-augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at https://github.com/OpenCausaLab/CauTion.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Learned Non-Maximum Suppression for 3D Object Detection</title>
    <author>
      <name>Timo Osterburg</name>
    </author>
    <author>
      <name>Stefan Schütte</name>
    </author>
    <author>
      <name>Torsten Bertram</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03568v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03568v1" />
    <published>2026-06-02T12:34:14Z</published>
    <updated>2026-06-02T12:34:14Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Timo Osterburg et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among detections. D2D-Rescore employs transformer-based detection-to-detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D through localized message passing in bird&amp;#x27;s-eye view. A metric-aware matching strategy aligned with the nuScenes evaluation protocol ensures consistent training and validation behavior, improving overall detection performance. Both approaches improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality compared to CircleNMS, particularly for small and infrequent classes, while adding minimal computational overhead. These results demonstrate that learned, detection-level filtering can enhance 3D detector reliability without modifying the base network, offering a principled alternative to heuristic suppression. Code is available at https://github.com/rst-tu-dortmund/learned-3d-nms .&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;6 pages, accepted at IEEE Intelligent Vehicles Symposium (IV) 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language</title>
    <author>
      <name>Muhammad Ali</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03504v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03504v1" />
    <published>2026-06-02T11:23:49Z</published>
    <updated>2026-06-02T11:23:49Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Muhammad Ali&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. We fine-tune OpenAI Whisper-small on this corpus and report a Word Error Rate (WER) of 30.07% on a held-out validation set of 538 utterances, down from a measured zero-shot baseline of 182.18% for Whisper-small on Balti. The dataset, fine-tuned model, and a live transcription demo are publicly available on HuggingFace.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;5 pages, 4 figures, 4 tables. Code and data available at https://github.com/mohdali-dev/BaltiVoice-ASR&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning</title>
    <author>
      <name>Ziyan Liu</name>
    </author>
    <author>
      <name>Xueda Shen</name>
    </author>
    <author>
      <name>Yuzhe Gu</name>
    </author>
    <author>
      <name>Songyang Gao</name>
    </author>
    <author>
      <name>Kuikun Liu</name>
    </author>
    <author>
      <name>Guangran Cheng</name>
    </author>
    <author>
      <name>Chengqi Lyu</name>
    </author>
    <author>
      <name>Dahua Lin</name>
    </author>
    <author>
      <name>Wenwei Zhang</name>
    </author>
    <author>
      <name>Kai Chen</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03503v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03503v1" />
    <published>2026-06-02T11:21:27Z</published>
    <updated>2026-06-02T11:21:27Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ziyan Liu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SE] The Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation</title>
    <author>
      <name>Akanksha Narula</name>
    </author>
    <author>
      <name>Mofasshara Binte Rafique</name>
    </author>
    <author>
      <name>Laurent Bindschaedler</name>
    </author>
    <category term="cs.SE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04057v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04057v1" />
    <published>2026-06-02T11:17:28Z</published>
    <updated>2026-06-02T11:17:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Akanksha Narula et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs) now generate substantial production code, often for tasks with multiple valid algorithmic solutions. Incidental prompt cues, meaning contextual words or metadata outside the task specification, can steer which algorithm the model selects, even when all outputs pass the same tests. Prompt sensitivity is well studied as a tool to improve output quality. Here, output policy means algorithm choice under fixed correctness. We define algorithm steering as cue-induced shifts in algorithm-family distributions and run 46,535 controlled experiments across 11 tasks, 19 cue types (18 channels plus a memoization semantic-vs-surface ablation that preserves meaning while changing typography and punctuation), and 15 model configurations. We find large, systematic shifts in algorithm-family distributions (up to 100 pp), largely consistent with cue semantics, including in applied tasks such as rate limiting. Direct algorithm naming is the most reliable mitigation we tested. Accidental context therefore creates an &amp;quot;invisible lottery&amp;quot; over performance, security, and maintainability.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs</title>
    <author>
      <name>Wenqi Chen</name>
    </author>
    <author>
      <name>Ziyan Zhang</name>
    </author>
    <author>
      <name>Bing Wang</name>
    </author>
    <author>
      <name>Lin Liu</name>
    </author>
    <author>
      <name>Hengheng Zhang</name>
    </author>
    <author>
      <name>Zhengsu Chen</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03489v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03489v1" />
    <published>2026-06-02T11:07:20Z</published>
    <updated>2026-06-02T11:07:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wenqi Chen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While Large Language Models (LLMs) excel in code generation, they remain prone to replicating subtle yet critical vulnerabilities endemic to their training data. Current alignment techniques, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), typically apply coarse-grained optimization at the sequence level. This approach often fails to address the localized nature of security flaws, where a single incorrect token choice can compromise an entire program. To bridge this gap, we introduce Tree-like Self-Play (TSP), a framework that reframes secure code generation as a fine-grained sequential decision process. Unlike standard methods that blindly maximize likelihood, TSP constructs a decision tree where the model explores branching trajectories--generating both secure &amp;quot;golden paths&amp;quot; and vulnerable variants. By treating code generation as a self-play game, the model learns to strictly discriminate against its own localized errors. This provides a dense, on-policy learning signal that forces self-correction precisely at the critical decision nodes where vulnerabilities typically emerge. Our experiments demonstrate that TSP fundamentally enhances model reliability. In Python security benchmarks, TSP boosts CodeLlama-7B&amp;#x27;s pass rate (SPR@1) to 75.8%, significantly outperforming SFT (57.0%) and unstructured self-play baselines. Crucially, TSP induces robust out-of-distribution generalization: the model not only reduces vulnerabilities in unseen categories (CWEs) by 24.5% but also successfully transfers security principles learned from C/C++ to diverse languages, including Python, Go, and JavaScript. This suggests that TSP does not merely memorize patches, but internalizes abstract, language-agnostic security logic.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;18 pages, 3 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation</title>
    <author>
      <name>Ekaterina Alimaskina</name>
    </author>
    <author>
      <name>Gleb Molodtsov</name>
    </author>
    <author>
      <name>Aleksandr Beznosikov</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03483v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03483v1" />
    <published>2026-06-02T11:00:49Z</published>
    <updated>2026-06-02T11:00:49Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ekaterina Alimaskina et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Hyper-Connections (HC) replace the single Transformer residual stream with multiple streams, introducing a permutation symmetry over stream indices. We study how this symmetry is resolved in practice: whether streams specialize in a balanced way or exhibit dominant-stream usage. Using fine-grained diagnostics for HC-based language models, we trace how multi-stream representations are actually used. We find that after an early seeding stage, residual mixing often remains close to identity, limiting a core HC mechanism for exchanging information between streams. Moreover, both signal and interpretable features concentrate in a dominant stream, and the nominally multi-stream residual connection can underutilize its capacity, behaving closer to a single-stream residual pathway. Finally, we show that breaking symmetry at stream initialization reduces dominant behavior and improves performance across \textit{m}HC variants. Our code is publicly available.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems</title>
    <author>
      <name>Taiyu Zhu</name>
    </author>
    <author>
      <name>Yifan Wu</name>
    </author>
    <author>
      <name>Weilin Jin</name>
    </author>
    <author>
      <name>Ying Li</name>
    </author>
    <author>
      <name>Gang Huang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03467v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03467v1" />
    <published>2026-06-02T10:45:49Z</published>
    <updated>2026-06-02T10:45:49Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Taiyu Zhu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter-efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross-step dependencies of the trajectories. Finally, the step-level error score is refined through multi-scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&amp;amp;When benchmark demonstrate that StepFinder outperforms LLM-based methods in step-level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM-based method, with no text generation overhead. Our code is available at https://github.com/taiyu-zhu/StepFinder.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 pages, 5 figures. Accepted by KDD 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression</title>
    <author>
      <name>Artur Zagitov</name>
    </author>
    <author>
      <name>Alexander Miasnikov</name>
    </author>
    <author>
      <name>Maxim Krutikov</name>
    </author>
    <author>
      <name>Vladimir Aletov</name>
    </author>
    <author>
      <name>Gleb Molodtsov</name>
    </author>
    <author>
      <name>Nail Bashirov</name>
    </author>
    <author>
      <name>Artem Tsedenov</name>
    </author>
    <author>
      <name>Aleksandr Beznosikov</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03465v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03465v1" />
    <published>2026-06-02T10:45:21Z</published>
    <updated>2026-06-02T10:45:21Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Artur Zagitov et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Post-training compression is essential for deploying large language models (LLMs) under tight resource constraints. Tensor decompositions have emerged as a promising direction, offering compact parameterizations well suited to Transformer weight structures. However, existing studies evaluate these methods in narrow settings, leaving unclear whether tensorization is effective at large-scale deployment. We systematically evaluate tensor compression across dense and MoE architectures, establishing performance trade-offs grounded in both empirical analysis and theoretical analysis. We identify a fundamental mismatch between the shared subspaces assumed by tensor decompositions and the heterogeneous representations learned by modern LLMs, thereby delineating their practical limits and clarifying their viable role in large-scale deployment. The code is available at https://github.com/brain-lab-research/TT-LLM.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] What Makes Interaction Trajectories Effective for Training Terminal Agents?</title>
    <author>
      <name>Sidi Yang</name>
    </author>
    <author>
      <name>Chaofan Tao</name>
    </author>
    <author>
      <name>Jierun Chen</name>
    </author>
    <author>
      <name>Tiezheng Yu</name>
    </author>
    <author>
      <name>Ruoyu Wang</name>
    </author>
    <author>
      <name>Yuxin Jiang</name>
    </author>
    <author>
      <name>Yiming Du</name>
    </author>
    <author>
      <name>Wendong Xu</name>
    </author>
    <author>
      <name>Jing Xiong</name>
    </author>
    <author>
      <name>Taiqiang Wu</name>
    </author>
    <author>
      <name>Lifeng Shang</name>
    </author>
    <author>
      <name>Xiaohui Li</name>
    </author>
    <author>
      <name>Ngai Wong</name>
    </author>
    <author>
      <name>Haoli Bai</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03461v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03461v1" />
    <published>2026-06-02T10:37:47Z</published>
    <updated>2026-06-02T10:37:47Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sidi Yang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harness design, and student capacity. We investigate this pedagogical link using Terminal-Lego, a scalable pipeline that transforms multi-domain real-world issues into environment-verified agentic tasks. Surprisingly, standalone performance does not dictate teaching efficacy: while Claude Opus 4.6 achieves higher scores on Terminal-Bench 2.0, students fine-tuned on trajectories from DeepSeek-V3.2, a lower-scoring agent, exhibit significantly stronger generalization. We attribute this &amp;quot;pedagogical paradox&amp;quot; to Environment-Grounded Supervision (EGS): trajectories that explicitly expose inspect-act-verify behaviors through harness-visible interactions allow students to internalize robust problem-solving routines rather than fragile action sequences. Scaling analysis reveals exceptional data efficiency: with only 15.3k Terminal-Lego trajectories, for example, Qwen3-32B achieves a 24.3% score on Terminal-Bench 2.0, rivaling previous SOTA performance established with over 30x the data volume. Our results suggest that the frontier of agent post-training lies beyond mere outcome-matching, shifting the focus toward &amp;quot;Harness Engineering&amp;quot;, where the systematic design of environment-grounded interaction structures serves as the primary catalyst for reproducible and generalizable agentic intelligence.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization</title>
    <author>
      <name>Ying Tang</name>
    </author>
    <author>
      <name>Dong Li</name>
    </author>
    <author>
      <name>Youjia Zhang</name>
    </author>
    <author>
      <name>Zikai Song</name>
    </author>
    <author>
      <name>Junqing Yu</name>
    </author>
    <author>
      <name>Wei Yang</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03444v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03444v1" />
    <published>2026-06-02T10:28:32Z</published>
    <updated>2026-06-02T10:28:32Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ying Tang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but challenged by the negative transfer inherent in monolithic distillation. To address these feature conflicts, we introduce \textbf{PRISM}, a novel dual-stream Mixture-of-Experts (MoE) framework that synergizes VFMs via modular specialization. We propose a two-stage paradigm: (1) expertise deconstruction, where a teacher-conditional router guides experts to specialize in distinct representational subspaces to mitigate interference, followed by (2) dynamic recomposition, where the router learns to assemble these experts into tailored computational pathways for downstream tasks. Experiments on PASCAL-Context and NYUD-v2 show that \textbf{PRISM} establishes a new state of the art, validating that sparse, emergent specialization is a scalable approach for integrating diverse visual knowledge.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] A Hybrid Approach For Malware Classification Using Secondary Features Fusion</title>
    <author>
      <name>Raja Khurram Shahzad</name>
    </author>
    <author>
      <name>Muhammad Mustaqeem</name>
    </author>
    <author>
      <name>Haroon Elahi</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03432v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03432v1" />
    <published>2026-06-02T10:19:35Z</published>
    <updated>2026-06-02T10:19:35Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Raja Khurram Shahzad et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The number of malware (either variant or novel) is rapidly increasing, making malware detection and mitigation a complex problem. One approach to improving malware mitigation is automatic detection and malware family classification. However, traditional malware detection methods cannot classify detected malware into their respective families, hindering effective malware mitigation. Consequently, this paper proposes a method to automate malware detection and classification of the detected malware into respective malware families. The proposed method uses feature fusion after extracting relevant malware features such as API calls and fixed and variable length n-grams with a customized feature selection method. Moreover, for the predictive model, a voting based approach is proposed for algorithm fusion. For the experimental evaluation of the proposed method, both binary and multi-class classification approaches are applied to the data set provided by Microsoft. Finally, the experimental results are compared with the state of the art. The experimental results indicate the effectiveness and efficiency of the proposed approach with an AUC of 0.989, accuracy of 99.72%, and a log loss of 0.01.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[math.OC] Optimizing Explicit Unit-Distance Lower-Bound Certificates</title>
    <author>
      <name>Michael T. M. Emmerich</name>
    </author>
    <category term="math.OC" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03419v5</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03419v5" />
    <published>2026-06-02T10:05:23Z</published>
    <updated>2026-06-09T13:18:56Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Michael T. M. Emmerich&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The 2026 disproof of Erdős&amp;#x27;s unit-distance conjecture and Sawin&amp;#x27;s quantitative refinement show that the maximum number $u(n)$ of unit distances among $n$ planar points can exceed $n^{1+\varepsilon}$ for a fixed positive $\varepsilon$. Sawin&amp;#x27;s explicit bound gives more than $n^{1.014}$ unit distances for arbitrarily large $n$ and exposes integer parameters whose choice is not fully optimized. This report treats Sawin&amp;#x27;s parameter selection as a nonlinear integer optimization problem and develops an open-source Python optimization and verification pipeline for certificates involving prime sets $T$ and $S_Q$, integer multiplicities $k(p)$, and a rationally encoded real parameter $R$. After reproducing Sawin&amp;#x27;s certificate with $δ=0.014114\ldots$, the pipeline yields improved certificates with the same $T$. We develop a tailored integer evolution strategy achieving a certificate with $δ=0.015263\ldots$ and supporting the cautious statement $u(n)&amp;gt;n^{1.0152}$ for arbitrarily large $n$. For extended ramified prime ranges, the Emmerich--Cordella certificate obtained with the same framework reports $u(n)&amp;gt;n^{1.031}$ for $\#T=67$, illustrating the importance of enlarging $T$. Very recent MathOverflow discussions, brought to the author&amp;#x27;s attention as of version~4, report further improvements, including certificates above $δ&amp;gt;0.035$ and beyond $δ&amp;gt;0.036$. Some of these improvements may rely not only on larger prime ranges but also on modified constraint systems and additional degrees of freedom that deviate from Sawin&amp;#x27;s original formulation. Beyond this application, the work illustrates how randomized optimization heuristics can improve, verify, and refine explicit certificates for combinatorial geometry through nonlinear integer optimization.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;17 pages, 9 figures. Added a declaration on the use of AI. Added references to further contributions discussed on MathOverflow, including a reference to the independently developed verification pipeline and certificate package by Tseng (2026), published on Zenodo&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] When Model Merging Breaks Routing: Training-Free Calibration for MoE</title>
    <author>
      <name>Canbin Huang</name>
    </author>
    <author>
      <name>Tianyuan Shi</name>
    </author>
    <author>
      <name>Xiaojun Quan</name>
    </author>
    <author>
      <name>Jingang Wang</name>
    </author>
    <author>
      <name>Jianfei Zhang</name>
    </author>
    <author>
      <name>Qifan Wang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03391v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03391v1" />
    <published>2026-06-02T09:33:33Z</published>
    <updated>2026-06-02T09:33:33Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Canbin Huang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Model merging has emerged as a cost-effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture-of-Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non-linear softmax and discrete Top-k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load-balancing constraints imposed during MoE pretraining. Because fine-tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian-Aware Router Calibration (HARC), a training-free framework that leverages second-order curvature information to realign the merged router. This approach admits a closed-form solution that can be efficiently solved using a matrix-free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements. Our code is available at https://github.com/huangcb01/HARC.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] A Goal-Set Characterization of Task Composition in the Boolean Task Algebra</title>
    <author>
      <name>Eduardo Terrés-Caballero</name>
    </author>
    <author>
      <name>Herke van Hoof</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04053v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04053v1" />
    <published>2026-06-02T09:33:06Z</published>
    <updated>2026-06-02T09:33:06Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Eduardo Terrés-Caballero, Herke van Hoof&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The Boolean Task Algebra (BTA) provides a principled framework for zero-shot task composition in reinforcement learning by equipping goal-reaching tasks with Boolean operations. We revisit its structural assumptions and formalize a collapse in the space of optimal extended Q-value functions: in deterministic MDPs, every such function is fully determined by the universal and empty tasks. This makes the logarithmic set of base tasks proposed in the original BTA formulation redundant. Building on this observation, we introduce a goal-set-based composition method that performs logical operations on goal sets and reconstructs composed value functions by selecting slices from the universal and empty value functions. This reduces learning costs for standard BTA and reduces composition time for both BTA and Skill Machines, while preserving policy performance. Experiments across tabular, visual, function-approximation, and continuous-control domains show that learning additional base tasks does not yield better performance. Finally, we study the stochastic setting and provide a counterexample showing that this collapse need not hold, that is, optimal composition may require accounting for exponentially many policies in the number of goals. Code is available at https://github.com/EduardoTerres/bta_paper.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.RO] Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation</title>
    <author>
      <name>Jiahao Xu</name>
    </author>
    <author>
      <name>Peiyuan Wang</name>
    </author>
    <author>
      <name>Hanzhuo Zhang</name>
    </author>
    <author>
      <name>Zihao Yu</name>
    </author>
    <author>
      <name>Tianyu Fu</name>
    </author>
    <author>
      <name>Hao Chen</name>
    </author>
    <author>
      <name>Xuanhao Xiang</name>
    </author>
    <author>
      <name>Jianbo Yu</name>
    </author>
    <author>
      <name>Chenchen Fu</name>
    </author>
    <author>
      <name>Wanyuan Wang</name>
    </author>
    <category term="cs.RO" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03385v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03385v1" />
    <published>2026-06-02T09:29:03Z</published>
    <updated>2026-06-02T09:29:03Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiahao Xu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;In robotic manipulation, the tight coupling between grasping and motion planning often obscures the true source of failure, leading to inefficient trial-and-error. To enable efficient long-horizon manipulation, we propose GTP-FA (Grasp-Then-Plan with Failure Attribution), a task-oriented two-stage grasp-then-plan framework that generates grasp candidates and performs downstream motion planning conditioned on the selected grasp. Given a failed manipulation trajectory, we learn a failure attribution model that generalizes to unseen grasps and produces a stable distribution over failure modes for diagnosis-guided optimization. Based on these attribution results, we then optimize both modules in a diagnosis-driven manner: on the grasping side, we inject task-level priors and risk penalties into grasp candidate scoring and optimization to suppress unstable or task-incompatible grasps; on the planning side, we target high-risk initial states through data collection and fine-tuning to address genuine planning bottlenecks. We evaluate the proposed framework in both simulation and real-robot experiments, and show that GTP-FA improves the corresponding base learners across RL, IL, diffusion-policy, and VLA-based settings, achieving substantially higher overall task success rates.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;32 pages, project page: https://sites.google.com/view/gtp-fa/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CR] AI Model Extraction Attacks: Bypassing Single-Client Assumptions in Defenses</title>
    <author>
      <name>Maxime Schwarzer</name>
    </author>
    <author>
      <name>Johannes F. Loevenich</name>
    </author>
    <author>
      <name>Gustavo Sánchez</name>
    </author>
    <author>
      <name>Laurin Holz</name>
    </author>
    <author>
      <name>Thies Möhlenhof</name>
    </author>
    <author>
      <name>Tobias Hürten</name>
    </author>
    <author>
      <name>Roberto Rigolin F. Lopes</name>
    </author>
    <author>
      <name>Veit Hagenmeyer</name>
    </author>
    <category term="cs.CR" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03381v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03381v1" />
    <published>2026-06-02T09:25:29Z</published>
    <updated>2026-06-02T09:25:29Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Maxime Schwarzer et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Ensuring the protection of Artificial Intelligence (AI) models deployed in military Command and Control (C2) systems and critical infrastructure is essential for maintaining information superiority. Model Extraction Attacks (MEAs) pose a significant threat, as they enable adversaries to replicate proprietary models, compromise protected information, and prepare offline adversarial attacks. However, current defense strategies predominantly rely on the Single Client Assumption (SCA), which is the implicit assumption that attacks originate from isolated identities. This work systematically demonstrates that the SCA is fundamentally invalid in the presence of coordinated threat actors, such as Advanced Persistent Threats (APTs). We introduce a modular, open-source framework called CerberusAI for reproducible model-stealing research, and use it to simulate distributed attack scenarios. Our empirical evaluation shows that well-established defense mechanisms, such as Protecting Against Deep Neural Network Model Stealing Attacks (PRADA), can be bypassed by basic round-robin query distribution strategies, resulting in a significant reduction in detection performance. Furthermore, we demonstrate that even global aggregation approaches can be rendered operationally useless through adaptive traffic mixing. These results highlight the need for a paradigm shift towards stateful, identity-independent defense architectures in the field of model extraction attacks. This paper was originally presented at the International Conference on Military Communication and Information Systems (ICMCIS), organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-224-RSY - the ICMCIS, held in Bath, United Kingdom, 12-13 May 2026 and won the best paper award.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization</title>
    <author>
      <name>Ruipeng Zhang</name>
    </author>
    <author>
      <name>Zhihao Li</name>
    </author>
    <author>
      <name>Haozhang Yuan</name>
    </author>
    <author>
      <name>C. L. Philip Chen</name>
    </author>
    <author>
      <name>Tong Zhang</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03376v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03376v2" />
    <published>2026-06-02T09:22:53Z</published>
    <updated>2026-06-03T03:21:53Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ruipeng Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target the perceptual bottleneck in attended regions or address insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are often vision-agnostic and their inherently off-policy nature limits their effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P$^2$-DPO), a novel training paradigm in which the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pairs construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P$^2$-DPO outperforms strong baselines that rely on costly human feedback on benchmarks. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P$^2$-DPO in addressing perceptual bottleneck in attended regions and improving Visual Robustness against degraded inputs.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection</title>
    <author>
      <name>Liulu He</name>
    </author>
    <author>
      <name>XuanAng Liu</name>
    </author>
    <author>
      <name>Juntao Liu</name>
    </author>
    <author>
      <name>Taolue Feng</name>
    </author>
    <author>
      <name>Ting Lu</name>
    </author>
    <author>
      <name>Chunsheng Gan</name>
    </author>
    <author>
      <name>Zhiyv Peng</name>
    </author>
    <author>
      <name>Yuan Du</name>
    </author>
    <author>
      <name>Huanrui Yang</name>
    </author>
    <author>
      <name>Yijiang Liu</name>
    </author>
    <author>
      <name>Li Du</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04050v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04050v1" />
    <published>2026-06-02T08:52:04Z</published>
    <updated>2026-06-02T08:52:04Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Liulu He et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Existing quantization methods are fundamentally limited by rigid, integer-based bit-widths (e.g., 2, 3-bit), resulting in a ``deployment gap&amp;quot; where Large Language Models cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit-width control for true Pareto-optimal deployment. The core innovation is a ``lift-then-project&amp;quot; mechanism which approximates low-dimensional weight vectors by projecting a simple 1-bit lattice from a higher-dimensional ``lifted&amp;quot; space. Crucially, the effective bit-width is determined simply by the ratio of the lifted dimension to the original dimension, which allows the bit-width to be tuned quasi-continuous as the dimension is a flexible structural parameter. This projection generates a structured yet non-uniform codebook, capturing the expressive power of Vector Quantization (VQ). While beneficial over VQ, LiftQuant&amp;#x27;s decoding path relies solely on linear transformations and 1-bit uniform quantizers, retaining hardware-friendly nature. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state-of-the-art 2-bit models fitted on the same device. Our code and ckpt is available at https://github.com/Heliulu/LiftQuant.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;ICML 2026 Spotlight&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] FLIPS: Instance-Fingerprinting for LLMs via Pseudo-random Sequences</title>
    <author>
      <name>Gurvan Richardeau</name>
    </author>
    <author>
      <name>Gohar Dashyan</name>
    </author>
    <author>
      <name>Erwan Le Merrer</name>
    </author>
    <author>
      <name>Gilles Tredan</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03330v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03330v1" />
    <published>2026-06-02T08:39:50Z</published>
    <updated>2026-06-02T08:39:50Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Gurvan Richardeau et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Literature reveals that a Large Language Model&amp;#x27;s (LLM) behavior is not only conditioned by its original weights but also its instance-level parameters, such as instructional prompt, sampling configuration or quantization. A model that generates safe outputs under one configuration may produce toxic content under another. However, current LLM identification techniques (such as fingerprinting) focus on intellectual property protection, and their design favors robustness to changes in these instance-level parameters. This poses a critical challenge for AI regulation in which compliance assessments target actual deployed behaviors, not model provenance. In this paper, we introduce instance-level fingerprinting, a regulator-oriented paradigm that distinguishes configurations of the same LLM. Our method FLIPS, exploits biases in generated binary random sequences to reach 96% (closed-set) and 90% (open-set, where some targets are unknown) identification accuracy across 237 model instances, versus 35% for the adapted LLMmap baseline. This shows that instance-level fingerprinting is both necessary for regulation and practically feasible. Code available at https://github.com/GurvanR/FLIPS-LLM-Instance-Fingerprinting.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;20 pages, 20 figures, 3 tables. 43rd International Conference on Machine Learning (ICML 2026)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain</title>
    <author>
      <name>Tiancheng Han</name>
    </author>
    <author>
      <name>Yong Li</name>
    </author>
    <author>
      <name>Wuzhou Yu</name>
    </author>
    <author>
      <name>Qiaosheng Zhang</name>
    </author>
    <author>
      <name>Wenqi Shao</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03329v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03329v1" />
    <published>2026-06-02T08:39:03Z</published>
    <updated>2026-06-02T08:39:03Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tiancheng Han et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Long-context tasks require LLMs to identify and preserve answer-relevant information from large contexts. Chunk-wise memory agents address this issue by sequentially reading document chunks, updating a compact memory, and generating the final answer from the accumulated memory. However, existing RL-based chunk-wise agents either rely on sparse final-answer rewards or use lexical intermediate rewards for memory and retrieval actions. These signals supervise task success or local overlap, but do not directly evaluate whether the final memory supports the ground-truth answer. We propose InfoMem, a reward mechanism for training chunk-wise memory agents that evaluates final-memory utility using answer-conditioned information. InfoMem measures how much the final memory increases the model&amp;#x27;s per-token log-likelihood of the ground-truth answer. To stabilize RL optimization, InfoMem applies this signal only to successful trajectories and normalizes it before reward composition. Under the same GRPO framework and training budget, InfoMem improves long-context memory-agent performance over comparable memory-agent RL baselines. Analyses show that effective final-memory rewards should operate on successful trajectories, be normalized before reward composition, and be conditioned on the answer rather than the query. Our code is available at https://github.com/GenSouKa1/InfoMem.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;17 pages, 7 figrues,&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks</title>
    <author>
      <name>Po-Nien Kung</name>
    </author>
    <author>
      <name>Linfeng Song</name>
    </author>
    <author>
      <name>Dawsen Hwang</name>
    </author>
    <author>
      <name>Jinsung Yoon</name>
    </author>
    <author>
      <name>Chun-Liang Li</name>
    </author>
    <author>
      <name>Simone Severini</name>
    </author>
    <author>
      <name>Mirek Olšák</name>
    </author>
    <author>
      <name>Edward Lockhart</name>
    </author>
    <author>
      <name>Quoc V Le</name>
    </author>
    <author>
      <name>Burak Gokturk</name>
    </author>
    <author>
      <name>Thang Luong</name>
    </author>
    <author>
      <name>Tomas Pfister</name>
    </author>
    <author>
      <name>Nanyun Peng</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03303v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03303v2" />
    <published>2026-06-02T08:16:42Z</published>
    <updated>2026-06-03T06:16:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Po-Nien Kung et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic framework that enables general-purpose foundation models to achieve state-of-the-art performance on automated formal theorem proving. LEAP leverages foundation model capabilities, such as informal reasoning, instruction following, and iterative self-refinement. By decomposing complex problems into smaller units, the system bridges formal proof construction with informal blueprints through continuous interaction with the Lean compiler. To provide a rigorous evaluation beyond increasingly saturated benchmarks, we introduce Lean-IMO-Bench, a benchmark of IMO-style problems formalized in Lean, with short statements yet highly non-routine and multi-step proofs across a wide range of difficulty levels. Empirically, on the latest 2025 Putnam Competition, an annual mathematics competition for undergraduate students in North America, LEAP solves all 12 problems, matching recent breakthroughs by frontier formal mathematical models. On Lean-IMO-Bench, LEAP boosts the one-shot formal solve rate of general-purpose LLMs from below 10% to 70%, notably surpassing the 48% benchmark set by a specialized, gold-medal-caliber IMO system. Furthermore, we demonstrate LEAP&amp;#x27;s research-level utility by autonomously formalizing complex proofs for open combinatorial challenges, including a verified proof for a key subproblem in Knuth&amp;#x27;s Hamiltonian decomposition of even-order Cayley graphs.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Message Tuning Outshines Graph Prompt Tuning: A Prismatic Space Perspective</title>
    <author>
      <name>Yancheng Chen</name>
    </author>
    <author>
      <name>Dun Ma</name>
    </author>
    <author>
      <name>Shuai Zhang</name>
    </author>
    <author>
      <name>Yang Liu</name>
    </author>
    <author>
      <name>Xixun Lin</name>
    </author>
    <author>
      <name>Xiangyu Zhao</name>
    </author>
    <author>
      <name>Wenguo Yang</name>
    </author>
    <author>
      <name>Wei Chen</name>
    </author>
    <author>
      <name>Chuan Zhou</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03290v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03290v1" />
    <published>2026-06-02T07:52:54Z</published>
    <updated>2026-06-02T07:52:54Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yancheng Chen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Graph Foundation Models (GFMs), built upon the Pre-training and Adaptation paradigm, have emerged as a research hotspot in graph learning. For GNN-based GFMs, graph prompt tuning has become the prevailing adaptation method for downstream tasks. Although recent methods explain why graph prompt tuning works, how to rigorously measure its adaptation capacity remains an open problem. Addressing this problem is critical for understanding the capability limits of graph prompt tuning and for developing more powerful adaptation methods. In this paper, we propose Prismatic Space Theory (PS-Theory), a novel mathematical framework to quantify the capacity of adaptation methods, while focusing on establishing the upper bound for the adaptation capacity of graph prompt tuning. Building upon the proposed PS-Theory, we further introduce Message Tuning for GFMs (MTG), a lightweight approach that injects a small set of learnable message prototypes into each layer of the GNN backbone to adaptively guide message fusion without updating pre-trained weights. Through our PS-Theory, we prove that the adaptation capacity of MTG can exceed the theoretical upper bound of graph prompt tuning. Extensive experiments demonstrate that MTG consistently outperforms graph prompt baselines across diverse benchmark datasets, providing strong empirical support for our theoretical findings.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation</title>
    <author>
      <name>Boyuan Xiao</name>
    </author>
    <author>
      <name>Bohong Chen</name>
    </author>
    <author>
      <name>Yumeng Li</name>
    </author>
    <author>
      <name>Ji Feng</name>
    </author>
    <author>
      <name>Yao-Xiang Ding</name>
    </author>
    <author>
      <name>Kun Zhou</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04046v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04046v1" />
    <published>2026-06-02T07:50:56Z</published>
    <updated>2026-06-02T07:50:56Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Boyuan Xiao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs &amp;amp; VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models&amp;#x27; inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future-item.github.io/SceneDiver.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] EqGINO: Equivariant Geometry-Informed Fourier Neural Operators for 3D PDEs</title>
    <author>
      <name>Sungwon Kim</name>
    </author>
    <author>
      <name>Juho Song</name>
    </author>
    <author>
      <name>Seungmin Shin</name>
    </author>
    <author>
      <name>Guimok Cho</name>
    </author>
    <author>
      <name>Sangkook Kim</name>
    </author>
    <author>
      <name>Chanyoung Park</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03260v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03260v1" />
    <published>2026-06-02T07:23:06Z</published>
    <updated>2026-06-02T07:23:06Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sungwon Kim et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Deep learning surrogates for 3D Partial Differential Equations (PDEs) often fail to generalize across geometric transformations because they depend heavily on specific coordinate systems. While equivariant networks offer a solution, they typically rely on local operations in the spatial domain, making the global receptive field, which is essential for PDE dynamics, computationally expensive. Conversely, Fourier Neural Operators (FNOs) efficiently capture global interactions, yet establishing 3D equivariance within them remains impractical due to the prohibitive cost of spectral group convolutions. To bridge this gap, we introduce EqGINO, a geometrically robust framework that enforces isotropy in the spectral domain. By design, EqGINO guarantees exact equivariance to the discrete symmetries inherent to the discretized computational domain. Beyond this discrete guarantee, our structural prior enables effective generalization to arbitrary continuous orientations even with a limited number of SE(3)-transformed training samples. Consequently, our method robustly models coordinate-invariant physical laws on complex irregular 3D geometries. Our code is available at https://github.com/sung-won-kim/EqGINO&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection</title>
    <author>
      <name>Gautam Gare</name>
    </author>
    <author>
      <name>John Galeotti</name>
    </author>
    <author>
      <name>Michael Mozer</name>
    </author>
    <author>
      <name>Deva Ramanan</name>
    </author>
    <author>
      <name>Nan Rosemary Ke</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03251v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03251v1" />
    <published>2026-06-02T07:12:30Z</published>
    <updated>2026-06-02T07:12:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Gautam Gare et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.MA] OpenAgenet / OAN Yellow Paper: Technical Architecture for Trust-Governed Resource Identity and Discovery</title>
    <author>
      <name>Jinliang Xu</name>
    </author>
    <category term="cs.MA" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03163v3</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03163v3" />
    <published>2026-06-02T05:18:14Z</published>
    <updated>2026-06-05T05:42:43Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jinliang Xu&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;This yellow paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection and discoverable AI resource products. It specifies the role architecture, \texttt{did:oan} identity objects, registration workflow, governance-backed Root lifecycle enforcement, Root-verified package model, authorization-aware Discovery, Root-issued infrastructure authorization VCs, signed trusted invocation, verification requirements, state transitions, security properties, implementation boundaries, and deployment considerations. The design is intended to support heterogeneous Agent frameworks and interaction protocols, including MCP, A2A, ANP-like systems, domain-specific Agent protocols, Skills, MCP Servers, and Tool/API resources. OAN does not define the entire business conversation among Agents or the native protocol of every resource; it defines how resource identities become admissible, discoverable, verifiable, and safe to approach before protocol-specific interaction begins.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.MA] OpenAgenet / OAN White Paper: Open Infrastructure for Trusted Agent Interconnection</title>
    <author>
      <name>Jinliang Xu</name>
    </author>
    <category term="cs.MA" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03161v3</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03161v3" />
    <published>2026-06-02T05:14:34Z</published>
    <updated>2026-06-05T05:40:02Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jinliang Xu&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes visible when Agents move from isolated applications into open, multi-operator networks: before an Agent can safely discover, select, and invoke another Agent, it needs a way to verify identity provenance, governance state, discovery authorization, freshness, and pre-connection trust evidence. OAN is designed as a protocol-neutral trust layer. It does not replace Agent interaction protocols, tool protocols, model orchestration frameworks, or application-level workflows. Instead, it provides \texttt{did:oan}-based resource identity, governance-backed admission, Registrar-assisted onboarding, Root-verified package publication, authorization-aware Discovery, Root-issued infrastructure authorization VCs, and signed trusted invocation. The architectural center of OAN is the combination of federated governance, resource identity, and trusted Discovery, rather than a single directory or naming service. This white paper explains the motivation, architecture, roles, governance model, relationship with MCP, A2A, and ANP, deployment patterns, cooperation model, on-chain governance layer, prototype status, performance profile, and roadmap of OAN.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Channel-Oriented Design for EEG-to-Music Reconstruction</title>
    <author>
      <name>Jiaxin Qing</name>
    </author>
    <author>
      <name>Junwei Lu</name>
    </author>
    <author>
      <name>Lexin Li</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04040v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04040v1" />
    <published>2026-06-02T04:13:37Z</published>
    <updated>2026-06-02T04:13:37Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiaxin Qing et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Brain-computer interfaces aim to decode naturalistic stimuli from neural signals, yet most progress to date has focused on vision and language. In this article, we study a more challenging but far less explored setting, EEG-to-music reconstruction, where signals are weak, distributed, and highly susceptible to noise and channel variability. Our central finding is that early channel mixing destroys weak but discriminative EEG signals. To address this, we propose a channel-oriented design with three key components. Specifically, channel-wise tokenization treats each electrode as an explicit token to retain spatially localized neural evidence, channel-wise multi-view self-distillation enforces consistency across temporal crops and random channel subsets to learn robust and distributed representations, and channel-wise data augmentation introduces structured channel dropout to improve invariance to noise, artifacts, and missing electrodes. Together, these components preserve weak yet informative signals across channels and enable stable alignment to a semantic music representation space. We integrate this channel-oriented design within an encoding-alignment-decoding pipeline for EEG-to-music reconstruction. Theoretically, we characterize when preserving channel-level structure leads to improved alignment. Empirically, we compare with a range of state-of-the-art baselines and demonstrate consistent and significant performance gains.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration</title>
    <author>
      <name>Wenkai Wang</name>
    </author>
    <author>
      <name>Tao Xiong</name>
    </author>
    <author>
      <name>Jingchen Ni</name>
    </author>
    <author>
      <name>Yunpeng Bao</name>
    </author>
    <author>
      <name>Xiyun Li</name>
    </author>
    <author>
      <name>Tianqi Liu</name>
    </author>
    <author>
      <name>Hongcan Guo</name>
    </author>
    <author>
      <name>Zilong Huang</name>
    </author>
    <author>
      <name>Shengyu Zhang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03103v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03103v1" />
    <published>2026-06-02T03:42:34Z</published>
    <updated>2026-06-02T03:42:34Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wenkai Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. To address this issue, we introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration. DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps, and covers professional creative software across design, video, audio, and 3D creation. Furthermore, DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges. Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion, together spanning the full space of realistic collaboration patterns. We evaluate 18 proprietary and open source agents on 538 tasks and find that GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.NE] Beyond Static Priors: Dynamic Neural Guidance for Large-Scale Ant Colony Optimization</title>
    <author>
      <name>Dat Thanh Tran</name>
    </author>
    <author>
      <name>Van Khu Vu</name>
    </author>
    <author>
      <name>Yining Ma</name>
    </author>
    <category term="cs.NE" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04039v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04039v1" />
    <published>2026-06-02T03:32:36Z</published>
    <updated>2026-06-02T03:32:36Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Dat Thanh Tran et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Neural-guided Ant Colony Optimization (ACO) suffers from a fundamental training-inference misalignment: policies are typically trained to generate static priors (e.g., heatmaps), yet deployed to guide iterative, long-horizon search processes. In this paper, we present DyNACO, a novel framework that achieves dynamic neural guidance by periodically observing the pheromone distribution and the incumbent solution. To make DyNACO tractable at scale, we pair the policy with a perturbation-based ACO backend and a scope-restricted refinement mechanism that jointly ensure efficacy and stable credit assignment. On TSP, DyNACO scales to 100,000-node instances and outperforms neural baselines while often reducing total runtime compared to the unguided solver. We extend DyNACO to CVRP via a capacity-aware backend, consistently improving the unguided baseline with less than 1% neural overhead. We further provide in-depth analysis validating the model&amp;#x27;s generalization capabilities and elucidating why dynamic guidance outperforms static priors. Our work underscores the necessity of aligning neural training with iterative search dynamics in learning-guided optimization. The code is available at https://github.com/shoraaa/DyNACO.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at KDD 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs</title>
    <author>
      <name>Xu Wan</name>
    </author>
    <author>
      <name>Speed Zhu</name>
    </author>
    <author>
      <name>Jianwei Cai</name>
    </author>
    <author>
      <name>Guang Chen</name>
    </author>
    <author>
      <name>XiMing Huang</name>
    </author>
    <author>
      <name>Wiggin Zhou</name>
    </author>
    <author>
      <name>Mingyang Sun</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03092v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03092v2" />
    <published>2026-06-02T03:26:55Z</published>
    <updated>2026-06-08T17:22:17Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xu Wan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models&amp;#x27; performance, yet real-world deployment is constrained by strict computational budgets. In this work, we formulate inference budget allocation as a global constrained optimization problem governed by economic principles. By modeling per-query reasoning utility with a shifted-surge function, we derive an optimal allocation policy based on a global shadow price that equilibrates marginal utility under resource scarcity. Based on this theory, we propose Constrained Latent-utility Equilibrium Allocation for Reasoning (CLEAR). It performs rational abandonment and reallocates resources from insolvent queries to solvable queries near their emergence thresholds.
  Extensive experiments on several reasoning tasks with different traffic streams demonstrate that CLEAR significantly improves the Pareto frontier of total token cost versus mean accuracy. In resource-scarce regimes, CLEAR achieves up to a 3x improvement in global accuracy compared to uniform allocation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Constitutional On-Policy Safe Distillation</title>
    <author>
      <name>Ming Wen</name>
    </author>
    <author>
      <name>Yuxuan Liu</name>
    </author>
    <author>
      <name>Kun Yang</name>
    </author>
    <author>
      <name>Yunhao Feng</name>
    </author>
    <author>
      <name>Zhuoer Xu</name>
    </author>
    <author>
      <name>Yuhao Sun</name>
    </author>
    <author>
      <name>Shiwen Cui</name>
    </author>
    <author>
      <name>Xiang Zheng</name>
    </author>
    <author>
      <name>Xingjun Ma</name>
    </author>
    <author>
      <name>Yu-Gang Jiang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03089v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03089v1" />
    <published>2026-06-02T03:17:56Z</published>
    <updated>2026-06-02T03:17:56Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ming Wen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety--helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees</title>
    <author>
      <name>Haoran Tan</name>
    </author>
    <author>
      <name>Zeyu Zhang</name>
    </author>
    <author>
      <name>Zhicheng Cao</name>
    </author>
    <author>
      <name>Rui Li</name>
    </author>
    <author>
      <name>Xu Chen</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03083v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03083v1" />
    <published>2026-06-02T03:13:50Z</published>
    <updated>2026-06-02T03:13:50Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haoran Tan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large Language Model (LLM)-based agents increasingly rely on memory to learn from experiences over continual interactions. However, storing experiences as independent, flat units leads to substantial redundancy and retrieval conflicts, as similar episodes repeat overlapping content and subtle scene variations cause retrieved memories to offer contradictory guidance. To address this, we introduce residual experience, positing that newly acquired experience is often an incremental variation of existing knowledge. We propose DeltaMem, a framework that organizes experience memory into two independent residual trees, one storing goal-conditioned task experience as reusable skills and another for scene-level environment knowledge. Each tree uses a root node for generalized base experiences and incremental delta nodes for subsequent variations, allowing related experiences to share a common foundation without duplication. For retrieval, a failure-penalized similarity scan locates the best match, reconstructing the full experience via root-to-match chain composition. An autonomous consolidation mechanism distills high-frequency paths into new root nodes, enabling the trees to self-organize from general heuristics to specialized variants. Experiments across diverse interactive environments show that DeltaMem consistently outperforms existing baselines. To facilitate future research, we release the code at https://github.com/import-myself/DeltaMem.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding</title>
    <author>
      <name>Mingkuan Zhao</name>
    </author>
    <author>
      <name>Xiayu Sun</name>
    </author>
    <author>
      <name>Wentao Hu</name>
    </author>
    <author>
      <name>Suquan Chen</name>
    </author>
    <author>
      <name>Jiaxuan Li</name>
    </author>
    <author>
      <name>Xiaoyan Zhu</name>
    </author>
    <author>
      <name>Xin Lai</name>
    </author>
    <author>
      <name>Jiayin Wang</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03080v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03080v1" />
    <published>2026-06-02T03:11:39Z</published>
    <updated>2026-06-02T03:11:39Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mingkuan Zhao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Causal language models factorize sequence probabilities using only preceding context, leaving future information unexploited during training despite its availability in the training data. This paper introduces Regret Pre-training, a self-supervised framework grounded in the Learning Using Privileged Information (LUPI) paradigm. The framework employs a dual-view architecture in which a single model generates both a causal Student distribution and a future-conditioned Teacher distribution. The training objective augments standard language modeling with a regret loss that minimizes the KL divergence from teacher to student, transferring future-aware signals to the causal representations. We investigate two teacher configurations on the OLMoE-1B-7B architecture:LocalRegret, which extends attention by one future token, andGlobalRegret, which conditions on bidirectional context with the target position masked. Experiments on nine downstream tasks following 4 billion tokens of training demonstrate that both configurations consistently outperform the baseline. On average,GlobalRegret andLocalRegret achieve 33.9% and 32.2% accuracy respectively, surpassing the baseline&amp;#x27;s 30.2%. Most notably,GlobalRegret improves BoolQ performance by 18.1 percentage points (61.0% vs 42.9%). The framework introduces no additional parameters and requires only one extra inference-mode forward pass per training step.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] ROBUST-WT: Robust Uncertainty-aware Segmentation Transform via Whitening and Training Enhancements</title>
    <author>
      <name>Aqsa Naseer</name>
    </author>
    <author>
      <name>Maryam Bibi</name>
    </author>
    <author>
      <name>Syeda Samiya Urooj</name>
    </author>
    <author>
      <name>Muhammad Khurram Shahzad</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03069v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03069v1" />
    <published>2026-06-02T02:59:35Z</published>
    <updated>2026-06-02T02:59:35Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Aqsa Naseer et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Generalized segmentation of medical images prevents performance degradation when different imaging devices and clinical protocols are used across multiple domains. The Whitening Transform-based Probabilistic Shape Regularization Extractor (WT-PSE), published in IEEE Transactions on Medical Imaging in 2024, addresses this challenge by employing feature decorrelation and Wasserstein distance-based knowledge distillation to achieve robust cross-domain segmentation. This study systematically examines improvements to the WT-PSE learning framework. Four limitations in the original implementation are identified: limited training augmentations that fail to simulate real scanner variations, reliance on per-pixel binary cross-entropy loss that is sensitive to edge noise, the absence of a scheduled loss weighting strategy that may destabilize early training, and the lack of ablation switches for controlled scientific comparison. To address these issues, we propose four enhancements: (1) domain-adaptive augmentation including random erasing, gamma correction, and salt-and-pepper noise; (2) a hybrid BCE and Dice loss function for improved edge-aware segmentation under noisy conditions; (3) a curriculum-based Dice weight scheduling strategy; and (4) command-line control flags for systematic ablation studies. Experiments on the fundus optic disc segmentation benchmark demonstrate that the improved pipeline achieves a final epoch optic-disc Dice score of 0.956 and an ASD score of 13.31, outperforming the baseline epoch-5 Dice score of 0.939. These results indicate that training-level improvements can provide consistent performance gains without modifying the underlying WT-PSE architecture.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;8 pages, 6 figures; code available at https://github.com/213269/WT-PSE-code-main&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Learn When and Where to Connect: Adaptive Virtual Nodes for Dynamic Message Passing on Graphs</title>
    <author>
      <name>Jaejun Lee</name>
    </author>
    <author>
      <name>Joyce Jiyoung Whang</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03068v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03068v1" />
    <published>2026-06-02T02:57:13Z</published>
    <updated>2026-06-02T02:57:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jaejun Lee, Joyce Jiyoung Whang&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While Virtual Nodes (VNs) are often utilized in Message Passing Neural Networks (MPNNs) to facilitate effective message passing, existing VN-based methods have limitations, such as constraining all nodes to connect to the same number of VNs, fixing the connections before applying MPNNs, and connecting a node to a VN independently of the other nodes that connect to the same VN. We propose MAVN, an end-to-end differentiable MPNN framework that allows non-constrained connections between nodes and VNs and dynamically introduces VNs on demand in response to evolving node representations across layers. Specifically, MAVN learns to adaptively determine when (at which layer) and where (to which nodes) to introduce and connect VNs based on the relative importance of connections. From a pool of candidate VNs, MAVN selects the necessary VNs in each layer, where each selected VN is connected to a nonempty subset of nodes, guided by a dual-perspective scoring mechanism that jointly captures the nodes&amp;#x27; preferences for VNs and the VNs&amp;#x27; preferences for nodes. We theoretically prove that for any node-VN connectivity pattern, there exists a set of MAVN&amp;#x27;s parameters that can simulate the pattern. Experiments on nine real-world datasets demonstrate that MAVN consistently improves the performance of backbone MPNNs, achieving up to 46.5% improvement over the backbones and outperforms the baselines.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 pages, 6 figures, 10 tables, 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection</title>
    <author>
      <name>Jinjie Shen</name>
    </author>
    <author>
      <name>Yaxiong Wang</name>
    </author>
    <author>
      <name>Yujiao Wu</name>
    </author>
    <author>
      <name>Lechao Cheng</name>
    </author>
    <author>
      <name>Tianrui Hui</name>
    </author>
    <author>
      <name>Nan Pu</name>
    </author>
    <author>
      <name>Zhihui Li</name>
    </author>
    <author>
      <name>Zhun Zhong</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03066v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03066v1" />
    <published>2026-06-02T02:53:48Z</published>
    <updated>2026-06-02T02:53:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jinjie Shen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The rapid rise of generative AI has made multimodal fake news increasingly realistic and pervasive, posing severe threats to public trust and social stability. Existing detection methods rely heavily on manipulation-specific models and large-scale labeled data, resulting in poor generalization to emerging manipulation types. We observed that the essence of manipulated misinformation lies in its intrinsic conflicts, \textbf{i.e.,} semantic or physical inconsistencies either across modalities or with common world knowledge. Inspired by this observation, we propose \textbf{C}onflict-\textbf{O}riented \textbf{RE}asoning (\textbf{CORE}) framework, an effective paradigm that learns to endows multimodal large language models (MLLMs) with explicit conflict-capturing capability. To this end, CORE first constructs the Conflict Attribution Corpus (CAC) with fine-grained annotations of conflict factors and sources, providing essential data support for subsequent conflict perception training. By performing conflict-oriented representation enhancement and reasoning based on CAC, CORE achieves robust and generalizable conflict detection, effectively and rapidly adapting to unseen manipulation types with a few samples or in even zero-shot settings. Extensive experiments demonstrate that CORE surpasses state-of-the-art models. The dataset and code are publicly available at https://github.com/shen8424/CORE.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification</title>
    <author>
      <name>Thanh Luong Tuan</name>
    </author>
    <author>
      <name>Abhijit Sanyal</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04037v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04037v2" />
    <published>2026-06-02T02:37:11Z</published>
    <updated>2026-06-04T15:00:59Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Thanh Luong Tuan, Abhijit Sanyal&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We present an ontology-grounded verification framework -- to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine-verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam (where Vietnam&amp;#x27;s 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation significantly outperformed the dominant persona-based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e-6); transparently, its advantage over plain and retrieval-augmented prompting did not survive Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The framework offers a reproducible, regulation-grounded route to pre-deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;26 pages, 3 figures. Companion to arXiv:2604.00555. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-6&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases</title>
    <author>
      <name>Phillip Jiang</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03040v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03040v1" />
    <published>2026-06-02T02:25:53Z</published>
    <updated>2026-06-02T02:25:53Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Phillip Jiang&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Relational databases underpin modern enterprise, scientific, and healthcare systems, yet predictive machine learning on such data remains challenging due to their multi-table, heterogeneous, and temporal structure. Relational Deep Learning (RDL) addresses this by representing databases as heterogeneous graphs and applying graph neural networks (GNNs) directly. RelBench v2 recently introduced autocomplete tasks -- a practically motivated task type where the goal is to predict an existing column value from relational context, analogous to an intelligent form-filling assistant. We propose RelGT-AC (Relational Graph Transformer for Autocomplete), extending the RelGT architecture with three targeted contributions: (1) a column masking strategy that prevents trivial solutions by masking the target column during subgraph encoding; (2) a unified task head supporting binary classification, multiclass classification, and regression autocomplete tasks within a single model; and (3) a TF-IDF text encoder that automatically detects and encodes free-text columns, recovering strong lexical signal that categorical encoders discard. Across 7 tasks spanning 3 RelBench v2 datasets (rel-trial, rel-f1, rel-stack), RelGT-AC outperforms the GraphSAGE baseline on all 3 regression autocomplete tasks and achieves up to +10 AUROC points on text-heavy eligibility tasks via the TF-IDF encoder.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 pages, 6 figures. Code and model checkpoints available at https://github.com/jiangdmv/graph-transformer&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.MA] Capability Advertisement as a Market for Lemons: A Trust Layer for Heterogeneous Agent Networks</title>
    <author>
      <name>Gaurav Naresh Mittal</name>
    </author>
    <category term="cs.MA" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03034v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03034v1" />
    <published>2026-06-02T02:17:30Z</published>
    <updated>2026-06-02T02:17:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Gaurav Naresh Mittal&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large language model (LLM) agents have begun to delegate work to one another. Protocols such as the Model Context Protocol (MCP) and the Agent2Agent protocol (A2A) let an agent publish what it can do and let others call it, and public registries of such agents are already appearing. These protocols assume an advertised capability is a static, truthful fact. A real agent is none of these things: its competence is probabilistic, varies with input, drifts when the underlying model is updated, and, because the agent is itself a language model, it can describe itself with complete confidence and be wrong. A caller therefore sees what an agent claims to do, not what it can do, with no principled way to tell a reliable provider from a fluent impostor.
  We argue these difficulties share one cause: the market for lemons. When quality is hidden and claims are cheap, good and bad providers become indistinguishable, honest reliability goes unrewarded, and the market decays toward its worst participants. Economics offers three remedies, signaling, screening, and reputation, and none are present in today&amp;#x27;s agent protocols.
  We make four contributions: (1) a failure taxonomy that names confident-wrong as a non-adversarial, correlated subclass of Byzantine faults that classical fault-tolerance mismodels; (2) a market-for-lemons model showing that faith-based protocols admit only a low-trust equilibrium; (3) the Trust Layer, a thin, protocol-agnostic narrow waist above MCP and A2A that adds probabilistic capability descriptors, screening, and reputation, and admits a separating equilibrium when the cost of sustaining an overclaim exceeds the gain from it; and (4) a reliability-composition bound for delegation chains with an end-to-end placement argument. The design needs no model retraining and degrades gracefully when its trust anchors are absent or corrupt.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates</title>
    <author>
      <name>Paiheng Xu</name>
    </author>
    <author>
      <name>Jing Liu</name>
    </author>
    <author>
      <name>Wei Ai</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03029v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03029v1" />
    <published>2026-06-02T02:07:46Z</published>
    <updated>2026-06-02T02:07:46Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Paiheng Xu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;A core goal of computational social science is to discover interpretable differences in how language varies across outcomes of interest, such as political affiliation or instructional quality. Recent LLM-based hypothesis generation methods describe such differences in natural language, but select for globally discriminative patterns without accounting for covariates that shape the data based on researchers&amp;#x27; domain knowledge. When covariates are ignored, selected patterns can reflect confounds rather than differences of substantive interest. We introduce conditional hypothesis generation, a framework that incorporates researcher-specified covariates to steer hypothesis discovery toward differences that hold within relevant subgroups. Two challenges arise: the target subgroup may be underrepresented (stratum imbalance), and the direction of a difference may reverse across subgroups (sign reversal). We propose two econometrics-inspired methods: one introduces feature--covariate interactions to detect sign reversals, and the other applies within-stratum demeaning and inverse-frequency reweighting to equalize underrepresented strata. Synthetic experiments show each method outperforms global baselines in its targeted setting, and expert evaluation on two real-world datasets confirms that covariate-aware generation surfaces more useful hypotheses within relevant subgroups.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization</title>
    <author>
      <name>Mingkuan Zhao</name>
    </author>
    <author>
      <name>Wentao Hu</name>
    </author>
    <author>
      <name>Tianchen Huang</name>
    </author>
    <author>
      <name>Yuheng Min</name>
    </author>
    <author>
      <name>Suquan Chen</name>
    </author>
    <author>
      <name>Yide Gao</name>
    </author>
    <author>
      <name>Yanbo Zhai</name>
    </author>
    <author>
      <name>Shuangyong Song</name>
    </author>
    <author>
      <name>Xuelong Li</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03022v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03022v1" />
    <published>2026-06-02T01:56:29Z</published>
    <updated>2026-06-02T01:56:29Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mingkuan Zhao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address this issue through a geometric framework rooted in the linear representation hypothesis. We propose that hallucinations manifest as orthogonal noise relative to the semantic manifold of the residual stream. Specifically, we hypothesize that while attention heads ideally propagate information congruent with the context subspace, hallucinations arise when specific heads introduce components orthogonal to this subspace, disrupting the coherence of the latent representation. Based on this formulation, we introduce Dynamic Contextual Orthogonalization (DCO), an inference-time intervention method. DCO utilizes the input residual stream as a dynamic context anchor to perform orthogonal decomposition on attention head outputs. To distinguish between context-aligned semantic updates and divergent noise, DCO employs a layer-wise Z-score suppression mechanism that selectively attenuates outlier orthogonal components based on statistical distributions. Evaluations on Llama-3-8B and 70B across benchmarks such as XSum, NQ-Swap, and IFEval demonstrate that DCO achieves superior contextual faithfulness compared to state-of-the-art intervention baselines. Furthermore, DCO maintains high performance on knowledge-intensive tasks like TriviaQA and TruthfulQA, effectively mitigating the trade-off between hallucination suppression and parametric knowledge retention often observed in existing methods. Our findings validate the geometric interpretation of hallucinations and establish DCO as a computationally efficient approach for enforcing manifold alignment.Our code is available at https://github.com/Harry-Miral/DCO&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Pretraining Language Models on Historical Text</title>
    <author>
      <name>Xiaoxi Luo</name>
    </author>
    <author>
      <name>Zachary Shinnick</name>
    </author>
    <author>
      <name>Niclas Griesshaber</name>
    </author>
    <author>
      <name>Yixuan Wang</name>
    </author>
    <author>
      <name>Junchi Yu</name>
    </author>
    <author>
      <name>Freda Shi</name>
    </author>
    <author>
      <name>Philip Torr</name>
    </author>
    <author>
      <name>Yao Lu</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.02991v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.02991v1" />
    <published>2026-06-02T00:59:06Z</published>
    <updated>2026-06-02T00:59:06Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xiaoxi Luo et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CV] Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion</title>
    <author>
      <name>Oskar Natan</name>
    </author>
    <author>
      <name>Jun Miura</name>
    </author>
    <category term="cs.CV" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.02979v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.02979v1" />
    <published>2026-06-02T00:35:42Z</published>
    <updated>2026-06-02T00:35:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Oskar Natan, Jun Miura&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird&amp;#x27;s eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre-processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real-world nuScenes-lidarseg dataset. To support future research, we share codes and other files publicly at https://github.com/oskarnatan/compact-perception.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;This work has been accepted for publication in IEEE Transactions on Intelligent Transportation Systems. https://ieeexplore.ieee.org/document/9712213&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.AI] WISE-HAR: A Generalizable Ensemble Deep Learning Framework for WiFi-Based Human Activity Recognition</title>
    <author>
      <name>Maheen Arshad</name>
    </author>
    <author>
      <name>Qindeel E Zahra</name>
    </author>
    <author>
      <name>Muhammad Khuram Shahzad</name>
    </author>
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.02974v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.02974v1" />
    <published>2026-06-02T00:25:46Z</published>
    <updated>2026-06-02T00:25:46Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Maheen Arshad et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Human Activity Recognition (HAR) using WiFi signals has emerged as a transformative technology for smart homes, healthcare monitoring, security systems, and ambient assisted living. Unlike traditional camera-based systems that raise significant privacy concerns and fail in low-light conditions, or wearable sensors that require user compliance, WiFi-based HAR is non-intrusive, privacy-preserving, cost-effective, and works seamlessly in any lighting condition. This paper presents a comprehensive approach to recognize three distinct human activities: &amp;quot;No Presence&amp;quot; (empty room), &amp;quot;Walking&amp;quot;, and &amp;quot;Walking + Arm-waving&amp;quot; using the Wallhack1.8k WiFi spectrogram dataset. We propose three key improvements to address the main challenges in WiFi-based HAR. First, to address high performance variance, we implement ensemble learning with five different CNN architectures (Deep CNN, Wide CNN, MobileNetV2, ResNet50V2, and EfficientNetB0). Second, to address the small dataset size limitation, we apply aggressive data augmentation techniques including time-warping, frequency masking, and noise addition. Third, to evaluate real-world generalization capability, we perform cross-scenario evaluation (training on Line-of-Sight and testing on Non-Line-of-Sight) and cross-antenna evaluation (training on Biquad antenna and testing on PIFA antenna). Our ensemble model achieved a test accuracy of 94.87% on the LOS scenario with Biquad antenna, outperforming the best individual model by 0.66%. Data augmentation improved Random Forest performance from 60% to 95%. Cross-scenario evaluation showed minimal accuracy drops of only 1.37% and 2.07%, demonstrating strong generalization capabilities. The results indicate that the proposed approach is robust, reliable, and suitable for real-world deployment in diverse environments with different hardware configurations.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;8 pages, 5 figures&lt;/p&gt;</content>
  </entry>
  <code-available-feed:processed>
    <code-available-feed:article url="https://arxiv.org/abs/2606.02967v1" updated="2026-06-02T00:09:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.02974v1" updated="2026-06-02T00:25:46Z" repo_found_in="pdf" repo_urls="https://github.com/maheenarshad198-jpg/HAR" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.02979v1" updated="2026-06-02T00:35:42Z" repo_found_in="comment" repo_urls="https://ieeexplore.ieee.org/document/9712213" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.02991v1" updated="2026-06-02T00:59:06Z" repo_found_in="pdf" repo_urls="https://huggingface.co/typewriter-ai" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.02994v1" updated="2026-06-02T01:11:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.02995v1" updated="2026-06-02T01:11:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03003v1" updated="2026-06-02T01:20:24Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03019v1" updated="2026-06-02T01:50:24Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03022v1" updated="2026-06-02T01:56:29Z" repo_found_in="abstract" repo_urls="https://github.com/Harry-Miral/DCO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03026v1" updated="2026-06-02T02:03:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03029v1" updated="2026-06-02T02:07:46Z" repo_found_in="pdf" repo_urls="https://github.com/rmovva/HypotheSAEs" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03034v1" updated="2026-06-02T02:17:30Z" repo_found_in="pdf" repo_urls="https://github.com/grvnmttl/lemons-market-sim" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03040v1" updated="2026-06-02T02:25:53Z" repo_found_in="comment" repo_urls="https://github.com/jiangdmv/graph-transformer" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03057v1" updated="2026-06-02T02:45:41Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03061v1" updated="2026-06-02T02:50:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03066v1" updated="2026-06-02T02:53:48Z" repo_found_in="abstract" repo_urls="https://github.com/shen8424/CORE" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03068v1" updated="2026-06-02T02:57:13Z" repo_found_in="pdf" repo_urls="https://github.com/bdi-lab/MAVN" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03069v1" updated="2026-06-02T02:59:35Z" repo_found_in="comment" repo_urls="https://github.com/213269/WT-PSE-code-main" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03073v1" updated="2026-06-02T03:02:06Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03077v1" updated="2026-06-02T03:09:13Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03080v1" updated="2026-06-02T03:11:39Z" repo_found_in="pdf" repo_urls="https://github.com/RegretPretraining/Code2026" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03083v1" updated="2026-06-02T03:13:50Z" repo_found_in="abstract" repo_urls="https://github.com/import-myself/DeltaMem" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03089v1" updated="2026-06-02T03:17:56Z" repo_found_in="pdf" repo_urls="https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct https://sii-fleeecermw.github.io/SafeC-OPSD/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03092v2" updated="2026-06-08T17:22:17Z" repo_found_in="pdf" repo_urls="https://github.com/waunx/CLEAR" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03097v1" updated="2026-06-02T03:36:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03099v1" updated="2026-06-02T03:38:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03103v1" updated="2026-06-02T03:42:34Z" repo_found_in="abstract" repo_urls="https://github.com/mrwwk/DeskCraft" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03137v1" updated="2026-06-02T04:26:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03144v1" updated="2026-06-02T04:40:25Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03161v3" updated="2026-06-05T05:40:02Z" repo_found_in="pdf" repo_urls="https://github.com/OpenAgenet" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03163v3" updated="2026-06-05T05:42:43Z" repo_found_in="pdf" repo_urls="https://github.com/OpenAgenet" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03198v1" updated="2026-06-02T05:58:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03201v2" updated="2026-06-03T06:57:41Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03237v1" updated="2026-06-02T06:54:55Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03238v1" updated="2026-06-02T06:55:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03251v1" updated="2026-06-02T07:12:30Z" repo_found_in="pdf" repo_urls="https://github.com/slachapelle/dcdi" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03257v1" updated="2026-06-02T07:18:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03260v1" updated="2026-06-02T07:23:06Z" repo_found_in="abstract" repo_urls="https://github.com/sung-won-kim/EqGINO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03280v2" updated="2026-06-05T14:15:49Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03288v1" updated="2026-06-02T07:51:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03290v1" updated="2026-06-02T07:52:54Z" repo_found_in="pdf" repo_urls="https://github.com/CYCUCAS/MTG" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03303v2" updated="2026-06-03T06:16:05Z" repo_found_in="pdf" repo_urls="https://github.com/AxiomMath/putnam2025 https://github.com/PatrickMassot/leanblueprint https://github.com/google-deepmind/superhuman/tree/main/leap https://github.com/project-numina/Numina-Putnam2025 https://imobench.github.io https://leanprover-community.github.io/blog/posts/FLT-announcement/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03310v1" updated="2026-06-02T08:24:17Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03328v2" updated="2026-06-08T13:05:20Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03329v1" updated="2026-06-02T08:39:03Z" repo_found_in="abstract" repo_urls="https://github.com/GenSouKa1/InfoMem" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03330v1" updated="2026-06-02T08:39:50Z" repo_found_in="abstract" repo_urls="https://github.com/GurvanR/FLIPS-LLM-Instance-Fingerprinting" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03331v1" updated="2026-06-02T08:40:47Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03357v1" updated="2026-06-02T09:05:25Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03376v2" updated="2026-06-03T03:21:53Z" repo_found_in="pdf" repo_urls="https://github.com/ZrpChuang/ https://github.com/ZrpChuang/P2-DPO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03381v1" updated="2026-06-02T09:25:29Z" repo_found_in="pdf" repo_urls="https://github.com/lMaxTl/Cerberus-AI" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03385v1" updated="2026-06-02T09:29:03Z" repo_found_in="comment" repo_urls="https://sites.google.com/view/gtp-fa/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03391v1" updated="2026-06-02T09:33:33Z" repo_found_in="abstract" repo_urls="https://github.com/huangcb01/HARC" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03398v1" updated="2026-06-02T09:39:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03419v5" updated="2026-06-09T13:18:56Z" repo_found_in="pdf" repo_urls="https://github.com/emmerichmtm/UnitDistanceProblemOptimizationOfSawinsLowerBound" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03428v1" updated="2026-06-02T10:18:00Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03430v1" updated="2026-06-02T10:18:45Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03432v1" updated="2026-06-02T10:19:35Z" repo_found_in="pdf" repo_urls="https://github.com/RGF-team/rgf https://github.com/dmlc/xgboost/ https://github.com/xiaozhouwang/kaggle_Microsoft_Malware" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03444v1" updated="2026-06-02T10:28:32Z" repo_found_in="pdf" repo_urls="https://github.com/robotyingtang/PRISM-VFM" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03453v1" updated="2026-06-02T10:32:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03459v1" updated="2026-06-02T10:36:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03461v1" updated="2026-06-02T10:37:47Z" repo_found_in="pdf" repo_urls="https://stephen0808.github.io/terminal-lego.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03463v1" updated="2026-06-02T10:41:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03465v1" updated="2026-06-02T10:45:21Z" repo_found_in="abstract" repo_urls="https://github.com/brain-lab-research/TT-LLM" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03467v1" updated="2026-06-02T10:45:49Z" repo_found_in="abstract" repo_urls="https://github.com/taiyu-zhu/StepFinder" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03471v1" updated="2026-06-02T10:48:59Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03483v1" updated="2026-06-02T11:00:49Z" repo_found_in="pdf" repo_urls="https://github.com/brain-lab-research/ https://github.com/brain-lab-research/hc-stream-collapse" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03486v1" updated="2026-06-02T11:01:50Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03489v1" updated="2026-06-02T11:07:20Z" repo_found_in="pdf" repo_urls="https://github.com/Easonnoway/TSP" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03503v1" updated="2026-06-02T11:21:27Z" repo_found_in="pdf" repo_urls="https://github.com/ziyanliux/ThoughtFold" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03504v1" updated="2026-06-02T11:23:49Z" repo_found_in="comment" repo_urls="https://github.com/mohdali-dev/BaltiVoice-ASR" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03512v1" updated="2026-06-02T11:29:00Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03517v1" updated="2026-06-02T11:38:48Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03521v1" updated="2026-06-02T11:43:13Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03523v1" updated="2026-06-02T11:44:46Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03544v1" updated="2026-06-02T12:08:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03566v1" updated="2026-06-02T12:32:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03568v1" updated="2026-06-02T12:34:14Z" repo_found_in="abstract" repo_urls="https://github.com/rst-tu-dortmund/learned-3d-nms" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03598v2" updated="2026-06-03T03:44:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03601v1" updated="2026-06-02T13:07:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03602v1" updated="2026-06-02T13:07:43Z" repo_found_in="abstract" repo_urls="https://github.com/OpenCausaLab/CauTion" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03606v2" updated="2026-06-03T12:40:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03608v1" updated="2026-06-02T13:11:09Z" repo_found_in="abstract" repo_urls="https://github.com/shanjf666/CoCoV" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03618v1" updated="2026-06-02T13:17:45Z" repo_found_in="pdf" repo_urls="https://huggingface.co/microsoft/ https://huggingface.co/microsoft/llmlingua-2-xlm-roberta-large-meetingbank" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03620v1" updated="2026-06-02T13:20:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03624v1" updated="2026-06-02T13:23:28Z" repo_found_in="pdf" repo_urls="https://github.com/stevenzyzhao/CRGC" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03628v1" updated="2026-06-02T13:26:17Z" repo_found_in="abstract" repo_urls="https://github.com/TreeLLi/hallucination-rejection-sampling" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03635v1" updated="2026-06-02T13:31:57Z" repo_found_in="comment" repo_urls="https://iyttor.github.io/VidMsg" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03641v1" updated="2026-06-02T13:35:12Z" repo_found_in="comment" repo_urls="https://github.com/wongqihan/ai-behavioral-experiments" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03647v1" updated="2026-06-02T13:39:15Z" repo_found_in="pdf" repo_urls="https://github.com/SEML-Lab/IHO https://huggingface.co/collections/SEML-Lab/iho" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03648v1" updated="2026-06-02T13:39:17Z" repo_found_in="pdf" repo_urls="https://github.com/IBM/SafeLoRA https://github.com/PKU-Alignment/beavertails https://github.com/SORRY-Bench/SORRY-Bench https://github.com/UKGovernmentBEIS/inspect_ https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/arc https://github.com/openai/grade-school-math https://github.com/paul-rottger/xstest https://github.com/tatsu-lab/stanford_alpaca https://huggingface.co/datasets/google/boolq https://huggingface.co/datasets/nvidia/ https://huggingface.co/datasets/nvidia/Nemotron-Science-v1 https://huggingface.co/docs/trl https://huggingface.co/meta-llama/ https://huggingface.co/meta-llama/Llama-Guard-3-8B" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03650v2" updated="2026-06-04T10:01:47Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03655v1" updated="2026-06-02T13:44:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03657v1" updated="2026-06-02T13:46:04Z" repo_found_in="pdf" repo_urls="https://github.com/JimmmmmL/NovelAPIBench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03660v2" updated="2026-06-03T14:05:21Z" repo_found_in="pdf" repo_urls="https://github.com/fresnellll/ChemCoTBench-V2 https://huggingface.co/datasets/fresnellll/ChemCoTBench-V2" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03664v1" updated="2026-06-02T13:50:22Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03685v1" updated="2026-06-02T14:09:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03689v1" updated="2026-06-02T14:11:23Z" repo_found_in="pdf" repo_urls="https://github.com/marianaw/ https://github.com/marianaw/frozenhazard.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03692v1" updated="2026-06-02T14:14:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03704v1" updated="2026-06-02T14:22:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03741v1" updated="2026-06-02T14:55:47Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03743v1" updated="2026-06-02T14:56:10Z" repo_found_in="pdf" repo_urls="https://github.com/pelicanhere/proof-refactor" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03748v1" updated="2026-06-02T15:01:13Z" repo_found_in="abstract" repo_urls="https://github.com/ultralytics/ultralytics" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03755v1" updated="2026-06-02T15:03:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03762v1" updated="2026-06-02T15:16:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03763v1" updated="2026-06-02T15:18:03Z" repo_found_in="pdf" repo_urls="https://github.com/ln9527/publishing-production-function" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03770v1" updated="2026-06-02T15:23:28Z" repo_found_in="pdf" repo_urls="https://github.com/ggml-org/llama.cpp" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03777v1" updated="2026-06-02T15:29:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03796v1" updated="2026-06-02T15:45:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03800v1" updated="2026-06-02T15:48:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03803v2" updated="2026-06-09T17:04:13Z" repo_found_in="pdf" repo_urls="https://sonycslparis.github.io/liveband-companion" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03808v1" updated="2026-06-02T15:53:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03810v2" updated="2026-06-03T10:22:34Z" repo_found_in="pdf" repo_urls="https://github.com/AI-Safety-Institute/ https://github.com/AI-Safety-Institute/consistency-misalignment" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03811v1" updated="2026-06-02T15:54:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03814v1" updated="2026-06-02T15:57:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03841v1" updated="2026-06-02T16:20:58Z" repo_found_in="abstract" repo_urls="https://github.com/usail-hkust/EvoDS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03846v1" updated="2026-06-02T16:25:54Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/ccqq77/clustered_self_assessment" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03852v1" updated="2026-06-02T16:29:17Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03858v1" updated="2026-06-02T16:32:53Z" repo_found_in="pdf" repo_urls="https://github.com/huggingface/Math-Verify https://github.com/optifine233-ship-it/ https://github.com/optifine233-ship-it/PyraMathBench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03867v1" updated="2026-06-02T16:39:07Z" repo_found_in="pdf" repo_urls="https://github.com/lupanh/VietnameseMDS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03879v1" updated="2026-06-02T16:46:42Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03883v1" updated="2026-06-02T16:49:19Z" repo_found_in="pdf" repo_urls="https://github.com/ETH-DISCO/ https://github.com/ETH-DISCO/llm-reasoning-efficiency" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03892v2" updated="2026-06-03T17:27:54Z" repo_found_in="pdf" repo_urls="https://huggingface.co/Qwen/ https://huggingface.co/Qwen/Qwen2 https://huggingface.co/Qwen/Qwen2.5-7B-Instruct https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 https://huggingface.co/Qwen/Qwen3-8B https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct https://huggingface.co/ibm-granite/ https://huggingface.co/ibm-granite/granite-4.1-8b-instruct" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03895v1" updated="2026-06-02T16:53:24Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03906v1" updated="2026-06-02T17:00:49Z" repo_found_in="abstract" repo_urls="https://github.com/Bunnybeibei/scTranslation" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03907v1" updated="2026-06-02T17:01:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03910v1" updated="2026-06-02T17:06:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03918v1" updated="2026-06-02T17:11:56Z" repo_found_in="abstract" repo_urls="https://github.com/Trata-Inc/trata-hedge-bench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03937v2" updated="2026-06-03T10:57:29Z" repo_found_in="pdf" repo_urls="https://github.com/Leonnnnnn929/VEPO https://huggingface.co/collections/Qwen/qwen25-vl" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03938v2" updated="2026-06-03T02:07:12Z" repo_found_in="pdf" repo_urls="https://github.com/qlabs-eng/slowrun" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03939v1" updated="2026-06-02T17:28:21Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03957v1" updated="2026-06-02T17:46:12Z" repo_found_in="pdf" repo_urls="https://huggingface.co/coqui/XTTS-v2 https://huggingface.co/nvidia/stt_en_ https://huggingface.co/nvidia/stt_en_fastconformer_ctc_large https://huggingface.co/openai/ https://huggingface.co/openai/whisper-large-v3" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03963v3" updated="2026-06-09T15:09:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03965v1" updated="2026-06-02T17:51:30Z" repo_found_in="abstract" repo_urls="https://github.com/Andree-9/ACTS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03967v1" updated="2026-06-02T17:52:18Z" repo_found_in="pdf" repo_urls="https://github.com/QuentinFuxa/AlignAtt4LLM" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03968v1" updated="2026-06-02T17:53:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03969v1" updated="2026-06-02T17:53:45Z" repo_found_in="comment" repo_urls="https://github.com/yale-nlp/faithful_lrm" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04037v2" updated="2026-06-04T15:00:59Z" repo_found_in="comment" repo_urls="https://github.com/frank-luongt/faos-research/tree/main/RA-6" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04039v1" updated="2026-06-02T03:32:36Z" repo_found_in="abstract" repo_urls="https://github.com/shoraaa/DyNACO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04040v1" updated="2026-06-02T04:13:37Z" repo_found_in="pdf" repo_urls="https://github.com/jqin4749/EEG-to-Music" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04045v1" updated="2026-06-02T07:13:31Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04046v1" updated="2026-06-02T07:50:56Z" repo_found_in="abstract" repo_urls="https://future-item.github.io/SceneDiver" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04048v1" updated="2026-06-02T08:45:24Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04050v1" updated="2026-06-02T08:52:04Z" repo_found_in="abstract" repo_urls="https://github.com/Heliulu/LiftQuant" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04053v1" updated="2026-06-02T09:33:06Z" repo_found_in="abstract" repo_urls="https://github.com/EduardoTerres/bta_paper" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04057v1" updated="2026-06-02T11:17:28Z" repo_found_in="pdf" repo_urls="https://github.com/mpi- https://github.com/mpi-dsg/invisible-lottery" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04063v1" updated="2026-06-02T12:57:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04074v1" updated="2026-06-02T15:49:06Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04075v1" updated="2026-06-02T16:29:48Z" repo_found_in="pdf" repo_urls="https://github.com/thinkwee/SocioHack" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04104v1" updated="2026-06-02T18:10:35Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04115v1" updated="2026-06-02T18:23:20Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04123v1" updated="2026-06-02T18:33:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04143v1" updated="2026-06-02T19:01:06Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04145v1" updated="2026-06-02T19:03:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04152v1" updated="2026-06-02T19:19:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04164v1" updated="2026-06-02T19:26:20Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04167v1" updated="2026-06-02T19:29:35Z" repo_found_in="pdf" repo_urls="https://github.com/dimichai/tabular-tndp https://github.com/weiyu123112/City-Metro-Network-Expansion-with-RL" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04171v1" updated="2026-06-02T19:35:44Z" repo_found_in="comment" repo_urls="https://github.com/mjbommar/mimelens-training https://huggingface.co/mjbommar" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04177v1" updated="2026-06-02T19:46:22Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04182v1" updated="2026-06-02T19:54:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04191v1" updated="2026-06-02T20:20:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04193v1" updated="2026-06-02T20:21:58Z" repo_found_in="comment" repo_urls="https://github.com/juanfiguera/sello" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04202v1" updated="2026-06-02T20:40:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04205v1" updated="2026-06-02T20:49:20Z" repo_found_in="abstract" repo_urls="https://github.com/sadjadeb/DetectZoo" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04223v1" updated="2026-06-02T21:21:02Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04226v1" updated="2026-06-02T21:25:49Z" repo_found_in="comment" repo_urls="https://percept-twin.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04227v2" updated="2026-06-06T00:13:11Z" repo_found_in="comment" repo_urls="https://github.com/Jasonleonardvolk/sigma" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04236v1" updated="2026-06-02T21:35:54Z" repo_found_in="pdf" repo_urls="https://github.com/Jerryaa98/AXON.git https://huggingface.co/Dream-org/Dream-v0-Instruct- https://huggingface.co/GSAI-ML/LLaDA-1.5 https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04238v1" updated="2026-06-02T21:37:56Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04240v1" updated="2026-06-02T21:39:32Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/Jingbiao/M2KR-WWW2025-Challenge https://huggingface.co/datasets/MMDocIR/MMDocIR-Challenge" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04246v1" updated="2026-06-02T21:52:48Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04269v1" updated="2026-06-02T22:46:20Z" repo_found_in="abstract" repo_urls="https://instant-fold.github.io" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04271v1" updated="2026-06-02T22:50:55Z" repo_found_in="abstract" repo_urls="https://github.com/stepankonev/StandardE2E" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04273v1" updated="2026-06-02T22:58:19Z" repo_found_in="pdf" repo_urls="https://collinskatie.github.io/hai_formalization.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04275v1" updated="2026-06-02T23:02:54Z" repo_found_in="comment" repo_urls="https://openreview.net/forum?id=TdiRLe3rPA" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04284v1" updated="2026-06-02T23:19:12Z" repo_found_in="pdf" repo_urls="https://huggingface.co/Qwen/Qwen3.6-27B https://huggingface.co/Qwen/Qwen3.6-35B-A3B https://huggingface.co/Ray2333/GRM-Llama3.2-3B-rewardmodel-ft https://huggingface.co/datasets/hendrydong/preference_700K https://huggingface.co/datasets/microsoft/rpr https://huggingface.co/datasets/namkoong-lab/PersonalLLM https://huggingface.co/datasets/stanfordnlp/SHP" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04287v1" updated="2026-06-02T23:34:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04296v1" updated="2026-06-02T23:54:27Z" repo_found_in="comment" repo_urls="https://github.com/2025eb1100268-tech/intervention-timing-saturation-trap" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04298v1" updated="2026-06-02T23:59:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04315v1" updated="2026-06-03T00:42:49Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04320v1" updated="2026-06-03T00:48:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04321v1" updated="2026-06-03T00:49:13Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04326v1" updated="2026-06-03T01:01:05Z" repo_found_in="comment" repo_urls="https://github.com/ustunb/concept-benchmark" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04342v1" updated="2026-06-03T01:50:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04373v2" updated="2026-06-05T08:17:28Z" repo_found_in="abstract" repo_urls="https://github.com/hfutqian/MaskAQ" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04374v1" updated="2026-06-03T02:42:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04381v1" updated="2026-06-03T02:54:59Z" repo_found_in="abstract" repo_urls="https://github.com/chuchen2017/SLM}{https://github.com/chuchen2017/SLM}" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04387v1" updated="2026-06-03T03:05:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04388v1" updated="2026-06-03T03:06:39Z" repo_found_in="comment" repo_urls="https://github.com/error8149/FedAnilPlus-Optimized" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04391v1" updated="2026-06-03T03:11:50Z" repo_found_in="abstract" repo_urls="https://github.com/plusnli/skill-dynamic-retrieval" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04402v1" updated="2026-06-03T03:29:57Z" repo_found_in="pdf" repo_urls="https://huggingface.co/Qwen/Qwen3-8B https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04405v1" updated="2026-06-03T03:33:45Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04408v1" updated="2026-06-03T03:36:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04409v2" updated="2026-06-06T02:49:10Z" repo_found_in="abstract" repo_urls="https://github.com/YidiZhouluo/DeepLearning-Empirical-Studies/tree/main/Exp_01" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04419v1" updated="2026-06-03T03:56:25Z" repo_found_in="abstract" repo_urls="https://github.com/sodicksonlab/L-TGVN" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04421v2" updated="2026-06-06T03:16:02Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04425v1" updated="2026-06-03T04:14:17Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04435v1" updated="2026-06-03T04:33:47Z" repo_found_in="pdf" repo_urls="https://github.com/sarmishra/ https://github.com/sarmishra/CHARM-agentic-rag" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04438v1" updated="2026-06-03T04:38:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04442v1" updated="2026-06-03T04:44:50Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04445v1" updated="2026-06-03T04:47:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04455v1" updated="2026-06-03T04:58:17Z" repo_found_in="comment" repo_urls="https://meta-agent-challenge.com/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04459v1" updated="2026-06-03T05:06:35Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04460v1" updated="2026-06-03T05:06:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04465v1" updated="2026-06-03T05:20:30Z" repo_found_in="comment" repo_urls="https://github.com/taowangcheng/SePO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04468v1" updated="2026-06-03T05:27:21Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04469v1" updated="2026-06-03T05:29:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04484v1" updated="2026-06-03T06:02:52Z" repo_found_in="pdf" repo_urls="https://github.com/modelscope/AgentJet" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04493v1" updated="2026-06-03T06:18:05Z" repo_found_in="abstract" repo_urls="https://github.com/Kirito14IT/SFMambaNet" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04505v1" updated="2026-06-03T06:36:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04516v1" updated="2026-06-03T06:47:50Z" repo_found_in="pdf" repo_urls="https://github.com/gczhu/GeoMin" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04517v1" updated="2026-06-03T06:52:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04522v1" updated="2026-06-03T07:00:50Z" repo_found_in="pdf" repo_urls="https://github.com/WeiJiuQi/ https://github.com/facebookresearch/faiss https://github.com/gaoj0017/ https://github.com/gaoj0017/RaBitQ https://github.com/gouyt13/SymphonyQG https://github.com/spotify/annoy https://huggingface.co/datasets/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04535v1" updated="2026-06-03T07:18:23Z" repo_found_in="pdf" repo_urls="https://github.com/Westlake-AGI-Lab/DIA" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04550v1" updated="2026-06-03T07:34:46Z" repo_found_in="comment" repo_urls="https://github.com/andersvestrum/carbon-aware-recsys" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04560v2" updated="2026-06-04T06:53:10Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04581v2" updated="2026-06-07T06:19:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04592v1" updated="2026-06-03T08:30:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04594v1" updated="2026-06-03T08:32:13Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04597v1" updated="2026-06-03T08:35:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04602v2" updated="2026-06-09T05:11:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04619v1" updated="2026-06-03T08:55:48Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04620v1" updated="2026-06-03T08:55:59Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04646v1" updated="2026-06-03T09:14:43Z" repo_found_in="pdf" repo_urls="https://github.com/ZHANG-MENGAO/qo-bench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04656v1" updated="2026-06-03T09:26:13Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04662v1" updated="2026-06-03T09:40:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04672v2" updated="2026-06-04T09:25:02Z" repo_found_in="pdf" repo_urls="https://github.com/adhocmp122/CTDG-SSM" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04684v1" updated="2026-06-03T10:05:18Z" repo_found_in="comment" repo_urls="https://github.com/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04699v1" updated="2026-06-03T10:24:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04705v1" updated="2026-06-03T10:34:15Z" repo_found_in="abstract" repo_urls="https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04718v2" updated="2026-06-09T06:35:28Z" repo_found_in="pdf" repo_urls="https://core-moe.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04735v1" updated="2026-06-03T11:19:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04739v1" updated="2026-06-03T11:20:37Z" repo_found_in="abstract" repo_urls="https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04743v1" updated="2026-06-03T11:23:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04750v1" updated="2026-06-03T11:31:37Z" repo_found_in="pdf" repo_urls="https://github.com/ajvish91/fog-of-love-rl" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04752v2" updated="2026-06-08T08:15:04Z" repo_found_in="comment" repo_urls="https://github.com/OssiLehtinen/channel-encoder-audit" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04755v1" updated="2026-06-03T11:38:30Z" repo_found_in="pdf" repo_urls="https://archi-physics.github.io/archi/ https://archi-physics.github.io/website/ https://github.com/archi-physics/archi" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04769v1" updated="2026-06-03T11:51:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04778v1" updated="2026-06-03T12:01:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04781v1" updated="2026-06-03T12:02:49Z" repo_found_in="pdf" repo_urls="https://github.com/zach-blumenfeld/aip https://github.com/zach-blumenfeld/aip-skillbench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04812v2" updated="2026-06-05T04:02:34Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04815v1" updated="2026-06-03T12:38:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04816v1" updated="2026-06-03T12:39:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04820v1" updated="2026-06-03T12:42:59Z" repo_found_in="pdf" repo_urls="https://github.com/tobna/oa-cutmix https://huggingface.co/datasets/TNauen/Sam3-ImageNet-Segmentations" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04833v1" updated="2026-06-03T13:02:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04860v1" updated="2026-06-03T13:27:46Z" repo_found_in="pdf" repo_urls="https://github.com/siddzzzz/ https://github.com/siddzzzz/empirical-admissible-neural-heuristics" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04867v1" updated="2026-06-03T13:33:42Z" repo_found_in="abstract" repo_urls="https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04877v1" updated="2026-06-03T13:41:08Z" repo_found_in="pdf" repo_urls="https://github.com/data61/PSL" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04903v1" updated="2026-06-03T14:01:33Z" repo_found_in="pdf" repo_urls="https://github.com/Thistleseeds/ https://github.com/Thistleseeds/agentic-redux" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04906v1" updated="2026-06-03T14:05:14Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/TristoneJiang/SenDetEX https://github.com/UKPLab/ https://github.com/UKPLab/arxiv2026-aitdna https://github.com/google/ https://github.com/google/diff-match-patch https://huggingface.co/datasets/ https://huggingface.co/datasets/UKPLab/AITDNA" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04922v1" updated="2026-06-03T14:17:57Z" repo_found_in="comment" repo_urls="https://github.com/tientrandinh/OGKD" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04923v1" updated="2026-06-03T14:18:23Z" repo_found_in="abstract" repo_urls="https://github.com/THUAIS-Lab/CHERRL" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04935v1" updated="2026-06-03T14:24:53Z" repo_found_in="pdf" repo_urls="https://github.com/biaslab/ https://github.com/biaslab/UAI-MP-AIF-JAX" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04967v1" updated="2026-06-03T14:49:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04987v1" updated="2026-06-03T15:08:46Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05004v1" updated="2026-06-03T15:23:06Z" repo_found_in="pdf" repo_urls="https://github.com/NusIoraPrivacy/ https://github.com/NusIoraPrivacy/SharedRequest" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05008v1" updated="2026-06-03T15:28:57Z" repo_found_in="abstract" repo_urls="https://pku-value-lab.github.io/m3eval-homepage" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05025v1" updated="2026-06-03T15:48:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05037v1" updated="2026-06-03T16:02:11Z" repo_found_in="abstract" repo_urls="https://github.com/arquicanedo/self-reflective-apis" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05043v1" updated="2026-06-03T16:05:25Z" repo_found_in="pdf" repo_urls="https://github.com/Universal-Commerce-Protocol https://github.com/Universal-Commerce-Protocol/samples" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05080v1" updated="2026-06-03T16:36:54Z" repo_found_in="comment" repo_urls="https://autolab.moe/ https://github.com/autolabhq/autolab" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05085v1" updated="2026-06-03T16:49:57Z" repo_found_in="pdf" repo_urls="https://allenai.github.io/scispacy/ https://github.com/tohidarehman/ https://github.com/tohidarehman/Title-Generation-ResearchPapers https://huggingface.co/datasets/TRnlp/ https://huggingface.co/datasets/TRnlp/LREC-COLING-2024-Abstract-Title https://huggingface.co/datasets/TRnlp/SpringerSSAT https://huggingface.co/facebook/bart-base https://huggingface.co/google/pegasus-large https://huggingface.co/t5-base https://huggingface.co/unsloth/llama-3-8b-bnb-4bit" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05130v1" updated="2026-06-03T17:34:01Z" repo_found_in="abstract" repo_urls="https://github.com/Unknown-zoo/AgentMob" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05142v1" updated="2026-06-03T17:49:48Z" repo_found_in="comment" repo_urls="https://gem-nr.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05152v2" updated="2026-06-05T17:44:05Z" repo_found_in="pdf" repo_urls="https://github.com/rishabh-1086/distIL https://rishabh-1086.github.io/project-distIL" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05158v1" updated="2026-06-03T17:57:04Z" repo_found_in="comment" repo_urls="https://zhenyangcs.github.io/StreamMA-website/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05233v1" updated="2026-06-03T01:21:59Z" repo_found_in="pdf" repo_urls="https://github.com/RPC2/AutoInject" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05241v1" updated="2026-06-03T07:11:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05252v1" updated="2026-06-03T14:26:25Z" repo_found_in="pdf" repo_urls="https://github.com/a https://github.com/alemaiorano/detection-as-code-synthesis" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05256v1" updated="2026-06-03T15:58:32Z" repo_found_in="pdf" repo_urls="https://github.com/kokiljaidka/UnauthorizedRedditCMVPosts" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05261v1" updated="2026-06-03T16:17:43Z" repo_found_in="abstract" repo_urls="https://github.com/ndvbd/NIV" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05262v1" updated="2026-06-03T16:18:18Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05275v1" updated="2026-06-03T17:59:30Z" repo_found_in="comment" repo_urls="https://thaoshibe.github.io/camroll" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05290v1" updated="2026-06-03T18:00:04Z" repo_found_in="comment" repo_urls="https://aimagelab.github.io/cross-model-safety-representations/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05296v1" updated="2026-06-03T18:00:07Z" repo_found_in="abstract" repo_urls="https://github.com/layer6ai-labs/Agentic-Monte-Carlo" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05304v1" updated="2026-06-03T18:00:22Z" repo_found_in="abstract" repo_urls="https://github.com/iNLP-Lab/PACT" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05308v1" updated="2026-06-03T18:01:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05315v1" updated="2026-06-03T18:05:50Z" repo_found_in="pdf" repo_urls="https://github.com/rmsolgi/lori" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05326v1" updated="2026-06-03T18:11:33Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05332v1" updated="2026-06-03T18:17:40Z" repo_found_in="pdf" repo_urls="https://github.com/birla-ai-labs/gitco" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05357v1" updated="2026-06-03T18:59:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05378v1" updated="2026-06-03T19:27:07Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/skydancerosel/spectral-probe-circuits" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05383v1" updated="2026-06-03T19:36:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05384v1" updated="2026-06-03T19:37:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05389v1" updated="2026-06-03T19:49:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05391v1" updated="2026-06-03T19:53:31Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05395v1" updated="2026-06-03T20:02:35Z" repo_found_in="comment" repo_urls="https://languagegroundedriskdetection.github.io/ProjectPage/vaso-webpage/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05396v1" updated="2026-06-03T20:06:52Z" repo_found_in="pdf" repo_urls="https://github.com/dessertlab/AblitEval" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05400v1" updated="2026-06-03T20:09:39Z" repo_found_in="abstract" repo_urls="https://github.com/YuanheZ/LeanMarathon" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05402v1" updated="2026-06-03T20:12:26Z" repo_found_in="abstract" repo_urls="https://github.com/jinulee-v/reasoningflow" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05404v1" updated="2026-06-03T20:20:34Z" repo_found_in="abstract" repo_urls="https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05405v1" updated="2026-06-03T20:20:46Z" repo_found_in="comment" repo_urls="https://agents-last-exam.org https://github.com/rdi-berkeley/agents-last-exam" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05408v1" updated="2026-06-03T20:22:29Z" repo_found_in="abstract" repo_urls="https://github.com/can-gurkan/lmca" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05411v1" updated="2026-06-03T20:25:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05413v1" updated="2026-06-03T20:27:48Z" repo_found_in="pdf" repo_urls="https://github.com/ZZQ-NTU/CausalPOI" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05414v1" updated="2026-06-03T20:28:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05415v1" updated="2026-06-03T20:28:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05433v1" updated="2026-06-03T20:57:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05434v1" updated="2026-06-03T20:57:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05436v1" updated="2026-06-03T20:58:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05441v2" updated="2026-06-07T02:08:53Z" repo_found_in="comment" repo_urls="https://github.com/zadid6pretam/GOTabPFN https://huggingface.co/spaces/zadid6pretam/GOTabPFN https://huggingface.co/spaces/zadid6pretam/GOTabPFN_CPU https://pypi.org/project/gotabpfn https://www.zadidhabib.com/gotabpfn.html" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05444v1" updated="2026-06-03T21:06:55Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05449v1" updated="2026-06-03T21:10:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05461v1" updated="2026-06-03T21:37:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05463v2" updated="2026-06-09T17:31:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05464v1" updated="2026-06-03T21:43:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05494v1" updated="2026-06-03T22:34:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05510v1" updated="2026-06-03T23:28:34Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05522v1" updated="2026-06-03T23:53:27Z" repo_found_in="pdf" repo_urls="https://github.com/Faria-Binte-Kader/ https://github.com/Faria-Binte-Kader/South-Asian-Music-data" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05525v1" updated="2026-06-04T00:14:25Z" repo_found_in="abstract" repo_urls="https://github.com/KuangshiAi/SciVisAgentSkills" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05528v1" updated="2026-06-04T00:18:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05531v1" updated="2026-06-04T00:21:22Z" repo_found_in="abstract" repo_urls="https://github.com/qcri/Almieyar-Oryx-BloomBench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05532v1" updated="2026-06-04T00:21:35Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05533v1" updated="2026-06-04T00:26:04Z" repo_found_in="comment" repo_urls="https://A4Dance-reasoning.github.io" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05535v1" updated="2026-06-04T00:37:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05551v1" updated="2026-06-04T01:05:57Z" repo_found_in="pdf" repo_urls="https://github.com/Telvc/AC-RAC" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05552v1" updated="2026-06-04T01:06:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05555v1" updated="2026-06-04T01:09:20Z" repo_found_in="pdf" repo_urls="https://github.com/johanobandoc/ScaleMRL.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05561v1" updated="2026-06-04T01:16:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05584v1" updated="2026-06-04T01:58:00Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05587v1" updated="2026-06-04T02:04:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05602v1" updated="2026-06-04T02:25:19Z" repo_found_in="abstract" repo_urls="https://misoshiruseijin.github.io/SENSEI/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05606v1" updated="2026-06-04T02:27:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05608v1" updated="2026-06-04T02:30:06Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05609v1" updated="2026-06-04T02:31:29Z" repo_found_in="abstract" repo_urls="https://github.com/youai058/SlotGCG}{https://github.com/youai058/SlotGCG}" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05613v1" updated="2026-06-04T02:36:30Z" repo_found_in="pdf" repo_urls="https://github.com/iNLP-Lab/BK-MOO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05625v1" updated="2026-06-04T02:50:26Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05633v1" updated="2026-06-04T03:00:42Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05646v1" updated="2026-06-04T03:17:21Z" repo_found_in="abstract" repo_urls="https://xhguo7.github.io/MemOp/}{https://xhguo7.github.io/MemOp/}" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05647v1" updated="2026-06-04T03:22:17Z" repo_found_in="pdf" repo_urls="https://github.com/CHATS-lab/coding-agent-safety-monitor" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05654v2" updated="2026-06-05T15:08:55Z" repo_found_in="pdf" repo_urls="https://github.com/Jeromeyluck/CodeMixBench https://github.com/Surajtk/ https://github.com/Surajtk/workflow-moderation-code# https://huggingface.co/datasets/TrustAIRLab/HateBenchSet https://huggingface.co/google-bert/bert-base-multilingual-cased" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05658v1" updated="2026-06-04T03:38:46Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05661v1" updated="2026-06-04T03:43:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05670v1" updated="2026-06-04T03:50:47Z" repo_found_in="comment" repo_urls="https://github.com/LINs-lab/MASArena/tree/BenchAgent" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05678v1" updated="2026-06-04T04:00:48Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05679v1" updated="2026-06-04T04:01:24Z" repo_found_in="abstract" repo_urls="https://github.com/dataflowcontrol/data-flow-control" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05682v2" updated="2026-06-05T01:01:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05684v1" updated="2026-06-04T04:06:08Z" repo_found_in="abstract" repo_urls="https://github.com/yunx-z/AdaMEM" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05688v1" updated="2026-06-04T04:13:05Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/nvidia/ https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05692v1" updated="2026-06-04T04:18:28Z" repo_found_in="pdf" repo_urls="https://github.com/complex-ai-lab/epi-cf-benchmark" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05702v1" updated="2026-06-04T04:49:09Z" repo_found_in="abstract" repo_urls="https://github.com/LuoRenqiang/ChronoVision" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05704v1" updated="2026-06-04T04:52:35Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05710v1" updated="2026-06-04T05:05:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05720v1" updated="2026-06-04T05:24:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05728v1" updated="2026-06-04T05:37:31Z" repo_found_in="abstract" repo_urls="https://github.com/puddingyeah/DiG-Plan" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05734v1" updated="2026-06-04T05:49:34Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05737v1" updated="2026-06-04T05:58:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05740v1" updated="2026-06-04T06:07:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05749v1" updated="2026-06-04T06:23:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05754v1" updated="2026-06-04T06:29:25Z" repo_found_in="abstract" repo_urls="https://github.com/wawa-abc/das" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05758v1" updated="2026-06-04T06:37:10Z" repo_found_in="pdf" repo_urls="https://dragonlzm.github.io/DRIFT" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05770v1" updated="2026-06-04T06:53:45Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05776v1" updated="2026-06-04T07:04:57Z" repo_found_in="pdf" repo_urls="https://github.com/mikhlasmscs25seecs-ship-it/IoT-Intrusion-" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05779v1" updated="2026-06-04T07:08:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05785v1" updated="2026-06-04T07:16:06Z" repo_found_in="pdf" repo_urls="https://github.com/shawaiz202/ https://github.com/shawaiz202/-Automated-License-Plate-Detection-with-Real-Time-Alert" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05792v1" updated="2026-06-04T07:22:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05805v1" updated="2026-06-04T07:34:35Z" repo_found_in="abstract" repo_urls="https://github.com/YUHAOSUNABC/TRIAD" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05806v1" updated="2026-06-04T07:38:46Z" repo_found_in="abstract" repo_urls="https://github.com/Zhudongsheng75/ToolMaze" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05817v1" updated="2026-06-04T07:58:55Z" repo_found_in="pdf" repo_urls="https://github.com/c-wei/AttCT" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05818v1" updated="2026-06-04T07:59:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05852v1" updated="2026-06-04T08:27:17Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05861v1" updated="2026-06-04T08:35:53Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05871v1" updated="2026-06-04T08:45:59Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05872v2" updated="2026-06-05T19:21:00Z" repo_found_in="pdf" repo_urls="https://github.com/olahsymbo/agent-eval https://github.com/olahsymbo/agent-eval1" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05873v1" updated="2026-06-04T08:47:08Z" repo_found_in="abstract" repo_urls="https://ladderman-robot.github.io" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05888v1" updated="2026-06-04T08:57:45Z" repo_found_in="pdf" repo_urls="https://github.com/nissymori/ReMAC" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05890v1" updated="2026-06-04T08:59:10Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05901v1" updated="2026-06-04T09:07:06Z" repo_found_in="pdf" repo_urls="https://github.com/NICD-UK/??? https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05922v1" updated="2026-06-04T09:26:00Z" repo_found_in="comment" repo_urls="https://github.com/wbopan/retro-harness https://paper-rho.wenbo.io" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05924v1" updated="2026-06-04T09:27:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05925v1" updated="2026-06-04T09:28:54Z" repo_found_in="pdf" repo_urls="https://github.com/Yuejingkun/Awesome-Biomedical-World-Models" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05931v1" updated="2026-06-04T09:33:58Z" repo_found_in="pdf" repo_urls="https://huggingface.co/ https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05932v2" updated="2026-06-08T06:52:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05956v1" updated="2026-06-04T09:53:18Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05970v1" updated="2026-06-04T10:14:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05976v1" updated="2026-06-04T10:17:00Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05983v1" updated="2026-06-04T10:25:31Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05998v1" updated="2026-06-04T10:44:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06003v1" updated="2026-06-04T10:56:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06021v3" updated="2026-06-09T02:20:46Z" repo_found_in="abstract" repo_urls="https://github.com/ShenzhiYang2000/OPRD" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06027v1" updated="2026-06-04T11:20:10Z" repo_found_in="abstract" repo_urls="https://github.com/Ahghaffari/redditpersona" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06034v1" updated="2026-06-04T11:29:05Z" repo_found_in="pdf" repo_urls="https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06054v1" updated="2026-06-04T11:54:29Z" repo_found_in="pdf" repo_urls="https://github.com/Kevin-Zh-CS/MemGate" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06055v1" updated="2026-06-04T11:54:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06056v1" updated="2026-06-04T11:57:26Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06058v1" updated="2026-06-04T11:58:59Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/m-salmani78/MDP-GRPO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06076v2" updated="2026-06-06T12:18:07Z" repo_found_in="abstract" repo_urls="https://github.com/Oranger-l/MGSD" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06080v1" updated="2026-06-04T12:16:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06081v1" updated="2026-06-04T12:17:42Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06087v1" updated="2026-06-04T12:26:09Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/yuaofan0-oss/LatentSkill" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06090v1" updated="2026-06-04T12:26:42Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06094v1" updated="2026-06-04T12:32:24Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06096v1" updated="2026-06-04T12:34:15Z" repo_found_in="abstract" repo_urls="https://github.com/paavo5/ordergrad" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06102v1" updated="2026-06-04T12:42:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06109v1" updated="2026-06-04T12:57:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06133v1" updated="2026-06-04T13:17:06Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06136v1" updated="2026-06-04T13:19:19Z" repo_found_in="pdf" repo_urls="https://eonmath.github.io/mechmath https://github.com/EonMath/Vasc-n9-cert/releases/tag/v1.0.0" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06160v1" updated="2026-06-04T13:32:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06168v1" updated="2026-06-04T13:40:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06178v1" updated="2026-06-04T13:53:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06203v1" updated="2026-06-04T14:08:30Z" repo_found_in="pdf" repo_urls="https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06207v1" updated="2026-06-04T14:14:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06217v1" updated="2026-06-04T14:31:11Z" repo_found_in="abstract" repo_urls="https://github.com/TanmouTT/DisasterBench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06225v1" updated="2026-06-04T14:35:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06235v1" updated="2026-06-04T14:41:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06240v1" updated="2026-06-04T14:46:52Z" repo_found_in="comment" repo_urls="https://github.com/ZenAlexa/toki-bitemporal-memory" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06242v1" updated="2026-06-04T14:47:40Z" repo_found_in="abstract" repo_urls="https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot https://huggingface.co/datasets/ai4data/data-snapshot" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06245v1" updated="2026-06-04T14:48:44Z" repo_found_in="pdf" repo_urls="https://github.com/EDGSCOUT/MPCoT" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06261v1" updated="2026-06-04T15:05:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06272v1" updated="2026-06-04T15:14:24Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06284v1" updated="2026-06-04T15:24:10Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06285v1" updated="2026-06-04T15:25:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06303v1" updated="2026-06-04T15:41:53Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06311v1" updated="2026-06-04T15:52:21Z" repo_found_in="pdf" repo_urls="https://github.com/Marchetz/MANTRA-CVPR20" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06315v1" updated="2026-06-04T15:54:34Z" repo_found_in="pdf" repo_urls="https://github.com/Thibaud-Ardoin/ https://github.com/Thibaud-Ardoin/LLM-Self-Recognition.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06316v1" updated="2026-06-04T15:54:53Z" repo_found_in="pdf" repo_urls="https://github.com/georgepwhuang/rare_event" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06333v1" updated="2026-06-04T16:08:25Z" repo_found_in="pdf" repo_urls="https://github.com/arshandalili/sasa" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06337v1" updated="2026-06-04T16:12:28Z" repo_found_in="comment" repo_urls="https://github.com/Shweta-Mishra-ai/tokenmizer" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06356v1" updated="2026-06-04T16:24:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06357v1" updated="2026-06-04T16:25:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06380v1" updated="2026-06-04T16:47:41Z" repo_found_in="comment" repo_urls="https://github.com/wuzengqing001225/ConsciousAI_Indexicality/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06390v1" updated="2026-06-04T16:58:43Z" repo_found_in="abstract" repo_urls="https://kairos-homeworld.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06396v1" updated="2026-06-04T17:02:53Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06416v1" updated="2026-06-04T17:20:47Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06423v1" updated="2026-06-04T17:28:42Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06458v1" updated="2026-06-04T17:50:32Z" repo_found_in="pdf" repo_urls="https://github.com/injurise/ICMIL" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06460v1" updated="2026-06-04T17:50:54Z" repo_found_in="comment" repo_urls="https://github.com/mthamil107/Recuse" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06462v1" updated="2026-06-04T17:52:04Z" repo_found_in="comment" repo_urls="https://benchmarkagent.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06464v1" updated="2026-06-04T17:53:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06467v1" updated="2026-06-04T17:54:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06468v1" updated="2026-06-04T17:54:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06470v1" updated="2026-06-04T17:55:11Z" repo_found_in="abstract" repo_urls="https://github.com/Empath-aln/PC-layer" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06473v1" updated="2026-06-04T17:55:59Z" repo_found_in="abstract" repo_urls="https://github.com/InternScience/MLEvolve" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06474v1" updated="2026-06-04T17:56:27Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/pauljngr/SARDI" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06481v1" updated="2026-06-04T17:58:05Z" repo_found_in="comment" repo_urls="https://github.com/VILA-Lab/OpAI-Bench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06486v1" updated="2026-06-04T17:59:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06492v1" updated="2026-06-04T17:59:46Z" repo_found_in="abstract" repo_urls="https://huggingface.co/code2lora" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06493v3" updated="2026-06-09T07:34:32Z" repo_found_in="comment" repo_urls="https://lzyang2000.github.io/HANDOFF/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06515v1" updated="2026-06-02T10:57:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06518v1" updated="2026-06-02T14:19:51Z" repo_found_in="abstract" repo_urls="https://github.com/shanxierdan/DiBS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06519v1" updated="2026-06-02T14:51:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06521v1" updated="2026-06-02T17:29:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06523v1" updated="2026-06-02T18:46:50Z" repo_found_in="pdf" repo_urls="https://github.com/RickySkywalker/Lean4Agent" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06526v1" updated="2026-06-02T20:38:39Z" repo_found_in="pdf" repo_urls="https://github.com/text-machine-lab/crowdmath" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06533v1" updated="2026-06-03T17:58:14Z" repo_found_in="comment" repo_urls="https://icml.cc/virtual/2026/poster/67142" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06534v1" updated="2026-06-03T18:12:19Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06539v1" updated="2026-06-04T04:01:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06543v1" updated="2026-06-04T06:23:13Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/yuncifor/Railway-Short-Term-Based-on-QUBO-and-Hybrid-Quantum-Algorithms" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06545v1" updated="2026-06-04T07:12:00Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06547v1" updated="2026-06-04T08:00:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06554v2" updated="2026-06-08T07:48:55Z" repo_found_in="pdf" repo_urls="https://github.com/danimp94/PLASTICS-THz https://github.com/danimp94/PLASTICS-THz/tree/main/data/experiment_5_plastics https://github.com/roshni-mahtani/MSFAN" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06559v1" updated="2026-06-04T12:39:44Z" repo_found_in="pdf" repo_urls="https://github.com/snakers4/silero-vad https://huggingface.co/datasets/ICTNLP/Instr https://huggingface.co/datasets/ICTNLP/InstructS2S-200K https://microsoft.github.io/msmarco/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06560v1" updated="2026-06-04T14:01:32Z" repo_found_in="pdf" repo_urls="https://github.com/MacPaw/ https://github.com/MacPaw/MacArena https://github.com/utmapp/UTM" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06564v1" updated="2026-06-04T16:15:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06570v1" updated="2026-06-04T17:51:49Z" repo_found_in="pdf" repo_urls=" https://aj730.github.io/PhylogeneticsForMalware/inter_family_vis.html https://aj730.github.io/PhylogeneticsF https://github.com/AJ730/MalwareEvolution" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06572v1" updated="2026-06-04T17:59:06Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06601v1" updated="2026-06-04T18:00:34Z" repo_found_in="comment" repo_urls="https://gong1130.github.io/DIRECT/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06615v1" updated="2026-06-04T18:05:39Z" repo_found_in="comment" repo_urls="https://nishitanand.github.io/figma-website/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06618v1" updated="2026-06-04T18:09:02Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06627v1" updated="2026-06-04T18:24:23Z" repo_found_in="comment" repo_urls="https://richardrl.github.io/what-matters-cotraining-human-videos/index.html" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06635v1" updated="2026-06-04T18:36:42Z" repo_found_in="pdf" repo_urls="https://github.com/sisl/ https://github.com/sisl/LMTwoFailureModeFramework" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06641v1" updated="2026-06-04T18:47:45Z" repo_found_in="pdf" repo_urls="https://github.com/cjchristopher/accelerated-fourier-sat" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06646v1" updated="2026-06-04T18:51:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06664v1" updated="2026-06-04T19:25:40Z" repo_found_in="abstract" repo_urls="https://github.com/deep-real/ViSAE" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06679v1" updated="2026-06-04T19:53:12Z" repo_found_in="abstract" repo_urls="https://github.com/xuanxixi/HKJudge" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06694v1" updated="2026-06-04T20:17:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06715v1" updated="2026-06-04T21:05:48Z" repo_found_in="pdf" repo_urls="https://github.com/upasanachatterjee/ https://github.com/upasanachatterjee/causal-inference-on-text https://huggingface.co/datasets/ https://huggingface.co/datasets/upasanachatterjee/AllSides-sentiments" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06717v1" updated="2026-06-04T21:06:31Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/SaketR1/ https://huggingface.co/datasets/SaketR1/shallowbench/tree/main" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06735v2" updated="2026-06-08T18:02:18Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06741v1" updated="2026-06-04T21:55:48Z" repo_found_in="comment" repo_urls="https://github.com/OpenLAIR/OpenSkill" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06743v1" updated="2026-06-04T21:57:18Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06748v1" updated="2026-06-04T22:19:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06764v1" updated="2026-06-04T23:04:49Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06772v1" updated="2026-06-04T23:31:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06779v1" updated="2026-06-04T23:39:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06787v1" updated="2026-06-05T00:11:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06800v1" updated="2026-06-05T01:01:56Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06805v1" updated="2026-06-05T01:13:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06813v1" updated="2026-06-05T01:27:44Z" repo_found_in="comment" repo_urls="https://github.com/daheekwon/DAVE" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06823v1" updated="2026-06-05T01:56:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06833v1" updated="2026-06-05T02:18:23Z" repo_found_in="pdf" repo_urls="https://github.com/jnxie/semantic-gambit" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06838v1" updated="2026-06-05T02:24:25Z" repo_found_in="pdf" repo_urls="https://github.com/LaurieWired/GhidraMCP https://github.com/terryyin/lizard" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06840v1" updated="2026-06-05T02:32:24Z" repo_found_in="pdf" repo_urls="https://github.com/research-anon-487/xcube/tree/reasoning" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06872v1" updated="2026-06-05T03:40:10Z" repo_found_in="abstract" repo_urls="https://egopressdiff.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06885v1" updated="2026-06-05T04:06:42Z" repo_found_in="abstract" repo_urls="https://freeani.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06893v1" updated="2026-06-05T04:19:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06895v1" updated="2026-06-05T04:27:34Z" repo_found_in="pdf" repo_urls="https://github.com/QuantBlockchain/ieee-icbc-tutorial-qbe" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06903v1" updated="2026-06-05T04:39:46Z" repo_found_in="abstract" repo_urls="https://directanimator.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06920v1" updated="2026-06-05T05:34:13Z" repo_found_in="abstract" repo_urls="https://github.com/gulguluu/tiny-slm-finetune-compare" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06923v1" updated="2026-06-05T05:38:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06941v1" updated="2026-06-05T06:12:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06943v1" updated="2026-06-05T06:12:48Z" repo_found_in="abstract" repo_urls="https://github.com/sunoh-kim/SS-TPT" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06946v1" updated="2026-06-05T06:19:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06950v1" updated="2026-06-05T06:24:20Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06959v1" updated="2026-06-05T06:38:40Z" repo_found_in="comment" repo_urls="https://github.com/Nellie179/Hallucination-Detection" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06972v1" updated="2026-06-05T07:01:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06983v1" updated="2026-06-05T07:21:11Z" repo_found_in="abstract" repo_urls="https://alibaba-damo-academy.github.io/DaX/benchboard/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07007v1" updated="2026-06-05T07:52:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07030v1" updated="2026-06-05T08:19:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07032v1" updated="2026-06-05T08:23:25Z" repo_found_in="abstract" repo_urls="https://github.com/sotayang/ZeroSight" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07047v2" updated="2026-06-08T08:14:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07054v1" updated="2026-06-05T08:54:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07074v1" updated="2026-06-05T09:10:50Z" repo_found_in="pdf" repo_urls=" https://github.com/AQ-MedAI/AntAFu-DeepResearch https://github.com/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07080v1" updated="2026-06-05T09:19:24Z" repo_found_in="pdf" repo_urls="https://github.com/rednote-hilab/dots.tts https://huggingface.co/collections/rednote-hilab/dotstts https://rednote-hilab.github.io/dots.tts-demo" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07082v1" updated="2026-06-05T09:20:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07094v1" updated="2026-06-05T09:42:01Z" repo_found_in="pdf" repo_urls="https://github.com/MetaConfigurator/meta-configurator/tree/develop/documentation_user/examples/rdf" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07102v1" updated="2026-06-05T09:53:30Z" repo_found_in="abstract" repo_urls="https://github.com/tms-byte/GP-Adapter" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07108v2" updated="2026-06-08T06:45:26Z" repo_found_in="abstract" repo_urls="https://github.com/yu-lin-li/DyCon" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07113v1" updated="2026-06-05T10:08:56Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07114v1" updated="2026-06-05T10:09:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07116v1" updated="2026-06-05T10:11:34Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07117v1" updated="2026-06-05T10:13:17Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07119v1" updated="2026-06-05T10:20:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07141v1" updated="2026-06-05T10:51:20Z" repo_found_in="pdf" repo_urls="https://github.com/anuragiiser/REMIDI" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07150v1" updated="2026-06-05T11:07:55Z" repo_found_in="pdf" repo_urls="https://github.com/dangoldbj/agent- https://github.com/dangoldbj/agent-metadata-privacy" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07157v1" updated="2026-06-05T11:17:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07167v1" updated="2026-06-05T11:35:27Z" repo_found_in="pdf" repo_urls="https://github.com/mbzuai-nlp/urdu-mmlu https://huggingface.co/datasets/MBZUAI/UrduMMLU https://mbzuai-nlp.github.io/UrduMMLU/ https://mbzuai-nlp.github.io/UrduMMLU/leaderboard.html" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07172v1" updated="2026-06-05T11:40:13Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/ https://huggingface.co/datasets/visheratin/google_landmarks_places" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07181v1" updated="2026-06-05T11:45:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07211v1" updated="2026-06-05T12:23:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07222v1" updated="2026-06-05T12:36:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07226v1" updated="2026-06-05T12:42:56Z" repo_found_in="pdf" repo_urls="https://github.com/tzwo/DEFINED" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07237v1" updated="2026-06-05T13:07:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07244v1" updated="2026-06-05T13:11:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07245v1" updated="2026-06-05T13:11:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07253v1" updated="2026-06-05T13:26:11Z" repo_found_in="comment" repo_urls="https://topsis-ranking.vercel.app/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07271v1" updated="2026-06-05T13:46:37Z" repo_found_in="pdf" repo_urls="https://github.com/sourisimos/rectified-flow-membership" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07299v1" updated="2026-06-05T14:10:48Z" repo_found_in="pdf" repo_urls="https://github.com/baidubce/qianfan-deepresearch" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07308v1" updated="2026-06-05T14:24:50Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07309v1" updated="2026-06-05T14:26:06Z" repo_found_in="pdf" repo_urls="https://huggingface.co/mispeech/midashenglm-7b-0804-fp32 https://huggingface.co/nvidia/audio-flamingo-3# https://huggingface.co/nvidia/audio-flamingo-3#think-mode-reasoning-with-peft-adapter-af-think" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07311v1" updated="2026-06-05T14:28:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07313v1" updated="2026-06-05T14:34:37Z" repo_found_in="pdf" repo_urls="https://github.com/Atmyre/sv-detect/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07316v1" updated="2026-06-05T14:35:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07325v1" updated="2026-06-05T14:43:10Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07351v1" updated="2026-06-05T15:00:33Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07365v1" updated="2026-06-05T15:08:50Z" repo_found_in="pdf" repo_urls="https://github.com/Nokia-Bell-Labs/ https://github.com/Nokia-Bell-Labs/papagei-foundation-model https://github.com/hsd1503/resnet1d" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07368v1" updated="2026-06-05T15:11:08Z" repo_found_in="pdf" repo_urls=" https://github.com/DeepMicroscopy/MIDOG25_T2_evaluation_docker https://github.com/DeepMicroscopy/MIDOG25_T1_evaluation_ https://github.com/DeepMicroscopy/MIDOG25_T1_evaluation_docker https://github.com/DeepMicroscopy/MIDOG25_T1_reference_ https://github.com/DeepMicroscopy/MIDOG25_T1_reference_docker https://github.com/DeepMicroscopy/MIDOG25_T2_ https://github.com/DeepMicroscopy/MIDOG25_T2_reference_docker" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07379v2" updated="2026-06-08T01:53:05Z" repo_found_in="pdf" repo_urls="https://github.com/ishida-lab/capcode https://github.com/ishida-lab/capreward https://huggingface.co/datasets/ishidalab/capcode https://ishida-lab.github.io/blog_capcode.html https://ishida-lab.github.io/blog_capreward.html" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07381v1" updated="2026-06-05T15:21:22Z" repo_found_in="pdf" repo_urls="https://github.com/NVlabs/SPADE https://github.com/virginiafdez/brainSPADE3D_rel" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07392v1" updated="2026-06-05T15:29:17Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07410v1" updated="2026-06-05T15:57:42Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07412v1" updated="2026-06-05T16:00:17Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07422v2" updated="2026-06-08T19:53:59Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07433v1" updated="2026-06-05T16:29:13Z" repo_found_in="abstract" repo_urls="https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07437v1" updated="2026-06-05T16:35:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07451v1" updated="2026-06-05T16:54:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07462v1" updated="2026-06-05T17:13:36Z" repo_found_in="abstract" repo_urls="https://github.com/AARR-bench/AARRI-bench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07464v1" updated="2026-06-05T17:16:21Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NuRec https://huggingface.co:nvidia/PhysicalAI-Autonomous-Vehicles-NuRec" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07473v1" updated="2026-06-05T17:26:23Z" repo_found_in="pdf" repo_urls="https://github.com/audiosae/audio-sae https://huggingface.co/Egorgij21/ https://huggingface.co/Egorgij21/Audio-SAE-Whisper-large-v3 https://huggingface.co/Egorgij21/Audio-SAE-Whisper-small" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07475v1" updated="2026-06-05T17:28:19Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07479v1" updated="2026-06-05T17:34:07Z" repo_found_in="pdf" repo_urls="https://github.com/ysufsimsek/Supervision-versus-Demonstration-Based-In-Context-Learning-for-Multiword-Expression-Classification" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07489v1" updated="2026-06-05T17:45:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07491v1" updated="2026-06-05T17:46:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07512v1" updated="2026-06-05T17:59:21Z" repo_found_in="pdf" repo_urls="https://aim-uofa.github.io/MemDreamer/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07515v1" updated="2026-06-05T17:59:42Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07645v1" updated="2026-06-02T01:10:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07653v1" updated="2026-06-02T23:08:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07664v1" updated="2026-06-03T19:01:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07673v1" updated="2026-06-04T17:39:33Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07678v1" updated="2026-06-04T20:23:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07681v1" updated="2026-06-04T22:18:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07684v1" updated="2026-06-05T02:46:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07687v1" updated="2026-06-05T04:43:02Z" repo_found_in="pdf" repo_urls="https://github.com/danijar/dreamerv3" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07690v1" updated="2026-06-05T06:31:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07695v1" updated="2026-06-05T07:33:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07696v1" updated="2026-06-05T07:40:34Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07698v1" updated="2026-06-05T07:47:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07700v1" updated="2026-06-05T08:43:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07702v1" updated="2026-06-05T09:09:34Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07703v1" updated="2026-06-05T09:13:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07704v1" updated="2026-06-05T09:18:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07705v1" updated="2026-06-05T10:00:19Z" repo_found_in="abstract" repo_urls="https://github.com/Zhaolutuan/SAW" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07706v1" updated="2026-06-05T10:10:53Z" repo_found_in="abstract" repo_urls="https://github.com/Rishabhpm23/MLingualFC" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07708v1" updated="2026-06-05T11:01:42Z" repo_found_in="pdf" repo_urls="https://github.com/oth-aifiud/Cross-View-Urban-Traffic-Dataset https://huggingface.co/datasets/prakharbh/CrossViewUrbanTrafficDataset" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07710v1" updated="2026-06-05T13:23:33Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07711v1" updated="2026-06-05T13:50:59Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07712v1" updated="2026-06-05T14:11:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07713v1" updated="2026-06-05T14:44:49Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07714v1" updated="2026-06-05T14:46:50Z" repo_found_in="pdf" repo_urls="https://huggingface.co/albert/albert-base-v2" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07716v1" updated="2026-06-05T15:25:51Z" repo_found_in="comment" repo_urls="https://github.com/maryamzaman-git/SHEILD-IDS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07717v1" updated="2026-06-05T15:30:56Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07718v1" updated="2026-06-05T15:38:18Z" repo_found_in="pdf" repo_urls="https://github.com/kaihorstmann/neuro-d2d-eval" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07720v1" updated="2026-06-05T15:45:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07721v1" updated="2026-06-05T15:57:35Z" repo_found_in="pdf" repo_urls="https://huggingface.co/docs/transformers/en/index" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07722v1" updated="2026-06-05T16:04:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07766v1" updated="2026-06-05T18:24:49Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07771v1" updated="2026-06-05T18:29:19Z" repo_found_in="pdf" repo_urls=" https://github.com/.............." />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07780v1" updated="2026-06-05T18:46:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07798v1" updated="2026-06-05T19:23:56Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07802v1" updated="2026-06-05T19:32:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07805v1" updated="2026-06-05T19:33:58Z" repo_found_in="pdf" repo_urls="https://github.com/leonardeee/MAC-Bench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07810v1" updated="2026-06-05T19:38:15Z" repo_found_in="abstract" repo_urls="https://anishh15.github.io/SLMJury/ https://github.com/anishh15/SLMJury" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07812v1" updated="2026-06-05T19:39:35Z" repo_found_in="pdf" repo_urls="https://yikee.github.io/open-model-collaboration/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07819v1" updated="2026-06-05T20:02:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07828v1" updated="2026-06-05T20:30:08Z" repo_found_in="pdf" repo_urls="https://github.com/jyh/jas" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07833v1" updated="2026-06-05T20:47:45Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07834v1" updated="2026-06-05T20:51:51Z" repo_found_in="pdf" repo_urls="https://github.com/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07836v1" updated="2026-06-05T20:55:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07846v1" updated="2026-06-05T21:13:47Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07853v1" updated="2026-06-05T21:29:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07865v1" updated="2026-06-05T21:53:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07866v1" updated="2026-06-05T21:54:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07882v1" updated="2026-06-05T22:35:13Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07889v1" updated="2026-06-05T22:52:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07897v1" updated="2026-06-05T23:16:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07904v1" updated="2026-06-05T23:47:33Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07907v1" updated="2026-06-06T00:10:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07909v1" updated="2026-06-06T00:15:00Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07915v1" updated="2026-06-06T00:37:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07916v1" updated="2026-06-06T00:43:13Z" repo_found_in="pdf" repo_urls="https://github.com/UofT-CIFAR/Synthetic-Evidence-Document-Corpus" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07923v1" updated="2026-06-06T01:16:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07924v1" updated="2026-06-06T01:17:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07929v1" updated="2026-06-06T01:39:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07936v1" updated="2026-06-06T01:55:56Z" repo_found_in="abstract" repo_urls="https://github.com/larchlab/Illusions-of-the-Gold-Standard" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07936v2" updated="2026-06-09T12:36:24Z" repo_found_in="abstract" repo_urls="https://github.com/larchlab/Illusions-of-the-Gold-Standard" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07943v1" updated="2026-06-06T02:10:03Z" repo_found_in="pdf" repo_urls="https://github.com/liofoil/Ski https://github.com/liofoil/SkillSafety" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07951v1" updated="2026-06-06T02:53:31Z" repo_found_in="pdf" repo_urls="https://github.com/UCIDataLab/ https://github.com/UCIDataLab/certainty-distortion-in-llm-rewriting" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07953v1" updated="2026-06-06T03:06:10Z" repo_found_in="abstract" repo_urls="https://github.com/hellozzk/MMIO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07954v1" updated="2026-06-06T03:16:19Z" repo_found_in="pdf" repo_urls="https://github.com/efficiency-learning/PartitionSEL" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07963v1" updated="2026-06-06T03:41:44Z" repo_found_in="pdf" repo_urls="https://huggingface.co/meta-llama/ https://huggingface.co/meta-llama/Llama-Guard-3-8B" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07965v1" updated="2026-06-06T03:48:12Z" repo_found_in="pdf" repo_urls="https://github.com/hellozzk/MMIO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07968v1" updated="2026-06-06T03:52:27Z" repo_found_in="pdf" repo_urls="https://github.com/abidaziz1/recurguard" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07969v1" updated="2026-06-06T04:04:52Z" repo_found_in="pdf" repo_urls="https://github.com/imanif/animal-stories" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07970v1" updated="2026-06-06T04:04:57Z" repo_found_in="abstract" repo_urls="https://github.com/haomingwen/patcher" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07974v1" updated="2026-06-06T04:18:10Z" repo_found_in="pdf" repo_urls="https://yuhaiw.github.io/PRISM_web/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07988v1" updated="2026-06-06T05:35:31Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07992v1" updated="2026-06-06T06:07:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07996v1" updated="2026-06-06T06:27:54Z" repo_found_in="pdf" repo_urls="https://github.com/HaoooWang/llm-knowledge-cutoff-dates https://github.com/hiyouga/LLaMA-Factory/tree/main https://huggingface.co/models" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07998v1" updated="2026-06-06T06:33:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07998v2" updated="2026-06-09T07:24:59Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07999v1" updated="2026-06-06T06:33:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08000v1" updated="2026-06-06T06:38:35Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08011v1" updated="2026-06-06T07:00:44Z" repo_found_in="pdf" repo_urls="https://huggingface.co/google/gemma-4-31B-it" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08014v1" updated="2026-06-06T07:09:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08016v1" updated="2026-06-06T07:11:08Z" repo_found_in="comment" repo_urls="https://github.com/OpenDFM/Image_Edit_Agent" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08018v1" updated="2026-06-06T07:14:53Z" repo_found_in="abstract" repo_urls="https://github.com/JerryGao818/UniQL" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08020v1" updated="2026-06-06T07:17:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08021v1" updated="2026-06-06T07:31:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08027v1" updated="2026-06-06T07:40:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08030v1" updated="2026-06-06T07:46:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08034v1" updated="2026-06-06T07:51:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08036v1" updated="2026-06-06T07:56:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08037v1" updated="2026-06-06T07:57:22Z" repo_found_in="abstract" repo_urls="https://github.com/labhai/SafeECGMatch" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08044v1" updated="2026-06-06T08:10:56Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08046v1" updated="2026-06-06T08:18:21Z" repo_found_in="pdf" repo_urls="https://github.com/d-michail/osmgraphclip" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08049v1" updated="2026-06-06T08:27:18Z" repo_found_in="pdf" repo_urls="https://github.com/Am1n3e/skill-nb.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08051v1" updated="2026-06-06T08:32:38Z" repo_found_in="pdf" repo_urls="https://github.com/inflaton/gaime-slm.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08056v1" updated="2026-06-06T08:39:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08057v1" updated="2026-06-06T08:39:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08063v1" updated="2026-06-06T08:58:12Z" repo_found_in="abstract" repo_urls="https://github.com/jqtangust/Robust-U1" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08076v1" updated="2026-06-06T09:54:31Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/esradonmez/llm_persuasion https://huggingface.co/docs/ https://huggingface.co/docs/text-generation-inference" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08081v1" updated="2026-06-06T10:05:49Z" repo_found_in="pdf" repo_urls="https://github.com/diff94/tan_ana" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08090v1" updated="2026-06-06T10:32:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08093v1" updated="2026-06-06T10:36:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08094v1" updated="2026-06-06T10:45:40Z" repo_found_in="abstract" repo_urls="https://fai-modelopt-tech.github.io/vla-cpp.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08098v1" updated="2026-06-06T10:57:45Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08102v1" updated="2026-06-06T11:07:13Z" repo_found_in="comment" repo_urls="https://conquer-project.pages.dev/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08106v1" updated="2026-06-06T11:12:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08107v1" updated="2026-06-06T11:14:21Z" repo_found_in="abstract" repo_urls="https://egopipaper.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08122v1" updated="2026-06-06T11:56:44Z" repo_found_in="pdf" repo_urls="https://github.com/yuppielqx/Next-POI" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08123v1" updated="2026-06-06T11:57:16Z" repo_found_in="comment" repo_urls="https://github.com/rubendflorezzela/hcbf-driver-monitoring" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08129v1" updated="2026-06-06T12:14:45Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08131v1" updated="2026-06-06T12:15:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08146v1" updated="2026-06-06T12:48:45Z" repo_found_in="abstract" repo_urls="https://github.com/yichenC1c/SAGE" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08151v1" updated="2026-06-06T13:02:28Z" repo_found_in="comment" repo_urls="https://github.com/stephen-guan-researcher/CICL https://huggingface.co/XinyuGuan/CICL" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08153v1" updated="2026-06-06T13:08:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08156v1" updated="2026-06-06T13:13:02Z" repo_found_in="pdf" repo_urls="https://github.com/labhai/RAPID" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08158v1" updated="2026-06-06T13:14:10Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08167v1" updated="2026-06-06T13:31:38Z" repo_found_in="abstract" repo_urls="https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08168v1" updated="2026-06-06T13:31:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08169v1" updated="2026-06-06T13:33:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08172v1" updated="2026-06-06T13:36:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08191v1" updated="2026-06-06T14:21:55Z" repo_found_in="abstract" repo_urls="https://github.com/Kewei2023/AMPCliff/tree/FLaG" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08194v1" updated="2026-06-06T14:24:05Z" repo_found_in="abstract" repo_urls="https://huggingface.co/datasets/iNLP-Lab/GlobeAudio" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08196v1" updated="2026-06-06T14:25:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08200v1" updated="2026-06-06T14:37:33Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08218v1" updated="2026-06-06T15:12:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08234v1" updated="2026-06-06T15:44:50Z" repo_found_in="abstract" repo_urls="https://opensciagent.github.io/SciTrace/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08239v1" updated="2026-06-06T15:51:25Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08247v1" updated="2026-06-06T16:11:45Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08251v1" updated="2026-06-06T16:39:28Z" repo_found_in="pdf" repo_urls="https://github.com/listar2000/science-reward-model" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08251v2" updated="2026-06-09T03:31:39Z" repo_found_in="pdf" repo_urls="https://github.com/listar2000/science-reward-model" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08256v1" updated="2026-06-06T17:01:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08267v1" updated="2026-06-06T17:27:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08270v1" updated="2026-06-06T17:33:31Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08272v1" updated="2026-06-06T17:37:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08275v1" updated="2026-06-06T17:44:23Z" repo_found_in="comment" repo_urls="https://github.com/jaineet17/causal-agent-replay" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08282v1" updated="2026-06-06T18:06:53Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08285v1" updated="2026-06-06T18:14:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08292v1" updated="2026-06-06T18:29:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08296v1" updated="2026-06-06T18:44:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08310v1" updated="2026-06-06T19:43:53Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08311v1" updated="2026-06-06T19:52:41Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08312v1" updated="2026-06-06T19:55:54Z" repo_found_in="pdf" repo_urls="https://github.com/ashkanans/nesy_rl_test" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08314v1" updated="2026-06-06T19:59:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08323v1" updated="2026-06-06T20:24:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08324v1" updated="2026-06-06T20:25:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08327v1" updated="2026-06-06T20:38:42Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08340v1" updated="2026-06-06T21:13:43Z" repo_found_in="abstract" repo_urls="https://github.com/alem-world/alem-env" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08360v1" updated="2026-06-06T22:31:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08364v1" updated="2026-06-06T22:42:17Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08365v1" updated="2026-06-06T22:57:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08367v1" updated="2026-06-06T22:59:27Z" repo_found_in="pdf" repo_urls="https://github.com/EmergenceAI/Emergence-World" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08369v1" updated="2026-06-06T23:06:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08376v1" updated="2026-06-07T00:06:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08379v1" updated="2026-06-07T00:20:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08381v1" updated="2026-06-07T00:20:55Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08382v1" updated="2026-06-07T00:24:02Z" repo_found_in="abstract" repo_urls="https://github.com/PriyanshBhatnagar/STAR-KV" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08400v1" updated="2026-06-07T01:31:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08402v1" updated="2026-06-07T01:38:39Z" repo_found_in="pdf" repo_urls="https://jhkim0759.github.io/projects/SceneConductor" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08403v1" updated="2026-06-07T01:41:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08405v1" updated="2026-06-07T01:59:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08408v1" updated="2026-06-07T02:13:17Z" repo_found_in="pdf" repo_urls="https://github.com/rayendito/ https://github.com/rayendito/dlm_steer" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08410v1" updated="2026-06-07T02:19:59Z" repo_found_in="pdf" repo_urls="https://github.com/caolinfeng/MO-PQUCB" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08414v1" updated="2026-06-07T02:27:53Z" repo_found_in="pdf" repo_urls="https://ethan-iai.github.io/pact https://github.com/thu-ml/PACT" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08415v1" updated="2026-06-07T02:29:41Z" repo_found_in="pdf" repo_urls="https://github.com/NJU-LINK/CoVEBench https://huggingface.co/datasets/NJU-LINK/CoVEBench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08417v1" updated="2026-06-07T02:35:56Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08432v1" updated="2026-06-07T03:17:25Z" repo_found_in="abstract" repo_urls="https://github.com/louieworth/trd" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08433v1" updated="2026-06-07T03:20:03Z" repo_found_in="comment" repo_urls="https://github.com/orbitalab/RnD-ai-sandboxes-sec-study-part-1" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08445v1" updated="2026-06-07T04:21:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08446v1" updated="2026-06-07T04:24:45Z" repo_found_in="pdf" repo_urls="https://github.com/Infini-AI-Lab/Sparrow https://infini-ai-lab.github.io/sparrow https://infini-ai-lab.github.io/sparrow_project_release/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08447v1" updated="2026-06-07T04:27:54Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08450v1" updated="2026-06-07T04:45:19Z" repo_found_in="comment" repo_urls="https://github.com/KAG778/GIFT" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08451v1" updated="2026-06-07T04:50:40Z" repo_found_in="pdf" repo_urls="https://github.com/aryashah2k/Multilingual-Sycophancy https://huggingface.co/datasets/aryashah00/multilingual-sycophancy" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08467v1" updated="2026-06-07T06:16:58Z" repo_found_in="abstract" repo_urls="https://github.com/CaptainCuong/Graph-Calibration-Attack.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08471v1" updated="2026-06-07T06:27:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08476v1" updated="2026-06-07T06:45:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08477v1" updated="2026-06-07T06:47:34Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08480v1" updated="2026-06-07T06:51:18Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08481v1" updated="2026-06-07T06:53:09Z" repo_found_in="pdf" repo_urls="https://github.com/suraj-ranganath/PIPE-Cypher/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08483v1" updated="2026-06-07T07:01:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08484v1" updated="2026-06-07T07:05:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08491v1" updated="2026-06-07T07:30:34Z" repo_found_in="pdf" repo_urls="https://github.com/cy623/ https://github.com/cy623/Structural_Optimizer_RDL.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08492v1" updated="2026-06-07T07:34:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08497v1" updated="2026-06-07T07:54:49Z" repo_found_in="pdf" repo_urls="https://github.com/ggomaeng514/GLASS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08500v1" updated="2026-06-07T07:57:22Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08503v1" updated="2026-06-07T08:05:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08508v1" updated="2026-06-07T08:18:11Z" repo_found_in="comment" repo_urls="https://air-embodied-brain.github.io/actprobe" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08529v1" updated="2026-06-07T09:14:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08530v1" updated="2026-06-07T09:23:16Z" repo_found_in="abstract" repo_urls="https://github.com/babynabeauty/GEAR-VLA" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08531v1" updated="2026-06-07T09:23:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08532v1" updated="2026-06-07T09:26:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08539v1" updated="2026-06-07T09:39:20Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08542v1" updated="2026-06-07T09:49:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08543v1" updated="2026-06-07T09:51:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08552v1" updated="2026-06-07T10:12:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08565v1" updated="2026-06-07T10:43:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08571v1" updated="2026-06-07T11:01:13Z" repo_found_in="pdf" repo_urls="https://github.com/SubramanyamSahoo/Calibration-of-Structured-Ignorance-Certificates-for-Diagnosing-Unknown-Unknowns-in-Reasoning-Models" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08590v1" updated="2026-06-07T12:05:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08596v1" updated="2026-06-07T12:20:32Z" repo_found_in="abstract" repo_urls="https://beiwenzhang.github.io/Co-pi-tree/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08601v1" updated="2026-06-07T12:27:13Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08602v1" updated="2026-06-07T12:28:51Z" repo_found_in="abstract" repo_urls="https://rpfey.github.io/rldt/}{https://rpfey.github.io/rldt/}" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08610v1" updated="2026-06-07T12:49:33Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08630v1" updated="2026-06-07T13:50:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08633v1" updated="2026-06-07T13:55:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08644v1" updated="2026-06-07T14:16:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08649v1" updated="2026-06-07T14:28:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08652v1" updated="2026-06-07T14:41:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08653v1" updated="2026-06-07T14:41:22Z" repo_found_in="comment" repo_urls="https://fibertune.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08657v1" updated="2026-06-07T14:53:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08658v1" updated="2026-06-07T14:56:13Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08661v1" updated="2026-06-07T15:05:22Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08674v1" updated="2026-06-07T15:23:49Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08674v2" updated="2026-06-09T14:14:17Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08676v1" updated="2026-06-07T15:24:30Z" repo_found_in="comment" repo_urls="https://github.com/arkosioscambions/CodeTalkers" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08682v1" updated="2026-06-07T15:34:59Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08696v1" updated="2026-06-07T15:53:10Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08702v1" updated="2026-06-07T15:59:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08710v1" updated="2026-06-07T16:07:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08712v1" updated="2026-06-07T16:07:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08714v1" updated="2026-06-07T16:10:46Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08718v1" updated="2026-06-07T16:25:47Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08728v1" updated="2026-06-07T16:50:07Z" repo_found_in="abstract" repo_urls="https://github.com/Starscream-11813/awesome-AI4Math" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08735v1" updated="2026-06-07T17:08:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08761v1" updated="2026-06-07T18:01:55Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08769v1" updated="2026-06-07T18:16:56Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08770v1" updated="2026-06-07T18:20:19Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08775v1" updated="2026-06-07T18:39:45Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08777v1" updated="2026-06-07T18:45:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08790v1" updated="2026-06-07T19:12:55Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08791v1" updated="2026-06-07T19:16:50Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08793v1" updated="2026-06-07T19:18:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08797v1" updated="2026-06-07T19:27:59Z" repo_found_in="abstract" repo_urls="https://github.com/corail-research/DFL-LD" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08800v1" updated="2026-06-07T19:39:03Z" repo_found_in="pdf" repo_urls="https://behavior-in-the-wild.github.io/fest.html" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08804v1" updated="2026-06-07T19:49:01Z" repo_found_in="pdf" repo_urls="https://github.com/psmiz/Q-Delta" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08806v1" updated="2026-06-07T19:53:19Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08814v1" updated="2026-06-07T20:07:24Z" repo_found_in="pdf" repo_urls="https://github.com/psmiz/STAR" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08815v1" updated="2026-06-07T20:08:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08816v1" updated="2026-06-07T20:09:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08831v1" updated="2026-06-07T20:52:10Z" repo_found_in="pdf" repo_urls="https://github.com/tinattw/ITCR" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08832v1" updated="2026-06-07T20:52:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08840v1" updated="2026-06-07T21:10:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08841v1" updated="2026-06-07T21:11:47Z" repo_found_in="pdf" repo_urls="https://huggingface.co/Rapidata/datasets" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08847v1" updated="2026-06-07T21:32:50Z" repo_found_in="abstract" repo_urls="https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08849v1" updated="2026-06-07T21:42:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08850v1" updated="2026-06-07T21:43:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08854v1" updated="2026-06-07T21:47:31Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08855v1" updated="2026-06-07T21:50:20Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08858v1" updated="2026-06-07T22:18:41Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08875v1" updated="2026-06-07T23:04:55Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08881v1" updated="2026-06-07T23:39:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08893v1" updated="2026-06-08T00:35:54Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08896v1" updated="2026-06-08T00:47:58Z" repo_found_in="abstract" repo_urls="https://github.com/hit636/FAME" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08897v1" updated="2026-06-08T00:50:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08898v1" updated="2026-06-08T00:50:39Z" repo_found_in="abstract" repo_urls="https://github.com/cgq2971-afk/FCIAC" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08904v1" updated="2026-06-08T01:10:07Z" repo_found_in="pdf" repo_urls="https://github.com/Explorermomo/OrderPlace" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08908v1" updated="2026-06-08T01:15:26Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08919v1" updated="2026-06-08T01:52:22Z" repo_found_in="comment" repo_urls="https://github.com/turangenesis/headroom" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08920v1" updated="2026-06-08T01:59:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08932v1" updated="2026-06-08T02:17:53Z" repo_found_in="pdf" repo_urls="https://github.com/AlexJJJChen/NormBench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08935v1" updated="2026-06-08T02:24:40Z" repo_found_in="abstract" repo_urls="https://github.com/pantheon5100/PAI" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08936v1" updated="2026-06-08T02:31:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08938v1" updated="2026-06-08T02:34:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08948v1" updated="2026-06-08T02:50:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08952v1" updated="2026-06-08T02:55:05Z" repo_found_in="pdf" repo_urls="https://github.com/Heathcliff-saku/AlloSpatial" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08960v1" updated="2026-06-08T03:00:56Z" repo_found_in="pdf" repo_urls="https://github.com/ScalingIntelligence/KernelBench/blob/main/src/kernelbench/timing.py#L485 https://github.com/few-sh/harden-v0 https://github.com/few-sh/terminal-wrench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08969v1" updated="2026-06-08T03:10:45Z" repo_found_in="pdf" repo_urls="https://github.com/som-shahlab/CARE" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08970v1" updated="2026-06-08T03:10:46Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08974v1" updated="2026-06-08T03:17:31Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08976v1" updated="2026-06-08T03:21:33Z" repo_found_in="pdf" repo_urls="https://github.com/hkust-zhiyao/RTL-BenchLS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08982v1" updated="2026-06-08T03:27:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08982v2" updated="2026-06-09T14:24:26Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08992v1" updated="2026-06-08T03:42:08Z" repo_found_in="pdf" repo_urls="https://charles-donne.github.io/SpaceVLN/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08998v1" updated="2026-06-08T03:53:55Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09004v1" updated="2026-06-08T04:03:54Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09006v1" updated="2026-06-08T04:09:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09012v1" updated="2026-06-08T04:21:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09019v1" updated="2026-06-08T04:32:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09027v1" updated="2026-06-08T04:53:01Z" repo_found_in="abstract" repo_urls="https://huggingface.co/datasets/zzp-seeker/SafeRun-RunPlanning-Benchmark}{huggingface}" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09028v1" updated="2026-06-08T04:53:10Z" repo_found_in="pdf" repo_urls="https://github.com/11isnotavailable/ATM" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09030v1" updated="2026-06-08T04:53:44Z" repo_found_in="comment" repo_urls="https://github.com/HyeongWon-Jang/TRIAGE" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09037v1" updated="2026-06-08T05:07:46Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09038v1" updated="2026-06-08T05:10:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09039v1" updated="2026-06-08T05:10:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09041v1" updated="2026-06-08T05:14:01Z" repo_found_in="pdf" repo_urls="https://github.com/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09048v1" updated="2026-06-08T05:36:42Z" repo_found_in="abstract" repo_urls="https://barewave.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09052v1" updated="2026-06-08T05:40:36Z" repo_found_in="abstract" repo_urls="https://github.com/FFishy-git/INFUSER" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09059v1" updated="2026-06-08T05:49:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09064v1" updated="2026-06-08T06:02:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09065v1" updated="2026-06-08T06:02:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09071v1" updated="2026-06-08T06:11:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09073v1" updated="2026-06-08T06:15:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09079v1" updated="2026-06-08T06:25:54Z" repo_found_in="comment" repo_urls="https://github.com/libertywing/FlashMemory-Deepseek-V4 https://huggingface.co/libertywing/FlashMemory-Deepseek-V4" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09079v2" updated="2026-06-09T17:04:45Z" repo_found_in="comment" repo_urls="https://github.com/libertywing/FlashMemory-Deepseek-V4 https://huggingface.co/libertywing/FlashMemory-Deepseek-V4" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09084v1" updated="2026-06-08T06:29:42Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/YukaiYang0803/Safe_Multi_Agents_System" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09086v1" updated="2026-06-08T06:32:49Z" repo_found_in="abstract" repo_urls="https://github.com/csjiezhao/DynaOD" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09090v1" updated="2026-06-08T06:36:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09104v1" updated="2026-06-08T06:58:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09105v1" updated="2026-06-08T06:58:21Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09105v2" updated="2026-06-09T09:03:26Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09112v1" updated="2026-06-08T07:04:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09117v1" updated="2026-06-08T07:11:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09118v1" updated="2026-06-08T07:11:56Z" repo_found_in="comment" repo_urls="https://gem-workshop.com/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09122v1" updated="2026-06-08T07:15:53Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09123v1" updated="2026-06-08T07:18:38Z" repo_found_in="abstract" repo_urls="https://github.com/HITlixian/TGRS_GSFF" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09124v1" updated="2026-06-08T07:18:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09125v1" updated="2026-06-08T07:19:42Z" repo_found_in="pdf" repo_urls="https://github.com/tiejin98/Privacy_ https://github.com/tiejin98/Privacy_Different_Task" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09131v1" updated="2026-06-08T07:28:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09132v1" updated="2026-06-08T07:30:20Z" repo_found_in="pdf" repo_urls="https://github.com/tiejin98/VLM_ https://github.com/tiejin98/VLM_Deidentification" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09134v1" updated="2026-06-08T07:32:06Z" repo_found_in="pdf" repo_urls="https://github.com/JTShuai/USD_2_KG" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09135v1" updated="2026-06-08T07:32:44Z" repo_found_in="pdf" repo_urls="https://github.com/UzL-ITS/ https://github.com/UzL-ITS/Hidden-Communication-via-LLM-Seeds https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF https://huggingface.co/unsloth/Llama-3.2-3B-Instruct-GGUF https://huggingface.co/unsloth/Phi-4-mini-instruct-GGUF https://huggingface.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507-GGUF" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09142v1" updated="2026-06-08T07:39:42Z" repo_found_in="pdf" repo_urls="https://github.com/danyayay/EgoCross-VLM.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09157v1" updated="2026-06-08T07:47:57Z" repo_found_in="pdf" repo_urls="https://huggingface.co/HannaAbiAkl/ https://huggingface.co/HannaAbiAkl/LOGIC-NL-CLIF-Flan-T5-Large https://huggingface.co/HannaAbiAkl/LOGIC-NL-Flan-T5-Large https://huggingface.co/google/flan-t5-large https://huggingface.co/google/flan-t5-small" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09159v1" updated="2026-06-08T07:50:12Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/BlueWhaleLab/Uni-E" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09160v1" updated="2026-06-08T07:51:34Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09165v1" updated="2026-06-08T08:02:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09169v1" updated="2026-06-08T08:08:20Z" repo_found_in="pdf" repo_urls="https://github.com/ccccEsion/IMUG-Bench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09175v1" updated="2026-06-08T08:14:22Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09178v1" updated="2026-06-08T08:17:18Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09189v1" updated="2026-06-08T08:24:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09198v1" updated="2026-06-08T08:32:21Z" repo_found_in="pdf" repo_urls="https://github.com/tjunlp-lab/MASS_ https://github.com/tjunlp-lab/MASS_DeepResearch" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09200v1" updated="2026-06-08T08:33:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09227v1" updated="2026-06-08T09:02:02Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09234v1" updated="2026-06-08T09:07:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09236v1" updated="2026-06-08T09:14:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09243v1" updated="2026-06-08T09:20:03Z" repo_found_in="abstract" repo_urls="https://egotactile.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09245v1" updated="2026-06-08T09:20:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09257v1" updated="2026-06-08T09:30:34Z" repo_found_in="pdf" repo_urls="https://github.com/zadid6pretam/BSTabDiff https://github.com/zadid6pretam/BSTabDiff,pip" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09266v1" updated="2026-06-08T09:37:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09278v1" updated="2026-06-08T09:44:31Z" repo_found_in="abstract" repo_urls="https://github.com/Huawei-AI4Math/PyGeoX" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09311v1" updated="2026-06-08T10:16:59Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09315v1" updated="2026-06-08T10:19:34Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09316v1" updated="2026-06-08T10:24:32Z" repo_found_in="pdf" repo_urls="https://github.com/ECNU-ICALK/AutoSkill" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09316v2" updated="2026-06-09T09:39:52Z" repo_found_in="pdf" repo_urls="https://github.com/ECNU-ICALK/AutoSkill" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09323v1" updated="2026-06-08T10:39:25Z" repo_found_in="abstract" repo_urls="https://github.com/LOGO-CUHKSZ/TRL-Bench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09327v1" updated="2026-06-08T10:47:47Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09331v1" updated="2026-06-08T10:54:18Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09343v1" updated="2026-06-08T11:08:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09353v1" updated="2026-06-08T11:27:11Z" repo_found_in="pdf" repo_urls="https://github.com/GuillaumeMougeot/DogFaceNet/releases/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09365v1" updated="2026-06-08T11:37:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09368v1" updated="2026-06-08T11:40:48Z" repo_found_in="abstract" repo_urls="https://github.com/ZMH-SDUST/PhysScene" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09371v1" updated="2026-06-08T11:48:55Z" repo_found_in="pdf" repo_urls="https://github.com/al123123123-123/htl" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09377v1" updated="2026-06-08T11:56:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09377v2" updated="2026-06-09T11:54:13Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09380v1" updated="2026-06-08T11:57:17Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09390v1" updated="2026-06-08T12:05:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09392v1" updated="2026-06-08T12:07:13Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09399v1" updated="2026-06-08T12:18:22Z" repo_found_in="pdf" repo_urls="https://github.com/runagent-dev/runagent-superbrowser" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09404v1" updated="2026-06-08T12:22:02Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09408v1" updated="2026-06-08T12:25:35Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09409v1" updated="2026-06-08T12:26:54Z" repo_found_in="pdf" repo_urls="https://github.com/socialfoundations/correct-looks-better" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09410v1" updated="2026-06-08T12:26:54Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09414v1" updated="2026-06-08T12:28:55Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09416v1" updated="2026-06-08T12:29:54Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09419v1" updated="2026-06-08T12:36:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09426v1" updated="2026-06-08T12:39:23Z" repo_found_in="pdf" repo_urls="https://weavebench.github.io" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09430v1" updated="2026-06-08T12:41:15Z" repo_found_in="pdf" repo_urls="https://github.com/Agentic-Intelligence-Lab/LargeMonitor" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09433v1" updated="2026-06-08T12:44:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09441v1" updated="2026-06-08T12:50:13Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09447v1" updated="2026-06-08T12:55:42Z" repo_found_in="pdf" repo_urls="https://github.com/AlibabaResearch/aliyun- https://github.com/AlibabaResearch/aliyun-console-agent" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09450v1" updated="2026-06-08T12:57:18Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09470v1" updated="2026-06-08T13:27:40Z" repo_found_in="pdf" repo_urls="https://github.com/Aditya3107/ https://github.com/Aditya3107/speechllm-l2-assessment" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09475v1" updated="2026-06-08T13:30:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09475v2" updated="2026-06-09T13:08:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09483v1" updated="2026-06-08T13:39:24Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09489v1" updated="2026-06-08T13:44:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09499v1" updated="2026-06-08T13:50:31Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09500v1" updated="2026-06-08T13:51:04Z" repo_found_in="comment" repo_urls="https://github.com/Aperivue/medsci-skills" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09500v2" updated="2026-06-09T14:24:13Z" repo_found_in="comment" repo_urls="https://github.com/Aperivue/medsci-skills" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09508v1" updated="2026-06-08T14:02:18Z" repo_found_in="abstract" repo_urls="https://github.com/SHA-4096/EntropyInfer" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09520v1" updated="2026-06-08T14:09:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09525v1" updated="2026-06-08T14:11:07Z" repo_found_in="pdf" repo_urls="https://allenai.github.io/open-instruct/ https://github.com/copenlu/ https://github.com/copenlu/context-characteristics-sensitivity" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09548v1" updated="2026-06-08T14:27:21Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09549v1" updated="2026-06-08T14:29:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09551v1" updated="2026-06-08T14:30:00Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09556v1" updated="2026-06-08T14:31:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09559v1" updated="2026-06-08T14:33:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09563v1" updated="2026-06-08T14:37:46Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09568v1" updated="2026-06-08T14:42:21Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09572v1" updated="2026-06-08T14:46:43Z" repo_found_in="pdf" repo_urls="https://embodied-ai-research.github.io/ct-vam/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09578v1" updated="2026-06-08T14:52:46Z" repo_found_in="pdf" repo_urls="https://github.com/mbzuai-nlp/TABVERSE https://huggingface.co/datasets/MBZUAI/TABVERSE https://mbzuai-nlp.github.io/TABVERSE/ https://mbzuai-nlp.github.io/TABVERSE/leaderboard.html" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09585v1" updated="2026-06-08T14:58:59Z" repo_found_in="pdf" repo_urls="https://github.com/ModalityDance/Optical-Reasoning https://github.com/python-pillow/Pillow" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09587v1" updated="2026-06-08T14:59:50Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09589v1" updated="2026-06-08T15:01:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09605v1" updated="2026-06-08T15:13:57Z" repo_found_in="pdf" repo_urls="https://github.com/joncarter1/hypnos" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09607v1" updated="2026-06-08T15:17:54Z" repo_found_in="pdf" repo_urls="https://github.com/skydancerosel/coactivation-closure" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09610v1" updated="2026-06-08T15:18:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09613v1" updated="2026-06-08T15:20:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09617v1" updated="2026-06-08T15:22:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09630v1" updated="2026-06-08T15:29:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09634v1" updated="2026-06-08T15:32:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09643v1" updated="2026-06-08T15:38:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09646v1" updated="2026-06-08T15:40:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09648v1" updated="2026-06-08T15:40:57Z" repo_found_in="pdf" repo_urls="https://olgaovcharenko.github.io/ArtiFact/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09658v1" updated="2026-06-08T15:42:54Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09659v1" updated="2026-06-08T15:43:16Z" repo_found_in="pdf" repo_urls="https://github.com/LeonLixyz/LCLM https://github.com/NVIDIA/kvpress/blob/main/notebooks/speed_and_memory.ipynb https://huggingface.co/latent-context" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09663v1" updated="2026-06-08T15:45:15Z" repo_found_in="comment" repo_urls="https://github.com/DunLi-Tsinghua/MetaAI-Mini" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09666v1" updated="2026-06-08T15:49:18Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09669v1" updated="2026-06-08T15:51:51Z" repo_found_in="pdf" repo_urls="https://spatial-world.github.io" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09670v1" updated="2026-06-08T15:52:05Z" repo_found_in="pdf" repo_urls="https://github.com/openvinotoolkit/anomalib" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09671v1" updated="2026-06-08T15:54:10Z" repo_found_in="pdf" repo_urls="https://github.com/YilinZhang00/Digital-Twin-for-AD.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09672v1" updated="2026-06-08T15:54:28Z" repo_found_in="pdf" repo_urls="https://github.com/dotsin/lbm-benchmarking-embeddingsFT" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09674v1" updated="2026-06-08T15:57:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09677v1" updated="2026-06-08T15:58:31Z" repo_found_in="pdf" repo_urls="https://github.com/rlaehghks5/MECO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09677v2" updated="2026-06-09T08:08:09Z" repo_found_in="pdf" repo_urls="https://github.com/rlaehghks5/MECO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09686v1" updated="2026-06-08T16:04:15Z" repo_found_in="comment" repo_urls="https://github.com/gHashTag/paper3-methodology" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09692v1" updated="2026-06-08T16:10:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09701v1" updated="2026-06-08T16:21:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09711v1" updated="2026-06-08T16:32:54Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09724v1" updated="2026-06-08T16:46:53Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09730v1" updated="2026-06-08T16:52:26Z" repo_found_in="pdf" repo_urls="https://search-swarm.github.io" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09746v1" updated="2026-06-08T17:06:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09748v1" updated="2026-06-08T17:08:36Z" repo_found_in="abstract" repo_urls="https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09751v1" updated="2026-06-08T17:11:42Z" repo_found_in="abstract" repo_urls="https://github.com/BrightbeamAI/chap" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09758v1" updated="2026-06-08T17:18:19Z" repo_found_in="comment" repo_urls="https://weirdlabuw.github.io/darp-site/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09762v1" updated="2026-06-08T17:24:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09767v1" updated="2026-06-08T17:29:08Z" repo_found_in="pdf" repo_urls="https://github.com/achulzhanov/mayan-mt5" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09774v1" updated="2026-06-08T17:35:17Z" repo_found_in="pdf" repo_urls="https://github.com/GEOS-DEV/GEOS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09778v1" updated="2026-06-08T17:39:22Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09800v1" updated="2026-06-08T17:53:05Z" repo_found_in="pdf" repo_urls="https://github.com/corvolin/CSE4AgenticSoftDev" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09802v1" updated="2026-06-08T17:53:29Z" repo_found_in="pdf" repo_urls="https://github.com/riiswa/context_drift_lin_bandits/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09806v1" updated="2026-06-08T17:54:33Z" repo_found_in="abstract" repo_urls="https://circle-group.github.io/research/TNO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09809v1" updated="2026-06-08T17:55:02Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/cfahlgren1/hub-stats" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09809v2" updated="2026-06-09T17:58:20Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/cfahlgren1/hub-stats" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09811v1" updated="2026-06-08T17:55:18Z" repo_found_in="comment" repo_urls="https://serene-sivy.github.io/aha-wam/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09816v1" updated="2026-06-08T17:56:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09825v1" updated="2026-06-08T17:59:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09826v1" updated="2026-06-08T17:59:43Z" repo_found_in="pdf" repo_urls="https://mxlin043.github.io/OmniGameArena/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09871v1" updated="2026-06-02T07:50:50Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09872v1" updated="2026-06-02T09:53:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09873v1" updated="2026-06-02T18:05:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09875v1" updated="2026-06-02T20:57:14Z" repo_found_in="pdf" repo_urls="https://github.com/qcri/GLU.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09883v1" updated="2026-06-03T06:40:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09884v1" updated="2026-06-03T06:49:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09886v1" updated="2026-06-03T08:41:48Z" repo_found_in="abstract" repo_urls="https://github.com/Alizen-1009/Shapley-Moe" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09887v1" updated="2026-06-03T09:08:29Z" repo_found_in="pdf" repo_urls="https://github.com/Liuz-rui/SocraticPO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09890v1" updated="2026-06-03T13:26:41Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/oyarsa/preact-bench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09891v1" updated="2026-06-03T19:02:19Z" repo_found_in="pdf" repo_urls="https://github.com/sinaBaharlouei/ https://github.com/sinaBaharlouei/CurriculumRepresentation" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09893v1" updated="2026-06-03T19:45:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09896v1" updated="2026-06-04T22:50:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09899v1" updated="2026-06-05T05:20:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09900v1" updated="2026-06-05T11:43:56Z" repo_found_in="comment" repo_urls="https://github.com/ly-wang19/engram" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09902v1" updated="2026-06-05T13:59:23Z" repo_found_in="pdf" repo_urls="https://github.com/Anmol2059/connectome-reservoir-optimisation" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09907v1" updated="2026-06-06T07:50:41Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09908v1" updated="2026-06-06T07:59:08Z" repo_found_in="abstract" repo_urls="https://github.com/tisl-lab/Interdependent_Privacy_Bench}{here}" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09909v1" updated="2026-06-06T07:59:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09912v1" updated="2026-06-06T12:10:38Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/birla-ai-labs/mix-dont-pick" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09916v1" updated="2026-06-06T15:54:48Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09919v1" updated="2026-06-07T00:02:21Z" repo_found_in="comment" repo_urls="https://co-glance.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09922v1" updated="2026-06-07T04:18:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09923v1" updated="2026-06-07T08:52:38Z" repo_found_in="pdf" repo_urls="https://github.com/physai/physai" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09924v1" updated="2026-06-07T09:19:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09926v1" updated="2026-06-07T14:06:20Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09927v1" updated="2026-06-07T14:17:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09928v1" updated="2026-06-07T14:24:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09929v1" updated="2026-06-07T14:46:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09931v1" updated="2026-06-07T16:36:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09932v1" updated="2026-06-07T17:58:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09935v1" updated="2026-06-07T19:17:18Z" repo_found_in="pdf" repo_urls="https://github.com/.extraheader https://github.com/.extraheaderexfiltrated https://github.com/ceferisbarov/GitInject" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09936v1" updated="2026-06-07T19:27:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09937v1" updated="2026-06-07T21:36:20Z" repo_found_in="abstract" repo_urls="https://github.com/AnirudhSekar/RKSC" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09940v1" updated="2026-06-08T00:15:44Z" repo_found_in="abstract" repo_urls="https://github.com/chainik1125/crosscoders-feature-interactions/tree/arxiv" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09942v1" updated="2026-06-08T00:50:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09944v1" updated="2026-06-08T03:30:20Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09949v1" updated="2026-06-08T08:25:19Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09953v1" updated="2026-06-08T10:45:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09954v1" updated="2026-06-08T10:51:56Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09958v1" updated="2026-06-08T12:04:06Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09959v1" updated="2026-06-08T12:28:28Z" repo_found_in="abstract" repo_urls="https://github.com/gijsvn/TA-SmaAt-UNet}{GitHub}" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09960v1" updated="2026-06-08T12:52:21Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09961v1" updated="2026-06-08T14:26:05Z" repo_found_in="abstract" repo_urls="https://github.com/genalyu/3SPO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09962v1" updated="2026-06-08T14:41:24Z" repo_found_in="pdf" repo_urls="https://github.com/FunAudioLLM/CosyVoice https://github.com/SWivid/F5-TTS https://github.com/li1jkdaw/ https://github.com/li1jkdaw/CDCD-TTS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09963v1" updated="2026-06-08T14:45:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10010v1" updated="2026-06-08T18:01:20Z" repo_found_in="pdf" repo_urls="https://github.com/JethroWangSir/DeRA-MOS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10019v1" updated="2026-06-08T18:07:11Z" repo_found_in="pdf" repo_urls="https://github.com/ToyotaResearchInstitute/ https://github.com/ToyotaResearchInstitute/gcvo.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10029v1" updated="2026-06-08T18:09:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10044v1" updated="2026-06-08T18:16:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10046v1" updated="2026-06-08T18:18:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10062v1" updated="2026-06-08T18:38:41Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10064v1" updated="2026-06-08T18:39:15Z" repo_found_in="comment" repo_urls="https://huggingface.co/collections/oro-ai/shoppingbench-sn15-trajectory-primitive" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10066v1" updated="2026-06-08T18:40:10Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10068v1" updated="2026-06-08T18:42:00Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10071v1" updated="2026-06-08T18:47:31Z" repo_found_in="pdf" repo_urls="https://github.com/TanzilaKhan1/TSNN-Code" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10080v1" updated="2026-06-08T18:54:31Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10084v1" updated="2026-06-08T19:08:22Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10086v1" updated="2026-06-08T19:15:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10089v1" updated="2026-06-08T19:16:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10094v1" updated="2026-06-08T19:18:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10099v1" updated="2026-06-08T19:28:52Z" repo_found_in="pdf" repo_urls="https://github.com/rrivera1849/ https://github.com/rrivera1849/lusr" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10106v1" updated="2026-06-08T19:35:37Z" repo_found_in="pdf" repo_urls="https://github.com/agno-agi/agno https://github.com/openai/codex https://huggingface.co/ https://huggingface.co/blog/agent-glossary" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10112v1" updated="2026-06-08T19:42:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10113v1" updated="2026-06-08T19:46:00Z" repo_found_in="pdf" repo_urls="https://huggingface.co/Lsthf/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10120v1" updated="2026-06-08T19:52:08Z" repo_found_in="pdf" repo_urls="https://github.com/Arefeen06088/MetaPlate" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10124v1" updated="2026-06-08T19:55:45Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10126v1" updated="2026-06-08T19:57:13Z" repo_found_in="pdf" repo_urls="https://github.com/tunazislam/Pareto-Guided-Teacher-Alignment https://huggingface.co/SamLowe/ https://huggingface.co/SamLowe/roberta-base-go_emotions" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10135v1" updated="2026-06-08T20:08:41Z" repo_found_in="pdf" repo_urls="https://github.com/LynnReal-AI/BiWM" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10147v1" updated="2026-06-08T20:26:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10156v1" updated="2026-06-08T20:35:45Z" repo_found_in="abstract" repo_urls="https://github.com/nbharaths/tau-rec" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10159v1" updated="2026-06-08T20:38:06Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10173v1" updated="2026-06-08T21:07:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10180v1" updated="2026-06-08T21:16:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10183v1" updated="2026-06-08T21:21:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10184v1" updated="2026-06-08T21:21:42Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10194v1" updated="2026-06-08T21:30:34Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10196v1" updated="2026-06-08T21:35:11Z" repo_found_in="abstract" repo_urls="https://github.com/AtlasAnalyticsLab/FisherAdapTune}{code}" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10197v1" updated="2026-06-08T21:35:52Z" repo_found_in="pdf" repo_urls="https://github.com/harrypenguin/MultiModal" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10198v1" updated="2026-06-08T21:36:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10200v1" updated="2026-06-08T21:42:19Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10208v1" updated="2026-06-08T22:00:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10209v1" updated="2026-06-08T22:01:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10213v1" updated="2026-06-08T22:07:59Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10216v1" updated="2026-06-08T22:13:42Z" repo_found_in="pdf" repo_urls="https://gitlab.com/adaptdata" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10219v1" updated="2026-06-08T22:17:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10223v1" updated="2026-06-08T22:22:48Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/piotrkawa/ https://github.com/piotrkawa/audio-deepfake-source-tracing" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10228v1" updated="2026-06-08T22:40:45Z" repo_found_in="pdf" repo_urls="https://github.com/montrealrobotics/shapo" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10237v1" updated="2026-06-08T22:51:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10238v1" updated="2026-06-08T22:57:39Z" repo_found_in="pdf" repo_urls="https://github.com/Fitzgerald-Lab-NU/hyperbolic-tuning-curve" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10241v1" updated="2026-06-08T23:04:35Z" repo_found_in="comment" repo_urls="https://github.com/yoheinakajima/regimes" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10244v1" updated="2026-06-08T23:21:14Z" repo_found_in="comment" repo_urls="https://yubi.airoa.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10246v1" updated="2026-06-08T23:26:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10250v1" updated="2026-06-08T23:36:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10254v1" updated="2026-06-08T23:40:34Z" repo_found_in="comment" repo_urls="https://github.com/RicharMd/RealMath-Eval https://huggingface.co/datasets/RicharMd/RealMath-Eval" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10267v1" updated="2026-06-09T00:24:00Z" repo_found_in="abstract" repo_urls="https://jiahenghu.github.io/hi-vla" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10276v1" updated="2026-06-09T00:50:29Z" repo_found_in="comment" repo_urls="https://project-edith.github.io" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10278v1" updated="2026-06-09T00:59:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10279v1" updated="2026-06-09T01:00:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10286v1" updated="2026-06-09T01:20:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10294v1" updated="2026-06-09T01:32:17Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10296v1" updated="2026-06-09T01:33:15Z" repo_found_in="pdf" repo_urls="https://github.com/openai/grade-school-math" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10298v1" updated="2026-06-09T01:34:13Z" repo_found_in="abstract" repo_urls="https://github.com/keith-Jiang/conflict-aware-decoding" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10299v1" updated="2026-06-09T01:34:18Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10314v1" updated="2026-06-09T02:09:02Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10315v1" updated="2026-06-09T02:11:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10321v1" updated="2026-06-09T02:18:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10328v1" updated="2026-06-09T02:23:49Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10329v1" updated="2026-06-09T02:24:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10334v1" updated="2026-06-09T02:28:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10338v1" updated="2026-06-09T02:33:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10346v1" updated="2026-06-09T02:55:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10357v1" updated="2026-06-09T03:13:52Z" repo_found_in="pdf" repo_urls="https://huggingface.co/BAAI/bge-m3 https://huggingface.co/Qwen/Qwen3-4B" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10358v1" updated="2026-06-09T03:14:50Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10359v1" updated="2026-06-09T03:18:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10366v1" updated="2026-06-09T03:25:02Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10368v1" updated="2026-06-09T03:27:30Z" repo_found_in="abstract" repo_urls="https://github.com/Sslnon/ELF-S2T" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10371v1" updated="2026-06-09T03:31:09Z" repo_found_in="abstract" repo_urls="https://tako-attack.github.io" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10376v1" updated="2026-06-09T03:38:53Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10380v1" updated="2026-06-09T03:42:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10381v1" updated="2026-06-09T03:42:50Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10384v1" updated="2026-06-09T03:49:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10385v1" updated="2026-06-09T03:51:41Z" repo_found_in="comment" repo_urls="https://vanhowe.github.io/AR-OPD/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10388v1" updated="2026-06-09T03:54:45Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10389v1" updated="2026-06-09T03:55:31Z" repo_found_in="abstract" repo_urls="https://github.com/1xiangliu1/FAMOU-CoEvo" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10392v1" updated="2026-06-09T04:14:49Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10394v1" updated="2026-06-09T04:16:35Z" repo_found_in="pdf" repo_urls="https://github.com/LiangThree/STAGE-Claw.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10402v1" updated="2026-06-09T04:25:28Z" repo_found_in="pdf" repo_urls="https://github.com/google-deepmind/alphaevolve_repository_ https://github.com/google-deepmind/alphaevolve_repository_of_problems https://github.com/vinid/einstein-arena" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10406v1" updated="2026-06-09T04:30:49Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10412v1" updated="2026-06-09T04:38:48Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10413v1" updated="2026-06-09T04:39:47Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10431v1" updated="2026-06-09T05:15:25Z" repo_found_in="pdf" repo_urls="https://github.com/gshuangchun/VaFM" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10448v1" updated="2026-06-09T05:55:06Z" repo_found_in="comment" repo_urls="https://github.com/ZeyuLIU-UST/FPQC-SAC-main" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10456v1" updated="2026-06-09T06:04:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10457v1" updated="2026-06-09T06:05:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10458v1" updated="2026-06-09T06:06:41Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10460v1" updated="2026-06-09T06:15:36Z" repo_found_in="pdf" repo_urls="https://github.com https://github.com/lakeagent/datalake-qa https://lakeqa-bench.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10461v1" updated="2026-06-09T06:16:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10466v1" updated="2026-06-09T06:36:06Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10471v1" updated="2026-06-09T06:39:31Z" repo_found_in="pdf" repo_urls="https://github.com/boonjiashen/speculative-language-recognizer" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10475v1" updated="2026-06-09T06:43:18Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10479v1" updated="2026-06-09T06:50:15Z" repo_found_in="comment" repo_urls="https://simplified-reasoning.github.io/ComBench/docs/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10481v1" updated="2026-06-09T06:50:49Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10487v1" updated="2026-06-09T07:01:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10489v1" updated="2026-06-09T07:07:48Z" repo_found_in="pdf" repo_urls="https://github.com/aibasel/ https://github.com/aibasel/downward-benchmarks" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10493v1" updated="2026-06-09T07:17:34Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10499v1" updated="2026-06-09T07:23:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10500v1" updated="2026-06-09T07:24:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10504v1" updated="2026-06-09T07:29:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10507v1" updated="2026-06-09T07:35:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10525v1" updated="2026-06-09T07:54:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10530v1" updated="2026-06-09T08:01:11Z" repo_found_in="pdf" repo_urls="https://github.com/sk2299/neuro-ai-survey" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10531v1" updated="2026-06-09T08:02:56Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10532v1" updated="2026-06-09T08:03:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10543v1" updated="2026-06-09T08:11:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10554v1" updated="2026-06-09T08:21:56Z" repo_found_in="pdf" repo_urls="https://github.com/dice-group/Benchmarking-KE https://github.com/dice-group/Benchmarking-KE/tree/main https://github.com/dig-team/amie https://github.com/kmeng01/rome" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10569v1" updated="2026-06-09T08:32:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10571v1" updated="2026-06-09T08:34:01Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10572v1" updated="2026-06-09T08:36:08Z" repo_found_in="abstract" repo_urls="https://github.com/zz1358m/Latent-Memory-Master" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10580v1" updated="2026-06-09T08:45:19Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10582v1" updated="2026-06-09T08:45:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10583v1" updated="2026-06-09T08:47:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10587v1" updated="2026-06-09T08:52:49Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/zoom-wang112358/EvoDiverse" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10595v1" updated="2026-06-09T09:00:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10596v1" updated="2026-06-09T09:01:18Z" repo_found_in="pdf" repo_urls="https://github.com/SangliTeng/ https://github.com/SangliTeng/Continuous-Hybrid-System-Learning" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10601v1" updated="2026-06-09T09:04:32Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10607v1" updated="2026-06-09T09:09:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10613v1" updated="2026-06-09T09:12:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10616v1" updated="2026-06-09T09:15:33Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10620v1" updated="2026-06-09T09:17:55Z" repo_found_in="pdf" repo_urls="https://github.com/gintmr/ImagineTime https://huggingface.co/datasets/Xin-Rui/ImageTime_Benchmark" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10621v1" updated="2026-06-09T09:18:09Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/opensearch-p https://huggingface.co/datasets/opensearch-project/msmarco-hard-negatives-llm-scores" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10632v1" updated="2026-06-09T09:36:54Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10650v1" updated="2026-06-09T09:57:48Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10658v1" updated="2026-06-09T10:06:55Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10660v1" updated="2026-06-09T10:08:36Z" repo_found_in="comment" repo_urls="https://doi.org/10.5281/zenodo.20443586" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10662v1" updated="2026-06-09T10:13:07Z" repo_found_in="abstract" repo_urls="https://yuzhenmao.github.io/DeLM/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10669v1" updated="2026-06-09T10:19:41Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10677v1" updated="2026-06-09T10:31:51Z" repo_found_in="pdf" repo_urls="https://github.com/infinigence/Infini-Memory" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10683v1" updated="2026-06-09T10:38:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10684v1" updated="2026-06-09T10:40:55Z" repo_found_in="pdf" repo_urls="https://huggingface.co/PeterJinGo/SearchR1-nq_hotpotqa_train-qwen2.5-7b-it-em-grpo-v0.2 https://huggingface.co/PeterJinGo/SearchR1-nq_hotpotqa_train-qwen2.5-7b-it-em-grpo-v0.3 https://huggingface.co/PeterJinGo/SearchR1-nq_hotpotqa_train-qwen2.5-7b-it-em-ppo-v0.2" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10699v1" updated="2026-06-09T10:59:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10705v1" updated="2026-06-09T11:08:45Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10706v1" updated="2026-06-09T11:09:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10709v1" updated="2026-06-09T11:12:58Z" repo_found_in="pdf" repo_urls="https://github.com/cxcscmu/agentic_search_query_recycling" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10713v1" updated="2026-06-09T11:19:09Z" repo_found_in="abstract" repo_urls="https://github.com/sofia-adelie/plusplusnnunet.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10716v1" updated="2026-06-09T11:24:07Z" repo_found_in="pdf" repo_urls="https://github.com/RobertoMCA/attention-expansion https://huggingface.co/allenai/scibert_scivocab_uncased https://huggingface.co/answerdotai/ModernBERT-base https://huggingface.co/bloomberg/KBIR https://huggingface.co/datasets/midas/duc2001 https://huggingface.co/datasets/midas/inspec https://huggingface.co/datasets/midas/ldkp3k https://huggingface.co/datasets/midas/nus https://huggingface.co/datasets/midas/semeval2010 https://huggingface.co/distilbert/distilbert-base-uncased https://huggingface.co/microsoft/deberta-v3-base https://huggingface.co/tasks/token-classification" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10718v1" updated="2026-06-09T11:25:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10736v1" updated="2026-06-09T11:47:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10738v1" updated="2026-06-09T11:50:06Z" repo_found_in="abstract" repo_urls="https://github.com/dieKarotte/Spatial-Omni" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10740v1" updated="2026-06-09T11:50:28Z" repo_found_in="pdf" repo_urls="https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507 https://huggingface.co/datasets/UVSKKR/when-cot-knows-better https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B https://huggingface.co/microsoft/Phi-4-mini-reasoning" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10747v1" updated="2026-06-09T11:57:02Z" repo_found_in="abstract" repo_urls="https://github.com/aisilab/arbiter" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10749v1" updated="2026-06-09T12:01:07Z" repo_found_in="pdf" repo_urls="https://ling-yuchen.github.io/LLMAgentSecuritySurvey/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10752v1" updated="2026-06-09T12:02:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10782v1" updated="2026-06-09T12:35:25Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10787v1" updated="2026-06-09T12:37:43Z" repo_found_in="pdf" repo_urls="https://github.com/azreasoners/NeurASP" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10794v1" updated="2026-06-09T12:43:49Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10796v1" updated="2026-06-09T12:44:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10799v1" updated="2026-06-09T12:46:35Z" repo_found_in="pdf" repo_urls="https://github.com/celainica/A-lightweight-natural-language-proof-verification-agent" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10802v1" updated="2026-06-09T12:48:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10803v1" updated="2026-06-09T12:49:11Z" repo_found_in="pdf" repo_urls="https://github.com/ModalityDance/PhysTool-Bench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10806v1" updated="2026-06-09T12:50:56Z" repo_found_in="pdf" repo_urls="https://github.com/DeepMathLLM/Moonshine" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10819v1" updated="2026-06-09T13:01:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10820v1" updated="2026-06-09T13:02:00Z" repo_found_in="pdf" repo_urls="https://github.com/alibaba-damo-academy/K-Forcing" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10827v1" updated="2026-06-09T13:16:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10829v1" updated="2026-06-09T13:17:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10833v1" updated="2026-06-09T13:20:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10835v1" updated="2026-06-09T13:24:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10852v1" updated="2026-06-09T13:31:13Z" repo_found_in="pdf" repo_urls="https://github.com/PolydorosG/Janus" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10861v1" updated="2026-06-09T13:39:21Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10862v1" updated="2026-06-09T13:39:49Z" repo_found_in="abstract" repo_urls="https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10881v1" updated="2026-06-09T13:54:31Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10890v1" updated="2026-06-09T14:03:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10892v1" updated="2026-06-09T14:04:51Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/black-forest-labs/flux" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10902v1" updated="2026-06-09T14:13:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10906v1" updated="2026-06-09T14:14:54Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10908v1" updated="2026-06-09T14:20:05Z" repo_found_in="pdf" repo_urls="https://github.com/Security-FIT/RAT" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10911v1" updated="2026-06-09T14:20:55Z" repo_found_in="pdf" repo_urls="https://security-fit.github.io/deepfake_speech_ https://security-fit.github.io/deepfake_speech_datasets_app/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10912v1" updated="2026-06-09T14:21:45Z" repo_found_in="pdf" repo_urls="https://github.com/Security-FIT/IG_for_SSL_detectors" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10917v1" updated="2026-06-09T14:28:07Z" repo_found_in="pdf" repo_urls="https://github.com/AMAP-ML/roleagent" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10928v1" updated="2026-06-09T14:38:11Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10929v1" updated="2026-06-09T14:38:26Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10933v1" updated="2026-06-09T14:44:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10934v1" updated="2026-06-09T14:44:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10935v1" updated="2026-06-09T14:45:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10937v1" updated="2026-06-09T14:46:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10940v1" updated="2026-06-09T14:47:24Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10942v1" updated="2026-06-09T14:48:26Z" repo_found_in="pdf" repo_urls="https://github.com/kiarash https://github.com/kiarashRezaei/llm-for-xai-qotEstimation" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10949v1" updated="2026-06-09T14:53:32Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/Percena/locomo-mc10" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10953v1" updated="2026-06-09T14:55:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10956v1" updated="2026-06-09T14:59:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10968v1" updated="2026-06-09T15:09:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10972v1" updated="2026-06-09T15:12:58Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10979v1" updated="2026-06-09T15:15:21Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10989v1" updated="2026-06-09T15:26:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11007v1" updated="2026-06-09T15:41:48Z" repo_found_in="pdf" repo_urls="https://github.com/axios/axios/issues/10604 https://github.com/openclaw/openclaw/commits/main/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11015v1" updated="2026-06-09T15:53:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11016v1" updated="2026-06-09T15:54:35Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11019v1" updated="2026-06-09T15:56:41Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11033v1" updated="2026-06-09T16:05:23Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/WillHeld/SD-QA https://huggingface.co/datasets/yijingwu/HeySQuAD_human" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11042v1" updated="2026-06-09T16:10:16Z" repo_found_in="pdf" repo_urls="https://workflow-gym.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11045v1" updated="2026-06-09T16:12:55Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11063v1" updated="2026-06-09T16:24:16Z" repo_found_in="pdf" repo_urls="https://github.com/JoachimSchaeffer/CIAware-Bench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11070v1" updated="2026-06-09T16:32:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11074v1" updated="2026-06-09T16:34:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11078v1" updated="2026-06-09T16:39:10Z" repo_found_in="comment" repo_urls="https://github.com/G-JWLee/HiViG" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11081v1" updated="2026-06-09T16:40:54Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11087v1" updated="2026-06-09T16:45:57Z" repo_found_in="pdf" repo_urls="https://github.com/zhouzypaul/qgf" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11092v1" updated="2026-06-09T16:51:13Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11105v1" updated="2026-06-09T17:03:19Z" repo_found_in="pdf" repo_urls="https://huggingface.co/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11106v1" updated="2026-06-09T17:03:37Z" repo_found_in="abstract" repo_urls="https://github.com/mahmoodphd/FADA" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11116v1" updated="2026-06-09T17:13:40Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11117v1" updated="2026-06-09T17:14:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11119v1" updated="2026-06-09T17:16:03Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11120v1" updated="2026-06-09T17:16:30Z" repo_found_in="pdf" repo_urls="https://github.com/andrewkang12345/monteCarloPassSearch" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11127v1" updated="2026-06-09T17:24:27Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/Lexsi/provenance-grounded-synthetic-qa https://huggingface.co/datasets/wikimedia/wikipedia" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11140v1" updated="2026-06-09T17:29:47Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11150v1" updated="2026-06-09T17:35:37Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11164v1" updated="2026-06-09T17:44:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11166v1" updated="2026-06-09T17:46:10Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11169v1" updated="2026-06-09T17:48:41Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11173v1" updated="2026-06-09T17:50:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11182v1" updated="2026-06-09T17:57:16Z" repo_found_in="pdf" repo_urls="https://github.com/Princeton-AI2-Lab/EEVEE https://princeton-ai2-lab.github.io/EEVEE/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.11189v1" updated="2026-06-09T17:59:54Z" repo_found_in="pdf" repo_urls="https://txie1.github.io/Target-SFT/" />
  </code-available-feed:processed>
</feed>