SE 论文推荐 2026-03-20

基于主人在 Zotero 的研究画像自动推荐

主人研究画像：软件供应链安全（package-dashboard 16 篇、python deps issues 8 篇、安全标签 18 篇）、开源软件生态（OSS-Lab 6 篇）、SE4AI（SE4AI collection）

数据来源：arXiv cs.SE 最新抓取（截至 2026-03-20）

筛选逻辑：优先选择与软件供应链安全、Python 生态、仓库级安全分析、AI Agent 可靠性/安全性、开源软件工程直接相关的论文；不为凑数而收录泛教育、泛生成式 AI 或弱相关工作。

本轮结论（先看这个）

这一轮最值得优先关注的方向有三条：

Python + 原生依赖的跨生态漏洞分析正在变成实战问题：不仅要看 PyPI 依赖图，还要追踪 wheel 中 vendored native library、宿主 OS 包版本以及可达性。
AI 代码智能体的安全薄弱点正在从“模型能力”转向“流程与交互偏差”：PR 元数据 framing、工具调用链覆盖不足、回归控制不充分，都是非常现实的切口。
SE4AI 的下一步不只是“让 agent 更会写代码”，而是“让 agent 在真实仓库里更可信”：包括 repo-level 漏洞数据集、impact-aware 测试、tool-call safety audit、模型供应链投毒扫描。

1. Cross-Ecosystem Vulnerability Analysis for Python Applications

arXiv：http://arxiv.org/abs/2603.18693v1
PDF：https://arxiv.org/pdf/2603.18693v1

英文摘要

Python applications depend on native libraries that may be vendored within package distributions or installed on the host system. When vulnerabilities are discovered in these libraries, determining which Python packages are affected requires cross-ecosystem analysis spanning Python dependency graphs and OS package versions. Current vulnerability scanners produce false negatives by missing vendored vulnerabilities and false positives by ignoring security patches backported by OS distributions. We present a provenance-aware vulnerability analysis approach that resolves vendored libraries to specific OS package versions or upstream releases. Our approach queries vendored libraries against a database of historical OS package artifacts using content-based hashing, and applies library-specific dynamic analyses to extract version information from binaries built from upstream source. We then construct cross-ecosystem call graphs by stitching together Python and binary call graphs across dependency boundaries, enabling reachability analysis of vulnerable functions. Evaluating on 100,000 Python packages and 10 known CVEs associated with third-party native dependencies, we identify 39 directly vulnerable packages (47M+ monthly downloads) and 312 indirectly vulnerable client packages affected through dependency chains. Our analysis achieves up to 97% false positive reduction compared to upstream version matching.

中文摘要

解决的问题：Python 应用的安全风险并不只存在于 Python 包本身，还藏在 wheel 内 vendored 的原生库、系统安装的共享库以及跨语言调用边界里。现有漏洞扫描器要么漏掉 vendored 依赖，要么忽略 Linux 发行版的 backport 修复，导致误报漏报都很严重。

方法：

做一个 provenance-aware 的跨生态漏洞分析框架；
用内容哈希把 vendored library 映射到历史 OS package artifact 或上游 release；
对二进制做库特定的动态分析，抽取版本信息；
把 Python 调用图与二进制调用图拼接起来，做 跨边界可达性分析。

核心发现：

在 10 万个 Python 包、10 个真实 CVE 上验证；
发现 39 个直接受影响包（月下载量合计 4700 万+）与 312 个间接受影响客户端包；
相比简单的 upstream version matching，误报最多可降 97%。

与主人研究的相关性：极高。这篇几乎正中主人在 Python 生态 + 软件供应链安全 + 依赖风险传播 的交叉点。尤其是“跨 PyPI 与 OS 发行版”的视角，很适合延展到 package-dashboard / Python deps 风险画像工作。

我建议重点看的点：

是否能迁移到 PyPI + Debian/Ubuntu/Alpine 的持续监测；
可达性分析能否和你的依赖风险分级体系结合；
vendored native library 的识别与归因能否单独形成一篇方法论文。

2. Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review

arXiv：http://arxiv.org/abs/2603.18740v1
PDF：https://arxiv.org/pdf/2603.18740v1

英文摘要

Security code reviews increasingly rely on systems integrating Large Language Models (LLMs), ranging from interactive assistants to autonomous agents in CI/CD pipelines. We study whether confirmation bias (i.e., the tendency to favor interpretations that align with prior expectations) affects LLM-based vulnerability detection, and whether this failure mode can be exploited in software supply-chain attacks. We conduct two complementary studies. Study 1 quantifies confirmation bias through controlled experiments on 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art models under five framing conditions for the review prompt. Framing a change as bug-free reduces vulnerability detection rates by 16-93%, with strongly asymmetric effects: false negatives increase sharply while false positive rates change little. Bias effects vary by vulnerability type, with injection flaws being more susceptible to them than memory corruption bugs. Study 2 evaluates exploitability in practice mimicking adversarial pull requests that reintroduce known vulnerabilities while framed as security improvements or urgent functionality fixes via their pull request metadata. Adversarial framing succeeds in 35% of cases against GitHub Copilot (interactive assistant) under one-shot attacks and in 88% of cases against Claude Code (autonomous agent) in real project configurations where adversaries can iteratively refine their framing to increase attack success. Debiasing via metadata redaction and explicit instructions restores detection in all interactive cases and 94% of autonomous cases.

中文摘要

解决的问题：越来越多的安全代码审查由 LLM assistant 或 autonomous agent 参与，但它们会不会像人一样受到“先入为主”的 framing 影响？如果会，这种偏差能不能被攻击者利用，伪装成正常 PR 来绕过审查？

方法：

Study 1：在 250 组 CVE patch 对上做受控实验，比较不同 prompt framing；
Study 2：模拟真实 adversarial PR，把恶意修改包装成“安全改进”或“紧急修复”，测试 GitHub Copilot 和 Claude Code 的审查表现；
同时评估去偏策略，如元数据遮蔽和明确的防偏指令。

核心发现：

把改动描述成“没问题”，漏洞检测率会下降 16%–93%；
偏差主要体现为 false negative 暴涨，而不是 false positive 明显变化；
对 Claude Code 这类 autonomous agent，攻击者可迭代优化 framing，成功率可达 88%；
简单的 metadata redaction + explicit instructions 就能显著恢复检测能力。

与主人研究的相关性：极高。这不是普通的“LLM 安全”论文，而是非常贴近 AI-assisted development / code review / 供应链攻击面 的实证工作。对主人做 SE4AI 与软件供应链安全 的交叉研究非常有启发。

我建议重点看的点：

可直接发展为 AI code review pipeline 的 threat model；
很适合做后续研究：不同 PR metadata、commit message、issue context 对 agent 判断的影响；
也可衔接你对 agent 在真实软件工程流程中引入的新型攻击面 的梳理。

3. Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

arXiv：http://arxiv.org/abs/2603.18245v1
PDF：https://arxiv.org/pdf/2603.18245v1

英文摘要

Large Language Model (LLM) agents increasingly act through external tools, making their safety contingent on tool-call workflows rather than text generation alone. While recent benchmarks evaluate agents across diverse environments and risk categories, a fundamental question remains unanswered: how complete are existing test suites, and what unsafe interaction patterns persist even after an agent passes the benchmark? We propose SafeAudit, a meta-audit framework that addresses this gap through two contributions. First, an LLM-based enumerator that systematically generates test cases by enumerating valid tool-call workflows and diverse user scenarios. Second, we introduce rule-resistance, a non-semantic, quantitative metric that distills compact safety rules from existing benchmarks and identifies unsafe interaction patterns that remain uncovered under those rules. Across 3 benchmarks and 12 environments, SafeAudit uncovers more than 20% residual unsafe behaviors that existing benchmarks fail to expose, with coverage growing monotonically as the testing budget increases.

中文摘要

解决的问题：现有 agent 安全 benchmark 就算“测过了”，也不代表真的安全。真正危险的地方往往出现在 多工具调用链 中，而且 benchmark 本身可能覆盖不全。作者问的是：谁来审计这些 benchmark 自己的盲区？

方法：

提出 SafeAudit 元审计框架；
第一部分是 LLM-based enumerator：系统枚举合法工具工作流和用户场景，自动生成更多测试样例；
第二部分是 rule-resistance 指标：从现有 benchmark 中压缩出紧凑安全规则，再看哪些不安全模式仍未被这些规则覆盖。

核心发现：

在 3 个 benchmark、12 个环境上，发现现有测试集漏掉了 20%+ 的残余不安全行为；
随着测试预算增加，覆盖率持续增长，说明“没测到”而非“本来就安全”；
论文把 agent safety evaluation 从“做 benchmark”往前推进到“审 benchmark 的完备性”。

与主人研究的相关性：很高。这篇和主人当前使用/构建的 agent 系统非常贴近，尤其对 tool-orchestrated workflow 的安全评测、agent benchmark 设计、SE4AI 的测试方法论 都很有价值。

我建议重点看的点：

能否把 SafeAudit 思路迁移到你自己的 agent workflows；
rule-resistance 是否可变成一个更普适的 workflow safety coverage metric；
这条线很适合和前一篇 confirmation bias 工作串起来，形成“流程级 agent 安全评测”专题。

4. Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

arXiv：http://arxiv.org/abs/2603.17974v1
PDF：https://arxiv.org/pdf/2603.17974v1

英文摘要

Software vulnerabilities continue to grow in volume and remain difficult to detect in practice. Although learning-based vulnerability detection has progressed, existing benchmarks are largely function-centric and fail to capture realistic, executable, interprocedural settings. Recent repo-level security benchmarks demonstrate the importance of realistic environments, but their manual curation limits scale. This doctoral research proposes an automated benchmark generator that injects realistic vulnerabilities into real-world repositories and synthesizes reproducible proof-of-vulnerability (PoV) exploits, enabling precisely labeled datasets for training and evaluating repo-level vulnerability detection agents. We further investigate an adversarial co-evolution loop between injection and detection agents to improve robustness under realistic constraints.

中文摘要

解决的问题：现有漏洞检测 benchmark 主要是函数级、片段级，离真实仓库环境差得很远；而 repo-level benchmark 又往往太依赖人工构造，导致规模上不去。结果就是：大家都在测“会不会做题”，而不是测“会不会在真实仓库里发现漏洞”。

方法：

自动向真实仓库中 注入 realistic vulnerabilities；
自动合成 可复现的 proof-of-vulnerability（PoV）exploit；
生成精确标注的数据集，用来训练/评估仓库级漏洞检测 agent；
进一步考虑 注入 agent 与检测 agent 的对抗共演化，提升基准鲁棒性。

核心发现：

这是一篇博士研究计划/短文，核心贡献是问题定义与研究路线；
明确指出 repo-level security benchmark 的瓶颈不在“有没有数据”，而在 能否自动、可扩展、可执行地生成带 exploit 的真实样本。

与主人研究的相关性：很高。主人若继续做 仓库级供应链风险检测 / AI agent 安全评测 / repo-level vulnerability detection，这篇是重要的方向信号。它非常适合作为后续选题地图，而不是立即采用的成熟方法。

我建议重点看的点：

你自己的研究可进一步强调 供应链场景，比如 package、build script、CI workflow、dependency update；
“带 exploit 的 repo-level benchmark” 可以和 agent-based audit 结合，形成更强的评测闭环；
若想发方法论文，这类“自动注入 + 自动验证”路线值得提前卡位。

5. TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

arXiv：http://arxiv.org/abs/2603.17973v2
PDF：https://arxiv.org/pdf/2603.17973v2

英文摘要

AI coding agents can resolve real-world software issues, yet they frequently introduce regressions — breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool that performs pre-change impact analysis for AI coding agents. TDAD builds a dependency map between source code and tests so that before committing a patch, the agent knows which tests to verify and can self-correct. The map is delivered as a lightweight agent skill — a static text file the agent queries at runtime. Evaluated on SWE-bench Verified with two open-weight models running on consumer hardware (Qwen3-Coder 30B, 100 instances; Qwen3.5-35B-A3B, 25 instances), TDAD reduced regressions by 70% (6.08% to 1.82%) compared to a vanilla baseline. In contrast, adding TDD procedural instructions without targeted test context increased regressions to 9.94% — worse than no intervention at all. When deployed as an agent skill with a different model and framework, TDAD improved issue-resolution rate from 24% to 32%.

中文摘要

解决的问题：AI coding agent 会修 bug，但也很容易顺手引入 regression。现有 benchmark 基本只看 resolution rate，不认真看“修好了同时有没有把别的地方搞坏”。

方法：

提出 TDAD；
在改代码前先做 impact analysis，建立源代码与测试之间的依赖映射；
把映射结果以轻量 skill/静态文本形式暴露给 agent，帮助它知道应重点回归哪些测试并进行自纠。

核心发现：

在 SWE-bench Verified 上，回归率从 6.08% 降到 1.82%，下降约 70%；
只给 agent 一般性的 TDD 程序化指令，反而把回归率拉高到 9.94%；
在另一套模型/框架里，issue 解决率也从 24% 提升到 32%。

与主人研究的相关性：很高。这篇对 SE4AI / AI coding agent 可靠性 很实在，尤其启发一点：给 agent 正确的结构化上下文，比给它抽象流程口号更有用。 这和很多 agent skill 设计经验是一致的。

我建议重点看的点：

可迁移成“repo-aware regression guardrail”；
很适合与主人现有 agent skill / workflow 研究衔接，形成“skill 如何改善真实软件工程结果”的实验设计；
也可延伸到安全方向：除了 regression test，还能不能做 security regression。

6. scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM-Generated Patterns

arXiv：http://arxiv.org/abs/2603.17893v1
PDF：https://arxiv.org/pdf/2603.17893v1

英文摘要

Methodology bugs in scientific Python code produce plausible but incorrect results that traditional linters and static analysis tools cannot detect. Several research groups have built ML-specific linters, demonstrating that detection is feasible. Yet these tools share a sustainability problem: dependency on specific pylint or Python versions, limited packaging, and reliance on manual engineering for every new pattern. As AI-generated code increases the volume of scientific software, the need for automated methodology checking (such as detecting data leakage, incorrect cross-validation, and missing random seeds) grows. We present scicode-lint, whose two-tier architecture separates pattern design (frontier models at build time) from execution (small local model at runtime). Patterns are generated, not hand-coded; adapting to new library versions costs tokens, not engineering hours. On Kaggle notebooks with human-labeled ground truth, preprocessing leakage detection reaches 65% precision at 100% recall; on 38 published scientific papers applying AI/ML, precision is 62% (LLM-judged) with substantial variation across pattern categories; on a held-out paper set, precision is 54%. On controlled tests, scicode-lint achieves 97.7% accuracy across 66 patterns.

中文摘要

解决的问题：科研 Python 代码里有一类很麻烦的 bug：程序能跑、结果也看起来合理，但方法学是错的，比如 data leakage、交叉验证错误、随机种子没生效。传统 lint/static analysis 抓不住这些问题。

方法：

提出 两层架构：
- build time 用 frontier model 自动生成检测 pattern；
- runtime 用小型本地模型执行这些 pattern；
强调 pattern 是“生成出来的”，不是每条都手写维护。

核心发现：

在 Kaggle notebook 上，数据泄漏检测做到 100% recall / 65% precision；
在 38 篇已发表 AI/ML 论文对应代码上，precision 约 62%；
控制实验里对 66 类 pattern 的总体准确率达 97.7%。

与主人研究的相关性：中高。它不直接打供应链安全，但和 Python 生态、科研软件质量、AI 生成代码质量保障 很接近。如果主人后续想把视角从“安全漏洞”扩到“科研软件可信性”，这篇很有意思。

我建议重点看的点：

pattern-as-artifact 这个思想很值得借鉴，可迁移到安全规则、依赖风险规则、代码审查规则；
如果你想做“面向科研 Python 的 SE4AI 工具”，这篇可作为近期 related work；
也可以反向思考：methodology bugs 与 security bugs 在 agent 审计框架里是否能统一表示。

7. Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning

arXiv：http://arxiv.org/abs/2603.17174v1
PDF：https://arxiv.org/pdf/2603.17174v1

英文摘要

Code generation large language models (LLMs) are increasingly integrated into modern software development workflows. Recent work has shown that these models are vulnerable to backdoor and poisoning attacks that induce the generation of insecure code, yet effective defenses remain limited. Existing scanning approaches rely on token-level generation consistency to invert attack targets, which is ineffective for source code where identical semantics can appear in diverse syntactic forms. We present CodeScan, which, to the best of our knowledge, is the first poisoning-scanning framework tailored to code generation models. CodeScan identifies attack targets by analyzing structural similarities across multiple generations conditioned on different clean prompts. It combines iterative divergence analysis with abstract syntax tree (AST)-based normalization to abstract away surface-level variation and unify semantically equivalent code, isolating structures that recur consistently across generations. CodeScan then applies LLM-based vulnerability analysis to determine whether the extracted structures contain security vulnerabilities and flags the model as compromised when such a structure is found. We evaluate CodeScan against four representative attacks under both backdoor and poisoning settings across three real-world vulnerability classes. Experiments on 108 models spanning three architectures and multiple model sizes demonstrate 97%+ detection accuracy with substantially lower false positives than prior methods.

中文摘要

解决的问题：代码生成模型可能在训练/微调阶段被投毒，导致看似正常的 prompt 触发不安全代码生成。已有扫描方法主要看 token 一致性，但代码里“语义相同、写法不同”太常见，导致这类方法很不稳。

方法：

提出 CodeScan 黑盒扫描框架；
在多个干净 prompt 下生成代码，看结构层面的重复模式；
用 迭代差异分析 + AST 归一化 去掉表面语法差异；
再用 LLM 做漏洞分析，判断抽出的共享结构是否含漏洞。

核心发现：

在 108 个模型、3 类真实漏洞、4 种攻击下测试；
检测准确率 97%+，且误报明显低于已有方法；
说明“从结构与漏洞模式出发”，比 token-level 逆推更适合代码模型投毒检测。

与主人研究的相关性：高。它更偏 模型供应链安全 / code LLM 安全，与主人主线中的供应链安全并不完全同层，但很适合作为“未来 3-5 年值得提前布局”的方向，尤其在 AI 开发工具可信性 这个主题下。

我建议重点看的点：

这条线可与 agent coding workflow 风险衔接，形成“从模型到 PR 的端到端可信链”；
很适合放进你对 SE4AI × security 的前瞻综述里；
若未来研究模型供应链，可考虑把 poisoning scan 与 benchmark / deployment policy 联动。

本轮筛选后没收的方向

这轮也有一些论文看起来热闹，但我故意没收：

纯教育/教学法论文；
泛化的软件架构、代码理解、自然语言到代码任务；
与主人主线只有“都用了 LLM”这种弱关联的工作。

这样做的原因很简单：保持推荐集合的密度，不让真正重要的方向被噪声淹掉。

给主人的 3 个研究信号

1) Python 供应链安全值得继续深挖“跨生态传播”

不是只看 PyPI 包版本，而是要把：

wheel / vendored native libs
OS package backport
reachability
间接客户端传播

统一放进一个风险传播框架里。这个方向很有论文味，也有系统价值。

2) Agent 安全评测会从“单轮 prompt 安全”转向“流程安全”

confirmation bias、tool-call workflow、benchmark completeness 这三篇其实拼成了一个更完整的图景：

输入上下文会带偏 agent；
多工具流程会暴露新攻击面；
现有 benchmark 还测不全。

这是很适合主人切入的 SE4AI 研究线。

3) 代码智能体的质量保障开始出现“轻量 skill 化”趋势

TDAD 和 scicode-lint 都说明一件事：

不一定需要巨大的端到端模型更新；
很多改进来自 结构化知识、静态工件、可执行规则；
这和主人正在构建的 agent skill / knowledge workflow 非常契合。

我认为最值得优先读的 4 篇

如果时间有限，我建议优先顺序是：

Cross-Ecosystem Vulnerability Analysis for Python Applications
→ 最贴主线，直接可转成研究问题。
Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review
→ 现实攻击面非常强，容易衍生新工作。
Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety
→ 对 agent safety evaluation 很关键。
TDAD: Test-Driven Agentic Development
→ 对 agent skill / coding workflow 的工程与研究价值都很高。

备注

本轮确实有新论文，因此已生成报告。
解析基于 arXiv 摘要与 PDF 前部内容整理。
原计划中的 Tavily 相关工作检索在当前环境不可用，因此本报告没有单独展开外部 related work 检索段，而是以论文摘要、方法描述和研究画像匹配为主完成筛选与解读。