Hugging Face Blog研究與前沿

基於任務種子的合成問答生成用於Nemotron預訓練

2026年6月4日 11:24

重點摘要

在大型語言模型的開發中,問題不再只是模型看到多少數據,還在於數據是否包含足夠的結構化學習信號。一般網絡、程式碼、數學、多語言和領域數據提供了廣泛基礎,而基於任務種子的合成問答(SDG)通過添加緊湊、任務結構化的範例來補充它們,這些範例具有明確的資訊需求、受限的回應空間,以及將證據與答案聯繫起來的解釋。在Nemotron-3 Nano模型的1000億詞元延續實驗中,基於任務種子的SDG使MMLU-Pro提升1.8分,平均程式碼能力提升1.9分,常識推理能力也有所提升。

站內 AI 整理稿

Back to Articles Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining Enterprise + Article Published June 4, 2026 Upvote - Markus Kliegl mkliegl-nv Follow nvidia Dan Su sudandandansu1 Follow nvidia In large-scale LLM development, the question is no longer simply how much data a model sees. It is also whether the data contains enough structured learning signals. General web, code, math, multilingual, and domain data provide a broad base. Task-seeded synthetic Q&A complements them by adding compact, task-structured examples with a clear information need, a constrained response space, and explanations that connect evidence to an answer. In a 100B-token continuation experiment on the Nemotron-3 Nano model, task-seeded SDG improved MMLU-Pro by +1.8, average code by +1.9, commonsense understanding by +1.6, and GPQA by +11.1, while average math remained stable. This post describes a task-seeded synthetic Q&A generation workflow developed for Nemotron-family training, including Ultra and Super training runs. The workflow uses training splits from broad public task families as capability seeds, generates new task-aligned examples, enriches them with reasoning and relevant knowledge, and filters them into curated synthetic datasets. Held-out evaluation and test data are excluded from generation. Downstream training recipes can then decide how to mix those datasets with the broader corpus. Figure 1. The task-seeded SDG pipeline ends at curated generated data. Training mixture design and reported evaluations happen downstream. TL;DR We use public task training splits as capability seeds, not as examples to memorize. We frame the data through transfer learning across task families: a model can learn reusable behaviors from broad seed tasks, then apply them to related applications and evaluations. The pipeline generates similar questions and answer-enriched examples with reasoning and task-relevant context. Multiple-choice tasks are easier to verify; open generation tasks need task-specific extraction and filtering. In a 100B-token continuation experiment on the Nemotron-3 Nano model, task-seeded SDG improved MMLU-Pro by +1.8, average code by +1.9, commonsense understanding by +1.6, and GPQA by +11.1 while keeping average math stable. At A Glance Element Value Seed source Public task training splits available through lm-eval-harness Scale About 70 tasks and about 700 subtasks Data types Similar questions, answer-enriched samples, reasoning/context traces Verification Schema checks, format checks, deduplication, majority voted answer checks Training use Late-stage Nemotron-family training, including Ultra/Super workstreams Main result Gains on MMLU-Pro, code, commonsense, and GPQA in a 100B-token Nemotron-3 Nano continuation Generation Pipeline The generation workflow is a compact loop: collect training-split seeds, normalize heterogeneous task records, generate new examples, enrich answers, and filter the resulting data. In the internal pipeline, we used roughly 70 public task datasets from lm-eval-harness, covering about 700 subtasks. For each task, we used only suitable training splits as SDG seeds; held-out test data was not used for generation, and tasks without suitable training data were excluded from seed collection. The seed pool covered both knowledge-intensive and reasoning-intensive tasks: Seed group Approximate coverage Purpose Knowledge-intensive tasks 39 tasks, about 300 subtasks, about 3M seed samples Improve factual, scientific, multilingual, and domain-specific QA behavior Reasoning-intensive tasks 34 tasks, about 400 subtasks, about 1.5M seed samples Improve analytical reasoning, logical reasoning, math, code, and commonsense reasoning For Nemotron Ultra and Super pretraining, we used a license-compatible subset of the generated data suitable for commercial model training. The end-to-end process has five stages: Collect seed tasks. Enumerate available lm-eval-harness tasks, group them by output type, and keep only tasks with suitable training splits. Normalize records. Since each lm-eval-harness task defines its own fields and formatting in YAML, we convert task records into a unified JSONL-style schema. For multiple-choice tasks, the normalized record contains the question and candidate options. For generative tasks, it contains the question or prompt, plus context when the task provides it. Generate similar examples. Given a seed example, the generator creates a new question that preserves the underlying capability while changing the content. Enrich answers. The generator solves the generated questions and adds the final answer plus relevant reasoning, knowledge, or context. Filter and package. The pipeline applies schema checks, format checks, deduplication, and task-specific answer validation where possible. Multiple-choice data is easier to verify directly; generation-style data requires more cautious task-specific handling. One practical formatting choice is to store semantic answer text rather than only option labels when possible. For example, writing the answer as dirt trapped under the fingernails gives the model a clearer training signal than only writing B. Why Task-Seeded Data? Public task datasets are imperfect, but their training splits contain compact examples of how information is requested, constrained, and resolved. They capture useful correlations among task framing, domain knowledge, reasoning depth, candidate answers, and final response form. A model may see abundant raw text during pretraining and still benefit from synthetic data that makes those correlations explicit. Task-seeded synthetic data addresses this gap by turning public task training splits into data generation templates. Using only suitable training splits from broad task families, we generate new examples that preserve useful properties of the source interaction: task framing, such as whether the example asks for selection, generation, classification, or explanation; answer structure, such as multiple-choice options, short answers, free-form responses, or format-constrained outputs; domain and context, such as science, commonsense, factual knowledge, math, code, multilingual QA, or reading comprehension; difficulty and reasoning depth, such as whether the example requires a direct fact, a comparison among alternatives, or several reasoning steps; explanatory signal, such as task-relevant knowledge, reasoning, or context that helps connect the question to the answer. This lets us expose the model to reusable reasoning and knowledge-use patterns across task families, without tying the dataset to the surface format of one data source. Why Use Broader Seed Tasks? A useful way to interpret this pipeline is through transfer learning across task families. Many improvements do not come from learning a single task's surface format. They come from strengthening reusable behaviors that appear across many tasks: identifying the information need, applying relevant domain knowledge, separating plausible alternatives, following response constraints, doing multi-step reasoning, and grounding a final answer in the right context. Because of this, we do not generate from a narrow set of task formats. Instead, we collect a broader set of training-split seed samples from lm-eval-harness and use them to cover many neighboring capability regions. A science QA seed can help with commonsense physical reasoning. A logical reasoning seed can help with careful alternative comparison. A math or code seed can help with multi-step planning even when the final application is not exactly the same task. The goal is positive transfer learning across task families, while reducing the risk that the model simply learns the quirks of a single data source. This motivation is also consistent with earlier evidence in Nemotron Nano pretraining. We found that using AGIEval training data improved MMLU-Pro, suggesting that structured Q&A data from one task family can improve behavior outside t

Related

相關文章

剛剛,李飛飛親自下場定義世界模型

李飛飛近日明確重新定義「世界模型」,強調渲染、模擬與規劃三大功能應無縫整合,而非各自獨立發展。她認為真正的世界模型必須讓AI能同時感知環境、推演動態並制定策略,此觀點可能推動機器人與自駕車等領域的突破。相關研究論文或開源框架預料即將發布,將影響未來AI研發方向。

10 小時前

慕尼黑工大Johannes Betz 教授:時速300公里的自動駕駛超車 | ICRA 2026

大多數AI賽車研究停留在仿真裡,這輛車是真的在賽道上撞過。 作者丨陳淑瑜 編輯丨岑 峰 2026年6月2日,在ICRA 2026大會上,慕尼黑工業大學(TUM)自動駕駛實驗室負責人Johannes Betz發表了題為“Autonomous Vehicles & Navigation ”的演講,系統回顧了過去八、九年其團隊在自動駕駛賽車領域的研究歷程與核心洞察。Johannes Betz開篇即拋出一個尖銳的問題:為什麼要研究自動駕駛賽車?他的回答直指機器人學的一個根本困境:賽車天然集成了多變環境、高速交互與極小容錯空間三個極致要素,構成了完美的研究沙盒。在此基礎上,他的團隊選擇了一條與主流“端到端強化學習”截然不同的技術路徑:一套經典的生產級感知-規劃-控制管線,輔以“一個博士生一個算法”的管理哲學,確保每個模塊擁有完全的技術所有權和極致的工程深度。在軟件架構層面,Betz提煉出四條硬核教訓:第一,多傳感器融合(GPS+激光雷達+毫米波雷達)是高速定位的基石,尤其是在GPS信號拒止的真實戰場環境中;第二,三維狀態估計是捕捉漂移、側偏角等極限動力學的前提,缺此則一切免談;第三,全局-局部雙層規劃架構,結合博弈論實現多車交互預測,是賽車能夠自主決策超車時機的關鍵——他展示了一段在阿布扎比亞斯碼頭賽道上完成的並排超車視頻,全程自動駕駛,十次中有九次成功;第四,當經典管線跑通之後,真正的挑戰來到了“如何比人類更快”。為此,Betz 團隊耗時三年,逆向工程了人類賽車手的行為模式,開發出名為APEX的“人類啟發的主動駕駛智能”系統。APEX的核心邏輯是:人類通過視覺、觸覺、聽覺感知極限,再憑藉記憶持續調整軌跡來逼近極限,而非死守一條固定的最優基線。這套系統在與梅賽德斯-AMG的合作測試中,以2.6秒的優勢擊敗了奔馳測試車手,又以1秒優勢戰勝了前DTM賽車手本·施奈德。然而在面對目前F1車

12 小時前

港中深王方鑫團隊:3D 重建的「玻璃杯難題」,終於被擺上檯面丨CVPR 2026

3DReflecNet:一個專為玻璃、金屬與陶瓷等材料建立的大規模數據集。 作者丨樊天驕、鄭佳美 編輯丨鄭佳美 想為手上的玻璃杯生成 3D 模型,需要幾步?拍照、掃描、建模......聽上去似乎很簡單,可如果你真的動手試試,得到的往往是佈滿孔洞、邊緣扭曲的殘缺結果。這就是當前 3D 重建技術的瓶頸:無論是爆火的 3D 高斯濺射(3D Gaussian Splatting, 3DGS)、神經輻射場(Neural Radiance Fields, NeRF),還是傳統的多視圖立體匹配方法,都只對不反光的漫反射材質且擁有足夠清晰的紋理特徵的物體有效。而對於那些具有反射、透明和低紋理表面特性的材料,現有技術可以說是束手無策,比如金屬、玻璃和陶瓷。金屬的反光特性會讓同一物體在不同角度、不同光照下呈現截然不同的特徵,玻璃的折射徹底會扭曲光線傳播路徑,光滑的陶瓷則缺乏可供算法匹配的特徵點。這就造成了以下現象:服務機器人想要拿起一個玻璃碗,卻常常因為識別不到物體輪廓而失手;工業質檢系統試圖掃描拋光後的不鏽鋼零件時,得到的 3D 模型卻總是扭曲且失真的。算法,就像是活在一個物體都是漫反射的世界裡,而真實物理世界遠遠比這複雜得多。在這樣的行業背景下,香港中文大學(深圳)王方鑫教授團隊聯合首都師範大學、南加州大學的研究者提出了《3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects》,構建了專門針對反射、透明、低紋理三類高難度物體的大規模混合 3D 重建數據集。這項研究打造了一個包含 12 萬+合成實例、1000+ 真實物體、總規模超 22 TB 的綜合數據集,並建立了涵蓋圖像匹配、運動恢復結構、新視角合成、反射去除和重光照五大核心任務

15 小時前

世界模型,擠滿了00後

這篇消息聚焦「世界模型,擠滿了00後」。原始導語提到:資本集體“叛變” 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

1 天前