如何針對您的語言、領域或口音微調 Nemotron 3.5 ASR
重點摘要
NVIDIA 推出 Nemotron 3.5 ASR,這是一個具備 6 億參數的串流多語言語音轉文字模型,能從單一檢查點即時轉錄 40 種語言區域,並內建標點符號與大小寫功能。它繼承了今年稍早於 Hugging Face 及 NIM 發布的 Nemotron 3 ASR(僅支援英文)模型,後者已獲得人工智慧分析獨立基準測試的驗證。
Back to Articles How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent Enterprise + Article Published June 4, 2026 Upvote - Maryam Motamedi maryameee Follow nvidia Adi- margolin Amargolin Follow nvidia Francesco fciannella Follow nvidia Myungjong Kim Myungjong Follow nvidia Enas Albasiri enas-albasiri Follow nvidia Introducing NVIDIA Nemotron 3.5 ASR, streaming multilingual: a 600M-parameter speech-to-text model that transcribes 40 language-locales from a single checkpoint, in real time, with punctuation and capitalization built in. It is the successor of the popular Nemotron 3 ASR model (English only) which was released on Hugging Face and as a NIM earlier this year. Since its release, Nemotron 3 ASR has been validated by independent benchmarks at Artificial Analysis, where it ranks 2nd in latency among all streaming ASR models— with just 0.07 seconds to final transcript after end of speech — and sits in the "most attractive quadrant" of the AA-WER Streaming Index vs. Time to Final Transcription leaderboard, placing it among the best models on the combined accuracy-latency tradeoff. The model uses a Cache-Aware FastConformer-RNNT architecture that streams audio without the redundant recomputation that makes most streaming ASR slow — so you get low latency and high accuracy, not one at the expense of the other. Nemotron 3.5 ASR ships as open weights on Hugging Face — you can inspect, fine-tune, and deploy it without API dependencies or per-call billing. No data leaves your infrastructure unless you choose. And because it's a strong base model, you can fine-tune it for your own language, domain, or accent. The second half of this post walks through exactly how. The problem with multilingual speech recognition today If you've ever built a product that needs to transcribe speech, you've probably hit one of these walls: The polyglot tax. You want to support multiple languages, so you stitch together 40 different models — or 40 different vendor APIs — each with its own quirks, latency profile, and billing. Your infrastructure becomes a museum of one-off integrations. The streaming-vs-accuracy tradeoff. Real-time captioning needs low latency, but most "streaming" ASR systems fake it by re-processing overlapping windows of audio over and over. That burns compute and adds delay. Turn down the latency and accuracy falls off a cliff. The post-processing pipeline. Raw ASR output is often an unpunctuated, lowercase wall of text. You bolt on a second model for punctuation and capitalization, adding yet another moving part. The "known language" assumption. Many systems require you to tell them the language up front. But what about a customer-support line where callers switch between English and Spanish mid-sentence? Nemotron 3.5 ASR was built to collapse all four of those problems into one model. What it does One model, 40 language-locales. A single 600M-parameter checkpoint transcribes English (US/GB), Spanish (US/ES), German, French (FR/CA), Italian, Arabic, Japanese, Korean, Portuguese (BR/PT), Russian, Hindi, Turkish, Vietnamese, Dutch, Ukrainian, Polish, Finnish, Mandarin, Czech, Bulgarian, Slovak, Swedish, Croatian, Romanian, Estonian, Danish, Hungarian, Norwegian Bokmål, Norwegian Nynorsk, Hebrew, Greek, Lithuanian, Latvian, Maltese, Slovenian, and Thai. No per-language deployment, no model-swapping. Real-time streaming, done right. The model is built on a Cache-Aware FastConformer encoder. Traditional "buffered" streaming re-processes overlapping chunks of audio at every step, doing the same work many times over. This model instead caches the encoder's internal state and reuses it — every audio frame is processed exactly once, with no overlap. The result is dramatically lower compute and end-to-end latency, with no accuracy penalty. Punctuation and capitalization, natively. The output is production-ready text — proper casing, commas, periods, question marks — straight from the model. No separate punctuation-restoration step. Language conditioning, your choice. You can run it two ways: Tell the model the input language (target_lang=en-US) when you know it — typically the best accuracy. Let the model detect the language (target_lang=auto) when you don't — the model detects the language and transcribes accordingly. How it works (the 2-minute version) The model has two main pieces: A Cache-Aware FastConformer encoder (24 layers). FastConformer is an efficient evolution of the Conformer architecture with linearly scalable attention. The "cache-aware" part is the streaming magic: the encoder keeps a cache of its self-attention and convolution activations from previous frames, so as new audio arrives it only computes what's genuinely new. Nothing is recomputed. An RNNT (Recurrent Neural Network Transducer) decoder. RNNT is the workhorse decoder for streaming ASR — it emits text as audio streams in, frame by frame, which is exactly what you want for live transcription. On top of this, the model adds prompt-based language-ID conditioning: a language signal is fed alongside the audio, which lets one set of weights specialize its output to the target language — or, in auto mode, infer the language itself. It was trained on a massive speech data spanning all supported languages, using a blend of public and proprietary data normalized to punctuated, properly-cased text. A knob worth knowing: att_context_size Streaming ASR is fundamentally a tradeoff between how soon you emit text and how much future audio the model gets to "peek at" before committing. Nemotron ASR exposes this directly through the attention context size: Attention Context Chunk Size (Latency) Use Case [56, 0] 80ms (Ultra-Low) Ultra low latency Voice Agents [56, 1] 160ms (Low) Interactive Voice Agents, Conversational AI [56, 3] 320ms (Balanced) Conversational AI, Live caption [56, 6] 560ms (Medium) High accuracy with reasonable latency [56, 13] 1.12s (High) Highest accuracy with high latency The same checkpoint covers the whole spectrum — you choose the operating point at inference time, no retraining required. Try it in minutes The model ships as a NeMo checkpoint. Clone the NeMo branch and point the streaming inference script at your audio: git clone https://github.com/NVIDIA-NeMo/NeMo.git Transcribe with a known language: python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \ model_path=${MODEL_PATH} \ dataset_manifest=${MANIFEST_PATH} \ output_path=${OUTPUT_FOLDER} \ target_lang=es-ES \ att_context_size="[56,3]" \ strip_lang_tags=true Or let the model detect the language: python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \ model_path=${MODEL_PATH} \ dataset_manifest=${MANIFEST_PATH} \ output_path=${OUTPUT_FOLDER} \ target_lang=auto \ att_context_size="[56,3]" \ strip_lang_tags=true Audio should be mono-channel .wav. The manifest is a standard NeMo JSON-lines file: {"audio_filepath": "/path/to/clip.wav", "duration": 4.27, "text": "reference transcript"} Model automatically predicts language_tag at the end of each completed sentence, i.e. “This is a test sample. <en-US>”. “strip_lang_tags=True” removes the language tag <xx-XX> for better readability. Deep Dive: Fine-Tuning Nemotron ASR for Your Language Nemotron 3.5 ASR is strong out of the box — but it was trained on a mix where some languages have far more data than others. The long-tail locales have headroom, and a few hours of in-domain audio plus the right recipe closes a surprising amount of it. To make this concrete, we ran a worked example: take the base model and sharpen it on two mid-resource European languages — Greek, and Bulgarian — then measure honestly on held-out data. The results below are from that run. This section is a high-level overview and the coding example lives in the companion GitHub repo. When we publish an agentic SKILL.md covering the whole process, this blog will be updated accordingly. Why fine-t
Related
相關文章

重慶車企首家:長安汽車自研大模型獲國家生成式 AI 備案審批
#長安汽車# 全棧自研的長安 #天樞大模型# 已正式通過備案審批,成為重慶首家通過國家級備案的車企,標誌著長安科技自主研發的“天樞大模型”作為獨立訓練、運營的生成式 AI 大模型服務或產品可面向公眾提供服務。

微軟“意外洩密”:Claude Mythos萬億參數,訓練規模浮出水面?
這篇消息聚焦「微軟“意外洩密”:Claude Mythos萬億參數,訓練規模浮出水面?」。原始導語提到:Scaling萬歲! 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

收費才是DeepSeek的“成人禮”
這篇消息聚焦「收費才是DeepSeek的“成人禮”」。原始導語提到:豆包先給DeepSeek探探路。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

轉戰閉源遇挑戰:消息稱 Meta 一再推遲上線 AI 模型 Muse Spark
根據《華爾街日報》報導,Meta 公司最強 AI 模型「Muse Spark」開發遭遇阻礙,已多次延後上線時程。截至目前,該模型仍未對開發者開放 API 進行調用。
何小鵬內部講話曝光,「最美」機器人量產時間表出來了?
機器人能不能規模化穩定交付,是整個行業的生死問題。 作者丨李希 編輯丨馬曉寧 “4季度一定要把量產做出來。”在近日的一場小鵬機器人量產動員大會上,何小鵬給內部團隊定下了一條極具壓迫感的時間線:2026 年四季度完成量產,2027 年一季度進入國內汽車門店導購場景,2027 年二季度開始進入海外市場。這場講話裡,何小鵬反覆強調的並不是 Demo、視頻或者模型參數,而是三個關鍵詞:“量產、全棧自研、跨域融合”。如果把過去兩年的具身智能行業理解為“模型秀場”,那麼小鵬這場內部講話,更像是一場真正的製造業動員令。我們認為,小鵬正在試圖把機器人,按照“造車邏輯”重新做一遍。01何小鵬說了哪些話小鵬去年年底發佈的全新一代人形機器人 IRON 相當矚目,一度被稱為“最美”機器人。除了量產時間表外,何小鵬表示,小鵬機器人是全國唯一一家全領域自研、跨界融合的機器人廠商。而且小鵬的自研深度很深。小鵬汽車花了5年的時間做自研,才能把第一個版本做到行業水平,再花5年時間才能把多個不同能力的跨域進行融合。而如果做簡單的產品定義與集成自研,你就永遠做不到跨域融合。(因為)你永遠會看到其他人給你提的需求,你再去找供應商商量,供應商說做不到,你就做不到了。所以何小鵬在會場堅決表態,小鵬要成為機器人中的蘋果,從芯片到操作系統、從關節到手都要做到自研,這樣才有可能做到不一樣。當然前期的投入時間難度特別大,但是想象力、創新力和改造力也特別不一樣。這款量產機器人,會是一款什麼樣的機器人?何小鵬給的產品定義是,小鵬正在走出一條不一樣的產品和商業路線。小鵬的機器人是一個優雅、美、安全的機器人,是一個能夠跟人近距離交互的機器人。02從“機器人 Demo”到“機器人工程”何小鵬在講話中回憶了小鵬汽車早期自動駕駛研發時的經歷。當時團隊曾經認為,“硬件先做,軟件後面 OTA 即可”,但後來發現,零下 30 度起霧、電磁幹擾、
CVPR 2026:深度學習的「標準件」,正在被逐個拆掉
注意力的浮點精度不是必須的,歸一化流的"精確可逆"是可以放棄的…… 作者丨馬曉寧 編輯丨岑 峰 這裡有一幢大樓,叫做深度學習。過去幾年,人們不停地給它加蓋、擴建,越蓋越高,越蓋越複雜。這幢大樓叫 Transformer。蓋樓時用了一大批標準件,浮點精度是它的鋼筋,層歸一化和殘差連接是它的混凝土,因果掩碼是它的承重隔斷。旁邊還有兩棟附樓:一棟是擴散模型;另一棟歸一化流。在漫長的施工期裡,人們不斷加裝更粗的鋼筋、更復雜的控制系統,以為這樣做就能讓這幾棟樓更穩固高大美觀。但是這樣真的是對的嗎?可現在,這些施工的收益越來越小,而有人在附樓裡試出了更好的新零件,量化、去噪、可逆約束都有了更輕便的替代品。於是,一批施工隊同時進場,對準這些標準件開刀。他們不是來修修補補的,而是問一個更根本的問題:這根柱子、這面牆、這套管道,到底是真承重,還是隻因為一直在那兒所以沒人動?更有意思的是,五支施工隊去了不同的樓層。有的在樓體外牆動手,拆掉了那些只用來裝飾的預製板——那是推理端的精度和定製策略。有的鑽進設備層,重新鋪設了管線——那是訓練目標的參數化方式。還有的直接下到地下室,對著地基裡的鋼筋動起了大錘——那是歸一化層和可逆性約束。把它們放在一起看,你會發現一條清晰的遞進線:深度學習的"標準件"正在從外圍到核心,被逐個拆掉。01從推理端開刀:精度和定製策略,不是必須的最先被質疑的標準件,是那些看起來最"技術性"的,比如說,浮點精度的矩陣乘法,和針對不同架構手工調參的量化策略。這些因為不涉及“模型為什麼能work”的核心設計哲學,看起來只是優化效率、節省算力,所以最容易被人當成“普通的工程優化”。但 CVPR 2026 的這兩篇論文告訴我們,遠不止"模型可以更省"這麼簡單。▎BinaryAttention:1-bit 注意力,比全精度還能打Transformer 的注意力模塊一直是算力黑洞。Quer