Miso Labs 推出 MisoTTS：具備 80 億參數、開放權重的情感文字轉語音模型

2026年6月4日 08:11

重點摘要

Miso Labs 發布 MisoTTS，這是一款開放權重的 80 億參數文字轉語音模型。它能根據文字與音訊脈絡生成富有表現力的語音。模型採用殘差向量量化（RVQ）技術擴展聲音範圍，避免在固定參數量下擴充單一扁平詞彙。MisoTTS 是一個 80 億參數的文字轉對話 RVQ Transformer，靈感來自 Sesame CSM 架構，結合 Llama 3.2 風格的主幹與較小的音訊解碼器，可從文字與選擇性音訊脈絡生成 Mimi 音訊編碼。模型同時以文字與先前音訊為條件，後者可讓它回應說話者的語調。其文字詞彙量為 128,256 個標記，並包含 32 組音訊碼簿。Mimi 為音訊分詞器，最大序列長度為……

站內 AI 整理稿

Miso Labs has released MisoTTS, an open-weights 8-billion-parameter text-to-speech model. It generates expressive speech from both text and audio context. The model uses residual vector quantization (RVQ) to widen its sonic range. This avoids scaling a single flat vocabulary while keeping parameter count fixed. What is MisoTTS MisoTTS is an 8B-parameter text-to-dialogue RVQ Transformer. It is inspired by the Sesame CSM architecture. It pairs a Llama 3.2-style backbone with a smaller audio decoder. It generates Mimi audio codes from text and optional audio context. The model conditions on both text and prior audio. That second input lets it respond to the speaker’s tone. The text vocabulary is 128,256 tokens, and there are 32 audio codebooks. Mimi is the audio tokenizer, and max sequence length is 2,048. Default inference runs in torch.bfloat16. Miso Labs claims 110ms latency. It lists ElevenLabs at 700ms and Sesame at 300ms. The Vocabulary Size Problem Standard transformers generate from a fixed vocabulary of discrete tokens. That works when a small vocabulary covers the target space. Human speech does not fit that assumption. It varies across pitch, rhythm, emphasis, emotion, and accent. Expanding the audio vocabulary is the obvious fix. But larger vocabularies need more parameters in a standard transformer. Each token must be represented and predicted by the model. Miso Labs calls this the vocabulary size problem. The second issue is conditioning. Most TTS models condition only on text. They ignore the interlocutor’s tone. Miso Labs argues this contributes to the “uncanny valley” effect. Residual Vector Quantization: The Core Idea MisoTTS addresses both problems with residual vector quantization (RVQ). Miso Labs traces RVQ to image-generation research and to Sesame’s CSM for audio. Instead of one token index, the model emits a vector of indices. Each audio token is 32 codebook indices over 2048-way codebooks. The model keeps a separate codebook for each position in the vector. To recover the sound, it sums the looked-up vectors. Each codebook adds another refinement to the signal. This is what makes the scaling work. Addressable vocabulary equals codebook size raised to the depth. Growing the depth adds no parameters to the model. So MisoTTS reaches about 204832, or roughly 10105 addressable tokens. Miso Labs notes naive scaling would require a far larger network. https://www.misolabs.ai/blog/miso-tts-8b The Two-Transformer Architecture The model splits into a backbone and a decoder. The backbone is a 7.7B-parameter transformer, autoregressive over time. It predicts the first codebook index and a final hidden state. A 300M-parameter decoder then runs autoregressively over depth. It predicts the remaining codebook indices, one position at a time. Each prediction conditions on the indices already chosen in the frame. The same 300M parameters are reused for every position. Embeddings follow the same logic. Text tokens use a single lookup. An audio token’s embedding is the sum of per-position codebook lookups. Interleaving text and audio lets the backbone use conversation history. That is how it carries context across turns. Strengths and Challenges Strengths: Open weights on day one, under a modified MIT license. RVQ scales the sonic range without scaling parameter count. Conditions on audio context, not text alone. Local deployment keeps sensitive audio data in-house. The architecture and math are documented in a public blog post. Challenges: Half-duplex only, with no turn-taking yet. The large model needs a capable CUDA GPU. API access is announced but not yet available. Latency and quality claims still need third-party testing. Marktechpost’s Visual Explainer @import url('https://fonts.googleapis.com/css2?family=Fraunces:opsz,[email protected],400;9..144,500;9..144,600&family=JetBrains+Mono:wght@400;600&display=swap'); #mtp-misotts-slider{ --paper:#F0EEE6; --surface:#FAF9F5; --ink:#141413; --muted:#6B6960; --accent:#D97757; --accent-dk:#BD5D3E; --line:#DED9CC; --line-soft:#E8E4D8; --serif:"Fraunces",Georgia,"Times New Roman",serif; --sans:-apple-system,BlinkMacSystemFont,"Segoe UI",Helvetica,Arial,sans-serif; --mono:"JetBrains Mono",ui-monospace,SFMono-Regular,Menlo,Consolas,monospace; max-width:860px!important; margin:28px auto!important; padding:0!important; background:var(--paper)!important; color:var(--ink)!important; border:1px solid var(--line)!important; border-radius:18px!important; box-shadow:0 1px 0 #fff inset, 0 18px 40px -28px rgba(20,20,19,.35)!important; font-family:var(--sans)!important; overflow:hidden!important; position:relative!important; -webkit-font-smoothing:antialiased; box-sizing:border-box!important; } #mtp-misotts-slider *{ box-sizing:border-box!important; } /* wpautop suppression */ #mtp-misotts-slider hr, #mtp-misotts-slider p:empty, #mtp-misotts-slider del, #mtp-misotts-slider s{ display:none!important; } /* ---- header bar ---- */ #mtp-misotts-slider .mtp-top{ display:flex!important; align-items:center!important; justify-content:space-between!important; padding:16px 26px!important; border-bottom:1px solid var(--line-soft)!important; background:var(--surface)!important; } #mtp-misotts-slider .mtp-brand{ font-family:var(--sans)!important; font-size:12px!important; letter-spacing:.14em!important; text-transform:uppercase!important; color:var(--ink)!important; font-weight:600!important; } #mtp-misotts-slider .mtp-brand b{ color:var(--accent)!important; } #mtp-misotts-slider .mtp-count{ font-family:var(--mono)!important; font-size:12px!important; color:var(--muted)!important; letter-spacing:.04em!important; } /* ---- viewport / track ---- */ #mtp-misotts-slider .mtp-view{ overflow:hidden!important; position:relative!important; } #mtp-misotts-slider .mtp-track{ display:flex!important; transition:transform .45s cubic-bezier(.4,.01,.2,1)!important; will-change:transform; } #mtp-misotts-slider .mtp-slide{ min-width:100%!important; padding:34px 44px 30px!important; min-height:474px!important; display:flex!important; flex-direction:column!important; justify-content:flex-start!important; } #mtp-misotts-slider .mtp-eyebrow{ font-family:var(--mono)!important; font-size:11.5px!important; letter-spacing:.16em!important; text-transform:uppercase!important; color:var(--accent-dk)!important; margin:0 0 14px!important; font-weight:600!important; } #mtp-misotts-slider h2.mtp-h{ font-family:var(--serif)!important; font-weight:500!important; color:var(--ink)!important; font-size:30px!important; line-height:1.16!important; margin:0 0 8px!important; letter-spacing:-.01em!important; } #mtp-misotts-slider .mtp-sub{ font-family:var(--sans)!important; font-size:15.5px!important; line-height:1.55!important; color:var(--muted)!important; margin:0 0 20px!important; max-width:62ch!important; } #mtp-misotts-slider .mtp-rule{ height:1px!important; width:54px!important; background:var(--accent)!important; border:0!important; margin:0 0 22px!important; display:block!important; border-radius:2px!important; } /* cover slide */ #mtp-misotts-slider .mtp-cover h2.mtp-h{ font-size:52px!important; margin-top:6px!important; } #mtp-misotts-slider .mtp-cover .mtp-sub{ font-size:18px!important; color:var(--ink)!important; } #mtp-misotts-slider .mtp-tags{ display:flex!important; flex-wrap:wrap!important; gap:8px!important; margin-top:24px!important; } #mtp-misotts-slider .mtp-chip{ font-family:var(--mono)!important; font-size:12px!important; color:var(--ink)!important; background:#fff!important; border:1px solid var(--line)!important; padding:6px 11px!important; border-radius:999px!important; } /* bullet lists */ #mtp-misotts-slider ul.mtp-list{ list-style:none!important; margin:4px 0 0!important; padding:0!important; } #mtp-misotts-slider ul.mtp-list li{ position:relative!important; padding:0 0 0 26px!important; margin:0 0 15px!important; font-family:var(--sans)!important; font-size:15.5px!important; line-height:1.5!important; color:var(--ink)!important; } #mtp-misotts-

原始來源：MarkTechPost AI ↗

查看原始來源

IT之家模型更新

重慶車企首家：長安汽車自研大模型獲國家生成式 AI 備案審批

#長安汽車# 全棧自研的長安 #天樞大模型# 已正式通過備案審批，成為重慶首家通過國家級備案的車企，標誌著長安科技自主研發的“天樞大模型”作為獨立訓練、運營的生成式 AI 大模型服務或產品可面向公眾提供服務。

剛剛閱讀分析

36氪模型更新

微軟“意外洩密”：Claude Mythos萬億參數，訓練規模浮出水面？

這篇消息聚焦「微軟“意外洩密”：Claude Mythos萬億參數，訓練規模浮出水面？」。原始導語提到：Scaling萬歲！從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

剛剛閱讀分析

36氪模型更新

收費才是DeepSeek的“成人禮”

這篇消息聚焦「收費才是DeepSeek的“成人禮”」。原始導語提到：豆包先給DeepSeek探探路。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

54 分鐘前閱讀分析

IT之家模型更新

轉戰閉源遇挑戰：消息稱 Meta 一再推遲上線 AI 模型 Muse Spark

根據《華爾街日報》報導，Meta 公司最強 AI 模型「Muse Spark」開發遭遇阻礙，已多次延後上線時程。截至目前，該模型仍未對開發者開放 API 進行調用。

2 小時前閱讀分析

Hugging Face Blog模型更新

如何針對您的語言、領域或口音微調 Nemotron 3.5 ASR

NVIDIA 推出 Nemotron 3.5 ASR，這是一個具備 6 億參數的串流多語言語音轉文字模型，能從單一檢查點即時轉錄 40 種語言區域，並內建標點符號與大小寫功能。它繼承了今年稍早於 Hugging Face 及 NIM 發布的 Nemotron 3 ASR（僅支援英文）模型，後者已獲得人工智慧分析獨立基準測試的驗證。

4 小時前閱讀分析

雷峰網模型更新

何小鵬內部講話曝光，「最美」機器人量產時間表出來了？

機器人能不能規模化穩定交付，是整個行業的生死問題。作者丨李希編輯丨馬曉寧 “4季度一定要把量產做出來。”在近日的一場小鵬機器人量產動員大會上，何小鵬給內部團隊定下了一條極具壓迫感的時間線：2026 年四季度完成量產，2027 年一季度進入國內汽車門店導購場景，2027 年二季度開始進入海外市場。這場講話裡，何小鵬反覆強調的並不是 Demo、視頻或者模型參數，而是三個關鍵詞：“量產、全棧自研、跨域融合”。如果把過去兩年的具身智能行業理解為“模型秀場”，那麼小鵬這場內部講話，更像是一場真正的製造業動員令。我們認為，小鵬正在試圖把機器人，按照“造車邏輯”重新做一遍。01何小鵬說了哪些話小鵬去年年底發佈的全新一代人形機器人 IRON 相當矚目，一度被稱為“最美”機器人。除了量產時間表外，何小鵬表示，小鵬機器人是全國唯一一家全領域自研、跨界融合的機器人廠商。而且小鵬的自研深度很深。小鵬汽車花了5年的時間做自研，才能把第一個版本做到行業水平，再花5年時間才能把多個不同能力的跨域進行融合。而如果做簡單的產品定義與集成自研，你就永遠做不到跨域融合。（因為）你永遠會看到其他人給你提的需求，你再去找供應商商量，供應商說做不到，你就做不到了。所以何小鵬在會場堅決表態，小鵬要成為機器人中的蘋果，從芯片到操作系統、從關節到手都要做到自研，這樣才有可能做到不一樣。當然前期的投入時間難度特別大，但是想象力、創新力和改造力也特別不一樣。這款量產機器人，會是一款什麼樣的機器人？何小鵬給的產品定義是，小鵬正在走出一條不一樣的產品和商業路線。小鵬的機器人是一個優雅、美、安全的機器人，是一個能夠跟人近距離交互的機器人。02從“機器人 Demo”到“機器人工程”何小鵬在講話中回憶了小鵬汽車早期自動駕駛研發時的經歷。當時團隊曾經認為，“硬件先做，軟件後面 OTA 即可”，但後來發現，零下 30 度起霧、電磁幹擾、

7 小時前閱讀分析

相關文章