MarkTechPost AI生成式AI

Google DeepMind 推出 Gemma 4 12B:免編碼器多模態模型,原生支援音訊,可在 16GB 筆電上執行

2026年6月3日 18:46
Google DeepMind 推出 Gemma 4 12B:免編碼器多模態模型,原生支援音訊,可在 16GB 筆電上執行

重點摘要

Google DeepMind 今日正式釋出 Gemma 4 12B,這是一款密集多模態模型,完全捨棄傳統編碼器。視覺與音訊資料直接流入大型語言模型主幹。結果是:該模型能在配備 16GB RAM 的消費級筆電上執行代理工作流程。採用 Apache 2.0 授權釋出。 模型概述與取得方式 Gemma 4 12B 是一個 120 億參數的純解碼器 Transformer,原生處理文字、圖像、音訊與影片,無需獨立的視覺或音訊編碼器。解碼器結構與 Gemma 4 31B Dense 模型相同,銜接了邊緣友善的 E4B 與更大的 26B 混合專家變體之間的差距。 架構:統一的、免編碼器純解碼器 Transformer。 模態:文字、圖像、影片與原生音訊輸入——這是首款中等尺寸的 Gemma,具備…

站內 AI 整理稿

Google DeepMind just released Gemma 4 12B, a dense multimodal model that strips out traditional encoders entirely. Vision and audio flow straight into the LLM backbone. The result is a model that runs agentic workflows on a consumer laptop with 16 GB of RAM. It ships under the Apache 2.0 license. Model Overview & Access Gemma 4 12B is a 12-billion-parameter decoder-only transformer. It handles text, images, audio, and video natively. There are no separate vision or audio encoders. The decoder uses the same structure as the Gemma 4 31B Dense model. It bridges the gap between the edge-friendly E4B and the larger 26B Mixture of Experts variant. Architecture: Unified, encoder-free decoder-only transformer. Modalities: Text, image, video, and native audio input — the first mid-sized Gemma with audio. Hardware requirement: 16 GB VRAM or unified memory. Runs on consumer GPU laptops and Apple Silicon Macs. License: Apache 2.0. Weights are open and publicly downloadable. Inference stack: Compatible with llama.cpp, MLX, vLLM, Ollama, SGLang, Unsloth, and LM Studio. Download: Hugging Face and Kaggle. The instruct variant is google/gemma-4-12B-it. Integration: Hugging Face Transformers, LiteRT-LM CLI, and an OpenAI-compatible local API server via litert-lm serve. A dedicated Multi-Token Prediction (MTP) drafter model is also released. It reduces inference latency on local hardware. Architecture: The Encoder-Free Design Every prior mid-sized Gemma model used separate Transformer encoders for vision and audio. Those encoders added latency and parameter overhead. The medium-sized Gemma 4 models carry a 550M-parameter vision encoder. The E2B and E4B models include a 300M-parameter audio encoder. All of that is gone in the 12B. Vision embedder (35M parameters): Raw images are split into 48×48 pixel patches. Each patch is projected to the LLM’s hidden dimension with a single matrix multiplication. There is no attention layer; each patch is processed independently. Spatial position is injected using a factorized coordinate lookup: a learned X matrix and a learned Y matrix. For a patch at (x, y), the model looks up two learned embeddings and adds them to form a position vector. This is added to the patch embedding, followed by normalization. That is the entire vision pipeline. Audio wave projection: Raw 16 kHz audio is sliced into 40 ms frames. Each frame contains 640 values. Those values are linearly projected into the same embedding space as text tokens. There is no feature extraction and no conformer layers. The LLM’s existing Rotary Position Embedding (RoPE) handles the 1-D temporal sequence. The audio encoder in the E2B and E4B used 12 conformer layers. All of that is removed. Importance: The unified weight space means you no longer co-tune separate frozen encoders. Downstream fine-tuning with LoRA or full tuning updates vision, audio, and text processing in a single pass. Hugging Face Transformers and Unsloth already support this. The encoder-free design reduces multimodal latency. The LLM backbone starts processing immediately. No encoder must finish first. Capabilities & Performance Google DeepMind team has not published full benchmark results in the initial launch materials. The official release notes state the 12B model performs nearing the 26B MoE model on standard benchmarks, at less than half the total memory footprint. https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/ The model’s demonstrated capabilities include: Automatic speech recognition. Transcribes audio natively without an external ASR pipeline. Agentic reasoning. Runs multi-step workflows locally, with performance approaching the 26B MoE model. Diarization. Distinguishes speakers in audio input. Video understanding. Processes video frames alongside audio. A demo analyzed a 5-minute Google I/O keynote segment using 313 frames at 1 FPS with a visual token budget of 70 per frame. Coding. Built a Gradio image-processing app using its own code generation, served locally with llama.cpp. Multimodal agentic workflows. The official Gemma Skills repository at github.com/google-gemma/gemma-skills provides pre-built agent capabilities. In Google’s own Google AI Edge Eloquent app, the switch to Gemma 4 12B produced what Google reports as a 60%+ jump in overall quality, with improved instruction following and scope adherence. Marktechpost’s Visual Explainer #mtp-g4-slider hr, #mtp-g4-slider p:empty, #mtp-g4-slider del, #mtp-g4-slider s { display:none !important; } #mtp-g4-slider *{ box-sizing:border-box !important; margin:0; padding:0; } #mtp-g4-slider{ --g-blue:#4285F4; --g-red:#EA4335; --g-yellow:#FBBC04; --g-green:#34A853; --g-ink:#202124; --g-grey:#5f6368; --g-line:#dadce0; --g-surface:#f8f9fa; font-family:'Google Sans','Product Sans',Roboto,Arial,sans-serif !important; color:var(--g-ink) !important; background:#ffffff !important; border:1px solid var(--g-line) !important; border-radius:20px !important; max-width:840px; margin:24px auto; overflow:hidden; box-shadow:0 1px 3px rgba(60,64,67,.16), 0 8px 28px rgba(60,64,67,.10); } /* four-color brand bar */ #mtp-g4-slider .mtpg4-bar{ display:flex; height:6px; width:100%; } #mtp-g4-slider .mtpg4-bar span{ flex:1; } #mtp-g4-slider .mtpg4-bar span:nth-child(1){ background:var(--g-blue) !important; } #mtp-g4-slider .mtpg4-bar span:nth-child(2){ background:var(--g-red) !important; } #mtp-g4-slider .mtpg4-bar span:nth-child(3){ background:var(--g-yellow) !important; } #mtp-g4-slider .mtpg4-bar span:nth-child(4){ background:var(--g-green) !important; } #mtp-g4-slider .mtpg4-viewport{ overflow:hidden; position:relative; } #mtp-g4-slider .mtpg4-track{ display:flex; transition:transform .45s cubic-bezier(.4,0,.2,1); } #mtp-g4-slider .mtpg4-slide{ min-width:100%; padding:38px 44px 30px; } #mtp-g4-slider .mtpg4-eyebrow{ display:inline-block; font-size:12px; font-weight:700; letter-spacing:.10em; text-transform:uppercase; padding:5px 12px; border-radius:999px; background:var(--g-surface) !important; color:var(--g-grey) !important; border:1px solid var(--g-line) !important; margin-bottom:16px; } #mtp-g4-slider .mtpg4-slide h2{ font-size:27px; line-height:1.2; font-weight:700; color:var(--g-ink) !important; margin-bottom:8px; letter-spacing:-.01em; } #mtp-g4-slider .mtpg4-slide h3{ font-size:15px; font-weight:500; color:var(--g-grey) !important; margin-bottom:20px; } #mtp-g4-slider .mtpg4-slide p.lead{ font-size:16px; line-height:1.6; color:var(--g-ink) !important; max-width:62ch; margin-bottom:18px; } #mtp-g4-slider ul.mtpg4-list{ list-style:none !important; padding:0 !important; } #mtp-g4-slider ul.mtpg4-list li{ position:relative; padding:11px 14px 11px 18px; margin-bottom:10px; font-size:15px; line-height:1.5; color:var(--g-ink) !important; background:var(--g-surface) !important; border-radius:10px; border-left:4px solid var(--accent) !important; } #mtp-g4-slider ul.mtpg4-list li b{ font-weight:700; } #mtp-g4-slider code{ font-family:'Roboto Mono',ui-monospace,monospace !important; font-size:13px; padding:2px 6px; border-radius:5px; background:#ffffff !important; color:var(--g-blue) !important; border:1px solid var(--g-line) !important; } /* per-slide accent */ #mtp-g4-slider .mtpg4-slide[data-accent="blue"]{ --accent:var(--g-blue); } #mtp-g4-slider .mtpg4-slide[data-accent="red"]{ --accent:var(--g-red); } #mtp-g4-slider .mtpg4-slide[data-accent="yellow"]{ --accent:var(--g-yellow); } #mtp-g4-slider .mtpg4-slide[data-accent="green"]{ --accent:var(--g-green); } #mtp-g4-slider .mtpg4-slide .mtpg4-eyebrow{ color:var(--accent) !important; border-color:var(--accent) !important; } /* controls */ #mtp-g4-slider .mtpg4-nav{ display:flex; align-items:center; justify-content:space-between; padding:14px 24px; border-top:1px solid var(--g-line) !important; background:#ffffff !important; } #mtp-g4-slider .mtpg4-dots{ display:flex; gap:8px; } #mtp-g4-slider .mtpg4-dot{ width:8px; height:8px; border-radius:999px; border:none !important; backgrou

Related

相關文章

Hugging Face Blog生成式AI

Nemotron 3.5 內容安全:為全球企業 AI 打造可自訂的多模態安全防護

回顧過去兩年,NVIDIA 的內容安全技術棧已從一個專注於英文的分類器,發展為一系列專業模型,逐步擴展至新的模態、語言與推論模式。2026 年 3 月推出的 Nemotron 3 Content Safety 首次在單一 4B 參數模型中整合多模態與多語言能力。今日我們發布 Nemotron 3.5 Content Safety,補齊最後一塊拼圖:一個統一處理多模態輸入的單一模型。

10 分鐘前
IT之家生成式AI

全球最強開源生圖 AI 模型:Ideogram 4.0 登場

Ideogram 於6月3日正式發表4.0版本,這是一款採用開放權重架構的文字轉圖片生成模型,官方宣稱其為「全球最佳開源生圖AI模型」。開發人員與研究人員可下載模型權重進行本地部署與二次開發,此舉有望進一步拉高開源模型的品質天花板。

5 小時前
雷峰網生成式AI

全球首個!材科源圖發佈有機高分子應用智能體

在人工智能重塑科研範式的科技浪潮中,因體系複雜、配方變量多,長期面臨高度依賴專家經驗、試錯成本高、知識難以沉澱複用等行業瓶頸,研發效率提升亟待突破。近日,據雷峰網瞭解,蘇州材科源圖(MatSource)正式發佈全球首個有機高分子材料研發應用智能體(Organic Polymer Agent)。該智能體依託自主構建的通用材料科學智能體框架(Materials Agent Framework),面向高分子材料研發場景打造專家級人工智能系統,推動“人工驅動”向“人工智能協同驅動”加速躍遷,為高新材料的高效自主研發提供了關鍵的技術支撐。01 面向複雜研發場景,構建高分子材料研發“智能中樞”作為材科源圖(MatSource) 材料科學智能體體系的重要組成部分,有機高分子應用智能體聚焦高分子材料研發中的關鍵痛點,融合材料知識圖譜、多模態數據理解、大模型推理與領域機理模型能力,構建覆蓋“設計-預測-優化-決策”的全流程智能研發體系。依託這一技術架構,系統可實現高分子分子結構設計與性能預測、配方體系智能生成與多目標優化、工藝參數推薦與實驗路徑規劃,以及文獻知識解析、研發知識沉澱等核心功能,推動專家經驗向數字化能力轉化。通過“知識+模型+工具”的深度協同,顯著提升研發效率與決策質量,為行業由傳統“經驗驅動”向“智能驅動”轉型提供新的技術路徑。02 率先落地光刻膠,完成產業級驗證作為有機高分子材料中技術壁壘最高、研發難度最大的典型代表,光刻膠成為該智能體的首個驗證場景。目前,系統已完成在ArF光刻膠研發場景中的實測驗證,實現從樹脂設計、配方篩選到性能預測的全流程支持,並完成關鍵指標驗證,證明瞭其在複雜有機高分子體系中的工程化能力與應用價值。這意味著,材科源圖(MatSource)不僅驗證了“AI+高分子材料”的技術可行性,也打通了從實驗室研發到產業應用的關鍵路徑。03 從ArF到EUV,持續拓

5 小時前
雷峰網生成式AI

不卷價格和參數,中國汽車如何賣到5000萬輛?

2026年,國內新能源汽車滲透率突破60%,中國汽車品牌的售價提升到80萬元。中國乘聯會秘書長崔東樹說,國產車未來要達到5000萬輛銷售規模,在全球市場中,佔比超過50%。中國汽車越過規模大關,但高速發展之下,行業參數內卷、體驗同質化、盈利承壓等痛點日益凸顯。第四屆未來汽車先行者大會上,奇瑞副總經理王琅直言,行業進入新的“無人區”,不能再卷參數了。跳出價格與參數之外,國產車如何尋找下一個增長點?01元戎啟行周光:智駕幾十公里接管一次和1000公里接管一次,是兩個物種最近幾年,智駕行業的技術重心從端到端、VLA向著大模型、基座模型和物理AI快速迭代。元戎啟行CEO周光分享了他對物理AI基座模型的思考。他認為,過去5年,智駕行業走的是小模型路線,已經到了能力的上限,投入越來越多,提升越來越慢。這個現象可以用“蹺蹺板效應”來形容:在小模型系統裡,當一個版本解決了上海、武漢等城市的問題,可能就會在深圳、廣州等地效果變差,引入新問題。版本之間因此要反反覆覆地修改。周光說,這種蹺蹺板效應在行業中非常普遍,這也是用戶難以長期信任這個系統的原因。2026年,行業認知進入到大模型階段。周光解釋,大模型並不是一個更大的小模型,而是有一整套技術邏輯,在技術棧、網絡結構、訓練方式和模式上都有變化。他舉了一個例子,來說明大模型和小模型的認知區別。假設一條狗被染上斑馬的條紋,小模型會識別為一隻斑馬;但大模型會作出這是一隻狗的判斷。“小模型擅長條件反射、局部特徵相應,大模型擅長高級認知”,周光總結。自動駕駛從一開始的被激活,城區安全接管,再到更高的認知理解,做到像人一樣的整體判斷和泛化能力,需要從執行系統升級到認知系統。周光判斷,今年年底到明年初,行業裡會迎來從小模型到大模型、基座模型的轉換浪潮。技術陡峭升級,大模型成為智駕發展的下一個技術範式。他透露,元戎啟行很早就判斷要全面擁抱大模型和多模態,202

7 小時前
IT之家生成式AI

奧爾特曼:OpenAI 內部有人每月用掉約 1000 億個詞元

從六年前月耗十萬詞元到如今月耗千億,OpenAI 的詞元消耗量呈爆炸式增長。公司內部設有消耗排行榜,員工甚至曬圖炫耀,與亞馬遜等嚴控成本的企業形成鮮明對比。奧爾特曼承認成本已成難題,正尋求降本增效。 #AI 成本# #詞元消耗#

8 小時前