【原】MLMs之Janus：《Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling》翻

处女座的程序猿 2025-01-28 发布于上海

展开全文

MLMs之Janus：《Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling》翻译与解读

导读：这篇论文介绍了Janus-Pro，一个改进版的统一多模态理解和生成模型。Janus-Pro 通过多方面的改进，在统一多模态理解和生成领域取得了显著的进展，为该领域的研究提供了新的思路和方向。

>> 背景痛点：现有统一多模态模型的不足：现有的统一多模态理解和生成模型通常使用相同的视觉编码器处理两种任务，导致在多模态理解方面性能欠佳，因为两种任务对图像表示的需求不同。 Janus模型虽然通过解耦视觉编码解决了部分问题，但在1B参数规模下，训练数据有限，模型容量较小，导致在短提示图像生成和文本到图像生成的稳定性方面表现不足。

>> 具体的解决方案：Janus-Pro针对Janus模型的不足，从三个方面进行了改进：

● 优化的训练策略：修改了Janus的三阶段训练流程。具体包括：延长第一阶段训练，充分利用ImageNet数据建模像素依赖；第二阶段专注于使用普通文本到图像数据训练，提高● 训练效率；调整第三阶段的监督微调数据比例，平衡多模态理解和视觉生成能力。

● 数据扩展：显著增加了训练数据。多模态理解方面，增加了约9000万个样本，涵盖图像字幕、表格、图表和文档理解等数据；视觉生成方面，增加了约7200万个合成美学数据，提高数据质量，改善生成图像的稳定性和美感。

● 模型规模扩展：将模型规模从1.5B参数扩展到7B参数，验证了视觉编码解码方法的可扩展性。

>> 核心思路步骤：Janus-Pro的核心思路是解耦视觉编码，分别为多模态理解和视觉生成任务设计独立的编码器。具体步骤如下：

● 独立编码：使用SigLIP编码器提取图像的高维语义特征用于理解任务；使用VQ tokenizer将图像转换为离散ID用于生成任务。

● 特征映射：使用理解适配器和生成适配器将图像特征映射到LLM的输入空间。

● 多模态融合：将映射后的特征序列与文本提示拼接成多模态特征序列。

● 统一处理：将多模态特征序列输入到统一的自动回归Transformer中进行处理。

● 独立预测头：视觉生成任务使用随机初始化的预测头进行图像预测。

>> 优势：

● 改进的多模态理解能力：Janus-Pro在多个多模态理解基准测试中取得了最优结果，显著优于Janus和其他一些模型，即使与参数量更大的模型相比也具有竞争力。

● 显著提升的文本到图像生成能力：Janus-Pro在文本到图像生成任务上，无论是GenEval还是DPG-Bench，都取得了显著的性能提升，在指令遵循能力方面表现出色，生成图像质量更高，细节更丰富，稳定性更好。

● 模型的可扩展性：7B参数的Janus-Pro模型验证了该方法的可扩展性，更大的模型带来了更快的收敛速度。

>> 结论和观点：

● Janus-Pro通过改进训练策略、扩展数据和增加模型规模，显著提升了多模态理解和文本到图像生成能力。

● 解耦视觉编码是提高统一多模态模型性能的关键。

● 尽管取得了显著进展，Janus-Pro仍然存在一些局限性，例如输入分辨率限制（384x384）影响了其在细粒度任务中的性能，以及图像分辨率低导致细节不足的问题。未来可以通过提高图像分辨率来解决这些问题。

MLMs之Janus：《Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling》翻译与解读

MLMs之Janus：《Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling》翻-CSDN博客

MLMs之Janus：Janus/Janus-Pro的简介、安装和使用方法、案例应用

https://yunyaniu.blog.csdn.net/article/details/145385376

《Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling》翻译与解读

地址	论文地址：https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf
时间	2025年1月27日
作者	DeepSeek团队

Abstract

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specif-ically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capa-bilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.

在这项工作中，我们推出了 Janus-Pro，这是之前工作 Janus 的高级版本。具体而言，Janus-Pro 融合了（1）优化的训练策略，（2）扩充的训练数据，以及（3）对更大模型规模的支持。凭借这些改进，Janus-Pro 在多模态理解和文本到图像指令遵循能力方面取得了显著进步，同时提升了文本到图像生成的稳定性。我们希望这项工作能激发该领域的进一步探索。代码和模型已公开可用。

Figure 1 | Multimodal understanding and visual generation results from our Janus-Pro. For multi-modal understand, we average the accuracy of POPE, MME-Perception, GQA, and MMMU. The scores of MME-Perception are divided by 20 to scale to [0, 100]. For visual generation, we evaluate the performance on two instruction-following benchamrks, GenEval and DPG-Bench. Overall, Janus-Pro outperforms the previous state-of-the-art unified multimodal models as well as some task-specific models. Best viewed on screen.图 1 | 我们的 Janus-Pro 的多模态理解和视觉生成结果。对于多模态理解，我们对 POPE、MME-Perception、GQA 和 MMMU 的准确率取平均值。MME-Perception 的分数除以 20 以缩放到 [0, 100] 范围。对于视觉生成，我们在两个指令遵循基准 GenEval 和 DPG-Bench 上评估其性能。总体而言，Janus-Pro 超过了之前的最先进的统一多模态模型以及一些特定任务的模型。建议在屏幕上查看效果最佳。

1、Introduction

Recent advancements in unified multimodal understanding and generation models have demonstrated significant progress [30, 40, 45, 46, 48, 50, 54, 55]. These approaches have been proven to enhance the instruction-following capabilities in visual generation tasks while re-ducing model redundancy. Most of these methods utilize the same visual encoder to process inputs for both multimodal understanding and generation tasks. Since the representations required for these two tasks differ, this often results in suboptimal performance in multimodal understanding. To address this issue, Janus [46] proposes decoupling visual encoding, which alleviates the conflict between multimodal understanding and generation tasks, achieving excellent performance in both tasks.

As a pioneering model, Janus is validated at the 1B parameter scale. However, due to the limited amount of training data and the relatively small model capacity, it exhibites certain shortcomings, such as suboptimal performance on short prompts image generation and unstable text-to-image generation quality. In this paper, we introduce Janus-Pro, an enhanced version of Janus that incorporates improvements across three dimensions: training strategies, data, and model size. The Janus-Pro series includes two model sizes: 1B and 7B, demonstrating scalability of the visual encoding decoding method.

统一多模态理解和生成模型的最新进展已取得显著成果[30, 40, 45, 46, 48, 50, 54, 55]。这些方法已被证明能够提升视觉生成任务中的指令遵循能力，同时减少模型冗余。大多数这些方法都使用相同的视觉编码器来处理多模态理解和生成任务的输入。由于这两个任务所需的表示不同，这往往导致多模态理解任务的表现不佳。为了解决这个问题，Janus [46] 提出了视觉编码解耦，这缓解了多模态理解和生成任务之间的冲突，在这两个任务中都取得了出色的表现。

作为开创性的模型，Janus 在 10 亿参数规模上得到了验证。然而，由于训练数据量有限以及模型容量相对较小，它存在一些不足之处，例如在短提示图像生成方面的表现欠佳以及文本到图像生成质量不稳定。在本文中，我们介绍了 Janus-Pro，这是 Janus 的增强版，在训练策略、数据和模型规模三个维度上都有所改进。Janus-Pro 系列包含两种模型规模：10 亿参数和 70 亿参数，展示了视觉编码解码方法的可扩展性。

We evaluate Janus-Pro on multiple benchmarks, and the results reveal its superior multi-modal understanding capabilities and significantly improved text-to-image instruction-following performance. Specifically, Janus-Pro-7B achieved a score of 79.2 on the multimodal understand-ing benchmark MMBench [29], surpassing state-of-the-art unified multimodal models such as Janus [46] (69.4), TokenFlow [34] (68.9) and MetaMorph [42] (75.2). Additionally, in the text-to-image instruction-following leaderboard GenEval [14], Janus-Pro-7B scores 0.80, outperforming Janus [46] (0.61), DALL-E 3 (0.67), and Stable Diffusion 3 Medium [11] (0.74).

我们在多个基准测试中对 Janus-Pro 进行了评估，结果表明其具有卓越的多模态理解能力和显著提升的文本到图像指令遵循性能。具体而言，Janus-Pro-7B 在多模态理解基准 MMBench [29] 上获得了 79.2 的分数，超过了诸如 Janus [46]（69.4）、TokenFlow [34]（68.9）和 MetaMorph [42]（75.2）等最先进的统一多模态模型。此外，在文本到图像指令遵循排行榜 GenEval [14] 上，Janus-Pro-7B 的得分是 0.80，优于 Janus [46]（0.61）、DALL-E 3（0.67）和 Stable Diffusion 3 Medium [11]（0.74）。

Figure 2 | Comparison of text-to-image generation between Janus-Pro and its predecessor,Janus. Janus-Pro delivers more stable outputs for short prompts, with improved visual quality,richer details, and the ability to generate simple text. The image resolution is 384 × 384. Best viewed on screen.图 2 | Janus-Pro 与其前身 Janus 在文本转图像生成方面的比较。Janus-Pro 对于短提示能提供更稳定的输出，视觉质量更高，细节更丰富，并且能够生成简单的文本。图像分辨率为 384×384。建议在屏幕上查看效果最佳。

Figure 3 | Architecture of our Janus-Pro. We decouple visual encoding for multimodal under-standing and visual generation. “Und. Encoder” and “Gen. Encoder” are abbreviations for “Understanding Encoder” and “Generation Encoder”, respectively. Best viewed on screen.图 3 | 我们的 Janus-Pro 架构。我们将用于多模态理解的视觉编码与用于视觉生成的视觉编码分离开来。“Und. Encoder”和“Gen. Encoder”分别是“理解编码器”和“生成编码器”的缩写。建议在屏幕上查看以获得最佳效果。

Conclusion

This paper introduces improvements to Janus from three aspects: training strategy, data, and model size. These enhancements have led to significant advancements in both multimodal understanding and text-to-image instruction-following capabilities. However, Janus-Pro still has certain limitations. In terms of multimodal understanding, the input resolution is limited to 384 × 384, which affects its performance in fine-grained tasks such as OCR. For text-to-image generation, the low resolution, combined with reconstruction losses introduced by the vision tokenizer, results in images that, while rich in semantic content, still lack fine details. For example, small facial regions occupying limited image space may appear under-detailed. Increasing the image resolution could mitigate these issues.

本文从训练策略、数据和模型规模三个方面介绍了对 Janus 的改进。这些改进在多模态理解和文本到图像的指令遵循能力方面都取得了显著进展。然而，Janus-Pro 仍存在一些局限性。在多模态理解方面，输入分辨率限制在 384×384，这影响了其在诸如 OCR 等细粒度任务中的表现。对于文本到图像生成，低分辨率加上视觉标记器引入的重建损失，导致生成的图像虽然语义丰富，但仍缺乏细节。例如，占据图像空间有限的小面部区域可能会显得不够清晰。提高图像分辨率可以缓解这些问题。