VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

1Zixuan Chen 1Ruijie Su 1Jiahao Zhu 2Guangcong Wang 1Lingxiao Yang

1Jian-Huang Lai 1Xiaohua Xie

1Sun Yat-Sen University

2Great Bay University

VividDreamer

Text-to-3D Examples


"Iron Man"

"A bichon frise wearing academic regalia"

"A plush dragon toy"

Training Process

Abstract

Text-to-3D generation aims to create 3D assets from text-to-image diffusion models. However, existing methods face an inherent bottleneck in generation quality because the widely-used objectives such as Score Distillation Sampling (SDS) inappropriately omit U-Net jacobians for swift generation, leading to significant bias compared to the ''true'' gradient obtained by full denoising sampling. This bias brings inconsistent updating direction, resulting in implausible 3D generation (e.g., color deviation, Janus problem, and semantically inconsistent details). In this work, we propose Pose-dependent Consistency Distillation Sampling (PCDS), a novel yet efficient objective for diffusion-based 3D generation tasks. Specifically, PCDS builds the pose-dependent consistency function within diffusion trajectories, allowing to approximate true gradients through minimal sampling steps (1~3). Compared to SDS, PCDS can acquire a more accurate updating direction with the same sampling time (1 sampling step), while enabling few-step (2~3) sampling to trade compute for higher generation quality. For efficient generation, we propose a coarse-to-fine optimization strategy, which first utilizes 1-step PCDS to create the basic structure of 3D objects, and then gradually increases PCDS steps to generate fine-grained details. Extensive experiments demonstrate that our approach outperforms the state-of-the-art in generation quality and training efficiency, conspicuously alleviating the implausible 3D generation issues caused by the deviated updating direction. Specifically, VividDreamer can create ready-to-use 3D assets within 10 minutes, while produces photorealistic 3D objects within 30 minutes. Moreover, it can be simply applied to many 3D generative applications to yield impressive 3D assets, such as 3D portrait and avatar generation and text-to-3D editing.


Visual Comparisons

Stable DreamFusion

(~1h)

GaussianDreamer

(~9mins)

LucidDreamer

(~45mins)

VividDreamer (Ours)

(~30mins)

"A DSLR photo of a chow chow puppy"

"A zoomed out DSLR photo of a wizard raccoon casting a spell"

"A DSLR photo of a baby dragon drinking_boba"

"A DSLR photo of a tray of Sushi containing pugs"

"A goat drinking beer"

"A DSLR photo of a terracotta bunny"



Visual Comparisons (~10 minutes)

GaussianDreamer (SDS)

(~9mins)

LucidDreamer (ISM)

(~10mins)

VividDreamer (Ours)

(~10mins)

"A pig wearing a backpack"

"A DSLR photo of a corgi puppy"

"A yellow schoolbus"


More Generated Results

"A DSLR photo of a piglet sitting in a teacup"

"A red panda"

"A DSLR photo of a shiba inu wearing golf clothes and hat"

"A DSLR photo of a cocker spaniel wearing a crown"

"A zoomed out DSLR photo of a corgi wearing a top hat"

"A DSLR photo of a squirrel dressed like a clown"

"A zoomed out DSLR photo of a kingfisher bird"

"A DSLR photo of a mandarin duck swimming in a pond"

"A plush toy of a corgi nurse"

"A plush dragon toy"

"A DSLR photo of a hippo with wearing a sweater"

"A zoomed out DSLR photo of a lion's mane jellyfish"

"A DSLR photo of a robot dinosaur"

"A DSLR photo of a shiny silver robot_cat"

"A DSLR photo of a hippo made out of chocolate"

"A DSLR photo of an origami hippo in a river"

"An airplane made out of wood"

"A Panther De Ville car"

"A DSLR photo of a steam engine train, high resolution"

"A DSLR photo of an amigurumi motorcycle"

"A delicious chocolate brownie dessert with ice cream"

"A DSLR photo of an ice cream sundae"

"A DSLR photo of spaghetti and meatballs"

"Tower Bridge made out of gingerbread and candy"

"A 20-sided die made out of glass"

"A DSLR photo of a football helmet"

"A DSLR photo of the leaning tower of Pisa"

"An erupting volcano"


Applications

3D Portrait Generation

"A boy with facial painting, head, HDR, photorealistic, 8K"

"Barack Obama, head, HDR, photorealistic, 8K"

"Portrait of young norwegian woman, steampunk, long hair"

"Robert Pattinson, head, HDR, photorealistic, 8K"


3D Avatar Generation

"Iron man"

"Ant man"

"Bat man"

"Groot"

"Sun Wukong"

"Captain Marvel"

"A young man wearing a turtleneck"

"A Mediterranean with beard wearing white linen shirt"


Text-to-3D Editing





Citation

@article{chen2024vividdreamer,
    title={VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation},
    author={Chen, Zixuan and Su, Ruijie and Zhu, Jiahao and Yang, Lingxiao and Lai, Jian-Huang and Xie, Xiaohua},
    journal={arXiv preprint arXiv:2406.14964},
    year={2024}
}
                

Acknowledgements


This project is supported by the Natural Science Foundation of China (No. 62072482), and is also supported by the Project of Guangdong Provincial Key Laboratory of Information Security Technology (Grant No. 2023B1212060026).
We also thank to Lior Yariv for the website template.