Back to Cinamon showcase

Cinamon Inc. ยท Motion Stylization

๐ŸŽจ Motion Stylization

ํ”„๋กœ์ ํŠธ ๊ธฐ๊ฐ„: 2024. 09 - 2025. 02

Tech stack

PyTorch PyTorch Lightning Streamlit GitHub

์ฝ˜ํ…์ธ  ๋ชจ์…˜์˜ ์˜๋ฏธ๋ฅผ ์œ ์ง€ํ•œ ์ฑ„ ์›ํ•˜๋Š” ์Šคํƒ€์ผ์„ ์ ์šฉํ•˜๋Š” ==motion stylization ๋ชจ๋ธ==์„ ๊ฐœ๋ฐœํ•˜๊ณ  ๋ฐ๋ชจ ํŽ˜์ด์ง€๋ฅผ ๋งŒ๋“ค์–ด ๋ฐฐํฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฝ˜ํ…์ธ ์™€ ์Šคํƒ€์ผ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ๊ฐ€ unpaired์ธ ์ œ์•ฝ์—์„œ๋„ ๋™์ž‘ํ•˜๊ธฐ ์œ„ํ•ด, ==Transformer VAE== ๊ธฐ๋ฐ˜ PoC๋ฅผ ์‹œ์ž‘์œผ๋กœ ==mixed attention== DDIM ๋ฐฉ์‹์œผ๋กœ ๊ณ ๋„ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

Media

๋™์ผํ•œ ๋™์ž‘์„ ๋‹ค๋ฅธ ์Šคํƒ€์ผ๋กœ ๋ณ€ํ™˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“ ์—…๋ฌด ์ˆ˜ํ–‰ ๋‚ด์šฉ

- AI ๊ธฐ๋ฐ˜ ๋ชจ์…˜ ์ฝ˜ํ…์ธ  ==๊ธฐ๋Šฅ ๊ธฐํš ์˜์‚ฌ๊ฒฐ์ • ๊ทผ๊ฑฐ==๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด, ์Šคํƒ€์ผ ๋ณ€ํ™˜ ๊ธฐ๋Šฅ์˜ ๊ตฌํ˜„ ๊ฐ€๋Šฅ์„ฑ๊ณผ ๋‚ด๋ถ€ ์‚ฌ์šฉ์ž ๊ฒฝํ—˜์„ ๊ฒ€์ฆํ•˜๋Š” ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. - =='์Šคํƒ€์ผ์ด ์ž˜ ์ ์šฉ๋จ'์„ ์–ด๋–ป๊ฒŒ ์ˆ˜์น˜์ ์œผ๋กœ ์ •์˜ํ•  ๊ฒƒ์ธ์ง€==๊ฐ€ ์ด ํ”„๋กœ์ ํŠธ์˜ ํ•ต์‹ฌ ๊ณผ์ œ์˜€์Šต๋‹ˆ๋‹ค. - ๋ชจ์…˜์—์„œ ์Šคํƒ€์ผ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” encoder๋ฅผ ์ •์˜ํ•˜๋Š” ๊ฒƒ์€ ๋Œ€๊ทœ๋ชจ paired dataset์ด ์—†์–ด ์ผ๋ฐ˜ํ™”๋ฅผ ๋ณด์žฅํ•  ์ˆ˜ ์—†์—ˆ์Šต๋‹ˆ๋‹ค. - ๋”ฐ๋ผ์„œ, ์ด๋ฅผ ๋ช…์‹œ์ ์ธ loss function์œผ๋กœ ์ •์˜ํ•˜๋Š” ๋Œ€์‹  pretrained model์˜ prior knowledge์™€ Attention ์—ฐ์‚ฐ์„ ํ™œ์šฉํ•œ ์Šคํƒ€์ผ ์ ์šฉ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. - PoC : ==2-stage latent VAE== ๋ชจ๋ธ๋กœ motion content encoder์™€ style generator๋ฅผ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค. - Stage 1 : motion-only VAE ํ•™์Šต - Stage 2 : content encoder์™€ style generator ๋™์‹œ ํ•™์Šต
Stage 2 ์ƒ์„ธ ๋กœ์ง
flowchart LR subgraph Stage2["Stage 2: Content Encoder + Style Generator"] x["Content Motion x"] --> E["Frozen Stage-1 VAE Encoder"] E --> z["Motion Latent z"] z --> CE["Content Encoder"] CE --> zc["Style-invariant Content Latent z_c"] y["Target Style Label"] --> CLIP["CLIP Text Encoder"] CLIP --> s["Style Embedding s"] subgraph SG["Style Generator"] zc --> SA["Content-aware Attention"] s --> TA["Style-conditioned Tokens / Attention"] TA --> F["Latent Fusion Transformer"] SA --> F end F --> zhat["Stylized Latent z_hat"] zhat --> D["Frozen Stage-1 VAE Decoder"] D --> xt["Stylized Motion x_hat"] end
- ๊ณ ๋„ํ™” : Diffusion model์˜ ==์ƒ˜ํ”Œ๋ง ์ „๋žต์„ ํ™œ์šฉ==ํ•˜๋Š” motion style transfer๋กœ ํ™•์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.
flowchart LR content["Content Motion"] --> inversion["DDIM Inversion
+ Cached Decoder States"] styleChoice["User Style Selection"] --> match["Style Candidate"] match --> styleMotion["Style Motion Selection"] styleMotion --> sampling["Mixed-Attention Sampling"] inversion --> sampling sampling --> output["Stylized Motion"]
- Prior knowledge๋ฅผ ์ด์šฉํ•˜๋”๋ผ๋„, PoC์ฒ˜๋Ÿผ encoder์™€ generator๋ฅผ ๋™์‹œ์— ํ•™์Šตํ•˜๋Š” ๊ฒฝ์šฐ ==ํ•™์Šต ์•ˆ์ •์„ฑ==์ด ๋–จ์–ด์กŒ์Šต๋‹ˆ๋‹ค. - Diffusion model์˜ prior knowledge๋ฅผ ์ด์šฉํ•˜๋˜, ==์ƒ˜ํ”Œ๋ง ๊ณผ์ •==์—์„œ ์Šคํƒ€์ผ์„ ์ ์šฉํ•˜๋Š” ์ „๋žต์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. - Attention ์—ฐ์‚ฐ์„ ๋ณ€ํ˜•ํ•ด ์ฝ˜ํ…์ธ  ๋ชจ์…˜๊ณผ ์Šคํƒ€์ผ ๋ชจ์…˜์˜ ์ •๋ณด๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๊ตํ™˜ํ•˜๋Š” ==Mixed Attention== ์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค.
Mixed attention ์ƒ์„ธ ๋กœ์ง
flowchart TD X["Current mixed sample at timestep t"] --> D{"Is decoder layer i the first(index=0) layer?"} D -->|"Yes"| N["Use current state only
tgt = current output
memory = current state"] D -->|"No, mid DDIM steps"| L{"Layer idx"} L -->|"layer idx = 1..5"| S1["Style-guided cross attention
tgt = current output
memory = cached style kv_l(t)"] L -->|"layer idx = 6..7"| S2["Content-style mixed attention
tgt = w * cached content q_l(t)
+ (1-w) * current output
memory = cached style kv_l(t)"] N --> T["TransformerDecoderLayer(tgt, memory)"] S1 --> T S2 --> T T --> Y["Next decoder layer / next DDIM step"]
- ==์œ ์ € ์นœํ™”์ ์ธ ์ž…๋ ฅ/์ถœ๋ ฅ ๊ตฌ์กฐ==๋ฅผ ์„ค๊ณ„ํ•ด ๋ชจ๋ธ ์‚ฌ์šฉ ์ ‘๊ทผ์„ฑ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค. - ์‚ฌ์šฉ์ž๊ฐ€ ์ž…๋ ฅํ•œ ๋ชจ์…˜์— ๋”ฐ๋ผ ์ถ”๋ก  ํ’ˆ์งˆ์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋Š” edge case๋ฅผ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. - ์˜ˆ) ์ฝ˜ํ…์ธ  ๋ชจ์…˜๊ณผ ์Šคํƒ€์ผ ๋ชจ์…˜์˜ ๊ฑท๊ธฐ์™€ ๋ˆ•๊ธฐ์ฒ˜๋Ÿผ ํฌ๊ฒŒ ๋‹ค๋ฅด๋ฉด global translation๊ณผ velocity ์ฐจ์ด๋กœ ์•„ํ‹ฐํŒฉํŠธ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ - ๋ผ๋ฒจ ํ˜•ํƒœ๋กœ ์Šคํƒ€์ผ์„ ์„ ํƒํ•˜๋Š” ๊ฐ„๋‹จํ•œ ์ž…๋ ฅ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•ด ์‚ฌ์šฉ์ž ํŽธ์˜์„ฑ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค. - ๋ชจ๋ธ ๋‚ด๋ถ€์ ์œผ๋กœ๋Š” ์Šคํƒ€์ผ ๋ชจ์…˜์˜ kinematic feature๋ฅผ ์บ์‹ฑํ•˜๊ณ , ์š”์ฒญ์ด ๋“ค์–ด์˜ค๋ฉด ์ž…๋ ฅ ๋ชจ์…˜๊ณผ ์œ ์‚ฌ๋„ ๋†’์€ ์Šคํƒ€์ผ ๋ชจ์…˜์„ ์„ ํƒํ•ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.