≡ menu
× menu

cnnmmd

Individual: 2D Images: On Workflow (ComfyUI)


images

overview

This is the conversation flow on a GUI workflow environment (ComfyUI).

There is no need to prepare a special client application; you can start a conversation using just this environment.

Configuration/Usage [*D]
Server startup & workflow: cnnmmd_xoxxox_mgrcmf_cmf_tlk_wsp_vox_gpt_001
Client settings - Start: cnnmmd_xoxxox_appcmf
Rights: Use
Voice data (voice synthesis): Trained model: Anneli / kaunista (Style-Bert-VITS2)
Character (image generation): Trained models: Animagine XL / cagliostrolab / linaqruf (Stable Diffusion XL)

※D
This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC) - to match the content of the video, you will need to refer to the resources used (obtain them if necessary) and rearrange the servers and flow to be launched.

Individual: 2D Images: Desktop (Electron)


images

overview

This is the conversation flow on a computer desktop.

This development environment allows us to reuse code for web browsers almost as is, so libraries and settings are shared.

Configuration/Usage [*D]
Server startup & workflow: cnnmmd_xoxxox_mgrcmf_web_tlk_wsp_vox_gpt_001
Client settings and startup: cnnmmd_xoxxox_tlkelc
Rights: Use
Voice data (voice synthesis): Trained model: Shikoku Metan: Whisper (VOICEVOX)
Character (image generation): Trained models: Animagine XL / cagliostrolab / linaqruf (Stable Diffusion XL)

※D
This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC) - to match the content of the video, you will need to refer to the resources used (obtain them if necessary) and rearrange the servers and flow to be launched.

Individual: 2D Image: Web browser (PC/mobile device)


images

overview

This is the conversation flow on a mobile device/PC web browser.

Security has been strengthened on mobile devices such as smartphones, so appropriate measures are required on the server side.

Configuration/Usage [*D]
Server startup & workflow: cnnmmd_xoxxox_mgrcmf_web_tlk_wsp_vox_gpt_001
Client settings and startup: cnnmmd_xoxxox_tlkweb
Rights: Use
Speech data (speech synthesis): Trained model: male28 / 852 (Style-Bert-VITS2)
Character (image generation): Trained models: Animagine XL / cagliostrolab / linaqruf (Stable Diffusion XL)

※D
This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC) - to match the content of the video, you will need to refer to the resources used (obtain them if necessary) and rearrange the servers and flow to be launched.

Individual: 3D Model: VR/MR (Unity)


images
images

*
Top: 3D models and morphs from DAZ are introduced, and hair from VRoid and VRM cell-look shaders are applied. In addition to lip sync, generic joints and physics calculations/active ragdoll are also used.
*
Bottom row: A VRoid character walking autonomously through reinforcement learning (ML-Agents) (not a fixed animation, but actually walking on two legs against gravity).

overview

Although it is not shown in the video, the game engine (Unity) can handle a variety of 3D model characters. [※1][※2][※3][※4]

Configuration/Usage
Server startup & workflow: cnnmmd_xoxxox_mgrcmf_uni_tlk_wsp_vox_gpt_001
Client settings and startup: cnnmmd_xoxxox_tlkuni
Rights: Use
Character (3D model): G8F (DAZ)
Character (3D Model: Morph): Sakura 8 (DAZ)
Character (3D model): VRoid
Character (3D model: morph): original

*1
Multiple character generators (DAZ Studio, VRoid Studio, ...) support export. By applying different shaders, you can draw anything from realistic to anime-style, and by introducing morphs, you can customize and express emotions in the game.
※2
Characters can be touched by using Configurable Joints and physical calculations, and can be moved by copying animations (active ragdolls). By using a library for reinforcement learning (ML-Agents), characters can also act autonomously.
*3
The apps you create can run on standalone VR/MR devices, etc.
*4
However, even something as simple as lip syncing will require different responses depending on the source of the character (including homemade characters) -- so some advance preparation is required. On the other hand, if you have the appropriate knowledge, you can set triggers/actions and make the character behave in a variety of ways.

Individual: 3D Model (Anime): VR/MR (Virt-A-Mate + Quest)


images

overview

Conversation flow for VR/MR.

This app (Virt-A-Mate) is an execution environment that runs 3D models from a character generator (DAZ) in real time. It also has simple trigger/action functions, which can be used via plug-ins. [※1][※2]

Configuration/Usage [*D]
Server startup & workflow: cnnmmd_xoxxox_mgrcmf_vam_tlk_wsp_vox_gpt_001
Client configuration and startup: cnnmmd_xoxxox_tlkvam
Rights: Use [※3]
Speech data (speech synthesis): Trained model: merge28_ds / 852 (Style-Bert-VITS2)
Character (3D model): G2F (DAZ)
Character (3D model: morph): Mona / Genshin (VaM)

*1
The 3D model has fairly appropriate physics applied to it, so it can be handled almost like a real figure. The default shader is only realistic, but a volunteer provided an anime-style (cell-look) plug-in, so this shader is used in the video.
※2
VaM is based on Unity, so plugins can also be written in C# (it is believed that the plugin interpreter uses a dynamic execution engine such as Roslyn), but it is necessary to use dedicated classes -- but in a case like this conversation, since we are only manipulating the network/thread/microphone/speaker, the code for Unity can be used almost as is (however, because security within the environment is strengthened to comply with the license for the 3D model (DAZ), only low-level libraries can be used).
*3
The characters are copyrighted works used in an existing game ("Genshin Impact"), but we have decided to use them because: (1) the game's production company (Mihoyo) officially released the 3D models; (2) the footage is made using data (not data distribution); and (3) it is unlikely that the expression will damage the character's dignity. (The audio is also synthesized and is believed to be close to the original, and no live audio was used as training data.)
※D
This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC) - to match the content of the video, you will need to refer to the resources used (obtain them if necessary) and rearrange the servers and flow to be launched.

Individual: 3D model (realistic): VR/MR (Virt A Mate + Quest)


images

overview

This is the conversation flow in VR/MR. [※1]

Configuration/Usage [*D]
Server startup & workflow: cnnmmd_xoxxox_mgrcmf_vam_tlk_wsp_vox_gpt_001
Client configuration and startup: cnnmmd_xoxxox_tlkvam
Rights: Use
Speech data (speech synthesis): Trained model: 170sec28_B / 852 (Style-Bert-VITS2)
Character (3D model): G2F (DAZ)
Character (3D model: morph): original (VaM)
Clothes (3D model): YameteOuji

*1
For this video, the rendering quality has been lowered - resolution, hair physics, cloth simulation, etc. (This was due to the low GPU performance of the environment, as well as the load of recording and communication.)
※D
This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC) - to match the content of the video, you will need to refer to the resources used (obtain them if necessary) and rearrange the servers and flow to be launched.

Individual: 2D Image: Microcomputer + LCD (M5Stack)


images

overview

This is the conversation flow on the microcontroller screen.

This device comes with a speaker and a screen (LCD), so all you need to do is add a microphone to be able to have conversations.

Configuration/Usage [*D]
Server startup & workflow: cnnmmd_xoxxox_mgrcmf_mcu_tlk_wsp_vox_gpt_001
Client settings - Start: cnnmmd_xoxxox_tlkmcu
Rights: Use
Voice data (voice synthesis): Trained model: Anneli / kaunista (Style-Bert-VITS2)
Character (image generation): Trained models: Animagine XL / cagliostrolab / linaqruf (Stable Diffusion XL)

※D
This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC) - to match the content of the video, you will need to refer to the resources used (obtain them if necessary) and rearrange the servers and flow to be launched.

Individual: Figure + Microcomputer (M5Stack) ~ Tracking a person's face with object recognition


images

overview

This is the conversation flow with Fagyua.

Lip-syncing is achieved by emitting light from a three-color LED, and attention to the user's face is expressed by object recognition by the camera and the rotation of the head by a motor. [※1][※2][※3]

Configuration/Usage [*D]
Server startup & workflow: cnnmmd_xoxxox_mgrcmf_fig_tlk_wsp_vox_gpt_001
Client settings and startup: cnnmmd_xoxxox_tlkfig
Rights: Use
Voice data (voice synthesis): Trained model: kouon28 / 852 (Style-Bert-VITS2)
Character (figure): Hatsune Miku (Nendoroid Doll)
Character (figure: face parts): Custom head (Nendoroid Doll)
Clothes (cloth clothes): JANAYA

*1
In order to avoid modifying the figure, all manipulations have been done on the frame (the figure is attached to the frame with plastic ties and adhesive rubber) -- except for around the mouth, where holes had to be drilled (so that the LED inserted inside the head could be seen), which uses a separately sold face part.
※2
Since the body is made of PVC, the clothes have been replaced with white ones made by volunteers to prevent color transfer.
*3
The reason we use Lego blocks for the frame is because the physical connections on the microcontroller side (M5Stack) are made for these blocks.
※D
This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC) - to match the content of the video, you will need to refer to the resources used (obtain them if necessary) and rearrange the servers and flow to be launched.

Individual: 2D image generation (ComfyUI) - Reflecting sentiment analysis of responses in prompts


images

overview

Conversation flow with real-time image generation.

The responses from the conversation are analyzed and the results are reflected in the prompts for generating the character image -- this allows the character image to be generated according to the emotion of the moment. [※1][※2][※3]

Rights: Use
Speech data (speech synthesis): Trained model: male28 / 852 (Style-Bert-VITS2)
Character (image generation): Trained model: Counterfeit-V3.0 / gsdf (Stable Diffusion)

*1
Phrases that express emotions based on the analysis results (such as "smile" or "sad") are inserted into the prompt.
※2
To speed up generation, the model uses SD 1.5 + LCM -- in this example, images are generated using only prompts, but you can use LoRA / IPadapter / ControlNet (OpenPose / Scribble) to stabilize the character's appearance and posture. However, there is a trade-off between quality and speed (you can also use AnimateDiff to make it into a video, but the quality and speed will be further reduced).
*3
Local diffusion and flow-based models are currently (as of June 2025) capable of generating high-quality video - although video generation still requires significant resources and time.

whole


This video is not a direct recording of an actual conversation.

Based on the text and audio data recorded during the conversation, the flow was changed for the demo and only the parts suitable for public viewing were played back -- this is because the responses of the generated AI are random, and in some cases it may not be possible to record audio due to equipment or app limitations. Some parts were also slightly revised to eliminate grammatical errors and awkward voice synthesis. [※1]


*1
This tool supports not only flexible conversations with AI, but also standard input and output, such as recording output text or audio to a file or playing existing text or audio lists -- allowing you to flexibly reconfigure your workflow according to your situation and needs.