cnnmmd

Introduction > Installation > Video Instructions

Individual: 2D image: On workflow (ComfyUI)

◯: overview

This is a conversation flow created in the GUI workflow creation environment (ComfyUI).

There is no need for the client to have any special devices or apps; conversations can begin in this environment alone.

◯: Configuration/Usage [*D]

・: Server startup & workflow: cnnmmd_xoxxox_mgrcmf_cmf_tlk_wsp_vox_lcp_001
・: Client settings ~ Start: cnnmmd_xoxxox_appcmf

◯: Rights: Use

・: Voice data (voice synthesis): Trained model: Anneli / kaunista (Style-Bert-VITS2) [※F]
・: Character (image generation): Pre-trained models: Animagine XL / cagliostrolab / linaqruf (Stable Diffusion XL)

※F: The natural intonation of this model demonstrates the power of SBV2, but the source data is unknown - if any similarities with other performers are identified, we will replace it with another model.

※D: This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC). To match the content of the video, you will need to refer to the resources you are using (obtain them if necessary) and rearrange the servers and flow you start.

Individual: 2D Images: Desktop (Electron)

◯: overview

This is the conversation flow on a computer desktop.

This development environment allows us to reuse code for web browsers almost as is, so libraries and settings are shared.

◯: Configuration/Usage [*D]

・: Server startup & workflow: cnnmmd_xoxxox_mgrcmf_web_tlk_wsp_vox_lcp_001
・: Client settings ~ Start: cnnmmd_xoxxox_tlkelc

◯: Rights: Use

・: Voice data (voice synthesis): Trained model: Shikoku Metan: Whisper (VOICEVOX)
・: Character (image generation): Pre-trained models: Animagine XL / cagliostrolab / linaqruf (Stable Diffusion XL)

※D: This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC). To match the content of the video, you will need to refer to the resources you are using (obtain them if necessary) and rearrange the servers and flow you start.

Individual: 2D image: Web browser (PC/mobile device)

◯: overview

This is the conversation flow on a mobile device/PC web browser.

Security has been strengthened on mobile devices such as smartphones, so appropriate measures are required on the server side.

◯: Configuration/Usage [*D]

・: Server startup & workflow: cnnmmd_xoxxox_mgrcmf_web_tlk_wsp_vox_lcp_001
・: Client settings and startup: cnnmmd_xoxxox_tlkweb

◯: Rights: Use

・: Speech data (speech synthesis): Trained model: male28 / 852 (Style-Bert-VITS2)
・: Character (image generation): Pre-trained models: Animagine XL / cagliostrolab / linaqruf (Stable Diffusion XL)

※D: This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC). To match the content of the video, you will need to refer to the resources you are using (obtain them if necessary) and rearrange the servers and flow you start.

Individual: 3D model: VR/MR (Unity)

*: Top row: DAZ 3D models and morphs are introduced, and VRoid hair and VRM cell-look shaders are applied. In addition to lip syncing, generic joints and physics calculations/active ragdoll are also included. ▶︎

*: Bottom row: A VRoid character walking autonomously through reinforcement learning (ML-Agents) (not a fixed animation, but actually walking on two legs against gravity). ▶︎

◯: overview

Although it is not shown in the video, the game engine (Unity) can handle a variety of 3D model characters. [※1][※2][※3][※4]

◯: Configuration/Usage

・: Server startup & workflow: cnnmmd_xoxxox_mgrcmf_uni_tlk_wsp_vox_lcp_001
・: Client settings ~ Start: cnnmmd_xoxxox_tlkuni

◯: Rights: Use

・: Character (3D model): G8F (DAZ)
・: Character (3D model: morph): Sakura 8 (DAZ)

・: Character (3D model): VRoid
・: Character (3D model: morph): original

*1: Multiple character generators (DAZ Studio, VRoid Studio, etc.) are supported for export. By applying different shaders, you can create anything from realistic to anime-style drawings, and by introducing morphs, you can customize and express emotions in the game.

*2: Configurable joints and physics calculations allow you to touch characters, and you can even copy animations to make them move (active ragdolls). Using a reinforcement learning library (ML-Agents), characters can also act autonomously.

*3: The apps you create can be run on standalone VR/MR devices, etc.

*4: However, even something as simple as lip syncing will require some preparation beforehand, as it will vary depending on the source of the character (including your own creation). On the other hand, with the appropriate knowledge, you can set triggers/actions and make your characters behave in a variety of ways.

Individual: 3D model (anime): Screen sharing/tablet (Virt-A-Mate + iPad + Splashtop)

◯: overview

This is an example of a configuration where the character is controlled from a tablet (rather than from a computer screen).

You will need a streamer app on your PC and a receiver app on your tablet, but you can control the characters with the same ease as using a tablet. [※1][※2][※3]

◯: Configuration/Usage [*D]

・: Server startup & workflow: cnnmmd_xoxxox_mgrcmf_vam_tlk_wsp_vox_lcp_001
・: Client settings ~ Start: cnnmmd_xoxxox_tlkvam

・: Remote control app (streamer/receiver): Splashtop
・: Key binder: AutoHotKey

◯: Rights: Use

・: Speech data (speech synthesis): Trained model: merge28_ds / 852 (Style-Bert-VITS2) [※G]
・: Character (3D model): G2F (DAZ)
・: Character (3D model: Morph): Mona / Genshin Impact (VaM)

*1: The computer screen is rotated 90 degrees to match the tablet's vertical aspect ratio.

*2: Controlling the computer from the tablet involves moving the mouse cursor - the middle mouse button is generally not supported by remote control apps. In this case, we made it work by changing the key bindings (the 3D model calculation app on the computer allows you to control the camera up, down, left, and right from the keyboard (in addition to the middle mouse key). By mapping these keys to the arrow keys, we were able to operate it from the UI provided on the tablet).

※G: This character is a copyrighted work used in an existing game (Genshin Impact), but we have decided to use it because: (1) the game's production company (Mihoyo) officially released the 3D model; (2) it is a video made using data (rather than a data distribution); and (3) it is unlikely that the expression would detract from the character's dignity. (The voice is also synthesized and is thought to be close to the original, and not a live performance voice was used as training data.)

※D: This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC). To match the content of the video, you will need to refer to the resources you are using (obtain them if necessary) and rearrange the servers and flow you start.

Individual: 3D Model (Anime): VR/MR (Virt-A-Mate + Quest)

◯: overview

This is the conversation flow for VR/MR.

This app (Virt-A-Mate) is an execution environment that runs 3D models from a character generator (DAZ) in real time. It also has simple trigger/action functions, which can be used as a plug-in. [※1][※2]

◯: Configuration/Usage [*D]

・: Server startup & workflow: cnnmmd_xoxxox_mgrcmf_vam_tlk_wsp_vox_lcp_001
・: Client settings ~ Start: cnnmmd_xoxxox_tlkvam

◯: Rights: Use

・: Speech data (speech synthesis): Trained model: merge28_ds / 852 (Style-Bert-VITS2) [※G]
・: Character (3D model): G2F (DAZ)
・: Character (3D model: Morph): Mona / Genshin Impact (VaM)

*1: The 3D models have fairly appropriate physics applied to them, making them feel almost like real figures. The default shaders are realistic, but a volunteer provided a plugin for an anime-style (cell-look) effect, so this shader is used in the video.

*2: Because VaM is based on Unity, plugins can also be written in C# (it is believed that the plugin interpreter uses a dynamic execution engine such as Roslyn), but it is necessary to use dedicated classes -- but in a case like this conversation, since only network/thread/microphone/speaker operations are involved, Unity code can be used almost as is (however, security within the environment has been strengthened to comply with the license for 3D models (DAZ)), so only low-level libraries can be used).

※G: This character is a copyrighted work used in an existing game (Genshin Impact), but we have decided to use it because: (1) the game's production company (Mihoyo) officially released the 3D model; (2) it is a video made using data (rather than a data distribution); and (3) it is unlikely that the expression would detract from the character's dignity. (The voice is also synthesized and is thought to be close to the original, and not a live performance voice was used as training data.)

※D: This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC). To match the content of the video, you will need to refer to the resources you are using (obtain them if necessary) and rearrange the servers and flow you start.

Individual: 3D model (realistic): VR/MR (Virt A Mate + Quest)

◯: overview

This is the conversation flow in VR/MR. [※1]

◯: Configuration/Usage [*D]

・: Server startup & workflow: cnnmmd_xoxxox_mgrcmf_vam_tlk_wsp_vox_lcp_001
・: Client settings ~ Start: cnnmmd_xoxxox_tlkvam

◯: Rights: Use

・: Speech data (speech synthesis): Trained model: 170sec28_B / 852 (Style-Bert-VITS2)
・: Character (3D model): G2F (DAZ)
・: Character (3D model: morph): original (VaM)
・: Clothes (3D model): YameteOuji

*1: In this video, the rendering quality has been lowered - resolution, hair physics, cloth simulation, etc. (due to the low GPU performance of the environment, as well as the load of recording and communication).

※D: This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC). To match the content of the video, you will need to refer to the resources you are using (obtain them if necessary) and rearrange the servers and flow you start.

Individual: 2D image: Microcomputer + LCD (M5Stack)

◯: overview

This is the conversation flow on the microcontroller screen.

This device comes with a speaker and a screen (LCD), so all you need to do is add a microphone to start having conversations.

◯: Configuration/Usage [*D]

・: Server startup & workflow: cnnmmd_xoxxox_mgrcmf_mcu_tlk_wsp_vox_lcp_001
・: Client settings ~ Start: cnnmmd_xoxxox_tlkmcu

◯: Rights: Use

・: Voice data (voice synthesis): Trained model: Anneli / kaunista (Style-Bert-VITS2) [※F]
・: Character (image generation): Pre-trained models: Animagine XL / cagliostrolab / linaqruf (Stable Diffusion XL)

※F: The natural intonation of this model demonstrates the power of SBV2, but the source data is unknown - if any similarities with other performers are identified, we will replace it with another model.

※D: This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC). To match the content of the video, you will need to refer to the resources you are using (obtain them if necessary) and rearrange the servers and flow you start.

Individual: Figure + Microcomputer (M5Stack) ~ Tracking a person's face with object recognition

◯: overview

This is the conversation flow with Fagyua.

Lip-syncing is achieved by illuminating a three-color LED, and attention to the user's face is expressed by object recognition by the camera and rotation of the neck by a motor. [※1][※2][※3]

◯: Configuration/Usage [*D]

・: Server startup & workflow: cnnmmd_xoxxox_mgrcmf_fig_tlk_wsp_vox_lcp_001
・: Client settings ~ Start: cnnmmd_xoxxox_tlkfig

◯: Rights: Use

・: Speech data (speech synthesis): Trained model: kouon28 / 852 (Style-Bert-VITS2)
・: Character (figure): Hatsune Miku (Nendoroid Doll)
・: Character (Figure: Facial Parts): Custom Head (Nendoroid Doll)
・: Clothes (cloth clothes): JANAYA

*1: To avoid altering the figure, all manipulations are done on the frame (the figure is secured to the frame with vinyl ties and adhesive rubber) - except for around the mouth, where holes need to be drilled (to allow the LED inserted inside the head to shine through), which requires the use of a separately sold face part).

*2: Since the body is made of PVC, the clothes have been replaced with white ones made by volunteers to prevent color transfer.

*3: The reason we use Lego blocks for the frame is because the physical connections on the microcontroller side (M5Stack) are designed for these blocks.

※D: This workflow is a simplified version for this environment (it is a minimum flow that can run even on a low-performance PC). To match the content of the video, you will need to refer to the resources you are using (obtain them if necessary) and rearrange the servers and flow you start.

Individual: 2D image generation (ComfyUI) - Reflecting sentiment analysis of responses in prompts

◯: overview

Conversation flow with real-time image generation.

The responses to your conversations are analyzed for emotion, and the results are reflected in the prompts that generate your character image - this way, a character image that matches your current emotion is generated. [※1][※2][※3]

◯: Rights: Use

・: Speech data (speech synthesis): Trained model: male28 / 852 (Style-Bert-VITS2)
・: Character (Image Generation): Trained model: Counterfeit-V3.0 / gsdf (Stable Diffusion)

*1: Phrases that express emotions based on the analysis results (such as "smile" or "sad") are inserted into the prompt.

*2: To speed up generation, the model uses SD 1.5 + LCM - in this example, images are generated using only prompts, but you can use LoRA / IPadapter / ControlNet (OpenPose / Scribble) etc. to stabilize the character's appearance and posture to a certain extent, but there is a trade-off between quality and speed (you can also make it into a video using AnimateDiff etc., but the quality and speed will be further reduced).

*3: Local diffusion and flow-based models are currently (as of June 2025) capable of generating high-quality video - however, video generation still requires considerable resources and time.

whole

This video is not a direct recording of actual conversation turns.

Based on the text and audio data recorded during the conversation, the flow was modified for the demo and only the parts suitable for public viewing were played back - this is because the responses of the generated AI conversation are random, and in some cases audio cannot be recorded due to equipment or app limitations. Some parts were also slightly revised to eliminate grammatical errors and awkward voice synthesis. [※1]

*1: This tool supports not only flexible conversations with AI, but also standardized input and output, such as recording output text or audio to a file or playing existing text or audio lists - allowing you to flexibly customize your workflow depending on the situation and your needs.