cnnmmd

Introduction > Overview > Overview

overview

It is a tool that connects various characters with various AI engines.

*: The description of this video (including rights) is as follows:
・: dscmov

This tool features:

・: Various characters and AI engines can be flexibly combined. [※1]
・: Any number of triggers and actions can be set. [※2]
・: Part selection and flow creation (configuration and flow) can be done using both GUI and CLI [※3]
・: Full customization and plugin creation/sharing [※4][※5]

*1: You can connect various characters (homemade to ready-made, virtual (2D images to 3D models, web browser/desktop/VR/MR) to reality (dolls/figures), etc.) with various AI engines (speech recognition/speech synthesis/language generation/emotion analysis/object recognition, local/remote/cloud, CPU/GPU/API, etc.).

*2: The basic flow is one-to-one conversation (chat) between the user and the character, but many-to-many conversations between characters are also possible (memory function). Also, since there is no limit to the number of endpoints that can be used for interaction, various behaviors can be made by controlling triggers (changes in the environment) and actions (character movements) (changing facial expressions based on the results of emotion analysis, making the character behave accordingly when touched, etc.).

*3: Conversation flows can be created visually, so there's no need to write code (ComfyUI version), while complex flows with branching and repetition can also be handled (script version). In either case, flows can be constructed and reassembled like building blocks (for example, you can send generated images to a microcontroller screen, or change a single conversion node to repurpose a flow intended for a web browser for use in a game engine).

*4: You can customize all of the tools and create and share plugins (even the core parts of the tools are plugins) - you can incorporate your own models and servers and run the containers you need individually. The procedures for installing, removing, starting, and stopping plugins are all unified.

*5: Now, by using the language generation model, you can do everything yourself, from setting up the server to writing the code (the plugins provided by this tool also have things created in this way, and each is just one example) - you can create a conversation app with characters in the interface and style you prefer.

subject

This tool is aimed at the following people: [※1]

・: I want to talk to my own characters
・: I want to create a system for conversation

*1: It's aimed at people who are interested in not just using but also building (there are already plenty of AI-powered chat apps, most of which offer a wealth of custom features - if that's all you need, you probably don't need this tool).

environment

Currently, we have verified the connection of the following characters and AI engines:

○: Characters (= client device + app)

・: On Workflow (ComfyUI)

+ Mobile device + OS (iOS / Android) / PC + OS (Windows / macOS / Linux)
・: Web browser

+ Mobile device + OS (iOS / Android) / PC + OS (Windows / macOS / Linux)
・: Desktop app (Electron)

+ PC + OS (Windows / macOS / Linux)
・: Game engine (Unity)

+ PC + OS (Windows / macOS)
・: VR/MR app (Unity / Virt-A-Mate)

+ VR/MR equipment (Quest) + PC (Windows)
・: Microcontroller: Bare metal (M5Stack)

+ LCD/microphone (M5Stack)
・: Figures (Nendoroid Dolls)

+ Microcontroller: Bare metal (M5Stack) + Microphone/Camera/Speaker/Motor (M5Stack)

○: AI engines (= AI-related model operation/service usage)

・: Speech recognition model:

・Local (wav2vec2-large-japanese [GPU] / OpenAI: Whisper [CPU / GPU])

・Service (OpenAI: Whisper [API])
・: Speech synthesis model:

・Local (VOICEVOX [CPU] / Style-Bert-VITS2 [CPU / GPU])

・Service (NIJI Voice [API])
・: Language generation model:

・Local (Gemma 3n [CPU/GPU] / Vecteus-v1 [GPU])

・Services (OpenAI: GPT [API] / NovelAI [API])
・: Sentiment analysis model:

・Local (luke-japanese-large-sentiment-analysis-wrime [GPU])

・Service (OpenAI: GPT [API])
・: Image generation model:

・Local (* [GPU])

・Service (OpenAI: DALL-E [API] / NovelAI [API])
・: Object Recognition Model:

・Local (OpenCV: Haar Cascades [CPU])

◯: Relay server (= server that relays communication between servers/between servers and clients)

・: PC + OS (Windows (WLS2 (Linux (Ubuntu))))

+ Containers (Docker / Docker Desktop)
・: PC (Mac) + OS (macOS)

+ Containers (Docker Desktop)
・: PC + OS (Linux (Ubuntu))

+ Container (Docker)
・: Microcomputer (Raspberry Pi) + OS (Linux (Ubuntu))

+ Container (Docker)

*: The characters (= devices and apps on which the characters operate) are being tested for connectivity with media ranging from virtual to real, including images, videos, 3D, VR, MR, and figurines. The AI engines (= AI-related model operation/service usage) are being tested for speech recognition, speech synthesis, language generation, emotion analysis, image generation, and object recognition, all of which are compatible with CPU/GPU operation and API calls. The relay server that connects them operates in environments ranging from PCs to microcontrollers.

*: The code for the clients, servers, and engines is provided as a sample only (we do not guarantee that it will work properly). This tool provides the creation of relay server handlers and their workflows (i.e., the flow that connects the handlers as nodes).

constraints

Currently, the following limitations exist:

・: A container is required to use the tool. [※1]
・: At this time, conversational connections are unstable. [※2]
・: Backward compatibility is not guaranteed

*1: We use OS container technology (Docker) to prevent conflicts between apps and allow for the use of various AI engines without restrictions.

*2: We've made it easy to implement client code even on low-performance devices (such as bare metal microcontrollers), so for now, communication is limited to HTTP polling. Streaming will also be implemented in a simplified version first (WebSocket/WebRTC will be added later). If the timing isn't right, the connection between the server and client will fail. In such cases, you'll need to stop and start the relay server or the GUI workflow creation environment (ComfyUI).

right

The proprietary code of this tool is distributed under the following license:

・: MIT License

The resources used by this tool (models, libraries, and applications) are all owned by their respective owners. Resources that require special care are clearly stated in the description of each plugin. [*1]

*1: Some of the models used by this tool's engine (speech synthesis models, image generation models, video generation models, etc.) can use audio, images, and video trained by individuals - beyond personal use, this tool is not intended to be used for any activities that infringe on portrait rights, copyrights, or moral rights of authors.