GenAI Model
Neat uses the GenAI APIs for LLM, VLM, and ASR models that were prepared with
LLiMa. This is the generative-model counterpart to Model:
load a model directory, create a request, then either wait for the full answer
or stream tokens as they are produced.
Use the GenAI APIs when your application asks a model to generate text,
answer questions about images, call tools, or transcribe audio. Use the classic
Model API for fixed-shape discriminative models such as classification,
detection, segmentation, or embedding models.
Where GenAI fits
The GenAI path has the same high-level shape as the rest of Neat:
- Prepare or download a model artifact for Modalix.
- Put the model on the target, commonly under
/media/nvme/llima/models/. - Load the model from your C++ or Python application.
- Send a
GenerationRequest. - Read a
GenerationResultor consume aGenerationStream.
LLiMa owns GenAI model preparation, command-line testing, and benchmarking. Neat owns the application-facing API and runtime integration once the model is used inside your app. For model preparation details, see GenAI with LLiMa.
Choose the right handle
For most applications, start with GenAIModel. It auto-detects the model task
from the model directory and exposes capability checks:
accepts_text()accepts_image()accepts_audio()task()model_id()
Use the task-specific handles when the application knows what it is loading and wants the narrower API:
| Handle | Use for |
|---|---|
genai::GenAIModel | Auto-detected LLM, VLM, or ASR model directories. |
genai::VisionLanguageModel | Text-only LLMs and image-capable VLMs. |
genai::ASRModel | Speech-to-text models. |
genai::GenAIServer | Serving one or more GenAI models through HTTP endpoints. |
Run a text request
#include "neat/genai.h"
#include <iostream>
int main() {
simaai::neat::genai::GenAIModel model(
"/media/nvme/llima/models/Qwen3-4B-Instruct-2507-GPTQ-a16w4");
simaai::neat::genai::GenerationRequest request;
request.prompt = "Explain what an API gateway is in one sentence.";
request.max_new_tokens = 64;
auto result = model.run(request);
std::cout << result.text << "\n";
}
run() is synchronous: it returns after generation finishes. It is the simplest
shape for tests, scripts, and request/response application code.
Stream generated tokens
Use stream() when the caller should see output as it is generated. Each item is
a TokenSample containing the latest text fragment, current metrics, and final
status when generation ends.
simaai::neat::genai::GenerationRequest request;
request.prompt = "Give me three practical tips for designing a small REST API.";
request.max_new_tokens = 96;
simaai::neat::genai::GenerationStream stream_handle = model.stream(request);
for (const auto& token : stream_handle) {
std::cout << token.text << std::flush;
}
std::cout << "\n";
Call cancel() on the stream if the user closes the request, changes prompts,
or your application times out the generation.
Add images for VLMs
VLMs accept text plus one or more images. Images are passed through
GenerationRequest.images for a simple prompt, or through ChatMessage.images
when you use chat history.
Images passed as Tensor values should be uint8 HWC RGB tensors. OpenCV
cv::Mat inputs follow the Neat/OpenCV convention: three-channel matrices are
treated as BGR and converted to RGB before they are stored in the request.
simaai::neat::genai::VisionLanguageModel model(
"/media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4");
cv::Mat image = cv::imread("scene.jpg");
simaai::neat::genai::GenerationRequest request;
request.prompt = "What is visible in this image?";
request.images = {image};
request.max_new_tokens = 128;
auto result = model.run(request);
std::cout << result.text << "\n";
For repeated questions about the same image, VisionLanguageModel.encode(...)
can cache image embeddings in the model. Then set request.use_cached_images = true or use a chat message with use_cached_images = true.
Transcribe audio
ASR models use the same request/result shape, but the request must provide
audio. Use audio_file for a file path or audio for an audio tensor.
simaai::neat::genai::ASRModel model("/media/nvme/llima/models/whisper-model");
simaai::neat::genai::GenerationRequest request;
request.audio_file = "meeting.wav";
request.language = "en";
auto result = model.run(request);
std::cout << result.text << "\n";
Compose GenAI into a Graph
Direct run() / stream() calls are the shortest path for most GenAI
applications. When GenAI is one stage in a larger Neat graph, use the public
Graph fragments:
genai::graphs::VisionLanguage(...)genai::graphs::SpeechTranscriber(...)
These fragments expose GenAI stages through named graph endpoints so you can
compose them with the same Graph and Run model used by the rest of Neat.
auto model = std::make_shared<simaai::neat::genai::VisionLanguageModel>(
"/media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4");
simaai::neat::genai::VisionLanguageOptions options;
options.max_new_tokens = 128;
options.streaming = true;
simaai::neat::Graph fragment =
simaai::neat::genai::graphs::VisionLanguage(model, options, "vlm");
Serve GenAI models
Use GenAIServer when the application boundary should be an HTTP service rather
than direct in-process calls. The server can host multiple model directories and
binds to 0.0.0.0:9998 by default.
simaai::neat::genai::GenAIServer server;
server.add_model("/media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4",
"qwen-vlm");
server.serve();
Use direct APIs for embedded application logic and tests. Use the server flow when a UI, service, or remote client should talk to the model over a network boundary.
Request rules
GenerationRequest is intentionally explicit:
- Use either
promptormessages, not both. - Use
system_promptonly withprompt. - Attach images directly only with
prompt; attach per-message images throughChatMessage.images. - Use either direct images or cached images, not both.
- ASR requests use audio fields, not text or image fields.
These rules let Neat fail early with a clear request error instead of sending an ambiguous prompt to the runtime.
Next steps
- Prepare and benchmark GenAI models with GenAI with LLiMa.
- Learn classic model execution with Run / Inference.
- Build larger application graphs with Graph.