Skip to main content

GenAI Model

Neat uses the GenAI APIs for LLM, VLM, and ASR models that were prepared with LLiMa. This is the generative-model counterpart to Model: load a model directory, create a request, then either wait for the full answer or stream tokens as they are produced.

Use the GenAI APIs when your application asks a model to generate text, answer questions about images, call tools, or transcribe audio. Use the classic Model API for fixed-shape discriminative models such as classification, detection, segmentation, or embedding models.

Where GenAI fits

The GenAI path has the same high-level shape as the rest of Neat:

  1. Prepare or download a model artifact for Modalix.
  2. Put the model on the target, commonly under /media/nvme/llima/models/.
  3. Load the model from your C++ or Python application.
  4. Send a GenerationRequest.
  5. Read a GenerationResult or consume a GenerationStream.

LLiMa owns GenAI model preparation, command-line testing, and benchmarking. Neat owns the application-facing API and runtime integration once the model is used inside your app. For model preparation details, see GenAI with LLiMa.

Choose the right handle

For most applications, start with GenAIModel. It auto-detects the model task from the model directory and exposes capability checks:

  • accepts_text()
  • accepts_image()
  • accepts_audio()
  • task()
  • model_id()

Use the task-specific handles when the application knows what it is loading and wants the narrower API:

HandleUse for
genai::GenAIModelAuto-detected LLM, VLM, or ASR model directories.
genai::VisionLanguageModelText-only LLMs and image-capable VLMs.
genai::ASRModelSpeech-to-text models.
genai::GenAIServerServing one or more GenAI models through HTTP endpoints.

Run a text request

#include "neat/genai.h"

#include <iostream>

int main() {
simaai::neat::genai::GenAIModel model(
"/media/nvme/llima/models/Qwen3-4B-Instruct-2507-GPTQ-a16w4");

simaai::neat::genai::GenerationRequest request;
request.prompt = "Explain what an API gateway is in one sentence.";
request.max_new_tokens = 64;

auto result = model.run(request);
std::cout << result.text << "\n";
}

run() is synchronous: it returns after generation finishes. It is the simplest shape for tests, scripts, and request/response application code.

Stream generated tokens

Use stream() when the caller should see output as it is generated. Each item is a TokenSample containing the latest text fragment, current metrics, and final status when generation ends.

simaai::neat::genai::GenerationRequest request;
request.prompt = "Give me three practical tips for designing a small REST API.";
request.max_new_tokens = 96;

simaai::neat::genai::GenerationStream stream_handle = model.stream(request);
for (const auto& token : stream_handle) {
std::cout << token.text << std::flush;
}
std::cout << "\n";

Call cancel() on the stream if the user closes the request, changes prompts, or your application times out the generation.

Add images for VLMs

VLMs accept text plus one or more images. Images are passed through GenerationRequest.images for a simple prompt, or through ChatMessage.images when you use chat history.

Images passed as Tensor values should be uint8 HWC RGB tensors. OpenCV cv::Mat inputs follow the Neat/OpenCV convention: three-channel matrices are treated as BGR and converted to RGB before they are stored in the request.

simaai::neat::genai::VisionLanguageModel model(
"/media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4");

cv::Mat image = cv::imread("scene.jpg");

simaai::neat::genai::GenerationRequest request;
request.prompt = "What is visible in this image?";
request.images = {image};
request.max_new_tokens = 128;

auto result = model.run(request);
std::cout << result.text << "\n";

For repeated questions about the same image, VisionLanguageModel.encode(...) can cache image embeddings in the model. Then set request.use_cached_images = true or use a chat message with use_cached_images = true.

Transcribe audio

ASR models use the same request/result shape, but the request must provide audio. Use audio_file for a file path or audio for an audio tensor.

simaai::neat::genai::ASRModel model("/media/nvme/llima/models/whisper-model");

simaai::neat::genai::GenerationRequest request;
request.audio_file = "meeting.wav";
request.language = "en";

auto result = model.run(request);
std::cout << result.text << "\n";

Compose GenAI into a Graph

Direct run() / stream() calls are the shortest path for most GenAI applications. When GenAI is one stage in a larger Neat graph, use the public Graph fragments:

  • genai::graphs::VisionLanguage(...)
  • genai::graphs::SpeechTranscriber(...)

These fragments expose GenAI stages through named graph endpoints so you can compose them with the same Graph and Run model used by the rest of Neat.

auto model = std::make_shared<simaai::neat::genai::VisionLanguageModel>(
"/media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4");

simaai::neat::genai::VisionLanguageOptions options;
options.max_new_tokens = 128;
options.streaming = true;

simaai::neat::Graph fragment =
simaai::neat::genai::graphs::VisionLanguage(model, options, "vlm");

Serve GenAI models

Use GenAIServer when the application boundary should be an HTTP service rather than direct in-process calls. The server can host multiple model directories and binds to 0.0.0.0:9998 by default.

simaai::neat::genai::GenAIServer server;
server.add_model("/media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4",
"qwen-vlm");
server.serve();

Use direct APIs for embedded application logic and tests. Use the server flow when a UI, service, or remote client should talk to the model over a network boundary.

Request rules

GenerationRequest is intentionally explicit:

  • Use either prompt or messages, not both.
  • Use system_prompt only with prompt.
  • Attach images directly only with prompt; attach per-message images through ChatMessage.images.
  • Use either direct images or cached images, not both.
  • ASR requests use audio fields, not text or image fields.

These rules let Neat fail early with a clear request error instead of sending an ambiguous prompt to the runtime.

Next steps