Run an LLM
| Field | Value |
|---|---|
| Difficulty | Beginner |
| Estimated Read Time | 10 minutes |
| Labels | genai, llm, chat, history, streaming |
The classic Model tutorials use .tar.gz MPK archives. GenAI models use LLiMa model directories and the neat::genai API instead. Start with the smallest request: load a model, set request.prompt, run it, and print the answer. Once that works, switch to request.messages when you need conversation state.
Walkthrough
Load the model directory
Point GenAIModel at a deployed LLiMa model directory. This tutorial uses GenAIModel because it auto-detects whether the directory is an LLM, VLM, or ASR model.
Construct simaai::neat::genai::GenAIModel from the model path.
genai::GenAIModel model(args.model);
Send one prompt
Build a GenerationRequest with prompt and a token budget. This is the shortest path for one-off questions, tests, and scripts.
genai::GenerationRequest request;
request.prompt = "Give me three practical tips for designing a small REST API.";
request.max_new_tokens = 96;
const genai::GenerationResult first = model.run(request);
std::cout << "assistant: " << first.text << "\n\n";
Define a system prompt
Use a short system instruction to steer the model's behavior. You can attach it to a simple prompt request with system_prompt; when you switch to chat history, carry the same instruction into the message list as a system message.
const std::string system_prompt = "You are concise and practical.";
genai::GenerationRequest concise_request;
concise_request.system_prompt = system_prompt;
concise_request.prompt = "Give me one rule of thumb for designing a small REST API.";
concise_request.max_new_tokens = 64;
const genai::GenerationResult concise = model.run(concise_request);
std::cout << "assistant: " << concise.text << "\n\n";
Switch to messages
For chat-style requests, use messages instead of prompt: start with a system message and a user message, run the request, then store the assistant response. The model does not remember earlier run() calls by itself; your application owns the message history.
std::vector<genai::ChatMessage> messages;
messages.push_back(genai::ChatMessage{
.role = "system",
.content = system_prompt,
});
messages.push_back(genai::ChatMessage{
.role = "user",
.content = "Give me three practical tips for writing API documentation.",
});
genai::GenerationRequest chat_request;
chat_request.messages = messages;
chat_request.max_new_tokens = 96;
const genai::GenerationResult chat_result = model.run(chat_request);
std::cout << "assistant: " << chat_result.text << "\n\n";
messages.push_back(genai::ChatMessage{.role = "assistant", .content = chat_result.text});
Ask a follow-up with history
Append another user message, send the updated message list, and read the answer. The model now sees the full conversation your application kept.
messages.push_back(genai::ChatMessage{
.role = "user",
.content = "Which tip should I apply first for a prototype?",
});
genai::GenerationRequest follow_up;
follow_up.messages = messages;
follow_up.max_new_tokens = 96;
const genai::GenerationResult second = model.run(follow_up);
std::cout << "assistant: " << second.text << "\n\n";
messages.push_back(genai::ChatMessage{.role = "assistant", .content = second.text});
Stream an answer
For UI-style output, call stream() and iterate the returned GenerationStream. Each token sample contains the latest text fragment.
messages.push_back(genai::ChatMessage{
.role = "user",
.content = "Turn that advice into a short checklist.",
});
genai::GenerationRequest streaming_request;
streaming_request.messages = messages;
streaming_request.max_new_tokens = 96;
genai::GenerationStream stream_handle = model.stream(streaming_request);
std::cout << "assistant: ";
for (const genai::TokenSample& token : stream_handle) {
std::cout << token.text << std::flush;
}
std::cout << "\n";
Run
First, download an LLM such as Qwen3 4B from Hugging Face using the LLiMa CLI:
llima pull Qwen3-4B-Instruct-2507-GPTQ-a16w4
Run the tutorial on Modalix with the deployed model directory:
C++ (prebuilt):
./lib/sima-neat/tutorials/tutorial_019_run_an_llm \
--model /media/nvme/llima/models/Qwen3-4B-Instruct-2507-GPTQ-a16w4
C++ (build from source):
./build.sh --target tutorial_019_run_an_llm
./build/tutorials-standalone/tutorial_019_run_an_llm \
--model /media/nvme/llima/models/Qwen3-4B-Instruct-2507-GPTQ-a16w4
Expected output is a simple prompt answer, a system-prompted answer, a context-aware follow-up, and a streamed final response.
In Practice
Keep only the amount of message history your application needs. Long histories consume context tokens and increase time to first token. For persistent chat applications, store the conversation outside the model object and rebuild GenerationRequest.messages for each turn.
Full source
Show the complete C++ and Python programs
#include "neat/genai.h"
#include <filesystem>
#include <iostream>
#include <stdexcept>
#include <string>
#include <vector>
namespace genai = simaai::neat::genai;
struct Args {
std::filesystem::path model;
};
Args parse_args(int argc, char** argv) {
Args args;
for (int i = 1; i < argc; ++i) {
const std::string arg = argv[i];
if (arg == "--model" && i + 1 < argc) {
args.model = argv[++i];
} else {
throw std::runtime_error("usage: run_an_llm --model <llima_model_dir>");
}
}
if (args.model.empty()) {
throw std::runtime_error("missing required --model <llima_model_dir>");
}
return args;
}
int main(int argc, char** argv) {
try {
const Args args = parse_args(argc, argv);
genai::GenAIModel model(args.model);
genai::GenerationRequest request;
request.prompt = "Give me three practical tips for designing a small REST API.";
request.max_new_tokens = 96;
const genai::GenerationResult first = model.run(request);
std::cout << "assistant: " << first.text << "\n\n";
const std::string system_prompt = "You are concise and practical.";
genai::GenerationRequest concise_request;
concise_request.system_prompt = system_prompt;
concise_request.prompt = "Give me one rule of thumb for designing a small REST API.";
concise_request.max_new_tokens = 64;
const genai::GenerationResult concise = model.run(concise_request);
std::cout << "assistant: " << concise.text << "\n\n";
std::vector<genai::ChatMessage> messages;
messages.push_back(genai::ChatMessage{
.role = "system",
.content = system_prompt,
});
messages.push_back(genai::ChatMessage{
.role = "user",
.content = "Give me three practical tips for writing API documentation.",
});
genai::GenerationRequest chat_request;
chat_request.messages = messages;
chat_request.max_new_tokens = 96;
const genai::GenerationResult chat_result = model.run(chat_request);
std::cout << "assistant: " << chat_result.text << "\n\n";
messages.push_back(genai::ChatMessage{.role = "assistant", .content = chat_result.text});
messages.push_back(genai::ChatMessage{
.role = "user",
.content = "Which tip should I apply first for a prototype?",
});
genai::GenerationRequest follow_up;
follow_up.messages = messages;
follow_up.max_new_tokens = 96;
const genai::GenerationResult second = model.run(follow_up);
std::cout << "assistant: " << second.text << "\n\n";
messages.push_back(genai::ChatMessage{.role = "assistant", .content = second.text});
messages.push_back(genai::ChatMessage{
.role = "user",
.content = "Turn that advice into a short checklist.",
});
genai::GenerationRequest streaming_request;
streaming_request.messages = messages;
streaming_request.max_new_tokens = 96;
genai::GenerationStream stream_handle = model.stream(streaming_request);
std::cout << "assistant: ";
for (const genai::TokenSample& token : stream_handle) {
std::cout << token.text << std::flush;
}
std::cout << "\n";
return 0;
} catch (const std::exception& e) {
std::cerr << "error: " << e.what() << "\n";
return 1;
}
}