Skip to main content

Run a VLM

Run a VLM — animated walkthrough overview

FieldValue
DifficultyBeginner
Estimated Read Time10-15 minutes
Labelsgenai, vlm, image, cache, multimodal

Vision-language models can accept text plus image tensors. For one question, attach the image directly to GenerationRequest.images. For repeated questions, encode the image once and reuse the cached image embeddings in follow-up requests.

Walkthrough

Load the VLM and image

Load a VisionLanguageModel from a deployed LLiMa model directory and decode an image from disk.

Use OpenCV to read the image. Neat treats three-channel cv::Mat inputs as BGR and converts them to RGB internally.

tutorials/020_run_a_vlm/run_a_vlm.cpp
genai::VisionLanguageModel model(args.model);
cv::Mat image = cv::imread(args.image.string(), cv::IMREAD_COLOR);
if (image.empty()) {
throw std::runtime_error("failed to read image: " + args.image.string());
}

Ask with a direct image

Attach the image directly to the first request. This is the simplest path and is often enough for one-shot visual questions.

tutorials/020_run_a_vlm/run_a_vlm.cpp
genai::GenerationRequest direct;
direct.prompt = "Describe this image in one sentence.";
direct.images = {image};
direct.max_new_tokens = 96;

const genai::GenerationResult first = model.run(direct);
std::cout << "direct image: " << first.text << "\n\n";

Cache the image embedding

Call encode(...) to cache image embeddings in the model. The call returns true when the image was accepted and cached.

tutorials/020_run_a_vlm/run_a_vlm.cpp
if (!model.encode(image)) {
throw std::runtime_error("VLM did not accept the image for caching");
}
std::cout << "cached_images=" << model.cached_image_count() << "\n";

Ask follow-up questions

Set use_cached_images = true on each request that should reuse the cached image. You can ask multiple questions about the same cached image. Requests without that flag behave normally: text-only requests use no image, direct-image requests use their own images, and another encode(...) call replaces the cached image.

tutorials/020_run_a_vlm/run_a_vlm.cpp
genai::GenerationRequest cached;
cached.prompt = "What details should I inspect more closely?";
cached.use_cached_images = true;
cached.max_new_tokens = 96;

const genai::GenerationResult follow_up = model.run(cached);
std::cout << "cached image: " << follow_up.text << "\n\n";

genai::GenerationRequest second_cached;
second_cached.prompt = "Summarize the image in three keywords.";
second_cached.use_cached_images = true;
second_cached.max_new_tokens = 48;

const genai::GenerationResult second_follow_up = model.run(second_cached);
std::cout << "cached image keywords: " << second_follow_up.text << "\n\n";

Attach an image to a chat message

When you use messages, attach images to the user message that needs them. This keeps the image next to the exact text it belongs to.

tutorials/020_run_a_vlm/run_a_vlm.cpp
genai::ChatMessage image_message;
image_message.role = "user";
image_message.content = "What is the main subject of this image?";
image_message.images = {image};

genai::GenerationRequest message_request;
message_request.messages = {image_message};
message_request.max_new_tokens = 96;

const genai::GenerationResult message_result = model.run(message_request);
std::cout << "message image: " << message_result.text << "\n";

Run

First, download a VLM such as Qwen3-VL 4B from Hugging Face using the LLiMa CLI:

llima pull Qwen3-VL-4B-Instruct-GPTQ-a16w4

Run the tutorial on Modalix with the deployed model directory and a local image:

C++ (prebuilt):

./lib/sima-neat/tutorials/tutorial_020_run_a_vlm \
--model /media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4 \
--image tests/images/people.jpg

C++ (build from source):

./build.sh --target tutorial_020_run_a_vlm
./build/tutorials-standalone/tutorial_020_run_a_vlm \
--model /media/nvme/llima/models/Qwen3-VL-4B-Instruct-GPTQ-a16w4 \
--image tests/images/people.jpg

Expected output is one answer from a direct image request, multiple follow-up answers that reuse the cached image, and one answer from a message-level image request.

In Practice

Use image caching when the user asks several questions about the same frame, product image, diagram, or document page. Avoid caching when each request uses a different image because the direct-image path is simpler and keeps prompt state obvious.

Some model families may not support cached reuse. In that case, use direct images on each request.

Use ChatMessage.images when you are building a conversation and only one message should carry the image. Use top-level GenerationRequest.images for the simpler one-prompt shape.

Full source

Show the complete C++ and Python programs
tutorials/020_run_a_vlm/run_a_vlm.cpp
#include "neat/genai.h"

#include <opencv2/imgcodecs.hpp>

#include <filesystem>
#include <iostream>
#include <stdexcept>
#include <string>

namespace genai = simaai::neat::genai;

struct Args {
std::filesystem::path model;
std::filesystem::path image;
};

Args parse_args(int argc, char** argv) {
Args args;
for (int i = 1; i < argc; ++i) {
const std::string arg = argv[i];
if (arg == "--model" && i + 1 < argc) {
args.model = argv[++i];
} else if (arg == "--image" && i + 1 < argc) {
args.image = argv[++i];
} else {
throw std::runtime_error("usage: run_a_vlm --model <vlm_model_dir> --image <image>");
}
}
if (args.model.empty() || args.image.empty()) {
throw std::runtime_error("missing required --model <vlm_model_dir> or --image <image>");
}
return args;
}

int main(int argc, char** argv) {
try {
const Args args = parse_args(argc, argv);

genai::VisionLanguageModel model(args.model);
cv::Mat image = cv::imread(args.image.string(), cv::IMREAD_COLOR);
if (image.empty()) {
throw std::runtime_error("failed to read image: " + args.image.string());
}

genai::GenerationRequest direct;
direct.prompt = "Describe this image in one sentence.";
direct.images = {image};
direct.max_new_tokens = 96;

const genai::GenerationResult first = model.run(direct);
std::cout << "direct image: " << first.text << "\n\n";

if (!model.encode(image)) {
throw std::runtime_error("VLM did not accept the image for caching");
}
std::cout << "cached_images=" << model.cached_image_count() << "\n";

genai::GenerationRequest cached;
cached.prompt = "What details should I inspect more closely?";
cached.use_cached_images = true;
cached.max_new_tokens = 96;

const genai::GenerationResult follow_up = model.run(cached);
std::cout << "cached image: " << follow_up.text << "\n\n";

genai::GenerationRequest second_cached;
second_cached.prompt = "Summarize the image in three keywords.";
second_cached.use_cached_images = true;
second_cached.max_new_tokens = 48;

const genai::GenerationResult second_follow_up = model.run(second_cached);
std::cout << "cached image keywords: " << second_follow_up.text << "\n\n";

genai::ChatMessage image_message;
image_message.role = "user";
image_message.content = "What is the main subject of this image?";
image_message.images = {image};

genai::GenerationRequest message_request;
message_request.messages = {image_message};
message_request.max_new_tokens = 96;

const genai::GenerationResult message_result = model.run(message_request);
std::cout << "message image: " << message_result.text << "\n";

return 0;
} catch (const std::exception& e) {
std::cerr << "error: " << e.what() << "\n";
return 1;
}
}

Source