GLM-4.6V
GLM-4.6V is Z.ai's full-scale 106B vision-language foundation model with a context window of 128K tokens, native multimodal function calling, interleaved image-text generation, and pixel-accurate frontend replication from screenshots.
import { streamText } from 'ai'
const result = streamText({ model: 'zai/glm-4.6v', prompt: 'Why is the sky blue?'})Playground
Try out GLM-4.6V by Z.ai. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.
More models by Z.ai
| Model |
|---|
About GLM-4.6V
GLM-4.6V is Z.ai's full-scale 106B parameter vision-language model, designed for cloud and high-performance cluster deployment. Released September 30, 2025, it upgrades GLM-4.5V with a context window of 128K tokens, native multimodal function calling, and reported multimodal benchmark results at comparable parameter scales.
A defining capability is native multimodal function calling, which lets you pass images and screenshots directly as tool inputs without converting them to text descriptions first. This enables agentic workflows where the model reasons about visual content, decides to call tools based on what it sees, and processes visual results across multiple steps. Combined with interleaved image-text content generation, GLM-4.6V produces mixed-media outputs from complex inputs.
GLM-4.6V handles frontend replication and visual editing: given a screenshot, it can reconstruct pixel-accurate HTML/CSS and apply iterative modifications. This capability, combined with multimodal document understanding (joint interpretation of text, layout, charts, and figures), suits UI development workflows and complex document processing pipelines.
What To Consider When Choosing a Provider
- Configuration: At 106B parameters, GLM-4.6V is designed for cloud deployment. Consider the latency implications for latency-sensitive applications and evaluate GLM-4.6V-Flash (9B) for use cases where speed matters more than peak capability.
- Configuration: When building agentic pipelines, GLM-4.6V's native ability to process images as tool inputs eliminates the need for intermediate text conversion. Design your tool schemas to accept image inputs directly.
- Configuration: The expanded context supports extended document and multi-image understanding. For workflows processing many images or long documents, this reduces the need to chunk inputs across multiple requests.
- Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use GLM-4.6V
Best For
- Frontend development and UI replication: Pixel-accurate HTML/CSS reconstruction from screenshots accelerates development workflows
- Agentic visual workflows: Native multimodal function calling passes images directly as tool inputs without text conversion
- Complex document understanding: Richly formatted pages parsed with joint analysis of text, layout, charts, and figures
- Multi-image analysis and comparison: The context window of 128K tokens fits image sequences and visual datasets
- Interleaved content generation: Mixed image-text outputs produced from complex multimodal inputs
Consider Alternatives When
- Latency-constrained deployments: GLM-4.6V-Flash (9B) offers vision capabilities with lower inference time for local and low-latency workloads
- Text-only capabilities: GLM-4.6 provides the coding-focused model without vision processing overhead
- Simple image tasks: A lighter vision model may be more cost-effective for basic classification or captioning
- Reasoning plus vision: Evaluate newer models in the GLM lineup for combined depth and multimodal advances
Conclusion
GLM-4.6V is Z.ai's full-scale 106B vision-language model, combining native multimodal function calling and a context window of 128K tokens. It fits teams building multimodal applications that need frontend replication, document understanding, and agentic visual workflows.
Frequently Asked Questions
What is native multimodal function calling in GLM-4.6V?
It allows you to pass images and screenshots directly as tool inputs in agentic workflows. The model reasons about visual content and calls tools based on what it sees, without requiring intermediate text conversion of images.
How does GLM-4.6V compare to GLM-4.5V?
GLM-4.6V is a major upgrade: 106B parameters (vs. GLM-4.5V built on GLM-4.5-Air), context window of 128K tokens, native multimodal function calling, interleaved image-text generation, and improved frontend replication.
Can GLM-4.6V reconstruct HTML/CSS from screenshots?
Yes. GLM-4.6V can produce pixel-accurate HTML/CSS from screenshots and apply iterative modifications, making it effective for frontend development and UI replication workflows.
What is the difference between GLM-4.6V and GLM-4.6V-Flash?
GLM-4.6V is the full 106B parameter model for maximum capability. GLM-4.6V-Flash is a 9B parameter lightweight variant designed for local deployment and low-latency applications.
How do I authenticate with GLM-4.6V through AI Gateway?
AI Gateway provides a unified API key. No separate Z.ai account is needed. Use the model identifier to route requests, or configure BYOK for direct provider access.
What context window does GLM-4.6V support?
128K tokens, supporting extended document analysis, multi-image understanding, and multimodal inputs in a single request.
Does GLM-4.6V support video input?
The model is designed primarily for image and document understanding. For dedicated video understanding capabilities, see GLM model documentation for video-specific features.