What is native multimodal function calling in GLM-4.6V?

It allows you to pass images and screenshots directly as tool inputs in agentic workflows. The model reasons about visual content and calls tools based on what it sees, without requiring intermediate text conversion of images.

Can GLM-4.6V reconstruct HTML/CSS from screenshots?

Yes. GLM-4.6V can produce pixel-accurate HTML/CSS from screenshots and apply iterative modifications, making it effective for frontend development and UI replication workflows.

What is the difference between GLM-4.6V and GLM-4.6V-Flash?

GLM-4.6V is the full 106B parameter model for maximum capability. GLM-4.6V-Flash is a 9B parameter lightweight variant designed for local deployment and low-latency applications.

How do I authenticate with GLM-4.6V through AI Gateway?

AI Gateway provides a unified API key. No separate Z.ai account is needed. Use the model identifier to route requests, or configure BYOK for direct provider access.

What context window does GLM-4.6V support?

128K tokens, supporting extended document analysis, multi-image understanding, and multimodal inputs in a single request.

Does GLM-4.6V support video input?

The model is designed primarily for image and document understanding. For dedicated video understanding capabilities, see GLM model documentation for video-specific features.

GLM-4.6V

GLM-4.6V is Z.ai's full-scale 106B vision-language foundation model with a context window of 128K tokens, native multimodal function calling, interleaved image-text generation, and pixel-accurate frontend replication from screenshots.

Vision (Image)File InputReasoningTool UseImplicit Caching

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'zai/glm-4.6v',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Playground

Try out GLM-4.6V by Z.ai. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Legal:Terms

•

Privacy

128K

1.7s

153tps

$0.30/M

$0.90/M

Read:$0.05/M

Write:—

—

09/30/2025

More models by Z.ai

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

205K

0.3s

113tps

$1.30/M

$4.30/M

Read:$0.26/M

Write:—

—

04/07/2026

203K

4.3s

67tps

$1.20/M

$4.00/M

Read:$0.24/M

Write:—

—

03/15/2026

203K

0.3s

115tps

$0.80/M

$2.56/M

Read:$0.16/M

Write:—

—

02/12/2026

205K

0.1s

445tps

$2.25/M

$2.75/M

Read:$2.25/M

Write:—

—

12/22/2025

205K

0.4s

241tps

$0.60/M

$2.20/M

Read:$0.11/M

Write:—

—

09/30/2025

200K

0.2s

135tps

$0.07/M

$0.40/M

Read:$0.01/M

Write:—

—

About GLM-4.6V

GLM-4.6V is Z.ai's full-scale 106B parameter vision-language model, designed for cloud and high-performance cluster deployment. Released September 30, 2025, it upgrades GLM-4.5V with a context window of 128K tokens, native multimodal function calling, and reported multimodal benchmark results at comparable parameter scales.

A defining capability is native multimodal function calling, which lets you pass images and screenshots directly as tool inputs without converting them to text descriptions first. This enables agentic workflows where the model reasons about visual content, decides to call tools based on what it sees, and processes visual results across multiple steps. Combined with interleaved image-text content generation, GLM-4.6V produces mixed-media outputs from complex inputs.

GLM-4.6V handles frontend replication and visual editing: given a screenshot, it can reconstruct pixel-accurate HTML/CSS and apply iterative modifications. This capability, combined with multimodal document understanding (joint interpretation of text, layout, charts, and figures), suits UI development workflows and complex document processing pipelines.

What To Consider When Choosing a Provider

Configuration: At 106B parameters, GLM-4.6V is designed for cloud deployment. Consider the latency implications for latency-sensitive applications and evaluate GLM-4.6V-Flash (9B) for use cases where speed matters more than peak capability.
Configuration: When building agentic pipelines, GLM-4.6V's native ability to process images as tool inputs eliminates the need for intermediate text conversion. Design your tool schemas to accept image inputs directly.
Configuration: The expanded context supports extended document and multi-image understanding. For workflows processing many images or long documents, this reduces the need to chunk inputs across multiple requests.
Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use GLM-4.6V

Best For

Frontend development and UI replication: Pixel-accurate HTML/CSS reconstruction from screenshots accelerates development workflows
Agentic visual workflows: Native multimodal function calling passes images directly as tool inputs without text conversion
Complex document understanding: Richly formatted pages parsed with joint analysis of text, layout, charts, and figures
Multi-image analysis and comparison: The context window of 128K tokens fits image sequences and visual datasets
Interleaved content generation: Mixed image-text outputs produced from complex multimodal inputs

Consider Alternatives When

Latency-constrained deployments: GLM-4.6V-Flash (9B) offers vision capabilities with lower inference time for local and low-latency workloads
Text-only capabilities: GLM-4.6 provides the coding-focused model without vision processing overhead
Simple image tasks: A lighter vision model may be more cost-effective for basic classification or captioning
Reasoning plus vision: Evaluate newer models in the GLM lineup for combined depth and multimodal advances

Conclusion

GLM-4.6V is Z.ai's full-scale 106B vision-language model, combining native multimodal function calling and a context window of 128K tokens. It fits teams building multimodal applications that need frontend replication, document understanding, and agentic visual workflows.

Frequently Asked Questions

What is native multimodal function calling in GLM-4.6V?
It allows you to pass images and screenshots directly as tool inputs in agentic workflows. The model reasons about visual content and calls tools based on what it sees, without requiring intermediate text conversion of images.
How does GLM-4.6V compare to GLM-4.5V?
GLM-4.6V is a major upgrade: 106B parameters (vs. GLM-4.5V built on GLM-4.5-Air), context window of 128K tokens, native multimodal function calling, interleaved image-text generation, and improved frontend replication.
Can GLM-4.6V reconstruct HTML/CSS from screenshots?
Yes. GLM-4.6V can produce pixel-accurate HTML/CSS from screenshots and apply iterative modifications, making it effective for frontend development and UI replication workflows.
What is the difference between GLM-4.6V and GLM-4.6V-Flash?
GLM-4.6V is the full 106B parameter model for maximum capability. GLM-4.6V-Flash is a 9B parameter lightweight variant designed for local deployment and low-latency applications.
How do I authenticate with GLM-4.6V through AI Gateway?
AI Gateway provides a unified API key. No separate Z.ai account is needed. Use the model identifier to route requests, or configure BYOK for direct provider access.
What context window does GLM-4.6V support?
128K tokens, supporting extended document analysis, multi-image understanding, and multimodal inputs in a single request.
Does GLM-4.6V support video input?
The model is designed primarily for image and document understanding. For dedicated video understanding capabilities, see GLM model documentation for video-specific features.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

GLM-4.6V

Playground

Providers

More models by Z.ai

About GLM-4.6V

What To Consider When Choosing a Provider

When to Use GLM-4.6V

Best For

Consider Alternatives When

Conclusion

Frequently Asked Questions