Running GLM 4.7 MLX 8-bit GS32 on M3 Ultra 512GB: A Deep Dive into Preserved Thinking and Local Agentic Coding

Executive Summary

Running frontier AI models locally has shifted from experimental curiosity to production-ready reality. This guide explores the optimal configuration for running GLM 4.7 (353B parameters) using MLX 8-bit GS32 quantization on an Apple M3 Ultra with 512GB unified memory—a setup that represents the current sweet spot for local agentic coding workflows. Unlike cloud APIs that charge per token and impose rate limits, this configuration delivers sustained 8-12 tokens per second with unlimited context windows while maintaining full data sovereignty.

The real breakthrough isn’t just the hardware capability—it’s GLM 4.7’s three thinking modes: Interleaved Thinking, Preserved Thinking, and Turn-level Thinking. These features fundamentally change how local models handle multi-turn conversations, tool calling, and complex reasoning chains. When properly configured with opencode and LM Studio, this setup enables coding agents that maintain coherent reasoning across dozens of file edits and terminal commands, rivaling the performance of Claude Code and GPT-5 class models at zero marginal cost.

Key Insight: The 8-bit GS32 quantization at 9 bits per weight achieves an optimal balance—delivering ~95% of FP16 quality while fitting comfortably within the M3 Ultra’s 512GB unified memory with room for 128K+ context windows.

1. Why This Configuration Matters

1.1 The Hardware Sweet Spot

The M3 Ultra represents a unique inflection point for local AI deployment. With 512GB of unified memory shared between CPU and GPU, it eliminates the PCIe bottleneck that cripples multi-GPU setups. This matters because GLM 4.7’s 353B parameters (358B active) require approximately 397GB in 8-bit GS32 quantization—leaving 115GB for context windows, KV cache, and system overhead.

Configuration	Memory Required	Fits M3 Ultra 512GB?	Performance
FP16 (BF16)	~706GB	❌ No	Baseline
INT8	~353GB	⚠️ Tight	~90% quality
8-bit GS32	~397GB	✅ Yes	~95% quality
6-bit	~265GB	✅ Yes with room	~92% quality
4-bit	~177GB	✅ Yes with room	~88% quality

The GS32 (group size 32) quantization method is particularly important—it uses 9 bits per weight rather than 8, preserving more precision in the weight matrices while still achieving significant compression. This is why community benchmarks show 8-bit GS32 hitting 95.7% on AIME 2025 compared to 94.2% for standard INT8.

1.2 The Model: GLM 4.7’s Architecture

GLM 4.7 is a Mixture-of-Experts (MoE) model with approximately 358B total parameters but only a subset activated per token. Unlike dense models where every parameter participates in every forward pass, GLM 4.7’s MoE architecture activates roughly 12-15% of parameters per token—dramatically reducing compute requirements while maintaining capacity.

Key specifications:

Total Parameters: 353B (mlx-community quantized version)
Architecture: Mixture-of-Experts with shared attention layers
Context Window: 128K tokens (tested), theoretically 200K
License: MIT (fully open-weight)
Training Data: Up to December 2024

The model’s design philosophy centers on agentic workflows rather than chat. It’s explicitly optimized for:

Multi-file code editing (SWE-bench Verified: 73.8%)
Terminal-based automation (Terminal Bench 2.0: 41.0%)
Tool use and function calling (τ²-Bench: 87.4%)
Long-context reasoning with maintained coherence

2. Understanding GLM 4.7’s Three Thinking Modes

The true differentiator for GLM 4.7 isn’t raw benchmark scores—it’s the thinking architecture that enables stable multi-turn reasoning. Understanding these modes is critical for maximizing local performance.

2.1 Interleaved Thinking: Reasoning Between Actions

Interleaved Thinking (supported since GLM 4.5) allows the model to think between tool calls and after receiving tool results. This isn’t just generating text—it’s maintaining an explicit reasoning chain that interprets each tool output before deciding the next action.

How it works:

User requests complex task (e.g., “Debug this failing test suite”)
Model generates reasoning block: “I need to first examine the test file to understand what’s failing, then look at the implementation…”
Model calls tool (read_file)
Model receives tool result
Model generates new reasoning block based on result: “The test is failing because of a null pointer exception on line 47. I should check the constructor…”
Model calls next tool

Critical Implementation Detail: When using Interleaved Thinking with tools, thinking blocks must be explicitly preserved and returned together with tool results. If you drop the reasoning_content, the model loses coherence across turns.

2.2 Preserved Thinking: Maintaining Reasoning State

Preserved Thinking is GLM 4.7’s breakthrough feature for coding scenarios. It allows the model to retain reasoning content from previous assistant turns in the context, preserving reasoning continuity across multi-turn conversations.

Why this matters:

Without preserved thinking: Each turn starts fresh. The model might contradict previous reasoning or lose track of the overall plan.
With preserved thinking: The model maintains a consistent chain of thought across 20+ file edits, remembering why certain decisions were made and adapting the plan based on new information.

Implementation in opencode:

{
  "glm-4.7-gs32": {
    "name": "glm-4.7-gs32",
    "tool_call": true,
    "reasoning": true,
    "options": {
      "extra_body": {
        "clear_thinking": false
      }
    }
  }
}

The key parameter is "clear_thinking": false. This tells the model to NOT clear reasoning blocks between turns, enabling Preserved Thinking mode. When set to true (or omitted), the model clears reasoning content after each response, which is faster but loses continuity.

Important: All consecutive reasoning_content blocks must exactly match the original sequence generated by the model. Do not reorder or edit these blocks—doing so degrades performance and reduces cache hit rates.

2.3 Turn-level Thinking: Dynamic Control

Turn-level Thinking lets you control reasoning computation on a per-request basis. Within the same session, each request can independently choose to enable or disable thinking.

Use cases:

Disable thinking: Quick factual queries (“What does this function do?”), simple edits (“Change this variable name”)
Enable thinking: Complex planning, debugging, architectural decisions, multi-file refactoring

Benefits:

Cost/Latency Control: Lightweight turns get faster responses; heavy tasks get deeper reasoning
Smooth Multi-turn Experience: The model feels “smarter when things are hard, faster when things are simple”
Agent Optimization: Reduce reasoning overhead on tool execution turns; enable deep thinking on decision turns

To enable/disable thinking dynamically, modify the request:

# Enable thinking for complex task
response = client.chat.completions.create(
    model="glm-4.7",
    messages=messages,
    extra_body={
        "thinking": {
            "type": "enabled",
            "clear_thinking": false  # Preserve across turns
        }
    }
)

# Disable thinking for simple query
response = client.chat.completions.create(
    model="glm-4.7",
    messages=messages,
    extra_body={
        "thinking": {
            "type": "disabled"
        }
    }
)

3. LM Studio Setup Guide

3.1 Installation and Configuration

Step 1: Download and install LM Studio from lmstudio.ai. The M3 Ultra version is optimized for Apple Silicon.

Step 2: Download the model. In LM Studio, navigate to the model search (Cmd+Shift+M) and search for “GLM-4.7-8bit-gs32”. Select the mlx-community version:

lms get mlx-community/GLM-4.7-8bit-gs32

Step 3: Configure the model parameters. In LM Studio’s model settings:

Parameter	Recommended Value	Notes
Context Length	65536-131072 (64K-128K)	Start with 64K, test up to 128K
Temperature	0.7	Official Z.ai recommendation for coding/agentic tasks
Top P	1.0	Official Z.ai recommendation (not 0.9)
Top K	0 (disabled) or default	Not specified by Z.ai; let Top P handle sampling
Repeat Penalty	1.0	See detailed explanation below

Official Z.ai Parameters vs. General Recommendations

These settings come directly from the GLM-4.7 model card (Z.ai official):

Default tasks: Temperature 1.0, Top P 0.95
Coding/Agentic tasks (what you’re doing): Temperature 0.7, Top P 1.0
Tool use (τ²-Bench): Temperature 0 (greedy), Top P 1.0

Why These Specific Values?

Temperature 0.7: Balances creativity with determinism for coding. Lower than the default 1.0 to reduce hallucinations, but not so low (0.0-0.3) that the model becomes rigid.
Top P 1.0: Lets the model consider the full probability distribution. Unlike many models that use 0.9, GLM 4.7 was benchmarked with 1.0 for coding tasks.
Top K: Z.ai doesn’t specify this parameter, meaning it’s not critical. Set to 0 (disabled) to rely entirely on Top P, or leave at LM Studio’s default.

Why Set Repeat Penalty to 1.0?

In LM Studio, set the Repeat Penalty slider to 1.0 (not “turned off” via a toggle—there is no toggle, just change the numeric value from the default ~1.1-1.18 down to 1.0).

Reasoning: Default repeat penalties (1.1-1.18) tell the model “don’t repeat tokens,” which interferes with thinking modes because:

Reasoning involves natural repetition (“Let me check… let me verify… this is important because…”)
Penalties can cut off reasoning mid-thought or cause abrupt topic changes
A value of 1.0 = neutral (no penalty), allowing the model to reason naturally and maintain coherent thought chains

Note: MLX models use Apple’s MLX framework, which handles batch processing automatically—there’s no manual “batch size” setting in the GUI.

Step 4: Enable thinking mode in LM Studio’s advanced settings. Since LM Studio doesn’t yet expose the clear_thinking parameter in the UI, you’ll need to use the Server API mode with custom parameters for full Preserved Thinking support.

3.2 Server Mode for Advanced Features

To access Preserved Thinking with LM Studio, run it in server mode:

# Start LM Studio server
lms server start

# Configure with preserved thinking
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/GLM-4.7-8bit-gs32",
    "messages": [{"role": "user", "content": "Hello"}],
    "extra_body": {
      "thinking": {
        "type": "enabled",
        "clear_thinking": false
      }
    }
  }'

4. Opencode Integration

4.1 Configuration Analysis

The opencode.json configuration provided represents the optimal setup for agentic coding with GLM 4.7:

{
  "glm-4.7-gs32": {
    "name": "glm-4.7-gs32",
    "tool_call": true,
    "reasoning": true,
    "options": {
      "extra_body": {
        "clear_thinking": false
      }
    }
  }
}

Breaking down each parameter:

"tool_call": true: Enables function calling capabilities. GLM 4.7 supports parallel tool calling and can chain multiple tools with reasoning between each call.
"reasoning": true: Tells opencode to expect and handle reasoning_content blocks. Without this, reasoning tokens may be dropped or displayed as regular text.
"clear_thinking": false: Critical for Preserved Thinking. This prevents the model from clearing reasoning blocks between turns, maintaining continuity across multi-file edits.

4.2 Complete Opencode Configuration

For a complete setup with LM Studio local server:

{
  "models": {
    "local-glm-4.7": {
      "name": "local-glm-4.7",
      "provider": "openai-compatible",
      "base_url": "http://localhost:1234/v1",
      "model": "mlx-community/GLM-4.7-8bit-gs32",
      "api_key": "not-needed",
      "tool_call": true,
      "reasoning": true,
      "options": {
        "extra_body": {
          "thinking": {
            "type": "enabled",
            "clear_thinking": false
          }
        }
      }
    }
  },
  "default_model": "local-glm-4.7"
}

4.3 Handling Reasoning Content

When using GLM 4.7 with opencode, the model returns both content and reasoning_content. The reasoning_content contains the model’s internal monologue—its planning, analysis, and decision-making process.

Example workflow:

# User asks: "Refactor this authentication module to use JWT"

# Turn 1: Model analyzes and plans
{
  "reasoning_content": "The user wants to refactor authentication to use JWT. I need to:\n1. Examine the current auth implementation\n2. Identify where sessions are currently handled\n3. Plan JWT integration strategy\n4. Update relevant files\n\nLet me start by reading the current auth module.",
  "content": "I'll help you refactor the authentication module to use JWT. Let me start by examining the current implementation.",
  "tool_calls": [{"name": "read_file", "arguments": {"path": "src/auth.js"}}]
}

# Turn 2: After tool result, model continues reasoning
{
  "reasoning_content": "Current auth uses session-based storage with express-session. To migrate to JWT:\n1. Need to replace session middleware with JWT verification\n2. Update login endpoint to generate tokens\n3. Add token refresh logic\n4. Update protected route middleware\n\nI'll start by installing the jsonwebtoken package and creating the JWT utility.",
  "content": "I can see you're currently using session-based authentication. I'll migrate this to JWT by creating a token utility and updating the middleware.",
  "tool_calls": [{"name": "write_file", "arguments": {"path": "src/jwt.js", "content": "..."}}]
}

Key Point: The reasoning_content from Turn 1 must be passed back to the model in Turn 2’s context. This is what "clear_thinking": false enables—it tells the model to maintain this reasoning chain.

5. Performance Benchmarks and Expectations

5.1 Real-World Performance on M3 Ultra 512GB

Based on community testing and benchmarks, here’s what to expect:

Metric	8-bit GS32	Notes
Memory Usage	~397GB	Fits comfortably in 512GB
Context Window	128K tokens	Tested stable; 200K possible
Prefill Speed	450-600 tokens/sec	Input processing
Generation Speed	8-12 tokens/sec	Output generation
Time to First Token	2-5 seconds	Depends on prompt length

Comparative Context:

Claude 3.5 Sonnet (API): ~50-80 tokens/sec but with rate limits
GPT-4 Turbo (API): ~30-50 tokens/sec with usage caps
Local Qwen3-30B-A3B: ~25-35 tokens/sec but less capable
Local GLM 4.7 Flash: ~35-50 tokens/sec but smaller model

The 8-12 tokens/sec might seem slow compared to cloud APIs, but it’s sustained—no rate limits, no token quotas, no network latency. For coding workflows where the model is making intelligent decisions across 20+ files, the coherence benefits of local deployment often outweigh raw speed.

5.2 Optimization Tips

1. Use Batch Processing for Multiple Files

Instead of editing files one at a time, batch related changes:

// Instead of 10 separate tool calls
// Use 1 batch call with multiple edits
{
  "tool_calls": [
    {"name": "edit_file", "arguments": {"path": "src/auth.js", "..."}},
    {"name": "edit_file", "arguments": {"path": "src/middleware.js", "..."}},
    {"name": "edit_file", "arguments": {"path": "src/routes.js", "..."}}
  ]
}

2. Enable KV Cache Quantization

In LM Studio, enable KV cache quantization to 8-bit. This reduces memory usage for long contexts:

Settings → Advanced → KV Cache → Quantization → 8-bit

3. Optimize Context Window Dynamically

Don’t use 128K context for simple queries. Adjust based on task:

Simple edits: 4K-8K context
File refactoring: 16K-32K context
Multi-file architecture: 64K-128K context

4. Use Preserved Thinking Strategically

Enable "clear_thinking": false for:

Multi-file refactoring projects
Debugging sessions requiring context
Long agentic workflows

Disable (set to true) for:

Simple Q&A
Single-file edits
Quick lookups

6. Advanced Workflows

6.1 Multi-Agent Orchestration

With 512GB of memory, you can run multiple instances simultaneously:

# Terminal 1: Main coding agent
opencode --model local-glm-4.7 --workspace ./project

# Terminal 2: Documentation agent
opencode --model local-glm-4.7 --workspace ./project/docs

# Terminal 3: Testing agent  
opencode --model local-glm-4.7 --workspace ./project/tests

Each instance uses ~400GB, so two instances fit with careful management. Alternatively, run one GLM 4.7 instance alongside smaller specialized models (Qwen3-30B, Nemotron 3 Nano) for specific tasks.

6.2 RAG with Massive Context

GLM 4.7’s 128K context window enables true “whole codebase” understanding:

# Load entire codebase into context
with open('codebase_snapshot.txt', 'r') as f:
    full_codebase = f.read()

messages = [
    {"role": "system", "content": "You are analyzing an entire codebase."},
    {"role": "user", "content": f"Analyze this codebase for security vulnerabilities:\n\n{full_codebase}"}
]

# Model can reason across entire codebase at once

With Preserved Thinking enabled, the model maintains analysis context across follow-up questions without re-loading the codebase.

6.3 Terminal Automation at Scale

GLM 4.7 excels at terminal-based workflows. Combined with Preserved Thinking, it can execute complex deployment pipelines:

User: "Deploy this app to production with zero downtime"

Model reasoning: "Zero-downtime deployment requires:\n1. Health checks on current deployment\n2. Blue-green deployment strategy\n3. Database migration planning\n4. Rollback procedure\n5. Traffic cutover\n\nLet me start by checking current deployment status..."

[Tool call: kubectl get deployments]
[Tool call: kubectl get pods]
[Tool call: helm list]

Model reasoning: "Current deployment shows 3 replicas running. Database is on version 47. I'll deploy to the green environment first, run migrations, then cutover traffic..."

[Tool call: helm upgrade --install app-green ./chart]
[Tool call: kubectl rollout status deployment/app-green]
[Tool call: kubectl exec -it pod/db-runner -- migrate up]

The Preserved Thinking mode ensures the model remembers each step of the deployment strategy across 15+ tool calls.

7. Troubleshooting Common Issues

7.1 “Model gets stuck in loops”

Symptom: Model repeats the same reasoning or tool calls indefinitely.

Solution:

Check that reasoning_content is being passed back correctly
Ensure "clear_thinking": false is set if you want continuity
Add explicit instructions: “You have already examined this file. Move to the next step.”

7.2 “Out of memory errors”

Symptom: System reports memory pressure or model fails to load.

Solution:

Close other applications (Chrome can use 50GB+ with many tabs)
Reduce context window to 64K
Enable KV cache quantization
Check for memory leaks in opencode (restart if needed)

7.3 “Slow token generation”

Symptom: Sub-5 tokens/sec performance.

Solution:

Ensure MLX is using GPU, not CPU fallback
Check macOS Activity Monitor for memory pressure (yellow/red)
Reduce batch size if CPU-bound
Disable unnecessary system processes

7.4 “Reasoning content not displayed”

Symptom: Model works but reasoning isn’t visible.

Solution:

Verify "reasoning": true in opencode.json
Check opencode version (thinking mode support added in recent versions)
Enable verbose logging: opencode --verbose

8. Economic Analysis: Local vs. Cloud

8.1 Cost Comparison

Hardware Cost:

M3 Ultra Mac Studio (512GB): ~$12,000
Amortized over 3 years: $333/month

Cloud API Costs (for equivalent usage):

Usage Pattern	Claude 3.5 Sonnet	GPT-4 Turbo	Local GLM 4.7
Light (1M tokens/month)	$30/month	$60/month	$0
Medium (10M tokens/month)	$300/month	$600/month	$0
Heavy (100M tokens/month)	$3,000/month	$6,000/month	$0
Break-even point	4 months	2 months	N/A

Key Insight: If you’re spending more than $300/month on AI APIs, local deployment pays for itself within 4-12 months. More importantly, there are no rate limits—critical for agentic workflows that can consume millions of tokens in a single session.

8.2 Hidden Benefits

Data Sovereignty: Code never leaves your machine. Critical for:

Financial services
Healthcare
Proprietary algorithms
Security-sensitive projects

Availability: 100% uptime, no API outages, no service degradation during peak hours.

Customization: Fine-tune on proprietary codebases without data sharing agreements.

9. Future Outlook

9.1 MLX Framework Evolution

Apple’s MLX framework is rapidly evolving. Expected improvements in 2026:

MLX 2.0: Better memory management for 400GB+ models
Flash Attention 3: 2-3x speedup for long contexts
Sparse Attention: Support for 1M+ context windows
Unified Memory Optimization: Better utilization of 512GB+ configurations

9.2 GLM 4.7 Ecosystem

Community projects extending GLM 4.7 capabilities:

GLM Code Extensions: VSCode plugin with thinking visualization
Agent Frameworks: LangChain and AutoGPT integrations
Fine-tuning Guides: Domain-specific versions for legal, medical, financial code

9.3 Hardware Trajectory

The M4 Ultra (expected late 2026) will likely feature:

40-80% faster Neural Engine
512GB-1TB unified memory options
Better power efficiency for sustained loads

For current M3 Ultra owners, this represents a 3-4 year competitive window before an upgrade is necessary.

10. Conclusion

Running GLM 4.7 MLX 8-bit GS32 on an M3 Ultra with 512GB unified memory represents the current pinnacle of local AI deployment. The combination of:

353B parameter frontier model with 95.7% AIME 2025 performance
Three thinking modes enabling coherent multi-turn reasoning
Zero marginal cost with no rate limits or data exposure
8-12 tokens/sec sustained performance for unlimited context windows

…creates a viable alternative to Claude Code and GPT-5 for serious development work.

The key to unlocking this potential is understanding and properly configuring the thinking modes. Preserved Thinking ("clear_thinking": false) transforms GLM 4.7 from a stateless chatbot into a true coding partner that maintains context across complex, multi-file workflows. Interleaved Thinking enables sophisticated tool use with reasoning between each action. Turn-level Thinking provides the flexibility to optimize for speed or depth on a per-request basis.

For developers already spending hundreds monthly on AI APIs, the economics are compelling. But beyond cost savings, local deployment offers something cloud APIs cannot: complete control. Your code never leaves your machine. Your workflows aren’t subject to rate limits. Your agent can think for hours without interruption.

The infrastructure for truly powerful local AI has arrived. The question isn’t whether you can afford to deploy it—it’s whether you can afford not to.

Quick Reference Card

Model: mlx-community/GLM-4.7-8bit-gs32
Memory Required: ~397GB
Context Window: 128K tokens (tested)
Generation Speed: 8-12 tokens/sec
Hardware: Apple M3 Ultra 512GB unified memory

Opencode Config:

{
  "glm-4.7-gs32": {
    "name": "glm-4.7-gs32",
    "tool_call": true,
    "reasoning": true,
    "options": {
      "extra_body": {
        "clear_thinking": false
      }
    }
  }
}

LM Studio Download:

lms get mlx-community/GLM-4.7-8bit-gs32

Thinking Modes:

Preserved: "clear_thinking": false (maintains reasoning across turns)
Interleaved: Default with tools (reasoning between tool calls)
Turn-level: Dynamic enable/disable per request

Sources

Z.ai, GLM-4.7 Technical Report and Thinking Mode Documentation (Dec 2025) – https://docs.z.ai/guides/capabilities/thinking-mode
mlx-community, GLM-4.7-8bit-gs32 Model Card – https://huggingface.co/mlx-community/GLM-4.7-8bit-gs32
Z.ai, GLM-4.7 Release Announcement (Dec 22, 2025) – https://z.ai/blog/glm-4.7
Apple, M3 Ultra Technical Specifications – https://www.apple.com/mac-studio/
MLX Framework, Apple Silicon Inference Guide – https://ml-explore.github.io/mlx
LM Studio, Model Documentation – https://lmstudio.ai/models/glm-4.7
Opencode Documentation, Model Configuration – https://opencode.ai
Unsloth, GLM-4.7 Local Deployment Guide – https://unsloth.ai/docs/models/glm-4.7
Hacker News Discussion, GLM-4.7 Performance Analysis (Dec 2025) – https://news.ycombinator.com/item?id=46357287
Algustionesa Yoshi, GLM-4.7: Benchmarks, Local Hardware, and Real Costs (Dec 2025) – https://algustionesa.com/glm-4-7-benchmarks-local-hardware-and-real-costs/

Published February 11, 2026