Instructions to use mudler/Step-3.7-Flash-APEX-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mudler/Step-3.7-Flash-APEX-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="mudler/Step-3.7-Flash-APEX-GGUF",
	filename="Step-3.7-Flash-APEX-Balanced.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use mudler/Step-3.7-Flash-APEX-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf mudler/Step-3.7-Flash-APEX-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf mudler/Step-3.7-Flash-APEX-GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf mudler/Step-3.7-Flash-APEX-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf mudler/Step-3.7-Flash-APEX-GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf mudler/Step-3.7-Flash-APEX-GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf mudler/Step-3.7-Flash-APEX-GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf mudler/Step-3.7-Flash-APEX-GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf mudler/Step-3.7-Flash-APEX-GGUF:F16

Use Docker

docker model run hf.co/mudler/Step-3.7-Flash-APEX-GGUF:F16

LM Studio
Jan
Ollama
How to use mudler/Step-3.7-Flash-APEX-GGUF with Ollama:
```
ollama run hf.co/mudler/Step-3.7-Flash-APEX-GGUF:F16
```

Unsloth Studio new

How to use mudler/Step-3.7-Flash-APEX-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mudler/Step-3.7-Flash-APEX-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mudler/Step-3.7-Flash-APEX-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for mudler/Step-3.7-Flash-APEX-GGUF to start chatting

Pi new

How to use mudler/Step-3.7-Flash-APEX-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf mudler/Step-3.7-Flash-APEX-GGUF:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "mudler/Step-3.7-Flash-APEX-GGUF:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use mudler/Step-3.7-Flash-APEX-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf mudler/Step-3.7-Flash-APEX-GGUF:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default mudler/Step-3.7-Flash-APEX-GGUF:F16

Run Hermes

hermes

Docker Model Runner
How to use mudler/Step-3.7-Flash-APEX-GGUF with Docker Model Runner:
```
docker model run hf.co/mudler/Step-3.7-Flash-APEX-GGUF:F16
```

Lemonade

How to use mudler/Step-3.7-Flash-APEX-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull mudler/Step-3.7-Flash-APEX-GGUF:F16

Run and chat with the model

lemonade run user.Step-3.7-Flash-APEX-GGUF-F16

List all available models

lemonade list

Step-3.7-Flash — APEX GGUF quants

APEX (Adaptive Precision for EXpert models) quantizations of stepfun-ai/Step-3.7-Flash, a ~198B-parameter vision-language MoE with ~11B active per token (288 routed experts + 1 shared expert, top-8 routing; 3 dense layers + 42 MoE layers; hidden 4096).

What is APEX?

APEX assigns per-tensor precision along a layer-wise gradient rather than using one quant type for the whole model:

Edges high, middle low. The first and last MoE blocks are far more sensitive to quantization noise than the middle of the stack, so APEX keeps the edges at a higher precision (e.g. Q5_K / Q6_K) and progressively drops the middle toward the profile's base type.
Dense layers kept at the edge precision. Step-3.7's three leading dense FFN layers (0–2) are not gated and carry every token, so they are quantized at the same precision as the edge MoE layers rather than being lumped in with the deep experts.
Shared expert protected. The shared expert sees every token in every MoE layer, so its weights (ffn_*_shexp) are kept at a high precision (Q6_K or Q8_0 depending on profile) across the whole stack, while the 288 routed experts (ffn_*_exps) follow the layer-wise gradient.
Attention kept above experts. Attention weights (Q/K/V/O) are held above the expert precision, since they are reused at every token regardless of routing.

The I- variants additionally use an importance matrix (imatrix) computed on a calibration mix — quantization noise is preferentially placed in directions the calibration data shows are least active.

Available files

File	Base	Profile	Size	Notes
`Step-3.7-Flash-APEX-I-Quality.gguf`	Q6_K	quality + imatrix	123 GB	highest fidelity
`Step-3.7-Flash-APEX-I-Balanced.gguf`	Q5_K	balanced + imatrix	141 GB	recommended
`Step-3.7-Flash-APEX-I-Compact.gguf`	Q4_K	compact + imatrix	90 GB	best size/quality tradeoff
`Step-3.7-Flash-APEX-I-Mini.gguf`	Q3_K	mini + imatrix	73 GB	smallest with imatrix
`Step-3.7-Flash-APEX-Quality.gguf`	Q6_K	quality	123 GB	no-imatrix
`Step-3.7-Flash-APEX-Balanced.gguf`	Q5_K	balanced	141 GB	no-imatrix
`Step-3.7-Flash-APEX-Compact.gguf`	Q4_K	compact	90 GB	no-imatrix
`mmproj-step3.7-flash-f16.gguf`	F16	vision tower	4.0 GB	mirrored from StepFun, pair with any of the above for VLM use
`imatrix.dat`	—	—	466 MB	importance matrix (BF16-derived)

The included imatrix.dat is the same matrix used to produce the I- files above. It is published so you can apply it yourself to any other quantization of the same BF16 weights without re-running calibration.

The mmproj is the vision tower in F16, mirrored as-is from stepfun-ai/Step-3.7-Flash-GGUF. Pair it with any of the language-tower files above (--mmproj mmproj-step3.7-flash-f16.gguf in llama.cpp) to run image+text inference.

How these were built

Source weights: stepfun-ai/Step-3.7-Flash-GGUF (BF16) — the upstream BF16 GGUF (9 shards).
Importance matrix: computed on the BF16 GGUF via llama-imatrix on a calibration mix and published here as imatrix.dat.
Quantization: per-tensor precision targets emitted by APEX, applied via llama-quantize --tensor-type-file (with --imatrix imatrix.dat for the I- variants).

Because the imatrix is keyed by tensor name, it is portable across any GGUF that comes from the same converted BF16 weights — no GPU needed to reproduce these quants from the published BF16 GGUF.

Architecture notes

Step-3.7-Flash's HF config (Step3p7ForConditionalGeneration) is mapped to the existing STEP35 arch at GGUF level: the convert script registers the Step-3.7 architecture and tokenizer hash, while runtime/quantize uses the standard step35 compute graph.

Layers: 45 (3 dense FFN + 42 MoE)
Hidden: 4096
Experts: 288 routed (top-8) + 1 shared
Total params: ~198B
Active params per token: ~11B
Modality: vision-language (mmproj included)

License

Inherits the upstream Apache-2.0 license from stepfun-ai/Step-3.7-Flash.

Downloads last month: -

GGUF

Model size

197B params

Architecture

step35

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mudler/Step-3.7-Flash-APEX-GGUF

Base model

stepfun-ai/Step-3.7-Flash

Quantized

(8)

this model

Collection including mudler/Step-3.7-Flash-APEX-GGUF

APEX Quants (GGUF)

Collection

MoE models quantized with the APEX Quantization technique ( https://github.com/mudler/apex-quant ) • 36 items • Updated about 9 hours ago • 102