Instructions to use mudler/Step-3.7-Flash-APEX-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use mudler/Step-3.7-Flash-APEX-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mudler/Step-3.7-Flash-APEX-GGUF", filename="Step-3.7-Flash-APEX-Balanced.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use mudler/Step-3.7-Flash-APEX-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mudler/Step-3.7-Flash-APEX-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf mudler/Step-3.7-Flash-APEX-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mudler/Step-3.7-Flash-APEX-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf mudler/Step-3.7-Flash-APEX-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf mudler/Step-3.7-Flash-APEX-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf mudler/Step-3.7-Flash-APEX-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf mudler/Step-3.7-Flash-APEX-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf mudler/Step-3.7-Flash-APEX-GGUF:F16
Use Docker
docker model run hf.co/mudler/Step-3.7-Flash-APEX-GGUF:F16
- LM Studio
- Jan
- Ollama
How to use mudler/Step-3.7-Flash-APEX-GGUF with Ollama:
ollama run hf.co/mudler/Step-3.7-Flash-APEX-GGUF:F16
- Unsloth Studio new
How to use mudler/Step-3.7-Flash-APEX-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mudler/Step-3.7-Flash-APEX-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mudler/Step-3.7-Flash-APEX-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for mudler/Step-3.7-Flash-APEX-GGUF to start chatting
- Pi new
How to use mudler/Step-3.7-Flash-APEX-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf mudler/Step-3.7-Flash-APEX-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mudler/Step-3.7-Flash-APEX-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mudler/Step-3.7-Flash-APEX-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf mudler/Step-3.7-Flash-APEX-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mudler/Step-3.7-Flash-APEX-GGUF:F16
Run Hermes
hermes
- Docker Model Runner
How to use mudler/Step-3.7-Flash-APEX-GGUF with Docker Model Runner:
docker model run hf.co/mudler/Step-3.7-Flash-APEX-GGUF:F16
- Lemonade
How to use mudler/Step-3.7-Flash-APEX-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull mudler/Step-3.7-Flash-APEX-GGUF:F16
Run and chat with the model
lemonade run user.Step-3.7-Flash-APEX-GGUF-F16
List all available models
lemonade list
llm.create_chat_completion(
messages = "No input example has been defined for this model task."
)Step-3.7-Flash — APEX GGUF quants
APEX (Adaptive Precision for EXpert models) quantizations of stepfun-ai/Step-3.7-Flash, a ~198B-parameter vision-language MoE with ~11B active per token (288 routed experts + 1 shared expert, top-8 routing; 3 dense layers + 42 MoE layers; hidden 4096).
What is APEX?
APEX assigns per-tensor precision along a layer-wise gradient rather than using one quant type for the whole model:
- Edges high, middle low. The first and last MoE blocks are far more sensitive to quantization noise than the middle of the stack, so APEX keeps the edges at a higher precision (e.g. Q5_K / Q6_K) and progressively drops the middle toward the profile's base type.
- Dense layers kept at the edge precision. Step-3.7's three leading dense FFN layers (0–2) are not gated and carry every token, so they are quantized at the same precision as the edge MoE layers rather than being lumped in with the deep experts.
- Shared expert protected. The shared expert sees every token in every MoE layer, so its weights (
ffn_*_shexp) are kept at a high precision (Q6_K or Q8_0 depending on profile) across the whole stack, while the 288 routed experts (ffn_*_exps) follow the layer-wise gradient. - Attention kept above experts. Attention weights (Q/K/V/O) are held above the expert precision, since they are reused at every token regardless of routing.
The I- variants additionally use an importance matrix (imatrix) computed on a calibration mix — quantization noise is preferentially placed in directions the calibration data shows are least active.
Available files
| File | Base | Profile | Size | Notes |
|---|---|---|---|---|
Step-3.7-Flash-APEX-I-Quality.gguf |
Q6_K | quality + imatrix | 123 GB | highest fidelity |
Step-3.7-Flash-APEX-I-Balanced.gguf |
Q5_K | balanced + imatrix | 141 GB | recommended |
Step-3.7-Flash-APEX-I-Compact.gguf |
Q4_K | compact + imatrix | 90 GB | best size/quality tradeoff |
Step-3.7-Flash-APEX-I-Mini.gguf |
Q3_K | mini + imatrix | 73 GB | smallest with imatrix |
Step-3.7-Flash-APEX-Quality.gguf |
Q6_K | quality | 123 GB | no-imatrix |
Step-3.7-Flash-APEX-Balanced.gguf |
Q5_K | balanced | 141 GB | no-imatrix |
Step-3.7-Flash-APEX-Compact.gguf |
Q4_K | compact | 90 GB | no-imatrix |
mmproj-step3.7-flash-f16.gguf |
F16 | vision tower | 4.0 GB | mirrored from StepFun, pair with any of the above for VLM use |
imatrix.dat |
— | — | 466 MB | importance matrix (BF16-derived) |
The included imatrix.dat is the same matrix used to produce the I- files above. It is published so you can apply it yourself to any other quantization of the same BF16 weights without re-running calibration.
The mmproj is the vision tower in F16, mirrored as-is from stepfun-ai/Step-3.7-Flash-GGUF. Pair it with any of the language-tower files above (--mmproj mmproj-step3.7-flash-f16.gguf in llama.cpp) to run image+text inference.
How these were built
- Source weights: stepfun-ai/Step-3.7-Flash-GGUF (BF16) — the upstream BF16 GGUF (9 shards).
- Importance matrix: computed on the BF16 GGUF via
llama-imatrixon a calibration mix and published here asimatrix.dat. - Quantization: per-tensor precision targets emitted by APEX, applied via
llama-quantize --tensor-type-file(with--imatrix imatrix.datfor theI-variants).
Because the imatrix is keyed by tensor name, it is portable across any GGUF that comes from the same converted BF16 weights — no GPU needed to reproduce these quants from the published BF16 GGUF.
Architecture notes
Step-3.7-Flash's HF config (Step3p7ForConditionalGeneration) is mapped to the existing STEP35 arch at GGUF level: the convert script registers the Step-3.7 architecture and tokenizer hash, while runtime/quantize uses the standard step35 compute graph.
- Layers: 45 (3 dense FFN + 42 MoE)
- Hidden: 4096
- Experts: 288 routed (top-8) + 1 shared
- Total params: ~198B
- Active params per token: ~11B
- Modality: vision-language (mmproj included)
License
Inherits the upstream Apache-2.0 license from stepfun-ai/Step-3.7-Flash.
- Downloads last month
- -
We're not able to determine the quantization variants.
Model tree for mudler/Step-3.7-Flash-APEX-GGUF
Base model
stepfun-ai/Step-3.7-Flash
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mudler/Step-3.7-Flash-APEX-GGUF", filename="", )