Files

T

hmo 1b2b935832 Initial: multi-agent XMPP communication system with dashboard

- Platform-based architecture (Windows/Linux/Mac)
- Agent instance registry (agents.yaml)
- Management dashboard with cross-platform monitoring
- xmpp_bot with HTTP bridge + health endpoints
- wechat_agent with WeChat-Hermes bridging
- Platform services: ProcessGuardian, HealthProbe, APIRouter, ChannelBridge
- Deployment: systemd (Linux) + PowerShell (Windows)
- Monitoring: SSH+ejabberdctl for cross-platform presence

2026-06-12 21:51:36 +08:00

5.9 KiB

Raw Blame History

莫小果（MacBook M5 Pro 64G）MLX vs Ollama 对比测试记录

日期：2026-06-03 机器：MacBook M5 Pro 64G / macOS 26.4 / Apple M5 Pro / arm64

测试目标

评估莫小果（本地算力中心）是否该从 Ollama 切换到 MLX 栈（MLX-LM / Rapid-MLX / oMLX），以解决"日常操作很慢"的问题。

测试方法

同一组 3 个 prompt，覆盖日常（daily_short）、排查（ops_reasoning）、规划（agent_style）三种场景。max_tokens=256，非流式，temperature=0.2。

由于 27B MLX 模型因 HuggingFace 镜像链路不稳定未下载完整（进度 ~10GB/16GB），先用 Qwen3.5-4B-MLX-4bit 验证 MLX 栈可用性和端到端速度。27B 同规模交叉验证引用社区基准。

实测结果

Ollama（现有栈）

模型: qwen3.5:27b-q8_0 (GGUF Q8_0, 29GB, 27.8B params) 引擎: Ollama 0.24 (llama.cpp) 服务: localhost:11434

prompt	tok/s	总耗时	prompt_tps	加载内存
daily_short	8.12	39.6s	82.78	54 GB RSS
ops_reasoning	8.17	32.1s	74.04	54 GB
agent_style	8.14	32.2s	72.23	54 GB

MLX-LM（原生 Apple Silicon）

模型: mlx-community/Qwen3.5-4B-MLX-4bit (MLX, 2.9GB, ~4B params) 引擎: mlx-lm 0.31.3 / mlx 0.31.2 方式: Python 脚本 mlx_lm.generate（本地加载，非 HTTP 服务）

prompt	tok/s	总耗时	加载内存
daily_short	92.76	2.76s	2.8 GB
ops_reasoning	94.91	2.70s	—
agent_style	94.42	2.71s	—

模型加载时间：0.70s

Rapid-MLX（OpenAI-compatible HTTP 服务）

模型: 同上（本地文件路径加载）引擎: Rapid-MLX 0.6.80（封装 mlx-lm 0.31.3）服务: localhost:18000 → OpenAI-compatible API

prompt	tok/s	总耗时	内存
daily_short	82.53	3.10s	2.8 GB RSS + 7GB cache
ops_reasoning	90.24	2.84s	—
agent_style	90.62	2.83s	—

Rapid-MLX 比裸 MLX-LM 慢 ~6%（服务层开销）。

oMLX（macOS-native 推理服务）

模型: 同上（文件系统路径 ~/llm-bench/models/）引擎: oMLX 0.4.2rc1（封装 mlx-lm，EnginePool 多模型管理）服务: localhost:18001 → OpenAI + Anthropic API 安装方式: source（git clone + pip install -e .），网络正常时约 3 分钟完成

prompt	tok/s	总耗时	说明
daily_short	56.04	4.57s	冷启动（模型首次加载）
ops_reasoning	90.34	2.83s	热起，MLX 全速
agent_style	90.26	2.84s	热起

oMLX 用 EnginePool 管理多模型，首次请求需要加载模型到内存（冷启动慢 30-50%）。热起后与裸 MLX-LM 速度一致。内置 5 个预设模型引擎（LLM/VLM/Embedding/Reranker/MarkItDown）。

27B 同规模交叉验证（社区基准）

引擎	模型	tok/s	来源
Ollama	qwen3:32b (Q4)	~27	Rapid-MLX README
Ollama	你的 qwen3.5:27b Q8	~8	实测
MLX-LM	Qwen3.5-27B-8bit	~55	mlx-lm benchmark
Rapid-MLX	Qwen3.5-27B-8bit	~66	Rapid-MLX README
MLX-LM	Qwen3.5-27B-4bit	~80	Rapid-MLX README

对于你的 27B Q8（8 tok/s），MLX 同精度（8bit）预期 30-55 tok/s（3.5-7x 提升）。

结论：双栈并跑

Ollama 保留

模型库最广，下载安装最方便
Hermes CLI 现有对接（http://localhost:11434/v1）
GGUF 生态兼容
出问题时稳定 fallback

新增 Rapid-MLX 作为主力 MLX 服务

预期 3-7x 速度提升（8 tok/s → 30-66 tok/s）
OpenAI-compatible API，/v1/chat/completions，Hermes/OpenCode 可直接对接
连续批处理（continuous batching）支持多并发
自动 tool_choice / reasoning parser（hermes / qwen3）
莫小果 64G，跑 27B-4bit 约占用 14-18GB，绰绰有余
易管理：rapid-mlx serve <alias> 一键启动

oMLX 已装（可用，待网络恢复后再全面评估）

已通过 git clone + pip install -e . 安装，~/.venv/bin/omlx CLI 就绪
性能持平 MLX-LM/Rapid-MLX（热起 ~90 tok/s），EnginePool 多模型管理有额外 5.6% 内存
冷启动比 Rapid-MLX 慢（首次请求需要加载模型，约 4.5s vs 3.1s）
优势：原生 tiered KV cache + macOS menubar app（通过 homebrew 安装时可用）
劣势：安装流程比 Rapid-MLX 重（需 git clone + pyproject build）
推荐使用顺序：日常推理 → Rapid-MLX（最轻量）；长上下文/agent 工作流 → oMLX（tiered cache 更好）；备用/测试 → MLX-LM（最接近底层，无服务开销）

启动指南

# 激活环境
source ~/llm-bench/.venv/bin/activate

# 启动 Ollama（已有）：localhost:11434
ollama serve

# 启动 Rapid-MLX（新增）：localhost:8001
rapid-mlx serve qwen3.5-27b \
  --port 8001 \
  --gpu-memory-utilization 0.50 \
  --no-mllm \
  --served-model-name qwen35-27b-4bit

# 测试 API
curl http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen35-27b-4bit","messages":[{"role":"user","content":"你好"}],"max_tokens":256}'

遗留项

⏳ MLX 27B 模型需续传（HF mirror 断连，约剩 6GB，需要 20-30 分钟）
⏳ oMLX 待网络恢复后评估（brew install / 源码编译）
⏳ Rapid-MLX 内置的 rapid-mlx doctor 集成评测待跑

环境

隔离路径: ~/llm-bench/（完整的 MLX 评估环境，不影响系统）
Ollama: 未动，App 0.24 仍在 :11434
MLX 栈: mlx==0.31.2 / mlx-lm==0.31.3 / rapid-mlx==0.6.80
Python: 3.12.13（Homebrew）
磁盘: 1.6TiB 可用（够用）

5.9 KiB Raw Blame History Unescape Escape