hmo/skills

Files

T

hmo 04db423416 Initial commit: skills library

- 70 skills with code and documentation
- Add .gitignore (ignore __pycache__, output/, temp/, venv/)
- Clean up test intermediates and caches

2026-04-26 19:27:40 +08:00

5.0 KiB

Raw Blame History

name, description

name	description
agent-vision-awareness	Enables automatic visual content detection and processing for OMO agents. Uses direct API calls to vision models instead of custom agent delegation. Triggers on image files, diagrams, charts, screenshots, or any visual media references.

Agent Vision Awareness

Overview

This skill provides automatic visual content detection for OMO agents working with text-only models.

Key Change: Instead of problematic custom agent delegation, this skill uses direct API integration with 火山方舟 (VolcEngine) vision models, which works reliably with current OMO versions.

Core Capabilities

✅ Automatic detection of images, diagrams, charts, screenshots in user input
✅ Direct API integration with 火山方舟 Vision API (doubao-1.5-vision-pro)
✅ Context-aware analysis with appropriate modes (OCR, chart analysis, etc.)
✅ Graceful degradation when vision processing fails
✅ No custom agent dependency - works with standard OMO configuration

Detection Logic

Trigger Patterns

Detects visual content when user input contains:

1. Image File Extensions:

.png, .jpg, .jpeg, .gif, .bmp, .webp
Case-insensitive matching

2. Visual Content Keywords:

Chinese: "图片", "图像", "照片", "截图", "图表", "图示"
English: "diagram", "chart", "graph", "screenshot", "image", "photo"

3. File Path Patterns:

Absolute paths: C:/path/to/image.png
Relative paths: ./assets/diagram.png
URLs: https://example.com/image.png

Integration Workflow

Step 1: Detect Visual Content

When receiving user input, scan for visual content signals using the detection logic above.

Step 2: Direct API Processing

When visual content is detected, make direct API calls to VolcEngine:

Uses volcengine API Key from OpenCode config
Supports all common image formats
Handles local files and URLs

Step 3: Result Integration

Seamlessly integrates visual analysis results into responses
Maintains conversation context
Provides natural language descriptions

Usage Examples

Example 1: User requests image analysis

User Input: "描述 temp/稿定设计-1.png 这张图片的内容"

Agent Response: Automatically detects the PNG file, processes it via API, and returns the detailed description (as demonstrated in our testing).

Example 2: User mentions screenshot

User Input: "帮我分析这个错误截图 error.png"

Agent Response: Detects "截图" keyword + .png extension, processes the image, and provides error analysis.

Example 3: No visual content

User Input: "写一个 Python 脚本"

Agent Response: No detection triggered, processes as normal text-only request.

Configuration

Required Setup

Vision Model: doubao-seed-2.0-pro (火山方舟直接调用)
API Endpoint: https://ark.cn-beijing.volces.com/api/coding/v3
API Key: Uses existing volcengine API Key from OpenCode config

Known Limitations

响应时间较长 (20-60秒)
不够稳定，偶尔超时
推荐: 压缩图片到1024px可提升响应速度

Loading the Skill

Add to OMO configuration:

skills:
  - agent-vision-awareness

Script Integration

The skill includes executable scripts in scripts/ directory:

vision_processor.py - Main vision processing script
Handles both detection and API integration
Can be used standalone or integrated into agent workflows

API Integration Details

The skill uses 火山方舟 (VolcEngine) Ark API for vision understanding:

from openai import OpenAI

client = OpenAI(
    base_url='https://ark.cn-beijing.volces.com/api/v3',
    api_key='YOUR_API_KEY'  # 从config.json的volcengine配置获取
)

# Vision model name
model = 'doubao-seed-2.0-pro'

# 支持的图片格式: base64, URL

Graceful Degradation

If vision processing fails:

Provides clear error messages
Suggests alternatives (describe content in text)
Continues with text-only processing when possible

Best Practices

Do's

Trust automatic detection - the system will handle visual content seamlessly
Provide clear context - mention what you want analyzed from the image
Use natural language - just ask normally, no special commands needed

Don'ts

Don't specify agents - no need for @multimodal-looker commands
Don't worry about file paths - the system handles relative/absolute paths
Don't repeat requests - automatic processing happens on first mention

Troubleshooting

Issue: Vision processing not working

Check: Ensure volcengine API Key is valid in OpenCode config
Check: Verify image file exists and is accessible
Fix: Test with simple image description request

Issue: Detection not triggering

Check: Ensure input contains detectable patterns (file extensions, keywords)
Fix: Use explicit file paths or visual keywords

This skill enables fully automatic visual content processing without requiring manual intervention or custom agent commands.

5.0 KiB Raw Blame History