Files
skills/agent-vision-awareness/SKILL.md
T
hmo 04db423416 Initial commit: skills library
- 70 skills with code and documentation
- Add .gitignore (ignore __pycache__, output/, temp/, venv/)
- Clean up test intermediates and caches
2026-04-26 19:27:40 +08:00

5.0 KiB

name, description
name description
agent-vision-awareness Enables automatic visual content detection and processing for OMO agents. Uses direct API calls to vision models instead of custom agent delegation. Triggers on image files, diagrams, charts, screenshots, or any visual media references.

Agent Vision Awareness

Overview

This skill provides automatic visual content detection for OMO agents working with text-only models.

Key Change: Instead of problematic custom agent delegation, this skill uses direct API integration with 火山方舟 (VolcEngine) vision models, which works reliably with current OMO versions.

Core Capabilities

  • Automatic detection of images, diagrams, charts, screenshots in user input
  • Direct API integration with 火山方舟 Vision API (doubao-1.5-vision-pro)
  • Context-aware analysis with appropriate modes (OCR, chart analysis, etc.)
  • Graceful degradation when vision processing fails
  • No custom agent dependency - works with standard OMO configuration

Detection Logic

Trigger Patterns

Detects visual content when user input contains:

1. Image File Extensions:

  • .png, .jpg, .jpeg, .gif, .bmp, .webp
  • Case-insensitive matching

2. Visual Content Keywords:

  • Chinese: "图片", "图像", "照片", "截图", "图表", "图示"
  • English: "diagram", "chart", "graph", "screenshot", "image", "photo"

3. File Path Patterns:

  • Absolute paths: C:/path/to/image.png
  • Relative paths: ./assets/diagram.png
  • URLs: https://example.com/image.png

Integration Workflow

Step 1: Detect Visual Content

When receiving user input, scan for visual content signals using the detection logic above.

Step 2: Direct API Processing

When visual content is detected, make direct API calls to VolcEngine:

  • Uses volcengine API Key from OpenCode config
  • Supports all common image formats
  • Handles local files and URLs

Step 3: Result Integration

  • Seamlessly integrates visual analysis results into responses
  • Maintains conversation context
  • Provides natural language descriptions

Usage Examples

Example 1: User requests image analysis

User Input: "描述 temp/稿定设计-1.png 这张图片的内容"

Agent Response: Automatically detects the PNG file, processes it via API, and returns the detailed description (as demonstrated in our testing).

Example 2: User mentions screenshot

User Input: "帮我分析这个错误截图 error.png"

Agent Response: Detects "截图" keyword + .png extension, processes the image, and provides error analysis.

Example 3: No visual content

User Input: "写一个 Python 脚本"

Agent Response: No detection triggered, processes as normal text-only request.

Configuration

Required Setup

  • Vision Model: doubao-seed-2.0-pro (火山方舟直接调用)
  • API Endpoint: https://ark.cn-beijing.volces.com/api/coding/v3
  • API Key: Uses existing volcengine API Key from OpenCode config

Known Limitations

  • 响应时间较长 (20-60秒)
  • 不够稳定,偶尔超时
  • 推荐: 压缩图片到1024px可提升响应速度

Loading the Skill

Add to OMO configuration:

skills:
  - agent-vision-awareness

Script Integration

The skill includes executable scripts in scripts/ directory:

  • vision_processor.py - Main vision processing script
  • Handles both detection and API integration
  • Can be used standalone or integrated into agent workflows

API Integration Details

The skill uses 火山方舟 (VolcEngine) Ark API for vision understanding:

from openai import OpenAI

client = OpenAI(
    base_url='https://ark.cn-beijing.volces.com/api/v3',
    api_key='YOUR_API_KEY'  # 从config.json的volcengine配置获取
)

# Vision model name
model = 'doubao-seed-2.0-pro'

# 支持的图片格式: base64, URL

Graceful Degradation

If vision processing fails:

  • Provides clear error messages
  • Suggests alternatives (describe content in text)
  • Continues with text-only processing when possible

Best Practices

Do's

  1. Trust automatic detection - the system will handle visual content seamlessly
  2. Provide clear context - mention what you want analyzed from the image
  3. Use natural language - just ask normally, no special commands needed

Don'ts

  1. Don't specify agents - no need for @multimodal-looker commands
  2. Don't worry about file paths - the system handles relative/absolute paths
  3. Don't repeat requests - automatic processing happens on first mention

Troubleshooting

Issue: Vision processing not working

  • Check: Ensure volcengine API Key is valid in OpenCode config
  • Check: Verify image file exists and is accessible
  • Fix: Test with simple image description request

Issue: Detection not triggering

  • Check: Ensure input contains detectable patterns (file extensions, keywords)
  • Fix: Use explicit file paths or visual keywords

This skill enables fully automatic visual content processing without requiring manual intervention or custom agent commands.