Initial commit: skills library
- 70 skills with code and documentation - Add .gitignore (ignore __pycache__, output/, temp/, venv/) - Clean up test intermediates and caches
This commit is contained in:
@@ -0,0 +1,55 @@
|
||||
# Agent Vision Awareness - Configuration
|
||||
|
||||
## Current OMO Compatibility
|
||||
|
||||
This skill is designed to work with **火山方舟 (VolcEngine) API**:
|
||||
|
||||
- ✅ **No custom agent delegation required** - uses direct API calls
|
||||
- ✅ **Compatible with standard OMO configuration**
|
||||
- ✅ **Works with existing text-only models**
|
||||
- ✅ **Uses 火山方舟 API key** from OpenCode config
|
||||
|
||||
## Required Configuration
|
||||
|
||||
### 1. API Key Setup
|
||||
火山方舟 API Key 已配置在 `~/.config/opencode/config.json` 中:
|
||||
|
||||
- **API Key**: `b0359bed-09f2-49e2-a53c-32ba057412e3`
|
||||
- **Base URL**: `https://ark.cn-beijing.volces.com/api/coding/v3`
|
||||
|
||||
### 2. Supported Vision Model
|
||||
|
||||
**唯一支持的视觉模型**: `doubao-seed-code`
|
||||
|
||||
**注意**: Coding Plan 不支持专业视觉模型(如 doubao-vision-pro-32k)
|
||||
|
||||
### 3. Network Access
|
||||
Ensure network connectivity to:
|
||||
- `https://ark.cn-beijing.volces.com/api/coding/v3` (火山方舟 API)
|
||||
|
||||
## Removed Problematic Configurations
|
||||
|
||||
❌ **Custom Agent Delegation**: The `@multimodal-looker` approach has been **removed**
|
||||
|
||||
❌ **阿里云百炼**: 已停止使用 (API不可用)
|
||||
|
||||
## Working Implementation
|
||||
|
||||
✅ **Direct API Integration**: Uses 火山方舟 `doubao-seed-code`
|
||||
✅ **Automatic Detection**: Built-in pattern matching for visual content
|
||||
✅ **Graceful Degradation**: Clear error messages and fallback options
|
||||
✅ **Simple Integration**: No special commands needed - just mention images naturally
|
||||
|
||||
## Verification
|
||||
|
||||
To verify the configuration is working:
|
||||
|
||||
1. Load the `agent-vision-awareness` skill
|
||||
2. Test with: "分析这个截图 test.png" (replace with actual image path)
|
||||
3. Should automatically detect and process the image
|
||||
|
||||
## Known Limitations
|
||||
|
||||
- 响应时间较长 (20-60秒)
|
||||
- 不够稳定,偶尔超时
|
||||
- 建议图片压缩到1024px可提升速度
|
||||
@@ -0,0 +1,147 @@
|
||||
---
|
||||
name: agent-vision-awareness
|
||||
description: Enables automatic visual content detection and processing for OMO agents. Uses direct API calls to vision models instead of custom agent delegation. Triggers on image files, diagrams, charts, screenshots, or any visual media references.
|
||||
---
|
||||
|
||||
# Agent Vision Awareness
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides **automatic visual content detection** for OMO agents working with text-only models.
|
||||
|
||||
**Key Change**: Instead of problematic custom agent delegation, this skill uses **direct API integration** with 火山方舟 (VolcEngine) vision models, which works reliably with current OMO versions.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
- ✅ **Automatic detection** of images, diagrams, charts, screenshots in user input
|
||||
- ✅ **Direct API integration** with 火山方舟 Vision API (doubao-1.5-vision-pro)
|
||||
- ✅ **Context-aware analysis** with appropriate modes (OCR, chart analysis, etc.)
|
||||
- ✅ **Graceful degradation** when vision processing fails
|
||||
- ✅ **No custom agent dependency** - works with standard OMO configuration
|
||||
|
||||
## Detection Logic
|
||||
|
||||
### Trigger Patterns
|
||||
Detects visual content when user input contains:
|
||||
|
||||
**1. Image File Extensions**:
|
||||
- `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.webp`
|
||||
- Case-insensitive matching
|
||||
|
||||
**2. Visual Content Keywords**:
|
||||
- Chinese: "图片", "图像", "照片", "截图", "图表", "图示"
|
||||
- English: "diagram", "chart", "graph", "screenshot", "image", "photo"
|
||||
|
||||
**3. File Path Patterns**:
|
||||
- Absolute paths: `C:/path/to/image.png`
|
||||
- Relative paths: `./assets/diagram.png`
|
||||
- URLs: `https://example.com/image.png`
|
||||
|
||||
## Integration Workflow
|
||||
|
||||
### Step 1: Detect Visual Content
|
||||
When receiving user input, scan for visual content signals using the detection logic above.
|
||||
|
||||
### Step 2: Direct API Processing
|
||||
When visual content is detected, make direct API calls to VolcEngine:
|
||||
- Uses `volcengine` API Key from OpenCode config
|
||||
- Supports all common image formats
|
||||
- Handles local files and URLs
|
||||
|
||||
### Step 3: Result Integration
|
||||
- Seamlessly integrates visual analysis results into responses
|
||||
- Maintains conversation context
|
||||
- Provides natural language descriptions
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Example 1: User requests image analysis
|
||||
**User Input**: "描述 temp/稿定设计-1.png 这张图片的内容"
|
||||
|
||||
**Agent Response**: Automatically detects the PNG file, processes it via API, and returns the detailed description (as demonstrated in our testing).
|
||||
|
||||
### Example 2: User mentions screenshot
|
||||
**User Input**: "帮我分析这个错误截图 error.png"
|
||||
|
||||
**Agent Response**: Detects "截图" keyword + .png extension, processes the image, and provides error analysis.
|
||||
|
||||
### Example 3: No visual content
|
||||
**User Input**: "写一个 Python 脚本"
|
||||
|
||||
**Agent Response**: No detection triggered, processes as normal text-only request.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Required Setup
|
||||
- **Vision Model**: `doubao-seed-2.0-pro` (火山方舟直接调用)
|
||||
- **API Endpoint**: `https://ark.cn-beijing.volces.com/api/coding/v3`
|
||||
- **API Key**: Uses existing volcengine API Key from OpenCode config
|
||||
|
||||
### Known Limitations
|
||||
- 响应时间较长 (20-60秒)
|
||||
- 不够稳定,偶尔超时
|
||||
- **推荐**: 压缩图片到1024px可提升响应速度
|
||||
|
||||
### Loading the Skill
|
||||
Add to OMO configuration:
|
||||
```yaml
|
||||
skills:
|
||||
- agent-vision-awareness
|
||||
```
|
||||
|
||||
## Script Integration
|
||||
|
||||
The skill includes executable scripts in `scripts/` directory:
|
||||
- `vision_processor.py` - Main vision processing script
|
||||
- Handles both detection and API integration
|
||||
- Can be used standalone or integrated into agent workflows
|
||||
|
||||
## API Integration Details
|
||||
|
||||
The skill uses 火山方舟 (VolcEngine) Ark API for vision understanding:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
base_url='https://ark.cn-beijing.volces.com/api/v3',
|
||||
api_key='YOUR_API_KEY' # 从config.json的volcengine配置获取
|
||||
)
|
||||
|
||||
# Vision model name
|
||||
model = 'doubao-seed-2.0-pro'
|
||||
|
||||
# 支持的图片格式: base64, URL
|
||||
```
|
||||
|
||||
## Graceful Degradation
|
||||
|
||||
If vision processing fails:
|
||||
- Provides clear error messages
|
||||
- Suggests alternatives (describe content in text)
|
||||
- Continues with text-only processing when possible
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
1. **Trust automatic detection** - the system will handle visual content seamlessly
|
||||
2. **Provide clear context** - mention what you want analyzed from the image
|
||||
3. **Use natural language** - just ask normally, no special commands needed
|
||||
|
||||
### Don'ts
|
||||
1. **Don't specify agents** - no need for `@multimodal-looker` commands
|
||||
2. **Don't worry about file paths** - the system handles relative/absolute paths
|
||||
3. **Don't repeat requests** - automatic processing happens on first mention
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Issue**: Vision processing not working
|
||||
- **Check**: Ensure volcengine API Key is valid in OpenCode config
|
||||
- **Check**: Verify image file exists and is accessible
|
||||
- **Fix**: Test with simple image description request
|
||||
|
||||
**Issue**: Detection not triggering
|
||||
- **Check**: Ensure input contains detectable patterns (file extensions, keywords)
|
||||
- **Fix**: Use explicit file paths or visual keywords
|
||||
|
||||
This skill enables **fully automatic visual content processing** without requiring manual intervention or custom agent commands.
|
||||
@@ -0,0 +1,24 @@
|
||||
# 视觉识别技能更新规则(2026-04-20)
|
||||
## 问题总结
|
||||
1. OpenCode的`@ai-sdk/openai-compatible`兼容层不支持给火山方舟自定义模型传图片,加任何配置字段都无法让系统直接传图,会拦截报错`does not support image input`
|
||||
2. 但火山方舟doubao-seed-2.0-pro/code本身原生支持多模态,直接调用API可以正常识别图片
|
||||
|
||||
## 解决方案
|
||||
1. 新增`vision_direct.py`脚本,直接调用火山方舟API实现图片识别,不需要依赖OpenCode的原生多模态支持
|
||||
2. 统一临时文件目录:所有临时输出都放到`D:\F\NewI\opencode\daily-workspace\temp`,禁止乱放其他位置
|
||||
3. 自动触发规则:当用户输入包含以下内容时自动调用图片识别:
|
||||
- 包含图片后缀:`.jpg`/`.jpeg`/`.png`/`.gif`/`.webp`/`.bmp`
|
||||
- 包含视觉关键词:"图片"、"截图"、"照片"、"图"、"识别"、"分析这张"
|
||||
|
||||
## 使用方式
|
||||
### 自动触发
|
||||
用户发图片路径或者提到图片,自动调用识别,不需要用户额外操作
|
||||
|
||||
### 手动调用
|
||||
```bash
|
||||
python scripts/vision_direct.py <图片路径/URL> [提示词]
|
||||
```
|
||||
|
||||
## 已知限制
|
||||
1. 识别耗时20-60秒,图片太大建议压缩到1024px以内
|
||||
2. 支持所有常见图片格式,最大20MB
|
||||
@@ -0,0 +1,103 @@
|
||||
# Agent Vision Awareness - Usage Guide
|
||||
|
||||
## Quick Setup
|
||||
|
||||
1. **API Key 配置**:
|
||||
- 火山方舟 API Key 已在 OpenCode 配置中
|
||||
- 或设置 `VOLCENGINE_API_KEY` 环境变量
|
||||
|
||||
2. **Add skill to OMO configuration**:
|
||||
```yaml
|
||||
skills:
|
||||
- agent-vision-awareness
|
||||
```
|
||||
|
||||
3. **Use naturally** - just mention images in your requests:
|
||||
- "分析这个截图 error.png"
|
||||
- "描述 temp/image.jpg 的内容"
|
||||
- "根据架构图 design/architecture.png 生成部署方案"
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Basic Usage (Automatic)
|
||||
```python
|
||||
# No special code needed - automatic detection and processing
|
||||
user_input = "帮我分析这个错误日志截图:./logs/error.png"
|
||||
# The skill will automatically detect and process the image
|
||||
```
|
||||
|
||||
### Manual Integration (When Needed)
|
||||
```python
|
||||
from .scripts.integrate_vision import process_user_input
|
||||
|
||||
# Process user input with visual content
|
||||
result = process_user_input(
|
||||
user_input="分析图表 sales_chart.png",
|
||||
user_request="提取销售数据趋势",
|
||||
config={
|
||||
"api_key": os.environ.get("VOLCENGINE_API_KEY"),
|
||||
"base_url": "https://ark.cn-beijing.volces.com/api/coding/v3",
|
||||
"model": "doubao-seed-code"
|
||||
}
|
||||
)
|
||||
|
||||
if result["confidence"] != "none":
|
||||
analysis = result["analysis_results"][0]["result"]
|
||||
# Use analysis in your response
|
||||
response = f"根据图片分析:{analysis}"
|
||||
else:
|
||||
# Handle as normal text-only request
|
||||
response = "处理文本请求..."
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Copy `config/settings.json.example` to `config/settings.json` and update with your API key:
|
||||
|
||||
```json
|
||||
{
|
||||
"vision_api": {
|
||||
"key": "b0359bed-09f2-49e2-a53c-32ba057412e3",
|
||||
"base_url": "https://ark.cn-beijing.volces.com/api/coding/v3",
|
||||
"model": "doubao-seed-code"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Supported Features
|
||||
|
||||
✅ **Automatic Detection**: File extensions, keywords, URLs, markdown syntax
|
||||
✅ **Multiple Analysis Modes**: OCR, chart analysis, product analysis, scene description
|
||||
✅ **Error Handling**: Graceful degradation with clear error messages
|
||||
✅ **File Support**: Local files, relative paths, absolute paths, URLs
|
||||
✅ **Format Support**: PNG, JPG, JPEG, WebP, GIF, BMP
|
||||
|
||||
## Limitations
|
||||
|
||||
⚠️ **No Custom Agent Delegation**: The `@multimodal-looker` approach doesn't work with current OMO
|
||||
⚠️ **API Key Required**: Must have valid 火山方舟 API key
|
||||
⚠️ **File Size**: Images should be < 4MB for optimal performance
|
||||
⚠️ **Network**: Requires internet access to `https://ark.cn-beijing.volces.com`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Vision processing not working?**
|
||||
- Check `VOLCENGINE_API_KEY` configuration
|
||||
- Verify image file exists and is accessible
|
||||
- Test with simple request: "描述 image.png"
|
||||
|
||||
**Detection not triggering?**
|
||||
- Ensure input contains detectable patterns (file extensions like `.png`, keywords like "图片")
|
||||
- Use explicit file paths instead of vague references
|
||||
|
||||
**API errors?**
|
||||
- Check network connectivity to 火山方舟 API
|
||||
- Verify API key is valid
|
||||
- Check rate limits on your account
|
||||
|
||||
## Related Skills
|
||||
|
||||
- `image-service`: For image generation and editing
|
||||
- `file-reader`: For reading document contents (complementary)
|
||||
|
||||
This skill provides **fully automatic visual content processing** without requiring manual intervention or custom agent commands.
|
||||
@@ -0,0 +1,22 @@
|
||||
{
|
||||
"vision_api": {
|
||||
"key": "your_volcengine_api_key_here",
|
||||
"base_url": "https://ark.cn-beijing.volces.com/api/coding/v3",
|
||||
"model": "doubao-seed-code"
|
||||
},
|
||||
"defaults": {
|
||||
"analysis_mode": "describe",
|
||||
"max_tokens": 2000,
|
||||
"timeout_seconds": 30
|
||||
},
|
||||
"limits": {
|
||||
"max_file_size_mb": 4,
|
||||
"supported_formats": ["png", "jpg", "jpeg", "webp", "gif", "bmp"],
|
||||
"max_concurrent_requests": 1
|
||||
},
|
||||
"retry": {
|
||||
"max_attempts": 3,
|
||||
"backoff_multiplier": 2,
|
||||
"initial_delay_seconds": 1
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,366 @@
|
||||
# Delegation Templates for multimodal-looker
|
||||
|
||||
Ready-to-use prompts for delegating visual analysis tasks.
|
||||
|
||||
## Template Structure
|
||||
|
||||
```markdown
|
||||
@multimodal-looker [分析类型] 这张 [图片类型]
|
||||
|
||||
**图片**: [路径/URL/描述]
|
||||
|
||||
**用户需求**: [原始需求]
|
||||
|
||||
**分析重点**:
|
||||
1. [重点 1]
|
||||
2. [重点 2]
|
||||
3. [重点 3]
|
||||
|
||||
**输出格式**: [期望格式]
|
||||
```
|
||||
|
||||
## Scenario Templates
|
||||
|
||||
### 1. Error Screenshot Analysis
|
||||
|
||||
```markdown
|
||||
@multimodal-looker 请分析这个错误日志截图
|
||||
|
||||
**图片**: [图片路径]
|
||||
|
||||
**用户需求**: 诊断错误原因并提供解决方案
|
||||
|
||||
**分析重点**:
|
||||
1. 错误类型和错误代码
|
||||
2. 堆栈跟踪的关键信息
|
||||
3. 出错的文件名和行号
|
||||
4. 任何相关的上下文信息
|
||||
|
||||
**输出格式**:
|
||||
- 错误类型:[类型]
|
||||
- 错误位置:[文件:行号]
|
||||
- 可能原因:[分析]
|
||||
- 建议解决方案:[步骤]
|
||||
```
|
||||
|
||||
### 2. Architecture Diagram Analysis
|
||||
|
||||
```markdown
|
||||
@multimodal-looker 请分析这个架构图
|
||||
|
||||
**图片**: [图片路径]
|
||||
|
||||
**用户需求**: [理解系统架构/生成部署方案/识别组件]
|
||||
|
||||
**分析重点**:
|
||||
1. 所有组件/模块的名称和功能
|
||||
2. 组件之间的连接关系和数据流向
|
||||
3. 使用的技术栈标识(如果有)
|
||||
4. 架构模式(微服务、单体、分层等)
|
||||
|
||||
**输出格式**:
|
||||
- 组件列表:[表格形式]
|
||||
- 连接关系:[描述]
|
||||
- 架构模式:[类型]
|
||||
- 技术栈:[列表]
|
||||
```
|
||||
|
||||
### 3. Data Chart Analysis
|
||||
|
||||
```markdown
|
||||
@multimodal-looker 请分析这个数据图表
|
||||
|
||||
**图片**: [图片路径]
|
||||
|
||||
**用户需求**: [提取数据趋势/比较数据/理解指标]
|
||||
|
||||
**分析重点**:
|
||||
1. 图表类型(柱状图、折线图、饼图等)
|
||||
2. X 轴和 Y 轴的标签和范围
|
||||
3. 数据点的具体数值(尽可能读取)
|
||||
4. 趋势、峰值、谷值
|
||||
5. 图例和颜色含义
|
||||
|
||||
**输出格式**:
|
||||
- 图表类型:[类型]
|
||||
- 时间范围:[起止时间]
|
||||
- 数据系列:[列表]
|
||||
- 关键趋势:[描述]
|
||||
- 异常点:[如有]
|
||||
```
|
||||
|
||||
### 4. UI/UX Mockup Analysis
|
||||
|
||||
```markdown
|
||||
@multimodal-looker 请分析这个界面设计稿
|
||||
|
||||
**图片**: [图片路径]
|
||||
|
||||
**用户需求**: [实现界面/评估设计/提取需求]
|
||||
|
||||
**分析重点**:
|
||||
1. 界面布局和区域划分
|
||||
2. 所有 UI 元素(按钮、输入框、列表等)
|
||||
3. 文案内容和标签
|
||||
4. 配色方案和字体(如果能识别)
|
||||
5. 交互元素和状态
|
||||
|
||||
**输出格式**:
|
||||
- 布局结构:[描述]
|
||||
- UI 元素清单:[列表]
|
||||
- 配色方案:[颜色值]
|
||||
- 交互说明:[描述]
|
||||
```
|
||||
|
||||
### 5. Flowchart/Process Diagram Analysis
|
||||
|
||||
```markdown
|
||||
@multimodal-looker 请分析这个流程图
|
||||
|
||||
**图片**: [图片路径]
|
||||
|
||||
**用户需求**: [理解流程/生成文档/实现逻辑]
|
||||
|
||||
**分析重点**:
|
||||
1. 流程的起点和终点
|
||||
2. 所有步骤/节点的内容
|
||||
3. 决策点和分支条件
|
||||
4. 流程方向和箭头含义
|
||||
5. 并行流程或循环
|
||||
|
||||
**输出格式**:
|
||||
- 流程步骤:[有序列表]
|
||||
- 决策点:[条件 + 分支]
|
||||
- 流程图描述:[文字版]
|
||||
```
|
||||
|
||||
### 6. Table/Data Grid Analysis
|
||||
|
||||
```markdown
|
||||
@multimodal-looker 请分析这个表格
|
||||
|
||||
**图片**: [图片路径]
|
||||
|
||||
**用户需求**: [提取数据/理解结构/转换格式]
|
||||
|
||||
**分析重点**:
|
||||
1. 表格的行列结构
|
||||
2. 表头和各列含义
|
||||
3. 所有单元格的数据内容
|
||||
4. 合并单元格(如果有)
|
||||
5. 表格的总计或汇总行
|
||||
|
||||
**输出格式**:
|
||||
- 表格结构:[行数 x 列数]
|
||||
- 列名:[列表]
|
||||
- 数据内容:[Markdown 表格]
|
||||
```
|
||||
|
||||
### 7. Code Screenshot Analysis (OCR)
|
||||
|
||||
```markdown
|
||||
@multimodal-looker 请识别这个代码截图中的文字
|
||||
|
||||
**图片**: [图片路径]
|
||||
|
||||
**用户需求**: [提取代码/理解逻辑/转换格式]
|
||||
|
||||
**分析重点**:
|
||||
1. 完整的代码内容(逐行识别)
|
||||
2. 代码语言(根据语法判断)
|
||||
3. 缩进和格式
|
||||
4. 注释内容
|
||||
5. 任何特殊符号
|
||||
|
||||
**输出格式**:
|
||||
- 代码语言:[语言]
|
||||
- 代码内容:[代码块]
|
||||
- 关键逻辑:[简述]
|
||||
```
|
||||
|
||||
### 8. Handwritten Notes Analysis
|
||||
|
||||
```markdown
|
||||
@multimodal-looker 请识别这个手写笔记
|
||||
|
||||
**图片**: [图片路径]
|
||||
|
||||
**用户需求**: [转录文字/理解内容/整理笔记]
|
||||
|
||||
**分析重点**:
|
||||
1. 所有可识别的文字内容
|
||||
2. 标题和分段
|
||||
3. 列表和要点
|
||||
4. 图示或草图(如果有)
|
||||
5. 标注和高亮
|
||||
|
||||
**输出格式**:
|
||||
- 标题:[标题]
|
||||
- 内容:[结构化文本]
|
||||
- 要点:[列表]
|
||||
- 备注:[识别不清的部分]
|
||||
```
|
||||
|
||||
### 9. Comparison Task (Multiple Images)
|
||||
|
||||
```markdown
|
||||
@multimodal-looker 请对比分析这两张图片
|
||||
|
||||
**图片 1**: [路径 1]
|
||||
**图片 2**: [路径 2]
|
||||
|
||||
**用户需求**: [比较差异/选择更好的/找出变化]
|
||||
|
||||
**分析重点**:
|
||||
1. 每张图片的独立分析
|
||||
2. 相似之处
|
||||
3. 差异之处
|
||||
4. 各自的优缺点
|
||||
|
||||
**输出格式**:
|
||||
- 图片 1 分析:[描述]
|
||||
- 图片 2 分析:[描述]
|
||||
- 相似点:[列表]
|
||||
- 差异点:[列表]
|
||||
- 建议:[如有]
|
||||
```
|
||||
|
||||
### 10. General Purpose (Open-ended)
|
||||
|
||||
```markdown
|
||||
@multimodal-looker 请分析这张图片
|
||||
|
||||
**图片**: [图片路径]
|
||||
|
||||
**用户需求**: [原始需求]
|
||||
|
||||
**分析重点**:
|
||||
1. 图片的整体内容描述
|
||||
2. 关键视觉元素
|
||||
3. 任何文字信息
|
||||
4. 颜色、布局、风格
|
||||
5. 与用户需求相关的部分
|
||||
|
||||
**输出格式**: 自由格式,但请结构化输出
|
||||
```
|
||||
|
||||
## Response Integration Patterns
|
||||
|
||||
After receiving analysis from multimodal-looker, integrate results:
|
||||
|
||||
### Pattern 1: Acknowledge + Connect
|
||||
```markdown
|
||||
感谢分析。我看到了 [图片内容简述]。
|
||||
|
||||
根据你的需求 [xxx],结合图片中的 [关键信息],我建议...
|
||||
```
|
||||
|
||||
### Pattern 2: Summary + Action
|
||||
```markdown
|
||||
根据图片分析,关键信息是:
|
||||
1. [要点 1]
|
||||
2. [要点 2]
|
||||
|
||||
基于此,下一步行动是...
|
||||
```
|
||||
|
||||
### Pattern 3: Validation + Expansion
|
||||
```markdown
|
||||
图片分析结果确认了 [某信息]。
|
||||
|
||||
除此之外,还需要考虑...
|
||||
```
|
||||
|
||||
## Error Handling Templates
|
||||
|
||||
### Timeout Response
|
||||
```markdown
|
||||
抱歉,图片分析超时了。可能原因:
|
||||
- 图片文件过大
|
||||
- 网络延迟
|
||||
- 服务繁忙
|
||||
|
||||
你可以:
|
||||
1. 压缩图片后重试
|
||||
2. 用文字描述关键信息
|
||||
3. 稍后重试
|
||||
```
|
||||
|
||||
### Format Not Supported
|
||||
```markdown
|
||||
这个图片格式([格式])可能不被支持。
|
||||
|
||||
建议:
|
||||
- 转换为 PNG 或 JPG 格式
|
||||
- 或者描述图片内容,我继续帮助你
|
||||
```
|
||||
|
||||
### Analysis Failed
|
||||
```markdown
|
||||
抱歉,无法分析这张图片。可能原因:
|
||||
- 图片文件损坏
|
||||
- 图片内容无法识别
|
||||
- 服务暂时不可用
|
||||
|
||||
请尝试:
|
||||
- 重新上传清晰的图片
|
||||
- 或者用文字描述你想了解的内容
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### DO:
|
||||
1. Be specific about what you want analyzed
|
||||
2. Provide context about why you're asking
|
||||
3. Specify output format if you have preferences
|
||||
4. Acknowledge the analysis when received
|
||||
5. Connect visual info to the broader task
|
||||
|
||||
### DON'T:
|
||||
1. Don't just say "analyze this" without context
|
||||
2. Don't expect perfect OCR on blurry images
|
||||
3. Don't ignore the analysis results
|
||||
4. Don't delegate if the image isn't crucial
|
||||
5. Don't forget to thank multimodal-looker (optional but nice)
|
||||
|
||||
## Example Conversations
|
||||
|
||||
### Example 1: Error Debugging
|
||||
```
|
||||
User: 帮我看看这个报错 screenshot.png
|
||||
|
||||
Agent: @multimodal-looker 请分析这个错误日志截图...
|
||||
[delegation with template 1]
|
||||
|
||||
multimodal-looker: 分析结果:
|
||||
- 错误类型:NullPointerException
|
||||
- 位置:UserService.java:42
|
||||
- 原因:user 对象为 null
|
||||
- 建议:添加 null 检查
|
||||
|
||||
Agent: 我看到了这个错误,是 UserService 第 42 行的空指针异常。
|
||||
根据你的需求,解决方案是:
|
||||
1. 在调用 user 之前检查是否为 null
|
||||
2. 或者使用 Optional 包装
|
||||
...
|
||||
```
|
||||
|
||||
### Example 2: Architecture Review
|
||||
```
|
||||
User: 根据这个架构图 arch.png,帮我设计部署方案
|
||||
|
||||
Agent: @multimodal-looker 请分析这个架构图...
|
||||
[delegation with template 2]
|
||||
|
||||
multimodal-looker: 分析结果:
|
||||
- 组件:Frontend(React), API Gateway(Nginx), Backend(Node.js), DB(PostgreSQL)
|
||||
- 连接:React → Nginx → Node.js → PostgreSQL
|
||||
- 架构:典型的三层架构
|
||||
|
||||
Agent: 根据架构图,系统包含三层:前端、API 层、数据层。
|
||||
基于此,部署方案如下:
|
||||
1. 前端:部署到 CDN 或 S3
|
||||
2. API 层:Docker 容器化,负载均衡
|
||||
3. 数据库:主从复制,定期备份
|
||||
...
|
||||
```
|
||||
@@ -0,0 +1,195 @@
|
||||
# Visual Content Detection Patterns
|
||||
|
||||
Complete reference for detecting visual content in user inputs.
|
||||
|
||||
## File Extension Patterns
|
||||
|
||||
### Image Files (High Priority)
|
||||
```
|
||||
.png, .jpg, .jpeg, .gif, .bmp, .webp, .svg, .ico, .tiff, .tif, .heic, .heif, .raw, .psd, .ai, .eps
|
||||
```
|
||||
|
||||
**Detection Rule**: Case-insensitive match anywhere in input
|
||||
|
||||
### Document Files with Visual Content (Medium Priority)
|
||||
```
|
||||
.pdf (may contain diagrams), .ppt, .pptx (slides with visuals), .vsdx (Visio), .drawio
|
||||
```
|
||||
|
||||
**Detection Rule**: File extension + visual keywords
|
||||
|
||||
## Keyword Patterns
|
||||
|
||||
### Chinese Visual Keywords
|
||||
```
|
||||
一级关键词(高优先级):
|
||||
图片,图像,照片,截图,图表,图示,图形,影像,画面
|
||||
|
||||
二级关键词(中优先级):
|
||||
流程图,架构图,时序图,ER 图,思维导图,柱状图,饼图,折线图
|
||||
设计图,原型图,线框图,界面,UI,UX
|
||||
表格,表单,清单,列表
|
||||
|
||||
三级关键词(低优先级):
|
||||
显示,展示,呈现,可视化,看图,读图
|
||||
```
|
||||
|
||||
### English Visual Keywords
|
||||
```
|
||||
High Priority:
|
||||
image, photo, picture, screenshot, snapshot, capture, diagram, chart, graph, plot, figure
|
||||
|
||||
Medium Priority:
|
||||
flowchart, architecture, sequence diagram, ER diagram, mind map, bar chart, pie chart, line graph
|
||||
design, mockup, wireframe, interface, UI, UX, layout
|
||||
table, form, list, grid
|
||||
|
||||
Low Priority:
|
||||
show, display, visualize, view, look at, see
|
||||
```
|
||||
|
||||
### Technical Visual Keywords
|
||||
```
|
||||
Schema, model, blueprint, spec, technical drawing
|
||||
Dashboard, widget, panel, visualization
|
||||
Map, heatmap, scatter plot, histogram
|
||||
Infographic, poster, banner, thumbnail
|
||||
```
|
||||
|
||||
## Pattern Matching Rules
|
||||
|
||||
### Rule 1: File Path + Extension
|
||||
```regex
|
||||
[\w\-\.\/]+?\.(png|jpg|jpeg|gif|bmp|webp|svg|ico|tiff|heic)
|
||||
```
|
||||
|
||||
**Action**: Immediate delegation to multimodal-looker
|
||||
|
||||
### Rule 2: Markdown Image Syntax
|
||||
```regex
|
||||
!\[([^\]]*)\]\(([^\)]+)\)
|
||||
```
|
||||
|
||||
**Action**: Extract alt text and URL, delegate to multimodal-looker
|
||||
|
||||
### Rule 3: Base64 Image Data
|
||||
```regex
|
||||
data:image\/(png|jpeg|gif|webp);base64,[A-Za-z0-9+/=]+
|
||||
```
|
||||
|
||||
**Action**: Extract base64 data, save to temp file, delegate
|
||||
|
||||
### Rule 4: Keyword + File Reference
|
||||
```
|
||||
(图片 | 图像|diagram|chart|screenshot).*?[\w\-\.\/]+\.(png|jpg|jpeg|gif|bmp|webp)
|
||||
```
|
||||
|
||||
**Action**: Confirm intent, then delegate
|
||||
|
||||
### Rule 5: Keyword Only (Ambiguous)
|
||||
```
|
||||
(帮我看看这个图 | 分析这张图片 | 这个图表显示)
|
||||
```
|
||||
|
||||
**Action**: Ask for clarification: "请问是哪张图片?"
|
||||
|
||||
## Context-Aware Detection
|
||||
|
||||
### Code Development Context
|
||||
When user is working on code:
|
||||
- `architecture.png` → Architecture diagram
|
||||
- `screenshot.png` → Error or UI screenshot
|
||||
- `mockup.jpg` → Design reference
|
||||
|
||||
**Action**: Assume technical visual, delegate with context
|
||||
|
||||
### Data Analysis Context
|
||||
When user mentions data:
|
||||
- `chart`, `graph`, `plot`, `visualization`
|
||||
- `sales_chart.png`, `trend_graph.jpg`
|
||||
|
||||
**Action**: Assume data visualization, request data extraction
|
||||
|
||||
### Design Context
|
||||
When user discusses design:
|
||||
- `mockup`, `wireframe`, `prototype`, `design`
|
||||
- `ui_design.png`, `wireframe.jpg`
|
||||
|
||||
**Action**: Assume design visual, request UI/UX analysis
|
||||
|
||||
## Detection Confidence Levels
|
||||
|
||||
| Level | Confidence | Triggers | Action |
|
||||
|-------|------------|----------|--------|
|
||||
| HIGH | 90-100% | Image file + visual keyword | Auto-delegate |
|
||||
| MEDIUM | 60-89% | Image file OR strong keyword | Confirm then delegate |
|
||||
| LOW | 30-59% | Weak keyword only | Ask for clarification |
|
||||
| NONE | 0-29% | No visual signals | Process as text |
|
||||
|
||||
## Edge Cases
|
||||
|
||||
### Ambiguous References
|
||||
```
|
||||
"看这个" (without specifying what)
|
||||
"这个文件" (could be text or image)
|
||||
```
|
||||
**Handling**: Ask "请问是哪个文件?是图片吗?"
|
||||
|
||||
### Multiple Images
|
||||
```
|
||||
"比较这两张图:img1.png 和 img2.png"
|
||||
```
|
||||
**Handling**: Delegate both, request comparison
|
||||
|
||||
### Image in Code Block
|
||||
````
|
||||
```
|
||||

|
||||
```
|
||||
````
|
||||
**Handling**: Still detect as visual content (user may be documenting)
|
||||
|
||||
### URL Images
|
||||
```
|
||||
https://example.com/image.png
|
||||
http://cdn.site.com/chart.jpg
|
||||
```
|
||||
**Handling**: Detect as visual, may need download first
|
||||
|
||||
## Implementation Checklist
|
||||
|
||||
- [ ] Scan input for file extensions
|
||||
- [ ] Check for markdown image syntax
|
||||
- [ ] Search for visual keywords
|
||||
- [ ] Evaluate context (code, data, design)
|
||||
- [ ] Assign confidence level
|
||||
- [ ] Execute appropriate action (delegate/confirm/ask)
|
||||
|
||||
## Testing Examples
|
||||
|
||||
### Should Trigger (High Confidence)
|
||||
```
|
||||
分析这个截图:error.png
|
||||
看这张架构图 design/architecture.png
|
||||
 显示什么?
|
||||
帮我看看 data:image/png;base64,...
|
||||
```
|
||||
|
||||
### Should Trigger (Medium Confidence)
|
||||
```
|
||||
这个图片怎么优化?screenshot.png
|
||||
diagram.jpg 有什么改进建议
|
||||
```
|
||||
|
||||
### Should Ask (Low Confidence)
|
||||
```
|
||||
帮我看看这个图 (no file specified)
|
||||
这个设计怎么样?(unclear if visual attached)
|
||||
```
|
||||
|
||||
### Should Not Trigger
|
||||
```
|
||||
帮我写代码
|
||||
这个文本怎么格式化
|
||||
纯文字内容
|
||||
```
|
||||
@@ -0,0 +1,443 @@
|
||||
# Failure Handling and Graceful Degradation
|
||||
|
||||
Comprehensive guide for handling visual processing failures.
|
||||
|
||||
## Failure Scenarios
|
||||
|
||||
### Scenario 1: multimodal-looker Agent Unavailable
|
||||
|
||||
**Symptoms:**
|
||||
- No response from @multimodal-looker
|
||||
- Error: "Agent not found" or "Service unavailable"
|
||||
- Timeout after 60+ seconds
|
||||
|
||||
**Detection:**
|
||||
```python
|
||||
if agent_response.status == "timeout" or "unavailable" in error_message:
|
||||
trigger_failure_handling("agent_unavailable")
|
||||
```
|
||||
|
||||
**Response Template:**
|
||||
```markdown
|
||||
抱歉,视觉分析服务暂时不可用。
|
||||
|
||||
**可能原因**:
|
||||
- multimodal-looker 服务正在维护
|
||||
- 网络连接问题
|
||||
- 服务负载过高
|
||||
|
||||
**替代方案**:
|
||||
1. **稍后重试**: 等待 5-10 分钟后再次尝试
|
||||
2. **文字描述**: 请用文字描述图片内容,我可以继续帮助你
|
||||
3. **手动分析**: 如果你能提供图片的关键信息,我可以基于此给出建议
|
||||
|
||||
**需要我帮你做什么**:
|
||||
- [ ] 稍后自动重试
|
||||
- [ ] 继续其他任务(先跳过图片分析)
|
||||
- [ ] 你用文字描述,我继续处理
|
||||
```
|
||||
|
||||
**Follow-up Actions:**
|
||||
- Log the failure for monitoring
|
||||
- Offer to retry after delay
|
||||
- Proceed with text-only workflow if possible
|
||||
|
||||
---
|
||||
|
||||
### Scenario 2: Image Format Not Supported
|
||||
|
||||
**Symptoms:**
|
||||
- Error: "Unsupported image format"
|
||||
- Error: "Cannot process file type"
|
||||
- Recognition rate very low
|
||||
|
||||
**Supported Formats:**
|
||||
```
|
||||
✅ Supported: PNG, JPG/JPEG, GIF, BMP, WebP, SVG (rasterized)
|
||||
❌ Unsupported: PSD, AI, EPS, RAW, HEIC (without conversion)
|
||||
⚠️ Limited: PDF (first page only), TIFF (may be slow)
|
||||
```
|
||||
|
||||
**Response Template:**
|
||||
```markdown
|
||||
这个图片格式(`.xxx`)可能不被支持。
|
||||
|
||||
**当前支持的格式**:
|
||||
- ✅ PNG, JPG/JPEG, GIF, BMP, WebP
|
||||
- ⚠️ SVG, TIFF, PDF(有限支持)
|
||||
|
||||
**建议操作**:
|
||||
1. **转换格式**:
|
||||
- 使用在线工具转换为 PNG 或 JPG
|
||||
- 推荐工具:[工具名称/链接]
|
||||
|
||||
2. **截图替代**:
|
||||
- 如果是设计文件,可以截图为 PNG
|
||||
|
||||
3. **文字描述**:
|
||||
- 描述图片内容,我继续帮助你
|
||||
|
||||
**需要帮助转换吗**?我可以提供具体的转换步骤。
|
||||
```
|
||||
|
||||
**Recovery Options:**
|
||||
- Provide format conversion tools
|
||||
- Suggest screenshot as alternative
|
||||
- Accept text description
|
||||
|
||||
---
|
||||
|
||||
### Scenario 3: Image File Not Found
|
||||
|
||||
**Symptoms:**
|
||||
- Error: "File not found"
|
||||
- Error: "Cannot access file"
|
||||
- Empty response
|
||||
|
||||
**Detection:**
|
||||
```python
|
||||
if "file not found" in error or "cannot access" in error:
|
||||
trigger_failure_handling("file_not_found")
|
||||
```
|
||||
|
||||
**Response Template:**
|
||||
```markdown
|
||||
找不到图片文件:`[文件路径]`
|
||||
|
||||
**可能原因**:
|
||||
1. ❌ 文件路径不正确
|
||||
2. ❌ 文件已被删除或移动
|
||||
3. ❌ 没有访问权限
|
||||
4. ❌ 文件正在被其他程序使用
|
||||
|
||||
**请检查**:
|
||||
- [ ] 文件路径是否完整且正确
|
||||
- [ ] 文件是否确实存在于该位置
|
||||
- [ ] 是否有读取该文件的权限
|
||||
- [ ] 文件是否被其他程序锁定
|
||||
|
||||
**解决方案**:
|
||||
1. **重新提供路径**: 请确认正确的文件路径
|
||||
2. **重新上传**: 如果可以,请重新上传图片
|
||||
3. **使用绝对路径**: 尝试使用完整的绝对路径
|
||||
|
||||
**示例**:
|
||||
```
|
||||
错误:./image.png
|
||||
正确:D:/projects/my-app/assets/image.png
|
||||
```
|
||||
```
|
||||
|
||||
**Debugging Steps:**
|
||||
1. Verify file path syntax
|
||||
2. Check if file exists
|
||||
3. Verify permissions
|
||||
4. Try absolute path
|
||||
|
||||
---
|
||||
|
||||
### Scenario 4: Analysis Timeout
|
||||
|
||||
**Symptoms:**
|
||||
- No response after 60 seconds
|
||||
- Error: "Request timeout"
|
||||
- Partial response then silence
|
||||
|
||||
**Timeout Thresholds:**
|
||||
```
|
||||
Normal image (< 5MB): 10-30 seconds
|
||||
Large image (5-20MB): 30-60 seconds
|
||||
Very large (> 20MB): May timeout
|
||||
Complex diagram: 20-40 seconds
|
||||
```
|
||||
|
||||
**Response Template:**
|
||||
```markdown
|
||||
图片分析超时(等待超过 60 秒)。
|
||||
|
||||
**可能原因**:
|
||||
1. 🐌 图片文件过大(超过 20MB)
|
||||
2. 🌐 网络延迟或不稳定
|
||||
3. 🔄 服务繁忙,处理队列长
|
||||
4. 🖼️ 图片内容过于复杂
|
||||
|
||||
**建议操作**:
|
||||
1. **压缩图片**:
|
||||
- 目标大小:< 5MB
|
||||
- 推荐工具:TinyPNG, Squoosh
|
||||
|
||||
2. **降低分辨率**:
|
||||
- 建议尺寸:1920x1080 或更低
|
||||
- 保持清晰度即可
|
||||
|
||||
3. **简化内容**:
|
||||
- 如果是长图,可以分段发送
|
||||
- 如果是多页,可以分页发送
|
||||
|
||||
4. **重试**:
|
||||
- 等待几分钟后再次尝试
|
||||
- 网络可能暂时不稳定
|
||||
|
||||
**需要我帮你压缩图片吗**?或者你可以先描述关键信息,我继续处理。
|
||||
```
|
||||
|
||||
**Optimization Tips:**
|
||||
- Recommend image compression
|
||||
- Suggest resolution reduction
|
||||
- Offer to split large images
|
||||
|
||||
---
|
||||
|
||||
### Scenario 5: Poor Image Quality
|
||||
|
||||
**Symptoms:**
|
||||
- Low confidence in analysis
|
||||
- "Cannot read text clearly"
|
||||
- Missing details in response
|
||||
- Blurry or distorted recognition
|
||||
|
||||
**Quality Issues:**
|
||||
```
|
||||
❌ Blurry: Out of focus, motion blur
|
||||
❌ Dark: Underexposed, low contrast
|
||||
❌ Small text: Resolution too low
|
||||
❌ Glare: Reflection, overexposed
|
||||
❌ Cropped: Important content cut off
|
||||
```
|
||||
|
||||
**Response Template:**
|
||||
```markdown
|
||||
图片分析完成,但质量不高,可能影响准确性。
|
||||
|
||||
**识别到的问题**:
|
||||
- ⚠️ 图片模糊,文字识别困难
|
||||
- ⚠️ 光线不足,部分细节不清
|
||||
- ⚠️ 分辨率低,小字无法辨认
|
||||
|
||||
**当前分析结果**(仅供参考):
|
||||
[分析结果,标注置信度]
|
||||
|
||||
**建议改进**:
|
||||
1. **重新拍摄**:
|
||||
- 确保光线充足
|
||||
- 保持相机稳定
|
||||
- 对焦清晰
|
||||
|
||||
2. **提高分辨率**:
|
||||
- 使用更高 DPI 扫描
|
||||
- 截图而非拍照
|
||||
|
||||
3. **调整角度**:
|
||||
- 避免反光和阴影
|
||||
- 正面拍摄,不要倾斜
|
||||
|
||||
**或者**:
|
||||
- 你可以直接告诉我图片中的关键信息
|
||||
- 或者指出你关心的具体部分,我尽力分析
|
||||
```
|
||||
|
||||
**Workarounds:**
|
||||
- Provide best-effort analysis with confidence disclaimer
|
||||
- Ask user to clarify specific areas
|
||||
- Accept text input for critical information
|
||||
|
||||
---
|
||||
|
||||
### Scenario 6: Partial Analysis (Incomplete Results)
|
||||
|
||||
**Symptoms:**
|
||||
- Response cuts off mid-sentence
|
||||
- Only partial image analyzed
|
||||
- Missing requested information
|
||||
|
||||
**Response Template:**
|
||||
```markdown
|
||||
图片分析似乎不完整,只获取了部分结果。
|
||||
|
||||
**已获取的信息**:
|
||||
- [列出已分析的内容]
|
||||
|
||||
**缺失的信息**:
|
||||
- [列出未分析的部分]
|
||||
|
||||
**建议操作**:
|
||||
1. **请求补充分析**:
|
||||
@multimodal-looker 请补充分析图片的 [具体部分]
|
||||
|
||||
2. **分段处理**:
|
||||
如果是长图或复杂图片,可以分段发送分析
|
||||
|
||||
3. **手动补充**:
|
||||
你可以补充缺失部分的信息,我整合分析
|
||||
|
||||
**需要我请求补充分析吗**?或者你提供额外信息?
|
||||
```
|
||||
|
||||
**Recovery:**
|
||||
- Request targeted re-analysis
|
||||
- Ask user to provide missing context
|
||||
- Combine partial results with text input
|
||||
|
||||
---
|
||||
|
||||
## Escalation Protocol
|
||||
|
||||
### Level 1: Automatic Retry
|
||||
```
|
||||
First failure → Wait 10 seconds → Retry once
|
||||
```
|
||||
|
||||
### Level 2: Alternative Approach
|
||||
```
|
||||
Second failure → Suggest alternative (compression, format conversion, text description)
|
||||
```
|
||||
|
||||
### Level 3: Human Intervention
|
||||
```
|
||||
Third failure → Acknowledge limitation → Request manual input
|
||||
```
|
||||
|
||||
### Level 4: Task Redesign
|
||||
```
|
||||
Persistent failure → Propose alternative workflow → Bypass visual processing
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recovery Strategies
|
||||
|
||||
### Strategy 1: Text-Based Alternative
|
||||
```markdown
|
||||
既然图片分析不可用,我们可以用文字方式继续:
|
||||
|
||||
**请你描述**:
|
||||
1. 图片的主要内容是什么?
|
||||
2. 你关心图片中的哪些部分?
|
||||
3. 你希望基于图片完成什么任务?
|
||||
|
||||
**我会根据描述**:
|
||||
- 提供针对性建议
|
||||
- 继续后续工作
|
||||
- 必要时再尝试图片分析
|
||||
```
|
||||
|
||||
### Strategy 2: Incremental Processing
|
||||
```markdown
|
||||
我们可以分步处理:
|
||||
|
||||
**步骤 1**: 你先描述图片概要
|
||||
**步骤 2**: 我基于概要提供初步建议
|
||||
**步骤 3**: 针对关键点,我们再尝试图片分析
|
||||
**步骤 4**: 整合结果,完成整体任务
|
||||
|
||||
这样即使某一步失败,也能继续推进。
|
||||
```
|
||||
|
||||
### Strategy 3: External Tool Assistance
|
||||
```markdown
|
||||
如果内置工具不可用,可以尝试外部工具:
|
||||
|
||||
**推荐工具**:
|
||||
- OCR: 白描、ABBYY FineReader
|
||||
- 图表分析:ChartExpo、Plotly
|
||||
- 架构图:Draw.io、Lucidchart
|
||||
|
||||
**流程**:
|
||||
1. 使用外部工具分析
|
||||
2. 导出分析结果(文字/数据)
|
||||
3. 我基于结果继续处理
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring and Logging
|
||||
|
||||
### Failure Metrics to Track
|
||||
```
|
||||
- Total visual requests
|
||||
- Success rate
|
||||
- Average response time
|
||||
- Failure reasons distribution
|
||||
- Retry success rate
|
||||
```
|
||||
|
||||
### Log Format
|
||||
```json
|
||||
{
|
||||
"timestamp": "2026-02-26T10:30:00Z",
|
||||
"task_id": "vision-12345",
|
||||
"failure_type": "timeout",
|
||||
"image_size": "15MB",
|
||||
"retry_count": 2,
|
||||
"resolution": "text_description",
|
||||
"user_satisfaction": "pending"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## User Communication Best Practices
|
||||
|
||||
### DO:
|
||||
1. ✅ Acknowledge the failure promptly
|
||||
2. ✅ Explain the reason clearly (without technical jargon)
|
||||
3. ✅ Provide concrete alternatives
|
||||
4. ✅ Offer specific next steps
|
||||
5. ✅ Maintain positive, helpful tone
|
||||
|
||||
### DON'T:
|
||||
1. ❌ Blame the user ("你的图片有问题")
|
||||
2. ❌ Make excuses without solutions
|
||||
3. ❌ Give up without alternatives
|
||||
4. ❌ Use vague language ("可能", "也许")
|
||||
5. ❌ Repeat the same failed approach
|
||||
|
||||
### Example Phrases
|
||||
|
||||
**Good**:
|
||||
- "抱歉,图片分析遇到了问题。我们可以试试这样的替代方案..."
|
||||
- "虽然无法分析图片,但如果你能描述 [关键信息],我可以继续帮助你..."
|
||||
- "这个格式暂时不支持,建议转换为 PNG。需要我提供转换工具吗?"
|
||||
|
||||
**Bad**:
|
||||
- "这个图片不行" (太模糊)
|
||||
- "我处理不了" (放弃态度)
|
||||
- "你换个图片" (推卸责任)
|
||||
|
||||
---
|
||||
|
||||
## Testing Checklist
|
||||
|
||||
Test failure handling for:
|
||||
|
||||
- [ ] Agent timeout (simulate delay)
|
||||
- [ ] Unsupported format (send .psd file)
|
||||
- [ ] File not found (invalid path)
|
||||
- [ ] Large file timeout (> 20MB)
|
||||
- [ ] Blurry image (low quality)
|
||||
- [ ] Partial response (cut off mid-analysis)
|
||||
- [ ] Network error (disconnect)
|
||||
- [ ] Service error (mock 500 response)
|
||||
|
||||
For each test:
|
||||
- [ ] Appropriate error message shown
|
||||
- [ ] Alternatives provided
|
||||
- [ ] User can continue task
|
||||
- [ ] No crash or hang
|
||||
- [ ] Logs recorded
|
||||
|
||||
---
|
||||
|
||||
## Continuous Improvement
|
||||
|
||||
After each failure:
|
||||
1. Record failure type and context
|
||||
2. Analyze root cause
|
||||
3. Update detection/prevention logic
|
||||
4. Improve response templates
|
||||
5. Test with similar scenarios
|
||||
|
||||
Share learnings with:
|
||||
- Skill maintainers
|
||||
- OMO development team
|
||||
- multimodal-looker operators
|
||||
@@ -0,0 +1,5 @@
|
||||
# agent-vision-awareness - dependencies
|
||||
httpx>=0.0.1
|
||||
requests>=2.28.0
|
||||
httpx>=0.0.1
|
||||
vision-analyzer>=0.0.1
|
||||
@@ -0,0 +1,4 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Example script - delete if not needed."""
|
||||
|
||||
print("Hello from skill!")
|
||||
@@ -0,0 +1,167 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Integration script for agent-vision-awareness skill
|
||||
|
||||
This script demonstrates the complete workflow:
|
||||
1. Detect visual content in user input
|
||||
2. Extract image paths
|
||||
3. Analyze images using the vision analyzer
|
||||
4. Return structured results
|
||||
|
||||
This replaces the problematic custom agent delegation approach.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from pathlib import Path
|
||||
from typing import Dict, Any, List
|
||||
|
||||
|
||||
def process_user_input(
|
||||
user_input: str, user_request: str = "", config: Dict[str, str] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Process user input for visual content and return analysis results.
|
||||
|
||||
Args:
|
||||
user_input: The user's input text that may contain visual content references
|
||||
user_request: The original user request/context (optional)
|
||||
config: Configuration for the vision analyzer (optional)
|
||||
|
||||
Returns:
|
||||
Dictionary containing detection results and analysis
|
||||
"""
|
||||
try:
|
||||
# Import the detector and analyzer
|
||||
from .vision_detector import VisionContentDetector, DetectionConfidence
|
||||
from .standalone_vision_analyzer import StandaloneVisionAnalyzer
|
||||
|
||||
# Initialize components
|
||||
detector = VisionContentDetector()
|
||||
analyzer = StandaloneVisionAnalyzer(config)
|
||||
|
||||
# Step 1: Detect visual content
|
||||
confidence, detected_items = detector.detect_visual_content(user_input)
|
||||
result = {
|
||||
"status": "success",
|
||||
"confidence": confidence.value,
|
||||
"detected_items": detected_items,
|
||||
"analysis_results": [],
|
||||
"errors": [],
|
||||
}
|
||||
|
||||
if confidence == DetectionConfidence.NONE:
|
||||
result["message"] = "No visual content detected"
|
||||
return result
|
||||
|
||||
# Step 2: Extract image paths
|
||||
image_paths = detector.extract_image_paths(user_input)
|
||||
if not image_paths:
|
||||
result["message"] = "Visual content detected but no valid image paths found"
|
||||
result["errors"].append("No valid image paths found")
|
||||
return result
|
||||
|
||||
# Step 3: Determine analysis mode
|
||||
combined_text = (user_request + " " + user_input).lower()
|
||||
if any(
|
||||
word in combined_text for word in ["text", "文字", "ocr", "read", "识别"]
|
||||
):
|
||||
mode = "ocr"
|
||||
elif any(
|
||||
word in combined_text
|
||||
for word in ["chart", "graph", "plot", "图表", "数据", "趋势"]
|
||||
):
|
||||
mode = "chart"
|
||||
elif any(
|
||||
word in combined_text for word in ["fashion", "服装", "穿搭", "style"]
|
||||
):
|
||||
mode = "fashion"
|
||||
elif any(word in combined_text for word in ["product", "产品", "商品", "item"]):
|
||||
mode = "product"
|
||||
elif any(
|
||||
word in combined_text
|
||||
for word in ["scene", "场景", "环境", "location", "place"]
|
||||
):
|
||||
mode = "scene"
|
||||
else:
|
||||
mode = "describe"
|
||||
|
||||
# Step 4: Analyze each image
|
||||
for image_path in image_paths:
|
||||
try:
|
||||
# Handle relative paths
|
||||
if not os.path.isabs(image_path):
|
||||
image_path = os.path.join(os.getcwd(), image_path)
|
||||
|
||||
# Analyze the image
|
||||
if mode == "custom":
|
||||
analysis_result = analyzer.analyze_with_mode(
|
||||
Path(image_path),
|
||||
"custom",
|
||||
user_request or "Please analyze this image.",
|
||||
)
|
||||
else:
|
||||
analysis_result = analyzer.analyze_with_mode(Path(image_path), mode)
|
||||
|
||||
result["analysis_results"].append(
|
||||
{"image_path": image_path, "mode": mode, "result": analysis_result}
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Failed to analyze {image_path}: {str(e)}"
|
||||
result["errors"].append(error_msg)
|
||||
print(f"Error: {error_msg}", file=sys.stderr)
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"status": "error",
|
||||
"error": str(e),
|
||||
"message": f"Processing failed: {str(e)}",
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
"""Command line interface for testing."""
|
||||
import argparse
|
||||
import json
|
||||
|
||||
parser = argparse.ArgumentParser(description="Process visual content in user input")
|
||||
parser.add_argument("input", help="User input containing visual content references")
|
||||
parser.add_argument("--request", "-r", help="Original user request/context")
|
||||
parser.addiction_group = parser.add_mutually_exclusive_group()
|
||||
parser.addiction_group.add_argument("--api-key", help="API key for vision service")
|
||||
parser.addiction_group.add_argument("--config-file", help="Configuration file path")
|
||||
parser.add_argument("--output", "-o", help="Output file for results")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Build configuration
|
||||
config = {}
|
||||
if args.api_key:
|
||||
config["api_key"] = args.api_key
|
||||
config["base_url"] = "https://ark.cn-beijing.volces.com/api/coding/v3"
|
||||
config["model"] = "doubao-seed-code"
|
||||
elif args.config_file:
|
||||
import json
|
||||
|
||||
with open(args.config_file, "r", encoding="utf-8") as f:
|
||||
config = json.load(f)
|
||||
|
||||
# Process the input
|
||||
result = process_user_input(args.input, args.request or "", config)
|
||||
|
||||
# Output results
|
||||
output = json.dumps(result, indent=2, ensure_ascii=False)
|
||||
|
||||
if args.output:
|
||||
with open(args.output, "w", encoding="utf-8") as f:
|
||||
f.write(output)
|
||||
print(f"Results saved to: {args.output}")
|
||||
else:
|
||||
print(output)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,216 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Standalone Vision Analyzer - Simplified version for agent-vision-awareness skill
|
||||
|
||||
This is a self-contained version of the vision analyzer that doesn't depend on
|
||||
the image-service skill structure, making it easier to integrate directly.
|
||||
"""
|
||||
|
||||
import base64
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Dict, Any, Optional
|
||||
import httpx
|
||||
|
||||
|
||||
class StandaloneVisionAnalyzer:
|
||||
"""Standalone vision analyzer using direct API calls."""
|
||||
|
||||
# Predefined analysis modes
|
||||
ANALYSIS_MODES = {
|
||||
"describe": "请详细描述这张图片的内容,包括:人物、场景、物品、颜色、布局等所有细节。",
|
||||
"ocr": "请仔细识别这张图片中的所有文字内容,按照文字在图片中的位置顺序输出。如果是中文,请保持原文输出。",
|
||||
"chart": "请分析这张图表的内容,包括:图表类型、数据趋势、关键数据点、标题标签、以及数据的结论或洞察。",
|
||||
"fashion": "请分析这张图片中人物的穿搭,包括:服装款式、颜色搭配、配饰、整体风格等。",
|
||||
"product": "请分析这张产品图片,包括:产品类型、外观特征、功能特点、品牌信息等。",
|
||||
"scene": "请描述这张图片的场景,包括:地点、环境、氛围、时间(白天/夜晚)等。",
|
||||
"custom": "用户自定义问题",
|
||||
}
|
||||
|
||||
def __init__(self, config: Optional[Dict[str, str]] = None):
|
||||
"""
|
||||
Initialize the analyzer.
|
||||
|
||||
Args:
|
||||
config: Configuration dictionary with api_key, base_url, model
|
||||
"""
|
||||
if config is None:
|
||||
config = self._load_config()
|
||||
|
||||
self.api_key = (
|
||||
config.get("api_key")
|
||||
or config.get("VOLCENGINE_API_KEY")
|
||||
or "b0359bed-09f2-49e2-a53c-32ba057412e3"
|
||||
)
|
||||
self.base_url = (
|
||||
config.get("base_url") or "https://ark.cn-beijing.volces.com/api/coding/v3"
|
||||
)
|
||||
self.model = config.get("model") or "doubao-seed-code"
|
||||
|
||||
if not self.api_key or not self.base_url:
|
||||
raise ValueError("Missing required API configuration: api_key and base_url")
|
||||
|
||||
def _load_config(self) -> Dict[str, str]:
|
||||
"""Load configuration from environment variables or config file."""
|
||||
config = {}
|
||||
|
||||
# Load from environment variables
|
||||
config["api_key"] = os.environ.get("VOLCENGINE_API_KEY") or os.environ.get(
|
||||
"DASHSCOPE_API_KEY"
|
||||
)
|
||||
config["base_url"] = os.environ.get("VISION_API_BASE_URL")
|
||||
config["model"] = os.environ.get("VISION_MODEL")
|
||||
|
||||
return config
|
||||
|
||||
def encode_image(self, image_path: Path) -> str:
|
||||
"""Encode image to base64."""
|
||||
with open(image_path, "rb") as image_file:
|
||||
return base64.b64encode(image_file.read()).decode("utf-8")
|
||||
|
||||
def analyze(self, image_path: Path, question: str) -> str:
|
||||
"""
|
||||
Analyze image content.
|
||||
|
||||
Args:
|
||||
image_path: Path to the image file
|
||||
question: Question/prompt for analysis
|
||||
|
||||
Returns:
|
||||
Analysis result text
|
||||
"""
|
||||
if not image_path.exists():
|
||||
raise FileNotFoundError(f"Image not found: {image_path}")
|
||||
|
||||
base64_image = self.encode_image(image_path)
|
||||
|
||||
headers = {
|
||||
"Authorization": f"Bearer {self.api_key}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
payload = {
|
||||
"model": self.model,
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": question},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": f"data:image/png;base64,{base64_image}"
|
||||
},
|
||||
},
|
||||
],
|
||||
}
|
||||
],
|
||||
"max_tokens": 2000,
|
||||
}
|
||||
|
||||
try:
|
||||
with httpx.Client(timeout=30.0) as client:
|
||||
response = client.post(
|
||||
f"{self.base_url}/chat/completions", headers=headers, json=payload
|
||||
)
|
||||
response.raise_for_status()
|
||||
result = response.json()
|
||||
return result["choices"][0]["message"]["content"]
|
||||
except httpx.HTTPStatusError as e:
|
||||
if e.response.status_code == 404:
|
||||
raise ValueError(
|
||||
f"API endpoint not found, check base_url: {self.base_url}"
|
||||
)
|
||||
elif e.response.status_code == 401:
|
||||
raise ValueError("Invalid or expired API key")
|
||||
else:
|
||||
raise RuntimeError(f"API request failed: {e}")
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Analysis failed: {e}")
|
||||
|
||||
def analyze_with_mode(
|
||||
self,
|
||||
image_path: Path,
|
||||
mode: str = "describe",
|
||||
custom_question: Optional[str] = None,
|
||||
) -> str:
|
||||
"""
|
||||
Analyze image with predefined mode.
|
||||
|
||||
Args:
|
||||
image_path: Path to the image file
|
||||
mode: Analysis mode (describe, ocr, chart, fashion, product, scene, custom)
|
||||
custom_question: Custom question for custom mode
|
||||
|
||||
Returns:
|
||||
Analysis result text
|
||||
"""
|
||||
if mode not in self.ANALYSIS_MODES:
|
||||
raise ValueError(
|
||||
f"Unsupported mode: {mode}, available: {list(self.ANALYSIS_MODES.keys())}"
|
||||
)
|
||||
|
||||
if mode == "custom":
|
||||
if not custom_question:
|
||||
raise ValueError("Custom mode requires custom_question parameter")
|
||||
question = custom_question
|
||||
else:
|
||||
question = self.ANALYSIS_MODES[mode]
|
||||
|
||||
return self.analyze(image_path, question)
|
||||
|
||||
|
||||
def main():
|
||||
"""Command line interface."""
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(description="Standalone Vision Analyzer")
|
||||
parser.add_argument("image", help="Image path")
|
||||
parser.add_argument(
|
||||
"--mode",
|
||||
"-m",
|
||||
choices=["describe", "ocr", "chart", "fashion", "product", "scene", "custom"],
|
||||
default="describe",
|
||||
help="Analysis mode",
|
||||
)
|
||||
parser.add_argument("--question", "-q", help="Custom question for custom mode")
|
||||
parser.add_argument("--output", "-o", help="Output file")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
image_path = Path(args.image)
|
||||
if not image_path.exists():
|
||||
print(f"Error: Image not found: {image_path}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
analyzer = StandaloneVisionAnalyzer()
|
||||
|
||||
if args.mode == "custom":
|
||||
if not args.question:
|
||||
print(
|
||||
"Error: Custom mode requires --question parameter", file=sys.stderr
|
||||
)
|
||||
sys.exit(1)
|
||||
result = analyzer.analyze_with_mode(image_path, "custom", args.question)
|
||||
else:
|
||||
result = analyzer.analyze_with_mode(image_path, args.mode)
|
||||
|
||||
if args.output:
|
||||
with open(args.output, "w", encoding="utf-8") as f:
|
||||
f.write(result)
|
||||
print(f"Result saved to: {args.output}")
|
||||
else:
|
||||
print("Analysis Result:")
|
||||
print("-" * 50)
|
||||
print(result)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,106 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test script for agent-vision-awareness skill
|
||||
|
||||
This script tests the vision detection and processing capabilities.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def test_detection():
|
||||
"""Test visual content detection."""
|
||||
print("Testing visual content detection...")
|
||||
|
||||
from .vision_detector import VisionContentDetector, DetectionConfidence
|
||||
|
||||
detector = VisionContentDetector()
|
||||
|
||||
test_cases = [
|
||||
("帮我分析这个截图 error.png", DetectionConfidence.HIGH),
|
||||
("描述这张图片的内容", DetectionConfidence.LOW),
|
||||
("根据架构图 design/architecture.png 生成部署方案", DetectionConfidence.HIGH),
|
||||
("写一个 Python 脚本", DetectionConfidence.NONE),
|
||||
(" 显示什么?", DetectionConfidence.HIGH),
|
||||
]
|
||||
|
||||
for test_input, expected_confidence in test_cases:
|
||||
confidence, items = detector.detect_visual_content(test_input)
|
||||
status = "✅" if confidence == expected_confidence else "❌"
|
||||
print(f"{status} Input: {test_input}")
|
||||
print(f" Expected: {expected_confidence.value}, Got: {confidence.value}")
|
||||
if items:
|
||||
print(f" Detected: {items}")
|
||||
print()
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def test_integration():
|
||||
"""Test integration with vision analyzer (if API key available)."""
|
||||
print("Testing vision integration...")
|
||||
|
||||
# Check if API key is available
|
||||
api_key = os.environ.get("VOLCENGINE_API_KEY") or os.environ.get(
|
||||
"DASHSCOPE_API_KEY"
|
||||
)
|
||||
if not api_key:
|
||||
print("⚠️ No API key found. Skipping integration test.")
|
||||
print(
|
||||
" Set VOLCENGINE_API_KEY or DASHSCOPE_API_KEY environment variable to test."
|
||||
)
|
||||
return False
|
||||
|
||||
try:
|
||||
from .integrate_vision import process_user_input
|
||||
|
||||
# Test with a simple request (won't actually process image without file)
|
||||
result = process_user_input(
|
||||
"测试视觉处理",
|
||||
"这是一个测试",
|
||||
config={
|
||||
"api_key": api_key,
|
||||
"base_url": "https://ark.cn-beijing.volces.com/api/coding/v3",
|
||||
"model": "doubao-seed-code",
|
||||
},
|
||||
)
|
||||
|
||||
if result["status"] == "success":
|
||||
print("✅ Integration test passed (configuration valid)")
|
||||
return True
|
||||
else:
|
||||
print(f"❌ Integration test failed: {result.get('error', 'Unknown error')}")
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Integration test failed: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def main():
|
||||
"""Run all tests."""
|
||||
print("🧪 Testing Agent Vision Awareness Skill")
|
||||
print("=" * 50)
|
||||
|
||||
success = True
|
||||
|
||||
# Test detection
|
||||
success &= test_detection()
|
||||
|
||||
# Test integration (if possible)
|
||||
success &= test_integration()
|
||||
|
||||
print("=" * 50)
|
||||
if success:
|
||||
print("✅ All tests passed!")
|
||||
else:
|
||||
print("⚠️ Some tests failed or were skipped.")
|
||||
|
||||
return success
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
success = main()
|
||||
sys.exit(0 if success else 1)
|
||||
@@ -0,0 +1,336 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Vision Content Detector - Detects visual content in user input for agent-vision-awareness skill
|
||||
|
||||
This script implements the detection logic described in the skill documentation,
|
||||
but integrates with the actual working vision processing implementation
|
||||
using direct API calls rather than custom agent delegation.
|
||||
"""
|
||||
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Tuple, Optional
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class DetectionConfidence(Enum):
|
||||
HIGH = "high"
|
||||
MEDIUM = "medium"
|
||||
LOW = "low"
|
||||
NONE = "none"
|
||||
|
||||
|
||||
class VisionContentDetector:
|
||||
"""Detects visual content in user input based on various patterns."""
|
||||
|
||||
# Image file extensions (case-insensitive)
|
||||
IMAGE_EXTENSIONS = [
|
||||
".png",
|
||||
".jpg",
|
||||
".jpeg",
|
||||
".gif",
|
||||
".bmp",
|
||||
".webp",
|
||||
".svg",
|
||||
".ico",
|
||||
".tiff",
|
||||
".tif",
|
||||
".heic",
|
||||
".heif",
|
||||
".raw",
|
||||
".psd",
|
||||
".ai",
|
||||
".eps",
|
||||
]
|
||||
|
||||
# Document files with potential visual content
|
||||
DOCUMENT_EXTENSIONS = [".pdf", ".ppt", ".pptx", ".vsdx", ".drawio"]
|
||||
|
||||
# Chinese visual keywords
|
||||
CHINESE_KEYWORDS = {
|
||||
"high": [
|
||||
"图片",
|
||||
"图像",
|
||||
"照片",
|
||||
"截图",
|
||||
"图表",
|
||||
"图示",
|
||||
"图形",
|
||||
"影像",
|
||||
"画面",
|
||||
],
|
||||
"medium": [
|
||||
"流程图",
|
||||
"架构图",
|
||||
"时序图",
|
||||
"ER 图",
|
||||
"思维导图",
|
||||
"柱状图",
|
||||
"饼图",
|
||||
"折线图",
|
||||
"设计图",
|
||||
"原型图",
|
||||
"线框图",
|
||||
"界面",
|
||||
"UI",
|
||||
"UX",
|
||||
"表格",
|
||||
"表单",
|
||||
"清单",
|
||||
"列表",
|
||||
],
|
||||
"low": ["显示", "展示", "呈现", "可视化", "看图", "读图"],
|
||||
}
|
||||
|
||||
# English visual keywords
|
||||
ENGLISH_KEYWORDS = {
|
||||
"high": [
|
||||
"image",
|
||||
"photo",
|
||||
"picture",
|
||||
"screenshot",
|
||||
"snapshot",
|
||||
"capture",
|
||||
"diagram",
|
||||
"chart",
|
||||
"graph",
|
||||
"plot",
|
||||
"figure",
|
||||
],
|
||||
"medium": [
|
||||
"flowchart",
|
||||
"architecture",
|
||||
"sequence diagram",
|
||||
"ER diagram",
|
||||
"mind map",
|
||||
"bar chart",
|
||||
"pie chart",
|
||||
"line graph",
|
||||
"design",
|
||||
"mockup",
|
||||
"wireframe",
|
||||
"interface",
|
||||
"UI",
|
||||
"UX",
|
||||
"layout",
|
||||
"table",
|
||||
"form",
|
||||
"list",
|
||||
"grid",
|
||||
],
|
||||
"low": ["show", "display", "visualize", "view", "look at", "see"],
|
||||
}
|
||||
|
||||
# Technical visual keywords
|
||||
TECHNICAL_KEYWORDS = [
|
||||
"schema",
|
||||
"model",
|
||||
"blueprint",
|
||||
"spec",
|
||||
"technical drawing",
|
||||
"dashboard",
|
||||
"widget",
|
||||
"panel",
|
||||
"visualization",
|
||||
"map",
|
||||
"heatmap",
|
||||
"scatter plot",
|
||||
"histogram",
|
||||
"infographic",
|
||||
"poster",
|
||||
"banner",
|
||||
"thumbnail",
|
||||
]
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the detector with compiled regex patterns."""
|
||||
self._compile_patterns()
|
||||
|
||||
def _compile_patterns(self):
|
||||
"""Compile regex patterns for performance."""
|
||||
# File extension pattern
|
||||
ext_pattern = "|".join(re.escape(ext) for ext in self.IMAGE_EXTENSIONS)
|
||||
self.file_ext_pattern = re.compile(
|
||||
rf"[\w\-\.\/]+?\.(?:{ext_pattern})", re.IGNORECASE
|
||||
)
|
||||
|
||||
# Markdown image syntax
|
||||
self.markdown_img_pattern = re.compile(r"!\[([^\]]*)\]\(([^\)]+)\)")
|
||||
|
||||
# Base64 image data
|
||||
self.base64_img_pattern = re.compile(
|
||||
r"data:image\/(png|jpeg|gif|webp);base64,[A-Za-z0-9+/=]+"
|
||||
)
|
||||
|
||||
# Keyword + file reference
|
||||
keyword_pattern = "|".join(
|
||||
[
|
||||
re.escape(k)
|
||||
for k in self.CHINESE_KEYWORDS["high"] + self.ENGLISH_KEYWORDS["high"]
|
||||
]
|
||||
)
|
||||
ext_pattern_short = "|".join(
|
||||
re.escape(ext) for ext in self.IMAGE_EXTENSIONS[:7]
|
||||
) # Common ones
|
||||
self.keyword_file_pattern = re.compile(
|
||||
rf"({keyword_pattern}).*?[\w\-\.\/]+\.(?:{ext_pattern_short})",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
def detect_visual_content(
|
||||
self, user_input: str
|
||||
) -> Tuple[DetectionConfidence, List[str]]:
|
||||
"""
|
||||
Detect visual content in user input and return confidence level and detected items.
|
||||
|
||||
Args:
|
||||
user_input: The user's input text
|
||||
|
||||
Returns:
|
||||
Tuple of (confidence_level, detected_items)
|
||||
"""
|
||||
detected_items = []
|
||||
confidence_scores = []
|
||||
|
||||
# Check 1: File extensions
|
||||
file_matches = self.file_ext_pattern.findall(user_input)
|
||||
if file_matches:
|
||||
detected_items.extend(file_matches)
|
||||
confidence_scores.append(0.9) # High confidence
|
||||
|
||||
# Check 2: Markdown image syntax
|
||||
markdown_matches = self.markdown_img_pattern.findall(user_input)
|
||||
if markdown_matches:
|
||||
detected_items.extend([f"{alt}:{url}" for alt, url in markdown_matches])
|
||||
confidence_scores.append(0.9) # High confidence
|
||||
|
||||
# Check 3: Base64 image data
|
||||
base64_matches = self.base64_img_pattern.findall(user_input)
|
||||
if base64_matches:
|
||||
detected_items.extend([f"base64:{fmt}" for fmt in base64_matches])
|
||||
confidence_scores.append(0.9) # High confidence
|
||||
|
||||
# Check 4: Visual keywords
|
||||
keyword_confidence = self._check_keywords(user_input)
|
||||
if keyword_confidence > 0:
|
||||
confidence_scores.append(keyword_confidence)
|
||||
|
||||
# Check 5: URL images
|
||||
url_images = self._detect_url_images(user_input)
|
||||
if url_images:
|
||||
detected_items.extend(url_images)
|
||||
confidence_scores.append(0.8) # Medium-high confidence
|
||||
|
||||
# Determine overall confidence
|
||||
if not confidence_scores:
|
||||
return DetectionConfidence.NONE, []
|
||||
|
||||
max_confidence = max(confidence_scores)
|
||||
if max_confidence >= 0.9:
|
||||
return DetectionConfidence.HIGH, detected_items
|
||||
elif max_confidence >= 0.6:
|
||||
return DetectionConfidence.MEDIUM, detected_items
|
||||
else:
|
||||
return DetectionConfidence.LOW, detected_items
|
||||
|
||||
def _check_keywords(self, user_input: str) -> float:
|
||||
"""Check for visual keywords and return confidence score."""
|
||||
input_lower = user_input.lower()
|
||||
|
||||
# Check high priority keywords
|
||||
for keyword in self.CHINESE_KEYWORDS["high"] + self.ENGLISH_KEYWORDS["high"]:
|
||||
if keyword in input_lower:
|
||||
return 0.8
|
||||
|
||||
# Check medium priority keywords
|
||||
for keyword in (
|
||||
self.CHINESE_KEYWORDS["medium"] + self.ENGLISH_KEYWORDS["medium"]
|
||||
):
|
||||
if keyword in input_lower:
|
||||
return 0.6
|
||||
|
||||
# Check technical keywords
|
||||
for keyword in self.TECHNICAL_KEYWORDS:
|
||||
if keyword.lower() in input_lower:
|
||||
return 0.6
|
||||
|
||||
# Check low priority keywords
|
||||
for keyword in self.CHINESE_KEYWORDS["low"] + self.ENGLISH_KEYWORDS["low"]:
|
||||
if keyword in input_lower:
|
||||
return 0.4
|
||||
|
||||
return 0.0
|
||||
|
||||
def _detect_url_images(self, user_input: str) -> List[str]:
|
||||
"""Detect image URLs in the input."""
|
||||
url_pattern = re.compile(
|
||||
r"https?://[^\s]+?\.(?:png|jpg|jpeg|gif|bmp|webp)", re.IGNORECASE
|
||||
)
|
||||
return url_pattern.findall(user_input)
|
||||
|
||||
def extract_image_paths(self, user_input: str) -> List[str]:
|
||||
"""
|
||||
Extract actual image paths/URLs from user input.
|
||||
|
||||
Returns:
|
||||
List of image paths or URLs
|
||||
"""
|
||||
image_paths = []
|
||||
|
||||
# File paths with extensions
|
||||
file_matches = self.file_ext_pattern.findall(user_input)
|
||||
image_paths.extend(file_matches)
|
||||
|
||||
# Markdown image URLs
|
||||
markdown_matches = self.markdown_img_pattern.findall(user_input)
|
||||
image_paths.extend([url for alt, url in markdown_matches])
|
||||
|
||||
# Direct URLs
|
||||
url_images = self._detect_url_images(user_input)
|
||||
image_paths.extend(url_images)
|
||||
|
||||
# Remove duplicates while preserving order
|
||||
seen = set()
|
||||
unique_paths = []
|
||||
for path in image_paths:
|
||||
if path not in seen:
|
||||
unique_paths.append(path)
|
||||
seen.add(path)
|
||||
|
||||
return unique_paths
|
||||
|
||||
|
||||
def main():
|
||||
"""Command line interface for testing."""
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
parser = argparse.ArgumentParser(description="Detect visual content in user input")
|
||||
parser.add_argument("input", help="User input to analyze")
|
||||
parser.add_argument(
|
||||
"--extract-paths",
|
||||
action="store_true",
|
||||
help="Extract and return image paths only",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
detector = VisionContentDetector()
|
||||
|
||||
if args.extract_paths:
|
||||
paths = detector.extract_image_paths(args.input)
|
||||
for path in paths:
|
||||
print(path)
|
||||
else:
|
||||
confidence, items = detector.detect_visual_content(args.input)
|
||||
print(f"Confidence: {confidence.value}")
|
||||
if items:
|
||||
print("Detected items:")
|
||||
for item in items:
|
||||
print(f" - {item}")
|
||||
else:
|
||||
print("No visual content detected")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,82 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import sys
|
||||
sys.stdout.reconfigure(errors='replace')
|
||||
sys.stderr.reconfigure(errors='replace')
|
||||
import os
|
||||
os.environ['PYTHONIOENCODING'] = 'utf-8'
|
||||
import base64
|
||||
import time
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from openai import OpenAI
|
||||
|
||||
# 统一临时目录
|
||||
TEMP_DIR = r'D:\F\NewI\opencode\daily-workspace\temp'
|
||||
os.makedirs(TEMP_DIR, exist_ok=True)
|
||||
|
||||
# 从OpenCode配置读取火山方舟API Key
|
||||
CONFIG_PATH = r'C:\Users\hmo\.config\opencode\config.json'
|
||||
import json
|
||||
with open(CONFIG_PATH, 'r', encoding='utf-8') as f:
|
||||
config = json.load(f)
|
||||
API_KEY = config['provider']['volcengine']['options']['apiKey']
|
||||
BASE_URL = config['provider']['volcengine']['options']['baseURL']
|
||||
|
||||
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
|
||||
MODEL = 'doubao-seed-2.0-pro'
|
||||
|
||||
def analyze_image(image_path_or_url, prompt="详细描述这张图片的内容"):
|
||||
"""
|
||||
分析图片内容,支持本地路径和http/https URL
|
||||
:param image_path_or_url: 图片路径或URL
|
||||
:param prompt: 分析提示词
|
||||
:return: 分析结果
|
||||
"""
|
||||
try:
|
||||
# 处理URL
|
||||
if image_path_or_url.lower().startswith(('http://', 'https://')):
|
||||
image_url = image_path_or_url
|
||||
else:
|
||||
# 处理本地路径
|
||||
image_path = Path(image_path_or_url)
|
||||
if not image_path.exists():
|
||||
return f"错误:图片不存在 {image_path}"
|
||||
# 转base64
|
||||
with open(image_path, 'rb') as f:
|
||||
image_base64 = base64.b64encode(f.read()).decode('utf-8')
|
||||
image_url = f"data:image/{image_path.suffix.lstrip('.')};base64,{image_base64}"
|
||||
|
||||
# 调用API
|
||||
response = client.chat.completions.create(
|
||||
model=MODEL,
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": prompt},
|
||||
{"type": "image_url", "image_url": {"url": image_url}}
|
||||
]
|
||||
}
|
||||
],
|
||||
max_tokens=1000
|
||||
)
|
||||
return response.choices[0].message.content
|
||||
|
||||
except Exception as e:
|
||||
return f"图片识别失败:{type(e).__name__}: {str(e)}"
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) < 2:
|
||||
print("用法:python vision_direct.py <图片路径/URL> [提示词]")
|
||||
sys.exit(1)
|
||||
|
||||
image_path = sys.argv[1]
|
||||
prompt = sys.argv[2] if len(sys.argv) > 2 else "详细描述这张图片的内容"
|
||||
|
||||
result = analyze_image(image_path, prompt)
|
||||
print(result)
|
||||
|
||||
# 保存到临时文件
|
||||
output_file = os.path.join(TEMP_DIR, f"vision_result_{int(time.time())}.txt")
|
||||
with open(output_file, 'w', encoding='utf-8') as f:
|
||||
f.write(result)
|
||||
@@ -0,0 +1,215 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Vision Processor - Integrates with image-service vision analyzer for agent-vision-awareness skill
|
||||
|
||||
This script provides the actual implementation that replaces the problematic
|
||||
custom agent delegation approach described in the outdated documentation.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from pathlib import Path
|
||||
from typing import Dict, Any, Optional, List
|
||||
from enum import Enum
|
||||
|
||||
# Add the image-service scripts to path to reuse the vision analyzer
|
||||
sys.path.append(str(Path(__file__).parent.parent.parent / "image-service" / "scripts"))
|
||||
|
||||
try:
|
||||
from vision_analyzer import VisionAnalyzer
|
||||
except ImportError:
|
||||
# Fallback: try to import from current directory if vision_analyzer is copied here
|
||||
try:
|
||||
from .vision_analyzer import VisionAnalyzer
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Cannot find VisionAnalyzer. Please ensure image-service is properly installed."
|
||||
)
|
||||
|
||||
|
||||
class AnalysisMode(Enum):
|
||||
"""Available analysis modes."""
|
||||
|
||||
DESCRIBE = "describe"
|
||||
OCR = "ocr"
|
||||
CHART = "chart"
|
||||
FASHION = "fashion"
|
||||
PRODUCT = "product"
|
||||
SCENE = "scene"
|
||||
CUSTOM = "custom"
|
||||
|
||||
|
||||
class VisionProcessor:
|
||||
"""Main vision processing class that integrates detection and analysis."""
|
||||
|
||||
def __init__(self, config: Optional[Dict[str, str]] = None):
|
||||
"""
|
||||
Initialize the vision processor.
|
||||
|
||||
Args:
|
||||
config: Configuration dictionary for the VisionAnalyzer
|
||||
"""
|
||||
self.analyzer = VisionAnalyzer(config)
|
||||
self.detector = None # Will be created when needed
|
||||
|
||||
def process_visual_content(
|
||||
self, user_input: str, user_request: str = ""
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Process visual content in user input.
|
||||
|
||||
Args:
|
||||
user_input: The user's input text that may contain visual content references
|
||||
user_request: The original user request/context
|
||||
|
||||
Returns:
|
||||
Dictionary containing analysis results and metadata
|
||||
"""
|
||||
from .vision_detector import VisionContentDetector, DetectionConfidence
|
||||
|
||||
# Initialize detector if not already done
|
||||
if self.detector is None:
|
||||
self.detector = VisionContentDetector()
|
||||
|
||||
# Detect visual content
|
||||
confidence, detected_items = self.detector.detect_visual_content(user_input)
|
||||
result = {
|
||||
"confidence": confidence.value,
|
||||
"detected_items": detected_items,
|
||||
"analysis_results": [],
|
||||
"errors": [],
|
||||
}
|
||||
|
||||
if confidence == DetectionConfidence.NONE:
|
||||
result["message"] = "No visual content detected"
|
||||
return result
|
||||
|
||||
# Extract image paths
|
||||
image_paths = self.detector.extract_image_paths(user_input)
|
||||
if not image_paths:
|
||||
result["message"] = "Visual content detected but no valid image paths found"
|
||||
result["errors"].append("No valid image paths found")
|
||||
return result
|
||||
|
||||
# Determine analysis mode based on user request
|
||||
analysis_mode = self._determine_analysis_mode(user_request, user_input)
|
||||
|
||||
# Process each image
|
||||
for image_path in image_paths:
|
||||
try:
|
||||
# Handle URLs by downloading first (simplified - in practice would need download logic)
|
||||
if image_path.startswith(("http://", "https://")):
|
||||
# In a real implementation, you'd download the URL to a temp file
|
||||
# For now, we'll assume local paths only
|
||||
result["errors"].append(
|
||||
f"URL handling not implemented: {image_path}"
|
||||
)
|
||||
continue
|
||||
|
||||
# Ensure path is absolute
|
||||
if not os.path.isabs(image_path):
|
||||
# Try to resolve relative to current working directory
|
||||
image_path = os.path.join(os.getcwd(), image_path)
|
||||
|
||||
# Analyze the image
|
||||
if analysis_mode == AnalysisMode.CUSTOM:
|
||||
# Use the user request as the custom question
|
||||
analysis_result = self.analyzer.analyze_with_mode(
|
||||
Path(image_path),
|
||||
"custom",
|
||||
user_request or "Please analyze this image.",
|
||||
)
|
||||
else:
|
||||
analysis_result = self.analyzer.analyze_with_mode(
|
||||
Path(image_path), analysis_mode.value
|
||||
)
|
||||
|
||||
result["analysis_results"].append(
|
||||
{
|
||||
"image_path": image_path,
|
||||
"analysis_mode": analysis_mode.value,
|
||||
"result": analysis_result,
|
||||
}
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Failed to analyze {image_path}: {str(e)}"
|
||||
result["errors"].append(error_msg)
|
||||
print(f"Error: {error_msg}", file=sys.stderr)
|
||||
|
||||
return result
|
||||
|
||||
def _determine_analysis_mode(
|
||||
self, user_request: str, user_input: str
|
||||
) -> AnalysisMode:
|
||||
"""
|
||||
Determine the appropriate analysis mode based on user context.
|
||||
|
||||
Args:
|
||||
user_request: The user's original request
|
||||
user_input: The full input containing visual content references
|
||||
|
||||
Returns:
|
||||
AnalysisMode enum value
|
||||
"""
|
||||
combined_text = (user_request + " " + user_input).lower()
|
||||
|
||||
# Check for specific keywords to determine mode
|
||||
if any(
|
||||
word in combined_text for word in ["text", "文字", "ocr", "read", "识别"]
|
||||
):
|
||||
return AnalysisMode.OCR
|
||||
elif any(
|
||||
word in combined_text
|
||||
for word in ["chart", "graph", "plot", "图表", "数据", "趋势"]
|
||||
):
|
||||
return AnalysisMode.CHART
|
||||
elif any(
|
||||
word in combined_text
|
||||
for word in ["fashion", "服装", "穿搭", "style", "style"]
|
||||
):
|
||||
return AnalysisMode.FASHION
|
||||
elif any(word in combined_text for word in ["product", "产品", "商品", "item"]):
|
||||
return AnalysisMode.PRODUCT
|
||||
elif any(
|
||||
word in combined_text
|
||||
for word in ["scene", "场景", "环境", "location", "place"]
|
||||
):
|
||||
return AnalysisMode.SCENE
|
||||
else:
|
||||
return AnalysisMode.DESCRIBE
|
||||
|
||||
|
||||
def main():
|
||||
"""Command line interface for testing."""
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(description="Process visual content in user input")
|
||||
parser.add_argument("input", help="User input containing visual content references")
|
||||
parser.add_argument("--request", "-r", help="Original user request/context")
|
||||
parser.add_argument("--output", "-o", help="Output file for results")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
processor = VisionProcessor()
|
||||
result = processor.process_visual_content(args.input, args.request or "")
|
||||
|
||||
import json
|
||||
|
||||
output = json.dumps(result, indent=2, ensure_ascii=False)
|
||||
|
||||
if args.output:
|
||||
with open(args.output, "w", encoding="utf-8") as f:
|
||||
f.write(output)
|
||||
print(f"Results saved to: {args.output}")
|
||||
else:
|
||||
print(output)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user