Initial commit to git.yoin

2026-02-11 22:02:47 +08:00
commit cf10ab6473
153 changed files with 14581 additions and 0 deletions
--- a/log-analyzer/README.md
+++ b/log-analyzer/README.md
@@ -0,0 +1,20 @@
+# Log Analyzer
+
+智能日志分析器，支持多种日志类型。
+
+## 依赖
+
+无需额外安装，纯 Python 标准库实现。
+
+## 功能
+
+- 自动识别日志类型（Java App / MySQL Binlog / Nginx / Trace / Alert）
+- 提取 20+ 种实体（IP、thread_id、user_id、表名等）
+- 敏感操作检测、异常洞察
+- 支持 100M+ 大文件流式处理
+
+## 使用
+
+```bash
+python scripts/preprocess.py <日志文件> -o ./log_analysis
+```
--- a/log-analyzer/SKILL.md
+++ b/log-analyzer/SKILL.md
@@ -0,0 +1,109 @@
+---
+name: log-analyzer
+description: 全维度日志分析技能。自动识别日志类型（Java应用/MySQL Binlog/Nginx/Trace/告警），提取关键实体（IP、thread_id、trace_id、用户、表名等），进行根因定位、告警分析、异常洞察。支持100M+大文件。触发词：分析日志、日志排查、根因定位、告警分析、异常分析。
+---
+
+# 日志分析器
+
+基于 RAPHL（Recursive Analysis Pattern for Hierarchical Logs）的全维度智能日志分析技能。流式处理，内存占用低，100M+ 日志秒级分析。
+
+## 核心能力
+
+| 能力 | 说明 |
+|------|------|
+| 自动识别 | 自动识别日志类型：Java App / MySQL Binlog / Nginx / Trace / Alert |
+| 实体提取 | IP、thread_id、trace_id、user_id、session_id、bucket、URL、表名等 20+ 种 |
+| 操作分析 | DELETE/UPDATE/INSERT/DROP 等敏感操作检测 |
+| 关联分析 | 时间线、因果链、操作链构建 |
+| 智能洞察 | 自动生成分析结论、证据、建议 |
+
+## 支持的日志类型
+
+| 类型 | 识别特征 | 提取内容 |
+|------|----------|----------|
+| **Java App** | ERROR/WARN + 堆栈 | 异常类型、堆栈、logger、时间 |
+| **MySQL Binlog** | server id、GTID、Table_map | 表操作、thread_id、server_id、数据变更 |
+| **Nginx Access** | IP + HTTP 方法 + 状态码 | 请求IP、URL、状态码、耗时 |
+| **Trace** | trace_id、span_id | 链路追踪、调用关系、耗时 |
+| **Alert** | CRITICAL/告警 | 告警级别、来源、消息 |
+| **General** | 通用 | 时间、IP、关键词 |
+
+## 使用方法
+
+```bash
+python .opencode/skills/log-analyzer/scripts/preprocess.py <日志文件> -o ./log_analysis
+```
+
+## 输出文件
+
+| 文件 | 内容 | 用途 |
+|------|------|------|
+| `summary.md` | 完整分析报告 | **优先阅读** |
+| `entities.md` | 实体详情（IP、用户、表名等） | 追溯操作来源 |
+| `operations.md` | 操作详情 | 查看具体操作 |
+| `insights.md` | 智能洞察 | 问题定位和建议 |
+| `analysis.json` | 结构化数据 | 程序处理 |
+
+## 实体提取清单
+
+### 网络/连接类
+- IP 地址、IP:Port、URL、MAC 地址
+
+### 追踪/会话类
+- trace_id、span_id、request_id、session_id、thread_id
+
+### 用户/权限类
+- user_id、ak（access_key）、bucket
+
+### 数据库类
+- database.table、server_id
+
+### 性能/状态类
+- duration（耗时）、http_status、error_code
+
+## 敏感操作检测
+
+| 类型 | 检测模式 | 风险级别 |
+|------|----------|----------|
+| 数据删除 | DELETE, DROP, TRUNCATE | HIGH |
+| 数据修改 | UPDATE, ALTER, MODIFY | MEDIUM |
+| 权限变更 | GRANT, REVOKE, chmod | HIGH |
+| 认证操作 | LOGIN, LOGOUT, AUTH | MEDIUM |
+
+## 智能洞察类型
+
+| 类型 | 说明 |
+|------|------|
+| security | 大批量删除/修改、权限变更 |
+| anomaly | 高频 IP、异常时间段操作 |
+| error | 严重异常、错误聚类 |
+| audit | 操作来源、用户行为 |
+
+## 分析流程
+
+```
+Phase 1: 日志类型识别（采样前100行）
+    ↓
+Phase 2: 全量扫描提取（流式处理）
+    ↓
+Phase 3: 关联分析（时间排序、聚合统计）
+    ↓
+Phase 4: 智能洞察（异常检测、生成结论）
+    ↓
+Phase 5: 生成报告（Markdown + JSON）
+```
+
+## 技术特点
+
+| 特点 | 说明 |
+|------|------|
+| 流式处理 | 逐行读取，100M 文件只占几 MB 内存 |
+| 正则预编译 | 20+ 种实体模式预编译，匹配快 |
+| 一次遍历 | 提取 + 统计 + 分类一次完成 |
+| 类型适配 | 不同日志类型用专用解析器 |
+
+## 注意事项
+
+1. **Binlog 不记录客户端 IP**：只有 server_id 和 thread_id，需结合 general_log 确认操作者
+2. **敏感信息脱敏**：报告中注意不要暴露密码、密钥
+3. **结合多源日志**：binlog + 应用日志 + 审计日志 才能完整还原
--- a/log-analyzer/scripts/preprocess.py
+++ b/log-analyzer/scripts/preprocess.py
@@ -0,0 +1,849 @@
+#!/usr/bin/env python3
+"""
+RAPHL 日志分析器 - 全维度智能分析
+
+Author: 翟星人
+Created: 2026-01-18
+
+支持多种日志类型的自动识别和深度分析：
+- Java/应用日志：异常堆栈、ERROR/WARN
+- MySQL Binlog：DDL/DML 操作、表变更、事务分析
+- 审计日志：用户操作、权限变更
+- 告警日志：告警级别、告警源
+- Trace 日志：链路追踪、调用关系
+- 通用日志：IP、时间、关键词提取
+
+核心能力：
+1. 自动识别日志类型
+2. 提取关键实体（IP、用户、表名、thread_id 等）
+3. 时间线分析
+4. 关联分析（因果链、操作链）
+5. 智能洞察和异常检测
+"""
+
+import argparse
+import re
+import hashlib
+import json
+from pathlib import Path
+from collections import defaultdict, Counter
+from datetime import datetime
+from dataclasses import dataclass, field
+from typing import Optional
+from enum import Enum
+
+
+class LogType(Enum):
+    JAVA_APP = "java_app"
+    MYSQL_BINLOG = "mysql_binlog"
+    NGINX_ACCESS = "nginx_access"
+    AUDIT = "audit"
+    TRACE = "trace"
+    ALERT = "alert"
+    GENERAL = "general"
+
+
+@dataclass
+class Entity:
+    """提取的实体"""
+    type: str  # ip, user, table, thread_id, trace_id, bucket, etc.
+    value: str
+    line_num: int
+    context: str = ""
+
+
+@dataclass 
+class Operation:
+    """操作记录"""
+    line_num: int
+    time: str
+    op_type: str  # DELETE, UPDATE, INSERT, DROP, SELECT, API_CALL, etc.
+    target: str   # 表名、接口名等
+    detail: str
+    entities: list = field(default_factory=list)  # 关联的实体
+    raw_content: str = ""
+
+
+@dataclass
+class Alert:
+    """告警记录"""
+    line_num: int
+    time: str
+    level: str  # CRITICAL, WARNING, INFO
+    source: str
+    message: str
+    entities: list = field(default_factory=list)
+
+
+@dataclass
+class Trace:
+    """链路追踪"""
+    trace_id: str
+    span_id: str
+    parent_id: str
+    service: str
+    operation: str
+    duration: float
+    status: str
+    line_num: int
+
+
+@dataclass
+class Insight:
+    """分析洞察"""
+    category: str  # security, performance, error, anomaly
+    severity: str  # critical, high, medium, low
+    title: str
+    description: str
+    evidence: list = field(default_factory=list)
+    recommendation: str = ""
+
+
+class SmartLogAnalyzer:
+    """智能日志分析器 - 全维度感知"""
+    
+    # ============ 实体提取模式 ============
+    ENTITY_PATTERNS = {
+        'ip': re.compile(r'\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\b'),
+        'ip_port': re.compile(r'\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d+)\b'),
+        'mac': re.compile(r'\b([0-9A-Fa-f]{2}[:-]){5}[0-9A-Fa-f]{2}\b'),
+        'email': re.compile(r'\b[\w.-]+@[\w.-]+\.\w+\b'),
+        'url': re.compile(r'https?://[^\s<>"\']+'),
+        'uuid': re.compile(r'\b[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\b', re.I),
+        'trace_id': re.compile(r'\b(?:trace[_-]?id|traceid|x-trace-id)[=:\s]*([a-zA-Z0-9_-]{16,64})\b', re.I),
+        'span_id': re.compile(r'\b(?:span[_-]?id|spanid)[=:\s]*([a-zA-Z0-9_-]{8,32})\b', re.I),
+        'request_id': re.compile(r'\b(?:request[_-]?id|req[_-]?id)[=:\s]*([a-zA-Z0-9_-]{8,64})\b', re.I),
+        'user_id': re.compile(r'\b(?:user[_-]?id|uid|userid)[=:\s]*([a-zA-Z0-9_-]+)\b', re.I),
+        'thread_id': re.compile(r'\bthread[_-]?id[=:\s]*(\d+)\b', re.I),
+        'session_id': re.compile(r'\b(?:session[_-]?id|sid)[=:\s]*([a-zA-Z0-9_-]+)\b', re.I),
+        'ak': re.compile(r'\b(?:ak|access[_-]?key)[=:\s]*([a-zA-Z0-9]{16,64})\b', re.I),
+        'bucket': re.compile(r'\bbucket[=:\s]*([a-zA-Z0-9_.-]+)\b', re.I),
+        'database': re.compile(r'`([a-zA-Z_][a-zA-Z0-9_]*)`\.`([a-zA-Z_][a-zA-Z0-9_]*)`'),
+        'duration_ms': re.compile(r'\b(?:duration|cost|elapsed|time)[=:\s]*(\d+(?:\.\d+)?)\s*(?:ms|毫秒)\b', re.I),
+        'duration_s': re.compile(r'\b(?:duration|cost|elapsed|time)[=:\s]*(\d+(?:\.\d+)?)\s*(?:s|秒)\b', re.I),
+        'error_code': re.compile(r'\b(?:error[_-]?code|errno|code)[=:\s]*([A-Z0-9_-]+)\b', re.I),
+        'http_status': re.compile(r'\b(?:status|http[_-]?code)[=:\s]*([1-5]\d{2})\b', re.I),
+    }
+    
+    # ============ 时间格式 ============
+    TIME_PATTERNS = [
+        (re.compile(r'(\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}(?:\.\d{3})?)'), '%Y-%m-%d %H:%M:%S'),
+        (re.compile(r'(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})'), '%d/%b/%Y:%H:%M:%S'),
+        (re.compile(r'#(\d{6} \d{2}:\d{2}:\d{2})'), '%y%m%d %H:%M:%S'),  # MySQL binlog
+        (re.compile(r'\[(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})'), '%d/%b/%Y:%H:%M:%S'),  # Nginx
+    ]
+    
+    # ============ 日志类型识别 ============
+    LOG_TYPE_SIGNATURES = {
+        LogType.MYSQL_BINLOG: [
+            re.compile(r'server id \d+.*end_log_pos'),
+            re.compile(r'GTID.*last_committed'),
+            re.compile(r'Table_map:.*mapped to number'),
+            re.compile(r'(Delete_rows|Update_rows|Write_rows|Query).*table id'),
+        ],
+        LogType.JAVA_APP: [
+            re.compile(r'(ERROR|WARN|INFO|DEBUG)\s+[\w.]+\s+-'),
+            re.compile(r'^\s+at\s+[\w.$]+\([\w.]+:\d+\)'),
+            re.compile(r'Exception|Error|Throwable'),
+        ],
+        LogType.NGINX_ACCESS: [
+            re.compile(r'\d+\.\d+\.\d+\.\d+\s+-\s+-\s+\['),
+            re.compile(r'"(GET|POST|PUT|DELETE|HEAD|OPTIONS)\s+'),
+        ],
+        LogType.TRACE: [
+            re.compile(r'trace[_-]?id', re.I),
+            re.compile(r'span[_-]?id', re.I),
+            re.compile(r'parent[_-]?id', re.I),
+        ],
+        LogType.ALERT: [
+            re.compile(r'(CRITICAL|ALERT|EMERGENCY)', re.I),
+            re.compile(r'告警|报警|alarm', re.I),
+        ],
+    }
+    
+    # ============ MySQL Binlog 分析 ============
+    BINLOG_PATTERNS = {
+        'gtid': re.compile(r"GTID_NEXT=\s*'([^']+)'"),
+        'thread_id': re.compile(r'thread_id=(\d+)'),
+        'server_id': re.compile(r'server id (\d+)'),
+        'table_map': re.compile(r'Table_map:\s*`(\w+)`\.`(\w+)`\s*mapped to number (\d+)'),
+        'delete_rows': re.compile(r'Delete_rows:\s*table id (\d+)'),
+        'update_rows': re.compile(r'Update_rows:\s*table id (\d+)'),
+        'write_rows': re.compile(r'Write_rows:\s*table id (\d+)'),
+        'query': re.compile(r'Query\s+thread_id=(\d+)'),
+        'xid': re.compile(r'Xid\s*=\s*(\d+)'),
+        'delete_from': re.compile(r'###\s*DELETE FROM\s*`(\w+)`\.`(\w+)`'),
+        'update': re.compile(r'###\s*UPDATE\s*`(\w+)`\.`(\w+)`'),
+        'insert': re.compile(r'###\s*INSERT INTO\s*`(\w+)`\.`(\w+)`'),
+        'time': re.compile(r'#(\d{6} \d{2}:\d{2}:\d{2})'),
+    }
+    
+    # ============ 告警级别 ============
+    ALERT_PATTERNS = {
+        'CRITICAL': re.compile(r'\b(CRITICAL|FATAL|EMERGENCY|P0|严重|致命)\b', re.I),
+        'HIGH': re.compile(r'\b(ERROR|ALERT|P1|高|错误)\b', re.I),
+        'MEDIUM': re.compile(r'\b(WARN|WARNING|P2|中|警告)\b', re.I),
+        'LOW': re.compile(r'\b(INFO|NOTICE|P3|低|提示)\b', re.I),
+    }
+    
+    # ============ 敏感操作 ============
+    SENSITIVE_OPS = {
+        'data_delete': re.compile(r'\b(DELETE|DROP|TRUNCATE|REMOVE)\b', re.I),
+        'data_modify': re.compile(r'\b(UPDATE|ALTER|MODIFY|REPLACE)\b', re.I),
+        'permission': re.compile(r'\b(GRANT|REVOKE|chmod|chown|赋权|权限)\b', re.I),
+        'auth': re.compile(r'\b(LOGIN|LOGOUT|AUTH|认证|登录|登出)\b', re.I),
+        'config_change': re.compile(r'\b(SET|CONFIG|配置变更)\b', re.I),
+    }
+
+    def __init__(self, input_path: str, output_dir: str):
+        self.input_path = Path(input_path)
+        self.output_dir = Path(output_dir)
+        self.output_dir.mkdir(parents=True, exist_ok=True)
+        
+        # 分析结果
+        self.log_type: LogType = LogType.GENERAL
+        self.total_lines = 0
+        self.file_size_mb = 0
+        self.time_range = {'start': '', 'end': ''}
+        
+        # 提取的数据
+        self.entities: dict[str, list[Entity]] = defaultdict(list)
+        self.operations: list[Operation] = []
+        self.alerts: list[Alert] = []
+        self.traces: list[Trace] = []
+        self.insights: list[Insight] = []
+        
+        # 统计数据
+        self.stats = defaultdict(Counter)
+        
+        # Binlog 特有
+        self.table_map: dict[str, tuple[str, str]] = {}  # table_id -> (db, table)
+        self.current_thread_id = ""
+        self.current_server_id = ""
+        self.current_time = ""
+        
+    def run(self) -> dict:
+        print(f"\n{'='*60}")
+        print(f"RAPHL 智能日志分析器")
+        print(f"{'='*60}")
+        
+        self.file_size_mb = self.input_path.stat().st_size / (1024 * 1024)
+        print(f"文件: {self.input_path.name}")
+        print(f"大小: {self.file_size_mb:.2f} MB")
+        
+        # Phase 1: 识别日志类型
+        print(f"\n{'─'*40}")
+        print("Phase 1: 日志类型识别")
+        print(f"{'─'*40}")
+        self._detect_log_type()
+        print(f"  ✓ 类型: {self.log_type.value}")
+        
+        # Phase 2: 全量扫描提取
+        print(f"\n{'─'*40}")
+        print("Phase 2: 全量扫描提取")
+        print(f"{'─'*40}")
+        self._full_scan()
+        
+        # Phase 3: 关联分析
+        print(f"\n{'─'*40}")
+        print("Phase 3: 关联分析")
+        print(f"{'─'*40}")
+        self._correlate()
+        
+        # Phase 4: 生成洞察
+        print(f"\n{'─'*40}")
+        print("Phase 4: 智能洞察")
+        print(f"{'─'*40}")
+        self._generate_insights()
+        
+        # Phase 5: 生成报告
+        print(f"\n{'─'*40}")
+        print("Phase 5: 生成报告")
+        print(f"{'─'*40}")
+        self._generate_reports()
+        
+        print(f"\n{'='*60}")
+        print("分析完成")
+        print(f"{'='*60}")
+        
+        return self._get_summary()
+    
+    def _detect_log_type(self):
+        """识别日志类型"""
+        sample_lines = []
+        with open(self.input_path, 'r', encoding='utf-8', errors='ignore') as f:
+            for i, line in enumerate(f):
+                sample_lines.append(line)
+                if i >= 100:
+                    break
+        
+        sample_text = '\n'.join(sample_lines)
+        
+        scores = {t: 0 for t in LogType}
+        for log_type, patterns in self.LOG_TYPE_SIGNATURES.items():
+            for pattern in patterns:
+                matches = pattern.findall(sample_text)
+                scores[log_type] += len(matches)
+        
+        best_type = max(scores.keys(), key=lambda x: scores[x])
+        if scores[best_type] > 0:
+            self.log_type = best_type
+        else:
+            self.log_type = LogType.GENERAL
+    
+    def _full_scan(self):
+        """全量扫描提取"""
+        if self.log_type == LogType.MYSQL_BINLOG:
+            self._scan_binlog()
+        elif self.log_type == LogType.JAVA_APP:
+            self._scan_java_app()
+        else:
+            self._scan_general()
+        
+        print(f"  ✓ 总行数: {self.total_lines:,}")
+        print(f"  ✓ 时间范围: {self.time_range['start']} ~ {self.time_range['end']}")
+        
+        # 实体统计
+        for entity_type, entities in self.entities.items():
+            unique = len(set(e.value for e in entities))
+            print(f"  ✓ {entity_type}: {unique} 个唯一值, {len(entities)} 次出现")
+        
+        if self.operations:
+            print(f"  ✓ 操作记录: {len(self.operations)} 条")
+        if self.alerts:
+            print(f"  ✓ 告警记录: {len(self.alerts)} 条")
+    
+    def _scan_binlog(self):
+        """扫描 MySQL Binlog"""
+        current_op = None
+        
+        with open(self.input_path, 'r', encoding='utf-8', errors='ignore') as f:
+            for line_num, line in enumerate(f, 1):
+                self.total_lines += 1
+                
+                # 提取时间
+                time_match = self.BINLOG_PATTERNS['time'].search(line)
+                if time_match:
+                    self.current_time = time_match.group(1)
+                    self._update_time_range(self.current_time)
+                
+                # 提取 server_id
+                server_match = self.BINLOG_PATTERNS['server_id'].search(line)
+                if server_match:
+                    self.current_server_id = server_match.group(1)
+                    self._add_entity('server_id', self.current_server_id, line_num, line)
+                
+                # 提取 thread_id
+                thread_match = self.BINLOG_PATTERNS['thread_id'].search(line)
+                if thread_match:
+                    self.current_thread_id = thread_match.group(1)
+                    self._add_entity('thread_id', self.current_thread_id, line_num, line)
+                
+                # 提取 table_map
+                table_match = self.BINLOG_PATTERNS['table_map'].search(line)
+                if table_match:
+                    db, table, table_id = table_match.groups()
+                    self.table_map[table_id] = (db, table)
+                    self._add_entity('database', f"{db}.{table}", line_num, line)
+                
+                # 识别操作类型
+                for op_name, pattern in [
+                    ('DELETE', self.BINLOG_PATTERNS['delete_from']),
+                    ('UPDATE', self.BINLOG_PATTERNS['update']),
+                    ('INSERT', self.BINLOG_PATTERNS['insert']),
+                ]:
+                    match = pattern.search(line)
+                    if match:
+                        db, table = match.groups()
+                        self.stats['operations'][op_name] += 1
+                        self.stats['tables'][f"{db}.{table}"] += 1
+                        
+                        if current_op is None or current_op.target != f"{db}.{table}":
+                            if current_op:
+                                self.operations.append(current_op)
+                            current_op = Operation(
+                                line_num=line_num,
+                                time=self.current_time,
+                                op_type=op_name,
+                                target=f"{db}.{table}",
+                                detail="",
+                                entities=[
+                                    Entity('thread_id', self.current_thread_id, line_num),
+                                    Entity('server_id', self.current_server_id, line_num),
+                                ],
+                                raw_content=line
+                            )
+                
+                # 提取行内实体（IP、用户等）
+                self._extract_entities(line, line_num)
+                
+                if line_num % 50000 == 0:
+                    print(f"    已处理 {line_num:,} 行...")
+        
+        if current_op:
+            self.operations.append(current_op)
+    
+    def _scan_java_app(self):
+        """扫描 Java 应用日志"""
+        current_exception = None
+        context_buffer = []
+        
+        error_pattern = re.compile(
+            r'^(\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}(?:\.\d{3})?)\s+'
+            r'(FATAL|ERROR|WARN|WARNING|INFO|DEBUG)\s+'
+            r'([\w.]+)\s+-\s+(.+)$'
+        )
+        stack_pattern = re.compile(r'^\s+at\s+')
+        exception_pattern = re.compile(r'^([a-zA-Z_$][\w.$]*(?:Exception|Error|Throwable)):\s*(.*)$')
+        
+        with open(self.input_path, 'r', encoding='utf-8', errors='ignore') as f:
+            for line_num, line in enumerate(f, 1):
+                self.total_lines += 1
+                line = line.rstrip()
+                
+                # 提取实体
+                self._extract_entities(line, line_num)
+                
+                error_match = error_pattern.match(line)
+                if error_match:
+                    time_str, level, logger, message = error_match.groups()
+                    self._update_time_range(time_str)
+                    
+                    if level in ('ERROR', 'FATAL', 'WARN', 'WARNING'):
+                        if current_exception:
+                            self._finalize_exception(current_exception)
+                        
+                        current_exception = {
+                            'line_num': line_num,
+                            'time': time_str,
+                            'level': level,
+                            'logger': logger,
+                            'message': message,
+                            'stack': [],
+                            'context': list(context_buffer),
+                            'entities': [],
+                        }
+                    context_buffer.clear()
+                elif current_exception:
+                    if stack_pattern.match(line) or exception_pattern.match(line):
+                        current_exception['stack'].append(line)
+                    elif line.startswith('Caused by:'):
+                        current_exception['stack'].append(line)
+                    else:
+                        self._finalize_exception(current_exception)
+                        current_exception = None
+                        context_buffer.append(line)
+                else:
+                    context_buffer.append(line)
+                    if len(context_buffer) > 5:
+                        context_buffer.pop(0)
+                
+                if line_num % 50000 == 0:
+                    print(f"    已处理 {line_num:,} 行...")
+        
+        if current_exception:
+            self._finalize_exception(current_exception)
+    
+    def _finalize_exception(self, exc: dict):
+        """完成异常记录"""
+        level_map = {'FATAL': 'CRITICAL', 'ERROR': 'HIGH', 'WARN': 'MEDIUM', 'WARNING': 'MEDIUM'}
+        
+        self.alerts.append(Alert(
+            line_num=exc['line_num'],
+            time=exc['time'],
+            level=level_map.get(exc['level'], 'LOW'),
+            source=exc['logger'],
+            message=exc['message'],
+            entities=exc.get('entities', [])
+        ))
+        
+        if exc['stack']:
+            self.stats['exceptions'][exc['stack'][0].split(':')[0] if ':' in exc['stack'][0] else exc['level']] += 1
+    
+    def _scan_general(self):
+        """通用日志扫描"""
+        with open(self.input_path, 'r', encoding='utf-8', errors='ignore') as f:
+            for line_num, line in enumerate(f, 1):
+                self.total_lines += 1
+                
+                # 提取时间
+                for pattern, fmt in self.TIME_PATTERNS:
+                    match = pattern.search(line)
+                    if match:
+                        self._update_time_range(match.group(1))
+                        break
+                
+                # 提取实体
+                self._extract_entities(line, line_num)
+                
+                # 识别告警
+                for level, pattern in self.ALERT_PATTERNS.items():
+                    if pattern.search(line):
+                        self.stats['alert_levels'][level] += 1
+                        break
+                
+                # 识别敏感操作
+                for op_type, pattern in self.SENSITIVE_OPS.items():
+                    if pattern.search(line):
+                        self.stats['sensitive_ops'][op_type] += 1
+                
+                if line_num % 50000 == 0:
+                    print(f"    已处理 {line_num:,} 行...")
+    
+    def _extract_entities(self, line: str, line_num: int):
+        """提取行内实体"""
+        for entity_type, pattern in self.ENTITY_PATTERNS.items():
+            for match in pattern.finditer(line):
+                value = match.group(1) if match.lastindex else match.group(0)
+                self._add_entity(entity_type, value, line_num, line[:200])
+    
+    def _add_entity(self, entity_type: str, value: str, line_num: int, context: str = ""):
+        """添加实体"""
+        # 过滤无效值
+        if entity_type == 'ip' and value in ('0.0.0.0', '127.0.0.1', '255.255.255.255'):
+            return
+        if entity_type == 'duration_ms' and float(value) == 0:
+            return
+            
+        self.entities[entity_type].append(Entity(
+            type=entity_type,
+            value=value,
+            line_num=line_num,
+            context=context
+        ))
+        self.stats[f'{entity_type}_count'][value] += 1
+    
+    def _update_time_range(self, time_str: str):
+        """更新时间范围"""
+        if not self.time_range['start'] or time_str < self.time_range['start']:
+            self.time_range['start'] = time_str
+        if not self.time_range['end'] or time_str > self.time_range['end']:
+            self.time_range['end'] = time_str
+    
+    def _correlate(self):
+        """关联分析"""
+        # 操作按时间排序
+        if self.operations:
+            self.operations.sort(key=lambda x: x.time)
+            print(f"  ✓ 操作时间线: {len(self.operations)} 条")
+        
+        # 聚合相同操作
+        if self.log_type == LogType.MYSQL_BINLOG:
+            op_summary: dict[str, dict] = {}
+            for op in self.operations:
+                key = op.op_type
+                if key not in op_summary:
+                    op_summary[key] = {'count': 0, 'tables': Counter(), 'thread_ids': set()}
+                op_summary[key]['count'] += 1
+                op_summary[key]['tables'][op.target] += 1
+                for e in op.entities:
+                    if e.type == 'thread_id':
+                        op_summary[key]['thread_ids'].add(e.value)
+            
+            for op_type, data in op_summary.items():
+                tables_count = len(data['tables'])
+                thread_count = len(data['thread_ids'])
+                print(f"  ✓ {op_type}: {data['count']} 次, 涉及 {tables_count} 个表, {thread_count} 个 thread_id")
+        
+        # IP 活动分析
+        if 'ip' in self.entities:
+            ip_activity = Counter(e.value for e in self.entities['ip'])
+            top_ips = ip_activity.most_common(5)
+            if top_ips:
+                print(f"  ✓ Top IP:")
+                for ip, count in top_ips:
+                    print(f"      {ip}: {count} 次")
+    
+    def _generate_insights(self):
+        """生成智能洞察"""
+        
+        # Binlog 洞察
+        if self.log_type == LogType.MYSQL_BINLOG:
+            # 大批量删除检测
+            delete_count = self.stats['operations'].get('DELETE', 0)
+            if delete_count > 100:
+                tables = self.stats['tables'].most_common(5)
+                thread_ids = list(set(e.value for e in self.entities.get('thread_id', [])))
+                server_ids = list(set(e.value for e in self.entities.get('server_id', [])))
+                
+                self.insights.append(Insight(
+                    category='security',
+                    severity='high',
+                    title=f'大批量删除操作检测',
+                    description=f'检测到 {delete_count} 条 DELETE 操作',
+                    evidence=[
+                        f"时间范围: {self.time_range['start']} ~ {self.time_range['end']}",
+                        f"涉及表: {', '.join(f'{t[0]}({t[1]}次)' for t in tables)}",
+                        f"Server ID: {', '.join(server_ids)}",
+                        f"Thread ID: {', '.join(thread_ids[:5])}{'...' if len(thread_ids) > 5 else ''}",
+                    ],
+                    recommendation='确认操作来源：1. 根据 thread_id 查询应用连接 2. 检查对应时间段的应用日志 3. 确认是否为正常业务行为'
+                ))
+            
+            # 操作来源分析
+            if self.entities.get('server_id'):
+                unique_servers = set(e.value for e in self.entities['server_id'])
+                if len(unique_servers) == 1:
+                    server_id = list(unique_servers)[0]
+                    self.insights.append(Insight(
+                        category='audit',
+                        severity='medium',
+                        title='操作来源确认',
+                        description=f'所有操作来自同一数据库实例 server_id={server_id}',
+                        evidence=[
+                            f"Server ID: {server_id}",
+                            f"这是数据库主库的标识，不是客户端 IP",
+                            f"Binlog 不记录客户端 IP，需查 general_log 或审计日志",
+                        ],
+                        recommendation='如需确认操作者 IP，请检查：1. MySQL general_log 2. 审计插件日志 3. 应用服务连接日志'
+                    ))
+        
+        # 异常洞察
+        if self.alerts:
+            critical_count = sum(1 for a in self.alerts if a.level == 'CRITICAL')
+            if critical_count > 0:
+                self.insights.append(Insight(
+                    category='error',
+                    severity='critical',
+                    title=f'严重异常检测',
+                    description=f'检测到 {critical_count} 个严重级别异常',
+                    evidence=[f"L{a.line_num}: {a.message[:100]}" for a in self.alerts if a.level == 'CRITICAL'][:5],
+                    recommendation='立即检查相关服务状态'
+                ))
+        
+        # IP 异常检测
+        if 'ip' in self.entities:
+            ip_counter = Counter(e.value for e in self.entities['ip'])
+            for ip, count in ip_counter.most_common(3):
+                if count > 100:
+                    self.insights.append(Insight(
+                        category='anomaly',
+                        severity='medium',
+                        title=f'高频 IP 活动',
+                        description=f'IP {ip} 出现 {count} 次',
+                        evidence=[e.context[:100] for e in self.entities['ip'] if e.value == ip][:3],
+                        recommendation='确认该 IP 的活动是否正常'
+                    ))
+        
+        print(f"  ✓ 生成 {len(self.insights)} 条洞察")
+        for insight in self.insights:
+            print(f"      [{insight.severity.upper()}] {insight.title}")
+    
+    def _generate_reports(self):
+        """生成报告"""
+        self._write_summary()
+        self._write_entities()
+        self._write_operations()
+        self._write_insights()
+        self._write_json()
+        
+        print(f"\n输出文件:")
+        for f in sorted(self.output_dir.iterdir()):
+            size = f.stat().st_size
+            print(f"  - {f.name} ({size/1024:.1f} KB)")
+    
+    def _write_summary(self):
+        """写入摘要报告"""
+        path = self.output_dir / "summary.md"
+        with open(path, 'w', encoding='utf-8') as f:
+            f.write(f"# 日志分析报告\n\n")
+            
+            f.write(f"## 概览\n\n")
+            f.write(f"| 项目 | 内容 |\n|------|------|\n")
+            f.write(f"| 文件 | {self.input_path.name} |\n")
+            f.write(f"| 大小 | {self.file_size_mb:.2f} MB |\n")
+            f.write(f"| 类型 | {self.log_type.value} |\n")
+            f.write(f"| 总行数 | {self.total_lines:,} |\n")
+            f.write(f"| 时间范围 | {self.time_range['start']} ~ {self.time_range['end']} |\n\n")
+            
+            # 实体统计
+            if self.entities:
+                f.write(f"## 实体统计\n\n")
+                f.write(f"| 类型 | 唯一值 | 出现次数 | Top 值 |\n|------|--------|----------|--------|\n")
+                for entity_type, entities in sorted(self.entities.items()):
+                    counter = Counter(e.value for e in entities)
+                    unique = len(counter)
+                    total = len(entities)
+                    top = counter.most_common(1)[0] if counter else ('', 0)
+                    f.write(f"| {entity_type} | {unique} | {total} | {top[0][:30]}({top[1]}) |\n")
+                f.write(f"\n")
+            
+            # 操作统计
+            if self.stats['operations']:
+                f.write(f"## 操作统计\n\n")
+                f.write(f"| 操作类型 | 次数 |\n|----------|------|\n")
+                for op, count in self.stats['operations'].most_common():
+                    f.write(f"| {op} | {count:,} |\n")
+                f.write(f"\n")
+            
+            if self.stats['tables']:
+                f.write(f"## 表操作统计\n\n")
+                f.write(f"| 表名 | 操作次数 |\n|------|----------|\n")
+                for table, count in self.stats['tables'].most_common(10):
+                    f.write(f"| {table} | {count:,} |\n")
+                f.write(f"\n")
+            
+            # 洞察
+            if self.insights:
+                f.write(f"## 分析洞察\n\n")
+                for i, insight in enumerate(self.insights, 1):
+                    f.write(f"### {i}. [{insight.severity.upper()}] {insight.title}\n\n")
+                    f.write(f"{insight.description}\n\n")
+                    if insight.evidence:
+                        f.write(f"**证据:**\n")
+                        for e in insight.evidence:
+                            f.write(f"- {e}\n")
+                        f.write(f"\n")
+                    if insight.recommendation:
+                        f.write(f"**建议:** {insight.recommendation}\n\n")
+                    f.write(f"---\n\n")
+    
+    def _write_entities(self):
+        """写入实体详情"""
+        path = self.output_dir / "entities.md"
+        with open(path, 'w', encoding='utf-8') as f:
+            f.write(f"# 实体详情\n\n")
+            
+            for entity_type, entities in sorted(self.entities.items()):
+                counter = Counter(e.value for e in entities)
+                f.write(f"## {entity_type} ({len(counter)} 个唯一值)\n\n")
+                f.write(f"| 值 | 出现次数 | 首次行号 |\n|-----|----------|----------|\n")
+                
+                first_occurrence = {}
+                for e in entities:
+                    if e.value not in first_occurrence:
+                        first_occurrence[e.value] = e.line_num
+                
+                for value, count in counter.most_common(50):
+                    f.write(f"| {value[:50]} | {count} | {first_occurrence[value]} |\n")
+                f.write(f"\n")
+    
+    def _write_operations(self):
+        """写入操作详情"""
+        if not self.operations:
+            return
+            
+        path = self.output_dir / "operations.md"
+        with open(path, 'w', encoding='utf-8') as f:
+            f.write(f"# 操作详情\n\n")
+            f.write(f"共 {len(self.operations)} 条操作记录\n\n")
+            
+            # 按表分组
+            by_table = defaultdict(list)
+            for op in self.operations:
+                by_table[op.target].append(op)
+            
+            for table, ops in sorted(by_table.items(), key=lambda x: len(x[1]), reverse=True):
+                f.write(f"## {table} ({len(ops)} 次操作)\n\n")
+                
+                op_types = Counter(op.op_type for op in ops)
+                f.write(f"操作类型: {dict(op_types)}\n\n")
+                
+                thread_ids = set()
+                for op in ops:
+                    for e in op.entities:
+                        if e.type == 'thread_id':
+                            thread_ids.add(e.value)
+                
+                if thread_ids:
+                    f.write(f"Thread IDs: {', '.join(sorted(thread_ids))}\n\n")
+                
+                f.write(f"时间范围: {ops[0].time} ~ {ops[-1].time}\n\n")
+                f.write(f"---\n\n")
+    
+    def _write_insights(self):
+        """写入洞察报告"""
+        if not self.insights:
+            return
+            
+        path = self.output_dir / "insights.md"
+        with open(path, 'w', encoding='utf-8') as f:
+            f.write(f"# 分析洞察\n\n")
+            
+            # 按严重程度分组
+            by_severity = defaultdict(list)
+            for insight in self.insights:
+                by_severity[insight.severity].append(insight)
+            
+            for severity in ['critical', 'high', 'medium', 'low']:
+                if severity not in by_severity:
+                    continue
+                    
+                f.write(f"## {severity.upper()} 级别\n\n")
+                for insight in by_severity[severity]:
+                    f.write(f"### {insight.title}\n\n")
+                    f.write(f"**类别:** {insight.category}\n\n")
+                    f.write(f"**描述:** {insight.description}\n\n")
+                    
+                    if insight.evidence:
+                        f.write(f"**证据:**\n")
+                        for e in insight.evidence:
+                            f.write(f"- {e}\n")
+                        f.write(f"\n")
+                    
+                    if insight.recommendation:
+                        f.write(f"**建议:** {insight.recommendation}\n\n")
+                    
+                    f.write(f"---\n\n")
+    
+    def _write_json(self):
+        """写入 JSON 数据"""
+        path = self.output_dir / "analysis.json"
+        
+        data = {
+            'file': str(self.input_path),
+            'size_mb': self.file_size_mb,
+            'log_type': self.log_type.value,
+            'total_lines': self.total_lines,
+            'time_range': self.time_range,
+            'entities': {
+                k: {
+                    'unique': len(set(e.value for e in v)),
+                    'total': len(v),
+                    'top': Counter(e.value for e in v).most_common(10)
+                }
+                for k, v in self.entities.items()
+            },
+            'stats': {k: dict(v) for k, v in self.stats.items()},
+            'insights': [
+                {
+                    'category': i.category,
+                    'severity': i.severity,
+                    'title': i.title,
+                    'description': i.description,
+                    'evidence': i.evidence,
+                    'recommendation': i.recommendation
+                }
+                for i in self.insights
+            ]
+        }
+        
+        with open(path, 'w', encoding='utf-8') as f:
+            json.dump(data, f, ensure_ascii=False, indent=2)
+    
+    def _get_summary(self) -> dict:
+        return {
+            'log_type': self.log_type.value,
+            'total_lines': self.total_lines,
+            'entity_types': len(self.entities),
+            'operation_count': len(self.operations),
+            'insight_count': len(self.insights),
+            'output_dir': str(self.output_dir)
+        }
+
+
+def main():
+    parser = argparse.ArgumentParser(description='RAPHL 智能日志分析器')
+    parser.add_argument('input', help='输入日志文件')
+    parser.add_argument('-o', '--output', default='./log_analysis', help='输出目录')
+    
+    args = parser.parse_args()
+    
+    analyzer = SmartLogAnalyzer(args.input, args.output)
+    result = analyzer.run()
+    
+    print(f"\n请查看 {result['output_dir']}/summary.md")
+
+
+if __name__ == '__main__':
+    main()