Initial commit: skills library

- 70 skills with code and documentation - Add .gitignore (ignore __pycache__, output/, temp/, venv/) - Clean up test intermediates and caches
2026-04-26 19:27:40 +08:00
commit 04db423416
861 changed files with 210414 additions and 0 deletions
@@ -0,0 +1,437 @@
+---
+name: file-reader
+description: 文件读写编辑技能。读取、创建、编辑 docx/xlsx/pdf/pptx 等文件，特别处理中文编码问题。
+---
+
+# 文件读写编辑技能
+
+## 概述
+
+支持对 Word、Excel、PDF、PPT 四类办公文档进行**读取、创建、编辑**操作。
+特别针对 WPS/Office 创建的中文文档编码问题提供解决方案。
+
+## 适用场景
+
+- 读取 `.docx` / `.xlsx` / `.pdf` / `.pptx` 文件内容
+- 创建新的办公文档（报告、表格、演示文稿、PDF）
+- 编辑已有文档（修改内容、格式、公式等）
+- 中文编码乱码问题处理
+- PDF 合并/拆分/OCR/加水印
+- Excel 公式模型构建
+
+---
+
+## 一、Word 文档（docx）
+
+### 1.1 读取
+
+```bash
+# 使用内置脚本（自动处理中文编码）
+python .opencode/skills/file-reader/scripts/file_reader.py <file.docx>
+```
+
+内容保存到 `temp/docx_output.txt`，确保中文在 Windows 终端也能正确查看。
+
+**中文编码原理**：WPS 创建的 docx 内部 XML 可能是 GBK 编码但声明为 UTF-8，直接用 python-docx 会乱码。脚本直接从 zip 包读取 XML 并正确处理编码。
+
+### 1.2 创建新文档
+
+使用 `docx-js`（Node.js）生成，安装：`npm install -g docx`
+
+```javascript
+const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell,
+        ImageRun, Header, Footer, AlignmentType, HeadingLevel, BorderStyle,
+        WidthType, ShadingType, PageNumber, PageBreak, LevelFormat,
+        TableOfContents, PageOrientation } = require('docx');
+const fs = require('fs');
+
+const doc = new Document({
+  styles: {
+    default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt
+    paragraphStyles: [
+      { id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal",
+        quickFormat: true,
+        run: { size: 32, bold: true, font: "Arial" },
+        paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } },
+    ]
+  },
+  sections: [{
+    properties: {
+      page: {
+        size: { width: 12240, height: 15840 }, // US Letter (DXA单位, 1440=1英寸)
+        margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 }
+      }
+    },
+    children: [/* 内容 */]
+  }]
+});
+
+Packer.toBuffer(doc).then(buf => fs.writeFileSync("output.docx", buf));
+```
+
+**关键规则**：
+- 页面尺寸必须显式设置（默认 A4，US Letter 需指定 12240x15840）
+- 横向：传纵向尺寸 + `orientation: PageOrientation.LANDSCAPE`（内部自动交换）
+- 禁止用 `\n` 换行，用多个 `Paragraph`
+- 禁止用 Unicode 项目符号，用 `LevelFormat.BULLET` + numbering config
+- `PageBreak` 必须放在 `Paragraph` 内
+- `ImageRun` 必须指定 `type`（png/jpg 等）
+- 表格必须同时设置 `columnWidths` 和每个 cell 的 `width`，且用 `WidthType.DXA`（百分比在 Google Docs 会坏）
+- 表格着色用 `ShadingType.CLEAR`（不是 SOLID，否则全黑）
+- 目录 TOC 要求标题用 `HeadingLevel`，且样式中包含 `outlineLevel`
+
+### 1.3 编辑已有文档
+
+采用 XML 解包→编辑→重打包 三步法：
+
+```bash
+# 步骤1：解包
+python scripts/office/unpack.py document.docx unpacked/
+
+# 步骤2：编辑 unpacked/word/ 下的 XML 文件
+# 直接用 Edit 工具做字符串替换，不要写 Python 脚本
+
+# 步骤3：重打包
+python scripts/office/pack.py unpacked/ output.docx --original document.docx
+```
+
+**修订标记（Tracked Changes）**：
+```xml
+<!-- 插入 -->
+<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
+  <w:r><w:t>新文本</w:t></w:r>
+</w:ins>
+
+<!-- 删除（注意用 delText 不是 t） -->
+<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
+  <w:r><w:delText>被删文本</w:delText></w:r>
+</w:del>
+```
+
+**注意事项**：
+- 替换整个 `<w:r>` 元素，不要在 run 内部插入修订标签
+- 保留原始 `<w:rPr>` 格式块
+- 新内容中的引号用 XML 实体：`&#x201C;`（"）`&#x201D;`（"）`&#x2019;`（'）
+- 删除整段时需在 `<w:pPr><w:rPr>` 中也加 `<w:del/>`，否则接受修订后留空段
+
+---
+
+## 二、PDF 文档（pdf）
+
+### 2.1 读取
+
+```bash
+# 使用 pypdf 读取并保存到文件（避免终端中文乱码）
+python -c "
+from pypdf import PdfReader
+r = PdfReader('<file.pdf>')
+with open('temp/pdf_output.txt', 'w', encoding='utf-8') as f:
+    for i, p in enumerate(r.pages):
+        f.write(f'--- Page {i+1} ---\n')
+        f.write(p.extract_text() + '\n\n')
+"
+```
+
+**提取表格**（pdfplumber 更强）：
+```python
+import pdfplumber
+with pdfplumber.open("doc.pdf") as pdf:
+    for page in pdf.pages:
+        tables = page.extract_tables()
+        for table in tables:
+            for row in table:
+                print(row)
+```
+
+### 2.2 合并与拆分
+
+```python
+from pypdf import PdfReader, PdfWriter
+
+# 合并多个 PDF
+writer = PdfWriter()
+for f in ["doc1.pdf", "doc2.pdf"]:
+    for page in PdfReader(f).pages:
+        writer.add_page(page)
+with open("merged.pdf", "wb") as out:
+    writer.write(out)
+
+# 拆分为单页
+reader = PdfReader("input.pdf")
+for i, page in enumerate(reader.pages):
+    w = PdfWriter()
+    w.add_page(page)
+    with open(f"page_{i+1}.pdf", "wb") as out:
+        w.write(out)
+```
+
+### 2.3 创建新 PDF
+
+使用 `reportlab`：
+
+```python
+from reportlab.lib.pagesizes import letter
+from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
+from reportlab.lib.styles import getSampleStyleSheet
+
+doc = SimpleDocTemplate("report.pdf", pagesize=letter)
+styles = getSampleStyleSheet()
+story = [
+    Paragraph("报告标题", styles['Title']),
+    Spacer(1, 12),
+    Paragraph("正文内容..." * 20, styles['Normal']),
+    PageBreak(),
+    Paragraph("第二页", styles['Heading1']),
+]
+doc.build(story)
+```
+
+**陷阱**：禁止在 reportlab 中使用 Unicode 上下标字符（₀₁₂ ⁰¹²），内置字体不支持会渲染成黑块。用 `<sub>` `<super>` 标签代替。
+
+### 2.4 OCR 扫描件
+
+```python
+# 依赖：pip install pytesseract pdf2image
+import pytesseract
+from pdf2image import convert_from_path
+
+images = convert_from_path('scanned.pdf')
+for i, img in enumerate(images):
+    text = pytesseract.image_to_string(img, lang='chi_sim')  # 中文用 chi_sim
+    print(f"Page {i+1}:\n{text}\n")
+```
+
+### 2.5 其他操作
+
+```python
+from pypdf import PdfReader, PdfWriter
+
+# 旋转页面
+page = PdfReader("input.pdf").pages[0]
+page.rotate(90)
+
+# 添加水印
+watermark = PdfReader("watermark.pdf").pages[0]
+for page in PdfReader("doc.pdf").pages:
+    page.merge_page(watermark)
+
+# 加密
+writer.encrypt("user_password", "owner_password")
+```
+
+**命令行工具**：
+```bash
+pdftotext -layout input.pdf output.txt     # 提取文本（保留布局）
+qpdf --empty --pages f1.pdf f2.pdf -- out.pdf  # 合并
+pdfimages -j input.pdf prefix               # 提取图片
+```
+
+---
+
+## 三、PPT 演示文稿（pptx）
+
+### 3.1 读取
+
+```bash
+# 文本提取
+python -m markitdown presentation.pptx
+
+# 缩略图预览
+python scripts/thumbnail.py presentation.pptx
+
+# 原始 XML
+python scripts/office/unpack.py presentation.pptx unpacked/
+```
+
+### 3.2 创建新演示文稿
+
+使用 `pptxgenjs`（Node.js），安装：`npm install -g pptxgenjs`
+
+无模板时从头创建，有模板时用解包→编辑→打包流程。
+
+### 3.3 编辑已有文件
+
+```bash
+# 1. 分析模板
+python scripts/thumbnail.py presentation.pptx
+
+# 2. 解包→编辑→打包
+python scripts/office/unpack.py presentation.pptx unpacked/
+# 编辑 XML...
+python scripts/office/pack.py unpacked/ output.pptx --original presentation.pptx
+```
+
+### 3.4 设计规范
+
+**配色**：不要默认蓝色，根据主题选择配色方案。深色背景用于首尾页，浅色用于内容页。
+
+**字体搭配**（标题/正文）：Georgia+Calibri、Arial Black+Arial、Cambria+Calibri
+
+**字号规范**：
+
+| 元素 | 字号 |
+|------|------|
+| 幻灯片标题 | 36-44pt 加粗 |
+| 章节标题 | 20-24pt 加粗 |
+| 正文 | 14-16pt |
+| 注释 | 10-12pt |
+
+**常见错误**：
+- 不要重复相同布局，要变换排版
+- 正文左对齐，仅标题居中
+- 每页必须有视觉元素（图、图标、图表），拒绝纯文字幻灯片
+- 禁止标题下划线（AI 生成痕迹明显）
+- 设置 text box `margin: 0` 或偏移以对齐装饰线与文字边缘
+
+### 3.5 QA 检查（必须执行）
+
+```bash
+# 内容检查
+python -m markitdown output.pptx
+
+# 视觉检查：转图片后逐页审查
+python scripts/office/soffice.py --headless --convert-to pdf output.pptx
+pdftoppm -jpeg -r 150 output.pdf slide
+
+# 检查残留占位符
+python -m markitdown output.pptx | grep -iE "xxxx|lorem|ipsum"
+```
+
+**检查清单**：元素重叠、文字溢出、间距不均、低对比度、占位符残留。
+**至少完成一轮 修复→验证 循环才能交付。**
+
+---
+
+## 四、Excel 电子表格（xlsx）
+
+### 4.1 读取
+
+```bash
+# 使用内置脚本（自动处理中文编码）
+python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx>             # 列出所有sheets
+python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx> <sheet_idx>  # 读指定sheet（默认前20行）
+python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx> <sheet_idx> <max_rows>  # 指定行数
+```
+
+**中文编码原理**：直接读取 xlsx 内部 XML（sharedStrings.xml），使用 UTF-8 解码，绕过 openpyxl 可能的编码问题。
+
+**pandas 分析**：
+```python
+import pandas as pd
+df = pd.read_excel('file.xlsx')                          # 默认第一个sheet
+all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # 所有sheet
+df.describe()                                             # 统计摘要
+```
+
+### 4.2 创建与编辑
+
+```python
+from openpyxl import Workbook, load_workbook
+from openpyxl.styles import Font, PatternFill, Alignment
+
+# 创建新文件
+wb = Workbook()
+ws = wb.active
+ws['A1'] = '标题'
+ws['A1'].font = Font(bold=True, size=14)
+ws.column_dimensions['A'].width = 20
+
+# 编辑已有文件
+wb = load_workbook('existing.xlsx')
+ws = wb['Sheet1']
+ws['B2'] = '=SUM(B3:B10)'
+
+wb.save('output.xlsx')
+```
+
+### 4.3 公式规范（铁律）
+
+> **必须用 Excel 公式，禁止在 Python 中算好再硬编码到单元格！**
+
+```python
+# ❌ 错误：Python 计算后写死值
+total = df['Sales'].sum()
+sheet['B10'] = total  # 写死 5000
+
+# ✅ 正确：写 Excel 公式
+sheet['B10'] = '=SUM(B2:B9)'
+sheet['C5'] = '=(C4-C2)/C2'
+sheet['D20'] = '=AVERAGE(D2:D19)'
+```
+
+公式写入后必须重新计算：
+```bash
+python scripts/recalc.py output.xlsx
+```
+
+脚本返回 JSON，如果 `status` 为 `errors_found`，根据 `error_summary` 修复后重新计算。
+
+### 4.4 财务模型规范
+
+**颜色编码**：
+- 蓝色文字 (0,0,255)：硬编码输入值
+- 黑色文字 (0,0,0)：公式和计算
+- 绿色文字 (0,128,0)：跨 sheet 引用
+- 红色文字 (255,0,0)：外部文件链接
+- 黄色背景 (255,255,0)：关键假设
+
+**数字格式**：
+- 年份格式化为文本（避免 2024 变成 2,024）
+- 货币用 `$#,##0`，标题注明单位
+- 零值显示为 `-`
+- 负数用括号 `(123)` 不用减号
+- 百分比默认 `0.0%`
+
+### 4.5 注意事项
+
+- openpyxl 单元格索引从 1 开始
+- `data_only=True` 读取计算值，但保存后公式会永久丢失
+- pandas 读取时指定 `dtype={'id': str}` 避免类型推断问题
+- 假设值放单独单元格，公式用引用，不要内嵌常量
+- 跨 sheet 引用格式：`Sheet1!A1`
+- 检查公式错误：`#REF!`、`#DIV/0!`、`#VALUE!`、`#NAME?`
+
+---
+
+## 五、中文编码问题总结
+
+| 文件类型 | 问题根源 | 解决方案 |
+|---------|---------|---------|
+| xlsx | WPS 内部 XML 可能是 GBK 但声明 UTF-8 | 直接读 zip 内 XML，正确解码 |
+| docx | 同上 | 直接读 word/document.xml，尝试 UTF-8/GBK |
+| pdf | 提取的中文在 Windows 终端乱码 | 保存到 UTF-8 文件再读取 |
+| pptx | 类似 docx | 用 markitdown 或解包 XML |
+
+---
+
+## 六、依赖安装
+
+```bash
+# Python 核心（读取）
+pip install python-docx openpyxl pypdf pdfplumber
+
+# PDF 创建与 OCR
+pip install reportlab pytesseract pdf2image
+
+# PPT 文本提取
+pip install "markitdown[pptx]" Pillow
+
+# Node.js（文档创建）
+npm install -g docx        # Word 创建
+npm install -g pptxgenjs   # PPT 创建
+
+# 系统工具（可选）
+# LibreOffice - PDF转换、公式重算
+# Poppler (pdftoppm/pdftotext/pdfimages) - PDF处理
+# Tesseract - OCR
+```
+
+## 七、输出文件位置
+
+| 文件类型 | 默认输出路径 |
+|---------|------------|
+| xlsx 读取 | `temp/xlsx_output.txt` |
+| docx 读取 | `temp/docx_output.txt` |
+| pdf 读取 | `temp/pdf_output.txt` |
+
+保存到文件是为了确保中文在 Windows 终端乱码时仍可正确查看。
@@ -0,0 +1,146 @@
+# 文件读写编辑技能
+
+## 概述
+
+支持对 Word、Excel、PDF、PPT 四类办公文档进行**读取、创建、编辑**操作。
+特别针对 WPS/Office 创建的中文文档编码问题提供解决方案。
+
+## 适用场景
+
+- 读取 `.docx` / `.xlsx` / `.pdf` / `.pptx` / **`.xls`** 文件内容
+- 创建新的办公文档（报告、表格、演示文稿、PDF）
+- 编辑已有文档（修改内容、格式、公式等）
+- 中文编码乱码问题处理
+- PDF 合并/拆分/OCR/加水印
+- Excel 公式模型构建
+- **旧版Excel (.xls) 文件处理**
+
+---
+
+## 一、Word 文档（docx）
+
+[...保持原有内容不变...]
+
+---
+
+## 二、PDF 文档（pdf）
+
+[...保持原有内容不变...]
+
+---
+
+## 三、PPT 演示文稿（pptx）
+
+[...保持原有内容不变...]
+
+---
+
+## 四、Excel 电子表格（xlsx/xls）
+
+### 4.1 读取
+
+```bash
+# 使用内置脚本（自动处理中文编码）
+python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx>             # 列出所有sheets
+python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx> <sheet_idx>  # 读指定sheet（默认前20行）
+python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx> <sheet_idx> <max_rows>  # 指定行数
+
+# 新增：支持.xls格式
+python .opencode/skills/file-reader/scripts/file_reader.py <file.xls>              # 读取旧版Excel文件
+```
+
+**中文编码原理**：
+- **.xlsx文件**：直接读取 xlsx 内部 XML（sharedStrings.xml），使用 UTF-8 解码，绕过 openpyxl 可能的编码问题
+- **.xls文件**：使用 xlrd 库处理，自动检测 GBK/UTF-8 编码，特别处理 WPS 创建的中文.xls文件
+
+### 4.2 .xls文件特殊处理
+
+对于.xls格式文件，使用以下方法：
+
+```python
+import pandas as pd
+import xlrd
+
+def read_xls_with_chinese_encoding(file_path):
+    """读取包含中文的.xls文件"""
+    try:
+        # 尝试直接读取
+        df = pd.read_excel(file_path, engine='xlrd')
+        return df
+    except UnicodeDecodeError:
+        # 如果失败，尝试不同编码
+        encodings = ['gbk', 'gb2312', 'utf-8']
+        for encoding in encodings:
+            try:
+                # 对于制表符分隔的伪xls文件
+                df = pd.read_csv(file_path, encoding=encoding, sep='\t')
+                return df
+            except:
+                continue
+        raise ValueError("无法识别文件编码")
+```
+
+### 4.3 创建与编辑
+
+[...保持原有内容不变...]
+
+### 4.4 公式规范（铁律）
+
+[...保持原有内容不变...]
+
+### 4.5 注意事项
+
+- **.xlsx文件**：使用 openpyxl 处理
+- **.xls文件**：使用 xlrd 处理，注意 xlrd 2.0+ 不再支持.xls格式，需要使用 xlrd 1.2.0
+- **混合格式**：如果文件实际上是制表符分隔的CSV但扩展名为.xls，使用 pandas.read_csv() with sep='\t'
+- **中文编码**：WPS创建的.xls文件通常使用GBK编码，Office创建的可能使用UTF-8
+
+---
+
+## 五、中文编码问题总结
+
+| 文件类型 | 问题根源 | 解决方案 |
+|---------|---------|---------|
+| xlsx | WPS 内部 XML 可能是 GBK 但声明 UTF-8 | 直接读 zip 内 XML，正确解码 |
+| **xls** | **WPS创建的.xls文件使用GBK编码** | **使用xlrd 1.2.0 + GBK编码处理** |
+| docx | 同上 | 直接读 word/document.xml，尝试 UTF-8/GBK |
+| pdf | 提取的中文在 Windows 终端乱码 | 保存到 UTF-8 文件再读取 |
+| pptx | 类似 docx | 用 markitdown 或解包 XML |
+
+---
+
+## 六、依赖安装
+
+```bash
+# Python 核心（读取）
+pip install python-docx openpyxl pypdf pdfplumber
+
+# Excel .xls 支持（关键！）
+pip install xlrd==1.2.0
+
+# PDF 创建与 OCR
+pip install reportlab pytesseract pdf2image
+
+# PPT 文本提取
+pip install "markitdown[pptx]" Pillow
+
+# Node.js（文档创建）
+npm install -g docx        # Word 创建
+npm install -g pptxgenjs   # PPT 创建
+
+# 系统工具（可选）
+# LibreOffice - PDF转换、公式重算
+# Poppler (pdftoppm/pdftotext/pdfimages) - PDF处理
+# Tesseract - OCR
+```
+
+## 七、输出文件位置
+
+| 文件类型 | 默认输出路径 |
+|---------|------------|
+| xlsx 读取 | `temp/xlsx_output.txt` |
+| **xls 读取 | `temp/xls_output.txt`** |
+| docx 读取 | `temp/docx_output.txt` |
+| pdf 读取 | `temp/pdf_output.txt` |
+
+保存到文件是为了确保中文在 Windows 终端乱码时仍可正确查看。
@@ -0,0 +1,3 @@
+# Example Reference
+
+This is an example reference file. Delete if not needed.
@@ -0,0 +1,2 @@
+# file-reader - dependencies
+python-docx>=0.0.1
@@ -0,0 +1,4 @@
+#!/usr/bin/env python3
+"""Example script - delete if not needed."""
+
+print("Hello from skill!")
@@ -0,0 +1,234 @@
+# -*- coding: utf-8 -*-
+"""
+文件读取工具 - 支持docx、xlsx等二进制文件读取
+解决中文编码问题
+"""
+
+import sys
+import os
+import zipfile
+import re
+
+
+def read_docx(file_path):
+    """读取docx文件内容"""
+    try:
+        import zipfile
+        import re
+
+        with zipfile.ZipFile(file_path, "r") as z:
+            with z.open("word/document.xml") as f:
+                content = f.read()
+
+        # docx内部编码可能是GBK但声明为UTF-8
+        try:
+            text = content.decode("utf-8", errors="strict")
+        except:
+            text = content.decode("gbk", errors="ignore")
+
+        # 提取段落
+        paragraphs = re.findall(r"<w:t[^>]*>([^<]*)</w:t>", text)
+
+        content_lines = []
+        for p in paragraphs:
+            if p.strip():
+                content_lines.append(p)
+
+        # 读取表格
+        # 先找到表格
+        tables = re.findall(r"<w:tbl>(.*?)</w:tbl>", text, re.DOTALL)
+        for table in tables:
+            rows = re.findall(r"<w:tr>(.*?)</w:tr>", table, re.DOTALL)
+            for row in rows:
+                cells = re.findall(r"<w:tc>(.*?)</w:tc>", row, re.DOTALL)
+                row_data = []
+                for cell in cells:
+                    cell_text = re.findall(r"<w:t[^>]*>([^<]*)</w:t>", cell)
+                    if cell_text:
+                        row_data.append("".join(cell_text).strip())
+                if row_data:
+                    content_lines.append(" | ".join(row_data))
+
+        return "\n".join(content_lines)
+
+    except Exception as e:
+        return f"Error reading docx: {e}"
+
+
+def read_docx_and_save(file_path):
+    """读取docx并保存为UTF-8文本文件"""
+    try:
+        import zipfile
+        import re
+        import os
+
+        with zipfile.ZipFile(file_path, "r") as z:
+            with z.open("word/document.xml") as f:
+                content = f.read()
+
+        # 尝试UTF-8，然后尝试GBK
+        try:
+            text = content.decode("utf-8", errors="strict")
+        except:
+            try:
+                text = content.decode("gbk", errors="ignore")
+            except:
+                text = content.decode("utf-8", errors="replace")
+
+        # 提取段落
+        paragraphs = re.findall(r"<w:t[^>]*>([^<]*)</w:t>", text)
+
+        content_lines = []
+        for p in paragraphs:
+            if p.strip():
+                content_lines.append(p)
+
+        # 读取表格
+        tables = re.findall(r"<w:tbl>(.*?)</w:tbl>", text, re.DOTALL)
+        for table in tables:
+            rows = re.findall(r"<w:tr>(.*?)</w:tr>", table, re.DOTALL)
+            for row in rows:
+                cells = re.findall(r"<w:tc>(.*?)</w:tc>", row, re.DOTALL)
+                row_data = []
+                for cell in cells:
+                    cell_text = re.findall(r"<w:t[^>]*>([^<]*)</w:t>", cell)
+                    if cell_text:
+                        row_data.append("".join(cell_text).strip())
+                if row_data:
+                    content_lines.append(" | ".join(row_data))
+
+        output = "\n".join(content_lines)
+
+        # 保存到文件
+        output_file = "temp/docx_output.txt"
+        os.makedirs("temp", exist_ok=True)
+        with open(output_file, "w", encoding="utf-8") as f:
+            f.write(output)
+
+        return f"[内容已保存到 temp/docx_output.txt]\n\n{output}"
+
+    except Exception as e:
+        return f"Error: {e}"
+
+
+def read_xlsx(file_path, sheet_index=0, max_rows=20):
+    """读取xlsx文件，中文正确显示（保存到文件）"""
+    try:
+        with zipfile.ZipFile(file_path, "r") as z:
+            # 读取sharedStrings - UTF-8编码
+            string_map = {}
+            if "xl/sharedStrings.xml" in z.namelist():
+                with z.open("xl/sharedStrings.xml") as f:
+                    ss_content = f.read()
+                    text = ss_content.decode("utf-8", errors="ignore")
+                    strings = re.findall(r"<t>([^<]*)</t>", text)
+                    for i, s in enumerate(strings):
+                        string_map[i] = s
+
+            # 读取sheet
+            sheet_files = [
+                f
+                for f in z.namelist()
+                if f.startswith("xl/worksheets/sheet") and f.endswith(".xml")
+            ]
+            if sheet_index >= len(sheet_files):
+                return f"Sheet index {sheet_index} out of range"
+
+            sheet_file = sheet_files[sheet_index]
+            with z.open(sheet_file) as f:
+                sheet_content = f.read().decode("utf-8", errors="ignore")
+
+            # 解析数据
+            rows = re.findall(
+                r'<row r="(\d+)"[^>]*>(.*?)</row>', sheet_content, re.DOTALL
+            )
+
+            result = [f"=== Sheet {sheet_index} ==="]
+            for row_num, row_content in rows[:max_rows]:
+                cells = re.findall(
+                    r'<c r="([A-Z]+\d+)"[^>]*>(.*?)</c>', row_content, re.DOTALL
+                )
+                row_data = []
+                for cell_ref, cell_content in cells:
+                    v_match = re.search(r"<v>([^<]*)</v>", cell_content)
+                    t_match = re.search(r"<is><t[^>]*>([^<]*)</t></is>", cell_content)
+
+                    if t_match:
+                        val = t_match.group(1)
+                    elif v_match:
+                        v = v_match.group(1)
+                        if 't="s"' in cell_content and v.isdigit():
+                            idx = int(v)
+                            val = string_map.get(idx, f"[str:{v}]")
+                        else:
+                            try:
+                                val = str(int(v)) if "." not in v else str(float(v))
+                            except:
+                                val = v
+                    else:
+                        val = ""
+
+                    if val:
+                        row_data.append(val)
+
+                if row_data:
+                    result.append(" | ".join(row_data))
+
+            output = "\n".join(result)
+
+            # 写入文件确保中文正确显示
+            output_file = f"temp/xlsx_output.txt"
+            os.makedirs("temp", exist_ok=True)
+            with open(output_file, "w", encoding="utf-8") as f:
+                f.write(output)
+
+            return f"[内容已保存到 temp/xlsx_output.txt]\n\n{output}"
+
+    except Exception as e:
+        return f"Error: {e}"
+
+
+def list_sheets(file_path):
+    """列出xlsx所有sheet"""
+    try:
+        import openpyxl
+
+        wb = openpyxl.load_workbook(file_path, data_only=True)
+        sheets = wb.sheetnames
+        return "可用Sheets:\n" + "\n".join(f"{i}: {s}" for i, s in enumerate(sheets))
+    except Exception as e:
+        return f"Error: {e}"
+
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print("用法:")
+        print("  python file_reader.py <file.docx>")
+        print("  python file_reader.py <file.xlsx>                    # 列出所有sheets")
+        print("  python file_reader.py <file.xlsx> <sheet_index>      # 读取指定sheet")
+        print("  python file_reader.py <file.xlsx> <sheet_index> <max_rows>")
+        print("\n示例:")
+        print("  python file_reader.py test.xlsx           # 列出sheets")
+        print("  python file_reader.py test.xlsx 1         # 读取第2个sheet")
+        print("  python file_reader.py test.xlsx 1 30      # 读取前30行")
+        sys.exit(1)
+
+    file_path = sys.argv[1]
+
+    if not os.path.exists(file_path):
+        print(f"文件不存在: {file_path}")
+        sys.exit(1)
+
+    ext = os.path.splitext(file_path)[1].lower()
+
+    if ext == ".docx":
+        print(read_docx_and_save(file_path))
+    elif ext in [".xlsx", ".xls"]:
+        if len(sys.argv) == 2:
+            print(list_sheets(file_path))
+        else:
+            sheet_index = int(sys.argv[2])
+            max_rows = int(sys.argv[3]) if len(sys.argv) > 3 else 20
+            print(read_xlsx(file_path, sheet_index, max_rows))
+    else:
+        print(f"不支持的文件类型: {ext}")
@@ -0,0 +1,265 @@
+# -*- coding: utf-8 -*-
+"""
+文件读取工具 - 支持docx、xlsx、xls等二进制文件读取
+解决中文编码问题
+"""
+
+import sys
+import os
+import zipfile
+import re
+import pandas as pd
+
+
+def read_docx_and_save(file_path):
+    """读取docx并保存为UTF-8文本文件"""
+    try:
+        import zipfile
+        import re
+        import os
+
+        with zipfile.ZipFile(file_path, "r") as z:
+            with z.open("word/document.xml") as f:
+                content = f.read()
+
+        # 尝试UTF-8，然后尝试GBK
+        try:
+            text = content.decode("utf-8", errors="strict")
+        except:
+            try:
+                text = content.decode("gbk", errors="ignore")
+            except:
+                text = content.decode("utf-8", errors="replace")
+
+        # 提取段落
+        paragraphs = re.findall(r"<w:t[^>]*>([^<]*)</w:t>", text)
+
+        content_lines = []
+        for p in paragraphs:
+            if p.strip():
+                content_lines.append(p)
+
+        # 读取表格
+        tables = re.findall(r"<w:tbl>(.*?)</w:tbl>", text, re.DOTALL)
+        for table in tables:
+            rows = re.findall(r"<w:tr>(.*?)</w:tr>", table, re.DOTALL)
+            for row in rows:
+                cells = re.findall(r"<w:tc>(.*?)</w:tc>", row, re.DOTALL)
+                row_data = []
+                for cell in cells:
+                    cell_text = re.findall(r"<w:t[^>]*>([^<]*)</w:t>", cell)
+                    if cell_text:
+                        row_data.append("".join(cell_text).strip())
+                if row_data:
+                    content_lines.append(" | ".join(row_data))
+
+        output = "\n".join(content_lines)
+
+        # 保存到文件
+        output_file = "temp/docx_output.txt"
+        os.makedirs("temp", exist_ok=True)
+        with open(output_file, "w", encoding="utf-8") as f:
+            f.write(output)
+
+        return f"[内容已保存到 temp/docx_output.txt]\n\n{output}"
+
+    except Exception as e:
+        return f"Error: {e}"
+
+
+def read_xlsx(file_path, sheet_index=0, max_rows=20):
+    """读取xlsx文件，中文正确显示（保存到文件）"""
+    try:
+        with zipfile.ZipFile(file_path, "r") as z:
+            # 读取sharedStrings - UTF-8编码
+            string_map = {}
+            if "xl/sharedStrings.xml" in z.namelist():
+                with z.open("xl/sharedStrings.xml") as f:
+                    ss_content = f.read()
+                    text = ss_content.decode("utf-8", errors="ignore")
+                    strings = re.findall(r"<t>([^<]*)</t>", text)
+                    for i, s in enumerate(strings):
+                        string_map[i] = s
+
+            # 读取sheet
+            sheet_files = [
+                f
+                for f in z.namelist()
+                if f.startswith("xl/worksheets/sheet") and f.endswith(".xml")
+            ]
+            if sheet_index >= len(sheet_files):
+                return f"Sheet index {sheet_index} out of range"
+
+            sheet_file = sheet_files[sheet_index]
+            with z.open(sheet_file) as f:
+                sheet_content = f.read().decode("utf-8", errors="ignore")
+
+            # 解析数据
+            rows = re.findall(
+                r'<row r="(\d+)"[^>]*>(.*?)</row>', sheet_content, re.DOTALL
+            )
+
+            result = [f"=== Sheet {sheet_index} ==="]
+            for row_num, row_content in rows[:max_rows]:
+                cells = re.findall(
+                    r'<c r="([A-Z]+\d+)"[^>]*>(.*?)</c>', row_content, re.DOTALL
+                )
+                row_data = []
+                for cell_ref, cell_content in cells:
+                    v_match = re.search(r"<v>([^<]*)</v>", cell_content)
+                    t_match = re.search(r"<is><t[^>]*>([^<]*)</t></is>", cell_content)
+
+                    if t_match:
+                        val = t_match.group(1)
+                    elif v_match:
+                        v = v_match.group(1)
+                        if 't="s"' in cell_content and v.isdigit():
+                            idx = int(v)
+                            val = string_map.get(idx, f"[str:{v}]")
+                        else:
+                            try:
+                                val = str(int(v)) if "." not in v else str(float(v))
+                            except:
+                                val = v
+                    else:
+                        val = ""
+
+                    if val:
+                        row_data.append(val)
+
+                if row_data:
+                    result.append(" | ".join(row_data))
+
+            output = "\n".join(result)
+
+            # 写入文件确保中文正确显示
+            output_file = f"temp/xlsx_output.txt"
+            os.makedirs("temp", exist_ok=True)
+            with open(output_file, "w", encoding="utf-8") as f:
+                f.write(output)
+
+            return f"[内容已保存到 temp/xlsx_output.txt]\n\n{output}"
+
+    except Exception as e:
+        return f"Error: {e}"
+
+
+def read_xls(file_path, sheet_index=0, max_rows=20):
+    """读取xls文件，处理中文编码"""
+    try:
+        # 首先尝试用pandas直接读取
+        df = pd.read_excel(file_path, sheet_name=sheet_index, nrows=max_rows)
+
+        # 转换为文本格式
+        result = [f"=== Sheet {sheet_index} ==="]
+        # 添加列名
+        result.append(" | ".join(str(col) for col in df.columns))
+
+        # 添加数据行
+        for idx, row in df.iterrows():
+            row_data = []
+            for val in row:
+                if pd.isna(val):
+                    row_data.append("")
+                else:
+                    row_data.append(str(val))
+            result.append(" | ".join(row_data))
+
+        output = "\n".join(result)
+
+        # 保存到文件
+        output_file = "temp/xls_output.txt"
+        os.makedirs("temp", exist_ok=True)
+        with open(output_file, "w", encoding="utf-8") as f:
+            f.write(output)
+
+        return f"[内容已保存到 temp/xls_output.txt]\n\n{output}"
+
+    except Exception as e:
+        # 如果pandas失败，尝试手动解析（可能是制表符分隔的csv）
+        try:
+            encodings = ["gbk", "utf-8", "latin1"]
+            for encoding in encodings:
+                try:
+                    df = pd.read_csv(
+                        file_path, encoding=encoding, sep="\t", nrows=max_rows
+                    )
+                    result = [f"=== Sheet {sheet_index} (CSV format) ==="]
+                    result.append(" | ".join(str(col) for col in df.columns))
+                    for idx, row in df.iterrows():
+                        row_data = []
+                        for val in row:
+                            if pd.isna(val):
+                                row_data.append("")
+                            else:
+                                row_data.append(str(val))
+                        result.append(" | ".join(row_data))
+                    output = "\n".join(result)
+
+                    output_file = "temp/xls_output.txt"
+                    os.makedirs("temp", exist_ok=True)
+                    with open(output_file, "w", encoding="utf-8") as f:
+                        f.write(output)
+
+                    return f"[内容已保存到 temp/xls_output.txt]\n\n{output}"
+                except:
+                    continue
+            return f"Error reading xls file: {e}"
+        except Exception as e2:
+            return f"Error: {e} | {e2}"
+
+
+def list_sheets(file_path):
+    """列出Excel所有sheet"""
+    try:
+        import pandas as pd
+
+        excel_file = pd.ExcelFile(file_path)
+        sheets = excel_file.sheet_names
+        return "可用Sheets:\n" + "\n".join(f"{i}: {s}" for i, s in enumerate(sheets))
+    except Exception as e:
+        return f"Error: {e}"
+
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print("用法:")
+        print("  python file_reader.py <file.docx>")
+        print("  python file_reader.py <file.xlsx>                    # 列出所有sheets")
+        print("  python file_reader.py <file.xlsx> <sheet_index>      # 读取指定sheet")
+        print("  python file_reader.py <file.xlsx> <sheet_index> <max_rows>")
+        print("  python file_reader.py <file.xls>                     # 列出所有sheets")
+        print("  python file_reader.py <file.xls> <sheet_index>       # 读取指定sheet")
+        print("\n示例:")
+        print("  python file_reader.py test.xlsx           # 列出sheets")
+        print("  python file_reader.py test.xlsx 1         # 读取第2个sheet")
+        print("  python file_reader.py test.xlsx 1 30      # 读取前30行")
+        print("  python file_reader.py test.xls            # 读取xls文件")
+        sys.exit(1)
+
+    file_path = sys.argv[1]
+
+    if not os.path.exists(file_path):
+        print(f"文件不存在: {file_path}")
+        sys.exit(1)
+
+    ext = os.path.splitext(file_path)[1].lower()
+
+    if ext == ".docx":
+        print(read_docx_and_save(file_path))
+    elif ext == ".xlsx":
+        if len(sys.argv) == 2:
+            print(list_sheets(file_path))
+        else:
+            sheet_index = int(sys.argv[2])
+            max_rows = int(sys.argv[3]) if len(sys.argv) > 3 else 20
+            print(read_xlsx(file_path, sheet_index, max_rows))
+    elif ext == ".xls":
+        if len(sys.argv) == 2:
+            print(list_sheets(file_path))
+        else:
+            sheet_index = int(sys.argv[2])
+            max_rows = int(sys.argv[3]) if len(sys.argv) > 3 else 20
+            print(read_xls(file_path, sheet_index, max_rows))
+    else:
+        print(f"不支持的文件类型: {ext}")
@@ -0,0 +1,92 @@
+#!/usr/bin/env python3
+"""
+安全的PPTX读取脚本 - 不依赖markitdown，避免Google Vision API问题
+"""
+
+import sys
+import os
+from pptx import Presentation
+from pptx.enum.shapes import MSO_SHAPE_TYPE
+
+
+def extract_text_from_shape(shape):
+    """从形状中提取文本"""
+    if not hasattr(shape, "text"):
+        return ""
+    return shape.text
+
+
+def extract_text_from_slide(slide):
+    """从幻灯片中提取所有文本"""
+    texts = []
+
+    # 提取标题
+    if slide.shapes.title:
+        title = slide.shapes.title.text
+        if title.strip():
+            texts.append(f"标题: {title}")
+
+    # 提取所有文本框
+    for shape in slide.shapes:
+        if shape.shape_type == MSO_SHAPE_TYPE.TEXT_BOX:
+            text = extract_text_from_shape(shape)
+            if text.strip():
+                texts.append(text)
+        elif shape.shape_type == MSO_SHAPE_TYPE.PLACEHOLDER:
+            text = extract_text_from_shape(shape)
+            if text.strip():
+                texts.append(text)
+        elif hasattr(shape, "text") and shape.text:
+            text = shape.text
+            if text.strip():
+                texts.append(text)
+
+    return texts
+
+
+def read_pptx_safe(file_path):
+    """安全读取PPTX文件"""
+    try:
+        prs = Presentation(file_path)
+        all_content = []
+
+        for i, slide in enumerate(prs.slides):
+            slide_content = extract_text_from_slide(slide)
+            if slide_content:
+                all_content.append(f"--- 幻灯片 {i + 1} ---")
+                all_content.extend(slide_content)
+                all_content.append("")
+
+        return "\n".join(all_content)
+
+    except Exception as e:
+        print(f"Error reading PPTX: {e}", file=sys.stderr)
+        return None
+
+
+def main():
+    if len(sys.argv) != 2:
+        print("Usage: python pptx_reader_safe.py <presentation.pptx>", file=sys.stderr)
+        sys.exit(1)
+
+    file_path = sys.argv[1]
+    if not os.path.exists(file_path):
+        print(f"File not found: {file_path}", file=sys.stderr)
+        sys.exit(1)
+
+    content = read_pptx_safe(file_path)
+    if content:
+        # 保存到临时文件以避免终端编码问题
+        output_path = "temp/pptx_output.txt"
+        os.makedirs("temp", exist_ok=True)
+        with open(output_path, "w", encoding="utf-8") as f:
+            f.write(content)
+        print(f"Content extracted to: {output_path}")
+        print("\n" + content)
+    else:
+        print("Failed to extract content", file=sys.stderr)
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()