Initial commit: skills library

- 70 skills with code and documentation - Add .gitignore (ignore __pycache__, output/, temp/, venv/) - Clean up test intermediates and caches
2026-04-26 19:27:40 +08:00
commit 04db423416
861 changed files with 210414 additions and 0 deletions
@@ -0,0 +1,146 @@
+# 文件读写编辑技能
+
+## 概述
+
+支持对 Word、Excel、PDF、PPT 四类办公文档进行**读取、创建、编辑**操作。
+特别针对 WPS/Office 创建的中文文档编码问题提供解决方案。
+
+## 适用场景
+
+- 读取 `.docx` / `.xlsx` / `.pdf` / `.pptx` / **`.xls`** 文件内容
+- 创建新的办公文档（报告、表格、演示文稿、PDF）
+- 编辑已有文档（修改内容、格式、公式等）
+- 中文编码乱码问题处理
+- PDF 合并/拆分/OCR/加水印
+- Excel 公式模型构建
+- **旧版Excel (.xls) 文件处理**
+
+---
+
+## 一、Word 文档（docx）
+
+[...保持原有内容不变...]
+
+---
+
+## 二、PDF 文档（pdf）
+
+[...保持原有内容不变...]
+
+---
+
+## 三、PPT 演示文稿（pptx）
+
+[...保持原有内容不变...]
+
+---
+
+## 四、Excel 电子表格（xlsx/xls）
+
+### 4.1 读取
+
+```bash
+# 使用内置脚本（自动处理中文编码）
+python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx>             # 列出所有sheets
+python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx> <sheet_idx>  # 读指定sheet（默认前20行）
+python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx> <sheet_idx> <max_rows>  # 指定行数
+
+# 新增：支持.xls格式
+python .opencode/skills/file-reader/scripts/file_reader.py <file.xls>              # 读取旧版Excel文件
+```
+
+**中文编码原理**：
+- **.xlsx文件**：直接读取 xlsx 内部 XML（sharedStrings.xml），使用 UTF-8 解码，绕过 openpyxl 可能的编码问题
+- **.xls文件**：使用 xlrd 库处理，自动检测 GBK/UTF-8 编码，特别处理 WPS 创建的中文.xls文件
+
+### 4.2 .xls文件特殊处理
+
+对于.xls格式文件，使用以下方法：
+
+```python
+import pandas as pd
+import xlrd
+
+def read_xls_with_chinese_encoding(file_path):
+    """读取包含中文的.xls文件"""
+    try:
+        # 尝试直接读取
+        df = pd.read_excel(file_path, engine='xlrd')
+        return df
+    except UnicodeDecodeError:
+        # 如果失败，尝试不同编码
+        encodings = ['gbk', 'gb2312', 'utf-8']
+        for encoding in encodings:
+            try:
+                # 对于制表符分隔的伪xls文件
+                df = pd.read_csv(file_path, encoding=encoding, sep='\t')
+                return df
+            except:
+                continue
+        raise ValueError("无法识别文件编码")
+```
+
+### 4.3 创建与编辑
+
+[...保持原有内容不变...]
+
+### 4.4 公式规范（铁律）
+
+[...保持原有内容不变...]
+
+### 4.5 注意事项
+
+- **.xlsx文件**：使用 openpyxl 处理
+- **.xls文件**：使用 xlrd 处理，注意 xlrd 2.0+ 不再支持.xls格式，需要使用 xlrd 1.2.0
+- **混合格式**：如果文件实际上是制表符分隔的CSV但扩展名为.xls，使用 pandas.read_csv() with sep='\t'
+- **中文编码**：WPS创建的.xls文件通常使用GBK编码，Office创建的可能使用UTF-8
+
+---
+
+## 五、中文编码问题总结
+
+| 文件类型 | 问题根源 | 解决方案 |
+|---------|---------|---------|
+| xlsx | WPS 内部 XML 可能是 GBK 但声明 UTF-8 | 直接读 zip 内 XML，正确解码 |
+| **xls** | **WPS创建的.xls文件使用GBK编码** | **使用xlrd 1.2.0 + GBK编码处理** |
+| docx | 同上 | 直接读 word/document.xml，尝试 UTF-8/GBK |
+| pdf | 提取的中文在 Windows 终端乱码 | 保存到 UTF-8 文件再读取 |
+| pptx | 类似 docx | 用 markitdown 或解包 XML |
+
+---
+
+## 六、依赖安装
+
+```bash
+# Python 核心（读取）
+pip install python-docx openpyxl pypdf pdfplumber
+
+# Excel .xls 支持（关键！）
+pip install xlrd==1.2.0
+
+# PDF 创建与 OCR
+pip install reportlab pytesseract pdf2image
+
+# PPT 文本提取
+pip install "markitdown[pptx]" Pillow
+
+# Node.js（文档创建）
+npm install -g docx        # Word 创建
+npm install -g pptxgenjs   # PPT 创建
+
+# 系统工具（可选）
+# LibreOffice - PDF转换、公式重算
+# Poppler (pdftoppm/pdftotext/pdfimages) - PDF处理
+# Tesseract - OCR
+```
+
+## 七、输出文件位置
+
+| 文件类型 | 默认输出路径 |
+|---------|------------|
+| xlsx 读取 | `temp/xlsx_output.txt` |
+| **xls 读取 | `temp/xls_output.txt`** |
+| docx 读取 | `temp/docx_output.txt` |
+| pdf 读取 | `temp/pdf_output.txt` |
+
+保存到文件是为了确保中文在 Windows 终端乱码时仍可正确查看。