Initial commit: skills library
- 70 skills with code and documentation - Add .gitignore (ignore __pycache__, output/, temp/, venv/) - Clean up test intermediates and caches
This commit is contained in:
@@ -0,0 +1,437 @@
|
||||
---
|
||||
name: file-reader
|
||||
description: 文件读写编辑技能。读取、创建、编辑 docx/xlsx/pdf/pptx 等文件,特别处理中文编码问题。
|
||||
---
|
||||
|
||||
# 文件读写编辑技能
|
||||
|
||||
## 概述
|
||||
|
||||
支持对 Word、Excel、PDF、PPT 四类办公文档进行**读取、创建、编辑**操作。
|
||||
特别针对 WPS/Office 创建的中文文档编码问题提供解决方案。
|
||||
|
||||
## 适用场景
|
||||
|
||||
- 读取 `.docx` / `.xlsx` / `.pdf` / `.pptx` 文件内容
|
||||
- 创建新的办公文档(报告、表格、演示文稿、PDF)
|
||||
- 编辑已有文档(修改内容、格式、公式等)
|
||||
- 中文编码乱码问题处理
|
||||
- PDF 合并/拆分/OCR/加水印
|
||||
- Excel 公式模型构建
|
||||
|
||||
---
|
||||
|
||||
## 一、Word 文档(docx)
|
||||
|
||||
### 1.1 读取
|
||||
|
||||
```bash
|
||||
# 使用内置脚本(自动处理中文编码)
|
||||
python .opencode/skills/file-reader/scripts/file_reader.py <file.docx>
|
||||
```
|
||||
|
||||
内容保存到 `temp/docx_output.txt`,确保中文在 Windows 终端也能正确查看。
|
||||
|
||||
**中文编码原理**:WPS 创建的 docx 内部 XML 可能是 GBK 编码但声明为 UTF-8,直接用 python-docx 会乱码。脚本直接从 zip 包读取 XML 并正确处理编码。
|
||||
|
||||
### 1.2 创建新文档
|
||||
|
||||
使用 `docx-js`(Node.js)生成,安装:`npm install -g docx`
|
||||
|
||||
```javascript
|
||||
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell,
|
||||
ImageRun, Header, Footer, AlignmentType, HeadingLevel, BorderStyle,
|
||||
WidthType, ShadingType, PageNumber, PageBreak, LevelFormat,
|
||||
TableOfContents, PageOrientation } = require('docx');
|
||||
const fs = require('fs');
|
||||
|
||||
const doc = new Document({
|
||||
styles: {
|
||||
default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt
|
||||
paragraphStyles: [
|
||||
{ id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal",
|
||||
quickFormat: true,
|
||||
run: { size: 32, bold: true, font: "Arial" },
|
||||
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } },
|
||||
]
|
||||
},
|
||||
sections: [{
|
||||
properties: {
|
||||
page: {
|
||||
size: { width: 12240, height: 15840 }, // US Letter (DXA单位, 1440=1英寸)
|
||||
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 }
|
||||
}
|
||||
},
|
||||
children: [/* 内容 */]
|
||||
}]
|
||||
});
|
||||
|
||||
Packer.toBuffer(doc).then(buf => fs.writeFileSync("output.docx", buf));
|
||||
```
|
||||
|
||||
**关键规则**:
|
||||
- 页面尺寸必须显式设置(默认 A4,US Letter 需指定 12240x15840)
|
||||
- 横向:传纵向尺寸 + `orientation: PageOrientation.LANDSCAPE`(内部自动交换)
|
||||
- 禁止用 `\n` 换行,用多个 `Paragraph`
|
||||
- 禁止用 Unicode 项目符号,用 `LevelFormat.BULLET` + numbering config
|
||||
- `PageBreak` 必须放在 `Paragraph` 内
|
||||
- `ImageRun` 必须指定 `type`(png/jpg 等)
|
||||
- 表格必须同时设置 `columnWidths` 和每个 cell 的 `width`,且用 `WidthType.DXA`(百分比在 Google Docs 会坏)
|
||||
- 表格着色用 `ShadingType.CLEAR`(不是 SOLID,否则全黑)
|
||||
- 目录 TOC 要求标题用 `HeadingLevel`,且样式中包含 `outlineLevel`
|
||||
|
||||
### 1.3 编辑已有文档
|
||||
|
||||
采用 XML 解包→编辑→重打包 三步法:
|
||||
|
||||
```bash
|
||||
# 步骤1:解包
|
||||
python scripts/office/unpack.py document.docx unpacked/
|
||||
|
||||
# 步骤2:编辑 unpacked/word/ 下的 XML 文件
|
||||
# 直接用 Edit 工具做字符串替换,不要写 Python 脚本
|
||||
|
||||
# 步骤3:重打包
|
||||
python scripts/office/pack.py unpacked/ output.docx --original document.docx
|
||||
```
|
||||
|
||||
**修订标记(Tracked Changes)**:
|
||||
```xml
|
||||
<!-- 插入 -->
|
||||
<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
|
||||
<w:r><w:t>新文本</w:t></w:r>
|
||||
</w:ins>
|
||||
|
||||
<!-- 删除(注意用 delText 不是 t) -->
|
||||
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
|
||||
<w:r><w:delText>被删文本</w:delText></w:r>
|
||||
</w:del>
|
||||
```
|
||||
|
||||
**注意事项**:
|
||||
- 替换整个 `<w:r>` 元素,不要在 run 内部插入修订标签
|
||||
- 保留原始 `<w:rPr>` 格式块
|
||||
- 新内容中的引号用 XML 实体:`“`(")`”`(")`’`(')
|
||||
- 删除整段时需在 `<w:pPr><w:rPr>` 中也加 `<w:del/>`,否则接受修订后留空段
|
||||
|
||||
---
|
||||
|
||||
## 二、PDF 文档(pdf)
|
||||
|
||||
### 2.1 读取
|
||||
|
||||
```bash
|
||||
# 使用 pypdf 读取并保存到文件(避免终端中文乱码)
|
||||
python -c "
|
||||
from pypdf import PdfReader
|
||||
r = PdfReader('<file.pdf>')
|
||||
with open('temp/pdf_output.txt', 'w', encoding='utf-8') as f:
|
||||
for i, p in enumerate(r.pages):
|
||||
f.write(f'--- Page {i+1} ---\n')
|
||||
f.write(p.extract_text() + '\n\n')
|
||||
"
|
||||
```
|
||||
|
||||
**提取表格**(pdfplumber 更强):
|
||||
```python
|
||||
import pdfplumber
|
||||
with pdfplumber.open("doc.pdf") as pdf:
|
||||
for page in pdf.pages:
|
||||
tables = page.extract_tables()
|
||||
for table in tables:
|
||||
for row in table:
|
||||
print(row)
|
||||
```
|
||||
|
||||
### 2.2 合并与拆分
|
||||
|
||||
```python
|
||||
from pypdf import PdfReader, PdfWriter
|
||||
|
||||
# 合并多个 PDF
|
||||
writer = PdfWriter()
|
||||
for f in ["doc1.pdf", "doc2.pdf"]:
|
||||
for page in PdfReader(f).pages:
|
||||
writer.add_page(page)
|
||||
with open("merged.pdf", "wb") as out:
|
||||
writer.write(out)
|
||||
|
||||
# 拆分为单页
|
||||
reader = PdfReader("input.pdf")
|
||||
for i, page in enumerate(reader.pages):
|
||||
w = PdfWriter()
|
||||
w.add_page(page)
|
||||
with open(f"page_{i+1}.pdf", "wb") as out:
|
||||
w.write(out)
|
||||
```
|
||||
|
||||
### 2.3 创建新 PDF
|
||||
|
||||
使用 `reportlab`:
|
||||
|
||||
```python
|
||||
from reportlab.lib.pagesizes import letter
|
||||
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
|
||||
from reportlab.lib.styles import getSampleStyleSheet
|
||||
|
||||
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
|
||||
styles = getSampleStyleSheet()
|
||||
story = [
|
||||
Paragraph("报告标题", styles['Title']),
|
||||
Spacer(1, 12),
|
||||
Paragraph("正文内容..." * 20, styles['Normal']),
|
||||
PageBreak(),
|
||||
Paragraph("第二页", styles['Heading1']),
|
||||
]
|
||||
doc.build(story)
|
||||
```
|
||||
|
||||
**陷阱**:禁止在 reportlab 中使用 Unicode 上下标字符(₀₁₂ ⁰¹²),内置字体不支持会渲染成黑块。用 `<sub>` `<super>` 标签代替。
|
||||
|
||||
### 2.4 OCR 扫描件
|
||||
|
||||
```python
|
||||
# 依赖:pip install pytesseract pdf2image
|
||||
import pytesseract
|
||||
from pdf2image import convert_from_path
|
||||
|
||||
images = convert_from_path('scanned.pdf')
|
||||
for i, img in enumerate(images):
|
||||
text = pytesseract.image_to_string(img, lang='chi_sim') # 中文用 chi_sim
|
||||
print(f"Page {i+1}:\n{text}\n")
|
||||
```
|
||||
|
||||
### 2.5 其他操作
|
||||
|
||||
```python
|
||||
from pypdf import PdfReader, PdfWriter
|
||||
|
||||
# 旋转页面
|
||||
page = PdfReader("input.pdf").pages[0]
|
||||
page.rotate(90)
|
||||
|
||||
# 添加水印
|
||||
watermark = PdfReader("watermark.pdf").pages[0]
|
||||
for page in PdfReader("doc.pdf").pages:
|
||||
page.merge_page(watermark)
|
||||
|
||||
# 加密
|
||||
writer.encrypt("user_password", "owner_password")
|
||||
```
|
||||
|
||||
**命令行工具**:
|
||||
```bash
|
||||
pdftotext -layout input.pdf output.txt # 提取文本(保留布局)
|
||||
qpdf --empty --pages f1.pdf f2.pdf -- out.pdf # 合并
|
||||
pdfimages -j input.pdf prefix # 提取图片
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 三、PPT 演示文稿(pptx)
|
||||
|
||||
### 3.1 读取
|
||||
|
||||
```bash
|
||||
# 文本提取
|
||||
python -m markitdown presentation.pptx
|
||||
|
||||
# 缩略图预览
|
||||
python scripts/thumbnail.py presentation.pptx
|
||||
|
||||
# 原始 XML
|
||||
python scripts/office/unpack.py presentation.pptx unpacked/
|
||||
```
|
||||
|
||||
### 3.2 创建新演示文稿
|
||||
|
||||
使用 `pptxgenjs`(Node.js),安装:`npm install -g pptxgenjs`
|
||||
|
||||
无模板时从头创建,有模板时用解包→编辑→打包流程。
|
||||
|
||||
### 3.3 编辑已有文件
|
||||
|
||||
```bash
|
||||
# 1. 分析模板
|
||||
python scripts/thumbnail.py presentation.pptx
|
||||
|
||||
# 2. 解包→编辑→打包
|
||||
python scripts/office/unpack.py presentation.pptx unpacked/
|
||||
# 编辑 XML...
|
||||
python scripts/office/pack.py unpacked/ output.pptx --original presentation.pptx
|
||||
```
|
||||
|
||||
### 3.4 设计规范
|
||||
|
||||
**配色**:不要默认蓝色,根据主题选择配色方案。深色背景用于首尾页,浅色用于内容页。
|
||||
|
||||
**字体搭配**(标题/正文):Georgia+Calibri、Arial Black+Arial、Cambria+Calibri
|
||||
|
||||
**字号规范**:
|
||||
|
||||
| 元素 | 字号 |
|
||||
|------|------|
|
||||
| 幻灯片标题 | 36-44pt 加粗 |
|
||||
| 章节标题 | 20-24pt 加粗 |
|
||||
| 正文 | 14-16pt |
|
||||
| 注释 | 10-12pt |
|
||||
|
||||
**常见错误**:
|
||||
- 不要重复相同布局,要变换排版
|
||||
- 正文左对齐,仅标题居中
|
||||
- 每页必须有视觉元素(图、图标、图表),拒绝纯文字幻灯片
|
||||
- 禁止标题下划线(AI 生成痕迹明显)
|
||||
- 设置 text box `margin: 0` 或偏移以对齐装饰线与文字边缘
|
||||
|
||||
### 3.5 QA 检查(必须执行)
|
||||
|
||||
```bash
|
||||
# 内容检查
|
||||
python -m markitdown output.pptx
|
||||
|
||||
# 视觉检查:转图片后逐页审查
|
||||
python scripts/office/soffice.py --headless --convert-to pdf output.pptx
|
||||
pdftoppm -jpeg -r 150 output.pdf slide
|
||||
|
||||
# 检查残留占位符
|
||||
python -m markitdown output.pptx | grep -iE "xxxx|lorem|ipsum"
|
||||
```
|
||||
|
||||
**检查清单**:元素重叠、文字溢出、间距不均、低对比度、占位符残留。
|
||||
**至少完成一轮 修复→验证 循环才能交付。**
|
||||
|
||||
---
|
||||
|
||||
## 四、Excel 电子表格(xlsx)
|
||||
|
||||
### 4.1 读取
|
||||
|
||||
```bash
|
||||
# 使用内置脚本(自动处理中文编码)
|
||||
python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx> # 列出所有sheets
|
||||
python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx> <sheet_idx> # 读指定sheet(默认前20行)
|
||||
python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx> <sheet_idx> <max_rows> # 指定行数
|
||||
```
|
||||
|
||||
**中文编码原理**:直接读取 xlsx 内部 XML(sharedStrings.xml),使用 UTF-8 解码,绕过 openpyxl 可能的编码问题。
|
||||
|
||||
**pandas 分析**:
|
||||
```python
|
||||
import pandas as pd
|
||||
df = pd.read_excel('file.xlsx') # 默认第一个sheet
|
||||
all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # 所有sheet
|
||||
df.describe() # 统计摘要
|
||||
```
|
||||
|
||||
### 4.2 创建与编辑
|
||||
|
||||
```python
|
||||
from openpyxl import Workbook, load_workbook
|
||||
from openpyxl.styles import Font, PatternFill, Alignment
|
||||
|
||||
# 创建新文件
|
||||
wb = Workbook()
|
||||
ws = wb.active
|
||||
ws['A1'] = '标题'
|
||||
ws['A1'].font = Font(bold=True, size=14)
|
||||
ws.column_dimensions['A'].width = 20
|
||||
|
||||
# 编辑已有文件
|
||||
wb = load_workbook('existing.xlsx')
|
||||
ws = wb['Sheet1']
|
||||
ws['B2'] = '=SUM(B3:B10)'
|
||||
|
||||
wb.save('output.xlsx')
|
||||
```
|
||||
|
||||
### 4.3 公式规范(铁律)
|
||||
|
||||
> **必须用 Excel 公式,禁止在 Python 中算好再硬编码到单元格!**
|
||||
|
||||
```python
|
||||
# ❌ 错误:Python 计算后写死值
|
||||
total = df['Sales'].sum()
|
||||
sheet['B10'] = total # 写死 5000
|
||||
|
||||
# ✅ 正确:写 Excel 公式
|
||||
sheet['B10'] = '=SUM(B2:B9)'
|
||||
sheet['C5'] = '=(C4-C2)/C2'
|
||||
sheet['D20'] = '=AVERAGE(D2:D19)'
|
||||
```
|
||||
|
||||
公式写入后必须重新计算:
|
||||
```bash
|
||||
python scripts/recalc.py output.xlsx
|
||||
```
|
||||
|
||||
脚本返回 JSON,如果 `status` 为 `errors_found`,根据 `error_summary` 修复后重新计算。
|
||||
|
||||
### 4.4 财务模型规范
|
||||
|
||||
**颜色编码**:
|
||||
- 蓝色文字 (0,0,255):硬编码输入值
|
||||
- 黑色文字 (0,0,0):公式和计算
|
||||
- 绿色文字 (0,128,0):跨 sheet 引用
|
||||
- 红色文字 (255,0,0):外部文件链接
|
||||
- 黄色背景 (255,255,0):关键假设
|
||||
|
||||
**数字格式**:
|
||||
- 年份格式化为文本(避免 2024 变成 2,024)
|
||||
- 货币用 `$#,##0`,标题注明单位
|
||||
- 零值显示为 `-`
|
||||
- 负数用括号 `(123)` 不用减号
|
||||
- 百分比默认 `0.0%`
|
||||
|
||||
### 4.5 注意事项
|
||||
|
||||
- openpyxl 单元格索引从 1 开始
|
||||
- `data_only=True` 读取计算值,但保存后公式会永久丢失
|
||||
- pandas 读取时指定 `dtype={'id': str}` 避免类型推断问题
|
||||
- 假设值放单独单元格,公式用引用,不要内嵌常量
|
||||
- 跨 sheet 引用格式:`Sheet1!A1`
|
||||
- 检查公式错误:`#REF!`、`#DIV/0!`、`#VALUE!`、`#NAME?`
|
||||
|
||||
---
|
||||
|
||||
## 五、中文编码问题总结
|
||||
|
||||
| 文件类型 | 问题根源 | 解决方案 |
|
||||
|---------|---------|---------|
|
||||
| xlsx | WPS 内部 XML 可能是 GBK 但声明 UTF-8 | 直接读 zip 内 XML,正确解码 |
|
||||
| docx | 同上 | 直接读 word/document.xml,尝试 UTF-8/GBK |
|
||||
| pdf | 提取的中文在 Windows 终端乱码 | 保存到 UTF-8 文件再读取 |
|
||||
| pptx | 类似 docx | 用 markitdown 或解包 XML |
|
||||
|
||||
---
|
||||
|
||||
## 六、依赖安装
|
||||
|
||||
```bash
|
||||
# Python 核心(读取)
|
||||
pip install python-docx openpyxl pypdf pdfplumber
|
||||
|
||||
# PDF 创建与 OCR
|
||||
pip install reportlab pytesseract pdf2image
|
||||
|
||||
# PPT 文本提取
|
||||
pip install "markitdown[pptx]" Pillow
|
||||
|
||||
# Node.js(文档创建)
|
||||
npm install -g docx # Word 创建
|
||||
npm install -g pptxgenjs # PPT 创建
|
||||
|
||||
# 系统工具(可选)
|
||||
# LibreOffice - PDF转换、公式重算
|
||||
# Poppler (pdftoppm/pdftotext/pdfimages) - PDF处理
|
||||
# Tesseract - OCR
|
||||
```
|
||||
|
||||
## 七、输出文件位置
|
||||
|
||||
| 文件类型 | 默认输出路径 |
|
||||
|---------|------------|
|
||||
| xlsx 读取 | `temp/xlsx_output.txt` |
|
||||
| docx 读取 | `temp/docx_output.txt` |
|
||||
| pdf 读取 | `temp/pdf_output.txt` |
|
||||
|
||||
保存到文件是为了确保中文在 Windows 终端乱码时仍可正确查看。
|
||||
@@ -0,0 +1,146 @@
|
||||
# 文件读写编辑技能
|
||||
|
||||
## 概述
|
||||
|
||||
支持对 Word、Excel、PDF、PPT 四类办公文档进行**读取、创建、编辑**操作。
|
||||
特别针对 WPS/Office 创建的中文文档编码问题提供解决方案。
|
||||
|
||||
## 适用场景
|
||||
|
||||
- 读取 `.docx` / `.xlsx` / `.pdf` / `.pptx` / **`.xls`** 文件内容
|
||||
- 创建新的办公文档(报告、表格、演示文稿、PDF)
|
||||
- 编辑已有文档(修改内容、格式、公式等)
|
||||
- 中文编码乱码问题处理
|
||||
- PDF 合并/拆分/OCR/加水印
|
||||
- Excel 公式模型构建
|
||||
- **旧版Excel (.xls) 文件处理**
|
||||
|
||||
---
|
||||
|
||||
## 一、Word 文档(docx)
|
||||
|
||||
[...保持原有内容不变...]
|
||||
|
||||
---
|
||||
|
||||
## 二、PDF 文档(pdf)
|
||||
|
||||
[...保持原有内容不变...]
|
||||
|
||||
---
|
||||
|
||||
## 三、PPT 演示文稿(pptx)
|
||||
|
||||
[...保持原有内容不变...]
|
||||
|
||||
---
|
||||
|
||||
## 四、Excel 电子表格(xlsx/xls)
|
||||
|
||||
### 4.1 读取
|
||||
|
||||
```bash
|
||||
# 使用内置脚本(自动处理中文编码)
|
||||
python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx> # 列出所有sheets
|
||||
python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx> <sheet_idx> # 读指定sheet(默认前20行)
|
||||
python .opencode/skills/file-reader/scripts/file_reader.py <file.xlsx> <sheet_idx> <max_rows> # 指定行数
|
||||
|
||||
# 新增:支持.xls格式
|
||||
python .opencode/skills/file-reader/scripts/file_reader.py <file.xls> # 读取旧版Excel文件
|
||||
```
|
||||
|
||||
**中文编码原理**:
|
||||
- **.xlsx文件**:直接读取 xlsx 内部 XML(sharedStrings.xml),使用 UTF-8 解码,绕过 openpyxl 可能的编码问题
|
||||
- **.xls文件**:使用 xlrd 库处理,自动检测 GBK/UTF-8 编码,特别处理 WPS 创建的中文.xls文件
|
||||
|
||||
### 4.2 .xls文件特殊处理
|
||||
|
||||
对于.xls格式文件,使用以下方法:
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
import xlrd
|
||||
|
||||
def read_xls_with_chinese_encoding(file_path):
|
||||
"""读取包含中文的.xls文件"""
|
||||
try:
|
||||
# 尝试直接读取
|
||||
df = pd.read_excel(file_path, engine='xlrd')
|
||||
return df
|
||||
except UnicodeDecodeError:
|
||||
# 如果失败,尝试不同编码
|
||||
encodings = ['gbk', 'gb2312', 'utf-8']
|
||||
for encoding in encodings:
|
||||
try:
|
||||
# 对于制表符分隔的伪xls文件
|
||||
df = pd.read_csv(file_path, encoding=encoding, sep='\t')
|
||||
return df
|
||||
except:
|
||||
continue
|
||||
raise ValueError("无法识别文件编码")
|
||||
```
|
||||
|
||||
### 4.3 创建与编辑
|
||||
|
||||
[...保持原有内容不变...]
|
||||
|
||||
### 4.4 公式规范(铁律)
|
||||
|
||||
[...保持原有内容不变...]
|
||||
|
||||
### 4.5 注意事项
|
||||
|
||||
- **.xlsx文件**:使用 openpyxl 处理
|
||||
- **.xls文件**:使用 xlrd 处理,注意 xlrd 2.0+ 不再支持.xls格式,需要使用 xlrd 1.2.0
|
||||
- **混合格式**:如果文件实际上是制表符分隔的CSV但扩展名为.xls,使用 pandas.read_csv() with sep='\t'
|
||||
- **中文编码**:WPS创建的.xls文件通常使用GBK编码,Office创建的可能使用UTF-8
|
||||
|
||||
---
|
||||
|
||||
## 五、中文编码问题总结
|
||||
|
||||
| 文件类型 | 问题根源 | 解决方案 |
|
||||
|---------|---------|---------|
|
||||
| xlsx | WPS 内部 XML 可能是 GBK 但声明 UTF-8 | 直接读 zip 内 XML,正确解码 |
|
||||
| **xls** | **WPS创建的.xls文件使用GBK编码** | **使用xlrd 1.2.0 + GBK编码处理** |
|
||||
| docx | 同上 | 直接读 word/document.xml,尝试 UTF-8/GBK |
|
||||
| pdf | 提取的中文在 Windows 终端乱码 | 保存到 UTF-8 文件再读取 |
|
||||
| pptx | 类似 docx | 用 markitdown 或解包 XML |
|
||||
|
||||
---
|
||||
|
||||
## 六、依赖安装
|
||||
|
||||
```bash
|
||||
# Python 核心(读取)
|
||||
pip install python-docx openpyxl pypdf pdfplumber
|
||||
|
||||
# Excel .xls 支持(关键!)
|
||||
pip install xlrd==1.2.0
|
||||
|
||||
# PDF 创建与 OCR
|
||||
pip install reportlab pytesseract pdf2image
|
||||
|
||||
# PPT 文本提取
|
||||
pip install "markitdown[pptx]" Pillow
|
||||
|
||||
# Node.js(文档创建)
|
||||
npm install -g docx # Word 创建
|
||||
npm install -g pptxgenjs # PPT 创建
|
||||
|
||||
# 系统工具(可选)
|
||||
# LibreOffice - PDF转换、公式重算
|
||||
# Poppler (pdftoppm/pdftotext/pdfimages) - PDF处理
|
||||
# Tesseract - OCR
|
||||
```
|
||||
|
||||
## 七、输出文件位置
|
||||
|
||||
| 文件类型 | 默认输出路径 |
|
||||
|---------|------------|
|
||||
| xlsx 读取 | `temp/xlsx_output.txt` |
|
||||
| **xls 读取 | `temp/xls_output.txt`** |
|
||||
| docx 读取 | `temp/docx_output.txt` |
|
||||
| pdf 读取 | `temp/pdf_output.txt` |
|
||||
|
||||
保存到文件是为了确保中文在 Windows 终端乱码时仍可正确查看。
|
||||
@@ -0,0 +1,3 @@
|
||||
# Example Reference
|
||||
|
||||
This is an example reference file. Delete if not needed.
|
||||
@@ -0,0 +1,2 @@
|
||||
# file-reader - dependencies
|
||||
python-docx>=0.0.1
|
||||
@@ -0,0 +1,4 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Example script - delete if not needed."""
|
||||
|
||||
print("Hello from skill!")
|
||||
@@ -0,0 +1,234 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
文件读取工具 - 支持docx、xlsx等二进制文件读取
|
||||
解决中文编码问题
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import zipfile
|
||||
import re
|
||||
|
||||
|
||||
def read_docx(file_path):
|
||||
"""读取docx文件内容"""
|
||||
try:
|
||||
import zipfile
|
||||
import re
|
||||
|
||||
with zipfile.ZipFile(file_path, "r") as z:
|
||||
with z.open("word/document.xml") as f:
|
||||
content = f.read()
|
||||
|
||||
# docx内部编码可能是GBK但声明为UTF-8
|
||||
try:
|
||||
text = content.decode("utf-8", errors="strict")
|
||||
except:
|
||||
text = content.decode("gbk", errors="ignore")
|
||||
|
||||
# 提取段落
|
||||
paragraphs = re.findall(r"<w:t[^>]*>([^<]*)</w:t>", text)
|
||||
|
||||
content_lines = []
|
||||
for p in paragraphs:
|
||||
if p.strip():
|
||||
content_lines.append(p)
|
||||
|
||||
# 读取表格
|
||||
# 先找到表格
|
||||
tables = re.findall(r"<w:tbl>(.*?)</w:tbl>", text, re.DOTALL)
|
||||
for table in tables:
|
||||
rows = re.findall(r"<w:tr>(.*?)</w:tr>", table, re.DOTALL)
|
||||
for row in rows:
|
||||
cells = re.findall(r"<w:tc>(.*?)</w:tc>", row, re.DOTALL)
|
||||
row_data = []
|
||||
for cell in cells:
|
||||
cell_text = re.findall(r"<w:t[^>]*>([^<]*)</w:t>", cell)
|
||||
if cell_text:
|
||||
row_data.append("".join(cell_text).strip())
|
||||
if row_data:
|
||||
content_lines.append(" | ".join(row_data))
|
||||
|
||||
return "\n".join(content_lines)
|
||||
|
||||
except Exception as e:
|
||||
return f"Error reading docx: {e}"
|
||||
|
||||
|
||||
def read_docx_and_save(file_path):
|
||||
"""读取docx并保存为UTF-8文本文件"""
|
||||
try:
|
||||
import zipfile
|
||||
import re
|
||||
import os
|
||||
|
||||
with zipfile.ZipFile(file_path, "r") as z:
|
||||
with z.open("word/document.xml") as f:
|
||||
content = f.read()
|
||||
|
||||
# 尝试UTF-8,然后尝试GBK
|
||||
try:
|
||||
text = content.decode("utf-8", errors="strict")
|
||||
except:
|
||||
try:
|
||||
text = content.decode("gbk", errors="ignore")
|
||||
except:
|
||||
text = content.decode("utf-8", errors="replace")
|
||||
|
||||
# 提取段落
|
||||
paragraphs = re.findall(r"<w:t[^>]*>([^<]*)</w:t>", text)
|
||||
|
||||
content_lines = []
|
||||
for p in paragraphs:
|
||||
if p.strip():
|
||||
content_lines.append(p)
|
||||
|
||||
# 读取表格
|
||||
tables = re.findall(r"<w:tbl>(.*?)</w:tbl>", text, re.DOTALL)
|
||||
for table in tables:
|
||||
rows = re.findall(r"<w:tr>(.*?)</w:tr>", table, re.DOTALL)
|
||||
for row in rows:
|
||||
cells = re.findall(r"<w:tc>(.*?)</w:tc>", row, re.DOTALL)
|
||||
row_data = []
|
||||
for cell in cells:
|
||||
cell_text = re.findall(r"<w:t[^>]*>([^<]*)</w:t>", cell)
|
||||
if cell_text:
|
||||
row_data.append("".join(cell_text).strip())
|
||||
if row_data:
|
||||
content_lines.append(" | ".join(row_data))
|
||||
|
||||
output = "\n".join(content_lines)
|
||||
|
||||
# 保存到文件
|
||||
output_file = "temp/docx_output.txt"
|
||||
os.makedirs("temp", exist_ok=True)
|
||||
with open(output_file, "w", encoding="utf-8") as f:
|
||||
f.write(output)
|
||||
|
||||
return f"[内容已保存到 temp/docx_output.txt]\n\n{output}"
|
||||
|
||||
except Exception as e:
|
||||
return f"Error: {e}"
|
||||
|
||||
|
||||
def read_xlsx(file_path, sheet_index=0, max_rows=20):
|
||||
"""读取xlsx文件,中文正确显示(保存到文件)"""
|
||||
try:
|
||||
with zipfile.ZipFile(file_path, "r") as z:
|
||||
# 读取sharedStrings - UTF-8编码
|
||||
string_map = {}
|
||||
if "xl/sharedStrings.xml" in z.namelist():
|
||||
with z.open("xl/sharedStrings.xml") as f:
|
||||
ss_content = f.read()
|
||||
text = ss_content.decode("utf-8", errors="ignore")
|
||||
strings = re.findall(r"<t>([^<]*)</t>", text)
|
||||
for i, s in enumerate(strings):
|
||||
string_map[i] = s
|
||||
|
||||
# 读取sheet
|
||||
sheet_files = [
|
||||
f
|
||||
for f in z.namelist()
|
||||
if f.startswith("xl/worksheets/sheet") and f.endswith(".xml")
|
||||
]
|
||||
if sheet_index >= len(sheet_files):
|
||||
return f"Sheet index {sheet_index} out of range"
|
||||
|
||||
sheet_file = sheet_files[sheet_index]
|
||||
with z.open(sheet_file) as f:
|
||||
sheet_content = f.read().decode("utf-8", errors="ignore")
|
||||
|
||||
# 解析数据
|
||||
rows = re.findall(
|
||||
r'<row r="(\d+)"[^>]*>(.*?)</row>', sheet_content, re.DOTALL
|
||||
)
|
||||
|
||||
result = [f"=== Sheet {sheet_index} ==="]
|
||||
for row_num, row_content in rows[:max_rows]:
|
||||
cells = re.findall(
|
||||
r'<c r="([A-Z]+\d+)"[^>]*>(.*?)</c>', row_content, re.DOTALL
|
||||
)
|
||||
row_data = []
|
||||
for cell_ref, cell_content in cells:
|
||||
v_match = re.search(r"<v>([^<]*)</v>", cell_content)
|
||||
t_match = re.search(r"<is><t[^>]*>([^<]*)</t></is>", cell_content)
|
||||
|
||||
if t_match:
|
||||
val = t_match.group(1)
|
||||
elif v_match:
|
||||
v = v_match.group(1)
|
||||
if 't="s"' in cell_content and v.isdigit():
|
||||
idx = int(v)
|
||||
val = string_map.get(idx, f"[str:{v}]")
|
||||
else:
|
||||
try:
|
||||
val = str(int(v)) if "." not in v else str(float(v))
|
||||
except:
|
||||
val = v
|
||||
else:
|
||||
val = ""
|
||||
|
||||
if val:
|
||||
row_data.append(val)
|
||||
|
||||
if row_data:
|
||||
result.append(" | ".join(row_data))
|
||||
|
||||
output = "\n".join(result)
|
||||
|
||||
# 写入文件确保中文正确显示
|
||||
output_file = f"temp/xlsx_output.txt"
|
||||
os.makedirs("temp", exist_ok=True)
|
||||
with open(output_file, "w", encoding="utf-8") as f:
|
||||
f.write(output)
|
||||
|
||||
return f"[内容已保存到 temp/xlsx_output.txt]\n\n{output}"
|
||||
|
||||
except Exception as e:
|
||||
return f"Error: {e}"
|
||||
|
||||
|
||||
def list_sheets(file_path):
|
||||
"""列出xlsx所有sheet"""
|
||||
try:
|
||||
import openpyxl
|
||||
|
||||
wb = openpyxl.load_workbook(file_path, data_only=True)
|
||||
sheets = wb.sheetnames
|
||||
return "可用Sheets:\n" + "\n".join(f"{i}: {s}" for i, s in enumerate(sheets))
|
||||
except Exception as e:
|
||||
return f"Error: {e}"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) < 2:
|
||||
print("用法:")
|
||||
print(" python file_reader.py <file.docx>")
|
||||
print(" python file_reader.py <file.xlsx> # 列出所有sheets")
|
||||
print(" python file_reader.py <file.xlsx> <sheet_index> # 读取指定sheet")
|
||||
print(" python file_reader.py <file.xlsx> <sheet_index> <max_rows>")
|
||||
print("\n示例:")
|
||||
print(" python file_reader.py test.xlsx # 列出sheets")
|
||||
print(" python file_reader.py test.xlsx 1 # 读取第2个sheet")
|
||||
print(" python file_reader.py test.xlsx 1 30 # 读取前30行")
|
||||
sys.exit(1)
|
||||
|
||||
file_path = sys.argv[1]
|
||||
|
||||
if not os.path.exists(file_path):
|
||||
print(f"文件不存在: {file_path}")
|
||||
sys.exit(1)
|
||||
|
||||
ext = os.path.splitext(file_path)[1].lower()
|
||||
|
||||
if ext == ".docx":
|
||||
print(read_docx_and_save(file_path))
|
||||
elif ext in [".xlsx", ".xls"]:
|
||||
if len(sys.argv) == 2:
|
||||
print(list_sheets(file_path))
|
||||
else:
|
||||
sheet_index = int(sys.argv[2])
|
||||
max_rows = int(sys.argv[3]) if len(sys.argv) > 3 else 20
|
||||
print(read_xlsx(file_path, sheet_index, max_rows))
|
||||
else:
|
||||
print(f"不支持的文件类型: {ext}")
|
||||
@@ -0,0 +1,265 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
文件读取工具 - 支持docx、xlsx、xls等二进制文件读取
|
||||
解决中文编码问题
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import zipfile
|
||||
import re
|
||||
import pandas as pd
|
||||
|
||||
|
||||
def read_docx_and_save(file_path):
|
||||
"""读取docx并保存为UTF-8文本文件"""
|
||||
try:
|
||||
import zipfile
|
||||
import re
|
||||
import os
|
||||
|
||||
with zipfile.ZipFile(file_path, "r") as z:
|
||||
with z.open("word/document.xml") as f:
|
||||
content = f.read()
|
||||
|
||||
# 尝试UTF-8,然后尝试GBK
|
||||
try:
|
||||
text = content.decode("utf-8", errors="strict")
|
||||
except:
|
||||
try:
|
||||
text = content.decode("gbk", errors="ignore")
|
||||
except:
|
||||
text = content.decode("utf-8", errors="replace")
|
||||
|
||||
# 提取段落
|
||||
paragraphs = re.findall(r"<w:t[^>]*>([^<]*)</w:t>", text)
|
||||
|
||||
content_lines = []
|
||||
for p in paragraphs:
|
||||
if p.strip():
|
||||
content_lines.append(p)
|
||||
|
||||
# 读取表格
|
||||
tables = re.findall(r"<w:tbl>(.*?)</w:tbl>", text, re.DOTALL)
|
||||
for table in tables:
|
||||
rows = re.findall(r"<w:tr>(.*?)</w:tr>", table, re.DOTALL)
|
||||
for row in rows:
|
||||
cells = re.findall(r"<w:tc>(.*?)</w:tc>", row, re.DOTALL)
|
||||
row_data = []
|
||||
for cell in cells:
|
||||
cell_text = re.findall(r"<w:t[^>]*>([^<]*)</w:t>", cell)
|
||||
if cell_text:
|
||||
row_data.append("".join(cell_text).strip())
|
||||
if row_data:
|
||||
content_lines.append(" | ".join(row_data))
|
||||
|
||||
output = "\n".join(content_lines)
|
||||
|
||||
# 保存到文件
|
||||
output_file = "temp/docx_output.txt"
|
||||
os.makedirs("temp", exist_ok=True)
|
||||
with open(output_file, "w", encoding="utf-8") as f:
|
||||
f.write(output)
|
||||
|
||||
return f"[内容已保存到 temp/docx_output.txt]\n\n{output}"
|
||||
|
||||
except Exception as e:
|
||||
return f"Error: {e}"
|
||||
|
||||
|
||||
def read_xlsx(file_path, sheet_index=0, max_rows=20):
|
||||
"""读取xlsx文件,中文正确显示(保存到文件)"""
|
||||
try:
|
||||
with zipfile.ZipFile(file_path, "r") as z:
|
||||
# 读取sharedStrings - UTF-8编码
|
||||
string_map = {}
|
||||
if "xl/sharedStrings.xml" in z.namelist():
|
||||
with z.open("xl/sharedStrings.xml") as f:
|
||||
ss_content = f.read()
|
||||
text = ss_content.decode("utf-8", errors="ignore")
|
||||
strings = re.findall(r"<t>([^<]*)</t>", text)
|
||||
for i, s in enumerate(strings):
|
||||
string_map[i] = s
|
||||
|
||||
# 读取sheet
|
||||
sheet_files = [
|
||||
f
|
||||
for f in z.namelist()
|
||||
if f.startswith("xl/worksheets/sheet") and f.endswith(".xml")
|
||||
]
|
||||
if sheet_index >= len(sheet_files):
|
||||
return f"Sheet index {sheet_index} out of range"
|
||||
|
||||
sheet_file = sheet_files[sheet_index]
|
||||
with z.open(sheet_file) as f:
|
||||
sheet_content = f.read().decode("utf-8", errors="ignore")
|
||||
|
||||
# 解析数据
|
||||
rows = re.findall(
|
||||
r'<row r="(\d+)"[^>]*>(.*?)</row>', sheet_content, re.DOTALL
|
||||
)
|
||||
|
||||
result = [f"=== Sheet {sheet_index} ==="]
|
||||
for row_num, row_content in rows[:max_rows]:
|
||||
cells = re.findall(
|
||||
r'<c r="([A-Z]+\d+)"[^>]*>(.*?)</c>', row_content, re.DOTALL
|
||||
)
|
||||
row_data = []
|
||||
for cell_ref, cell_content in cells:
|
||||
v_match = re.search(r"<v>([^<]*)</v>", cell_content)
|
||||
t_match = re.search(r"<is><t[^>]*>([^<]*)</t></is>", cell_content)
|
||||
|
||||
if t_match:
|
||||
val = t_match.group(1)
|
||||
elif v_match:
|
||||
v = v_match.group(1)
|
||||
if 't="s"' in cell_content and v.isdigit():
|
||||
idx = int(v)
|
||||
val = string_map.get(idx, f"[str:{v}]")
|
||||
else:
|
||||
try:
|
||||
val = str(int(v)) if "." not in v else str(float(v))
|
||||
except:
|
||||
val = v
|
||||
else:
|
||||
val = ""
|
||||
|
||||
if val:
|
||||
row_data.append(val)
|
||||
|
||||
if row_data:
|
||||
result.append(" | ".join(row_data))
|
||||
|
||||
output = "\n".join(result)
|
||||
|
||||
# 写入文件确保中文正确显示
|
||||
output_file = f"temp/xlsx_output.txt"
|
||||
os.makedirs("temp", exist_ok=True)
|
||||
with open(output_file, "w", encoding="utf-8") as f:
|
||||
f.write(output)
|
||||
|
||||
return f"[内容已保存到 temp/xlsx_output.txt]\n\n{output}"
|
||||
|
||||
except Exception as e:
|
||||
return f"Error: {e}"
|
||||
|
||||
|
||||
def read_xls(file_path, sheet_index=0, max_rows=20):
|
||||
"""读取xls文件,处理中文编码"""
|
||||
try:
|
||||
# 首先尝试用pandas直接读取
|
||||
df = pd.read_excel(file_path, sheet_name=sheet_index, nrows=max_rows)
|
||||
|
||||
# 转换为文本格式
|
||||
result = [f"=== Sheet {sheet_index} ==="]
|
||||
# 添加列名
|
||||
result.append(" | ".join(str(col) for col in df.columns))
|
||||
|
||||
# 添加数据行
|
||||
for idx, row in df.iterrows():
|
||||
row_data = []
|
||||
for val in row:
|
||||
if pd.isna(val):
|
||||
row_data.append("")
|
||||
else:
|
||||
row_data.append(str(val))
|
||||
result.append(" | ".join(row_data))
|
||||
|
||||
output = "\n".join(result)
|
||||
|
||||
# 保存到文件
|
||||
output_file = "temp/xls_output.txt"
|
||||
os.makedirs("temp", exist_ok=True)
|
||||
with open(output_file, "w", encoding="utf-8") as f:
|
||||
f.write(output)
|
||||
|
||||
return f"[内容已保存到 temp/xls_output.txt]\n\n{output}"
|
||||
|
||||
except Exception as e:
|
||||
# 如果pandas失败,尝试手动解析(可能是制表符分隔的csv)
|
||||
try:
|
||||
encodings = ["gbk", "utf-8", "latin1"]
|
||||
for encoding in encodings:
|
||||
try:
|
||||
df = pd.read_csv(
|
||||
file_path, encoding=encoding, sep="\t", nrows=max_rows
|
||||
)
|
||||
result = [f"=== Sheet {sheet_index} (CSV format) ==="]
|
||||
result.append(" | ".join(str(col) for col in df.columns))
|
||||
for idx, row in df.iterrows():
|
||||
row_data = []
|
||||
for val in row:
|
||||
if pd.isna(val):
|
||||
row_data.append("")
|
||||
else:
|
||||
row_data.append(str(val))
|
||||
result.append(" | ".join(row_data))
|
||||
output = "\n".join(result)
|
||||
|
||||
output_file = "temp/xls_output.txt"
|
||||
os.makedirs("temp", exist_ok=True)
|
||||
with open(output_file, "w", encoding="utf-8") as f:
|
||||
f.write(output)
|
||||
|
||||
return f"[内容已保存到 temp/xls_output.txt]\n\n{output}"
|
||||
except:
|
||||
continue
|
||||
return f"Error reading xls file: {e}"
|
||||
except Exception as e2:
|
||||
return f"Error: {e} | {e2}"
|
||||
|
||||
|
||||
def list_sheets(file_path):
|
||||
"""列出Excel所有sheet"""
|
||||
try:
|
||||
import pandas as pd
|
||||
|
||||
excel_file = pd.ExcelFile(file_path)
|
||||
sheets = excel_file.sheet_names
|
||||
return "可用Sheets:\n" + "\n".join(f"{i}: {s}" for i, s in enumerate(sheets))
|
||||
except Exception as e:
|
||||
return f"Error: {e}"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) < 2:
|
||||
print("用法:")
|
||||
print(" python file_reader.py <file.docx>")
|
||||
print(" python file_reader.py <file.xlsx> # 列出所有sheets")
|
||||
print(" python file_reader.py <file.xlsx> <sheet_index> # 读取指定sheet")
|
||||
print(" python file_reader.py <file.xlsx> <sheet_index> <max_rows>")
|
||||
print(" python file_reader.py <file.xls> # 列出所有sheets")
|
||||
print(" python file_reader.py <file.xls> <sheet_index> # 读取指定sheet")
|
||||
print("\n示例:")
|
||||
print(" python file_reader.py test.xlsx # 列出sheets")
|
||||
print(" python file_reader.py test.xlsx 1 # 读取第2个sheet")
|
||||
print(" python file_reader.py test.xlsx 1 30 # 读取前30行")
|
||||
print(" python file_reader.py test.xls # 读取xls文件")
|
||||
sys.exit(1)
|
||||
|
||||
file_path = sys.argv[1]
|
||||
|
||||
if not os.path.exists(file_path):
|
||||
print(f"文件不存在: {file_path}")
|
||||
sys.exit(1)
|
||||
|
||||
ext = os.path.splitext(file_path)[1].lower()
|
||||
|
||||
if ext == ".docx":
|
||||
print(read_docx_and_save(file_path))
|
||||
elif ext == ".xlsx":
|
||||
if len(sys.argv) == 2:
|
||||
print(list_sheets(file_path))
|
||||
else:
|
||||
sheet_index = int(sys.argv[2])
|
||||
max_rows = int(sys.argv[3]) if len(sys.argv) > 3 else 20
|
||||
print(read_xlsx(file_path, sheet_index, max_rows))
|
||||
elif ext == ".xls":
|
||||
if len(sys.argv) == 2:
|
||||
print(list_sheets(file_path))
|
||||
else:
|
||||
sheet_index = int(sys.argv[2])
|
||||
max_rows = int(sys.argv[3]) if len(sys.argv) > 3 else 20
|
||||
print(read_xls(file_path, sheet_index, max_rows))
|
||||
else:
|
||||
print(f"不支持的文件类型: {ext}")
|
||||
@@ -0,0 +1,92 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
安全的PPTX读取脚本 - 不依赖markitdown,避免Google Vision API问题
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from pptx import Presentation
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
|
||||
|
||||
def extract_text_from_shape(shape):
|
||||
"""从形状中提取文本"""
|
||||
if not hasattr(shape, "text"):
|
||||
return ""
|
||||
return shape.text
|
||||
|
||||
|
||||
def extract_text_from_slide(slide):
|
||||
"""从幻灯片中提取所有文本"""
|
||||
texts = []
|
||||
|
||||
# 提取标题
|
||||
if slide.shapes.title:
|
||||
title = slide.shapes.title.text
|
||||
if title.strip():
|
||||
texts.append(f"标题: {title}")
|
||||
|
||||
# 提取所有文本框
|
||||
for shape in slide.shapes:
|
||||
if shape.shape_type == MSO_SHAPE_TYPE.TEXT_BOX:
|
||||
text = extract_text_from_shape(shape)
|
||||
if text.strip():
|
||||
texts.append(text)
|
||||
elif shape.shape_type == MSO_SHAPE_TYPE.PLACEHOLDER:
|
||||
text = extract_text_from_shape(shape)
|
||||
if text.strip():
|
||||
texts.append(text)
|
||||
elif hasattr(shape, "text") and shape.text:
|
||||
text = shape.text
|
||||
if text.strip():
|
||||
texts.append(text)
|
||||
|
||||
return texts
|
||||
|
||||
|
||||
def read_pptx_safe(file_path):
|
||||
"""安全读取PPTX文件"""
|
||||
try:
|
||||
prs = Presentation(file_path)
|
||||
all_content = []
|
||||
|
||||
for i, slide in enumerate(prs.slides):
|
||||
slide_content = extract_text_from_slide(slide)
|
||||
if slide_content:
|
||||
all_content.append(f"--- 幻灯片 {i + 1} ---")
|
||||
all_content.extend(slide_content)
|
||||
all_content.append("")
|
||||
|
||||
return "\n".join(all_content)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error reading PPTX: {e}", file=sys.stderr)
|
||||
return None
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) != 2:
|
||||
print("Usage: python pptx_reader_safe.py <presentation.pptx>", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
file_path = sys.argv[1]
|
||||
if not os.path.exists(file_path):
|
||||
print(f"File not found: {file_path}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
content = read_pptx_safe(file_path)
|
||||
if content:
|
||||
# 保存到临时文件以避免终端编码问题
|
||||
output_path = "temp/pptx_output.txt"
|
||||
os.makedirs("temp", exist_ok=True)
|
||||
with open(output_path, "w", encoding="utf-8") as f:
|
||||
f.write(content)
|
||||
print(f"Content extracted to: {output_path}")
|
||||
print("\n" + content)
|
||||
else:
|
||||
print("Failed to extract content", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user