调用方式--文字识别-火山引擎

文档中心

导航

文字识别

调用方式

最近更新时间：2025.03.05 11:04:45首次发布时间：2024.09.12 11:33:01

接口简介

支持对数字扫描版PDF、图片进行深度解析和结构化处理，通过版面分析、文字识别，按照阅读顺序提取PDF、图片中的文本、表格、公式、图片等关键信息，最终组织成半结构化的带有语义信息和逻辑结构的文档，并以Markdown、JSON格式返回，覆盖常见论文、书籍、行业报告、公司内部文件等众多文件类型，加速大语言模型训练、开发、应用；

限制条件

名称	内容
文件要求	1. 文件格式：pdf、图片（JPG、JPEG、PNG、BMP、PDF等常见格式，建议使用JPG格式）。 2. 文件大小： a.如果传Base64编码，要求Base64编码和urlencode之后不超过 8 MB。 b.如果传文件完整URL, URL建议使用火山引擎对象存储，其他外部链接耗时与稳定性可能会受到影响，导致接口响应失败。 3. 输入文件过大时，返回的HttpCode如下：400/413/502。

请求说明

基本信息

名称	内容
接口地址	https://visual.volcengineapi.com
请求方式	POST
Content-Type	application/x-www-form-urlencoded
是否需要鉴权	是

请求参数

以下请求参数列表仅列出了接口请求参数和必要公共参数，完整公共参数列表见公共参数。

名称	类型	是否必填	描述
X-Date	String	是	使用UTC时间，精确到秒。请使用格式：`YYYYMMDD'T'HHMMSS'Z'` ，例如：`20201103T104027Z`
Authorization	String	是	HMAC-SHA256：签名方法 -Credential为签名凭证，其中: -AccessKeyId为访问密钥ID，可在访问密钥（Access Key）获取； -ShortDate为请求的短时间，使用UTC时间，精确到日。请使用格式：`YYYYMMDD`，例如：`20180201`； -Region为请求地区，国内一般为为`cn-north-1`； -Service为请求的服务，文字识别一般为`cv`； -SignedHeaders为参与签名计算的头部信息，`content-type` 和 `host` 为必选头部； -Signature为签名，可在签名方法获取。注：我们提供了SDK及签名示例供您实现服务快速接入，具体可参考快速接入例如：`HMAC-SHA256 Credential={AccessKeyId}/{ShortDate}/{Region}/{Service}/request, SignedHeaders={SignedHeaders}, Signature={Signature}`
X-Security-Token	String	否	指安全令牌服务（Security Token Service，STS）颁发的临时安全凭证中的SessionToken，使用长期密钥时无需填写该参数。

Query请求参数

参数	可选/必选	类型	说明
Action	必选	String	接口名，取值：OCRPdf
Version	必选	String	版本号，取值：2021-08-23

Body参数

参数	可选/必选	类型	说明
image_base64	与image_url二选一	String	文件的base64编码注意: 只需要传文件的base64值，要求Base64编码和urlencode之后不超过 8 MB
image_url	与image_base64二选一	String	文件的URL链接注意: 要求image_base64与image_url二选一，如果2个字段都有，优先解析image_base64。建议使用火山引擎对象存储，其他外部链接耗时与稳定性可能会收到影响
version	必选	String	版本号, 取值：v3
file_type	可选	String	文件类型："pdf"/"image", 默认为pdf
page_start	可选	Int	pdf从第几页开始解析, 默认为0
page_num	可选	Int	pdf解析页数, 默认为16, 最多支持300页
parse_mode	可选	String	文本解析模式:"auto"/"ocr", 默认为"auto"。 “auto”综合文字识别和解析模式，速度更快；“ocr”仅为文字识别模式，如果在“auto”获取解析结果中有不符合原文内容乱码，则可以尝试该模式修正。
table_mode	可选	String	表格返回格式:"html"/"markdown", 默认为"markdown"
filter_header	可选	String	页眉、页脚、脚注过滤开关:"true"/"false", 默认为"true",关闭插入正文

输出说明

通用输出参数

请参考通用返回字段及错误码 |

业务输出参数

data 字段说明

字段	类型	说明	备注
markdown	String	markdown字符串	包含整本PDF解析结果。图片会以url链接返回，访问有效期为 30 分钟，请尽快下载保存。
detail	Array of Result	PDF解析结构化信息	见下面result说明

result 字段说明

字段	类型	说明	备注
page_md	String	每页markdown字符串
page_image_hw	String	图像信息
textblocks	Array of Textblock	段落信息	见下面textblock字段说明
page_id	Int	当前页码	当前PDF中页码

textblock 字段说明

字段	类型	说明	备注
text	String	段落文本
box	Array of Float	元素块坐标信息	左上、右下坐标
label	String	段落文本类别	见下面label字段说明
norm_box	Array of Float	元素块坐标信息	归一化坐标
font_size	Int	字体大小	仅供参考
is_bold	Bool	是否粗体	仅供参考
is_italic	Bool	是否斜体	仅供参考
url	String	图片的超链接	markdown中渲染用

label 字段说明

字段	类型	说明
title	String	标题
author	String	作者
sec	String	章节标题
para	String	普通段落
header	String	页眉
foot	String	页脚
fnote	String	脚注
image	String	图片
table	String	表格
cap	String	图/表描述

输出示例

{
    "code":10000,
    "data":{
        "markdown": "Article 2: Active region PIC experiment\n63"
        "detail": {
			    [
			        {
			            "page_id": 0,
			            "page_md": "Article 2: Active region PIC experiment\n63",
			            "page_image_hw": {
			                "h": 1200,
			                "w": 848
			            },
			            "textblocks": [
			                {
			                    "box": {
			                        "x0": 172,
			                        "y0": 175,
			                        "x1": 591,
			                        "y1": 207
			                    },
			                    "text": "Article 2: Active region PIC experiment",
			                    "label": "para",
			                    "norm_box": {
			                        "y0": 0.14583333333333334,
			                        "x1": 0.6969339622641509,
			                        "y1": 0.1725,
			                        "x0": 0.2028301886792453
			                    },
			                    "font_size": 14,
			                    "is_bold": false,
			                    "is_italic": false
			                },
			                {
			                    "text": "63",
			                    "label": "foot",
			                    "norm_box": {
			                        "x0": 0.4834905660377358,
			                        "y0": 0.88,
			                        "x1": 0.5094339622641509,
			                        "y1": 0.8958333333333334
			                    },
			                    "font_size": 9,
			                    "is_bold": false,
			                    "is_italic": false,
			                    "box": {
			                        "x0": 410,
			                        "y0": 1056,
			                        "x1": 432,
			                        "y1": 1075
			                    }
			                }
			            ]
			        }
			    ]
			    }
			}
    },
    "message":"Success",
    "request_id":"021629427766315fdbddc01010500400000000000000068da22fd",
    "time_elapsed":"5.330714543s"
}

错误码

通用错误码

请参考通用返回字段及错误码

业务错误码

HttpCode	错误码	错误消息	描述
200	10000	无	请求成功
401	50205	"Image Size Exceeds Maximum Limit: please compress the image"	文件大小超过上限
400	50207	"Image Decode Error: image format unsupported"	文件解码错误，文件内容为空或格式错误
401	50400	"Access denied due to invalid authentication information"	鉴权失败
404	50402	"Invalid Request URL"	无效的请求路径
500	50500	"Internal Error: please contact with bytedance engineering team"	内部错误，需要联系开发人员

附录

PDF分页

为保证解析稳定性，页数较多的建议拆页
也可通过page_start和page_num控制

import fitz # PyMuPDF==1.18.19
import os
import math 

def split_pdf(input_dir, output_dir, split_num=16):
    """
    将指定目录中的每个 PDF 文件分割成多个小 PDF 文件。

    参数:
    input_dir (str): 输入 PDF 文件所在的目录。
    output_dir (str): 输出分割后的 PDF 文件的目录。
    split_num (int): 每个小 PDF 文件包含的页数，默认为 16。
    """
    # 遍历输入目录中的所有 PDF 文件
    for pdf_name in os.listdir(input_dir):
        pdf_path = os.path.join(input_dir, pdf_name):
        pdf = fitz.open(pdf_path) 
        count = pdf.pageCount # 获取 PDF 文件的总页数
        for page_num in range(count // split_num + math.ceil((count % split_num) / split_num)):
            output_pdf = fitz.open()  # 创建一个新的 PDF 对象

            # 计算当前分割的起始页和结束页
            start_page = page_num*split_num
            end_page = min(page_num*split_num+split_num-1, count-1)
            output_pdf.insert_pdf(pdf, from_page=start_page, to_page=end_page)  # 将指定页面插入新的 PDF 对象

            out_name = os.path.join(output_dir, f"{pdf_name}_{start_page}_{end_page}.pdf") 
            output_pdf.save(out_name)  # 保存新的 PDF 文件

上传火山引擎对象存储URL

base64文件大小超过8M上传火山引擎对象存储，通过url请求
详情见对象存储SDK

import os
import tos
def upload_file_tob(pdf_path, bucket_name, object_key_prefix):
    """
    将指定的 PDF 文件上传到 TOS（对象存储服务）。

    参数:
    pdf_path (str): 要上传的 PDF 文件的本地路径。
    bucket_name (str): TOS 存储桶的名称。
    object_key_prefix (str): 对象键的前缀。

    返回:
    str: 上传文件的 URL。
    """

    ak = ""
    sk = ""
    endpoint = "tos-cn-beijing.volces.com"
    region = "cn-beijing"
    object_key = os.path.join(object_key_prefix, os.path.basename(pdf_path))
    url = f"https://{bucket_name}.{endpoint}/{object_key}"

    content = open(pdf_path, "rb").read()

    try:
        client = tos.TosClientV2(ak, sk, endpoint, region)
        result = client.put_object(bucket_name, object_key, content=content)
        # HTTP状态码
        print('http status code:{}'.format(result.status_code))
        # 请求ID。请求ID是本次请求的唯一标识，建议在日志中添加此参数
        print('request_id: {}'.format(result.request_id))
        # hash_crc64_ecma 表示该对象的64位CRC值, 可用于验证上传对象的完整性
        print('crc64: {}'.format(result.hash_crc64_ecma))
    except Exception as e:
        print('fail with unknown error: {}'.format(e))
    
    return url

python请求示例

下载python sdk

import base64
from volcengine.visual.VisualService import VisualService
import json

if __name__ == '__main__':
    visual_service = VisualService()
    # call below method if you dont set ak and sk in $HOME/.volc/config
    visual_service.set_ak('')
    visual_service.set_sk('')

    params = dict()

    form = {
        "image_base64":  base64.b64encode(open(path,'rb').read()).decode(),   # 文件binary 图片/PDF 
        "image_url": "",                  # url
        "version": "v3",                  # 版本
        "page_start": 0,                  # 起始页数
        "page_num": 16,                   # 解析页数
        "table_mode": "html",             # 表格解析模式
        "filter_header": "true"           # 过滤页眉页脚水印
    }

    # 请求
    resp = visual_service.ocr_pdf(form)

    if resp["data"]:
        markdown = resp["data"]["markdown"] # markdown 字符串
        json_data = resp["data"]["detail"] # json格式详细信息

        with open("resp.md", "w") as f:
            f.writelines(markdown)

        json_data = json.loads(json_data)

        # 保存json
        with open("resp.json", "w") as f:
            json.dump(json_data, f, indent=4, ensure_ascii=False)
    else:
        print("request error")