从财报中提取财务数据
在金融领域,一项常见的任务是从财报中提取财务数据。财报的格式 notoriously 糟糕,因为 SEC 没有规定必须生成机器可读的文档。
财报通常以 HTML 文档的形式提供,这使得解析变得困难。投资者通常使用复杂的解析系统或手动审查来提取数据。甚至有整个公司专门致力于自动化这项任务。
本操作指南是一个概念验证,展示了如何使用 LLM 将财务数据直接提取到 CSV 中。逗号分隔值格式良好,可以用正则表达式定义,Outlines 可以使用正则表达式来指导 LLM 的输出。
此示例是完整演示的一个较小子集,完整演示可在此处找到。该演示包含将原始 HTML 转换为结构化 CSV 文件所需的完整预处理步骤,并测试了三家公司的 10-K 报告结果。
设置
安装 outlines 和所需依赖项
# Later versions of torch can have difficulty with certain CUDA drivers.
# We recommend using 2.4.0 for now, but you may wish to experiment with
# other versions.
pip install outlines pandas transformers torch==2.4.0 accelerate
加载模型
选择您的语言模型。我们将使用 Phi-3 mini,它足够小,可以在相对较小的机器上运行。
import outlines
import torch
model_name = 'microsoft/Phi-3-mini-4k-instruct'
model = outlines.models.transformers(
model_name,
device='auto',
model_kwargs={
# To reduce memory usage, we'll use bfloat16
"torch_dtype": torch.bfloat16,
},
)
设置数据
为简洁起见,我们附加了英伟达 10-K 报告的 markdown 版本。完整演示处理报告的原始 HTML 版本,将其转换为这些 markdown 表格。页面根据是否包含损益表进行过滤,然后压缩成您在下方看到的字符串。
income_statement = """
Table of ContentsNVIDIA Corporation and SubsidiariesConsolidated Statements of Income(In millions, except per share data)
| | | | | | | | | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| | | | Year Ended | | | | | | | | | | | | | | |
| | | | Jan 28, 2024 | | | | | | Jan 29, 2023 | | | | | | Jan 30, 2022 | | |
| Revenue | | | $ | 60,922 | | | | | $ | 26,974 | | | | | $ | 26,914 | |
| Cost of revenue | | | 16,621 | | | | | | 11,618 | | | | | | 9,439 | | |
| Gross profit | | | 44,301 | | | | | | 15,356 | | | | | | 17,475 | | |
| Operating expenses | | | | | | | | | | | | | | | | | |
| Research and development | | | 8,675 | | | | | | 7,339 | | | | | | 5,268 | | |
| Sales, general and administrative | | | 2,654 | | | | | | 2,440 | | | | | | 2,166 | | |
| Acquisition termination cost | | | | | | | | | 1,353 | | | | | | | | |
| Total operating expenses | | | 11,329 | | | | | | 11,132 | | | | | | 7,434 | | |
| Operating income | | | 32,972 | | | | | | 4,224 | | | | | | 10,041 | | |
| Interest income | | | 866 | | | | | | 267 | | | | | | 29 | | |
| Interest expense | | | (257) | | | | | | (262) | | | | | | (236) | | |
| Other, net | | | 237 | | | | | | (48) | | | | | | 107 | | |
| Other income (expense), net | | | 846 | | | | | | (43) | | | | | | (100) | | |
| Income before income tax | | | 33,818 | | | | | | 4,181 | | | | | | 9,941 | | |
| Income tax expense (benefit) | | | 4,058 | | | | | | (187) | | | | | | 189 | | |
| Net income | | | $ | 29,760 | | | | | $ | 4,368 | | | | | $ | 9,752 | |
| | | | | | | | | | | | | | | | | | |
| Net income per share: | | | | | | | | | | | | | | | | | |
| Basic | | | $ | 12\.05 | | | | | $ | 1\.76 | | | | | $ | 3\.91 | |
| Diluted | | | $ | 11\.93 | | | | | $ | 1\.74 | | | | | $ | 3\.85 | |
| | | | | | | | | | | | | | | | | | |
| Weighted average shares used in per share computation: | | | | | | | | | | | | | | | | | |
| Basic | | | 2,469 | | | | | | 2,487 | | | | | | 2,496 | | |
| Diluted | | | 2,494 | | | | | | 2,507 | | | | | | 2,535 | | |
"""
从财报中提取的 markdown 表格在行名、列数、数据类型等方面可能差异很大。LLM 在此处的优势在于,我们可以根据数据类型定义所需数据,LLM 将按照所需格式输出数据。
相比之下,以下是损益表在原始 HTML 中的样子
定义所需数据
Outlines 通常用于 JSON 输出,但也可用于 CSV。我们知道要提取的列,也知道列的数据类型。例如,年份始终是四位数字,收入是带逗号的数字,依此类推。
我们可以为每种列类型定义一个正则表达式模式
# Define the column type regex patterns
column_types = {
# Year is always a four-digit number
"year": r"\d{4}",
# Revenue, operating income, and net income are always numbers with commas.
# This regex permits integers that may begin with a minus sign, and may have
# commas separating the thousands, millions, etc.
"integer_comma": r"((-?\d+),?\d+|(-?\d+))",
# Number is currently not used, but it represents a number with up to two decimal places.
"number": r"(-?\d+(?:\.\d{1,2})?)",
}
接下来,选择我们要提取的列。我们需要:
- 年份,始终是四位数字
- 收入,一个带逗号的数字
- 营业收入,一个带逗号的数字
- 净利润,一个带逗号的数字
# Define the columns to extract, and their data types.
columns_to_extract = {
"year": "year",
"revenue": "integer_comma",
"operating_income": "integer_comma",
"net_income": "integer_comma",
}
您可以修改 column_type_regex
来匹配您要提取的列的数据类型。添加新的财务指标来提取就像向 columns_to_extract
添加新的键值对一样简单
额外列的准确性未经过充分测试,请谨慎使用。
创建描述所需数据的正则表达式
# Create the header line. This is the requested column names
# separated by commas, i.e. "year,revenue,..."
header = ",".join(columns_to_extract.keys())
# Create the data capture patterns. These are the regex patterns
# that will be used to capture the data in each column
data_patterns = [column_types[dtype] for dtype in columns_to_extract.values()]
data_line = ",".join(data_patterns)
# Our final regex pattern.
max_rows = 3 # We expect 3 rows of data, firms usually report 3 years of income statements
csv_regex = f"{header}(\n{data_line}){{,{max_rows}}}\n\n"
print(csv_regex)
这就得到了
year,revenue,operating_income,net_income,basic_earnings_per_share(
\d{4},((-?\d+),?\d+|(-?\d+)),((-?\d+),?\d+|(-?\d+)),((-?\d+),?\d+|(-?\d+)),(-?\d+(?:\.\d{1,2})?)){,3}
相当复杂,对吧?幸运的是,我们有一个简单的函数可以为您构建此正则表达式。该正则表达式定义了一个标题行,后跟一个数据行,该数据行针对我们要提取的每行数据重复。将正则表达式传递给 outlines.generate.regex
将生成一个函数,该函数将始终生成一个与正则表达式一致的 CSV 字符串。
向模型发送提示
Outlines 默认不添加系统或指令 token,因此我们需要使用 transformers.AutoTokenizer
为我们使用的任何模型添加它们。
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
def add_instruction(prompt):
return tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False, add_generation_prompt=True)
print(add_instruction("Howdy"))
我们的提示大致描述了我们希望模型执行的任务,以及它可能需要知道的一些关于损益表的信息。
def extract_financial_data_prompt(columns_to_extract, income_statement):
user_prompt = f"""
Extract annual financial data from this set of pages. Pages
are from a 10k filing and were chosen because they may contain
a comprehensive income statement. Note that selected pages may
be incorrectly extracted, so you should verify that you are extracting
from the comprehensive income statement and not some other financial
statement.
Create a row for each year available in the income statement with the
following columns: {', '.join(columns_to_extract.keys())}. Firms typically report the
most recent 3 years of data, but this can vary.
Each column has types: {', '.join(columns_to_extract.values())}.
# Relevant pages:
{income_statement}
# Key instructions:
1. Look ONLY at the "Consolidated Statements of Income" table
2. For operating income, look for "Income from operations" or "Operating income"
3. For net income, use the TOTAL net income figure, not amounts allocated to specific share classes
4. Use NULL for missing values
5. Operating income must be less than revenue
6. Net income must be less than operating income
7. Ignore segment breakdowns, quarterly data, or per-share amounts
# Output format:
- CSV format with headers: {','.join(columns_to_extract.keys())}
- Use NULL for missing values
- If no data are found, do not create a row.
- Enter two newline characters to terminate the CSV when no more data are found.
# Definitions:
- Revenue: Total sales of goods and services. Usually this is at the top of the
income statement.
- Operating income: Revenue minus operating expenses for the entire company. This is revenue
minus costs. Operating income is also called operating profit, EBIT, or income from
operations.
- Net income: Operating income minus taxes. This is the bottom line of the
income statement.
"""
return add_instruction(user_prompt)
运行模型
现在我们有了提示和正则表达式,我们可以运行模型了。
构建我们的正则表达式提取函数。我们将使用贪婪采样器,它在每一步采样最可能的下一个 token。这是一种简单的采样器,比多项式采样更具可复现性。
向模型提供提示并运行它
csv_data = csv_extractor(
extract_financial_data_prompt(columns_to_extract, income_statement),
max_tokens=1024,
)
print(csv_data)
year,revenue,operating_income,net_income
2024,60922,32972,29760
2023,26974,4224,4368
2022,26914,10041,9752
瞧!我们已经从损益表中提取了财务数据,并且经过检查是正确的。
您甚至可以将其加载到 pandas
DataFrame 中进行进一步分析