JP6889038B2

JP6889038B2 - Financial results analysis system and financial results analysis program

Info

Publication number: JP6889038B2
Application number: JP2017110899A
Authority: JP
Inventors: 洋二郎関; 大輔宮代; 夏樹八木
Original assignee: Xenodata Lab Co Ltd
Current assignee: Xenodata Lab Co Ltd
Priority date: 2017-01-23
Filing date: 2017-06-05
Publication date: 2021-06-18
Anticipated expiration: 2037-01-23
Also published as: JP2018120567A

Description

本発明は、決算関連情報を分析して、会計上の事象の要因を抽出する決算分析システムおよび決算分析プログラムに関する。 The present invention relates to a settlement analysis system and a settlement analysis program that analyze settlement-related information and extract factors of accounting events.

一般に、証券取引所が開示する決算短信、企業が発表する業績予想の修正、証券会社が発表する決算分析レポートといった決算関連情報は、投資家や機関投資家にとって、株式や債券などの投資取引を行う上での重要な情報源となる。そのため、従来より、コンピュータを用いて、決算関連情報を自動で処理する様々な手法が提案されている。 In general, financial statements such as financial statements disclosed by stock exchanges, revisions of earnings forecasts announced by companies, and financial results analysis reports released by securities companies provide investors and institutional investors with investment transactions such as stocks and bonds. It is an important source of information in doing so. Therefore, various methods have been conventionally proposed for automatically processing settlement-related information using a computer.

例えば、特許文献１には、決算書の分析結果に基づく記事を作成する記事作成システムが開示されている。この記事作成システムは、決算書データから取得される情報に関する条件と、決算書データから取得された情報がその条件に適合した場合に使用される文字列とを含むテンプレートを備えている。ＸＢＲＬ（eXtensible Business Reporting Language）で記述された決算書データを受け付けた場合、この決算書データから取得される情報がテンプレートに記述された条件に適合するか否かが判断される。そして、この条件に適合する場合、テンプレートに記述された文字列を含んだ記事を表す記事データが自動的に作成される。これにより、決算書の分析結果に基づく記事の作成に要する時間の短縮を図る。 For example, Patent Document 1 discloses an article creation system that creates an article based on the analysis result of financial statements. This article creation system includes a template containing conditions related to information obtained from financial statement data and a character string used when the information obtained from financial statement data meets the conditions. When the financial statement data described in XBRL (eXtensible Business Reporting Language) is accepted, it is determined whether or not the information acquired from the financial statement data meets the conditions described in the template. Then, if this condition is met, article data representing the article including the character string described in the template is automatically created. This will reduce the time required to create an article based on the analysis results of financial statements.

また、特許文献２には、不動産ポートフォリオの分析を行う装置が開示されている。この装置では、複数の不動産物件の属性データと、不動産物件運用の決算データと、分析に用いる決算データを標準化するための標準化ルール情報とが記憶されている。決算データは、標準化ルール情報に基づいて、予め決められた標準化データに変換される。そして、この標準化データと、不動産物件の属性データとに基づいて、収益率等の指標が計算される。これにより、不動産投資におけるポートフォリオ分析を容易に行うことが可能となる。 Further, Patent Document 2 discloses an apparatus for analyzing a real estate portfolio. In this device, attribute data of a plurality of real estate properties, settlement data of real estate property management, and standardization rule information for standardizing settlement data used for analysis are stored. The settlement data is converted into predetermined standardized data based on the standardization rule information. Then, an index such as a rate of return is calculated based on this standardized data and the attribute data of the real estate property. This makes it possible to easily perform portfolio analysis in real estate investment.

特開２０１１−００８５２７号公報Japanese Unexamined Patent Publication No. 2011-008527 特開２００８−１４０２９４号公報Japanese Unexamined Patent Publication No. 2008-14294

ところで、インターネット等の普及により、様々な決算関連情報の入手が容易になった今日、これらの情報を収集・分析してレポート化することは、投資家や機関投資家にとって有用である。その際、特に有用なのは、科目および金額を主体とした会計上の事象（例えば、「営業利益が○○億円」）に対して、その要因（例えば、「為替の影響等により」）を抽出・特定することである。しかしながら、従来、このような要因を決算関連情報から自動で抽出する手法は存在しない。 By the way, with the spread of the Internet and the like, it has become easier to obtain various financial statements-related information, and it is useful for investors and institutional investors to collect, analyze and report on this information. At that time, it is particularly useful to extract the factors (for example, "due to the influence of exchange rates") for accounting events (for example, "operating income is XX billion yen") mainly based on the subject and amount.・ To identify. However, conventionally, there is no method for automatically extracting such factors from settlement-related information.

本発明は、かかる事情に鑑みてなされたものであり、その目的は、決算関連情報からの要因抽出を効率的かつ柔軟に行い、投資家や機関投資家にとって有用な情報を提供することである。 The present invention has been made in view of such circumstances, and its object is had efficiently and flexibly row factors extracted from financial-related information, to provide information useful to investors and institutional investors is there.

かかる課題を解決すべく、第１の発明は、パターン記憶部と、文書解析部と、レポート作成部とを有し、決算関連情報を分析する決算分析システムを提供する。パターン記憶部には、要因パターンが記憶されている。この要因パターンは、少なくとも科目および金額情報を含む会計上の事象の表現と、この事象の要因の表現とを有する。文書解析部は、処理対象となる文の形態素列と、パターン記憶部に記憶された要因パターンとを比較する。処理対象となる文の形態素列は、この形態素列を構成する形態素またはその組み合わせに対して、少なくとも科目および金額を分類する属性毎に固有の属性ラベルを付与することによって抽象化されている。また、文書解析部は、要因パターンと一致した形態素列について、この要因パターンによって指定された部分を要因として抽出し、抽出された要因を形態素列における科目および金額情報と紐付けて、１組のデータとして記憶する。レポート作成部は、１組のデータに基づいて、決算分析レポートを出力する。 In order to solve such a problem, the first invention provides a settlement analysis system that has a pattern storage section, a document analysis section, and a report creation section, and analyzes settlement-related information. The factor pattern is stored in the pattern storage unit. This factor pattern has at least a representation of an accounting event that includes subject and monetary information and a representation of the factors of this event. The document analysis unit compares the morpheme sequence of the sentence to be processed with the factor pattern stored in the pattern storage unit. The morpheme string of the sentence to be processed is abstracted by assigning a unique attribute label to at least each attribute that classifies the subject and the amount of money for the morpheme or a combination thereof constituting the morpheme string. In addition, the document analysis department extracts the part specified by this factor pattern as a factor for the morpheme string that matches the factor pattern, and links the extracted factor with the subject and amount information in the morpheme string to form a set. Store as data. The report creation unit outputs a financial results analysis report based on a set of data.

ここで、第１の発明において、上記文書解析部は、要因パターンと一致した形態素列のうち、要因を表す表現の前方または後方において連続し、かつ、属性ラベルが付与されていない不定の繰り返し部分を要因として抽出することが好ましい。 Here, in the first invention, the document analysis unit is an indefinite repeating portion of the morpheme sequence that matches the factor pattern, which is continuous before or after the expression representing the factor and is not given an attribute label. Is preferable as a factor.

第１の発明において、上記文書解析部は、一つの文において、科目、金額情報、および要因のいずれかの要素が欠落している場合、文章の順序に従って他の文を分析し、この欠落した要素が得られたことをもって、科目、金額情報、および要因の紐付けを行ってもよい。 In the first invention, when any element of subject, amount information, and factor is missing in one sentence, the document analysis unit analyzes the other sentences according to the order of the sentences, and the missing sentence is found. Once the elements are obtained, the subject, amount information, and factors may be linked.

第１の発明において、予め定義された文字列を記憶する形態素解析辞書をさらに設けてもよい。この場合、上記文書解析部は、形態素解析辞書に記憶された文字列については、一つの形態素として扱う。また、上記形態素の組み合わせについて、属性ラベルを対応付けて記憶するラベリング辞書をさらに設けてもよい。この場合、上記文書解析部は、一つの形態素として扱われる形態素の組み合わせに対して、ラベリング辞書によって特定される属性ラベルを付与する。 In the first invention, a morphological analysis dictionary for storing a predefined character string may be further provided. In this case, the document analysis unit treats the character string stored in the morphological analysis dictionary as one morpheme. Further, a labeling dictionary for storing the combination of the above morphemes in association with the attribute label may be further provided. In this case, the document analysis unit assigns an attribute label specified by the labeling dictionary to the combination of morphemes treated as one morpheme.

また、第２の発明は、少なくとも科目および金額情報を含む会計上の事象の表現と、この事象の要因の表現とを有する要因パターンが予め記憶されているコンピュータを用いて、決算関連情報を分析する決算分析プログラムを提供する。このプログラムは、以下の第１から第３のステップを有する処理をコンピュータに実行させる。第１のステップでは、処理対象となる文の形態素列と、要因パターンとを比較する。処理対象となる文の形態素列は、この形態素列を構成する形態素またはその組み合わせに対して、少なくとも科目および金額を分類する属性毎に固有の属性ラベルを付与することによって抽象化されている。第２のステップでは、要因パターンと一致した形態素列について、この要因パターンによって指定された部分を要因として抽出し、抽出された要因を形態素列における科目および金額情報と紐付けて、１組のデータとして記憶する。第３のステップでは、１組のデータに基づいて、決算分析レポートを出力する。 Further, the second invention analyzes settlement-related information using a computer in which a factor pattern having at least an expression of an accounting event including subject and amount information and an expression of a factor of this event is stored in advance. Provide a financial results analysis program. This program causes a computer to perform a process having the following first to third steps. In the first step, the morpheme sequence of the sentence to be processed is compared with the factor pattern. The morpheme string of the sentence to be processed is abstracted by assigning a unique attribute label to at least each attribute that classifies the subject and the amount of money for the morpheme or a combination thereof constituting the morpheme string. In the second step, for the morpheme string that matches the factor pattern, the part specified by this factor pattern is extracted as a factor, and the extracted factor is linked with the subject and amount information in the morpheme string to form a set of data. Remember as. In the third step, a settlement analysis report is output based on a set of data.

ここで、第２の発明において、上記第２のステップは、要因パターンと一致した形態素列のうち、要因を表す表現の前方または後方において連続し、かつ、属性ラベルが付与されていない不定の繰り返し部分を要因として抽出するステップであることが好ましい。 Here, in the second invention, the second step is an indefinite repetition in which the morpheme sequence matching the factor pattern is continuous in front of or behind the expression representing the factor and is not given an attribute label. It is preferable that the step is to extract the portion as a factor.

第２の発明において、上記第２のステップは、一つの文において、科目、金額情報、および要因のいずれかの要素が欠落している場合、文章の順序に従って他の文を分析し、この欠落した要素が得られたことをもって、科目、金額情報、および要因の紐付けを行うステップを含んでいてもよい。 In the second invention, if any element of subject, amount information, and factor is missing in one sentence, the second step analyzes the other sentences according to the order of the sentences, and this missing sentence. It may include the step of associating the subject, the amount information, and the factor with the obtained element.

第２の発明において、コンピュータには、予め定義された文字列に関する形態素の組み合わせを記憶する形態素解析辞書が予め記憶されていてもよい。この場合、上記第１のステップは、形態素解析辞書に記憶された文字列については、一つの形態素として扱うステップを含む。また、コンピュータには、上記形態素の組み合わせについて、属性ラベルを対応付けて記憶するラベリング辞書が予め記憶されていてもよい。この場合、上記第１のステップは、一つの形態素として扱われる形態素の組み合わせに対して、ラベリング辞書によって特定される属性ラベルを付与するステップを含む。 In the second invention, the computer may store in advance a morphological analysis dictionary that stores a combination of morphemes related to a predefined character string. In this case, the first step includes a step of treating the character string stored in the morphological analysis dictionary as one morpheme. Further, the computer may store in advance a labeling dictionary that stores the combination of the above morphemes in association with the attribute label. In this case, the first step includes a step of assigning an attribute label specified by a labeling dictionary to a combination of morphemes treated as one morpheme.

さらに、第１および第２の発明において、上記金額情報は、金額の増減に関する情報であることが好ましい。 Further, in the first and second inventions, the amount information is preferably information relating to an increase or decrease in the amount.

本発明によれば、文そのものではなく、文を形態素解析した上で属性ラベルによって抽象化された形態素列が、予め定義された要因パターンと比較される。そして、両者が一致した場合、この形態素列のうち、要因パターンによって指定された部分が要因として抽出される。要因パターンとの比較を属性ラベルによって抽象化された形態素列ベースで行うことで、定義すべき要因パターンの数を有効に抑制でき、要因抽出を効率的かつ柔軟に行うことが可能となる。それとともに、科目、金額情報および要因を紐付けて１組のデータとして記憶し、これに基づいて決算分析レポートを出力することで、資家や機関投資家にとって有用な情報を提供できる。 According to the present invention, not the sentence itself, but the morphological sequence abstracted by the attribute label after morphological analysis of the sentence is compared with the predetermined factor pattern. Then, when both match, the portion specified by the factor pattern in this morpheme sequence is extracted as a factor. By performing comparison with factor patterns on the basis of morpheme sequences abstracted by attribute labels, the number of factor patterns to be defined can be effectively suppressed, and factor extraction can be performed efficiently and flexibly. At the same time, it is possible to provide useful information for investors and institutional investors by storing the subject, amount information and factors as a set of data and outputting a settlement analysis report based on this.

決算分析システムのブロック構成図Block configuration diagram of financial results analysis system 決済分析レポートの表示例を示す図Diagram showing a display example of a payment analysis report 文書解析ルーチンのフローチャートDocument analysis routine flowchart 処理対象となる文の一例を示す図Diagram showing an example of the sentence to be processed ラベリングされた形態素列の名称をまとめた一覧表A list summarizing the names of the labeled morpheme sequences 要因パターンの一例を示す図Diagram showing an example of factor patterns 係り受け解析における要因抽出例の説明図Explanatory diagram of factor extraction example in dependency analysis

図１は、本実施形態に係る決算分析システムのブロック構成図である。決算分析システム１は、入力された決算関連情報を分析し、その分析結果として、レポートを作成・出力する。このシステム１は、典型的にはコンピュータによって実現することができ、その機能的なブロックとして、データ前処理部２と、文書解析部３と、レポート作成部４とを有する。 FIG. 1 is a block configuration diagram of a settlement analysis system according to the present embodiment. The settlement analysis system 1 analyzes the input settlement-related information and creates and outputs a report as the analysis result. This system 1 can be typically realized by a computer, and has a data preprocessing unit 2, a document analysis unit 3, and a report creation unit 4 as functional blocks thereof.

データ前処理部２は、インターネット上に公開された決算関連情報、例えば、証券取引所が開示する決算短信、企業が発表する業績予想の修正、証券会社が発表する決算分析レポートなどを取得する。決算関連情報の取得は、各企業のホームページ（ＩＲページ）をクローリングすることによって行うことができる。また、データ前処理部２は、取得したＸＢＲＬデータを解析して財務情報を取得すると共に、取得したＰＤＦデータを解析して、セグメント別、地域別などの財務情報を取得する。ここで、ＸＢＲＬ（eXtensible Business Reporting Language）とは、拡張可能な事業報告言語であって、財務諸表などのビジネスレポートを電子文書化することで、それらの作成の効率化や比較・分析などの二次利用を目的として、ＸＭＬ（Extensible Markup Language）の規格をベースに作られた言語である。さらに、データ前処理部２は、決済関連情報としてのＰＤＦデータを解析して、このＰＤＦに記載されている文章のデータを取得する。 The data preprocessing unit 2 acquires financial statements-related information published on the Internet, for example, financial statements disclosed by stock exchanges, revisions of earnings forecasts announced by companies, financial results analysis reports announced by securities companies, and the like. Information related to settlement of accounts can be obtained by crawling the homepage (IR page) of each company. In addition, the data preprocessing unit 2 analyzes the acquired XBRL data to acquire financial information, and also analyzes the acquired PDF data to acquire financial information by segment, region, and the like. Here, XBRL (eXtensible Business Reporting Language) is an extensible business reporting language, and by electronically documenting business reports such as financial statements, it is possible to improve the efficiency of their creation and to compare and analyze them. It is a language created based on the XML (Extensible Markup Language) standard for the purpose of next use. Further, the data preprocessing unit 2 analyzes the PDF data as the payment-related information and acquires the data of the text described in the PDF.

文書解析部３は、データ前処理部２によって取得された文章から、決算結果である会計上の事象が生じた要因を抽出・取得する。ここで、会計上の事象とは、例えば、「営業利益が○○億円」といった如く、科目および金額情報を主体とした記述である。本明細書において、「金額情報」とは、文章中に出現した金額に関する情報を指し、具体的には、以下のパターンが想定されるが、特に、金額の増減に関しては重要な情報として注目すべきである。 The document analysis unit 3 extracts and acquires the cause of the accounting event, which is the settlement result, from the text acquired by the data preprocessing unit 2. Here, the accounting event is a description mainly composed of subject and amount information, for example, "operating profit is XX billion yen". In the present specification, "amount information" refers to information on the amount of money that appears in the text, and specifically, the following patterns are assumed, but in particular, attention is paid to important information regarding the increase or decrease of the amount of money. Should be.

［金額情報のパターン］
１．科目＋金額（例：売上高１億円）
２．科目＋金額増減（例：売上高が１億円増加）
３．科目＋金額増減（例：増収） [Pattern of amount information]
1. 1. Subject + amount (example: sales 100 million yen)
2. Subject + amount increase / decrease (example: sales increase by 100 million yen)
3. 3. Subject + amount increase / decrease (example: increase in sales)

また、要因とは、例えば、「為替の影響等により」といった如く、会計上の事象が生じた原因や要因を表す記述である。文書解析部３は、このような事象と要因とを紐付けて、１組のデータとして記憶・保持する。このような事象と要因との紐付けは、後述する文単位で文章全体に対して行われる。 Further, the factor is a description indicating the cause or factor of the accounting event, for example, "due to the influence of exchange rate or the like". The document analysis unit 3 associates such an event with a factor and stores and holds them as a set of data. The association between such an event and a factor is performed for the entire sentence in sentence units, which will be described later.

レポート作成部４は、文書解析部３によって取得されたデータに基づいて、決算の内容を分析し、その分析結果を決済分析レポートとして出力する。図２は、決済分析レポートの表示例を示す図である。同図の例では、事業のセグメント別に、「売上高」、「営業益」、「利益」等の各事象について、その要因が「解説」として記述されている。それぞれの事象および「解説」のセットは、後述する要因抽出部８による情報の紐付けに基づいて生成される。なお、決算分析レポートの作成に際しては、文書解析部３による解析結果だけでなく、上述したＸＢＲＬ解析やＰＤＦ解析で抽出された数値データも適宜使用される。 The report creation unit 4 analyzes the contents of the settlement of accounts based on the data acquired by the document analysis unit 3, and outputs the analysis result as a settlement analysis report. FIG. 2 is a diagram showing a display example of a payment analysis report. In the example of the figure, the factors of each event such as "sales", "operating profit", and "profit" are described as "commentary" for each business segment. Each event and a set of "commentary" are generated based on the association of information by the factor extraction unit 8 described later. When creating the settlement analysis report, not only the analysis result by the document analysis unit 3 but also the numerical data extracted by the above-mentioned XBRL analysis and PDF analysis is appropriately used.

文書解析部３は、これを構成する機能的なサブブロックとして、形態素解析部５と、ラベリング部６と、パターン比較部７と、要因抽出部８とを有する。また、文書解析部３は、文書解析に必要となる予め定義された情報を記憶する記憶部として、形態素解析辞書９と、ラベリング辞書１０と、パターン記憶部１１とを備えている。形態素解析辞書９は、形態素解析において、一つの形態素として扱うべきものとして、予め定義された文字列（例えば、「月」、「前年」、「売上高」に準じる科目など）を多数記憶している。ラベリング辞書１０は、ラベリング部６の処理において用いられ、形態素の組み合わせについて、属性ラベルを対応付けて記憶している。そして、一つの形態素として扱われる形態素の組み合わせに対して、ラベリング辞書１０によって特定された属性ラベルが付される。また、パターン記憶部１１には、予め定義された要因パターンが多数記憶されている。それぞれの要因パターンは、少なくとも科目および金額情報を含む会計上の事象の表現と、この事象の要因の表現とを有するパターンである。形態素解析辞書９、ラベリング辞書１０およびパターン記憶部１１の記憶内容は、適宜、追加・変更することができる。 The document analysis unit 3 has a morphological analysis unit 5, a labeling unit 6, a pattern comparison unit 7, and a factor extraction unit 8 as functional sub-blocks constituting the document analysis unit 3. Further, the document analysis unit 3 includes a morphological analysis dictionary 9, a labeling dictionary 10, and a pattern storage unit 11 as storage units for storing predefined information required for document analysis. The morphological analysis dictionary 9 stores a large number of predefined character strings (for example, subjects conforming to "month", "previous year", "sales", etc.) as one that should be treated as one morpheme in morphological analysis. There is. The labeling dictionary 10 is used in the processing of the labeling unit 6, and stores the combinations of morphemes in association with the attribute labels. Then, the attribute label specified by the labeling dictionary 10 is attached to the combination of morphemes treated as one morpheme. Further, the pattern storage unit 11 stores a large number of predetermined factor patterns. Each factor pattern is a pattern having at least an expression of an accounting event including subject and amount information and an expression of the factor of this event. The stored contents of the morphological analysis dictionary 9, the labeling dictionary 10, and the pattern storage unit 11 can be added or changed as appropriate.

形態素解析部５は、決算関連情報に含まれる文章を文に分解すると共に、それぞれの文についての形態素解析を行い、文毎の形態素列を生成する。ここで、ＰＤＦには行の概念がないので、形態素解析に先立ち、１文の切れ目（句点やインデント等）で行となるように、文字列が成形される。形態素解析とは、処理対象となる文を形態素と呼ばれる最も小さな文法単位に分割して解析することであり、日本語の解析では、文から単語を切り出していき、動詞、形容詞、名詞、副詞、連体詞、接続詞、助動詞、助詞といった如く、その単語の品詞と活用とが推定される。その際、形態素解析辞書９によって予め定義された文字列については、形態素解析上、一つの形態素として取り扱われ、所定のメタデータが付与される。 The morphological analysis unit 5 decomposes sentences included in the settlement-related information into sentences, performs morphological analysis on each sentence, and generates a morphological sequence for each sentence. Here, since the PDF does not have the concept of a line, the character string is formed so as to be a line at a break (such as a punctuation mark or an indent) of one sentence prior to the morphological analysis. Morphological analysis is to divide a sentence to be processed into the smallest grammatical units called morphological elements and analyze it. In Japanese analysis, words are cut out from sentences, and verbs, adjectives, nouns, adverbs, etc. The part of speech and utilization of the word are presumed, such as adverbs, conjunctions, adverbs, and adverbs. At that time, the character string defined in advance by the morphological analysis dictionary 9 is treated as one morpheme in the morphological analysis, and predetermined metadata is added.

ラベリング部６は、文の形態素列を構成する形態素またはその組み合わせに対して、属性毎に固有の属性ラベルを付与する（ラベリング）。例えば、数字と”円”の組み合わせには「金額」という属性ラベルを付与するといった如くである。属性ラベルは、最低限、「科目」および「金額」を分類できることが要求されるが、これら以外の属性を適宜設定してもよい。また、ラベリング辞書１０にて予め定義された形態素の組み合わせ（１つの形態素として扱うべきもの）については、用語辞書９によって指定された属性ラベルが付与される。 The labeling unit 6 assigns a unique attribute label for each attribute to the morpheme or a combination thereof constituting the morpheme string of the sentence (labeling). For example, an attribute label of "amount" is given to a combination of a number and "yen". At a minimum, the attribute label is required to be able to classify "subject" and "amount", but attributes other than these may be set as appropriate. Further, for the combination of morphemes defined in advance in the labeling dictionary 10 (those that should be treated as one morpheme), the attribute label specified by the term dictionary 9 is given.

パターン比較部７は、属性ラベルによって抽象化された文の形態素列と、パターン記憶部１１に記憶された要因パターンとを比較し、形態素列が要因パターンと一致するか否かを判定する。 The pattern comparison unit 7 compares the morpheme string of the sentence abstracted by the attribute label with the factor pattern stored in the pattern storage unit 11, and determines whether or not the morpheme string matches the factor pattern.

要因抽出部８は、要因パターンと一致した形態素列について、要因パターンによって指定された部分を要因として抽出し、この抽出された要因を形態素列における科目および金額情報と紐付ける。例えば、国名＋助詞＋科目＋助詞＋不定の繰り返し＋”により”＋金額差分という要因パターンと一致した形態素列については、要因を表す表現である”により”の前方において連続し、かつ、属性ラベルが付与されていない「不定の繰り返し」の部分が要因として抽出されるといった如くである。ここで、「不定の繰り返し」とは、正規表現では、例えば、”．”（「科目」や「金額情報」等のラベルが付与されていない任意の一文字）と、”＋”（直前のパターンの１回以上の繰り返し）”との組み合わせとして表現できる。また、要因を表す表現には様々なものが存在し、表現によっては後方において連続した「不定の繰り返し」の部分が要因とされることもある。なお、本実施形態において、科目、金額、および要因の３要素を１組のセットとした紐付けは、基本的に文単位で行われるが、これらの要素が別個の文になっている場合には、３要素の過不足をみながら、複数の文から１組のセットが抽出される。 The factor extraction unit 8 extracts the portion specified by the factor pattern as a factor for the morpheme string that matches the factor pattern, and associates the extracted factor with the subject and amount information in the morpheme string. For example, for a morpheme sequence that matches the factor pattern of "+ amount difference" by "country name + particle + subject + particle + indefinite repetition +", it is continuous in front of "by" and the attribute label. It seems that the part of "indefinite repetition" that is not given is extracted as a factor. Here, "indefinite repetition" means, for example, "." (Any character without labels such as "subject" and "amount information") and "+" (immediately preceding pattern) in regular expressions. It can be expressed as a combination with "one or more repetitions of") ". In addition, there are various expressions that express factors, and depending on the expression, the part of continuous" indefinite repetition "in the rear is considered as a factor. There is also. In this embodiment, the association of the three elements of subject, amount, and factor as a set is basically performed in sentence units, but when these elements are separate sentences. Is extracted from a plurality of sentences as a set while observing the excess or deficiency of the three elements.

図３は、文書解析部３において実行される文書解析ルーチンのフローチャートである。この文書解析処理は、コンピュータに図３の処理を実行させるコンピュータプログラムをインストールすることによって実行される。以下、図４に示した文を一例に文書解析の詳細について説明する。 FIG. 3 is a flowchart of a document analysis routine executed by the document analysis unit 3. This document analysis process is executed by installing a computer program that causes the computer to execute the process shown in FIG. Hereinafter, the details of the document analysis will be described using the sentence shown in FIG. 4 as an example.

まず、ステップ１において、形態素解析部５は、ＰＤＦより取得された文書を文単位で分解し、それぞれの文に対して、文章の順序に従って文番号を昇順で付与する。続くステップ２において、処理対象となる文番号を指定する循環変数ｎが１にセットされ、文章における最初の文の処理が開始される。 First, in step 1, the morphological analysis unit 5 decomposes the document acquired from the PDF in sentence units, and assigns sentence numbers to each sentence in ascending order according to the order of the sentences. In the following step 2, the circular variable n that specifies the sentence number to be processed is set to 1, and the processing of the first sentence in the sentence is started.

ステップ３において、形態素解析部５は、処理対象となる文の形態素解析を行う。上述したように、用語辞書にて予め定義された形態素の組み合わせについては一つの形態素として扱う以外、一般的な形態素解析と異なるところはない。 In step 3, the morphological analysis unit 5 performs morphological analysis of the sentence to be processed. As described above, the combination of morphemes defined in the term dictionary is treated as one morpheme, and there is no difference from general morphological analysis.

ステップ４において、ラベリング部６は、形態素列を構成する形態素またはその組み合わせに対してラベリングを行う。このラベリングには、（１）単純な形態素列に対するラベリング、（２）定義済み形態素列に対するラベリング、（３）金額増減の表現に対するラベリングの３つが存在する。 In step 4, the labeling unit 6 labels the morphemes or combinations thereof that form the morpheme sequence. There are three types of labeling: (1) labeling for simple morpheme sequences, (2) labeling for defined morpheme sequences, and (3) labeling for expressions of amount increase / decrease.

（１）単純な形態素列に対するラベリング
句点や数値などのような簡単な形態素列の組み合わせに対して、属性ラベルが付与される。読点については、形態素のメタデータが「記号」かつ「読点」の形態素であり、「,」にマッチする場合、「,」「、」の属性ラベルが付与される。句点については、形態素のメタデータが「記号」かつ「句点」の形態素であり、「.」にマッチする場合、「。」とされる。また、数値については、形態素のメタデータが「名詞」かつ「数」の形態素が１つ以上存在するものをＡとし、読点や句点に続き形態素のメタデータが「名詞」かつ「数」の形態素が1つ以上するものをＢとした場合、ＡまたはＡＢにマッチするものに「数値」の属性ラベルが付与される。さらに、金額については、上記「数値」に「円」が続くものに「金額」の属性ラベルが付与される。図４の例文では、「1,616億円」，「3,621億円」，「6,128億円」，「944億円」，「53億円」の各形態素列に「金額」の属性ラベルが付与されることになる。 (1) Labeling for simple morpheme sequences Attribute labels are assigned to combinations of simple morpheme sequences such as punctuation marks and numerical values. As for the comma, if the metadata of the morpheme is a "symbol" and a "comma" morpheme and matches ",", the attribute labels of "," and "," are given. As for the punctuation mark, if the metadata of the morpheme is a morpheme of "symbol" and "punctuation mark" and matches ".", It is regarded as ".". As for numerical values, A is defined as having one or more morpheme metadata of "noun" and "number", and morpheme metadata of "noun" and "number" following reading points and punctuation marks. If one or more of are B, the attribute label of "numerical value" is given to the one that matches A or AB. Further, regarding the amount of money, the attribute label of "amount of money" is given to the above-mentioned "numerical value" followed by "yen". In the example sentence of FIG. 4, the attribute label of "amount" is given to each morpheme string of "161.6 billion yen", "362.1 billion yen", "612.8 billion yen", "94.4 billion yen", and "5.3 billion yen". It will be.

（２）定義済み形態素列に対するラベリング
月、前年、売上高に準ずる科目など、ラベリング辞書１０にて定義済みの形態素列に対して、属性ラベルが付与される。例えば、売上高に準ずる科目として、「連結」という表現をＡとし、「売上収益」，「売上高」，「売上」，「営業収益」等の表現をＢとし、括弧に囲まれた形態素列をＣとした場合、Ｂ，ＡＢ，ＢＣ，ＡＢＣにマッチするものに対して、「売上高」や「売上」といった属性ラベルが付与される。 (2) Labeling for defined morpheme strings Attribute labels are assigned to morpheme strings defined in the labeling dictionary 10, such as months, previous years, and subjects based on sales. For example, as a subject similar to sales, the expression "consolidated" is A, and the expressions such as "sales revenue", "sales", "sales", and "operating revenue" are B, and the morpheme string enclosed in parentheses. When is C, attribute labels such as "sales" and "sales" are given to those matching B, AB, BC, and ABC.

（３）金額増減の表現に対するラベリング
単純な表現、カッコ書き付き、割合での表現等でパターン分けして属性ラベルが付与される。例えば、「過去最高の」をＡ、結果の直前の表現をＢ、前期の表現、前期の表現＋読点をＣ、金額または割合の１回以上の繰り返しをＤ、括弧に囲まれた形態素をＥ、読点＋金額または割合、読点＋金額または割合＋ＥをＦ、増減の表現をＧ、読点をＨとした場合、ＡＢＣＢＤＥＦＧＨ、ＢＣＢＤＥＦＧＨ、ＣＢＤＥＦＧＨといった組み合せを定義してマッチするものに「金額増減」の属性ラベルが付与される。ただし、定義すべき全ての組み合わせを列挙すると記述量が膨大になるため、実際には、組み合せの全列挙ではなく、正規表現のような手法が用いられる。図４の例文では、「3,621億円の増収」，「6,128 億円の減収」，「944億円の増収」，「53億円の減収」，「減少額1,616億円」に対して、「金額増減」の属性ラベルが付与される。 (3) Labeling for expressions of increase / decrease in amount of money Attribute labels are given by dividing into patterns by simple expressions, parenthesized expressions, expressions in proportions, etc. For example, "the highest ever" is A, the expression immediately before the result is B, the expression of the previous period, the expression of the previous period + the comma is C, the repetition of the amount or ratio is D, and the morpheme enclosed in parentheses is E. , When the reading point + amount or ratio, the reading point + amount or ratio + E is F, the expression of increase / decrease is G, and the reading point is H, the attribute of "amount increase / decrease" is defined as a combination such as ABCBDEFGH, BCBDEFGH, and CBDEFGH. A label is given. However, if all the combinations to be defined are listed, the amount of description becomes enormous. Therefore, in reality, a method such as a regular expression is used instead of listing all the combinations. In the example sentence in Fig. 4, "Sales increase of 362.1 billion yen", "Decrease of 612.8 billion yen", "Increase of sales of 94.4 billion yen", "Decrease of sales of 5.3 billion yen", "Decrease of 161.6 billion yen" are " The attribute label of "Amount increase / decrease" is given.

ステップ５において、パターン比較部７は、属性ラベルによって抽象化された文の形態素列と、パターン記憶部１１に記憶された要因パターンとを比較し、両者が一致するか否かが判断される（マッチング）。ここで、ラベリングされた形態素列の名称として、図５の一覧表に示す名称を用いる場合について考える。この場合、マッチさせる形態素列のパターン（要因パターン）としては、図６に示すように、Ａ系（Ａ1〜Ａ5・・・），Ｂ系（Ｂ1〜Ｂ3・・・），Ｃ系（Ｃ1〜Ｃ3・・・）などが考えられる。例えば、要因パターンＡは、［要因前置］＋「セグメント前置」＋「セグメント表現」＋「要因（逆向）候補」＋「要因前置（含：要因）」＋「要因（逆向）候補」＋”営業利益率は前年を維持し、”＋「修飾（分量）」＋「科目表現」＋「価格表現」＋「行末」より構成されていることを意味する。このような要因パターンは、多数の決算関連情報をサンプルとして調査し、要因の抽出漏れがないように多数用意されている。 In step 5, the pattern comparison unit 7 compares the morpheme sequence of the sentence abstracted by the attribute label with the factor pattern stored in the pattern storage unit 11, and determines whether or not they match ( matching). Here, consider the case where the names shown in the list of FIG. 5 are used as the names of the labeled morpheme strings. In this case, as the pattern (factor pattern) of the morpheme sequence to be matched, as shown in FIG. 6, A system (A1 to A5 ...), B system (B1 to B3 ...), C system (C1 to C1 to). C3 ...) etc. can be considered. For example, the factor pattern A is [factor prefix] + "segment prefix" + "segment expression" + "factor (reverse) candidate" + "factor prefix (including: factor)" + "factor (reverse) candidate" + "Operating profit margin is maintained from the previous year," means that it is composed of "modification (quantity)" + "subject expression" + "price expression" + "end of line". A large number of such factor patterns are prepared by investigating a large number of financial statements-related information as a sample and preventing omission of extraction of factors.

両者の並びが一致する場合には、ステップ６の肯定判定からステップ７の要因抽出に進み、要因抽出部８は、要因パターンによって指定された部分が結果（要因・科目・金額増減）として抽出する。例えば、Ａ系およびＣ系の要因パターンについては、「CAUSE_THOUGH」が要因（逆）、「CAUSE」が要因（順）、「ACCOUNT_PHRASE」が科目、「PRICE_SET」が金額増減として抽出されるといった如くである。また、Ｂ系の要員パターンについては、「CAUSE_THOUGH」が要因（逆）、「CAUSE」が要因（順）、「PRICE_SET_WITH_ACCOUNT」が科目、金額増減として抽出されるといった如くである。 If the two sequences match, the affirmative judgment in step 6 proceeds to the factor extraction in step 7, and the factor extraction unit 8 extracts the portion specified by the factor pattern as the result (factor / subject / amount increase / decrease). .. For example, regarding the factor patterns of A and C systems, "CAUSE_THOUGH" is extracted as a factor (reverse), "CAUSE" is a factor (in order), "ACCOUNT_PHRASE" is a subject, and "PRICE_SET" is extracted as an increase or decrease in amount. is there. Regarding the B-type personnel pattern, "CAUSE_THOUGH" is a factor (reverse), "CAUSE" is a factor (order), and "PRICE_SET_WITH_ACCOUNT" is extracted as a subject and amount increase / decrease.

ステップ７において抽出された要因は、形態素列における科目および金額情報と紐付けて１組のセットとされる。図４の例文では、「3,621億円の増収」という事象について「原油及び天然ガスの売上高に関し、販売数量の増加により」という要因、「6,128億円の減収」という事象について「平均単価の下落により」という要因、「944億円の増収」という事象について「売上の平均為替レートが円安となったことにより」という要因がそれぞれ抽出されることになる。なお、「増収」や「減収」といった表記は、科目としては売上高を表している。 The factors extracted in step 7 are combined with the subject and amount information in the morpheme string to form a set. In the example sentence in Fig. 4, the factor of "increased sales of 362.1 billion yen" was caused by "increased sales volume of crude oil and natural gas", and the event of "decreased sales of 612.8 billion yen" was "decrease in average unit price". The factor "due to" and the factor "due to the depreciation of the yen on the average exchange rate of sales" will be extracted for the event "increased sales of 94.4 billion yen". In addition, notations such as "increased sales" and "decreased sales" represent sales as a subject.

これに対して、両者が一致しない場合には、ステップ６の否定判定からステップ８に進み、要因抽出部８は、複数の文に跨がる組み合わせ判定を行う。すなわち、一つの文において、科目、金額情報、および要因のいずれかの要素が欠落している場合、文章の順序に従って他の文を分析し、この欠落した要素が得られたことをもって、科目、金額情報、および要因の紐付けが行われる。 On the other hand, when the two do not match, the process proceeds from the negative determination in step 6 to step 8, and the factor extraction unit 8 performs a combination determination across a plurality of sentences. That is, if any element of subject, amount information, or factor is missing in one sentence, the other sentences are analyzed according to the order of the sentences, and when this missing element is obtained, the subject, Amount information and factors are linked.

そして、ステップ９において、循環変数ｎがラストであるか、換言すれば、文章における最後の文の処理が終了したかが判断される。循環変数ｎがラストでない場合には、ステップ１０で循環変数をインクリメントした上で、ステップ３に戻り、新たな文の処理を実行する。これに対して、循環変数がラストの場合には、一連の処理が終了する。 Then, in step 9, it is determined whether the circular variable n is the last, in other words, whether the processing of the last sentence in the sentence is completed. If the circular variable n is not the last, the circular variable is incremented in step 10 and then returned to step 3 to execute a new sentence process. On the other hand, when the circular variable is the last, a series of processes is completed.

このように、本実施形態によれば、文そのものではなく、文を形態素解析した上で属性ラベルによって抽象化された形態素列が、予め定義された要因パターンと比較される。そして、両者が一致した場合、この形態素列のうち、要因パターンによって指定された部分が要因として抽出される。一般に、決算関連情報は、ある程度決まった形式の文章で記述されることが多い。このような傾向に鑑み、事象と要因との関係を記述した多数の文章から記述のバリエーションを抽出し、それぞれを要因パターンとして定義しておく。そして、要因パターンとの比較を属性ラベルによって抽象化された形態素列ベースで行うことで、定義すべき要因パターンの数を有効に抑制しつつ、要因抽出を効率的に行うことができる。それとともに、新たなバリエーションが見つかった場合には、新たに定義された要因パターンをパターン記憶部１１に追加するだけでよいため、柔軟性にも優れている。 As described above, according to the present embodiment, not the sentence itself, but the morpheme sequence abstracted by the attribute label after morphological analysis of the sentence is compared with the predetermined factor pattern. Then, when both match, the portion specified by the factor pattern in this morpheme sequence is extracted as a factor. In general, financial statements-related information is often described in sentences in a certain format. In view of this tendency, variations of the description are extracted from a large number of sentences that describe the relationship between the event and the factor, and each is defined as a factor pattern. Then, by performing the comparison with the factor pattern on the basis of the morpheme string abstracted by the attribute label, it is possible to efficiently perform factor extraction while effectively suppressing the number of factor patterns to be defined. At the same time, when a new variation is found, it is only necessary to add the newly defined factor pattern to the pattern storage unit 11, which is excellent in flexibility.

また、本実施形態によれば、処理対象となる一つの文において、科目、金額情報、および要因のいずれかの要素が欠落している場合、文章の順序に従って他の文を分析し、この欠落した要素が得られたことをもって、事象（科目，金額情報）と要因との紐付けが行われる。これにより、要因抽出をより効果的に行うことが可能となる。 Further, according to the present embodiment, when any element of subject, amount information, and factor is missing in one sentence to be processed, other sentences are analyzed according to the order of the sentences, and this missing sentence is performed. When the above elements are obtained, the event (subject, amount information) and the factor are linked. This makes it possible to perform factor extraction more effectively.

さらに、本実施形態によれば、予め定義された文字列を形態素解析辞書９に登録・記憶しておき、この文字列については、形態素解析上、一つの形態素として取り扱う。これにより、決済関連情報に記載された用語をより正確に認識でき、その結果として、要因抽出の精度の向上を図ることができる。 Further, according to the present embodiment, a predetermined character string is registered and stored in the morphological analysis dictionary 9, and this character string is treated as one morpheme in the morphological analysis. As a result, the terms described in the payment-related information can be recognized more accurately, and as a result, the accuracy of factor extraction can be improved.

なお、上述した実施形態では、要因抽出の一環として、形態素列における形態素の並び方と、要因パターンにおける要素の並び方とのマッチングを行う例について説明したが、本発明はこれに限定されるものではなく、形態素同士の係り受けのパターンに基づいて同様の処理を行ってもよい。この場合、要因パターンとしては、事象および要因を含む要素同士の係り受けのパターンとして定義される。 In the above-described embodiment, an example of matching the arrangement of morphemes in the morpheme sequence and the arrangement of elements in the factor pattern has been described as a part of factor extraction, but the present invention is not limited to this. , The same processing may be performed based on the pattern of dependency between morphemes. In this case, the factor pattern is defined as an event and a pattern of dependency between elements including the factor.

図７は、係り受け解析における要因抽出例の説明図である。解析対象となる例文を形態素解析した結果が「国内」「の」「販売」「が」「好調」「に」「推移」「した」「こと」「から」「、」「増収」「となり」「まし」「た」である場合、まず、形態素がラベリングされる。例えば、「国内」がセグメント、「増収」が科目＋増減、「、」「まし」「た」などは無視といった如くである。つぎに、起点となる「となり」から伸びているグループに分割される。例えば、（１）「国内の販売が」はセグメントに含まれている、（２）「好調に推移したことから」は、何のラベルも付いてない形態素で構成されているので要因である可能性が高い、（３）「、」は無視のみ、（４）「増収」は科目＋増減が含まれている、（５）「まし」は無視のみ、（６）「た」も無視のみ、といった如くである。最後に、無視を除いて整理することにより、科目＋増減として「増収」が、要因として「好調に推移したことから」が抽出される。 FIG. 7 is an explanatory diagram of an example of factor extraction in the dependency analysis. The results of morphological analysis of the example sentences to be analyzed are "domestic", "no", "sales", "ga", "good", "ni", "transition", "done", "koto", "kara", ",", "increased sales", and "next". In the case of "better" and "ta", the morpheme is first labeled. For example, "domestic" is a segment, "increased sales" is a subject + increase / decrease, and "," "better" "ta" are ignored. Next, it is divided into groups extending from the starting point "next". For example, (1) "domestic sales" is included in the segment, and (2) "because it has performed well" can be a factor because it is composed of morphemes without any label. Highly likely, (3) "," is only ignored, (4) "increased sales" includes subject + increase / decrease, (5) "mashi" is only ignored, (6) "ta" is only ignored, And so on. Finally, by arranging the items excluding neglect, "increased sales" is extracted as the subject + increase / decrease, and "because it has been favorable" as the factor.

１決済分析システム
２データ前処理部
３文書解析部
４レポート作成部
５形態素解析部
６ラベリング部
７パターン比較部
８要因抽出部
９形態素解析辞書
１０ラベリング辞書
１１パターン記憶部 1 Payment analysis system 2 Data preprocessing unit 3 Document analysis unit 4 Report creation unit 5 Morphological analysis unit 6 Labeling unit 7 Pattern comparison unit 8 Factor extraction unit 9 Morphological analysis dictionary 10 Labeling dictionary 11 Pattern storage unit

Claims

In the settlement analysis system that analyzes settlement-related information,
A pattern storage unit that stores a factor pattern having at least an expression of an accounting event including subject and amount information and an expression of the factor of the event.
A morpheme string of a sentence to be processed, which is abstracted by assigning a unique attribute label to at least each attribute that classifies a subject and an amount for the morpheme or a combination thereof that constitutes this morpheme string. The column is compared with the factor pattern stored in the pattern storage unit, and the portion specified by the factor pattern is extracted as a factor for the morpheme string that matches the factor pattern, and the extracted factor is extracted. With the document analysis unit that stores as a set of data by associating with the subject and amount information in the morpheme string.
A settlement analysis system characterized by having a report creation unit that outputs a settlement analysis report based on the one set of data.

In a settlement analysis program that analyzes settlement-related information using a computer in which a factor pattern having at least an expression of an accounting event including subject and amount information and an expression of the factor of the event is stored in advance.
A morpheme string of a sentence to be processed, which is abstracted by assigning a unique attribute label to at least each attribute that classifies the subject and the amount of money for the morpheme or a combination thereof that constitutes this morpheme string. The first step of comparing the column with the factor pattern,
For the morpheme string that matches the factor pattern, the portion specified by the factor pattern is extracted as a factor, and the extracted factor is associated with the subject and amount information in the morpheme string and stored as a set of data. The second step to do and
A settlement analysis program characterized by causing the computer to execute a process having a third step of outputting a settlement analysis report based on the set of data.