CN116150361A - Event extraction method, system and storage medium for financial statement notes - Google Patents

Event extraction method, system and storage medium for financial statement notes Download PDF

Info

Publication number
CN116150361A
CN116150361A CN202211680822.XA CN202211680822A CN116150361A CN 116150361 A CN116150361 A CN 116150361A CN 202211680822 A CN202211680822 A CN 202211680822A CN 116150361 A CN116150361 A CN 116150361A
Authority
CN
China
Prior art keywords
event
title
notes
vector
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211680822.XA
Other languages
Chinese (zh)
Inventor
潘定
周星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN202211680822.XA priority Critical patent/CN116150361A/en
Publication of CN116150361A publication Critical patent/CN116150361A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method, a system and a storage medium for extracting events of a financial statement notes, wherein the method comprises the following steps: acquiring a financial report PDF document, and acquiring a TXT document of a financial report notes text after data preprocessing; identifying and labeling the title, the hierarchy and the paragraphs of the TXT document of the text of the notes of the financial statement, and obtaining a title set and a paragraph set; identifying and labeling event arguments of the financial events in the financial statement notes based on the transducer encoder, and simultaneously obtaining vector representations of the event arguments; the semantic features of the paragraphs, the titles and the levels thereof are expressed by vectors, and the vector representations of words contained in the event argument and the vector representations of the titles and the levels thereof are spliced into a vector matrix; the method comprises the steps of learning event theory elements and titles and the characteristics of the layers of the event theory elements and the titles, learning the characteristics of the event theory elements, the titles, the layers of the titles and the memory vectors, filling the event theory elements into the current event roles of an event table based on a transducer encoder and a linear two-classifier, and obtaining all event records contained in the current paragraph. According to the invention, the titles and the levels thereof in the financial statement notes text are extracted as chapter-level semantic information of the financial statement notes, the chapter-level semantic information and the event theory meta information are utilized to identify event categories in the financial statement notes text, and a manner of event table filling is designed to realize simultaneous extraction of a plurality of event records, so that the accuracy of event extraction of the financial statement notes is improved as a whole.

Description

Event extraction method, system and storage medium for financial statement notes
Technical Field
The invention relates to the technical field of event extraction of financial statement notes, in particular to an event extraction method, an event extraction system and a storage medium of financial statement notes based on chapter structure identification.
Background
The quality analysis of the financial condition of the enterprise is based on the financial statement, and further analysis is carried out on the financial related events disclosed in the financial statement notes by means of the implicit association relation between the listing items of the financial statement and the financial statement notes, so that the quality analysis result of the financial condition of the enterprise is obtained.
Because the disclosure of the financial statement notes is increasingly complicated, when a decision maker performs quality analysis on the financial condition, a great deal of time and labor cost are required to acquire the financial events in the notes corresponding to the listing items of the financial statement, and therefore, an enterprise needs a method for automatically and structurally representing unstructured text in the financial statement notes, so that the analysis efficiency is improved.
The task of event extraction is to extract event arguments from unstructured text and organize them into a structured form (e.g., an event table), including two subtasks, identifying event categories and populating event arguments. The event table is a storage mode, is a two-dimensional table formed by event categories and event roles, is used for describing the events, and is characterized in that rows represent the event categories and columns represent the event roles. An event argument refers to a participant or attribute of an event. Event roles refer to the specific semantic roles an event argument plays in an event. The event record is composed of event arguments, covering a sequence of event roles that are necessary for a certain event category, one for each row in the event table.
In the existing event extraction method, the method based on pattern matching is mainly an algorithm based on a grammar tree or a regular expression, and depends on an event template in a specific field, and stronger professional knowledge is needed for manual construction; the method based on machine learning generally converts event extraction problems into classification problems, common classification algorithms include logistic regression, naive Bayes, nearest neighbors, decision trees, support vector machines and the like, and the method depends on a large-scale annotation corpus and proper feature selection, has better generalization, and has larger limitation on complex nonlinear relation mining; the deep learning-based method is typically a method based on a convolutional network, a attention mechanism, a pre-training model and a graph neural network, has strong nonlinear expression capability, and solves the problems of manual design characteristics, poor expandability, dependence on complex NLP tools and the like. However, classical deep learning methods focus on semantic learning of a single sentence, and only short-range sentence-level context semantics can be captured. The existing research does not retrieve an event extraction method aiming at the notes of the financial statement. The characteristics of the event of the notes of the financial statement are obvious, such as the situation that a plurality of categories of events exist in one notes text, a plurality of event records are distributed in different parts of the text in one event category, and event argument of one event may be distributed in a plurality of discontinuous sentences. The existing method is easy to miss long-distance chapter-level context semantics, so that event category classification is inaccurate, event record extraction is incomplete, and event argument identification is inaccurate.
Therefore, how to efficiently realize the extraction of the financial related events according to the notes of the financial statement is a technical problem to be solved in the field of quality analysis of financial conditions.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the invention provides an event extraction method for financial statement notes, which comprises the steps of firstly extracting titles and levels in a financial statement note text as key chapter-level semantic information, then identifying and labeling event arguments of financial events in the financial statement note based on a transducer encoder, identifying event categories in the financial statement note text by using the chapter-level semantic information and the event argument information, and finally realizing simultaneous extraction of a plurality of event records in a manner of filling an event table, thereby improving the accuracy of the event extraction of the financial statement notes on the whole.
A second object of the present invention is to provide an event extraction system for financial statement notes.
A third object of the present invention is to provide a computer-readable storage medium.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the invention provides an event extraction method for financial statement notes, which comprises the following steps:
acquiring a PDF document of a financial report in a database file, converting the PDF document into a TXT document through data preprocessing, and matching the TXT document of the text of the notes of the financial report by combining with a regular expression in a knowledge base;
Identifying and labeling the title, the hierarchy and the paragraphs of the TXT document of the financial statement notes text based on a knowledge base to obtain a title set and a paragraph set;
sentence and word segmentation are carried out on the paragraph set to obtain a word segmentation list, the semantics in the paragraphs are learned based on a transducer encoder, a vector matrix of an output layer of the transducer encoder is input into a CRF model, and event argument of a financial event in a financial statement notes are identified and marked to obtain vector representation of the event argument;
splicing the vector representation of the words contained in the event argument and the vector representation of the title and the hierarchy thereof into a vector matrix, inputting the spliced vector matrix into a transducer encoder to obtain a vector matrix integrating the title, the title hierarchy and the paragraph semantics, inputting the vector matrix into an event category classifier to obtain the probability of all event categories of the current vector matrix, selecting the event category with the highest probability as the event category of the current vector matrix, inquiring the event role of the event category currently triggered in the predefined event information through indexes, and outputting the event roles according to the predefined sequence to obtain the vector representation of the event role;
constructing a memory vector used for recording an event argument filling process, inputting a transducer encoder after splicing vector representation of the event argument, vector representation of an event argument and the memory vector, connecting an output layer of the transducer encoder into a linear two-classifier to obtain probability of filling the current event argument into the current event argument of an event table, selecting the event argument with the probability as a set value, filling the event argument into the current event argument of the event table, repeating iteration until all event argument of the current event category are filled, updating the memory vector of the filled event argument and the vectorized representation of the event argument, and obtaining all event records contained in the current paragraph.
As a preferred technical solution, the data preprocessing includes deleting headers, page numbers, tables, and special symbols.
As a preferred technical solution, a regular expression in a knowledge base file is combined to match a TXT document of a financial statement notes text, and the regular expression is expressed as:
start_line= [ x ] financial statement notes | [ company |own company |company|group |own group ] basic condition ] and end_line= [ x ] review file ];
wherein [ (i ] represents a regular expression set, |represents other character strings, |represents selectable items, |start_line represents a regular expression of a beginning line of the text of the financial statement, and end_line represents a regular expression of an ending line of the text of the financial statement;
when the target character string contains the rule of the regular expression, the re.search () function returns an object, and when the target character string does not contain the rule of the regular expression, the re.search () returns None;
traversing the TXT document after data preprocessing, enabling the character string format of the current line to be line, and when the return value of the re.search (start_line, line) is not None, namely, the expression matched in the line to the start_line, starting and continuing to keep the current line until the return value of the re.search (end_line, line) is not None, stopping traversing, and obtaining the TXT document of the financial statement notes text.
As a preferred technical scheme, the method for identifying and marking the title of the TXT document of the notes text of the financial statement, and the title level and paragraph specifically comprises the following steps:
acquiring a TXT document of a financial statement notes text, acquiring a regular expression for identifying a title and a label style set of the title in a knowledge base, and adding a mark symbol at the beginning of a row identified as the title;
traversing TXT document of the text of the financial statement, judging whether the line added with the mark symbol is a sentence with complete semantic meaning based on a binary statistics language model, if so, reserving the mark symbol, otherwise, deleting the mark symbol at the beginning of the line to obtain a candidate title;
traversing a TXT document of a financial statement notes text, arranging mark patterns of rows added with mark symbols and arranging mark sequences, loading a mark pattern set of titles from a knowledge base, wherein each mark pattern corresponds to a unique code, constructing a hierarchical stack of marks to record a title level corresponding to each title, and marking the title by using a digital code containing title level information;
the text between the titles in the TXT document is arranged into a row according to the paragraphs, numerical codes are used for marking before the paragraphs to obtain coded meanings, the title hierarchy corresponding to the paragraphs is obtained from the marks of the titles, and the paragraphs of the text are divided again by combining Chinese punctuation marks and sequence number marks based on the titles and the hierarchy thereof.
As a preferred technical solution, a regular expression for identifying a title and a label style set of the title in a knowledge base are obtained, and specific rules are expressed as follows:
first rule: the method comprises the steps of containing Chinese characters, wherein the Chinese character matching uses unicode coding;
a second rule: containing the pattern of the reference numerals and the like, the reference pattern set is [ ({ 0 }) | ({ 1 }) | ({ 0 }) | ({ 1 }) | {0}, | {1}, | {0} {1} ({ 0 }) | ({ 1 }) | ({ 0 }), | (1), | (0), | (1), | (0), the method comprises the steps of (1), | (0), | (1), | {0} section| {1} section| {0} chapter| {1} chapter }, wherein [ ] represents a set, |representsa union, {0} represents an array character of all Chinese numbers, and {1} represents a digital character of all positive integers;
third rule: if there is no sentence feature, the sentence does not include [ the sentence ]. The following is carried out ? The method comprises the steps of carrying out a first treatment on the surface of the ' symbol;
fourth rule: without data characteristics, sentences do not contain long data and can contain years;
fifth rule: not ending with a conjunctive word;
lines meeting the first rule to the fifth rule are initially identified as titles, the first rule and the second rule are expressed as regular expressions and stored in a list, the first rule and the second rule are expressed as L1, the third rule, the fourth rule and the fifth rule are expressed as regular expressions and stored in the list, the first rule and the fifth rule are expressed as L2, the TXT document of the text of the attached text of the traversing financial report is traversed, the current line is in a character string format of line, when the return value of re.search (L1, line) is not None, and when the return value of re.search (L2, line) is None, the current line is marked as a title, and the re.search is a search function of a re library in a python tool.
As a preferred technical solution, constructing a hierarchical stack of labels to record a title hierarchy corresponding to each title, and labeling the title with a digital code containing information of the title hierarchy, specifically including:
constructing a dictionary TP, wherein the dictionary TP comprises a plurality of key-value pairs, one key-value pair is composed of a key and a value, and is connected by a colon, wherein the key is the code of a label pattern, the value is the label pattern, the key and the value have unique corresponding relation, and the dictionary T is constructed for storing titles and mark levels;
acquiring a current line of label patterns and sequence numbers, comparing the acquired label patterns with all label patterns in a dictionary TP, if similar patterns exist, returning a key corresponding to a value, namely coding, if similar patterns do not exist, judging whether the acquired sequence number n is 1, if the sequence number is 1, recording the current label patterns as values into the dictionary TP, giving new keys, and if the sequence number is not 1, deleting the current title;
inputting codes and sequence numbers of the label patterns into a hierarchical stack, judging whether the stack is empty, if so, judging whether the element to be stacked is a title with the sequence number of 1, if so, stacking correctly, if not, acquiring stack top elements, comparing the codes of the element to be stacked and the stack top elements, judging the sequence relation of the title according to the codes and the sequence numbers, and stacking the element to be stacked according to the relation between the titles;
Sequentially outputting codes and serial numbers of the titles in the hierarchical stack as key values, and storing contents of the titles as values in a dictionary T to finish extraction of the titles and the hierarchy thereof;
the title and its level in dictionary T are represented by digital codes, and the current title is marked and saved in TXT file.
As an preferable technical scheme, the method for obtaining the word segmentation list by segmenting the paragraph set includes:
for the current line T i Word segmentation is carried out to obtain a list [ w ] 1 ,…,w i ,…,w n ]Wherein w is i Character string representing the i-th word, list T i =[w 1 ,…,w i ,…,w n ]Inputting a Doc2evc model to obtain a vector representation Ti-V of the current row;
dividing sentences and words of the current paragraph, adding SEP labels between the sentences to obtain a word dividing list P i =[w 1 ,…,w i ,…,SEP…,w n ];
Based on the semantics in the paragraph learned by the transducer encoder, the context information of each word in the sentence is captured by adopting the self-attention mechanism of the transducer encoder, and the input word segmentation list obtains a weighted feature vector through the self-attention mechanism module to obtain a vector matrix;
and obtaining a conditional probability value of each sample output as a corresponding label according to the CRF model, and outputting a labeling label list corresponding to the vector matrix.
As a preferred technical scheme, the spliced vector matrix is input into a transducer encoder to obtain a vector matrix integrating the title, the title hierarchy and the paragraph semantics, and the vector matrix is expressed as: e= [ E 1 ,e 2 …e i ,…,e n ]Wherein e is i Vector representation after the ith event argument is merged into the title and the hierarchical information;
connecting an output layer of a transducer encoder to an event classifier to classify event categories, obtaining event categories triggered by a current paragraph, obtaining labels of the event categories triggered by the current paragraph by using a probability maximum value, and obtaining a vector matrix E= [ E ] 1 ,e 2 …e i ,…,e n ]Input into a softmax classifier, calculate
Figure BDA0004019269590000071
Wherein W and b represent a learnable parameter momentAn array, k, represents the number of event category labels;
and (3) representing information difference between the label of the real event type and the output prediction result of the softmax classifier by using cross entropy, defining a loss function as average cross entropy, randomly initializing W and b, and adopting gradient descent update.
An event extraction system for financial statement notes, comprising: the system comprises a document acquisition unit, a chapter structure identification unit, an event argument identification unit, an event category classification unit and an event table filling unit;
the document acquisition unit is used for acquiring PDF documents of the financial report in the database file, converting the PDF documents into TXT documents through data preprocessing, and matching the TXT documents of the financial report notes text by combining regular expressions in the knowledge base;
The chapter structure identification unit is used for identifying and labeling titles of TXT documents of the notes text of the financial report based on a knowledge base, title levels and paragraphs, obtaining title sets and paragraph sets, representing the titles in the title sets and the levels thereof by vectors based on a Doc2evc model, and inputting the titles and the levels thereof into the event category classification unit;
the event argument identification unit is used for carrying out sentence segmentation and word segmentation on the paragraph set to obtain a word segmentation list, inputting a vector matrix of an output layer of the transducer encoder into a CRF model based on the semantics in the learning paragraph of the transducer encoder, and identifying and labeling event arguments of financial events in the notes of the financial report to obtain vector representations of the event arguments;
the event category classification unit is used for splicing the vector representation of words contained in the event argument and the vector representation of the title and the hierarchy thereof into a vector matrix, inputting the spliced vector matrix into the transform encoder to obtain a vector matrix integrating the title, the title hierarchy and the paragraph semantics, inputting the vector matrix into the event category classifier to obtain the probability of all event categories of the current vector matrix, selecting the event category with the highest probability as the event category of the current vector matrix, inquiring the event role of the event category currently triggered in the predefined event information through indexes, and outputting the event role according to the predefined sequence to obtain the vector representation of the event role;
The event table filling unit is used for inputting the vector representation of the event argument, the vector representation of the event role and the memory vector used for recording the event argument filling process into the transducer encoder after being spliced, connecting an output layer of the transducer encoder into the linear two-classifier to obtain the probability of the event argument filling the current event role, selecting the event argument with the probability as a set value to be filled into the current event role of the event table, repeating iteration until all event roles of the current event class are filled, and updating the memory vector of the filled event argument and the vectorized representation of the event role to obtain all event records contained in the current paragraph.
A computer readable storage medium storing a program which when executed by a processor implements a method of event extraction for financial statement notes as claimed in any one of claims 1 to 8.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) According to the invention, the regular expression and the binary language model are adopted to screen the title in the financial statement notes text, the hierarchical structure of the title is arranged by using the hierarchical stack, the title is marked in a digital coding mode, the identification and marking mechanism of the financial statement notes text title and the hierarchy thereof is constructed, the problem that long-distance chapter-level context semantics are omitted when the event extraction method learns the characteristics of the financial statement notes is solved, and the accuracy of the event extraction method of the financial statement notes when the event classification task is completed is improved.
(2) According to the invention, the Doc2evc model is adopted to convert the title and the context semantic feature information of the hierarchy thereof into the vector representation, the self-attention mechanism is used to convert the context semantic feature information of words and sentences in the paragraphs into the vector representation, the mining capability of the event extraction method on the paragraph context semantic information in the financial statement notes text is enhanced, the technical scheme of vector splicing is utilized, the fusion mode of the section-level semantic information and the chapter-level semantic information in the financial statement notes text is simplified, and the feature fusion efficiency of the financial statement notes text is improved.
(3) According to the method, the event theory element, the title and the hierarchy thereof in the financial statement notes text and the characteristics of the memory vector are learned through the transducer encoder, wherein the memory vector records the classification result of the event theory element, a layer of linear two-classifier is added on the basis, the method is used for judging whether the event theory element is matched with the event role, the transducer encoder and the linear two-classifier are adopted to convert the event table filling problem into the two-classification problem of whether the event theory element is matched with the target event role, the problem that the same event class has a plurality of event records is solved, and the accuracy of extracting the event theory element by the method is further improved.
Drawings
FIG. 1 is a flow chart of a method for event extraction of financial statement notes of the present invention;
FIG. 2 is a schematic view of a PDF document of a financial report of the present invention;
FIG. 3 is a schematic flow chart of labeling according to the present invention;
FIG. 4 is a schematic flow chart of the label pattern finishing of the present invention;
FIG. 5 is a schematic flow diagram of an event extraction system for financial statement notes of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Example 1
The embodiment provides an event extraction method for financial statement notes, which comprises the following steps:
s1: acquiring a PDF document of a financial report in a database file, converting the PDF document into a TXT document through data preprocessing, and deleting a text of a non-financial statement notes text;
s11: obtaining a PDF document of the financial report from the database file using the pdfplumberer library in the python tool, as shown in FIG. 2;
s12: and converting the PDF document into a TXT document by using an extract_text function of a page module in the pdfplum library, deleting a header, a page number, a table and a special symbol, and matching the TXT document of the financial statement notes text by combining a regular expression in a knowledge base file.
(1) Deleting the header: the PDF document is read page by an extract_text function of a page module in the pdfplumbum library, a first row of each page serves as a header, the first row is not stored, and the first row is stored as a TXT document row by row.
(2) Deleting page numbers: and reading the PDF document according to pages by an extract_text function of a page module in the pdfplumbum library, wherein the last page of each page is a page number and is not stored. .
(3) Deleting the table: a list LT is established, and text strings in the list are stored. Obtaining texts extracted from all tables from PDF documents by using an extract_tables function of a page module in a pdfplumberer library, converting values in the tables into character string formats, storing the character string formats in a list LT according to rows, traversing each row of the TXT documents, judging whether the values in the current row and the list LT are equal or not by using a comparison operator "=" in a python tool, deleting the current row if the character strings of the current row and the values in the list LT are equal, and retaining the current row if the character strings of the current row and the values in the list LT are not equal.
(4) Deleting special symbols: a list LS is established, and Unicode codes of Chinese, numbers, english and punctuation marks of Chinese and English are stored. Traversing each row of the TXT document, judging whether the character strings in the current row exist in a list LS or not by using the search function of the re library in the python tool, if so, reserving, and if not, deleting.
(5) Matching the financial statement notes text: the regular expression is obtained from the knowledge base file as a start_line= [ [ company|own company|enterprise|group|present group ] basic condition ] and an end_line= [ [ standby file ] ], wherein [ ] represents a set, other character strings are represented by the same, the I represents a union, the start_line represents a regular expression of a start line of a financial statement attached text, and the end_line represents a regular expression of an end line of the financial statement attached text. The text of non-financial statement notes such as "important prompts, catalogues and paraphrases", "company profile and main financial index" in the financial report is deleted by using the regular expression and search function of the re library in the python tool. The search function in the re library may be represented as re.search () where there are three parameters, the first parameter is a matching rule, typically represented by a regular expression, the second parameter is a matching target string, the third parameter is a matching pattern, defaulting to 0, and it is negligible that when the target string contains a rule represented by a regular expression, the re.search () returns an object, and when the target string does not contain a rule represented by a regular expression, the re.search () returns None. Traversing TXT documents after deleting headers, page numbers, tables and special symbols, enabling the character string format of the current line to be line, starting and continuing to keep the current line when the return value of the re.search (start_line) is not None, namely, the expression matched to the start_line in the line, stopping traversing until the return value of the re.search (end_line) is not None, and obtaining text data of the financial statement notes.
S2: identifying and labeling titles, levels and paragraphs in the financial statement postscript TXT document, and utilizing the disclosure features of the financial statement postscript to assist an event extraction system of the financial statement postscript to capture global information of the document;
s21: the TXT document of the financial statement notes text is obtained, and the regular expression for identifying the title and the label style set of the title in the knowledge base are obtained as follows:
(1) Containing chinese characters, chinese character matching is encoded using unicode.
(2) There is a pattern of labels that is to be applied, the reference pattern set is [ ({ 0 }) | ({ 1 }) | ({ 0 }) | ({ 1 }) | {0}, | {1}, | {0} {1} ({ 0 }) | ({ 1 }) | ({ 0 }), | (1), | (0), | (1), the terms "({ 0 }), | ({ 1 }), | {0} section | {1} section | {0} chapter |1 } chapter) ], where [ ] represents a set, |represents a union, {0} represents array characters of all Chinese numbers such as" one "," two ", etc., and {1} represents digital characters of all positive integers such as" 1"," 2", etc.
(3) There is no sentence feature, and the sentence is not included. The following is carried out ? The method comprises the steps of carrying out a first treatment on the surface of the ' etc.
(4) Without data features, sentences do not contain long data and can contain years.
(5) Not ending with a conjunctive word, i.e. the ending does not contain: [ and|and|or| I, "]".
The rows satisfying the 5-term rule described above are initially identified as titles, and are labeled at the beginning of the row with the symbol "S". Rules (1), (2) are represented as regular expressions stored in a list, represented by L1. Rules (3), (4), (5) are represented as regular expressions stored in a list, represented by L2. And judging whether the current line meets the 5-item rule or not by using the search function of the re library in the python tool. Traversing the financial statement to annotate the TXT document, enabling the character string format of the current line to be line, and when the return value of the re.search (L1, line) is not None and the return value of the re.search (L2, line) is None, marking the current line as a title, namely adding an S symbol at the beginning of the line in a marking mode, and storing the marked line as the TXT document.
S22: traversing TXT document of the text of the financial statement, acquiring a binary statistical language model from a model library file, inputting the marked line S into the trained binary statistical language model, judging whether the current line is a sentence with complete semantics, if so, reserving the symbol S, and if not, deleting the symbol S at the beginning of the line to obtain accurate candidate titles.
The binary statistical language model in the model library assumes that the current line W is composed of a plurality of words W1, …, wi, …, W n The probability of occurrence of W is P (W), and P (W) can be converted into the product of the conditional probabilities of occurrence of all words according to a conditional probability formula. According to the Markov assumption, word w i The probability of occurrence of (2) is only equal to w i-1 In relation, then:
Figure BDA0004019269590000121
under the premise of such assumption, the calculation of P (W) only needs to count the probability that a single word and two words before and after appear simultaneously, and the probability that the current line W appears is P (W). The maximum likelihood estimation method is applied to obtain:
Figure BDA0004019269590000122
wherein count (w) i-1 ,w i ) Representing w i-1 And w i Word frequency, count (w i-1 ) Representing w i-1 Word frequencies in the corpus.
The invention stores the word frequency of 33872 words and the word frequency of combination between every two in the form of list in the model library, and when judging whether the current line is a sentence with complete semantic, the invention needs to: acquiring accounting term vocabulary in a knowledge base as a word segmentation dictionary, and segmenting current line W by using an accurate mode of a Python-based jieba Chinese word segmentation component to form W 1 ,…,w i ,…,w n Will search for w from the model library 1 ,…,w i ,…,w n Corresponding word frequency count (w 1 ),count(w 2 ),…count(w n ) Word frequency count (w) corresponding to the combined phrase 1 ,w 2 ),count(w 2 ,w 3 ),…count(w n-1 ,w n ) Substituting into formula (2) to obtain P (w i |w i-1 ) Will P (w) i |w i-1 ) Bringing into formula (1) obtains the value of P (W), when P (W) is more than or equal to 0.8, the symbol S is reserved, otherwise, the symbol S at the beginning of the row is deleted.
S23: as shown in fig. 3, traversing a TXT document of a text of a financial statement notes, sorting a marked line of S and sorting a marked sequence of marks, obtaining a marked pattern set of marks from a knowledge base, each marked pattern corresponds to a unique code, constructing a hierarchical stack of marks to record a mark level corresponding to each mark, and marking the mark with a 10-bit digital code containing mark level information, specifically including:
(1) The dictionary TP is constructed, and the dictionary TP comprises a plurality of key value pairs, wherein one key value pair consists of one key and one value, and is connected by a colon, such as { A: "(one)" }, wherein the key is a code of a label pattern, and the value range is capital letters A-Z. The value is a label style, the value range is a label style set, and the key and the value have unique corresponding relation. The hierarchical stack is constructed to ensure a continuous and correct hierarchical relationship of sequence numbers. The dictionary T is constructed for storing the title and markup levels.
(2) The label style and the sequence number of the current line are acquired based on the match function of the re library in the Python tool, the match function in the re library can be expressed as re.match (), wherein three parameters exist, the first parameter represents a matching rule and is generally expressed by a regular expression, the second parameter represents a matched target character string, the third parameter is a matching mode, default is 0, and the re.match () returns a label style when the beginning of the target character string contains the rule expressed by the regular expression, and the re.match () returns None when the beginning of the target character string does not contain the rule expressed by the regular expression. The regular expression of the label pattern is pattern= "symbol+chinese number/arabic number+symbol", where the symbol is "(),,", the symbol may be missing, and the character length of the label is 1-5. And (3) enabling the character string format of the current line to be line, using a re-information (pattern, line) function, when the re-information (pattern, line) is not equal to the None, acquiring the Chinese number or Arabic number of the current label, converting the Chinese number or Arabic number into a number format as the serial number n of the current title, and when the re-information (pattern, line) is equal to the None, deleting the current title and continuing the operation of the next title.
(3) The label pattern obtained by re-match (pattern, line) is compared with all label patterns in the dictionary TP by using the comparison operator "= =" in the python tool, and if the similar patterns exist, the key corresponding to the return value is encoded. If the same type of patterns do not exist, judging whether the obtained serial number n is 1, if the serial number is 1, recording the current label pattern as a value to a dictionary TP, giving a new key, and if the serial number is not 1, deleting the current title and continuing the operation of the next title.
(4) And inputting the codes and the sequence numbers of the label patterns into the hierarchical stack, judging whether the stack is empty, if so, judging whether the element to be stacked is a header with the sequence number of 1, if so, stacking correctly, and if not, acquiring the stack top element, comparing the codes of the element to be stacked and the stack top element, judging the sequence relation of the header according to the codes and the sequence numbers, and stacking the element to be stacked according to the relation. The relationship is divided into three types: a sublevel header, which needs to determine whether the succession of codes and the sequence number are 1; the same-level title needs to judge the continuity of serial numbers; the parent-level title needs to search the last peer title from the stack top, and judges the sequence number continuity according to the sequence number of the title.
(5) And outputting the codes and serial numbers of the titles in the hierarchical stack in sequence to be used as the values of keys, and storing the contents of the titles as the values in a dictionary T to finish extraction of the titles and the hierarchy thereof.
(6) And the title and the hierarchy of the title in the dictionary T are expressed by numerical codes, the current title of the label is saved in the TXT document, and a data basis is provided for paragraph labeling. For example, when the current title is a primary title and the number is 2, the current title is marked with 0200000000; searching the serial number of the parent level title when the current title is three-level title and the serial number is 2, labeling the current title by 0201020000 when the primary parent level title is 2 if the secondary parent level title is 1,
s24: the text between the titles in the TXT document is arranged into a row according to the paragraphs, and is marked by 10-bit digital codes before the paragraphs, as shown in fig. 4, the coded meanings are obtained, and the title level corresponding to the paragraphs is obtained from the marks of the titles. Based on the title and the hierarchy thereof in step S23, the paragraphs of the text are divided again by combining the chinese punctuation mark and the sequence number mark, and the steps are specifically divided into three steps:
(1) It is determined whether the end of the sentence that is not the title is [. The method comprises the steps of carrying out a first treatment on the surface of the The following is carried out ? End. Traversing the TXT document, taking the last character of the current line, and comparing the last character of the current line with [ "respectively by using a loop sentence and a comparison operator" = "in the python tool. The method comprises the steps of carrying out a first treatment on the surface of the The following is carried out ? If there is no equality, keep the current line to continue to read the next line and repeat operation, if there is equality, the sentence ends with [. The method comprises the steps of carrying out a first treatment on the surface of the The following is carried out ? At the end of a symbol, all lines previously reserved are combined into a paragraph using the logical operation "+" in the python tool, a 10-bit digitally encoded paragraph tag is added before the paragraph, and then reading of the next line is resumed.
(2) Judging whether the beginning of the sentence which is not the title is in the sequence number patterns of (1), 1, and the like or not on the basis of the step (1), and if so, merging the paragraphs between the upper sequence number pattern and the lower sequence number pattern into one paragraph. Storing the sequence number patterns of (1), 1, and the like into a list LD, traversing the TXT document, taking the first five-bit character string of the current line to be expressed by line_five, judging whether the sequence number pattern exists at the beginning of the current line by using a re.match (LD, line_five) function, and when the re.match (LD, line_five) is not equal to None, indicating that the sequence number pattern exists in the current line, adding a paragraph mark in front of the line, and continuing to read the next line for repeating operation.
In this embodiment, an example of labeling results is given as follows:
[0100000000 one, company base case
0100000001 the international maritime container (group) stock company of China (hereinafter referred to as the "own company") is preceded by the international maritime container (group) of China, and is a middle and outer joint venture business established by the company of the ship stock company of the department of afferent, the ocean of Baolong, denmark and the sea container company of the United states.
01000000021992, the Shenzhen government office approves in Shenzhen's office in Shenfu [1992]1736 and Shenzhen economic special distinction in China's bank in Shenzhen's silver complex (1992) 261, and the original French stockholder of the company is used as an initiator to reorganize the company into a directional recruited stock limited company, which is named as China's International sea container stock limited company.
......
02000000002 Primary accounting policy and accounting estimation
......
02010000001 basis for compiling financial statement
......]
S3: the method comprises the steps of representing the semantic features of paragraphs, titles and levels thereof by vectors, splicing the vector representations of words contained in event argument and the vector representations of the titles and levels thereof into a vector matrix on the basis of merging the knowledge of event argument, and providing a data basis for event category classification and event table filling, wherein the method comprises the following specific steps:
s31: inputting a TXT document, matching the first 10 digits of each line, dividing each document into a title set and a paragraph set according to the codes, intercepting the 9 th bit character string and the 10 th bit character string of the current line by using a python tool, judging whether the 9 th bit character string and the 10 th bit character string of the current line are equal to '00' through a comparison operator '=' and classifying the current line into the title set if the 9 th bit character string and the 10 th bit character string are equal, and classifying the current line into the paragraph set if the 9 th bit character string and the 10 th bit character string are unequal;
specific examples are:
title set: [ "0100000000 first, company base", "0200000000 second, main accounting policy and accounting estimate", "02010000001 basis for accounting statement", "02010000001" for accounting statement "for making up base", "1900000000 nineteen, financial statement supplemental material", "19010000001, unusual damage list", "19020000002, net asset rate and per-share benefit" ]
Paragraph set: the front of the "0100000001 China International maritime container (group) stock company (hereinafter referred to as the" own company ") is the" China International maritime container (group) "and is a middle and outer joint venture business established by the International GmbH of the sea company, the ocean line of Danish Baolong and the American sea container company. "1901000002 according to the regulations of the license manager" public security issuing company information disclosure explanatory bulletin No. 1-unusual profit and loss [2008], the unusual profit and loss refers to the profit and loss generated by various transactions and matters which are not directly related to the normal business of the company and which affect the report user to correctly judge the business performance and profitability of the company due to the special and sporadic nature of the business and the normal business. "]
S32: acquiring a trained Doc2evc model from a model library file, acquiring an accounting term vocabulary from a knowledge base file as a word segmentation dictionary, segmenting a title set of each document, inputting the segmented title set into the Doc2evc model, and acquiring vector representations of each row of titles to obtain a vector list T-V of all titles;
specific examples of vector list T-V are:
T_V=[[0.67,2.8,0.46,-1.03,1.57,-1.57......-0.34,-0.71,0.78],[0.04,2.11,0.13,0.07,1.24-0.24......0.74,-0.7,0.06],......[-0.47,0.44,-0.08,0.15,0.15,0,-0.3,0.31,0.14,0.18]]
(1) Accurate mode pair current line T of jieba Chinese word segmentation component based on Python i Word segmentation is carried out to obtain a list [ w ] 1 ,…,w i ,…,w n ]Wherein w is i A character string representing the i-th word.
(2) Based on 29,280 financial statement notes title documents of the marketable company, a model Doc2ev integrating upper and lower layer title information is trained by means of a genesim.model.doc2vec library in the python tool and stored in the model library. List T i =[w 1 ,…,w i ,…,w n ]Input Doc2evc, doc2evc. Index_vector (T) i ) Obtaining a vector representation T of a current line i— V,T i— The V length is 16. The initialization parameters of Doc2evc are Doc2evc (min_count=1, window=5, vector_size=15, sample=1e-3, works=4, hs=1, epochs=100), where min_count is the minimum frequency of words to be ignored, window is the maximum distance between current and predicted words in a sentence, vector_size is the dimension of feature vector, sample is the proportion of random samples, the number of parallel lines of works training, hs sets the method of optimizing solution (Hierarchical Softmax when hs=1; negative sampling when hs=0), epochs is the number of training iterations.
S33: and acquiring accounting term words from the knowledge base file as word segmentation dictionary, and carrying out sentence segmentation and word segmentation on the paragraph set of each document to acquire a word segmentation list of the paragraphs. Obtaining a trained event argument identification model from a model library file, inputting a paragraph segmentation list into the event argument identification model to obtain a vector matrix P fused with event argument information i V, the vector matrix comprises feature vectors of all words in the paragraph, and the event argument is marked by using a BIO label, which specifically comprises the following steps:
(1) The accurate mode of the jieba Chinese word segmentation component based on Python carries out sentence segmentation and word segmentation on the current paragraph, SEP labels are added between sentences to obtain a word segmentation list P i =[w1,…,wi,…,SEP…,w n ]Wherein w is i A character string representing the i-th word.
(2) Will P i Input to event argument recognition model to obtain vector matrix
Figure BDA0004019269590000181
Wherein->
Figure BDA0004019269590000182
Representing the feature vector of the i-th word. And obtains the labeling label list P i L, tag list is made up of tags of event arguments, e.g. P i _L=[o,o,o,…,B-Pledger,I-Pledger,I-Pledger,…,o,o]O represents a non-event argument, B-Pledger represents a start position of an event argument in a sentence of an event character Pledger, and I-Pledger represents an intermediate position of the event argument in the sentence of the event character Pledger.
(3) The event argument recognition model captures the context information of each word in the sentence by adopting a self-attention mechanism in a transducer encoder, and then classifies the context information by using a CRF model to predict the label of each word. The transducer encoder is composed of 6 encoding blocks, and the input data is weighted by a self-attention mechanism module to obtain a characteristic vector
Figure BDA0004019269590000183
Wherein Q, K, V is by inputting data P i Obtained by linear transformation, d k Is the number of columns of the Q and K matrices, i.e., the vector dimension. First, word segmentation list P i Into an embedded vector, and then obtaining Q=W through linear transformation Q P i 、K=W K P i 、V=W V P i ,W Q 、W K And W is V Is a trainable parameter matrix, randomly initializes W before model training Q 、W K And W is V
(4) And outputting the characteristic vector FFN (Z) after the space compression by the Z input feedforward neural network layer (FFN). The FFN has two layers, the first layer having an activation function that is ReLU and the second layer having a linear activation function expressed as: FFN (Z) =max (0, zw 1 +b 1 )W 2 +b 2 . The output of the last feedforward neural network layer, i.e. vector matrix
Figure BDA0004019269590000191
Will P i V is taken as input of a CRF model to obtain each sample w i The conditional probability of the output label being y:
Figure BDA0004019269590000192
wherein the method comprises the steps of
Figure BDA0004019269590000193
t k 、s l : a feature function; lambda (lambda) k 、u l : corresponding weight values; the characteristic function is a predefined rule function, lambda k 、u l Is a parameter obtained as the model is trained.
(5) Outputting a conditional probability value P (y/x) according to the CRF model, and outputting a vector matrix P i Labeling label list P corresponding to_V i _L。
Specific examples are as follows:
assume that the current paragraph is:
0435000002: the company and Shenzhen capital operation group Limited company (hereinafter referred to as "Shenzhen capital group"), shenzhen energy group Limited company (hereinafter referred to as "Shenzhen energy group") and the central lease have signed the stock right transfer agreement; the hong Kong and Shenzhen capital group, tianjin Kai Ruikang enterprises management consultation partner enterprises (limited partner) (hereinafter referred to as Tianjin Kai Ruikang) in the whole-resource control subsidiary of the company have set lease in the Shang Kong and Shenzhen capital group, and the Tianjin Kai Ruikang enterprises management consultation partner enterprises have set lease in the Shang Kong Shengzhen protocol. "]
After clause:
0435000002: the company and Shenzhen capital operation group Limited company (hereinafter referred to as "Shenzhen capital group"), shenzhen energy group Limited company (hereinafter referred to as "Shenzhen energy group") and the central lease have signed the stock right transfer agreement; the Hongkong and Shenzhen capital groups, tianjin Kai Ruikang enterprises management consultation partner enterprises (limited partner) (hereinafter referred to as Tianjin Kai Ruikang) in the whole-resource control subsidiary of the company on the same day set lease and signed the "increase resource agreement". "]
After word segmentation:
0435000002: the company and Shenzhen capital operation group Limited company (hereinafter referred to as "Shenzhen capital group"), shenzhen energy group Limited company (hereinafter referred to as "Shenzhen energy group") and the central lease have signed the stock right transfer agreement; on the same day, hong Kong and Shenzhen capital group, tianjin Kai Ruikang Enterprise management consultation partner (limited partner) (hereinafter referred to as Tianjin Kai Ruikang) in the whole-resource control subsidiary of the own company, and the lease-collection in the Tianjin Kai Ruikang have signed a "fund-increasing agreement". "]]
Tag list:
P i _L=[o,B-SigningTime,I-SigningTime,I-SigningTime,o…,B-Terms,I-Terms,I-Terms,I-Terms,o]
s34: representing the title and its hierarchical vector T i— V and vector representation P of words contained in event argument i V is spliced into a vector matrix PE i Vector matrix PE i Only the word vector with the label being the event argument is reserved, the first 16 bits of the word vector in the same paragraph are the characteristic information of the same title and the levels thereof, and when the same paragraph corresponds to the multi-level title, the title vectors of all levels are spliced after being averaged. I.e.
Figure BDA0004019269590000201
Wherein the method comprises the steps of
Figure BDA0004019269590000202
The expression w i Corresponding header vector average of all levels, +.>
Figure BDA0004019269590000203
The expression w i Corresponding word vectors.
S4: a transducer encoder and a softmax classifier are constructed, and event categories are judged by learning the characteristics of event arguments and titles and the hierarchy thereof.
S41: vector matrix PE that will fuse event arguments and titles and their hierarchical levels i Inputting the data into a transducer encoder for feature learning, and encoding the transducer in a co-incident argument identification modelCode device is consistent, input PE i The vector matrix E= [ E ] fused with event argument and title and its level information is output through 6 coding blocks formed by self-attention mechanism and feedforward neural network layer 1 ,e 2 …e i ,…,e n ]I.e. vector matrix of enhanced event arguments, where e i And representing the vector representation of the ith event argument after merging the title and the hierarchical information thereof.
S42: connecting an output layer of a transducer encoder to an event classifier to classify event categories, obtaining event categories triggered by a current paragraph, and obtaining labels of the event categories triggered by the current paragraph by taking a probability maximum value, wherein the labels are specifically expressed as follows:
max(softmax(WE+b))=max([0,0.03,0,…0.05,0.89])=0.89
label (31) = "S32 major contract"
Wherein the event classifier is comprised of a softmax classifier. Specifically operate as to let E= [ E 1 ,e 2 …e i ,…,e n ]Input into a softmax classifier, calculate
Figure BDA0004019269590000211
Wherein->
Figure BDA0004019269590000212
And->
Figure BDA0004019269590000213
Is a learnable parameter matrix, where k refers to the number of event category labels, and 32 event categories are predefined, so k=32. While
Figure BDA0004019269590000214
The output of the softmax classifier is a k-dimensional vector containing the confidence of all event categories.
Tag y representing true event type by cross entropy i Results of model prediction
Figure BDA0004019269590000215
Information difference betweenThe optimization objective is to make the cross entropy +.>
Figure BDA0004019269590000216
As small as possible, define the loss function as the average cross entropy +.>
Figure BDA0004019269590000217
Where SE is the total number of event types, +.>
Figure BDA0004019269590000218
p ic Is->
Figure BDA0004019269590000219
The prediction probability of type c. Randomly initializing W and b, and updating the optimization variable by adopting gradient descent.
S43: and acquiring a file of the event category from the knowledge base file, acquiring the composition of the predefined event roles in the current event category, and forming the column names of the event table according to the sequence of the predefined event roles.
The event categories and event roles may be customized according to the requirements of the extraction task, e.g., the event role list for the current event category is expressed as:
major contract: [ "transactionSubject", "transactionObject", "Signing Time", "terminals", "ContractAmount", "State", "ProjectDurate ]
S5: constructing a transducer encoder and a linear two-classifier, and judging whether each event argument is filled into the current event role by learning the characteristics of the event argument, the title, the hierarchy thereof and the memory vector, wherein the specific steps comprise:
s51: traversing event role list, initializing vector representation ER and memory vector M of event roles, selecting first event role, vectorizing label of event role, splicing with vectorization representation E of all event argument of current paragraph, splicing memory vector M into vector representation E of event argument to obtain vector matrix
Figure BDA0004019269590000221
/>
S52: matrix vectors
Figure BDA0004019269590000222
Inputting into a transducer encoder, consistent with the transducer encoder in the event argument identification model, outputting a vector matrix fused with the event argument, the title and the hierarchy thereof and the memory vector information through 6 coding blocks formed by self-attention mechanisms and feedforward neural network layers
Figure BDA0004019269590000223
Will->
Figure BDA0004019269590000224
The input event table filling unit obtains the probability [0, 1,0, ], 0 that all event arguments can be filled into the current event role]。
S53: the output layer of the transducer encoder is connected to the linear two-classifier to determine whether all event arguments are filled in the event role. The linear two-classifier is calculated using a logistic regression formula, i.e
Figure BDA0004019269590000225
p i Representing the probability that the ith argument can be filled into the current event role, W and b are matrices of parameters that can be learned. The loss function of the model is +.>
Figure BDA0004019269590000226
For the sign function, W and b are randomly initialized, and the optimization variables are updated using gradient descent.
S54: updating the memory vector of the filled event argument and the vectorized representation of the event role.
S55: and repeating the steps S52, S53 and S54 until all event roles of the current event category are filled, filling the event roles without event argument with NULL values, and obtaining all event records contained in the current paragraph.
Specific examples are as follows:
event record 1: [ "TransactionSubject": "own", "TransactionObject": "Shenzhen capital operation group Co., ltd (hereinafter referred to as" Shenzhen capital group "), shenzhen energy group Co., ltd (hereinafter referred to as" Shenzhen energy group "), zhen Sunj lease", "Signing Time": "11 months and 23 days in 2021", "Terms": "stock right transfer protocol", "ContractAmount": "null", "State": "null", "ProjectDurate": "null" ]
Event record 2: [ "transactionSubject": "hong Kong in this company and its full-resource control subsidiary", "transactionObject": "Shenzhen capital group", "Tianjin Kai kang enterprise management consultation partner enterprise (limited partner) (hereinafter abbreviated as" Tianjin Kai kang ")", "Signing Time": "2021 11/23/day", "terminals": "increasing agreement", "ContractAmount": "null", "State": "null", "ProjectDurate": "null" ];
according to the method, the device and the system, the output layer of the transducer encoder is connected to the CRF model to construct an event argument identification model, the purpose of completing the identification task of the event argument by utilizing the feature extraction function of the transducer encoder and the classification function of the CRF model is achieved, the conventional event extraction model is applied to the chapter semantic information of a document which cannot be captured on a financial statement notes text, the title and the hierarchical information of the title are added before the event argument identification model to serve as the chapter semantic information of the document, the accuracy of classification of the financial statement notes event is improved, the characteristics of the event argument, the title, the hierarchical level and the memory vector are learned through the transducer encoder, the classification result of the event argument is recorded by the memory vector, a layer of linear second classifier is added on the basis of the fact that whether the event argument is matched with the event role is judged, the problem of filling the event table is converted into the second class of whether the event argument is matched with a target event role by adopting the transducer encoder and the linear second classifier, the problem that the same event category has a plurality of event records is solved, and the accuracy of the extracted financial statement notes is further improved.
Example 2
As shown in fig. 5, the present embodiment provides an event extraction system for financial statement notes, including: the system comprises a document acquisition unit, a chapter structure identification unit, an event argument identification unit, an event category classification unit and an event table filling unit;
the document acquisition unit is used as an interface of the event extraction system and is used for preprocessing a document needing to realize an event extraction task, acquiring a PDF document of a financial report in a database file, converting the PDF document into a TXT document through data preprocessing, and deleting a text of a non-financial statement notes text;
the chapter structure identifying unit obtains a title set and a paragraph set by identifying and labeling titles, levels and paragraphs in the financial statement notes TXT document on the basis of the document obtaining unit. The method comprises the steps of using a Doc2evc model to represent titles in a title set and the levels thereof by vectors, and inputting the titles and the levels thereof into an event category classification unit, wherein the module is used for assisting an event extraction system in capturing chapter semantic information of a document;
the event argument identification unit presents semantic features of the paragraph into the vector through a learning task identified by the event argument. Firstly, sentence and word segmentation are carried out on a paragraph set output by a chapter structure recognition unit, a word segmentation list is obtained, then semantics in paragraphs are learned by a transducer encoder, then a vector matrix of an output layer of the transducer encoder is input into a CRF model, event arguments of financial events in a financial statement notes are recognized and marked, and vector representation of the event arguments is obtained.
The event category classification unit designs a vector splicing mechanism, splices the vector representation of words contained in the event argument and the vector representation of the title and the hierarchy thereof into a vector matrix, and provides a data basis for event category classification and event table filling. Inputting the spliced vector matrix into a transducer encoder to obtain a vector matrix integrating the title, the title hierarchy and the paragraph semantics, inputting the vector matrix into an event category classifier to obtain the probability of all event categories of the current vector matrix, selecting the event category with the highest probability as the event category of the current vector matrix, inquiring the event role of the event category currently triggered in the predefined event information through indexes, and outputting the event roles according to the predefined sequence.
The event table filling unit inputs the vector representation e of the event argument outputted by the event argument identification unit and the vector representation of the event character outputted by the event category classification unit based on the transducer encoder and the linear two-classifier. In order to enable more than one event argument to be filled by one event role and simultaneously have good accuracy, the system is added with a memory vector m to record the filling process of the event argument. The method comprises the steps of splicing event arguments, memory vectors and vectors of event roles, inputting the vector into a transducer encoder, connecting an output layer of the transducer encoder into a linear two-classifier to obtain the probability of filling the current event roles by the event arguments, selecting the event arguments with the probability of 1 to be filled into the current event roles of an event table, and circulating the operations according to the predefined sequence of the event roles until the event roles are filled.
Example 3
The present embodiment provides a computer readable storage medium, which may be a storage medium such as a ROM, a RAM, a magnetic disk, or an optical disk, and stores one or more programs that when executed by a processor implement the event extraction method of the financial statement notes of embodiment 1.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (10)

1. The event extraction method for the financial statement notes is characterized by comprising the following steps of:
acquiring a PDF document of a financial report in a database file, converting the PDF document into a TXT document through data preprocessing, and matching the TXT document of the text of the notes of the financial report by combining with a regular expression in a knowledge base;
identifying and labeling the title of the TXT document of the financial statement notes text, the title level and the paragraph based on the knowledge base, and obtaining a title set and a paragraph set;
sentence and word segmentation are carried out on the paragraph set to obtain a word segmentation list, the semantics in the paragraphs are learned based on a transducer encoder, a vector matrix of an output layer of the transducer encoder is input into a CRF model, and event argument of a financial event in a financial statement notes are identified and marked to obtain vector representation of the event argument;
Splicing the vector representation of the words contained in the event argument and the vector representation of the title and the hierarchy thereof into a vector matrix, inputting the spliced vector matrix into a transducer encoder to obtain a vector matrix integrating the title, the title hierarchy and the paragraph semantics, inputting the vector matrix into an event category classifier to obtain the probability of all event categories of the current vector matrix, selecting the event category with the highest probability as the event category of the current vector matrix, inquiring the event role of the event category currently triggered in the predefined event information through indexes, and outputting the event roles according to the predefined sequence to obtain the vector representation of the event role;
constructing a memory vector used for recording an event argument filling process, inputting a transducer encoder after splicing vector representation of the event argument, vector representation of an event argument and the memory vector, connecting an output layer of the transducer encoder into a linear two-classifier to obtain probability of filling the current event argument into the current event argument of an event table, selecting the event argument with the probability as a set value, filling the event argument into the current event argument of the event table, repeating iteration until all event argument of the current event category are filled, updating the memory vector of the filled event argument and the vectorized representation of the event argument, and obtaining all event records contained in the current paragraph.
2. The method for event extraction of financial statement notes of claim 1 wherein the data preprocessing includes deleting headers, page numbers, tables and special symbols.
3. The method of claim 1, wherein the TXT documents of the financial statement notes text are matched in combination with a regular expression in the knowledge base file, the regular expression being expressed as:
start_line= [ x ] financial statement notes | [ company |own company |company|group |own group ] basic condition ] and end_line= [ x ] review file ];
wherein [ (i ] represents a regular expression set, |represents other character strings, |represents selectable items, |start_line represents a regular expression of a beginning line of the text of the financial statement, and end_line represents a regular expression of an ending line of the text of the financial statement;
when the target character string contains the rule of the regular expression, the re.search () returns an object, and when the target character string does not contain the rule of the regular expression, the re.search () returns None;
traversing the TXT document after data preprocessing, enabling the character string format of the current line to be line, and when the return value of the re.search (start_line, line) is not None, namely, the expression matched in the line to the start_line, starting and continuing to keep the current line until the return value of the re.search (end_line, line) is not None, stopping traversing, and obtaining the TXT document of the financial statement notes text.
4. The method for event extraction of a financial statement postscript according to claim 1, wherein identifying and annotating the title of the TXT document of the financial statement postscript, and the title hierarchy and paragraphs, comprises:
acquiring a TXT document of a financial statement notes text, acquiring a regular expression for identifying a title and a label style set of the title in a knowledge base, and adding a mark symbol at the beginning of a row identified as the title;
traversing TXT document of the text of the financial statement, judging whether the line added with the mark symbol is a sentence with complete semantic meaning based on a binary statistics language model, if so, reserving the mark symbol, otherwise, deleting the mark symbol at the beginning of the line to obtain a candidate title;
traversing a TXT document of a financial statement notes text, arranging mark patterns of rows added with mark symbols and arranging mark sequences, loading a mark pattern set of titles from a knowledge base, wherein each mark pattern corresponds to a unique code, constructing a hierarchical stack of marks to record a title level corresponding to each title, and marking the title by using a digital code containing title level information;
the text between the titles in the TXT document is arranged into a row according to the paragraphs, numerical codes are used for marking before the paragraphs to obtain coded meanings, the title hierarchy corresponding to the paragraphs is obtained from the marks of the titles, and the paragraphs of the text are divided again by combining Chinese punctuation marks and sequence number marks based on the titles and the hierarchy thereof.
5. The method for event extraction for financial statement notes as defined in claim 4 wherein the acquiring of regular expressions identifying titles and a set of label patterns for the titles in the knowledge base is performed by:
first rule: the method comprises the steps of containing Chinese characters, wherein the Chinese character matching uses unicode coding;
a second rule: containing the pattern of the reference numerals and the like, the reference pattern set is [ ({ 0 }) | ({ 1 }) | ({ 0 }) | ({ 1 }) | {0}, | {1}, | {0} {1} ({ 0 }) | ({ 1 }) | ({ 0 }), | (1), | (0), | (1), | (0), | (1), |0 < 1 > (1) th knot |1 < 1 > (0) th knot |1 < 1 > (1) th knot) ], wherein [ ] represents a set, | represents an optional item, {0} represents an array character of all Chinese digits, {1} represents a number character of all positive integers;
third rule: if there is no sentence feature, the sentence does not include [ the sentence ]. The following is carried out ? The method comprises the steps of carrying out a first treatment on the surface of the ' symbol;
fourth rule: without data characteristics, sentences do not contain long data and can contain years;
fifth rule: not ending with a conjunctive word;
lines meeting the first rule to the fifth rule are initially identified as titles, the first rule and the second rule are expressed as regular expressions and stored in a list, the first rule and the second rule are expressed as L1, the third rule, the fourth rule and the fifth rule are expressed as regular expressions and stored in the list, the first rule and the fifth rule are expressed as L2, the TXT document of the text of the attached text of the traversing financial report is traversed, the current line is in a character string format of line, when the return value of re.search (L1, line) is not None, and when the return value of re.search (L2, line) is None, the current line is marked as a title, and the re.search is a search function of a re library in a python tool.
6. The method for event extraction for a financial statement postscript according to claim 4, wherein constructing a hierarchical stack of labels records a title hierarchy corresponding to each title, and annotates the title with a numeric code containing title hierarchy information, specifically comprising:
constructing a dictionary TP, wherein the dictionary TP comprises a plurality of key-value pairs, one key-value pair is composed of a key and a value, and is connected by a colon, wherein the key is the code of a label pattern, the value is the label pattern, the key and the value have unique corresponding relation, and the dictionary T is constructed for storing titles and mark levels;
acquiring a current line of label patterns and sequence numbers, comparing the acquired label patterns with all label patterns in a dictionary TP, if similar patterns exist, returning a key corresponding to a value, namely coding, if similar patterns do not exist, judging whether the acquired sequence number n is 1, if the sequence number is 1, recording the current label patterns as values into the dictionary TP, giving new keys, and if the sequence number is not 1, deleting the current title;
inputting codes and sequence numbers of the label patterns into a hierarchical stack, judging whether the stack is empty, if so, judging whether the element to be stacked is a title with the sequence number of 1, if so, stacking correctly, if not, acquiring stack top elements, comparing the codes of the element to be stacked and the stack top elements, judging the sequence relation of the title according to the codes and the sequence numbers, and stacking the element to be stacked according to the relation between the titles;
Sequentially outputting codes and serial numbers of the titles in the hierarchical stack as key values, and storing contents of the titles as values in a dictionary T to finish extraction of the titles and the hierarchy thereof;
the title and its level in dictionary T are represented by digital codes, and the current title is marked and saved in TXT file.
7. The method for extracting events from a financial statement postscript according to claim 1, wherein the steps of sentence segmentation and word segmentation are performed on the paragraph set to obtain a word segmentation list comprise:
for the current line T i Word segmentation is carried out to obtain a list [ w ] 1 ,…,w i ,…,w n ]Wherein w is i Character string representing the i-th word, list T i =[w 1 ,…,w i ,…,w n ]Input Doc2evc model to obtain vector representation T of current line i —V;
Dividing sentences and words of the current paragraph, adding SEP labels between the sentences to obtain a word dividing list P i =[w 1 ,…,w i ,…,SEP…,w n ];
Based on the semantics in the paragraph learned by the transducer encoder, the context information of each word in the sentence is captured by adopting the self-attention mechanism of the transducer encoder, and the input word segmentation list obtains a weighted feature vector through the self-attention mechanism module to obtain a vector matrix;
and obtaining a conditional probability value of each sample output as a corresponding label according to the CRF model, and outputting a labeling label list corresponding to the vector matrix.
8. The method for event extraction of financial statement notes according to claim 1, wherein the spliced vector matrix is input into a transducer encoder to obtain a vector matrix integrating title and title hierarchy and paragraph semantics, and the vector matrix is expressed as: e= [ E 1 ,e 2 …e i ,…,e n ]Wherein e is i Vector representation after the ith event argument is merged into the title and the hierarchical information;
connecting an output layer of a transducer encoder to an event classifier to classify event categories, obtaining event categories triggered by a current paragraph, obtaining labels of the event categories triggered by the current paragraph by using a probability maximum value, and obtaining a vector matrix E= [ E ] 1 ,e 2 …e i ,…,e n ]Input into a softmax classifier, calculate
Figure FDA0004019269580000051
Wherein W and b represent a parameter matrix which can be learned, and k represents the number of event category labels;
and (3) representing information difference between the label of the real event type and the output prediction result of the softmax classifier by using cross entropy, defining a loss function as average cross entropy, randomly initializing W and b, and adopting gradient descent update.
9. An event extraction system for financial statement notes, comprising: the system comprises a document acquisition unit, a chapter structure identification unit, an event argument identification unit, an event category classification unit and an event table filling unit;
The document acquisition unit is used for acquiring PDF documents of the financial report in the database file, converting the PDF documents into TXT documents through data preprocessing, and matching the TXT documents of the financial report notes text by combining regular expressions in the knowledge base;
the chapter structure identification unit is used for identifying and labeling the titles and the levels and paragraphs of TXT documents of the notes text of the financial report based on a knowledge base, obtaining a title set and a paragraph set, representing the titles and the levels in the title set by vectors based on a Doc2evc model, and inputting the titles and the levels into the event category classification unit;
the event argument identification unit is used for carrying out sentence segmentation and word segmentation on the paragraph set to obtain a word segmentation list, inputting a vector matrix of an output layer of the transducer encoder into a CRF model based on the semantics in the learning paragraph of the transducer encoder, and identifying and labeling event arguments of financial events in the notes of the financial report to obtain vector representations of the event arguments;
the event category classification unit is used for splicing the vector representation of words contained in the event argument and the vector representation of the title and the hierarchy thereof into a vector matrix, inputting the spliced vector matrix into the transform encoder to obtain a vector matrix integrating the title, the title hierarchy and the paragraph semantics, inputting the vector matrix into the event category classifier to obtain the probability of all event categories of the current vector matrix, selecting the event category with the highest probability as the event category of the current vector matrix, inquiring the event role of the event category currently triggered in the predefined event information through indexes, and outputting the event role according to the predefined sequence to obtain the vector representation of the event role;
The event table filling unit is used for constructing a memory vector used for recording an event argument filling process, inputting a transducer encoder after vector representation of the event argument, vector representation of an event role and the memory vector are spliced, connecting an output layer of the transducer encoder to a linear two-classifier to obtain probability of the event argument filling the current event role, selecting the event argument with the probability as a set value to fill the current event role of the event table, repeating iteration until all event roles of the current event class are filled, and updating the memory vector of the filled event argument and the vectorized representation of the event role to obtain all event records contained in the current paragraph.
10. A computer-readable storage medium storing a program which when executed by a processor implements a method of event extraction for financial statement notes as claimed in any one of claims 1 to 8.
CN202211680822.XA 2022-12-27 2022-12-27 Event extraction method, system and storage medium for financial statement notes Pending CN116150361A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211680822.XA CN116150361A (en) 2022-12-27 2022-12-27 Event extraction method, system and storage medium for financial statement notes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211680822.XA CN116150361A (en) 2022-12-27 2022-12-27 Event extraction method, system and storage medium for financial statement notes

Publications (1)

Publication Number Publication Date
CN116150361A true CN116150361A (en) 2023-05-23

Family

ID=86351799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211680822.XA Pending CN116150361A (en) 2022-12-27 2022-12-27 Event extraction method, system and storage medium for financial statement notes

Country Status (1)

Country Link
CN (1) CN116150361A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117114013A (en) * 2023-10-12 2023-11-24 北京大学深圳研究生院 Semantic annotation method and device based on small sample
CN117422061A (en) * 2023-12-19 2024-01-19 中南大学 Method and device for merging and labeling multiple segmentation results of text terms
CN117521606A (en) * 2024-01-04 2024-02-06 长春职业技术学院 Intelligent report generation system and method for financial data

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117114013A (en) * 2023-10-12 2023-11-24 北京大学深圳研究生院 Semantic annotation method and device based on small sample
CN117114013B (en) * 2023-10-12 2024-02-02 北京大学深圳研究生院 Semantic annotation method and device based on small sample
CN117422061A (en) * 2023-12-19 2024-01-19 中南大学 Method and device for merging and labeling multiple segmentation results of text terms
CN117422061B (en) * 2023-12-19 2024-03-08 中南大学 Method and device for merging and labeling multiple segmentation results of text terms
CN117521606A (en) * 2024-01-04 2024-02-06 长春职业技术学院 Intelligent report generation system and method for financial data
CN117521606B (en) * 2024-01-04 2024-03-19 长春职业技术学院 Intelligent report generation system and method for financial data

Similar Documents

Publication Publication Date Title
WO2021147726A1 (en) Information extraction method and apparatus, electronic device and storage medium
CN111694924B (en) Event extraction method and system
CN109800437B (en) Named entity recognition method based on feature fusion
CN110263325B (en) Chinese word segmentation system
CN116150361A (en) Event extraction method, system and storage medium for financial statement notes
CN113204952B (en) Multi-intention and semantic slot joint identification method based on cluster pre-analysis
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
WO2023092960A1 (en) Labeling method and apparatus for named entity recognition in legal document
CN116245107B (en) Electric power audit text entity identification method, device, equipment and storage medium
Romero et al. Modern vs diplomatic transcripts for historical handwritten text recognition
CN110188340B (en) Automatic recognition method for text noun
Venkataramana et al. Abstractive text summarization using bart
Jiang et al. Multilingual interoperation in cross-country industry 4.0 system for one belt and one road
CN115017144A (en) Method for identifying judicial writing case element entity based on graph neural network
CN114510569A (en) Chemical emergency news classification method based on Chinesebert model and attention mechanism
CN115796280B (en) Efficient and controllable entity identification entity linking system applicable to financial field
Liu IntelliExtract: An End-to-End Framework for Chinese Resume Information Extraction from Document Images
CN115455964B (en) Low-resource optimization method for machine translation in vertical field
US11868313B1 (en) Apparatus and method for generating an article
CN118070812B (en) Industry data analysis method based on NLP
CN114996407B (en) Remote supervision relation extraction method and system based on packet reconstruction
CN118095294B (en) Project document generation method and device based on artificial intelligence and electronic equipment
CN116976351B (en) Language model construction method based on subject entity and subject entity recognition device
Kumar et al. User identification in online social networks using graph transformer networks
CN117113986A (en) Intelligent muzzle key information extraction system based on deep learning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination