CN114398853A

CN114398853A - Extraction method and device of document numerical index based on machine learning

Info

Publication number: CN114398853A
Application number: CN202111487097.XA
Authority: CN
Inventors: 赖文波; 柯学; 张汉林; 林康; 谭则涛
Original assignee: Gf Securities Co ltd
Current assignee: Gf Securities Co ltd
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-04-26

Abstract

The invention discloses a method and a device for extracting document numerical indexes based on machine learning, wherein the method comprises the following steps: dividing a document to be processed into key paragraphs and non-key paragraphs according to preset rules; adopting natural language processing and deep learning to construct an index extraction model, respectively inputting the key paragraphs and the non-key paragraphs into the index extraction model, and respectively outputting key paragraph indexes and non-key paragraph indexes by the index extraction model; and integrating the indexes of the key paragraphs and the indexes of the non-key paragraphs, inputting a preset index feature scoring model to screen the indexes of the key paragraphs and the indexes of the non-key paragraphs, and outputting an index extraction result. The invention improves the efficiency and the accuracy of index extraction.

Description

Extraction method and device of document numerical index based on machine learning

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a device for extracting a document numerical index based on machine learning.

Background

With the advent of the data age, the financial industry is required to process exponentially growing large amounts of data, which include, in addition to structured data that is easy to use directly, additional data that is not directly usable, such as a huge amount of PDF and WORD-type document data, which are huge in amount, without structuring, and key information hidden in lengthy documents. The business has strong requirements on intelligent processing of the unstructured data, but due to the fact that processing is difficult, manual processing is mainly used at present, and efficiency is low.

Unstructured data cannot be directly applied to data analysis, and can be effectively used only by structured extraction, and the extraction of the unstructured data has great difficulty due to the reasons of many expression changes, various forms, complex structures and the like. The prior art comprises two types, one is that a document is searched manually through a document reading tool directly according to keywords, a series of hit document positions are located, a text is processed through an artificial naked eye, whether the index is the index needing to be extracted or not is judged, if the index is the index needing to be extracted, the numerical value and the unit in the text are continuously extracted and normalized, and if the index is not the index needing to be extracted, other searched document positions are continuously checked, and even a plurality of keywords are replaced for searching. The other method is that all text paragraphs are extracted from the document, then regular or keyword matching is carried out, and finally the recalled information is manually processed.

Disclosure of Invention

The invention aims to provide a method and a device for extracting a document numerical index based on machine learning, which aim to solve the problem of low extraction efficiency of the document numerical index in the prior art.

In order to achieve the above object, the present invention provides a method for extracting document numerical indicators based on machine learning, including:

dividing a document to be processed into key paragraphs and non-key paragraphs according to preset rules;

adopting natural language processing and deep learning to construct an index extraction model, respectively inputting the key paragraphs and the non-key paragraphs into the index extraction model, and respectively outputting key paragraph indexes and non-key paragraph indexes by the index extraction model;

and integrating the indexes of the key paragraphs and the indexes of the non-key paragraphs, inputting a preset index feature scoring model to screen the indexes of the key paragraphs and the indexes of the non-key paragraphs, and outputting an index extraction result.

Preferably, the method for extracting a document numerical indicator based on machine learning further includes:

inputting historical sequence data into a preset prediction model to obtain a prediction result, calculating the average value of the prediction result and the screened index result as confidence degrees, and sequencing the average value, wherein the average value comprises index names, index values, index units, index confidence degrees, short sentences hitting indexes and paragraph texts.

Preferably, the dividing the document to be processed into the emphasized paragraphs and the non-emphasized paragraphs according to a preset rule includes:

determining the format of a document to be input, converting the pdf document format into a word format, and generating an input document;

analyzing the input document to obtain a text and a table element set;

and positioning the key paragraphs by adopting a rule matching method according to the text and the table element set, and outputting the key paragraphs.

Preferably, the performing index integration on the key section indexes and the non-key section indexes, inputting a preset index feature scoring model to screen the key section indexes and the non-key section indexes, and outputting an index extraction result includes:

marking indexes in the key paragraph indexes and the non-key paragraph indexes, and performing index feature sorting and normalized input logistic regression model training to obtain the preset index feature scoring model;

integrating the key paragraph indexes and the non-key paragraph indexes, inputting the integrated indexes into the preset index feature scoring model, filtering the indexes lower than a first threshold value, and outputting the index extraction result.

The invention also provides a device for extracting the document numerical index based on machine learning, which comprises:

the dividing module is used for dividing the document to be processed into key paragraphs and non-key paragraphs according to preset rules;

the extraction module is used for constructing an index extraction model by adopting natural language processing and deep learning, inputting the key paragraphs and the non-key paragraphs into the index extraction model respectively, and outputting key paragraph indexes and non-key paragraph indexes by the index extraction model respectively;

and the screening module is used for integrating the indexes of the key paragraphs and the indexes of the non-key paragraphs, inputting a preset index feature scoring model to screen the indexes of the key paragraphs and the indexes of the non-key paragraphs, and outputting an index extraction result.

Preferably, the system further comprises a calculation module, wherein the calculation module is used for:

Preferably, the extraction module is further configured to:

analyzing the input document to obtain a text and a table element set;

Preferably, the screening module is further configured to:

The present invention also provides a terminal device, including:

one or more processors;

a memory coupled to the processor for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method for machine learning based extraction of document numerical indicators as described in any of the above.

The present invention also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for extracting a document numerical indicator based on machine learning according to any one of the above items.

Compared with the prior art, the invention has the beneficial effects that:

dividing a document to be processed into key paragraphs and non-key paragraphs according to preset rules; adopting natural language processing and deep learning to construct an index extraction model, respectively inputting the key paragraphs and the non-key paragraphs into the index extraction model, and respectively outputting key paragraph indexes and non-key paragraph indexes by the index extraction model; and integrating the indexes of the key paragraphs and the indexes of the non-key paragraphs, inputting a preset index feature scoring model to screen the indexes of the key paragraphs and the indexes of the non-key paragraphs, and outputting an index extraction result. According to the invention, through a layered extraction method with separated key paragraphs, the focusing capacity of extraction is greatly improved, the workload of traditional index extraction is reduced, and the accuracy of traditional index extraction is effectively improved by building an index extraction model based on natural language processing and deep learning.

Further, inputting the historical sequence data into a preset prediction model to obtain a prediction result, calculating the average value of the prediction result and the screened index result as confidence degrees, and sequencing the average value, wherein the average value comprises an index name, an index numerical value, an index unit, an index confidence degree, a short sentence hitting the index and a paragraph text. By building a preset prediction model, error indexes are further screened by calculating a prediction result and a screened index result, and the stability of an index extraction effect is improved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for extracting document numerical indicators based on machine learning according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating key section extraction according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a process for index extraction according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating an example of an index extraction method according to another embodiment of the present invention;

FIG. 5 is a flowchart illustrating a method for extracting document numerical indicators based on machine learning according to another embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an apparatus for extracting a document numerical indicator based on machine learning according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.

It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, an embodiment of the present invention provides a method for extracting a document numerical indicator based on machine learning. As shown in fig. 1, the extraction method of the document numerical index based on machine learning includes steps S10 to S30. The method comprises the following steps:

s10: and dividing the document to be processed into a key paragraph and a non-key paragraph according to a preset rule.

Referring to fig. 2, specifically, key paragraphs of a document are extracted, a few key paragraphs of the document are extracted from the document with a large number of pages, and the extracted key paragraphs are stored as a new document, so that manual processing after subsequent links is facilitated.

Determining the format of a document to be input, converting the pdf document format into a word format, generating an input document, analyzing the input document, acquiring a text and a table element set, positioning a key paragraph according to the text and the table element set by adopting a rule matching method, and outputting the key paragraph.

Inputting a document to be processed, which can be a word document or a pdf document, converting the document into a word format and then performing subsequent processing if the document is the pdf document, wherein the conversion method is to use an ilovepdf (existing tool) for transcription, and after the transcription, analyzing elements such as texts, tables, diagrams and the like through a docx packet, and then positioning key paragraphs, wherein the positioning method is rule matching, and the rule is as follows: paragraph feature canonical + keywords, such as { any digital title } credit service, { any digital title } is the paragraph feature canonical we designed, for example: r '^ is [ (\[ [ d ] \\ d \ three, five, six, seven, eight ninety, chapter no ] {1,3} [) ] \\\ \ s \ t ] {0,4}', and the key words are key words of key paragraphs provided by the service, and the document is divided into key paragraphs and non-key paragraphs.

S20: and constructing an index extraction model by adopting natural language processing and deep learning, respectively inputting the key paragraphs and the non-key paragraphs into the index extraction model, and respectively outputting key paragraph indexes and non-key paragraph indexes by the index extraction model.

Referring to fig. 3 and 4, specifically, an index extraction model is constructed through natural language processing and deep learning to create a regular intelligent construction method, so that a service can more conveniently construct and maintain index regular, and generate regular intelligently, as shown in fig. 4, a service input item (including an index main body a and an index dimension U) is an example of a generation method, where N: i.e., Neighbour, any character of a certain length, configurable in length. NUM: random _ num: match any number, may contain, etc. common numeric characters. '|': or a character. Arrow head: representing a string of characters.

And matching the text set item by item according to the regular pattern constructed as shown in FIG. 4, and recording the hit text, numerical value and mark hit type characteristics as the regular pattern. For the keyword recalls in the key sections and the non-key sections, the text set is recalled through the keywords, and a plurality of keywords (the relationship of the keywords includes AND or OR) are supported. And establishing an index candidate set for syntactic analysis in the key paragraphs and non-key paragraphs, performing syntactic structure analysis on sentences recalled by keywords through a Baidu open-source ddparser syntactic analysis packet, and programming to realize extraction of SVO and ATT _ N, ADV _ V relations, wherein ATT _ N represents a decoration relation, ADV _ V represents a secondary action relation, and SVO represents a main and subordinate guest relation.

Evaluation and ranking of index candidate sets: similarity calculation is carried out on the index name X recalled by the syntactic analysis and the index name I provided by the service, and the calculation method comprises the following steps: cut _ for _ search (I) & jieba cut _ for _ search (x), i.e. performing word segmentation through the search word segmentation mode of jieba, taking the union of the word segmentation results of two external names, and calculating the proportion of the union words in I, wherein the higher the proportion is, the higher the similarity is, the more likely it is the target index to be found.

The textCNN model scores the semantic granularity of sentences according to the characteristics: and establishing a TextCNN model, and outputting a score between 0 and 1, wherein 0 represents that the sentence is not a target index sentence semantically, and 1 represents that the sentence is a target index sentence.

And finally, integrating the index names, the index values, the hit types, the evaluation scores and the like of the two components (integrating the same sentence characteristics) so as to provide the integrated sentence characteristics for the next step for processing.

S30: and integrating the indexes of the key paragraphs and the indexes of the non-key paragraphs, inputting a preset index feature scoring model to screen the indexes of the key paragraphs and the indexes of the non-key paragraphs, and outputting an index extraction result.

Marking indexes in the key paragraph indexes and the non-key paragraph indexes, performing index feature sorting and normalization input to a logistic regression model for training to obtain a preset index feature scoring model, integrating the key paragraph indexes and the non-key paragraph indexes, inputting the key paragraph indexes and the non-key paragraph indexes into the preset index feature scoring model, filtering indexes lower than a first threshold value, and outputting an index extraction result.

Specifically, the indexes of the emphasized paragraphs and the non-emphasized paragraphs are integrated according to the index names, the index values, the hit types, the evaluation scores, and the like of the emphasized paragraphs and the non-emphasized paragraphs in step S20. And (3) carrying out data set labeling on the recalled index name, the recalled index value and the characteristic, carrying out characteristic sorting and normalization when the correct index mark label is 1 and the wrong index mark label is 0, inputting a logistic regression model for training, and obtaining a preset index characteristic scoring model. And inputting the integrated indexes into a preset index characteristic scoring model for screening, predicting newly recalled indexes and characteristics through the model, and filtering indexes which are lower than a first threshold value by 0.5 (at least one index with the maximum prediction score is reserved, and if the index is lower than the first threshold value by 0.5, marking is carried out, so that manual correction is convenient for the service). And outputting an index extraction result.

The utilization of unstructured data is always a difficult point in the field of artificial intelligence, the current data analysis technology mainly aims at structured data, the unstructured data has the difficulties of many expression changes, diversified forms, complex structures and the like, the utilization of the unstructured data by the current technology has the problems of low efficiency and low accuracy, and the invention aims to solve the problem of intelligent extraction of unstructured data of indexes in the financial field.

According to the invention, through the hierarchical extraction method of key paragraph separation, the expert experience and the model capability are well combined, the focusing capability of extraction is greatly improved, and the workload of business personnel is greatly reduced by the extracted key paragraphs compared with the huge number of pages of the original document. By constructing an index extraction model based on natural language processing and deep learning, the accuracy of traditional index extraction is effectively improved, the extraction is more comprehensive and effective, meanwhile, invalid recalls are reduced, the workload of business personnel is reduced, and the working efficiency is improved.

Referring to fig. 5, in an embodiment, after the step S30 outputs the index extraction result, the method further includes the following steps:

s40: inputting historical sequence data into a preset prediction model to obtain a prediction result, calculating the average value of the prediction result and the screened index result as confidence degrees, and sequencing the average value, wherein the average value comprises index names, index values, index units, index confidence degrees, short sentences hitting indexes and paragraph texts.

Adopting an ARIMA model, inputting a historical sequence of each index, training the model, predicting a value P (P is extremely unlikely to be zero, but for logical self-consistency, if the value is zero, 0.01 is taken) of a next sequence of the value, and calculating with an index R returned by an index extraction model in a calculation mode: 1- (R-P) ^2/P ^2, and if the value is less than 0, the value is 0.

And finally, the result mean values of the comprehensive index feature scoring model and the index historical sequence prediction model are used as confidence degrees to be sequenced, the extracted results of the service display indexes are displayed, and the index names, the index values, the index units, the index confidence degrees, short sentences and paragraph texts hitting the indexes are displayed, so that the service can be browsed quickly.

The method further establishes a sequence prediction model based on the ARIMA model, so that the whole method has the judgment capability of common knowledge and rationality, filters error indexes, and improves the stability of index extraction effect. The method is based on ARIMA, an index historical sequence prediction model is developed, the extracted numerical value can be deduced, and the accuracy rate of index extraction can be improved.

Referring to fig. 6, another embodiment of the present invention provides an apparatus for extracting document numerical indicators based on machine learning, including:

the dividing module 11 is configured to divide the document to be processed into an emphasized paragraph and a non-emphasized paragraph according to a preset rule.

The extraction module 12 is configured to construct an index extraction model by using natural language processing and deep learning, input the key paragraphs and the non-key paragraphs into the index extraction model, and output key paragraph indexes and non-key paragraph indexes by the index extraction model.

And the screening module 13 is configured to perform index integration on the key paragraph indexes and the non-key paragraph indexes, input a preset index feature scoring model to screen the key paragraph indexes and the non-key paragraph indexes, and output an index extraction result.

Preferably, the system further comprises a calculation module, wherein the calculation module is used for: inputting historical sequence data into a preset prediction model to obtain a prediction result, calculating the average value of the prediction result and the screened index result as confidence degrees, and sequencing the average value, wherein the average value comprises index names, index values, index units, index confidence degrees, short sentences hitting indexes and paragraph texts.

Preferably, the extraction module is further configured to: determining the format of a document to be input, converting the pdf document format into a word format, and generating an input document;

analyzing the input document to obtain a text and a table element set;

Preferably, the screening module is further configured to: marking indexes in the key paragraph indexes and the non-key paragraph indexes, and performing index feature sorting and normalized input logistic regression model training to obtain the preset index feature scoring model;

For specific limitations of the extraction device for the document numerical indicator based on machine learning, reference may be made to the above limitations on the extraction method for the document numerical indicator based on machine learning, and details thereof are not repeated here. The modules in the extraction device of the document numerical indicators based on machine learning can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

Referring to fig. 7, an embodiment of the present invention provides a terminal device, including:

one or more processors;

a memory coupled to the processor for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method for machine learning based extraction of document numerical indicators as described above.

The processor is used for controlling the overall operation of the terminal device so as to complete all or part of the steps of the extraction method of the document numerical index based on the machine learning. The memory is used to store various types of data to support operation at the terminal device, and these data may include, for example, instructions for any application or method operating on the terminal device, as well as application-related data. The Memory may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.

In an exemplary embodiment, the terminal Device may be implemented by one or more Application Specific 1 integrated circuits (AS 1C), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and is configured to perform the method for extracting the machine learning based document numerical indicator according to any one of the embodiments described above, so AS to achieve the technical effects consistent with the method described above.

In another exemplary embodiment, a computer readable storage medium is also provided, which includes a computer program, when being executed by a processor, the computer program implements the steps of the extraction method based on the machine learning document numerical index according to any one of the above embodiments. For example, the computer readable storage medium may be the above-mentioned memory including a computer program, and the above-mentioned computer program may be executed by a processor of a terminal device to implement the extraction method of the document numerical indicator based on machine learning according to any one of the above-mentioned embodiments, and achieve the technical effects consistent with the above-mentioned method.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method for extracting document numerical indicators based on machine learning is characterized by comprising the following steps:

2. The extraction method of the document numerical index based on machine learning according to claim 1, further comprising:

3. The extraction method of document numerical indicators based on machine learning according to claim 2, wherein the dividing the document to be processed into the emphasized paragraphs and the non-emphasized paragraphs according to the preset rules comprises:

analyzing the input document to obtain a text and a table element set;

4. The method for extracting document numerical indicators based on machine learning according to claim 3, wherein the method for performing indicator integration on the key paragraph indicators and the non-key paragraph indicators, inputting a preset indicator feature scoring model to screen the key paragraph indicators and the non-key paragraph indicators, and outputting an indicator extraction result comprises:

5. An extraction device of a document numerical index based on machine learning, comprising:

6. The extraction apparatus of document numerical indicators based on machine learning according to claim 5, further comprising a calculation module, said calculation module is configured to:

7. The device for extracting document numerical indicators based on machine learning according to claim 6, wherein the extracting module is further configured to:

analyzing the input document to obtain a text and a table element set;

8. The extraction apparatus of document numerical indicators based on machine learning according to claim 7, wherein the filtering module is further configured to:

9. A terminal device, comprising:

one or more processors;

a memory coupled to the processor for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of machine learning based document numerical indicator extraction of any of claims 1-4.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for extracting a document numerical indicator based on machine learning according to any one of claims 1 to 4.