CN113553852B

CN113553852B - Contract information extraction method, system and storage medium based on neural network

Info

Publication number: CN113553852B
Application number: CN202111016995.7A
Authority: CN
Inventors: 柴茂森
Original assignee: Inspur General Software Co Ltd
Current assignee: Inspur General Software Co Ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2023-06-20
Anticipated expiration: 2041-08-31
Also published as: CN113553852A

Abstract

A contract information extraction method based on a neural network comprises the following steps: extracting original text data from the contract text, and preprocessing the original text data to obtain preprocessed text data; carrying out named entity recognition on the preprocessed text data through an NER entity recognition model to obtain named entities in the preprocessed text data; identifying the preprocessed text data through a part-of-speech identification model to obtain parts of speech of all segmentation words in the preprocessed text data; and completing the named entity in the preprocessed text data according to word segmentation parts of speech before and after the position of the named entity in the preprocessed text data in the original text data. The invention solves the problem that the named entity model identification can not effectively identify the named entity with correspondingly long length under certain conditions, effectively improves the accuracy of named entity identification in contract information extraction, and improves the credibility of the named entity model.

Description

Contract information extraction method, system and storage medium based on neural network

Technical Field

The invention belongs to the field of computers, and particularly relates to a contract information extraction method, a contract information extraction system and a storage medium based on a neural network.

Background

With the continuous development of the office automation field, electronic office is gradually used for replacing paper office, so that the use of paper documents in office places is effectively reduced. And due to the development of artificial intelligence, the algorithm and the calculation speed are improved, and based on the electronization of contract information, clients gradually tend to extract the content in the contract information in an artificial intelligence mode, so that the workload of manual contract confidence processing is reduced, and the office efficiency is improved.

At present, the traditional contract content extraction method comprises the following steps: the two modes of manual extraction or rule extraction have obvious two disadvantages: 1. the manual maintenance cost is high and the efficiency is low; 2. rule matching generalization capability is poor, limitations exist, and maintenance is frequent, because rules of contract templates to be identified need to be modified when different templates are identified, and some contracts can be hundreds of pages, so that the identification efficiency is low. There are many undesirable situations in contract information extraction, and it is difficult to meet the expected requirements of customers.

Therefore, a new and practical method for extracting contract information is needed to overcome the defects in the prior art, and truly reduce the workload of extracting the manual contract information.

Disclosure of Invention

In order to make up for the deficiency of the current industry development situation, the invention provides a contract information extraction method based on neural network and rule matching, which can be directly used for extracting contract contents of non-universal templates after single training, automatically classifies the acquired contents into information fields of all parties, and has high cloud deployment and safety.

In order to achieve the above object, an aspect of the present invention provides a method for extracting contract information based on a neural network, including:

extracting original text data from the contract text, and preprocessing the original text data to obtain preprocessed text data;

carrying out named entity recognition on the preprocessed text data through an NER entity recognition model to obtain named entities in the preprocessed text data;

identifying the preprocessed text data through a part-of-speech identification model to obtain parts of speech of all segmentation words in the preprocessed text data;

and completing the named entity in the preprocessed text data according to word segmentation parts of speech before and after the position of the named entity in the preprocessed text data in the original text data.

In some embodiments of the present invention, extracting original text data from the contracted text and preprocessing the original text data to obtain preprocessed text data includes:

and deleting the blank and punctuation marks in the original text data.

In some embodiments of the invention, the NER entity recognition model is based on LSTM neural network and is trained by Kears and Tensorflow frameworks.

In some embodiments of the present invention, identifying the preprocessed text data by the part-of-speech recognition model to obtain part-of-speech of all the segmented words in the preprocessed text data includes:

and performing part-of-speech tagging on the segmented words while performing word segmentation on the preprocessed text data by using a jieba word segmentation tool.

In some embodiments of the present invention, identifying the preprocessed text data by the part-of-speech recognition model to obtain part-of-speech of all the segmented words in the preprocessed text data further includes:

intercepting continuous text paragraphs formed by a preset number of segmentation words before and after the named entity in the preprocessed text data according to the named entity in the preprocessed text data;

manually checking part-of-speech tagging results of the word segments in the continuous text paragraphs, and establishing a manual part-of-speech table for the word segments with incorrect part-of-speech tagging;

in response to word segmentation of the pre-processed text using the jieba word segmentation tool, the artificial part-of-speech table is added to the part-of-speech table of the jieba word segmentation.

and intercepting continuous text paragraphs formed by a preset number of segmentation words before and after the named entity in the original text data according to the named entity in the preprocessed text data.

In some embodiments of the present invention, complementing the named entity in the pre-processed text data according to the part of speech of the word before and after the location of the named entity in the pre-processed text data in the original text data comprises:

judging whether the part of speech of the word segmentation before and after the position of the named entity in the original text data is the same as the part of speech of the named entity;

and combining the word segmentation before and after the position of the named entity in the original text data and the named entity into one named entity in response to the word segmentation before and after the position of the named entity in the original text data being the same as the word segmentation of the named entity.

In some embodiments of the present invention, complementing the named entity in the pre-processed text data according to the part of speech of the word before and after the location of the named entity in the pre-processed text data in the original text data further comprises:

and in response to the presence of punctuation marks between the named entity and the preceding and following words of the named entity in the position in the original text data, prohibiting the preceding and following words of the named entity in the position in the original text data from being combined with the named entity into one named entity.

Another aspect of the present invention also provides a system for extracting contract information based on a neural network, including:

the text processing module is configured to extract original text data from the contract text and preprocess the original text data to obtain preprocessed text data;

the text recognition module is configured to perform named entity recognition on the preprocessed text data through the NER entity recognition model to obtain named entities in the preprocessed text data;

the text part-of-speech tagging module is configured to identify the preprocessed text data through a part-of-speech identification model to obtain part of speech of all the segmented words in the preprocessed text data;

and the text entity completion module is configured to complete the named entities in the preprocessed text data according to word segmentation parts of speech before and after the named entities in the preprocessed text data are positioned in the original text data.

Still another aspect of the present invention provides a computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any of the above embodiments.

The method for extracting the contract information based on the neural network provided by the invention judges the parts of speech of the named entities identified by the entity identification model and the front and rear word segmentation in the preprocessed text data or the original text data, and supplements the parts of speech to the named entities if the parts of speech are the same, thereby solving the problem that the named entity model identification can not effectively identify the correspondingly extremely long named entities under certain conditions, effectively improving the accuracy of the named entity identification in the contract information extraction and improving the credibility of the named entity model.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for extracting contract information based on a neural network according to an embodiment of the present invention;

fig. 2 is a system structure diagram of a contract information extraction system based on a neural network provided by the invention;

fig. 3 is a schematic structural diagram of a computer storage medium for extracting contract information based on a neural network according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, in a first aspect of the embodiment of the present invention, a method for extracting contract information based on a neural network is provided, including:

s01, extracting original text data from contract text, and preprocessing the original text data to obtain preprocessed text data;

step S02, carrying out named entity recognition on the preprocessed text data through an NER entity recognition model to obtain named entities in the preprocessed text data;

s03, recognizing the preprocessed text data through a part-of-speech recognition model to obtain parts of speech of all segmentation words in the preprocessed text data;

and step S04, completing the named entity in the preprocessed text data according to the word segmentation parts of speech before and after the position of the named entity in the preprocessed text data in the original text data.

In this embodiment, the method for extracting contract information based on the neural network is applied to a web server, a user can upload a contract file carrying contract information to a corresponding server in a client or browser mode, the contract file is stored in the server in a binary mode, and when the user calls an API-Key format of a corresponding tenant to carry out authentication and calling, the server carries out information extraction on the corresponding contract information and carries out named entity identification according to the steps of the method.

In step S01, the contents in the contract document are read by the relevant document reading tool, and the document contents are preprocessed. The contract file is read as a PDF type contract file, for example, by a PDFminer tool in Python language, and the content is extracted as original text data. And extracting text contents in the contract file by using a Docx tool in Python language for the contract file of the Docx file type, and storing the text contents as original text data. Preprocessing the original text data after text extraction, removing useless characters in the original text data, and storing the useless characters as preprocessed text data.

In step S02, entity recognition is performed on the preprocessed text data, the trained NER entity recognition model is invoked, the preprocessed text data is input into the NER entity recognition model, and the output named entity of the entity recognition model is saved.

In step S03, word segmentation operation is performed on the preprocessed text data through the part-of-speech recognition model, and the part of speech of the segmented words is labeled, so as to obtain a word segmentation table and the part of speech of each segmented word in the word segmentation table.

In step S04, according to the named entity in step S02, the location of the named entity is found in the preprocessed text data, and the parts of speech of the word segmentation before and after the location of the named entity are obtained. And merging the named entity and the front and rear segmentation thereof according to whether the parts of speech are the same.

and deleting the blank and punctuation marks in the original text data.

In this embodiment, to prevent the influence of invalid symbols or some repeated nonsensical characters on the NER entity naming model, the corresponding characters are deleted from the original text data. Specifically, preprocessing the original text data includes deleting spaces and punctuation marks in the original text data.

In this embodiment, the NER entity recognition model adopts LSTM neural network algorithm, uses Kears and TensorFlow tools, and is trained based on people daily report chinese data (about 7 ten thousand sentences, about 250 ten thousand words), and is used for extracting common entities such as addresses, names, place names, and the like in a contract.

In this embodiment, when performing word segmentation and part-of-speech tagging generation on the preprocessed text data, a jieba word segmentation tool is used to perform word segmentation and part-of-speech generation processing on the data.

In this embodiment, in some cases, the NER entity recognition model may not recognize some entities, for example, some people's names are strange, for example, when some words in the names are verbs, or some addresses are also verbs or some other words, which results in that both the NER entity recognition model and the jieba word segmentation tool cannot effectively process the words, and the recognition accuracy is reduced. In order to solve the problem that the NER entity recognition model and the jieba word segmentation tool judge the part of speech of certain specific words inaccurately, the NER entity recognition model and the jieba word segmentation result are corrected.

Specifically, according to the named entity output by the NER model, intercepting continuous text paragraphs formed by at least 5 word segments before and after the named entity in the preprocessed text data, and inquiring the corresponding part of speech of the text paragraphs in the word segmentation list. And manually judging the parts of speech and the real entity in the text paragraph according to the original text of the contract file, and storing the judgment result into a manual part of speech table. For example, when processing some place names, the place names are: "something is done in a certain area of a certain city. In the case of a fish-strike village, the fish-strike may be classified into a verb of one action at the time of word segmentation, and the "fish-strike village" may not be determined as one place name, so that there may be a problem of inaccurate recognition. Therefore, it is necessary to manually identify some special words, and make the identified result into a special part-of-speech table to be assigned to NER entity identification and Jieba word segmentation model for training to improve the accuracy thereof.

In this embodiment, when determining the context of the named entity, the continuous text is formed by at least 5 words in the original text data, and the boundary of the named entity can be accurately obtained by the position relationship of the paragraph symbol, i.e. punctuation mark, of the original text data, so as to prevent the situation that after the punctuation mark is removed, the tail end and the beginning end of two sentences are perfectly joined into a sentence or a name, and thus, recognition errors occur.

In this embodiment, if the part of speech of the named entity identified by the NER entity identification model is the same as the part of speech of the named entity in the front and rear word segments of the original text data, the front and rear word segments of the named entity and the named entity are combined into a named entity. For example, address: the term part of speech indicates geographical nouns before and after the named entity "Chaoyang district", so the "Beijing city Chaoyang district Datun way" is taken as an integral named entity.

In this embodiment, if punctuation marks or paragraph segmentation marks "/n" exist before and after the named entity, the combination of the named entity and the front and rear segmentation is stopped.

As shown in fig. 2, another aspect of the present invention further provides a system for extracting contract information based on a neural network, including:

the text processing module 1 is configured to extract original text data from contract text, and preprocess the original text data to obtain preprocessed text data;

the text recognition module 2 is configured to perform named entity recognition on the preprocessed text data through the NER entity recognition model to obtain named entities in the preprocessed text data;

the text part-of-speech tagging module 3 is configured to identify the preprocessed text data through a part-of-speech identification model to obtain part of speech of all the segmented words in the preprocessed text data;

and the text entity completion module 4 is configured to complete the named entities in the preprocessed text data according to word segmentation parts of speech before and after the named entities in the preprocessed text data are located in the original text data.

As shown in fig. 3, a further aspect of the present invention further provides a computer readable storage medium 401 storing a computer program 402, the computer program 402 implementing the steps of the method according to any of the above embodiments when being executed by a processor.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the invention, and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims

1. A method for extracting contract information based on a neural network, comprising:

the method for complementing the named entities in the preprocessed text data according to the word segmentation parts of speech before and after the positions of the named entities in the preprocessed text data in the original text data comprises the following steps:

2. The method of claim 1, wherein extracting the original text data from the contracted text and preprocessing the original text data to obtain the preprocessed text data comprises:

and deleting the blank and punctuation marks in the original text data.

3. The method of claim 1, wherein the NER entity recognition model is based on LSTM neural network and is trained from Kears and Tensorflow frameworks.

4. The method of claim 1, wherein the identifying the pre-processed text data by the part-of-speech recognition model to obtain the part of speech of all the tokens in the pre-processed text data comprises:

5. The method of claim 4, wherein the identifying the pre-processed text data by the part-of-speech recognition model to obtain part-of-speech of all the tokens in the pre-processed text data further comprises:

6. The method of claim 5, wherein the identifying the pre-processed text data by the part-of-speech recognition model to obtain part-of-speech of all the tokens in the pre-processed text data further comprises:

7. The method of claim 1, wherein the complementing named entities in the pre-processed text data according to word parts of speech before and after the location of the named entities in the pre-processed text data in the original text data further comprises:

8. A neural network-based contract information extraction system, comprising:

the text entity completion module is configured to complete the named entities in the preprocessed text data according to word segmentation parts of speech before and after the named entities in the preprocessed text data are located in the original text data;

the text entity completion module is further configured to: judging whether the part of speech of the word segmentation before and after the position of the named entity in the original text data is the same as the part of speech of the named entity;

9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1-7.