CN114818718A - Contract text recognition method and device - Google Patents

Contract text recognition method and device Download PDF

Info

Publication number
CN114818718A
CN114818718A CN202210582893.XA CN202210582893A CN114818718A CN 114818718 A CN114818718 A CN 114818718A CN 202210582893 A CN202210582893 A CN 202210582893A CN 114818718 A CN114818718 A CN 114818718A
Authority
CN
China
Prior art keywords
text
entity
type
contract
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210582893.XA
Other languages
Chinese (zh)
Inventor
弓源
李长亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Kingsoft Interactive Entertainment Technology Co ltd
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Chengdu Kingsoft Interactive Entertainment Technology Co ltd
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Kingsoft Interactive Entertainment Technology Co ltd, Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Chengdu Kingsoft Interactive Entertainment Technology Co ltd
Priority to CN202210582893.XA priority Critical patent/CN114818718A/en
Publication of CN114818718A publication Critical patent/CN114818718A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content

Abstract

The application provides a contract text recognition method and a device, wherein the contract text recognition method comprises the following steps: extracting target text contents meeting preset conditions in the contract text by acquiring the contract text, wherein the preset conditions are set based on the specified type of feature information; performing type identification on the target text content to obtain the text type of the target text content; and under the condition that the text type of the target text content is a specified type, extracting entity information in the target text content, and determining the identification result of the contract text. By the method, the data processing amount of type identification can be greatly reduced, the efficiency of type identification is improved, and the precision of contract text identification is improved.

Description

Contract text recognition method and device
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a contract text recognition method. The application also relates to a contract text recognition device, a computing device and a computer readable storage medium.
Background
Artificial Intelligence (AI) refers to the ability of an engineered (i.e., designed and manufactured) system to perceive the environment, as well as the ability to acquire, process, apply, and represent knowledge. The development conditions of key technologies in the field of artificial intelligence comprise key technologies such as machine learning, knowledge maps, natural language processing, computer vision, human-computer interaction, biological feature recognition, virtual reality/augmented reality and the like.
With the continuous development of artificial intelligence technology, the artificial intelligence technology has been deeply applied in the field of natural language processing, and particularly, the artificial intelligence technology is introduced aiming at the contract text, so that the automatic identification of the type of the contract text can be realized, and the efficiency of an enterprise can be greatly improved. In the traditional artificial intelligence method, the contract text is directly input into a pre-trained type recognition model based on deep learning, and the type of the contract text can be obtained.
However, contract texts are longer and more in length and types, and the deep learning-based type recognition model is limited by training samples, so that recognition errors can be avoided in the case of limited training samples. Accordingly, there is a need to provide a more efficient and accurate contract text recognition scheme.
Disclosure of Invention
In view of this, the embodiment of the present application provides a method for recognizing a contract text, so as to solve technical defects in the prior art. The embodiment of the application also provides a contract text recognition device, a computing device and a computer readable storage medium.
According to a first aspect of the embodiments of the present application, there is provided a method for recognizing a contract text, including:
acquiring a contract text, and extracting target text contents which accord with preset conditions in the contract text, wherein the preset conditions are set based on the specified type of feature information;
performing type identification on the target text content to obtain the text type of the target text content;
and under the condition that the text type is a specified type, extracting entity information in the target text content, and determining the identification result of the contract text.
According to a second aspect of the embodiments of the present application, there is provided a contract text recognition apparatus including:
the screening module is configured to acquire a contract text and extract target text contents meeting preset conditions in the contract text, wherein the preset conditions are set based on the specified type of feature information;
the identification module is configured to identify the type of the target text content to obtain the text type of the target text content;
and the extraction module is configured to extract entity information in the target text content and determine the identification result of the contract text under the condition that the text type is the specified type.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is used for storing computer-executable instructions, and the processor implements the contract text recognition method provided by the first aspect of the embodiment of the application when executing the computer-executable instructions.
According to a fourth aspect of the embodiments of the present application, there is provided a computer-readable storage medium storing computer-executable instructions, which when executed by a processor, implement the contract text recognition method provided by the first aspect of the embodiments of the present application.
According to a fifth aspect of the embodiments of the present application, there is provided a chip, which stores a computer program, and when the computer program is executed by the chip, the method for recognizing a contract text provided by the first aspect of the embodiments of the present application is implemented.
According to the contract text identification method, the contract text is obtained, and the target text content meeting the preset conditions in the contract text is extracted, wherein the preset conditions are set based on the specified type of feature information; performing type identification on the target text content to obtain the text type of the target text content; and under the condition that the text type is a specified type, extracting entity information in the target text content, and determining the identification result of the contract text. By the method, the target text content meeting the preset conditions is extracted from the contract text and the type of the target text content is identified, so that the data processing amount of type identification can be greatly reduced, and the efficiency of type identification is improved. And under the condition that the text type of the target text content is the specified type, extracting entity information in the target text content to determine the identification result of the contract text, extracting the target text content meeting the preset condition in the contract text, combining type identification, primarily identifying the type of the contract text, extracting the entity information in the target text content of the specified type, and determining the identification result of the contract text by combining the entity information, thereby improving the precision of the contract text identification.
Drawings
FIG. 1 is a schematic structural diagram illustrating a contract text recognition system according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating a method for identifying a contract text according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating a method for identifying a contract text to extract target text content according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating training of a text classification model in a method for contract text recognition according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating entity recognition model training in a method for contract text recognition according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating a method for identifying a contract text according to an embodiment of the present application;
FIG. 7 is a process flow diagram illustrating a method for contract text recognition applied to an account-type contract according to an embodiment of the present application;
fig. 8 is a schematic structural diagram illustrating a contract text recognition apparatus according to an embodiment of the present application;
fig. 9 shows a block diagram of a computing device according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
Information extraction: refers to techniques for extracting structured information from structured, semi-structured, or unstructured text.
BERT (bidirectional Encoder retrieval from transformations): an open source pre-trained language model.
Named Entity Recognition (NER): the method is used for identifying entities with specific meanings in texts, and mainly comprises a person name, a place name, an organization name, a proper noun and the like.
Text classification: meaning that in a given classification system, text is assigned to be classified into one or several categories.
Entity: refers to a description of a word or phrase of an entity having a particular meaning in the text.
Account type: and the text sentence can reflect the contract-accepting and payer role information.
In the application, a contract text recognition method is provided. The present application relates to a contract text recognition apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
Fig. 1 shows a schematic structural diagram of a contract text recognition system provided according to an embodiment of the present application.
Taking the execution subject as the server 102 as an example, the terminal 104 uploads a contract text, the server 102 receives the contract text through the communication unit 102-1, and the obtaining unit 102-2 extracts target text content meeting preset conditions from the contract text, wherein the preset conditions are set based on the specified type of feature information; then the type of the target text content is identified by the identification unit 102-3 to obtain the text type of the target text content; and then, in the case that the text type of the target text content is the specified type, extracting entity information in the target text content by the extracting unit 102-4, and determining the identification result of the contract text. The recognition result may then be fed back to the terminal 104 by the communication unit 102-1.
Fig. 2 shows a flowchart of a method for recognizing a contract text according to an embodiment of the present application, which specifically includes steps 202 to 206, which will be described in detail below.
Step 202, acquiring a contract text, and extracting target text content meeting preset conditions in the contract text, wherein the preset conditions are set based on the specified type of feature information.
The execution main body of the application can be any electronic equipment with a contract text recognition function, for example, any one of a smart phone, a smart watch, a desktop computer, a portable computer, a server and the like.
In the present embodiment, a contract text is first obtained, where the contract text refers to a written document used by a party to describe the contract content. In practical applications, the Format of the contract text may be a Portable Document Format (PDF), a picture Format, and the like, besides a text Format.
Correspondingly, for contract texts with different formats, there are various ways of acquiring text contents in the contract texts, for example, when the contract texts are in a text format, the text contents of the contract texts are extracted by directly using a character recognition way; for another example, when the contract text is in the PDF format or the picture format, the text content of the contract text may be analyzed by using an Optical Character Recognition (OCR) technique, specifically, for the contract text in the PDF or the picture format, a text region may be identified by using the OCR technique, and then the text region may be divided into rectangles, and the characters may be classified, and the text content of the contract text may be identified.
After acquiring the contract text, the target text content meeting the preset condition needs to be extracted from the contract text, wherein the preset condition is set based on the characteristic information of the specified type, and the specified type is a specific text type and is set by the actual requirement identified by the contract text. The text type refers to the text type of the text content in the contract text, the text type is generally determined by the feature information in the text content, and the feature information represents the uniqueness of the specified type, and specifically can be words. In practical applications, the characteristic information may be words in the text content that characterize the role relationship between the two parties to the contract, such as words of "pay", "buy", "borrow", "rent", "transfer", and the like. Under the setting that the designated type is the account type, due to the text content of the account type, the transaction behavior relationship between two parties of the contract is mainly expressed, so under the setting of the account type, the preset condition is set based on the characteristic information of the transaction behavior relationship, for example, the word setting representing the transaction behavior relationship such as "payment", "transfer", and the like. Aiming at different contract text identification requirements, preset conditions can be set according to corresponding characteristic information representing the role relationship between two parties of the contract, and then the corresponding specified type setting can be met.
As in the above example, the words representing the role relationship between the two parties of the contract in the text content of the contract text are generally behavioral relationship words, in this embodiment, a preset behavioral keyword library may be established in advance, where the preset behavioral keyword library includes a plurality of behavioral keywords in specified types, and the behavioral keywords are words representing behavioral relationship characteristics, and then based on the preset behavioral keyword library, the text content matched with the preset behavioral keyword library is extracted from the contract text as the target text content. The preset behavior relation lexicon may be in a form of a table, that is, each table unit records one behavior keyword, or in a form of a database, which is not specifically limited herein.
Or, a plurality of behavior text contents in the specified type can be created in advance, the behavior text contents represent behavior characteristics, specifically describe a behavior occurrence process, and can be sentences artificially edited according to experience, then semantic similarity matching is performed on each text content in the contract text and the behavior text contents created in advance, and the text content of which the semantic similarity reaches a preset threshold in the contract text is taken as target text content. Specifically, semantic similarity matching may adopt a BERT model-based matching mode, each text content and behavior text content in a contract text are respectively input into a pretrained BERT model, the BERT model has a semantic recognition function, the BERT model outputs semantic information of each text content and behavior text content, similarity calculation is performed on the semantic information of each text content and the semantic information of the behavior text content, voice information is generally in a vector form, then, the specific similarity calculation mode is to respectively calculate the degree of element coincidence in a vector between the semantic information of each text content and the semantic information of the behavior text content, and the higher the degree of element coincidence is, the higher the similarity is, so that semantic similarity between each text content and the behavior text content can be obtained.
In a possible implementation manner of the embodiment of the present application, the step of extracting the target text content meeting the preset condition in the contract text may be specifically implemented in the manner of fig. 3. Fig. 3 shows a flowchart for extracting target text content in a contract text recognition method according to an embodiment of the present application, which specifically includes the following steps:
and step 302, dividing the contract text to obtain each text content in the contract text.
And 304, aiming at any text content, matching words in the text content with a preset behavior keyword library, and if the matching result meets a preset matching condition, determining the text content as the target text content, wherein behavior keywords in a specified type are recorded in the preset behavior keyword library.
In this embodiment, after acquiring the contract text, the text content of the contract text is divided to obtain each text content in the contract text.
There are various ways to divide the contract text, which can be divided based on punctuation marks or word count. Generally, a complete sentence can express the complete contract content, and therefore, the division is generally performed based on the punctuation marks or characters representing the end of the complete sentence, such as periods, exclamation marks, line breaks, and the like, that is, the two text contents separated by the punctuation marks or characters representing the end of the complete sentence are divided into different text contents. However, in some special cases, there may be multiple clauses in a complete sentence, each clause may involve both parties of the contract, however, in some clauses, both parties of the contract may not contain a behavior relationship, for example, in a sentence, "a and b go together to the bank, a transfers 50 ten thousand dollars to b from the bank account at the bank," it can be seen that the clause "a and b go together to the bank" does not contain a behavior relationship, therefore, in order to improve the accuracy and effectiveness of the subsequent type identification, in a preferred implementation, the clause may be divided according to punctuation marks or special characters, that is, when "," appears in the text of the contract. ","! ","? When punctuation marks are equal, or special characters such as line-changing marks appear in a contract text, the punctuation marks or the special characters are determined to be the tail of the text content and are divided, namely two parts of text contents disconnected by the punctuation marks or the special characters are divided into different text contents, for example, punctuation marks are used as a dividing method, a part A and a part B go to a bank together, the part A transfers 50 ten thousand yuan to the part B through a bank account, the part A transfers 50 ten thousand yuan to the part B through the bank account, and the part A transfers 50 ten thousand yuan to the part B through the bank account. E.g., in punctuation marks ". "as a division method," a and b go to the bank together, a transfers 50 ten thousand yuan to b by bank account at bank "is divided into a sub-sentence" a and b go to the bank together, a transfers 50 ten thousand yuan to b by bank account at bank.
The text content of the contract text may be divided by using any Neural Network model such as a Natural Language Processing model (NLP), a Long Short Term Memory Network model (LSTM), a Convolutional Neural Network model (CNN), and the like. Specifically, the neural network model is used for division, firstly, a large amount of sample text contents are required to be obtained as training samples, then the sample text contents are labeled, namely, labels are labeled on the beginning and the end of a sentence of each sample text content, then the labeled sample text contents are input into the neural network model for iterative training, and after multiple iterations, the training is completed when the loss value of the neural network model reaches a preset threshold value. And then, inputting the contract text into the trained neural network model, thereby realizing the division of the text content of the contract text.
After the contract text is divided to obtain each text content in the contract text, word segmentation processing is carried out on any text content to obtain each word in the text content. There are various methods for segmenting any text content, such as a neural network-based segmentation mode, a Chinese (Chinese) word segmentation, a part-of-speech-based segmentation, and the like, wherein the neural network-based segmentation mode is to train a neural network model by using a preset dictionary library in advance, and then input any text content into the trained neural network model for segmentation to obtain a segmentation result. Or a keyword table can be established in advance, and then any text content can be segmented by using a keyword matching method. The word segmentation mode is not limited, and words representing the role relationship between two parties of the contract can be divided into words.
After obtaining each word in any text content, matching each word with a preset behavior keyword library, and if the matching result meets a preset matching condition, determining the text content as the target text content, wherein behavior keywords in a specified type are recorded in the preset behavior keyword library.
Common contract texts generally have text contents of account types, and the situation that accounts are described in terms of one sentence, such as 'borrowing 5000 yuan from the first party to the second party', is very uncommon. The contract content is complicated to express, the former section may have a borrowing relationship between the two parties, the latter sections may not mention specific amounts in the payment terms, and even the method of sharing may exist, and no specific amount is mentioned. These situations increase the difficulty of extracting and further interpreting the meaning of the contract.
In a possible implementation manner of the embodiment of the application, the specified type may be an account type, and then behavior keywords of the account type are recorded in the preset behavior keyword library, for example, the behavior keywords of the account type may be keywords such as "borrow", "pay", "lend", "pay", and "repay"; the text content' the first party agrees with the second party, the second party is lent 500 yuan by means of bank transfer, and the second party is required to be cleared within 10 working days. The words "A", "lent", "B", "500" and "Yuan". Matching each word of the text content with each keyword in a preset behavior keyword bank, and determining that the text content and the keywords in the preset behavior keyword bank are lent successfully if the matching result shows that the matching result and the keywords in the keyword list are successful, then determining that the text content in the contract text is agreed between a first party and a second party, lending the agreement to the second party by a bank transfer mode for 500 RMB, and requiring the second party to be cleared within 10 working days. "is the target text content.
In another possible implementation manner of the embodiment of the present application, a matching threshold may be preset, and when a word in the text content successfully matches with a word in a preset behavior keyword library reaches the preset matching threshold, the text content is determined to be the target text content.
For example, the preset matching threshold is 2; the words in the text content '500 yuan is lent from A to B' are 'A', 'lent', 'B', '500', 'Yuan'; the action keyword of the account type may be a keyword such as "borrow", "pay", "meta", and the like. Matching each word of the text content with each keyword table in a preset behavior keyword library, and determining that the text content and the preset behavior keyword library have 'borrow' and 'element', and when a preset matching threshold value is reached, determining that the text content 'borrow from first to second 500 elements' in the contract text is the target text content.
In another possible implementation manner of the embodiment of the present application, the preset matching condition may be that a term proportion of the text content that matches the preset behavior keyword library reaches a preset proportion threshold, that is, when a term proportion of the text content that matches the preset behavior keyword library successfully reaches the preset proportion threshold, the text content may be determined as the target text content. And the target text content is screened by using a proportion mode, so that the screening result is more accurate.
For example, the preset duty threshold is 50%; the words in the text content ' 500 yuan is lent from A to B ' are ' A ', ' lent from ' B ' and ' 500 yuan ' respectively; the action keyword of the account type may be a keyword such as "lend", "payment", "meta", and the like. Matching each word of the text content with each keyword table in a preset behavior keyword library to obtain that the text content and the preset behavior keyword library have 'lending' and 'element', namely, the text content and the preset behavior keyword library have two words matched with each other, and the text content has four words in total, so that the proportion of the matched words is 50%, and when the preset proportion threshold is reached, the text content 'borrowing from first to second 500 element' in the contract text is determined as the target text content.
In the above embodiment, the text content of the contract text is divided to obtain each text content in the contract text, then, for any text content, a word in the text content is matched with a preset behavior keyword library, and if a matching result meets a preset matching condition, the word is determined to be a target word. By the method, the target text content meeting the preset condition in the contract text can be screened out for carrying out type recognition on the target text content meeting the preset condition, so that the calculation amount of text classification is greatly reduced, and the efficiency of classifying the same text is improved.
And 204, performing type identification on the target text content to obtain the text type of the target text content.
After the target text content meeting the preset conditions is extracted from the contract text, type recognition needs to be carried out on the target text content, and the text type of the target text content is obtained by carrying out type recognition on the target text content. The specific type identification method is not limited herein, and a type identification scheme based on a neural network may be adopted, or a type identification scheme based on a mapping table may be adopted. Next, a neural network-based type recognition scheme will be described.
In a possible implementation manner of the embodiment of the present application, step 204 may be specifically implemented by the following steps:
and inputting the target text content into a text classification model to obtain the text type of the target text content, wherein the text classification model is obtained by pre-training based on a sample text carrying a type label.
In this embodiment, the text classification model is used to classify the target text content, and the text classification model may be any one of neural network models such as an NLP model, an LSTM model, and a CNN model. By adopting the type identification scheme based on the neural network, the text classification model is obtained by pre-training based on the sample text carrying the type label and is an end-to-end model, so that the text classification efficiency and accuracy are higher.
In a possible implementation manner of the embodiment of the present application, before inputting a target text content into a text classification model to obtain a text type of the target text content, the text classification model needs to be trained, a specific training method is shown in fig. 4, fig. 4 shows a flowchart of training the text classification model in a contract text recognition method provided by an embodiment of the present application, and specifically includes the following steps:
step 402, a first training set and a first sample text to be labeled are obtained, wherein the first training set comprises a plurality of sample texts carrying type labels.
And step 404, training the text classification model by using the first training set.
And step 406, inputting the first sample text to be labeled into the trained text classification model to obtain a first prediction probability of whether the first sample text to be labeled is of the specified type.
And 408, labeling the first sample text to be labeled based on the first prediction probability to obtain the labeled first sample text.
And step 410, adding the labeled first sample text into the first training set to obtain an updated first training set, and continuously training the text classification model by using the updated first training set.
The method utilizes a weak supervision mode to train the model, can update the training set of the text classification model, greatly expands the training set of the text classification model and further effectively improves the accuracy of the text classification model.
In this embodiment, the first training set includes a plurality of sample texts carrying type tags, for example, the sample text "first borrowed second 500 yuan" carries an "account type" tag; the sample text "reject Payment 500 dollars to A" carries a "non-Account type" tag, and so on. The first sample text to be labeled refers to a sample text without a type label, for example, the first sample text to be labeled may be "500 yuan is donated to b", "500 yuan is collected from b", "b is not donated", etc., and the first sample text to be labeled does not carry a type label.
In practical application, the text classification model may be any one of neural network models such as an NLP model, an LSTM model, and a CNN model, and after the first training set and the first to-be-labeled sample text are obtained, the text classification model is trained by using the first training set. Specifically, a plurality of sample texts carrying type labels are input into a text classification model for iterative training, and after a plurality of iterations, a trained text classification model is obtained after the loss value of the text classification model reaches a preset threshold value.
And then inputting the first sample text to be labeled into the trained text classification model, and predicting whether the first sample text to be labeled is of the specified type to obtain a first prediction probability whether the first sample text to be labeled is of the specified type.
For example, the first sample text to be labeled "500 yuan donated to b" is input into the trained text classification model, and whether the first sample text to be labeled is an account type is predicted, so as to obtain a first prediction probability of whether the first sample text to be labeled is the account type.
After the first prediction probability that whether the first sample text to be labeled is of the specified type is obtained, the type label of the first sample text to be labeled is determined based on the first prediction probability, and the first sample text to be labeled is labeled to obtain the labeled first sample text. And then adding the labeled first sample text into the first training set to obtain an updated first training set, and continuously training the text classification model by using the updated training set.
For example, when the specified type is an account type, a first text to be labeled, "500 yuan is donated from the first to the second" is labeled as the account type, a first sample text after labeling is obtained, then the first sample text after labeling carrying an account type label, "500 yuan is donated from the first to the second" is added to a first training set, an updated first training set is obtained, and the text classification model is trained continuously by using the updated first training set.
In the method, a text classification model is trained through a first training set, then a first sample text to be labeled is input into the trained text classification model to obtain a first prediction probability of whether the first sample text to be labeled is of a specified type, then the first sample text to be labeled is labeled based on the first prediction probability to obtain a labeled first sample text, finally the labeled first sample text is added into the first training set to obtain an updated first training set, and the text classification model is continuously trained by using the updated first training set. By the method, the training set of the text classification model can be updated, the training set of the text classification model is greatly expanded, and the accuracy of the text classification model is effectively improved. Meanwhile, the first sample text to be labeled is labeled based on the first prediction probability, so that the workload of manually labeling the sample text is reduced, and the efficiency of text classification of the target text content is improved.
In a possible implementation manner of the embodiment of the application, labeling a first to-be-labeled sample text based on a first prediction probability may specifically be implemented in the following manner:
under the condition that the first prediction probability reaches a first preset threshold value, marking a first sample text to be marked as a first sample text of a specified type;
and under the condition that the first prediction probability does not reach a first preset threshold value, marking the first sample text to be marked as the first sample text of the non-specified type.
In this embodiment, first, a first sample text to be labeled needs to be input into a trained text classification model to obtain a first prediction probability of whether the first sample text to be labeled is of a specified type, then, the first prediction probability of whether the first sample text to be labeled is of the specified type is compared with a first preset threshold, and the first sample text to be labeled is labeled as the first sample text of the specified type under the condition that the first prediction probability reaches the first preset threshold; and under the condition that the first prediction probability does not reach a first preset threshold value, marking the first sample text to be marked as the first sample text of the non-specified type.
For example, when the specified type is an account type, and a first preset threshold is 80%, the trained text classification model is input into the first sample text to be labeled "500 yuan is donated to b, and a first prediction probability that the first sample text to be labeled" 500 yuan is donated to b "is the account type is 90%, then the first sample text to be labeled" 500 yuan is donated to b "is labeled as the first sample text of the account type, that is, the labeled first sample text.
For another example, the first to-be-labeled sample text "b-not-donated" is input into the trained text classification model, and the first prediction probability that the first to-be-labeled sample text "b-not-donated" is the account type is 30%, then the first to-be-labeled sample text "b-not-donated" is labeled as the first sample text of the non-account type, that is, the labeled first sample text.
In the above embodiment, under the condition that the first prediction probability reaches the first preset threshold, the first sample text to be labeled is labeled as the first sample text of the specified type; and under the condition that the first prediction probability does not reach a first preset threshold value, marking the first sample text to be marked as the first sample text of the non-specified type. By the method, the first sample text to be labeled can be accurately labeled, so that the training set of the text classification model is expanded, and the accuracy of the text classification model is improved.
And step 206, under the condition that the text type of the target text content is the specified type, extracting entity information in the target text content, and determining the identification result of the contract text.
If the type of the target text content is identified as the specified type, entity information can be further extracted from the target text content, wherein the entity information comprises an entity word (for example, a specific party of the contract text) with a specific meaning in the target text content and an entity tag aiming at the word, and the entity tag can represent the role, the attribute and the like of the entity word, so that the identification result of the contract text can be more accurately determined based on the extracted entity information. The specific manner of performing entity extraction is not limited herein, and an entity extraction scheme based on a neural network may be adopted, or an entity extraction scheme based on a mapping table may be adopted. Next, an entity extraction scheme based on a neural network will be described.
In a possible implementation manner of the embodiment of the present application, step 206 may be specifically implemented by the following steps:
and inputting the target text content into an entity recognition model, and extracting entity information in the target text content, wherein the entity recognition model is obtained by pre-training a sample text which carries entity label information and belongs to a specified type.
In this embodiment, the target text content is first input into the text classification model, and whether the target text content is of the specified type is identified. And under the condition that the target text content is of the specified type, inputting the target text content into the entity recognition model, and further extracting entity information in the target text content. By adopting the entity identification scheme based on the neural network, the entity identification model is obtained by pre-training a sample text which carries entity label information and belongs to a specified type, and is an end-to-end model, so that the entity identification efficiency and the accuracy are higher. The entity recognition model may adopt a model structure of LSTM plus Conditional Random Fields (CRFs), a model structure of BiLSTM plus CRFs, or a model structure of BERT plus CRFs, which is not specifically limited herein.
For example, when the designated type is an account type, the target text content ' first borrowing to second 500 yuan ' of the account type is input into the entity recognition model, and the entity information of the target text content ' first borrowing to second 500 yuan ' is extracted as ' a: a payer; b: payee ". Wherein, the payer is an entity label of an entity word A, and the payee is an entity label of an entity word B.
In a possible implementation manner of the embodiment of the present application, before inputting target text content into an entity recognition model and extracting entity information in the target text content, the entity recognition model needs to be trained, a specific training method is shown in fig. 5, fig. 5 shows a flowchart of training the entity recognition model in a contract text recognition method provided by an embodiment of the present application, and specifically includes the following steps:
step 502, a second training set and a second sample text to be labeled are obtained, wherein the second training set comprises a plurality of sample texts which carry entity label information and belong to a specified type.
And step 504, training the entity recognition model by using the second training set.
Step 506, inputting the second sample text to be labeled into the trained entity recognition model to obtain each entity word in the second sample text to be labeled and a second prediction probability of the entity label information corresponding to each entity word.
And step 508, labeling each entity word of the second sample text to be labeled based on the second prediction probability to obtain a labeled second sample text.
And step 510, adding the labeled second sample text into a second training set to obtain an updated second training set, and continuing training the entity recognition model by using the updated second training set.
The model training is carried out by using a weak supervision mode, the training set of the entity recognition model can be updated, the training set of the entity recognition model is greatly expanded, and the accuracy of the entity recognition model is effectively improved.
In this embodiment, the second training set includes a plurality of sample texts carrying entity tag information and belonging to a specified type. Taking the appointed type as an account type as an example, the sample text 'borrow 500 Yuan from the first to the second' belongs to the account type and carries entity label information (namely, label a payer label on the entity A and label a payee label on the entity B); the sample text 'Payment 500 Yuan to first from second to third' belongs to the account type and carries entity label information (namely, labeling a payee label on the entity A and labeling a payer label on the entity B).
In practical application, the entity recognition model may be any one of neural network models such as an NLP model, an LSTM model, and a CNN model, and after the second training set and the second sample text to be labeled are obtained, the entity recognition model is trained by using the second training set. Specifically, a plurality of sample texts carrying entity label information and belonging to a specified type are input into an entity recognition model for iterative training, and after repeated iteration, a trained entity recognition model is obtained after a loss value of the entity recognition model reaches a preset threshold value.
And then, inputting the second sample text to be labeled into the trained entity recognition model to obtain each entity word in the second sample text to be labeled and a second prediction probability of the entity label information corresponding to each entity word. And labeling each entity word of the second sample text to be labeled based on the second prediction probability to obtain a labeled second sample text. And then adding the labeled second sample text into a second training set to obtain an updated second training set, and continuing training the entity recognition model by using the updated training set.
For example, when the designated type is an account type, inputting a second text to be labeled "500 yuan is donated to the first party and the second party" into the trained entity recognition model to obtain a second prediction probability of each entity word in the "500 yuan is donated to the first party and the second party and corresponding entity label information, and labeling the entity word" a "as a payee and the entity word" b "as a payer based on the second prediction probability. And then adding the labeled second sample text into a second training set to obtain an updated second training set, and continuing training the entity recognition model by using the updated second training set.
By the method, the training set of the entity recognition model can be updated, the training set of the entity recognition model is greatly expanded, and the accuracy of the entity recognition model is effectively improved. Meanwhile, the second sample text to be labeled is labeled based on the second prediction probability, so that the workload of manually labeling the sample text is reduced, and the efficiency of identifying the entity contract text of the target text content is improved.
In a possible implementation manner of the embodiment of the present application, labeling a second sample text to be labeled based on a second prediction probability includes:
and under the condition that the second prediction probability reaches a second preset threshold value, determining the entity label information as target entity label information, and labeling the entity words corresponding to the target entity label information.
In this embodiment, first, the second to-be-labeled sample text needs to be input into the trained entity recognition model, so as to obtain each entity word in the second to-be-labeled sample text and a second prediction probability of the entity label information corresponding to each entity word. And then comparing the second prediction probability with a second preset threshold value, and labeling each entity word of the second sample text to be labeled based on the comparison result to obtain a labeled second sample text.
For example, when the designated type is an account type, and the second preset threshold is 80%, inputting the trained entity recognition model into the second to-be-labeled sample text "donate 500 yuan to b" to obtain that each entity word in the second to-be-labeled sample text is "a" and "b", and the second prediction probabilities corresponding to the entity word "a" are respectively: 10% of a payee and 90% of a payer; the second prediction probability corresponding to the entity word "b" is: 90% of the payee and 10% of the payer. In a second prediction probability of the entity label information corresponding to the entity word 'A', the second prediction probability of the label 'payer' reaches a second preset threshold value; in the second prediction probability of the entity label information corresponding to the entity word B, the second prediction probability of the label payee reaches a second preset threshold value. And based on the second prediction probability, marking the entity word 'A' in the second sample text to be marked 'A donates 500 Yuan to B' as a payer, and marking the entity word 'B' as a payee.
In the above embodiment, when the second prediction probability of the entity label information reaches the second preset threshold, it is determined that the entity label information is the target entity label information, and the entity words corresponding to the target entity label information are labeled. By the method, the second sample text to be labeled can be accurately labeled, so that the training set of the entity recognition model is expanded, and the accuracy of the entity recognition model is effectively improved.
In a possible implementation manner of the embodiment of the application, the specified types of target text contents are multiple pieces, that is, the types of multiple pieces of target text contents that can be recognized in the contract text are the specified types, and accordingly, the step of determining the recognition result of the contract text may be specifically implemented in the manner of fig. 6. Fig. 6 shows a flowchart for determining a recognition result of a contract text in a contract text recognition method according to an embodiment of the present application, which specifically includes the following steps:
step 602, performing information fusion processing on entity information in the target text contents of the multiple specified types to obtain an entity fusion result.
And step 604, correcting the entity information by using the entity fusion result to obtain the corrected entity information.
And 606, integrating the text type of the target text content and the corrected entity information to obtain a contract text recognition result.
In this embodiment, when the designated type of target text content is multiple, entity information may be extracted for each designated type of target text content, and then the entity information in the multiple designated type of target text content is subjected to information fusion processing, so as to obtain an entity fusion result. The information fusion processing refers to a process of analyzing a plurality of entity information, analyzing a unified rule or a rule which accords with the entity information, and fusing the entity information into one entity information based on an analysis result. Further, the entity fusion result can be utilized to correct each entity information, so as to obtain corrected entity information, and the text type of the target text content and the corrected entity information are integrated, so that the identification result of the contract text can be obtained, and the entity extraction result of the contract text can be obtained. The entity extraction result of the contract text comprises the text type of the target text content and the corrected entity information, and the correction of each entity information means that the entity information which does not accord with the unified rule or rule is correspondingly processed based on the entity fusion result, specifically, the operations such as deletion, modification and the like are carried out, so that all the entity information can accord with the unified rule or rule. By the method, the accuracy of extracting the contract text entity information can be effectively improved.
In a possible implementation manner of the embodiment of the present application, the step of extracting entity information in the target text content may be specifically implemented in the following manner: selecting a preset number of pieces of target text contents of the specified types, and sequentially extracting entity information in the preset number of pieces of target text contents of the specified types. Accordingly, the step 602 may be specifically implemented as follows: and carrying out information fusion processing on entity information in a preset number of pieces of specified types of target text content to obtain an entity fusion result.
The specified types of target text contents are multiple, in order to further improve the processing efficiency, a preset number of pieces of target text contents can be selected from the target text contents for extracting entity information, in a specific implementation, type recognition can be sequentially performed on the target text contents, and after the preset number of pieces (for example, 2) of the specified types of target text contents are recognized, recognition of the types of other target text contents is stopped. Of course, type recognition may be performed on all the target text contents, and after the text types of all the target text contents are obtained, a preset number of pieces of the target text contents of the specified types are selected from the target text contents. And then, carrying out information fusion processing on the entity information in the preset number of pieces of specified types of target text content to obtain an entity fusion result.
In a possible implementation manner of the embodiment of the application, the entity information includes entity words in the target text content of the specified type, where the entity words in the target text content of the specified type carry corresponding entity labels. Accordingly, the step 602 may be specifically implemented as follows: counting the entity words with the same entity labels in the target text contents of the specified types; and obtaining an entity fusion result according to the statistical result.
In this embodiment, the entity information includes an entity word in the target text content of the specified type, and the entity word carries a corresponding entity tag. After the entity information in the target text contents of the multiple specified types is obtained, the entity words with the same entity labels in the target text contents of the multiple specified types need to be counted according to the entity words and the corresponding entity labels in the entity information, and then an entity fusion result is obtained according to the counting result.
For example, for the target text contents "500 yuan for first loan, 200 yuan for first payment, and 500 yuan for first donation to second donation", entity information is obtained as follows: in the entity information of the target text content 'borrow from first to second 500 yuan', the entity words are 'first' and 'second', wherein the entity word 'first' carries the entity label 'payer'; the entity word "b" carries the entity label "payee". In the entity information of the target text content '200 Yuan Payment to first' from second, entity words are 'A' and 'B', wherein the entity word 'A' carries an entity label 'payee'; the entity word "b" carries the entity label "payer". In the entity information of the target text content '500 Yuan donated to B from A', the entity words are 'A' and 'B', wherein the entity word 'A' carries the entity label 'payer'; the entity word "b" carries the entity label "payee".
Counting the entity words with the same entity labels in the target text content, wherein the obtained statistical result is as follows: the number of the entity word 'A' carrying the entity label 'payee' is 1, and the number of the 'payer' is 2; the entity word "b" carries the entity label "payee" in a number of 2 and the "payer" in a number of 1. According to the statistical result, the entity fusion result is' A: a payer; b: payee ".
In the above embodiment, the entity words having the same entity label in the target text contents of a plurality of specified types are counted, and an entity fusion result is further obtained according to the statistical result. By the method, the accuracy of extracting the contract text entity information can be effectively improved.
In a possible implementation manner of the embodiment of the present application, step 604 may be specifically implemented by: matching the entity fusion result with each entity information; and determining the entity information which is successfully matched as the corrected entity information, and deleting the entity information which is failed to be matched.
In this embodiment, the entity fusion result of the contract text includes the target entity word and the entity tag corresponding to the target entity word, and the target entity word and the entity tag corresponding to the target entity word are obtained by counting the entity words having the same entity tag in the target words of the specified types. And matching the entity fusion result with each entity information, namely matching the target entity words and the entity labels corresponding to the target entity words with the entity words and the corresponding entity labels of each entity information to obtain a matching result. Accordingly, the entity information successfully matched is determined as the corrected entity information, and the entity information failed in matching is deleted.
For example, the entity information of each specified type of target text content is specifically as follows: in the entity information of the target text content 'borrow from first to second 500 yuan', the entity words are 'first' and 'second', wherein the entity word 'first' carries the entity label 'payer'; the entity word "b" carries the entity label "payee". In the entity information of the target text content '200 Yuan Payment to first' from second, entity words are 'A' and 'B', wherein the entity word 'A' carries an entity label 'payee'; the entity word "b" carries the entity label "payer". In the entity information of the target text content '500 Yuan donated to B from A', the entity words are 'A' and 'B', wherein the entity word 'A' carries the entity label 'payer'; the entity word "b" carries the entity label "payee".
Counting the entity words with the same entity label in the target text content, and obtaining an entity fusion result according to the counting result, wherein the first entity word is a payer and the second entity word is a payee, namely, in the entity extraction result of the contract text, the target entity words are 'A' and 'B', and the entity label corresponding to the target entity word 'A' is 'payer'; the entity label corresponding to "B" is "payee".
Matching the target entity words and the entity labels corresponding to the target entity words with the entity words of each entity information and the entity labels corresponding to the entity words, and obtaining a matching result as follows: successfully matching with the target text contents of 500 yuan borrowed from the first to the second and 500 yuan donated from the first to the second; the matching with the target text content "pay 200 dollars to first" failed. And finally, according to the matching result, determining the corrected entity information as follows: "A is the payer, B is the payee".
In the above embodiment, the entity fusion result is matched with each entity information, the entity information that is successfully matched is determined as the corrected entity information according to the matching result, and the entity information that is failed in matching is deleted. By the method, the entity information which is not consistent with the entity fusion result in the entity information can be deleted, so that the accuracy of extracting the entity information by the entity identification model is improved.
According to the contract text identification method, the contract text is obtained, and the target text content meeting the preset conditions in the contract text is extracted, wherein the preset conditions are set based on the specified type of feature information; performing type identification on the target text content to obtain the text type of the target text content; and under the condition that the text type of the target text content is a specified type, extracting entity information in the target text content, and determining the identification result of the contract text. By the method, the target text content meeting the preset conditions is extracted from the contract text and the type of the target text content is identified, so that the data processing amount of type identification can be greatly reduced, and the efficiency of type identification is improved. And under the condition that the text type of the target text content is the specified type, extracting entity information in the target text content to determine the identification result of the contract text, extracting the target text content meeting the preset condition in the contract text, combining type identification, primarily identifying the type of the contract text, extracting the entity information in the target text content of the specified type, and determining the identification result of the contract text by combining the entity information, thereby improving the precision of the contract text identification.
In the following, taking the contract text recognition method provided by the present application as an example of entity extraction in an account type contract, the contract text recognition is performed to identify the roles played by each party in the payment transaction or the relationship between each other from the contract, and determine which parties are borrowers (payees, debtors) and which parties are loans (payers, creditors) in the meaning expressed by the whole contract. The methods of identification of the equivalent text are further described. Fig. 7 shows a processing flow chart of a contract text recognition method applied to an account-type contract according to an embodiment of the present application, which specifically includes the following steps:
step 702, inputting a contract text.
A text description extracted from contract text is input, and the text description comprises a large number of sentences and paragraphs, some of the sentences are expressions with lending relations, and some of the sentences are not expressions with lending relations. For example, a contract with the sentence "the first party agrees with the second party, the contract is lent to the second party by 500 RMB in a bank transfer mode, and the second party is required to return within 10 working days," the contract between the first party and the second party agrees "in the sentence belongs to a sentence without loan relation," the contract between the second party and the second party is required to return within 10 working days "belongs to a sentence without loan relation. The '500 yuan RMB borrowed to the second party by means of bank transfer' belongs to a sentence with a lending relation. Therefore, aiming at the sentence that the first party agrees with the second party, the second party is lent to 500 yuan RMB by a bank transfer mode, the second party is required to be cleared within 10 working days, the punctuation marks are used as a dividing method to divide the clauses, and three clauses of 'the first party agrees with the second party', 'the second party agrees with the 500 yuan RMB by the bank transfer mode' and 'the second party is required to be cleared within 10 working days' are obtained.
Step 704, filtering the payment behavior keyword table.
And filtering the text sentences through the payment behavior keyword list, so that the text data volume to be predicted and processed by the model is reduced. The payment behavior keyword table can be obtained by manual tagging, and payment behavior action words such as "pay", "borrow", and the like. Under the identification scene of the contract text of the account type, the payment behavior is the main behavior characteristic of the account type, so the contract text is filtered by adopting a payment behavior keyword table, and target text content matched with the payment behavior keyword table is screened out, wherein relevant words representing the payment behavior, such as 'collection', 'borrowing', 'payment', 'lending', and the like, are recorded in the payment behavior keyword table. For example, based on the above example, the filtering screens out "party A and party B agree" and "party B is required to be clear within 10 working days", and screens out "500 RMB borrowed from party B by bank transfer" as the target text content.
And step 706, text classification and identification.
In this embodiment, the text classification model is used to perform text classification and identification on the target text content, and the text classification model may be trained in a weak supervision iterative training manner based on sample data.
The training text recognition model judges whether the input contract text belongs to the account type in parallel, and the training text recognition model is a typical binary task. The parallel processing refers to that the batch processes one batch at a time or processes and identifies a plurality of text statements at a time, so that the processing and identifying efficiency can be improved; parallel, i.e., parallel processing of models, or batch inference prediction. Training a text classification recognition model after the training data is labeled with a certain amount of account type text sentences, predicting unlabeled sample data, selecting sample data with a prediction score higher than a certain threshold value, marking pseudo labels (namely, labels predicted by the model are assigned to the sample data), expanding the training data and performing model iterative training (a weak supervision training process, namely, a process of labeling a small amount of training data at first, and then performing pseudo labeling and iterative training on the unlabeled sample data by using the model). And the manual labeling cost of sample data is reduced (the acquisition and training processes of the entity recognition model training data are basically the same).
Specifically, each target text content may be input into the text classification model in parallel, and whether each target text content belongs to the account type is determined, where the text classification model is a typical binary classification task. After the training data are labeled with a certain amount of account type text contents, a text classification model is trained, unlabeled sample data are predicted, sample data with prediction scores higher than a certain threshold value are selected, pseudo labels are printed on the sample data (namely labels predicted by the model are assigned to the sample data), the training data are expanded, model iteration training is carried out, and therefore the manual labeling cost of the sample data can be reduced.
Step 708, determine whether the contract text contains the target text content of the account type. If so, steps 710-714 are performed, otherwise step 716 is performed.
And directly outputting the account type and the extraction result of the receiving and paying party to be null aiming at the contract with the text without the account type. According to the technical scheme, the method can help human beings judge that no loan relation exists in the contract.
And step 710, acquiring target text contents of the preset number of account types.
The predetermined number is set to N. The preset number is assumed to be 10, and by using the example, 1 target text content item of 'borrowing 500 yuan RMB to the second party in a bank transfer mode' is obtained, and the limit of 10 items is not exceeded. If the number exceeds the preset number, only 10 strips are taken, and recognition is stopped later. Generally, the first page of the contract will first indicate the main meaning of the whole contract, including the first party contact information, the second party contact information, and the "in view" clauses to make some description about what each party is ready in the contract, and then the payment clauses will write the price of a specific amount paid by one party to the other party, according to the feature of the contract, in one embodiment, the target text content of the first 10 pieces is obtained from the paragraph after the "in view" of the second word and the paragraph after the word of the digital title plus the "pay" and before the next digital title through the operation from step 704 to step 710.
And 712, extracting entity identification.
And after N contract text descriptions (N is a preset value) are extracted from the identified account type texts, the N contract text descriptions are obtained from all the account type texts for prediction processing, and the obtained account type texts are sent to an entity identification model to extract the roles of the collection and payment parties. Where N is a predetermined number of bars. In this embodiment, the entity recognition model is used to perform entity extraction on the target text content, and the entity recognition model may be trained in a weak supervised iterative training manner based on sample data, where a specific training process is similar to the training manner of the text classification model. For example, in the above example, "borrow the second party 500 yuan by bank transfer", the second party appears after "borrowing" the second word, and therefore "the second party" is extracted as the receiver role, and neither "the first party" nor "the payer role" is extracted from the sentence.
And 714, voting fusion.
For a plurality of extracted entity information, in the embodiment, a voting fusion mode is adopted to vote and fuse the extraction results described by the N contract texts, and the identification extraction results of the contracting parties are output, so that the accuracy of extraction identification is more effectively ensured. For example, it is recognized from the content of N documents that "party a" is a total of 8 payor entities and "party b" is a total of 3 payee entities, i.e. voting to select the most predictive result, and the payor is the "party a". The voting fusion result can improve the overall accuracy, and in addition, due to the influence of the text content, particularly negative sample text content, model prediction may have some influence or wrong result, so that the voting fusion effect is better. The data acquisition and model training process also adopts a weak supervision training process. Only two entity information, the payee role and the payer role, are extracted here, and no other type of entity, such as "party a pays 500 yuan to party b," party a is the payer, party b is the payee, no detailed information or others.
After the entity information of the payee is obtained through voting fusion, the extracted account type text can be post-processed and filtered by combining with the payment behavior keyword and/or the entity information of the payee, so as to eliminate text statements which may contain error information (for example, the entity information of the payee does not conform to the entity information obtained through voting fusion) or identify errors in step 706 (the text statements do not contain the entity information of the payee or the payment behavior keyword, but are identified as the text of the account type). And after post-processing and filtering, screening out the wrong text sentences identified by the text classification identification, and then executing the step 718, so that the accuracy of outputting the type extraction result can be improved.
Step 716, null.
And under the condition that the contract text without the account type text content is identified, directly outputting the account type and the extraction result of the collection and payment party to be null.
Step 718, outputting the type extraction result.
And finally, outputting the extracted account type of the contract and the entity information of the receiving and paying party. For example, the above example, the output content is: account type: is that; payee-b; payer-first party.
For the contract text with the text content of the non-account type identified, the output extraction result is as follows: account type: a is not; payee-null; payer-null.
For the identified account type text content, in the case that there are many signed subjects, the output extraction result may be: account type: is that; payee side, second side and third side; payer-first party and Ding party.
If the contract text contains the target text content of the account type, adopting a voting fusion mode for a plurality of extracted entity information, and outputting a type extraction result which is a contract payee identification extraction result; and if the contract text does not contain the target text content of the account type, directly outputting a type extraction result to be null.
By the method, the target text content matched with the payment behavior keyword list is extracted from the contract text and the type of the target text content is identified, so that the data processing amount of type identification can be greatly reduced, and the efficiency of account type identification is improved. And under the condition that the contract text is determined to contain the target text content of the account type, extracting entity information in the target text content, and outputting an account type extraction result in a voting fusion mode, so that the precision of the identification of the contract text of the account type is improved.
Corresponding to the above method embodiment, the present application further provides an embodiment of a contract text recognition apparatus, and fig. 8 shows a schematic structural diagram of a contract text recognition apparatus provided in an embodiment of the present application. As shown in fig. 8, the apparatus includes:
the screening module 802 is configured to acquire a contract text and extract target text content meeting preset conditions in the contract text, wherein the preset conditions are set based on feature information of a specified type;
the identification module 804 is configured to perform type identification on the target text content to obtain a text type of the target text content;
and the extracting module 806 is configured to extract the entity information in the target text content and determine the identification result of the contract text in the case that the text type of the target text content is a specified type.
Optionally, the screening module 802 is further configured to divide the contract text to obtain each text content in the contract text; and aiming at any text content, matching words in the text content with a preset behavior keyword library, and if the matching result meets a preset matching condition, determining the text content as the target text content, wherein the preset behavior keyword library records behavior keywords in a specified type.
Optionally, the preset matching condition is that the ratio of words in the text content, which is matched with the preset behavior keyword library, reaches a preset ratio threshold.
Optionally, the target text content of the specified type is multiple pieces;
the extracting module 806 is further configured to perform information fusion processing on entity information in the target text contents of the multiple specified types to obtain an entity fusion result; correcting each entity information by using the entity fusion result to obtain corrected entity information; and integrating the text type of the target text content and the corrected entity information to obtain the identification result of the contract text.
Optionally, the extracting module 806 is further configured to select a preset number of pieces of target text content of the specified type, and sequentially extract entity information in the preset number of pieces of target text content of the specified type; and carrying out information fusion processing on entity information in a preset number of pieces of specified types of target text content to obtain an entity fusion result.
Optionally, the entity information includes entity words in the target text content of the specified type, where the entity words in the target text content of the specified type carry corresponding entity labels;
the extracting module 806 is further configured to count entity words with the same entity tag in the plurality of pieces of specified types of target text content; and obtaining an entity fusion result according to the statistical result.
Optionally, the extracting module 806 is further configured to match the entity fusion result with each entity information; and determining the entity information which is successfully matched as the corrected entity information, and deleting the entity information which is failed in matching.
Optionally, the identifying module 804 is further configured to input the target text content into a text classification model, so as to obtain a text type of the target text content, where the text classification model is obtained by pre-training based on a sample text carrying a type tag.
Optionally, the apparatus further comprises a first training module;
the system comprises a first training module, a second training module and a third training module, wherein the first training module is configured to obtain a first training set and a first sample text to be labeled, and the first training set comprises a plurality of sample texts carrying type labels; training the text classification model by utilizing a first training set; inputting the first sample text to be labeled into the trained text classification model to obtain a first prediction probability of whether the first sample text to be labeled is of a specified type; labeling the first sample text to be labeled based on the first prediction probability to obtain a labeled first sample text; and adding the labeled first sample text into the first training set to obtain an updated first training set, and continuously training the text classification model by using the updated first training set.
Optionally, the first training module is further configured to label the first sample text to be labeled as the first sample text of the specified type if the first prediction probability reaches a first preset threshold; and under the condition that the first prediction probability does not reach a first preset threshold value, marking the first sample text to be marked as the first sample text of the non-specified type.
Optionally, the extracting module 806 is further configured to input the target text content into an entity identification model, and extract entity information in the target text content, where the entity identification model is obtained by pre-training based on a sample text carrying entity label information and belonging to a specified type.
Optionally, the apparatus further comprises a second training module;
the second training module is configured to obtain a second training set and a second sample text to be labeled, wherein the second training set comprises a plurality of sample texts which carry entity label information and belong to a specified type; training the entity recognition model by utilizing a second training set; inputting the second sample text to be labeled into the trained entity recognition model to obtain each entity word in the second sample text to be labeled and a second prediction probability of entity label information corresponding to each entity word; labeling each entity word of the second sample text to be labeled based on the second prediction probability to obtain a labeled second sample text; and adding the labeled second sample text into a second training set to obtain an updated second training set, and continuously training the entity recognition model by using the updated second training set.
Optionally, the second training module is further configured to determine that the entity label information is target entity label information and label the entity word corresponding to the target entity label information when the second prediction probability reaches a second preset threshold.
The contract text recognition device extracts target text contents meeting preset conditions in the contract text by acquiring the contract text, wherein the preset conditions are set based on the specified type of feature information; performing type identification on the target text content to obtain the text type of the target text content; and under the condition that the text type of the target text content is a specified type, extracting entity information in the target text content, and determining the identification result of the contract text. By the method, the target text content meeting the preset conditions is extracted from the contract text and the type of the target text content is identified, so that the data processing amount of type identification can be greatly reduced, and the efficiency of type identification is improved. And under the condition that the text type of the target text content is the specified type, extracting entity information in the target text content to determine the identification result of the contract text, extracting the target text content meeting the preset condition in the contract text, combining type identification, primarily identifying the type of the contract text, extracting the entity information in the target text content of the specified type, and determining the identification result of the contract text by combining the entity information, thereby improving the precision of the contract text identification.
The above is an illustrative scheme of a contract text recognition apparatus of the present embodiment. It should be noted that the technical solution of the contract text recognition apparatus and the technical solution of the contract text recognition method belong to the same concept, and details of the technical solution of the contract text recognition apparatus, which are not described in detail, can be referred to the description of the technical solution of the contract text recognition method. Further, the components in the device embodiment should be understood as functional blocks that must be created to implement the steps of the program flow or the steps of the method, and each functional block is not actually divided or separately defined. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.
Fig. 9 shows a block diagram of a computing device according to an embodiment of the present application. Components of the computing device 900 include, but are not limited to, a memory 910 and a processor 920. The processor 920 is coupled to the memory 910 via a bus 930, and a database 950 is used to store data.
Computing device 900 also includes access device 940, access device 940 enabling computing device 900 to communicate via one or more networks 960. Examples of such networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The Access device 940 may include one or more of any type of Network Interface (e.g., a Network Interface Card (NIC)) whether wired or Wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) Wireless Interface, a worldwide Interoperability for Microwave Access (Wi-MAX) Interface, an ethernet Interface, a Universal Serial Bus (USB) Interface, a cellular Network Interface, a bluetooth Interface, a Near Field Communication (NFC) Interface, and so forth.
In one embodiment of the present application, the above-described components of computing device 900 and other components not shown in FIG. 9 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 9 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 900 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 900 may also be a mobile or stationary server.
Wherein processor 920 is configured to execute the computer-executable instructions of the contract text recognition method.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned contract text recognition method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above-mentioned contract text recognition method.
An embodiment of the present application also provides a computer readable storage medium storing computer instructions that, when executed by a processor, are for a contract text recognition method.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the above-mentioned contract text recognition method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the above-mentioned contract text recognition method.
An embodiment of the present application further provides a chip, in which a computer program is stored, and the computer program implements the steps of the contract text recognition method when executed by the chip.
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (16)

1. A contract text recognition method is characterized by comprising the following steps:
acquiring a contract text, and extracting target text content meeting preset conditions in the contract text, wherein the preset conditions are set based on specified types of feature information;
performing type identification on the target text content to obtain a text type of the target text content;
and under the condition that the text type is the specified type, extracting entity information in the target text content, and determining the identification result of the contract text.
2. The method according to claim 1, wherein the extracting of the target text content meeting the preset condition in the contract text comprises:
dividing the contract text to obtain each text content in the contract text;
and aiming at any text content, matching words in the text content with a preset behavior keyword library, and if the matching result meets a preset matching condition, determining the text content as the target text content, wherein the preset behavior keyword library records behavior keywords in a specified type.
3. The method according to claim 2, wherein the preset matching condition is that a ratio of words in the text content matching the preset action keyword library reaches a preset ratio threshold.
4. The method of claim 1, wherein the specified type of target text content is a plurality of pieces;
the determining of the recognition result of the contract text comprises:
carrying out information fusion processing on entity information in the target text contents of the specified types to obtain an entity fusion result;
correcting the entity information by using the entity fusion result to obtain corrected entity information;
and integrating the text type and the corrected entity information to obtain the identification result of the contract text.
5. The method of claim 4, wherein the extracting entity information from the target text content comprises:
selecting a preset number of pieces of target text contents of the specified type, and sequentially extracting entity information in the preset number of pieces of target text contents of the specified type;
the information fusion processing is performed on the entity information in the plurality of pieces of specified type target text content to obtain an entity fusion result, and the method comprises the following steps:
and carrying out information fusion processing on entity information in a preset number of pieces of the specified type of target text content to obtain an entity fusion result.
6. The method according to claim 4 or 5, wherein the entity information includes entity words in the specified type of target text content, wherein the entity words in the specified type of target text content carry corresponding entity labels;
the information fusion processing is performed on the entity information in the plurality of pieces of specified type target text content to obtain an entity fusion result, and the method comprises the following steps:
counting the entity words with the same entity labels in the target text contents of the specified types;
and obtaining an entity fusion result according to the statistical result.
7. The method according to claim 4 or 5, wherein the correcting each entity information by using the entity fusion result to obtain corrected entity information comprises:
matching the entity fusion result with each entity information;
and determining the entity information which is successfully matched as the corrected entity information, and deleting the entity information which is failed in matching.
8. The method according to any one of claims 1 to 5, wherein the type recognition of the target text content to obtain the text type of the target text content comprises:
and inputting the target text content into a text classification model to obtain the text type of the target text content, wherein the text classification model is obtained by pre-training based on a sample text carrying a type label.
9. The method of claim 8, wherein before entering the target text content into a text classification model to obtain a text type of the target text content, further comprising:
acquiring a first training set and a first sample text to be labeled, wherein the first training set comprises a plurality of sample texts carrying type labels;
training a text classification model by using the first training set;
inputting the first sample text to be labeled into a trained text classification model to obtain a first prediction probability of whether the first sample text to be labeled is of a specified type;
labeling the first sample text to be labeled based on the first prediction probability to obtain a labeled first sample text;
and adding the marked first sample text into the first training set to obtain an updated first training set, and continuously training the text classification model by using the updated first training set.
10. The method of claim 9, wherein labeling the first sample text to be labeled based on the first prediction probability comprises:
under the condition that the first prediction probability reaches a first preset threshold value, marking the first sample text to be marked as a first sample text of a specified type;
and under the condition that the first prediction probability does not reach the first preset threshold value, marking the first sample text to be marked as a first sample text of a non-specified type.
11. The method according to any one of claims 1 to 5, wherein the extracting entity information in the target text content comprises:
and inputting the target text content into an entity recognition model, and extracting entity information in the target text content, wherein the entity recognition model is obtained by pre-training based on a sample text which carries entity label information and belongs to the specified type.
12. The method of claim 11, wherein before the inputting the target text content into an entity recognition model and extracting entity information in the target text content, further comprising:
acquiring a second training set and a second sample text to be labeled, wherein the second training set comprises a plurality of sample texts which carry entity label information and belong to the specified type;
training an entity recognition model by using the second training set;
inputting the second to-be-labeled sample text into the trained entity recognition model to obtain each entity word in the second to-be-labeled sample text and a second prediction probability of entity label information corresponding to each entity word;
labeling each entity word of the second sample text to be labeled based on the second prediction probability to obtain a labeled second sample text;
and adding the labeled second sample text into the second training set to obtain an updated second training set, and continuing training the entity recognition model by using the updated second training set.
13. The method of claim 12, wherein labeling the second sample text to be labeled based on the second prediction probability comprises:
and under the condition that the second prediction probability reaches a second preset threshold value, determining the entity label information as target entity label information, and labeling the entity words corresponding to the target entity label information.
14. A contract text recognition apparatus, comprising:
the screening module is configured to acquire a contract text and extract target text content meeting preset conditions in the contract text, wherein the preset conditions are set based on specified types of feature information;
the identification module is configured to identify the type of the target text content to obtain the text type of the target text content;
and the extraction module is configured to extract entity information in the target text content and determine the identification result of the contract text under the condition that the text type is the specified type.
15. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the contract text recognition method of any one of claims 1 to 13.
16. A computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the contract text recognition method of any one of claims 1 to 13.
CN202210582893.XA 2022-05-26 2022-05-26 Contract text recognition method and device Pending CN114818718A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210582893.XA CN114818718A (en) 2022-05-26 2022-05-26 Contract text recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210582893.XA CN114818718A (en) 2022-05-26 2022-05-26 Contract text recognition method and device

Publications (1)

Publication Number Publication Date
CN114818718A true CN114818718A (en) 2022-07-29

Family

ID=82519559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210582893.XA Pending CN114818718A (en) 2022-05-26 2022-05-26 Contract text recognition method and device

Country Status (1)

Country Link
CN (1) CN114818718A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115270797A (en) * 2022-09-23 2022-11-01 山东省计算中心(国家超级计算济南中心) Text entity extraction method and system based on self-training semi-supervised learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115270797A (en) * 2022-09-23 2022-11-01 山东省计算中心(国家超级计算济南中心) Text entity extraction method and system based on self-training semi-supervised learning

Similar Documents

Publication Publication Date Title
CN107729309B (en) Deep learning-based Chinese semantic analysis method and device
CN112001177A (en) Electronic medical record named entity identification method and system integrating deep learning and rules
CN109685056B (en) Method and device for acquiring document information
CN111914558A (en) Course knowledge relation extraction method and system based on sentence bag attention remote supervision
CN111581961A (en) Automatic description method for image content constructed by Chinese visual vocabulary
CN110196982B (en) Method and device for extracting upper-lower relation and computer equipment
CN114580382A (en) Text error correction method and device
CN113961685A (en) Information extraction method and device
CN113033438B (en) Data feature learning method for modal imperfect alignment
CN112580362A (en) Visual behavior recognition method and system based on text semantic supervision and computer readable medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN111783471A (en) Semantic recognition method, device, equipment and storage medium of natural language
CN111462752A (en) Client intention identification method based on attention mechanism, feature embedding and BI-L STM
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN116070632A (en) Informal text entity tag identification method and device
CN111242710A (en) Business classification processing method and device, service platform and storage medium
CN110929015A (en) Multi-text analysis method and device
CN114240672A (en) Method for identifying green asset proportion and related product
CN114818718A (en) Contract text recognition method and device
CN109635289B (en) Entry classification method and audit information extraction method
CN115640401B (en) Text content extraction method and device
CN115906835A (en) Chinese question text representation learning method based on clustering and contrast learning
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN115358817A (en) Intelligent product recommendation method, device, equipment and medium based on social data
CN111736804B (en) Method and device for identifying App key function based on user comment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination