CN117494719A - Contract text processing method and device, electronic equipment and storage medium - Google Patents

Contract text processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117494719A
CN117494719A CN202310641191.9A CN202310641191A CN117494719A CN 117494719 A CN117494719 A CN 117494719A CN 202310641191 A CN202310641191 A CN 202310641191A CN 117494719 A CN117494719 A CN 117494719A
Authority
CN
China
Prior art keywords
text
contract
character
text segment
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310641191.9A
Other languages
Chinese (zh)
Inventor
夏志超
马超
肖冰
夏粉
蒋宁
吴海英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Xiaofei Finance Co Ltd
Original Assignee
Mashang Xiaofei Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Xiaofei Finance Co Ltd filed Critical Mashang Xiaofei Finance Co Ltd
Priority to CN202310641191.9A priority Critical patent/CN117494719A/en
Publication of CN117494719A publication Critical patent/CN117494719A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a contract text processing method and device, electronic equipment and storage medium. The method comprises the following steps: and carrying out semantic coding on the text segment of the target contract text to obtain a first coding sequence. And determining the position absolute coding result of each character based on the contract position corresponding to each character in the text segment, and introducing the position absolute coding result of each character into the first coding sequence. The first code sequence is encoded based on a mechanism of attention between the characters with respect to the relative position, resulting in a second code sequence. Based on the second coding sequence, predicting contract entity identifiers of all characters in the text segment, and extracting contract element corresponding entities of the text segment according to the contract identifiers of all the characters. The method and the device can improve the accuracy of entity identification in the contract.

Description

Contract text processing method and device, electronic equipment and storage medium
Technical Field
The document belongs to the technical field of artificial intelligence, and particularly relates to a contract text processing method and device, electronic equipment and a storage medium.
Background
With the development of artificial intelligence technology, the task of extracting text information has been gradually completed by machines.
The current mainstream mode of machine extraction of text information is to split the whole text into sentence-level text fragments, then to identify the entity of each text fragment, and then to extract the information of the required entity from each text fragment. In this way, the text fragments obtained by splitting lose their structural association in the whole text, resulting in low accuracy of entity recognition.
The above problem is particularly prominent in the scenario where information in the contract text is extracted. Compared with common texts, the phenomenon that the same text information corresponds to different entities often exists in contract texts. For example, the first party name of the first machine in the contract and the first party signature of the ending hand in the contract are identical in this information, but the entities are different, one entity being the first party signature and the other being the first party name. For this reason, how to improve the accuracy of entity identification in contracts is a matter of agreement in the hot spot of current research.
Disclosure of Invention
The application aims to provide a method and a device for processing contract text, electronic equipment and a storage medium, which can improve the accuracy of entity identification in a contract.
In order to achieve the above object, embodiments of the present application are implemented as follows:
in a first aspect, a method for processing contract text is provided, including:
carrying out semantic coding on a text segment in a target contract text to obtain a first coding sequence of the text segment, wherein the first coding sequence of the text segment comprises semantic coding results of all characters in the text segment; the text segment comprises contract elements and corresponding entities;
determining an absolute position coding result of each character in the text segment based on a contract position corresponding to each character in the text segment, adding the absolute position coding result corresponding to each character in a first coding sequence of the text segment, and coding a coding sequence added with the absolute position coding result based on a relative position attention mechanism among each character to obtain a second coding sequence of the text segment;
based on the second coding sequence of the text segment, predicting contract entity identifiers of all characters in the text segment, and extracting contract element corresponding entities of the text segment according to the contract identifiers of all the characters.
In a second aspect, there is provided a contract text processing apparatus including:
the encoding unit is used for carrying out semantic encoding on the text fragments in the target contract text to obtain a first encoding sequence of the text fragments, wherein the first encoding sequence of the text fragments comprises semantic encoding results of all characters in the text fragments; the text segment comprises contract elements and corresponding entities;
the coding unit is further configured to determine an absolute position coding result of each character in the text segment based on a contract position corresponding to each character in the text segment, add the absolute position coding result corresponding to each character in a first coding sequence of the text segment, and code the first coding sequence added with the absolute position coding result based on a mechanism of attention about relative positions among the characters, so as to obtain a second coding sequence of the text segment;
and the extraction unit is used for predicting contract entity identifiers of all characters in the text fragment based on the second coding sequence of the text fragment, and extracting contract element corresponding entities of the text fragment according to the contract identifiers of all the characters.
In a third aspect, there is provided an electronic device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to perform the method of the first aspect described above.
In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of the first aspect.
The text segment of the target contract text is subjected to conventional semantic coding to obtain a first coding sequence of the text segment, wherein the first coding sequence comprises semantic coding results of characters in the text segment; then, according to the contract position corresponding to each character, determining the absolute position coding result of each character, and introducing the absolute position coding result corresponding to the character on the basis of the first coding sequence, so that the coding results with different entities but the same text information in the first coding sequence are distinguished; next, the first coding sequence added with the absolute position coding result is coded in context based on the attention mechanism of the relative position among the characters, and a second coding sequence is obtained, wherein the second coding sequence is based on the coding result realized on the basis of the contract position and the context. And then predicting the contract entity identification of each character according to the second coding sequence, namely combining the local context semantics and the global contract position to identify the entity corresponding to the text information in the contract, wherein even the same text information in the target contract text can analyze the difference of the entity, thereby improving the accuracy of identifying the entity in the contract text.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person having ordinary skill in the art.
Fig. 1 is a schematic diagram of a conventional machine for extracting and verifying the first party signature information in the contract text.
Fig. 2 is a flow chart of a method for processing contract text according to an embodiment of the present application.
Fig. 3 is a schematic diagram of a target contract text according to an embodiment of the present application.
FIG. 4 is a schematic diagram of OCR software scanning sub-text from target contract text.
FIG. 5 is a schematic diagram of the OCR software from merging scanned sub-text.
FIG. 6 is a schematic diagram of OCR software segmenting target contract text into multiple text segments.
Fig. 7 is a schematic diagram of the BERT model for a single sentence output sub-coding sequence.
Fig. 8 is a schematic diagram of introduction of a contractual location corresponding to a character in a first encoding sequence.
FIG. 9 is a schematic illustration of the OCR software marking multiple text segments of a target contract text.
FIG. 10 is a schematic diagram of target contract text filtered by OCR software.
Fig. 11 is a schematic diagram of labeling characters of an unnecessary sentence in the first code sequence.
FIG. 12 is a schematic diagram of a sliding window selected character of a sliding window self-attention mechanism.
FIG. 13 is a schematic diagram of OCR software marking contract entity identifications in target contract text.
Fig. 14 is a schematic diagram of a first structure of a contract entity extraction model according to an embodiment of the present application.
Fig. 15 is a schematic diagram of a second structure of the entity extraction model of the contract according to the embodiment of the application.
Fig. 16 is a flow chart of a model training method according to an embodiment of the present application.
Fig. 17 is a third structural diagram of a contract entity extraction model according to an embodiment of the present application.
Fig. 18 is a schematic structural diagram of a processing apparatus for contract text according to an embodiment of the present application.
Fig. 19 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present specification, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.
As mentioned previously, with the development of artificial intelligence technology, the task of extracting text information has been gradually performed by machines. Currently, the mainstream machine extraction of text information is implemented on natural language processing (NLP, natural Language Processing) technology. Natural language processing techniques can semantically understand the text content to identify entities present in the text. For machines, where entities are determined in text, information may be extracted from the entities.
The bi-directional encoder characterization model (Bidirectional Encoder Representations from Transformers, BERT) is the language model most commonly used by NLP tasks. The BERT model is capable of encoding text based on context semantics, and the encoded sequence may be used to predict entities in the text.
The maximum number of characters that the language model supports to be input is 512 bits, but preferably not more than 256 bits. This results in the need to split the text into sentence-level text fragments when applied to text with a large number of words, and then input each text fragment separately to the BERT model for entity recognition alone. In this way, the text segment obtained by splitting loses its structural association in the whole text, thereby causing entity recognition to lose analysis factors of the global text position and further affecting accuracy.
The above problem is particularly prominent in the scenario where information in the contract text is extracted. Compared with common texts, the phenomenon that the same text information corresponds to different entities often exists in contract texts. For example, the first party name of the first machine in the contract and the first party signature of the ending hand in the contract are the same text information, but correspond to different entities, one corresponding entity being the first party name and the other corresponding entity being the first party signature.
It is assumed here that the application scenario is a machine check of whether there is an error in the first party signature in the contract text. Referring to fig. 1, a conventional process of checking a first party signature in a contract text by a machine includes:
1) The user initiates a verification request to the machine for a "party a signature" of the contract text, wherein the verification request carries the contract text. It should be noted that the machine may be a signature verification device for performing signature verification, such as a terminal or a server.
2) The machine performs a check of the text information for "party a signature". The specific steps of verification include:
2.1, inputting a plurality of split text fragments into a BERT model by a machine, and encoding each text fragment by the BERT model based on context semantics to obtain an encoding sequence of each text fragment;
2.2, inputting the coding sequence corresponding to each text segment into a trained classifier capable of identifying contract entities by the machine, and predicting the contract entity identification sequence of each text segment by the classifier; typically, the contract entity identifier sequence of each text segment includes entity identifiers corresponding to respective characters in the text segment, where the entity identifiers are entity information labels of machine language on the characters, explaining identifiers of the entities to which the characters belong and positions of the characters in the entities (if one entity is composed of a plurality of characters).
And 2.3, the machine extracts text information of the entity in the contract as the 'A-party signature' according to the indication of the contract entity identification sequence so as to verify.
3) The machine feeds back the verification results to the user.
Let the first party in the above-mentioned flow be the target user. That is, the first party name of the first machine in the contract and the first party signature of the ending hand in the contract are both the names of the target users in the contract text. Based on the traditional mode, after the first party name 'Zhang Zhen' of the machine at the beginning of the contract and the first party signature 'Zhang Zhen' of the hand sign at the end of the contract are input into the BERT model, even if the BERT model has certain understanding capability based on context semantics, it is difficult to code the two Zhen separately. Obviously, once the coding results of two somewhere on the BERT model side are consistent, the classifier cannot identify the potential difference between the two somewhere, and thus the text information of "somewhere" machined at the beginning of the contract and "somewhere" manually signed at the end of the contract are wrongly classified as the same entity, for example, all are filed as a first party signature, or all are filed as a first party name. For the machine, finally, the "Zhang Zhen" of the machine at the beginning of the contract can be mistaken as the "A-party signature" so as to perform extraction verification.
It can be seen that the conventional approach is not applicable to identifying a scene where the object is contracted text.
For this reason, the present application aims to propose a solution that can improve the accuracy of entity recognition in contracts. Specifically, considering that the same text information in the contract text may correspond to different entities in different positions of the contract, the position of the character in the contract is used as a dimension for identifying the entity, after the position information corresponding to each character is introduced, the coding results with the same text information but different entities can be distinguished, the entities can be accurately identified according to the distinction, and then the corresponding content is extracted from the contract text according to the identified entities.
In the application, a text segment of a target contract text is subjected to conventional semantic coding to obtain a first coding sequence of the text segment, wherein the first coding sequence comprises a semantic coding result of characters in the text segment; then, according to the contract position corresponding to each character, determining the absolute position coding result of each character, and introducing the absolute position coding result corresponding to the character on the basis of the first coding sequence, so that the coding results with different entities but identical text information (also called as different contract elements) in the first coding sequence are distinguished; next, the first code sequence is further encoded based on a relative position-related attention mechanism among the characters, so as to obtain a second code sequence, wherein the second code sequence is an encoding result realized based on a text-based contract position and a context relation. And then predicting the contract entity identification of each character according to the second coding sequence, namely combining the local context semantics and the global contract position to identify the entity corresponding to the text information in the contract, wherein even the same text information in the target contract text can analyze the difference of the entity, thereby improving the accuracy of identifying the entity in the contract text.
The solution of the present application may be executed by an electronic device, and in particular may be executed by a processor of the electronic device. So-called electronic devices may include terminal devices such as smartphones, tablet computers, notebook computers, desktop computers, intelligent voice interaction devices, smart appliances, smart watches, car terminals, aircraft, etc.; alternatively, the electronic device may further include a server, such as an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides a cloud computing service.
Based on the foregoing, the embodiment of the application specifically provides a method for processing a contract text. Wherein fig. 2 is a flow chart diagram of the method for processing contract text, comprising the steps of:
s202, performing semantic coding on text fragments in a target contract text to obtain a first coding sequence of the text fragments, wherein the first coding sequence of the text fragments comprises semantic coding results of all characters in the text fragments; the text segment contains contract elements and their corresponding entities.
The characters of the application refer to the text of the minimum unit in the target contract text, such as Chinese characters, english words and the like. In the NLP field, characters are generally represented by Token. The semantic encoding of the target contract text is performed by taking Token as granularity. That is, the first code sequence is composed of semantic code results of Token one by one.
As an exemplary introduction, existing language models (e.g., BERT models) may be employed for semantic coding. The BERT model is provided with the coding capability based on context semantics, and semantic coding is performed by taking Token as granularity, but when each Token is coded, the semantics of adjacent Token are combined, so that the semantic coding result of the Token has certain accuracy.
As previously mentioned, the number of characters supported by language models such as BERT preferably does not exceed 256 bits. In the application, if the number of characters of the target contract text exceeds 256 bits, the target contract text needs to be segmented into a plurality of text fragments, and then each text fragment is respectively input into a language model for semantic coding.
As an exemplary introduction, the present application may use existing optical character recognition (Optical Character Recognition, ORC) segmentation techniques to segment the target contract text converted into a picture form, thereby obtaining multiple sub-text multi-segment contract contents with granularity smaller than sentence level; and then, splicing the multiple sections of contract contents through preset symbols (such as commas, punctuations of intra-sentence levels and the like) and cutting the spliced multiple sections of contract contents based on preset cutting rules to obtain multiple sentence-level sub-texts. And then, based on the text segment length requirement (the number of characters is smaller than 256 bits), splicing the sentence-level sub-texts according to the sequence of the sentence-level sub-texts in the target contract text, so as to obtain a plurality of text segments which meet the text segment length requirement in the target contract text.
Briefly described here, FIG. 3 illustrates a target contract text after information desensitization. The present application may first convert the initial image of the target contract text shown in fig. 3 into a PDF format image by the OCR software based on the computer programming language python after inputting the initial image of the target contract text into the OCR software.
The OCR software then re-uses the deep learning for image in images technique to further convert the PDF formatted image to the machine recognizable base64 language shown in fig. 4. In FIG. 4, the base64 language would describe the corresponding scanned sub-text in terms of the text structure of the lines "line" and columns "column".
Next, the OCR software gathers the scanned sub-texts according to the location guidelines of "line" and "column" to obtain the contract content cntract_content of the target contract text shown in fig. 5. In FIG. 5, each sub-text recognized by OCR software is separated by a "#".
Then, the OCR software segments the contractual content cntract_content in fig. 5 according to "#" signs and sentence-level pause symbols (such as "," | "and".
In the process of splicing, the nth sentence level sub-text and the (n+1) th sentence level sub-text are assumed to be selected for splicing, and if the number of characters of the nth sentence level sub-text and the (n+1) th sentence level sub-text does not exceed 256 bits, the nth sentence level sub-text and the (n+1) th sentence level sub-text are spliced to form a new sentence level sub-text; then, splicing the new sentence-level sub-text with the n+2th sentence-level sub-text, and if the number of characters of the new sentence-level sub-text and the n+2th sentence-level sub-text exceeds 256 bits, taking the new sentence-level sub-text as a text segment and outputting the text segment; then, starting from the n+2th sentence-level sub-text, selecting the n+3th sentence-level sub-text and splicing the n+2th sentence-level sub-text. It should be appreciated that based on the above-described manner of stitching, each text segment eventually ends with a sentence-level pause symbol, i.e., the segmentation of the text segment does not disrupt the integrity of the entire sentence.
After obtaining the text fragments text shown in fig. 6, the text fragments text can be respectively input into the BERT model for semantic coding, so as to obtain the sub-coding sequence of each text fragment text.
It should be noted that, based on the existing BERT model, some flag bits with calibration function are added in the generated coding sequence. The flag bit of the BERT model mainly comprises:
The [ CLS ] mark is used for marking the first bit of a sentence, the [ CLS ] mark is a sentence-level semantic coding result obtained by the BERT model, and the [ CLS ] mark of one sentence is determined based on the semantic coding results of all Token in the sentence. Here, the [ CLS ] tag is defined herein as a sentence-level semantic tag.
The [ SEP ] flag is used to separate two adjacent sentences.
The MASK flag is used for replacing the Token in the sentence, and the MASK flag covers the Token to lose the practical meaning.
Here, a text fragment containing sentences a and B is taken as an example. In the special case of MASK-up of Token with MASK flag, the corresponding subcode sequence is shown in fig. 7 after the text fragment is entered into the BERT model. Wherein the [ CLS ] mark is positioned at the sentence head of the sentence; the [ SEP ] tag is used to segment sentence a and sentence B; e represents the semantic coding result corresponding to one Token, the number of tokens of the sentence is equal to the number of E corresponding to the sentence in the sub-coding sequence, for example, if the sentence A has 30 tokens, the sentence A has 30E corresponding to the sub-coding sequence, and the 30E corresponds to 30 tokens one by one.
It should be understood that, after the sub-coding sequences of each text segment text are sequentially combined, the first coding sequence of the target contract text can be obtained.
S204, determining an absolute position coding result of each character in the text segment based on the contract position corresponding to each character in the text segment, adding the absolute position coding result corresponding to each character in a first coding sequence of the text segment, and coding the first coding sequence added with the absolute position coding result based on the attention mechanism of each character about the relative position, so as to obtain a second coding sequence of the text segment.
As an exemplary introduction, the present application may use the "row" and "column" of Token in the target contract text to represent the corresponding contract location of Token. That is, the absolute position encoding result of Token can be represented by the row value and column value of Token in the target contract text.
Assuming that the number of lines of the target contract text is M and the number of columns is N, referring to FIG. 8, after adding the corresponding absolute position coding result, any token in the first coding sequence is changed from the original E to E mn . Where m represents the m-th row and n represents the n-th column. It can be seen that, after the absolute position coding result is introduced into the first coding sequence, the token at different contract positions can be correspondingly provided with different coding values, so that a distinction is formed in the subsequent prediction.
It should be appreciated that in practical applications, when there are multiple text segments in the target contract text, the present embodiment encodes the first encoding sequence of the multiple text segments. In order to simplify the encoding of the first encoding sequence, a text segment containing contract elements may be selected from a plurality of text segments as a text segment to be identified, and only the first encoding sequence of the text segment to be identified may be encoded, prior to encoding.
For example, the present embodiment may pre-train a classifier that predicts whether a text segment contains a contract element based on the first coding sequence of the text segment. After obtaining the first coding sequences corresponding to the text fragments in the target contract text, the first coding sequences may be respectively input into a classifier, and whether the text fragments contain contract elements is predicted by the classifier, wherein the contract elements refer to necessary information in a contract text, such as a first party name, a second party name, a first party signature or a second party signature, and the like. Text segments that do not contain contract elements may be considered as non-essential information, and such text may not be considered when identifying for contract-like text. And then, according to the prediction result, adding a first identifier to the text fragments with the contract elements, adding a second identifier to the text fragments without the contract elements, taking each text fragment with the first identifier as a to-be-identified fragment of the target contract text, determining the absolute position coding result of each character of the text fragments with the first identifier, and then determining a second coding sequence.
Here, taking the text fragments text shown in fig. 6 as an example, after determining whether the contract pixels exist in the text fragments text in fig. 6, labeling can be performed by "labels" to obtain a labeling result shown in fig. 9. In fig. 9, "labels": [1] the representation contains contract elements, "labels": [0] indicating that no contract elements are involved. Thereafter, for "labels": [0] and filtering the marked text fragments text to obtain the target contract text shown in fig. 10.
Thereafter, for text segments where no contract elements exist, the [ MASK ] introduced above can be used]The token marks all of its token. As an exemplary introduction, referring to fig. 11, assume E 21 To E to 2n If the corresponding text segment does not contain contract elements, coding results E of all token corresponding to the text segment in the first coding sequence 21 To E to 2n To [ MASK ]]A flag. Quilt [ MASK ]]The token of the token is automatically ignored by the machine because of losing the actual meaning in the first code sequence, which corresponds to the effect of coding only the first code sequence of the text segment having the contract element.
Furthermore, the present embodiment employs a self-attention mechanism for encoding for the first encoding sequence. Among these, the self-attention mechanism has three parameters:
Q (query): content corresponding to a decoding decoder for matching with other units;
k (key): content corresponding to the encoded encoder for matching by other units;
v (value): corresponding to the content of the encoded encoder, the information needs to be extracted.
Self-attention mechanisms are typically implemented by attention (attention) functions, the nature of which can be described as a mapping of a query matrix (Q) to a series of matrix (key, K) -value (V) pairs. The method mainly comprises three steps in the calculation of the Attention: the first step is that similarity calculation is carried out on the query and each key to obtain weight, and common similarity functions comprise dot product, splicing, perceptron and the like; the second step is typically to normalize the weights using a softmax function; and thirdly, carrying out weighted summation on the weights and the corresponding key value values to obtain the final Attention.
In this application, the obtained Attention refers to the last output second code sequence.
Wherein the attention mechanism between the symbols with respect to the relative positions may include:
an attention mechanism between the relative position of the first character to the second character and the relative position of the second character to the first character in any two characters; a mechanism of attention between the first character and the relative position of the first character to the second character; the relative position of the second character to the first character, and the attention mechanism between the second character.
As an exemplary introduction, define the relative position of the self-attention mechanism as P, the relative distance as a function δ (i, j), q=pw q X,K=PW k X, wherein W q Matrix representing Q, W k The matrix representing K, i is the identity of the first character and j is the identity of the second character.
Let x= (X) 1 ,…,x n ) Representing input data, X.epsilon.R n×d N represents the number of Token and d represents the BERT model problem degree. Correspondingly, the position codes P are embedded into the matrix P E R by using the same shape n×d Then
Att described above i,j In the formula of (1), the first termIs the mechanism of attention between the individual characters. The second to fourth terms are the attentiveness mechanisms with respect to the relative positions between the different respective symbols, wherein: second item->I.e. the relative position of the first character with respect to the second character, and the relative position of the second character with respect to said first character. Third item->An attention mechanism between the first character and the relative position of the first character to the second character. Fourth item P i|j P j|i Attention to contract positions between Token, namely the relative position of the second character to the first character, and the attention mechanism between the second character.
Further, defining three matrixes generated randomly by Q, K and V, wherein the three matrixes correspond to query, key and value respectively, and the attribute function is as follows:
A i,j In the formula (a), Q, K, V represents three matrices generated randomly, corresponding to the query, key, value.The correlation score representing the i-th input vector and the j-th input vector is higher the greater the score is.
It should be noted that the above description is only an exemplary description of the attention mechanism for encoding the first coding sequence, and other attention mechanisms may be introduced to achieve different encoding effects. For example, the first coding sequence of the text segment may be further encoded based on a relative position-related attention mechanism between each character, an attention mechanism between each character, and an attention mechanism of each character and the sentence-level semantic tag [ CLS ], resulting in a second coding sequence of the text segment. It should be understood that, the [ CLS ] flag semantic coding result in the first coding sequence is a sentence-level semantic coding result, and the attention mechanism body introducing each character and the sentence-level semantic tag [ CLS ] in this embodiment may also code the characters in the text segment according to the sentence-level semantics, thereby improving the coding performance.
In addition, to reduce the difficulty of calculating the attention, the attention mechanism of the relative positions between the characters and the attention mechanism between the characters can adopt a local attention mechanism based on a sliding window. That is, the attention operation range of the local attention mechanism for an arbitrary character is a plurality of characters in a fixed window including the arbitrary character.
Referring to fig. 12, assuming that each sentence of the target text contains j Token, the number of tokens framed by the sliding window is 3, when the kth Token of the sentence is encoded in the first coding sequence, the Token framed by the sliding window is the kth-Token, the kth Token, and the kth+1th Token, that is, when the kth Token is encoded, the context relationship between the kth Token and the k-1th Token and the kth+1th Token adjacent thereto is considered.
Similarly, when the k+1th Token in the first coding sequence is coded, the tokens framed by the sliding window are the k Token, the k+1th Token, and the k+2th Token, that is, when the k+1th Token is coded, the context relationship between the k+1th Token and the k Token and the k+2th Token adjacent to the k+1th Token is considered.
For the present application, the number of tokens selected by the sliding window should be smaller than the number of tokens in a single sentence, so that the context relationship between the Token and the whole sentence which is originally considered is reduced to only consider the context relationship between the Token and the adjacent part of Token, thereby reducing the calculation amount of attention.
In addition, in this embodiment, the attention mechanism of each character and sentence-level semantic tag may employ a global attention mechanism. That is, the global attention mechanism aims at the attention operation range of any character to be all sentence-level semantic tags in the text segment where the character is located (or all sentence-level semantic tags in all text segments), so that the first coding sequence is further coded into the second first coding sequence by further referring to the semantic angle of the character relative to the global.
S206, predicting contract entity identifiers of all characters in the text segment based on the second coding sequence of the text segment, and extracting corresponding entities of contract elements for the text segment according to the contract identifiers of all the characters.
It should be appreciated that entity identification belongs to the state of the art of NLP technology. Typically, NLP defines the entity identity as follows: the beginning of an entity, the entity identity being denoted by B; the remainder of the entity, the entity identity being denoted by I; non-entity characters, entity identification represented by O;
when each Token in the text is represented according to the entity identifier, the Token can be presented as an entity identifier sequence.
Here, as an exemplary introduction, assuming that the text is "graduation university as dane", the university of bloom sits in beijing, i.e. the first capital of china ", if the name of a person is taken as an entity, the corresponding entity identification sequence is: "O, O, B, I, I, O, O, O, O, O, O, O, O, O, O, O, O, O.
In the entity identification sequence, "B, I" identifies that "dane" is an entity of a person name. It can be seen that the information of the desired entity can be extracted from the text by means of the entity identification sequence. For example, when extracting name information, find the text corresponding to "B, I" to determine "dane".
For the purposes of this application, the contract entity identifier is based on the definition of the entity identifier, and a fine classification of the contract scenario is introduced. For example, entity identification B is subdivided into, for the contract scenario: the name of the first party B-1 of the machine, the signature of the first party B-2 of the hand tag, etc. It should be appreciated that the contract entity identification may be set according to specific needs, and the text is not specifically limited herein.
Specifically, the present application pre-trains a classifier that predicts the entity identification sequences from the coding sequences herein. After obtaining the second code sequence corresponding to the target contract text, the second code sequence may be input into a classifier, and the contract entity identification of each character in the text segment, that is, the contract entity identification sequence, is predicted according to the classifier.
And then, determining entity identifiers of target contract texts corresponding to the contract information extraction requirements, and extracting contract information corresponding to the target contract entity identifiers from the target contract texts according to the contract entity identifier sequences.
By way of exemplary introduction, assume that the entity of the first party name of the machine is initially identified as B-1, the entity of the first party signature of the hand tag is initially identified as B-2 and so on, the contract information extraction requirement is information of extracting a first party signature, the contract entity identification sequence of the target contract text is O, O is added to the mixture of the two, B-1, I, O, B-2, I, O.
Correspondingly, the entity identification of the target contract text can be determined as B-2 according to the contract information extraction requirement, and then the target contract text is extracted from ' O, B-1, I, O, B-2, I, O "found in the Token of B-2" and the Token belonging to the adjacent "I" after "B-2", and extracting the corresponding information of the 3 Token ' B-2, I and I ' in the text to obtain the information of the first party signature.
For ease of understanding, taking the filtered target contract text shown in FIG. 10 as an example, if the contract entity in the target contract text of FIG. 10 is labeled with a contract entity identification, the result shown in FIG. 13 may be obtained. In fig. 13, "buyer name", "buyer signature", "seller name", "seller signature", and the like all belong to entities in the contract.
It should be noted that fig. 13 is only used to visually introduce entity identifiers in the target contract text. Because the contract entity identification sequences are marked with Token as granularity, and the corresponding information of each Token in the target contract text is already determined, when the machine extracts a specific contract entity, the corresponding information can be accurately extracted from the target contract text only according to the contract entity identification sequences, and the marking result shown in fig. 13 is not required to be generated.
In practical applications, the processing method of the embodiment of the present application may be implemented by a machine, and the following description describes a machine implementation manner.
Specifically, based on the above principle, the present application may train a contract entity extraction model, and the contract entity extraction model is responsible for predicting a contract entity identification sequence corresponding to the target contract text.
Corresponding to the method shown in fig. 1, the entity extraction model of the contract according to the embodiment of the present application is shown in fig. 14, and includes a sub-language model, an encoder, and a first classifier. Wherein:
the sub-language model is used for carrying out semantic coding on the text fragments in the target contract text to obtain a first coding sequence of the text fragments. Alternatively, the sub-language model of the present application may be a BERT model, and the principles of semantic coding of the BERT model are described above and are not described herein.
The encoder is used for determining an absolute position coding result of each character in the text segment based on the contract position corresponding to each character in the text segment, adding the absolute position coding result corresponding to each character in the first coding sequence of the text segment, and coding the first coding sequence of the text segment based on the attention mechanism of relative positions among each character to obtain a second coding sequence of the text segment.
Referring specifically to fig. 15, the encoder mainly includes: an active Token embedding layer and an attention layer.
In the encoder, the valid Token embedding layer is used for eliminating [ CLS ] marks, [ SEP ] marks and [ MASK ] marks of useless sentences in the first coding sequence output by the BERT model.
In FIG. 15, in [ CLS ]],E 11 ,……,E 1N ,[SEP],[CLS],[MAST],……,[MAST],[SEP],……[CLS],E M1 ,……,E MN ,[SEP]As an example of the first coding sequence. It can be seen that all Token corresponding to the second sentence is [ MASK ]]The flag is replaced and therefore belongs to an invalid sentence. Correspondingly, the valid Token embedding layer can encode the second sentence in the first code sequence "[ CLS ]],[MAST],……,[MAST],[SEP]"all culling, resulting in a filtered first coding sequence: [ CLS ]],E 11 ,……,E 1N ,[SEP],……,[CLS],E M1 ,……,E MN ,[SEP]。
In the encoder, the attention layer is configured to add the position coding result of the characters in the target contract text (i.e. the position matrix shown in fig. 15) to the filtered first coding sequence, and re-encode the first coding sequence based on the self-attention mechanism introduced above, to obtain a second coding sequence corresponding to the target contract text: h [ CLS ]],H 11 ,……,H 1N ,H[SEP],……,H[CLS],H M1 ,……,H MN ,H[SEP]。
The first classifier is used for predicting contract entity identification of each character in the text segment based on the second coding sequence of the text segment.
Based on the contract entity extraction model shown in fig. 15, the embodiment of the application also provides a corresponding model training method. Fig. 16 is a schematic flow chart of the model training method, which specifically includes the following steps:
S1602, carrying out semantic coding on a sample text segment in a sample contract text based on a sub-language model to obtain a first sample coding sequence of the sample text segment, wherein the first sample coding sequence comprises semantic coding results of all characters in the sample text segment; the sample text segment contains contract elements and their corresponding contract entities.
Specifically, the method comprises the steps of converting a sample contract text into a picture form, and carrying out optical character recognition segmentation on the sample contract text in the picture form to obtain a plurality of sections of contract contents; then, splicing the multiple sections of contract contents through preset symbols, and segmenting the spliced multiple sections of contract contents based on preset segmentation rules to obtain multiple sentence-level sub-texts; and then, based on the text segment length requirement, splicing the sentence-level sub-texts according to the sequence of the sentence-level sub-texts in the sample contract text, so as to obtain a plurality of text segments meeting the text segment length requirement in the sample contract text.
Correspondingly, the sub-language model in the contract entity extraction model can respectively carry out semantic coding on a plurality of text fragments of the sample contract text to obtain first coding sequences corresponding to the text fragments, and splice the first coding sequences corresponding to the text fragments according to the sequence of the text fragments in the sample contract text, so as to obtain the first coding sequences corresponding to the whole sample contract text.
S1604, based on the encoder, determining an absolute position encoding result of each character in the sample text segment according to the contract position corresponding to each character in the sample text segment, adding the absolute position encoding result corresponding to each character in the first sample encoding sequence, and encoding the first sample encoding sequence of the sample text segment based on the attention mechanism related to the relative position between each character, to obtain a second sample encoding sequence of the sample text segment.
S1606, based on the first classifier, the contract entity identification of each character in the sample text segment is predicted according to the second sample coding sequence.
S1608, training at least one of the sub-language model, the encoder and the first classifier based on the contract entity identifiers of the characters in the sample text fragments obtained through prediction and training labels corresponding to the sample text fragments, wherein the training labels are marked with the truth value contract entity identifiers corresponding to the characters in the sample text fragments.
In this embodiment, the training label is labeled with the true value contract entity identifier corresponding to each character in the sample text segment. It should be appreciated that the training labels with the truth contract entity identification as characters can be used to supervise training the contract entity extraction models (e.g., the sub-language model, the encoder, and the first classifier) via sample contract text.
The purpose of the supervised training is to make the contract entity identification of the characters predicted by the contract entity extraction model consistent with the truth value contract entity identification marked by the training label. Therefore, in the training process, after the first classifier outputs the contract entity identification sequence of the whole sample contract text, a loss function can be determined according to the difference between the predicted contract entity identification sequence and the truth value contract entity identification sequence, and parameters of the sub-language model, the encoder and the first classifier are adjusted according to the gradient direction with smaller difference by using the loss function.
It should be understood that, by performing the supervised training of multiple iterations in the above manner, the contract entity identification sequence of the sample contract text predicted by the contract entity extraction model can be gradually converged on the truth value contract entity identification sequence marked by the training label.
Further, if the present application requires that the contract entity extraction model marks Token for [ MASK ] without contract elements in the first coding sequence, as shown in fig. 17, the contract entity extraction model further includes:
and the second classifier is used for predicting whether the text fragments contain contract elements according to the first coding sequences corresponding to the text fragments in the target contract text before the encoder determines the position coding result of the characters.
And the annotator is used for annotating the first coding sequence corresponding to the text segment which does not contain the contract elements in the target contract as an invalid first coding sequence.
In this embodiment, referring to fig. 15, the invalid first coding sequence of the processing input may be directly ignored from the valid Token embedded layer in the encoder.
In addition, based on the contract entity extraction model shown in fig. 17, in S1608, the second classifier may be trained according to the predicted contract entity identifier of the character and the training label of the character. Since the training principle is described in the foregoing, it is not repeated here.
In summary, based on the entity extraction model of the contract of the present application, the machine only needs to convert the target contract text into a plurality of text fragments text of the type shown in fig. 6 through OCR software. And then, importing the text fragments text into a contract entity extraction model to obtain a contract entity identification sequence corresponding to the target contract text. After substituting into the specific application, the machine can automatically extract the contract information corresponding to the target contract entity identifier from the target contract text as long as the data (such as image format data or text format data) of the target contract and the target contract entity identifier corresponding to the contract information extraction requirement are provided for the machine.
Corresponding to the method shown in fig. 2, the embodiment of the application also provides a device for processing the contract text. Fig. 18 is a schematic structural diagram of a processing apparatus 1800 for contract text according to an embodiment of the application, including:
the coding unit 1810 is configured to perform semantic coding on a text segment in a target contract text to obtain a first coding sequence of the text segment, where the first coding sequence of the text segment includes semantic coding results of each character in the text segment; the text segment includes contract elements and corresponding contract entities.
The encoding unit 1810 is configured to determine an absolute position encoding result of each character in the text segment based on a contract position corresponding to each character in the text segment, add the absolute position encoding result corresponding to each character in a first encoding sequence of the text segment, and encode the first encoding sequence of the text segment based on a mechanism of attention about relative positions between each character, so as to obtain a second encoding sequence of the text segment.
And an extraction unit 1820, configured to predict, based on the second coding sequence of the text segment, a contract entity identifier of each character in the text segment, and extract, according to the contract identifier of each character, a contract element corresponding entity of the text segment.
Optionally, the attention mechanism of the relative position between the symbols comprises: a mechanism of attention between the relative position of a first character to a second character and the relative position of the second character to the first character in any two characters; an attention mechanism between the first character and a relative position of the first character for a second character; the relative position of the second character to the first character, and the attention mechanism between the second character.
Optionally, the text segment is composed of a plurality of sentence-level texts, the first coding sequence further includes coding results of sentence-level semantic tags corresponding to the sentence-level texts, and the coding results of each sentence-level semantic tag are determined based on semantic coding results of all characters in the sentence; the encoding unit 1810 encodes the first encoding sequence of the text segment based on the attention mechanism of the relative position between the characters, to obtain a second encoding sequence of the text segment, including: and encoding the first encoding sequence of the text segment based on the attention mechanism of relative positions among the characters, the attention mechanism among the characters and the attention mechanism of the sentence-level semantic tags to obtain a second encoding sequence of the text segment.
Optionally, the attention mechanism of the relative positions among the characters and the attention mechanism among the characters are local attention mechanisms, and the attention operation range of the local attention mechanisms for any character is a plurality of characters in a fixed window containing the any character; the attention mechanism of each character and sentence-level semantic tags is a global attention mechanism, and the attention operation range of the global attention mechanism aiming at any character is all sentence-level semantic tags in the text segment.
Optionally, the text segment is any one of a plurality of text segments to be identified included in the target contract text; the device of the embodiment further comprises:
the labeling unit is configured to convert the target contract text into a picture form before the encoding unit 1810 determines an absolute position encoding result of each character in the text segment based on the contract position corresponding to each character in the text segment, and perform optical character recognition segmentation on the target contract text in the picture form to obtain a plurality of segments of contract contents; splicing the multiple sections of contract contents through preset symbols, and splitting the spliced multiple sections of contract contents based on preset splitting rules to obtain multiple sentence-level sub-texts; based on the text segment length requirement, splicing the sentence-level sub-texts according to the sequence of the sentence-level sub-texts in the target contract text to obtain a plurality of text segments meeting the text segment length requirement in the target contract text; and screening the text fragments containing the contract elements from the text fragments to be identified.
Optionally, the labeling unit screens out text segments containing contract elements from the plurality of text segments, and the text segments to be identified include: predicting whether each text segment has a contract element based on a first coding sequence of the text segment; and adding a first identifier to the text fragments with the contract elements, adding a second identifier to the text fragments without the contract elements, and taking each text fragment with the first identifier as a fragment to be identified of the target contract text.
Optionally, the method for processing the contract text is completed based on a contract entity extraction model, wherein the contract entity extraction model comprises a sub-language model, an encoder and a first classifier.
Wherein said semantically encoding text segments in the target contract text is performed by the sub-language model; determining an absolute position coding result of each character in the text segment based on the contract position corresponding to each character in the text segment, adding the absolute position coding result corresponding to each character in a first coding sequence of the text segment, and coding the first coding sequence of the text segment based on an attention mechanism of relative positions among each character to obtain a second coding sequence of the text segment, wherein the second coding sequence is executed by the encoder; the predicting the contract entity identification for each character in the text segment based on the second coding sequence of the text segment is performed by the first classifier.
Optionally, the entity extraction model of the contract further comprises a second classifier and a annotator.
Wherein said predicting whether a contract element exists for said each text segment based on said first coding sequence of said each text segment is performed by said second classifier; the adding the first identifier to the text snippet in which the contract element exists and the adding the second identifier to the text snippet in which the contract element does not exist are performed by the annotator.
According to another embodiment of the present application, each unit in the processing apparatus for contract text shown in fig. 18 may be separately or completely combined into one or several other units, or some unit(s) thereof may be further split into a plurality of units with smaller functions, which may achieve the same operation without affecting the implementation of the technical effects of the embodiments of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present application, the contract text processing apparatus may also include other units, and in practical applications, these functions may also be implemented with assistance by other units, and may be implemented by cooperation of multiple units.
It should be understood that the processing device of the contract text shown in fig. 18 of the present application may be used as an execution subject of the method shown in fig. 2, and thus can implement the steps and functions in the method shown in fig. 2.
Fig. 19 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 19, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.
The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in fig. 19, but not only one bus or one type of bus.
And a memory for storing a computer program. In particular, the computer program may comprise program code comprising computer operating instructions. The memory may include memory and non-volatile memory and provide the processor with a computer program.
Alternatively, the processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the same, and the processing device for contract text shown in fig. 18 is formed on a logic level. Correspondingly, the processor executes the program stored in the memory and is specifically configured to perform the following operations:
carrying out semantic coding on a text segment in a target contract text to obtain a first coding sequence of the text segment, wherein the first coding sequence of the text segment comprises semantic coding results of all characters in the text segment; the text segment includes contract elements and corresponding contract entities.
Determining an absolute position coding result of each character in the text segment based on the contract position corresponding to each character in the text segment, adding the absolute position coding result corresponding to each character in the first coding sequence of the text segment, and coding the first coding sequence of the text segment based on the attention mechanism of relative positions among each character to obtain a second coding sequence of the text segment.
Based on the second coding sequence of the text segment, predicting contract entity identifiers of all characters in the text segment, and extracting contract element corresponding entities of the text segment according to the contract identifiers of all the characters.
The method for processing the contract text or the model training method disclosed by the embodiment shown in the specification can be applied to a processor and implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
Of course, in addition to the software implementation, the electronic device in this specification does not exclude other implementations, such as a logic device or a combination of software and hardware, that is, the execution subject of the following process is not limited to each logic unit, but may also be hardware or a logic device.
Furthermore, embodiments of the present application also propose a computer-readable storage medium storing one or more computer programs, the one or more computer programs comprising instructions.
Optionally, the instructions, when executed by a portable electronic device comprising a plurality of applications, enable the portable electronic device to perform the steps of the method shown in fig. 2, comprising:
carrying out semantic coding on a text segment in a target contract text to obtain a first coding sequence of the text segment, wherein the first coding sequence of the text segment comprises semantic coding results of all characters in the text segment; the text segment includes contract elements and corresponding contract entities.
Determining an absolute position coding result of each character in the text segment based on the contract position corresponding to each character in the text segment, adding the absolute position coding result corresponding to each character in the first coding sequence of the text segment, and coding the first coding sequence of the text segment based on the attention mechanism of relative positions among each character to obtain a second coding sequence of the text segment.
Based on the second coding sequence of the text segment, predicting contract entity identifiers of all characters in the text segment, and extracting contract element corresponding entities of the text segment according to the contract identifiers of all the characters.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The foregoing is merely an example of the present specification and is not intended to limit the present specification. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description. Moreover, all other embodiments obtained by those skilled in the art without making any inventive effort shall fall within the scope of protection of this document.

Claims (11)

1. A method for processing contract text, comprising:
carrying out semantic coding on a text segment in a target contract text to obtain a first coding sequence of the text segment, wherein the first coding sequence of the text segment comprises semantic coding results of all characters in the text segment; the text segment comprises contract elements and corresponding entities;
determining an absolute position coding result of each character in the text segment based on a contract position corresponding to each character in the text segment, adding the absolute position coding result corresponding to each character in a first coding sequence of the text segment, and coding the first coding sequence added with the absolute position coding result based on a relative position attention mechanism among each character to obtain a second coding sequence of the text segment;
Based on the second coding sequence of the text segment, predicting contract entity identifiers of all characters in the text segment, and extracting contract element corresponding entities of the text segment according to the contract identifiers of all the characters.
2. The method of claim 1, wherein the attention mechanism between the symbols with respect to relative position comprises:
a mechanism of attention between the relative position of a first character to a second character and the relative position of the second character to the first character in any two characters; an attention mechanism between the first character and a relative position of the first character for a second character; the relative position of the second character to the first character, and the attention mechanism between the second character.
3. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the text segment consists of a plurality of sentence-level texts, the first coding sequence also comprises coding results of sentence-level semantic tags corresponding to the sentence-level texts, and the coding results of each sentence-level semantic tag are determined based on the semantic coding results of all characters in the corresponding sentence;
the coding the first coding sequence of the text segment based on the attention mechanism of relative positions among the characters to obtain a second coding sequence of the text segment, which comprises the following steps:
And encoding the first encoding sequence of the text segment based on the attention mechanism of relative positions among the characters, the attention mechanism among the characters and the attention mechanism of the sentence-level semantic tags to obtain a second encoding sequence of the text segment.
4. The method of claim 3, wherein the step of,
the attention mechanism of the relative positions among the characters and the attention mechanism among the characters are local attention mechanisms; the attention operation range of the local attention mechanism aiming at any character is a plurality of characters in a fixed window containing the any character; the attention mechanism of each character and sentence-level semantic tag is a global attention mechanism; the attention operation range of the global attention mechanism aiming at any character is all sentence-level semantic tags in the text fragment where the any character is located.
5. The method of claim 1, wherein the text segment is any one of a plurality of text segments to be identified included in the target contract text; before determining the absolute position encoding result of each character in the text segment based on the contract position corresponding to each character in the text segment, the method further comprises:
Converting the target contract text into a picture form, and performing optical character recognition segmentation on the target contract text in the picture form to obtain a plurality of sections of contract contents;
splicing the multiple sections of contract contents through preset symbols, and cutting the spliced multiple sections of contract contents based on preset cutting rules to obtain multiple sentence-level sub-texts;
based on the text segment length requirement, splicing the sentence-level sub-texts according to the sequence of the sentence-level sub-texts in the target contract text to obtain a plurality of text segments meeting the text segment length requirement in the target contract text;
and screening the text fragments containing contract elements from the text fragments to serve as text fragments to be identified.
6. The method of claim 5, wherein the screening text segments from the plurality of text segments that include contractual elements as text segments to be identified comprises:
predicting whether each text segment has a contract element based on a first coding sequence of the text segment;
and adding a first identifier to the text fragments with the contract elements, adding a second identifier to the text fragments without the contract elements, and taking each text fragment with the first identifier as a text fragment to be identified of the target contract text.
7. The method of claim 6, wherein the step of providing the first layer comprises,
the contract text processing method is completed based on a contract entity extraction model, and the contract entity extraction model comprises a sub-language model, an encoder and a first classifier;
wherein said semantically encoding text segments in the target contract text is performed by the sub-language model; determining an absolute position coding result of each character in the text segment based on a contract position corresponding to each character in the text segment, adding the absolute position coding result corresponding to each character in a first coding sequence of the text segment, and coding the first coding sequence added with the absolute position coding result based on an attention mechanism of each character about a relative position, wherein the second coding sequence of the text segment is obtained by the coder; the predicting the contract entity identification for each character in the text segment based on the second coding sequence of the text segment is performed by the first classifier.
8. The method of claim 7, wherein the step of determining the position of the probe is performed,
the contract entity extraction model also comprises a second classifier and a marker;
Wherein said predicting whether a contract element exists for said each text segment based on said first coding sequence of said each text segment is performed by said second classifier; the adding the first identifier to the text snippet in which the contract element exists and the adding the second identifier to the text snippet in which the contract element does not exist are performed by the annotator.
9. A contract text processing apparatus, characterized by comprising:
the encoding unit is used for carrying out semantic encoding on the text fragments in the target contract text to obtain a first encoding sequence of the text fragments, wherein the first encoding sequence of the text fragments comprises semantic encoding results of all characters in the text fragments; the text segment comprises contract elements and corresponding entities;
the coding unit is further configured to determine an absolute position coding result of each character in the text segment based on a contract position corresponding to each character in the text segment, add the absolute position coding result corresponding to each character in a first coding sequence of the text segment, and code the first coding sequence added with the absolute position coding result based on a mechanism of attention about relative positions among the characters, so as to obtain a second coding sequence of the text segment;
And the extraction unit is used for predicting contract entity identifiers of all characters in the text fragment based on the second coding sequence of the text fragment, and extracting contract element corresponding entities of the text fragment according to the contract identifiers of all the characters.
10. An electronic device, comprising:
a memory for storing one or more computer programs;
a processor for loading the one or more computer programs to perform the method of any of claims 1-8.
11. A computer readable storage medium having one or more computer programs stored thereon, which when executed by a processor, implement the method of any of claims 1-8.
CN202310641191.9A 2023-06-01 2023-06-01 Contract text processing method and device, electronic equipment and storage medium Pending CN117494719A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310641191.9A CN117494719A (en) 2023-06-01 2023-06-01 Contract text processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310641191.9A CN117494719A (en) 2023-06-01 2023-06-01 Contract text processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117494719A true CN117494719A (en) 2024-02-02

Family

ID=89667821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310641191.9A Pending CN117494719A (en) 2023-06-01 2023-06-01 Contract text processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117494719A (en)

Similar Documents

Publication Publication Date Title
US20180267956A1 (en) Identification of reading order text segments with a probabilistic language model
CN111931517B (en) Text translation method, device, electronic equipment and storage medium
CN110196982B (en) Method and device for extracting upper-lower relation and computer equipment
EP3029607A1 (en) Method for text recognition and computer program product
CN113051356B (en) Open relation extraction method and device, electronic equipment and storage medium
CN112883732A (en) Method and device for identifying Chinese fine-grained named entities based on associative memory network
CN114580424B (en) Labeling method and device for named entity identification of legal document
CN110866402A (en) Named entity identification method and device, storage medium and electronic equipment
CN114298035A (en) Text recognition desensitization method and system thereof
CN114818891A (en) Small sample multi-label text classification model training method and text classification method
CN116070632A (en) Informal text entity tag identification method and device
CN115952791A (en) Chapter-level event extraction method, device and equipment based on machine reading understanding and storage medium
CN115983271A (en) Named entity recognition method and named entity recognition model training method
CN113204956B (en) Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN115862040A (en) Text error correction method and device, computer equipment and readable storage medium
CN115099233A (en) Semantic analysis model construction method and device, electronic equipment and storage medium
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
US20120197894A1 (en) Apparatus and method for processing documents to extract expressions and descriptions
CN115906855A (en) Word information fused Chinese address named entity recognition method and device
CN114298032A (en) Text punctuation detection method, computer device and storage medium
CN114842982A (en) Knowledge expression method, device and system for medical information system
CN117494719A (en) Contract text processing method and device, electronic equipment and storage medium
CN114417891A (en) Reply sentence determination method and device based on rough semantics and electronic equipment
CN114611489A (en) Text logic condition extraction AI model construction method, extraction method and system
CN115577680B (en) Ancient book text sentence-breaking method and device and ancient book text sentence-breaking model training method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination