CN110705225A - Contract marking method and device - Google Patents

Contract marking method and device Download PDF

Info

Publication number
CN110705225A
CN110705225A CN201910752445.8A CN201910752445A CN110705225A CN 110705225 A CN110705225 A CN 110705225A CN 201910752445 A CN201910752445 A CN 201910752445A CN 110705225 A CN110705225 A CN 110705225A
Authority
CN
China
Prior art keywords
contract
sample
model
labeling
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910752445.8A
Other languages
Chinese (zh)
Inventor
郭于丹
肖丰阳
陈卫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Trust Co Ltd
Original Assignee
Ping An Trust Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Trust Co Ltd filed Critical Ping An Trust Co Ltd
Priority to CN201910752445.8A priority Critical patent/CN110705225A/en
Publication of CN110705225A publication Critical patent/CN110705225A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a contract marking method and a contract marking device, which relate to the technical field of artificial intelligence, and the method comprises the following steps: extracting at least one contract sample from each contract of the service type to obtain an initial sample set; constructing and training an initial labeling model based on contract samples of each service type; acquiring a plurality of pre-stored contracts of each service type, and dividing the contracts into a sample expansion set and a test set; marking contract elements in the contracts in the sample expansion set by using the initial marking model; merging the labeled sample expansion set and the initial sample set into a training sample set, and optimizing and training an initial labeling model by using the training sample set to obtain a labeling model; inputting the test set into a labeling model, and acquiring a labeling result of the contract in the test set output by the labeling model; and judging whether the optimization of the labeling model is needed to be continued or not according to the labeling result of the test set. The technical scheme provided by the embodiment of the invention can solve the problem of low accuracy rate of marking the contract elements in the prior art.

Description

Contract marking method and device
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of artificial intelligence, in particular to a contract marking method and device.
[ background of the invention ]
At present, more and more contracts are needed in enterprises, whether contract elements are filled correctly is mainly needed to be checked during checking of the contracts, however, a large amount of manpower is often consumed for manually searching the contract elements, the distribution of the contract elements is complex, and the auditor needs to clearly and accurately acquire the association relation among the contract elements, so that the marking accuracy of the contract elements in the contracts is improved, and the quick acquisition of the contract elements during checking of the contracts becomes the problem to be solved urgently at present.
[ summary of the invention ]
In view of this, embodiments of the present invention provide a contract annotation method and apparatus, so as to solve the problem in the prior art that the annotation accuracy of contract elements is low.
In order to achieve the above object, according to one aspect of the present invention, there is provided a contract annotation method, including:
extracting at least one contract sample from each contract of the service type to obtain an initial sample set, wherein the contract sample comprises a plurality of manually marked contract elements; constructing and training an initial labeling model based on the contract sample of each service type; acquiring a plurality of pre-stored contracts of each service type, and dividing the contracts into a sample expansion set and a test set according to a preset proportion; labeling contract elements in the contracts in the sample expansion set by using the initial labeling model; merging the labeled sample expansion set and the initial sample set into a training sample set, and optimally training the initial labeling model by using the training sample set to obtain a labeling model; inputting the test set into the labeling model, and acquiring a labeling result of the contract in the test set output by the labeling model; and judging whether the marking model needs to be continuously optimized or not according to the marking result of the test set until the marking accuracy of the marking model is greater than a preset value.
Further, the constructing and training an initial annotation model based on the contract sample of each service type includes: constructing an initial labeling model, wherein the initial labeling model is a long-time memory neural network model; inputting the initial sample set into the initial labeling model, wherein the long-time memory neural network learns a vector sequence of manually labeled contract elements and a category vector of labels associated with the contract elements in each contract sample; and training the initial labeling model through an error minimization strategy.
Further, the contract elements comprise a first-level element, a second-level element and a third-level element which are distributed hierarchically, the labels also present a hierarchical relationship, and the labels comprise a first-level label, a second-level label under the first-level label and a third-level label under the second-level label; the constructing and training of the initial labeling model based on the contract sample of each service type comprises the following steps: constructing an initial labeling model; inputting the initial sample set into the initial labeling model, wherein the deep convolutional neural network extracts a vector sequence of a primary element associated with the primary label, a vector sequence of a secondary element associated with the secondary label, and a vector sequence of a tertiary element associated with the tertiary label in each contract sample; and training the initial labeling model through a strategy of error minimization based on the vector sequence of the primary elements, the vector sequence of the secondary elements, the vector sequence of the tertiary elements and the type vector of the label.
Further, the labeling contract elements in the contracts in the sample expansion set by using the initial labeling model comprises: the initial labeling model labels the contracts in the sample expansion set according to the first-level labels to obtain first-level elements; acquiring at least one secondary label according to the primary label, and labeling the contract according to the secondary label to obtain at least one secondary element associated with the primary element; and acquiring at least one third-level label according to the second-level label, and labeling the contract according to the third-level label to obtain at least one third-level element associated with the second-level element.
Further, the determining whether the annotation model needs to be continuously optimized according to the annotation result of the test set until the annotation accuracy of the annotation model is greater than a preset value includes: comparing the manual labeling result of the contract in the test set with the labeling result output by the labeling model to obtain the labeling accuracy of the contract of each service type of the labeling model; judging whether the marking accuracy of each service type is greater than the preset value; eliminating the service types with the marking accuracy rate larger than the preset value from the plurality of service types to obtain target service types needing to be continuously optimized; modifying the labeling result of the contract of the target service type based on the modification instruction of the user; and optimally training the labeling model by using the revised contract of the target service type until the labeling accuracy of the labeling model is greater than a preset value.
Further, prior to the labeling, with the initial labeling model, contract elements in contracts in the sample expansion set, the method further comprises: screening out a contract sample to be processed in the sample expansion set, wherein the format of the contract sample to be processed is a figure file; finding the inclination angle of each contract sample to be processed by a Hough transform method, and performing rotation correction on the contract sample to be processed by adopting bilinear interpolation based on the inclination angle; identifying the contract sample after the rotation correction to obtain a contract text; replacing the pending contract sample in the sample expansion set with the contract text.
In order to achieve the above object, according to one aspect of the present invention, there is provided a contract annotation apparatus, comprising: the extraction unit is used for extracting at least one contract sample from each contract of each service type to obtain an initial sample set, wherein the contract sample comprises a plurality of artificially labeled contract elements; the construction unit is used for constructing and training an initial labeling model based on the contract sample of each service type; the acquisition unit is used for acquiring a plurality of pre-stored contracts of each service type and dividing the contracts into a sample expansion set and a test set according to a preset proportion; the marking unit is used for marking contract elements in the contracts in the sample expansion set by using the initial marking model; the optimization training unit is used for merging the labeled sample expansion set and the initial sample set into a training sample set, and optimally training the initial labeling model by using the training sample set to obtain a labeling model; the input unit is used for inputting the test set into the labeling model and acquiring a labeling result of the contract in the test set output by the labeling model; and the judging unit is used for judging whether the marking model needs to be continuously optimized according to the marking result of the test set until the marking accuracy of the marking model is greater than a preset value.
Further, the construction unit includes: the building subunit is used for building an initial labeling model, and the initial labeling model is a long-time memory neural network model; an input subunit, configured to input the initial sample set to the initial labeling model, where the long-time and short-time memory neural network learns a vector sequence of manually labeled contract elements in each contract sample and a category vector of a label associated with the contract elements; and the training subunit is used for training the initial labeling model through a strategy of error minimization.
In order to achieve the above object, according to one aspect of the present invention, there is provided a computer nonvolatile storage medium including a stored program that, when executed, controls an apparatus in which the storage medium is located to execute the above contract annotation method.
To achieve the above object, according to one aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the contract annotation method described above when executing the computer program.
In the scheme, at least one contract sample is extracted from the contract of each service type, an initial labeling model is established by using the contract sample, then a training set is expanded by using more historical contracts of each service type, the initial labeling model is trained by using the expanded training set, and the labeling capacity of the model is improved. And then, testing the marking accuracy of the model by using the test set, judging whether the marking model needs to be continuously optimized according to the marking result of the test set until the marking accuracy of the marking model is greater than a preset value, and continuously optimizing the model so as to improve the efficiency and accuracy of marking the contract elements.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
FIG. 1 is a flow chart of an alternative contract annotation method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of an alternative contract annotation apparatus provided by an embodiment of the present invention;
fig. 3 is a schematic diagram of an alternative computer device provided by the embodiment of the present invention.
[ detailed description ] embodiments
For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
It should be understood that although the terms first, second, third, etc. may be used to describe the terminals in the embodiments of the present invention, the terminals should not be limited by these terms. These terms are only used to distinguish one terminal from another. For example, a first terminal may also be referred to as a second terminal, and similarly, a second terminal may also be referred to as a first terminal, without departing from the scope of embodiments of the present invention.
The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.
Fig. 1 is a flowchart of a contract annotation method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step S101, at least one contract sample is extracted from each contract of the service type to obtain an initial sample set, wherein the contract sample comprises a plurality of manually marked contract elements.
And S102, constructing and training an initial labeling model based on the contract sample of each service type.
And step S103, acquiring a plurality of pre-stored contracts of each service type, and dividing the contracts into a sample expansion set and a test set according to a preset proportion.
And step S104, utilizing the initial labeling model to label contract elements in the contracts in the sample expansion set.
And step S105, merging the labeled sample expansion set and the initial sample set into a training sample set, and optimizing and training the initial labeling model by using the training sample set to obtain a labeling model.
And S106, inputting the test set into the annotation model, and acquiring the annotation result of the contract in the test set output by the annotation model.
And S107, judging whether the annotation model needs to be continuously optimized according to the annotation result of the test set until the annotation accuracy of the annotation model is greater than a preset value.
It is understood that the types of services include house buying and selling contracts, house renting contracts, loan contracts, borrowing contracts, and the like, and the contract elements may be party information (name, residence, contact, and the like), contract terms, contract price, fulfillment terms, default liability, and the like.
In the scheme, at least one contract sample is extracted from the contract of each service type, an initial labeling model is established by using the contract sample, then a training set is expanded by using more historical contracts of each service type, the initial labeling model is trained by using the expanded training set, and the labeling capacity of the model is improved. And then, testing the marking accuracy of the model by using the test set, judging whether the marking model needs to be continuously optimized according to the marking result of the test set until the marking accuracy of the marking model is greater than a preset value, and continuously optimizing the model so as to improve the accuracy of marking the contract elements.
In one embodiment, 5 contract samples are extracted for each business type, and the contract text is subjected to element marking manually, for example, the contract text can be marked in an annotation mode.
As can be appreciated, training the initial annotation model with contract samples for each business type can enable the initial annotation model to have preliminary annotation capabilities for contracts of various business types.
In one embodiment, when the initial labeling model is a long-and-short memory neural network model, the LSTM neural network model may be a multi-layer long-and-short memory neural network model, or a bidirectional long-and-short memory neural network model. The LSTM (length-term) neural network model is a time recursive neural network machine learning algorithm, the input is a vector, and a certain amount of data can be given to obtain tasks of model completion classification, labeling, prediction and the like. The GRU neural network model is a recurrent neural network model similar to LSTM. It can be understood that, before the initial labeling model is constructed and trained, the contract samples need to be serialized, and then the serialized contract samples are imported into the initial labeling model, and the LSTM neural network model further extracts the sequence vectors.
Optionally, step S102 includes: constructing an initial labeling model, wherein the initial labeling model is a long-time memory neural network model; inputting an initial sample set to an initial labeling model, wherein the long-time memory neural network learns a vector sequence of manually labeled contract elements and a category vector of labels associated with the contract elements in each contract sample; and training an initial labeling model through an error minimization strategy.
Of course, the LSTM neural network model may have various model structures, such as a two-layer LSTM neural network model, a three-layer LSTM neural network model, a two-way LSTM neural network model, and the like, which is not limited in this embodiment of the present invention. In addition, the above neural network model is only used as an example, and when the embodiment of the present invention is implemented, other types of neural network models may be used according to actual situations, and the neural network model may also perform adjustment of various model structures and parameters.
Optionally, the contract elements include first-level elements, second-level elements, and third-level elements that are hierarchically distributed, the tags also present a hierarchical relationship, and the tags include first-level tags, second-level tags under the first-level tags, and third-level tags under the second-level tags. For example: the first-level label is a product name element (home property insurance contract), 3 categories of benefit classifications exist under the element (namely, the second-level element is a house main body, indoor property, house decoration and the like respectively), different subdivision elements (namely, third-level elements) exist under each benefit classification respectively, and the third-level elements can be the same or different, such as position information and effective date; the labeling model can identify the hierarchical relationship between the contract elements by using the labels while identifying the contract elements.
Optionally, constructing and training an initial annotation model based on the contract sample of each service type includes: constructing an initial labeling model; inputting an initial sample set to an initial labeling model, wherein the long-time memory neural network extracts a vector sequence of a primary element associated with a primary label, a vector sequence of a secondary element associated with a secondary label and a vector sequence of a tertiary element associated with a tertiary label in each contract sample; and training an initial labeling model through a strategy of error minimization based on the vector sequence of the primary elements, the vector sequence of the secondary elements, the vector sequence of the tertiary elements and the type vector of the label.
In one embodiment, the sample expansion set and test set are assigned in a ratio of 8:2 or 7: 3. While the number of contract samples in the initial sample set is small, e.g., one tenth of the sum of the sample expansion set and the test set. And the labels on the contract samples in the initial sample set are manually labeled, so that the accuracy can be guaranteed.
Optionally, labeling contract elements in the contracts in the sample expansion set by using the initial labeling model, including: the initial labeling model labels the contracts in the sample expansion set according to the primary labeling to obtain primary elements; acquiring at least one secondary label according to the primary label, and labeling the contract according to the secondary label to obtain at least one secondary element associated with the primary element; and acquiring at least one third-level label according to the second-level label, and labeling the contract according to the third-level label to obtain at least one third-level element associated with the second-level element.
For example, the tag may be A-A1-a1, where A represents the home property insurance, A1 represents the house body, a1 represents the location information of the house body; the label may be a-a1-a2, where a represents a product name element, a1 represents the house body, and a2 represents the floor age of the house body. When the first-level elements are identified as the home property insurance, the marking model can acquire which second-level elements under the first-level elements exist, for example, the second-level elements comprise house main bodies, indoor property and house decoration; when the second-level element is marked and determined to be in indoor property, the marking model can obtain the third-level elements under the second-level element, such as the amount of property and the estimated value of precious metal.
It can be understood that after the initial labeling model is trained, the initial labeling model has the contract labeling capability of each service type, so that the initial labeling model performs contract element labeling on the contracts in the sample extended set according to the labels configured by the hierarchy.
In one embodiment, the contracts in the sample expansion set may be, for example, signed contracts for each business type, which may be collected from each business segment.
Of course, these contracts may be in various forms such as PDF versions, photo versions, scanned versions, and the like. They also need to be preprocessed before they can be used as an extended set of samples.
Specifically, before labeling contract elements in a contract in a sample expansion set with an initial labeling model, the method further comprises: screening out a contract sample to be processed in the sample expansion set, wherein the format of the contract sample to be processed is a figure file; finding the inclination angle of each contract sample to be processed by a Hough transform method, and performing rotation correction on the contract samples to be processed by adopting bilinear interpolation based on the inclination angle; identifying the contract sample after the rotation correction to obtain a contract text; the pending contract samples in the sample augmentation set are replaced with contract text.
In another embodiment, the contract in the sample expansion set may be subjected to binarization processing and filtering processing. The binarization processing is to process the value of the pixel point of the drawing file into two pixel values, 255 or 0, wherein 255 is white and 0 is black, so that the color of the character is more vivid than the color of the background. The contract filtering process can adopt modes of mean value filtering, self-adaptive wiener filtering, wavelet filtering and the like. Understandably, after the binarization processing and the filtering processing, the characters in the contract are clearer, and the accuracy of recognition can be improved when the characters are further recognized.
Further, judging whether the annotation model needs to be continuously optimized according to the annotation result of the test set until the annotation accuracy of the annotation model is greater than a preset value, including: comparing the manual labeling result of the contract in the test set with the labeling result output by the labeling model to obtain the labeling accuracy of the contract of each service type of the labeling model; judging whether the marking accuracy of each service type is greater than a preset value; eliminating the service types with the mark accuracy rate larger than a preset value in the plurality of service types to obtain the target service type needing to be continuously optimized; correcting the labeling result of the contract of the target service type based on the correction instruction of the user; and optimally training the labeling model by using the revised contract of the target service type until the labeling accuracy of the labeling model is greater than a preset value.
Specifically, calculating the annotation accuracy includes:
calculating the marking accuracy of the first-level label, the marking accuracy of the second-level label and the marking accuracy of the third-level label according to the manual marking result of the contract and the marking result output by the marking model;
and calculating the total marking accuracy according to the weight of the preset level label. For example: Ψ1*Q12*Q23*Q3=QGeneral assemblyWherein, Ψ1Labeling accuracy Q for primary labels1Weight of [ phi ], [ phi ]2Labeling accuracy Q for secondary labels2Weight of [ phi ], [ phi ]3Labeling accuracy Q for a three-level label3The weight of (c). Ψ1>Ψ2>Ψ3
For example: Ψ1、Ψ2、Ψ3The distribution of the label is 50%, 30% and 20%, the artificial label is A-A1-a1, and the output labeling result of the model is A-A2-b1, so that the accuracy of the first-level label is 100%, the accuracy of the second-level label is 0% and the accuracy of the third-level label is 0%.
In the scheme, at least one contract sample is extracted from the contract of each service type, an initial labeling model is established by using the contract sample, then a training set is expanded by using more historical contracts of each service type, the initial labeling model is trained by using the expanded training set, and the labeling capacity of the model is improved. And then, testing the marking accuracy of the model by using the test set, judging whether the marking model needs to be continuously optimized according to the marking result of the test set until the marking accuracy of the marking model is greater than a preset value, and continuously optimizing the model so as to improve the accuracy of marking the contract elements.
An embodiment of the present invention provides a contract annotation apparatus, which is configured to execute the contract annotation method described above, and as shown in fig. 2, the apparatus includes: the system comprises an extraction unit 10, a construction unit 20, an acquisition unit 30, a labeling unit 40, an optimization training unit 50, an input unit 60 and a judgment unit 70.
The extraction unit 10 is configured to extract at least one contract sample from each contract of the service type to obtain an initial sample set, where the contract sample includes a plurality of artificially labeled contract elements;
the construction unit 20 is configured to construct and train an initial annotation model based on the contract sample of each service type;
the acquiring unit 30 is configured to acquire a plurality of pre-stored contracts of each service type, and divide the contracts into a sample expansion set and a test set according to a preset proportion;
the labeling unit 40 is used for labeling contract elements in the contracts in the sample expansion set by using the initial labeling model;
an optimization training unit 50, configured to combine the labeled sample expansion set and the initial sample set into a training sample set, and optimize and train the initial labeling model by using the training sample set to obtain a labeling model;
an input unit 60, configured to input the test set into the annotation model, and obtain an annotation result of the contract in the test set output by the annotation model;
and the judging unit 70 is configured to judge whether the annotation model needs to be optimized continuously according to the annotation result of the test set until the annotation accuracy of the annotation model is greater than a preset value.
It is understood that the types of services include house buying and selling contracts, house renting contracts, loan contracts, borrowing contracts, and the like, and the contract elements may be party information (name, residence, contact, and the like), contract terms, contract price, fulfillment terms, default liability, and the like.
In the scheme, at least one contract sample is extracted from the contract of each service type, an initial labeling model is established by using the contract sample, then a training set is expanded by using more historical contracts of each service type, the initial labeling model is trained by using the expanded training set, and the labeling capacity of the model is improved. And then, testing the marking accuracy of the model by using the test set, judging whether the marking model needs to be continuously optimized according to the marking result of the test set until the marking accuracy of the marking model is greater than a preset value, and continuously optimizing the model so as to improve the accuracy of marking the contract elements.
In one embodiment, 5 contract samples are extracted for each business type, and the contract text is subjected to element marking manually, for example, the contract text can be marked in an annotation mode.
As can be appreciated, training the initial annotation model with contract samples for each business type can enable the initial annotation model to have preliminary annotation capabilities for contracts of various business types.
In one embodiment, when the initial labeling model is a long-and-short memory neural network model, the LSTM neural network model may be a multi-layer long-and-short memory neural network model, or a bidirectional long-and-short memory neural network model. The LSTM (length-term) neural network model is a time recursive neural network machine learning algorithm, the input is a vector, and a certain amount of data can be given to obtain tasks of model completion classification, labeling, prediction and the like. The GRU neural network model is a recurrent neural network model similar to LSTM. It can be understood that, before the initial labeling model is constructed and trained, the contract samples need to be serialized, and then the serialized contract samples are imported into the initial labeling model, and the LSTM neural network model further extracts the sequence vectors.
The building unit comprises a building subunit, an input subunit and a training subunit.
The building subunit is used for building an initial labeling model, and the initial labeling model is a long-time memory neural network model; the input subunit is used for inputting an initial sample set to the initial labeling model, wherein the long-time memory neural network learns the vector sequence of the manually labeled contract elements and the category vectors of the labels associated with the contract elements in each contract sample; and the training subunit is used for training the initial labeling model through the error minimization strategy.
Of course, the LSTM neural network model may have various model structures, such as a two-layer LSTM neural network model, a three-layer LSTM neural network model, a two-way LSTM neural network model, and the like, which is not limited in this embodiment of the present invention. In addition, the above neural network model is only used as an example, and when the embodiment of the present invention is implemented, other types of neural network models may be used according to actual situations, and the neural network model may also perform adjustment of various model structures and parameters.
Optionally, the contract elements include first-level elements, second-level elements, and third-level elements that are hierarchically distributed, the tags also present a hierarchical relationship, and the tags include first-level tags, second-level tags under the first-level tags, and third-level tags under the second-level tags. For example: the first-level label is a product name element (home property insurance contract), 3 categories of benefit classifications exist under the element (namely, the second-level element is a house main body, indoor property, house decoration and the like respectively), different subdivision elements (namely, third-level elements) exist under each benefit classification respectively, and the third-level elements can be the same or different, such as position information and effective date; the labeling model can identify the hierarchical relationship between the contract elements by using the labels while identifying the contract elements.
Optionally, the input subunit is further configured to input an initial sample set to the initial labeling model, where the long-term and short-term memory neural network extracts a vector sequence of a primary element associated with the primary label, a vector sequence of a secondary element associated with the secondary label, and a vector sequence of a tertiary element associated with the tertiary label in each contract sample;
and the training subunit is also used for training the initial labeling model through a strategy of error minimization based on the vector sequence of the primary element, the vector sequence of the secondary element, the vector sequence of the tertiary element and the type vector of the label.
In one embodiment, the sample expansion set and test set are assigned in a ratio of 8:2 or 7: 3. While the number of contract samples in the initial sample set is small, e.g., one tenth of the sum of the sample expansion set and the test set. And the labels on the contract samples in the initial sample set are manually labeled, so that the accuracy can be guaranteed.
Optionally, when the initial labeling model is used for labeling contract elements in the contracts in the sample expansion set, the initial labeling model labels the contracts in the sample expansion set according to the first-level label to obtain a first-level element; acquiring at least one secondary label according to the primary label, and labeling the contract according to the secondary label to obtain at least one secondary element associated with the primary element; and acquiring at least one third-level label according to the second-level label, and labeling the contract according to the third-level label to obtain at least one third-level element associated with the second-level element.
For example, the tag may be A-A1-a1, where A represents the home property insurance, A1 represents the house body, a1 represents the location information of the house body; the label may be a-a1-a2, where a represents a product name element, a1 represents the house body, and a2 represents the floor age of the house body. When the first-level elements are identified as the home property insurance, the marking model can acquire which second-level elements under the first-level elements exist, for example, the second-level elements comprise house main bodies, indoor property and house decoration; when the second-level element is marked and determined to be in indoor property, the marking model can obtain the third-level elements under the second-level element, such as the amount of property and the estimated value of precious metal.
It can be understood that after the initial labeling model is trained, the initial labeling model has the contract labeling capability of each service type, so that the initial labeling model performs contract element labeling on the contracts in the sample extended set according to the labels configured by the hierarchy.
In one embodiment, the contracts in the sample expansion set may be, for example, signed contracts for each business type, which may be collected from each business segment.
Of course, these contracts may be in various forms such as PDF versions, photo versions, scanned versions, and the like. They also need to be preprocessed before they can be used as an extended set of samples.
Specifically, the device further comprises a screening unit, a processing unit, an identification unit and a replacement unit.
The system comprises a screening unit, a processing unit and a processing unit, wherein the screening unit is used for screening out a contract sample to be processed in a sample expansion set, and the format of the contract sample to be processed is a figure file; the processing unit is used for finding the inclination angle of each contract sample to be processed by a Hough transform method and performing rotation correction on the contract samples to be processed by adopting bilinear interpolation based on the inclination angle; the identification unit is used for identifying the contract sample after the rotation correction to obtain a contract text; and the replacing unit is used for replacing the contract samples to be processed in the sample expansion set with the contract texts.
In another embodiment, the contract in the sample expansion set may be subjected to binarization processing and filtering processing. The binarization processing is to process the value of the pixel point of the drawing file into two pixel values, 255 or 0, wherein 255 is white and 0 is black, so that the color of the character is more vivid than the color of the background. The contract filtering process can adopt modes of mean value filtering, self-adaptive wiener filtering, wavelet filtering and the like. Understandably, after the binarization processing and the filtering processing, the characters in the contract are clearer, and the accuracy of recognition can be improved when the characters are further recognized.
Furthermore, the judging unit comprises a calculating subunit, a judging subunit, a rejecting subunit, a correcting subunit and an optimizing subunit.
The calculation subunit is used for comparing the manual labeling result of the contract in the test set with the labeling result output by the labeling model to obtain the labeling accuracy of the contract of each service type of the labeling model; the judging subunit is used for judging whether the marking accuracy of each service type is greater than a preset value; the removing subunit is used for removing the service types with the mark accuracy rate larger than the preset value in the plurality of service types to obtain the target service type needing to be continuously optimized; the correction subunit is used for correcting the labeling result of the contract of the target service type based on the correction instruction of the user; and the optimization subunit is used for optimizing and training the labeling model by using the revised contract of the target service type until the labeling accuracy of the labeling model is greater than a preset value.
Specifically, calculating the annotation accuracy includes: calculating the marking accuracy of the first-level label, the marking accuracy of the second-level label and the marking accuracy of the third-level label according to the manual marking result of the contract and the marking result output by the marking model;
and calculating the total marking accuracy according to the weight of the preset level label. For example: Ψ1*Q12*Q23*Q3=QGeneral assemblyWherein, Ψ1Labeling accuracy Q for primary labels1Weight of [ phi ], [ phi ]2Labeling accuracy Q for secondary labels2Weight of [ phi ], [ phi ]3Labeling accuracy Q for a three-level label3The weight of (c). Ψ1>Ψ2>Ψ3
For example: Ψ1、Ψ2、Ψ3The distribution of the label is 50%, 30% and 20%, the artificial label is A-A1-a1, and the output labeling result of the model is A-A2-b1, so that the accuracy of the first-level label is 100%, the accuracy of the second-level label is 0% and the accuracy of the third-level label is 0%.
In the scheme, at least one contract sample is extracted from the contract of each service type, an initial labeling model is established by using the contract sample, then a training set is expanded by using more historical contracts of each service type, the initial labeling model is trained by using the expanded training set, and the labeling capacity of the model is improved. And then, testing the marking accuracy of the model by using the test set, judging whether the marking model needs to be continuously optimized according to the marking result of the test set until the marking accuracy of the marking model is greater than a preset value, and continuously optimizing the model so as to improve the accuracy of marking the contract elements.
The embodiment of the invention provides a non-volatile storage medium of a computer, wherein the storage medium comprises a stored program, and when the program runs, equipment where the storage medium is located is controlled to execute the following steps:
extracting at least one contract sample from each contract of the service type to obtain an initial sample set, wherein the contract sample comprises a plurality of manually marked contract elements; constructing and training an initial labeling model based on contract samples of each service type; acquiring a plurality of pre-stored contracts of each service type, and dividing the contracts into a sample expansion set and a test set according to a preset proportion; marking contract elements in the contracts in the sample expansion set by using the initial marking model; merging the labeled sample expansion set and the initial sample set into a training sample set, and optimizing and training an initial labeling model by using the training sample set to obtain a labeling model; inputting the test set into a labeling model, and acquiring a labeling result of the contract in the test set output by the labeling model; and judging whether the annotation model needs to be continuously optimized or not according to the annotation result of the test set until the annotation accuracy of the annotation model is greater than a preset value.
Optionally, the program controls the apparatus in which the storage medium is located to perform the following steps when running: constructing an initial labeling model, wherein the initial labeling model is a long-time memory neural network model; inputting an initial sample set to an initial labeling model, wherein the long-time memory neural network learns a vector sequence of manually labeled contract elements and a category vector of labels associated with the contract elements in each contract sample; and training an initial labeling model through an error minimization strategy.
Optionally, the program controls the apparatus in which the storage medium is located to perform the following steps when running: constructing an initial labeling model; inputting an initial sample set to an initial labeling model, wherein a vector sequence of a primary element associated with a primary label, a vector sequence of a secondary element associated with a secondary label and a vector sequence of a tertiary element associated with a tertiary label in each contract sample are extracted by a deep convolutional neural network; and training an initial labeling model through a strategy of error minimization based on the vector sequence of the primary elements, the vector sequence of the secondary elements, the vector sequence of the tertiary elements and the type vector of the label.
Optionally, the program controls the apparatus in which the storage medium is located to perform the following steps when running: the initial labeling model labels the contracts in the sample expansion set according to the first-level labels to obtain first-level elements; acquiring at least one secondary label according to the primary label, and labeling the contract according to the secondary label to obtain at least one secondary element associated with the primary element; and acquiring at least one third-level label according to the second-level label, and labeling the contract according to the third-level label to obtain at least one third-level element associated with the second-level element.
Optionally, the program controls the apparatus in which the storage medium is located to perform the following steps when running: comparing the manual labeling result of the contract in the test set with the labeling result output by the labeling model to obtain the labeling accuracy of the contract of each service type of the labeling model; judging whether the marking accuracy of each service type is greater than a preset value; eliminating the service types with the mark accuracy rate larger than a preset value in the plurality of service types to obtain the target service type needing to be continuously optimized; correcting the labeling result of the contract of the target service type based on the correction instruction of the user; and optimally training the labeling model by using the revised contract of the target service type until the labeling accuracy of the labeling model is greater than a preset value.
Fig. 3 is a schematic diagram of a computer device according to an embodiment of the present invention. As shown in fig. 3, the computer apparatus 100 of this embodiment includes: the processor 101, the memory 102, and the computer program 103 stored in the memory 102 and capable of running on the processor 101, wherein the contract marking method in the embodiment is implemented when the processor 101 executes the computer program 103, and therefore, for avoiding repetition, details are not repeated herein. Alternatively, the computer program is executed by the processor 101 to implement the functions of each model/unit in the contract annotation apparatus in the embodiment, which are not repeated herein to avoid repetition.
The computing device 100 may be a desktop computer, a notebook, a palm top computer, a cloud server, or other computing devices. The computer device may include, but is not limited to, a processor 101, a memory 102. Those skilled in the art will appreciate that fig. 3 is merely an example of a computing device 100 and is not intended to limit the computing device 100 and that it may include more or less components than those shown, or some of the components may be combined, or different components, e.g., the computing device may also include input output devices, network access devices, buses, etc.
The Processor 101 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 102 may be an internal storage unit of the computer device 100, such as a hard disk or a memory of the computer device 100. The memory 102 may also be an external storage device of the computer device 100, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc., provided on the computer device 100. Further, the memory 102 may also include both internal storage units and external storage devices of the computer device 100. The memory 102 is used for storing computer programs and other programs and data required by the computer device. The memory 102 may also be used to temporarily store data that has been output or is to be output.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a Processor (Processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for contract annotation, the method comprising:
extracting at least one contract sample from each contract of the service type to obtain an initial sample set, wherein the contract sample comprises a plurality of manually marked contract elements;
constructing and training an initial labeling model based on the contract sample of each service type;
acquiring a plurality of pre-stored contracts of each service type, and dividing the contracts into a sample expansion set and a test set according to a preset proportion;
labeling contract elements in the contracts in the sample expansion set by using the initial labeling model;
merging the labeled sample expansion set and the initial sample set into a training sample set, and optimally training the initial labeling model by using the training sample set to obtain a labeling model;
inputting the test set into the labeling model, and acquiring a labeling result of the contract in the test set output by the labeling model;
and judging whether the marking model needs to be continuously optimized or not according to the marking result of the test set until the marking accuracy of the marking model is greater than a preset value.
2. The method of claim 1, wherein the constructing and training an initial annotation model based on the contract samples for each of the traffic types comprises:
constructing an initial labeling model, wherein the initial labeling model is a long-time memory neural network model;
inputting the initial sample set into the initial labeling model, wherein the long-time memory neural network learns a vector sequence of manually labeled contract elements and a category vector of labels associated with the contract elements in each contract sample;
and training the initial labeling model through an error minimization strategy.
3. The method of claim 2, wherein the contract elements comprise a first-level element, a second-level element and a third-level element which are distributed hierarchically, and the labels also present a hierarchical relationship, and the labels comprise a first-level label, a second-level label under the first-level label and a third-level label under the second-level label; the constructing and training of the initial labeling model based on the contract sample of each service type comprises the following steps:
constructing an initial labeling model;
inputting the initial sample set to the initial labeling model, wherein the long-time memory neural network extracts a vector sequence of a primary element associated with the primary label, a vector sequence of a secondary element associated with the secondary label and a vector sequence of a tertiary element associated with the tertiary label in each contract sample;
and training the initial labeling model through a strategy of error minimization based on the vector sequence of the primary elements, the vector sequence of the secondary elements, the vector sequence of the tertiary elements and the type vector of the label.
4. The method of claim 3, wherein said labeling contract elements in contracts in said sample expansion set with said initial labeling model comprises:
the initial labeling model labels the contracts in the sample expansion set according to the first-level labels to obtain first-level elements;
acquiring at least one secondary label according to the primary label, and labeling the contract according to the secondary label to obtain at least one secondary element associated with the primary element;
and acquiring at least one third-level label according to the second-level label, and labeling the contract according to the third-level label to obtain at least one third-level element associated with the second-level element.
5. The method of claim 1, wherein the determining whether the annotation model needs to be optimized continuously according to the annotation result of the test set until the annotation accuracy of the annotation model is greater than a preset value comprises:
comparing the manual labeling result of the contract in the test set with the labeling result output by the labeling model to obtain the labeling accuracy of the contract of each service type of the labeling model;
judging whether the marking accuracy of each service type is greater than the preset value;
eliminating the service types with the marking accuracy rate larger than the preset value from the plurality of service types to obtain target service types needing to be continuously optimized;
modifying the labeling result of the contract of the target service type based on the modification instruction of the user;
and optimally training the labeling model by using the revised contract of the target service type until the labeling accuracy of the labeling model is greater than a preset value.
6. The method of any of claims 1-5, wherein prior to said labeling contract elements in contracts in said sample expansion set with said initial labeling model, the method further comprises:
screening out a contract sample to be processed in the sample expansion set, wherein the format of the contract sample to be processed is a figure file;
finding the inclination angle of each contract sample to be processed by a Hough transform method, and performing rotation correction on the contract sample to be processed by adopting bilinear interpolation based on the inclination angle;
identifying the contract sample after the rotation correction to obtain a contract text;
replacing the pending contract sample in the sample expansion set with the contract text.
7. A contract annotation apparatus, characterized in that the apparatus comprises:
the extraction unit is used for extracting at least one contract sample from each contract of each service type to obtain an initial sample set, wherein the contract sample comprises a plurality of artificially labeled contract elements;
the construction unit is used for constructing and training an initial labeling model based on the contract sample of each service type;
the acquisition unit is used for acquiring a plurality of pre-stored contracts of each service type and dividing the contracts into a sample expansion set and a test set according to a preset proportion;
the marking unit is used for marking contract elements in the contracts in the sample expansion set by using the initial marking model;
the optimization training unit is used for merging the labeled sample expansion set and the initial sample set into a training sample set, and optimally training the initial labeling model by using the training sample set to obtain a labeling model;
the input unit is used for inputting the test set into the labeling model and acquiring a labeling result of the contract in the test set output by the labeling model;
and the judging unit is used for judging whether the marking model needs to be continuously optimized according to the marking result of the test set until the marking accuracy of the marking model is greater than a preset value.
8. The apparatus of claim 7, wherein the building unit comprises:
the building subunit is used for building an initial labeling model, and the initial labeling model is a long-time memory neural network model;
an input subunit, configured to input the initial sample set to the initial labeling model, where the long-time and short-time memory neural network learns a vector sequence of manually labeled contract elements in each contract sample and a category vector of a label associated with the contract elements;
and the training subunit is used for training the initial labeling model through a strategy of error minimization.
9. A computer non-volatile storage medium, the storage medium comprising a stored program, wherein when the program runs, the apparatus on which the storage medium is located is controlled to execute the contract annotation method according to any one of claims 1 to 6.
10. Computer arrangement comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the contract annotation method according to any one of claims 1 to 6 when executing the computer program.
CN201910752445.8A 2019-08-15 2019-08-15 Contract marking method and device Pending CN110705225A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910752445.8A CN110705225A (en) 2019-08-15 2019-08-15 Contract marking method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910752445.8A CN110705225A (en) 2019-08-15 2019-08-15 Contract marking method and device

Publications (1)

Publication Number Publication Date
CN110705225A true CN110705225A (en) 2020-01-17

Family

ID=69194055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910752445.8A Pending CN110705225A (en) 2019-08-15 2019-08-15 Contract marking method and device

Country Status (1)

Country Link
CN (1) CN110705225A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723870A (en) * 2020-06-22 2020-09-29 中国平安人寿保险股份有限公司 Data set acquisition method, device, equipment and medium based on artificial intelligence
CN112036166A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Data labeling method and device, storage medium and computer equipment
CN112214595A (en) * 2020-08-21 2021-01-12 中国建设银行股份有限公司 Category determination method, device, equipment and medium
CN113239205A (en) * 2021-06-10 2021-08-10 阳光保险集团股份有限公司 Data annotation method and device, electronic equipment and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951925A (en) * 2017-03-27 2017-07-14 成都小多科技有限公司 Data processing method, device, server and system
CN109902157A (en) * 2019-01-10 2019-06-18 平安科技(深圳)有限公司 A kind of training sample validation checking method and device
CN110110086A (en) * 2019-05-13 2019-08-09 湖南星汉数智科技有限公司 A kind of Chinese Semantic Role Labeling method, apparatus, computer installation and computer readable storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951925A (en) * 2017-03-27 2017-07-14 成都小多科技有限公司 Data processing method, device, server and system
CN109902157A (en) * 2019-01-10 2019-06-18 平安科技(深圳)有限公司 A kind of training sample validation checking method and device
CN110110086A (en) * 2019-05-13 2019-08-09 湖南星汉数智科技有限公司 A kind of Chinese Semantic Role Labeling method, apparatus, computer installation and computer readable storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723870A (en) * 2020-06-22 2020-09-29 中国平安人寿保险股份有限公司 Data set acquisition method, device, equipment and medium based on artificial intelligence
CN111723870B (en) * 2020-06-22 2024-04-09 中国平安人寿保险股份有限公司 Artificial intelligence-based data set acquisition method, apparatus, device and medium
CN112036166A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Data labeling method and device, storage medium and computer equipment
CN112214595A (en) * 2020-08-21 2021-01-12 中国建设银行股份有限公司 Category determination method, device, equipment and medium
CN113239205A (en) * 2021-06-10 2021-08-10 阳光保险集团股份有限公司 Data annotation method and device, electronic equipment and computer readable storage medium
CN113239205B (en) * 2021-06-10 2023-09-01 阳光保险集团股份有限公司 Data labeling method, device, electronic equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN110705225A (en) Contract marking method and device
CN107798299B (en) Bill information identification method, electronic device and readable storage medium
CN108304775B (en) Remote sensing image recognition method and device, storage medium and electronic equipment
CN112016438B (en) Method and system for identifying certificate based on graph neural network
CN108399386A (en) Information extracting method in pie chart and device
CN109635718B (en) Text region dividing method, device, equipment and storage medium
CN106980856B (en) Formula identification method and system and symbolic reasoning calculation method and system
CN110059750A (en) House type shape recognition process, device and equipment
CN110705952A (en) Contract auditing method and device
CN110807491A (en) License plate image definition model training method, definition detection method and device
CN110503103B (en) Character segmentation method in text line based on full convolution neural network
CN112001406B (en) Text region detection method and device
CN110909618A (en) Pet identity recognition method and device
US11023720B1 (en) Document parsing using multistage machine learning
CN113449046A (en) Model training method, system and related device based on enterprise knowledge graph
CN109815480B (en) Data processing method and device and storage medium
CN110796210A (en) Method and device for identifying label information
CN109710788A (en) Image pattern mark and management method and equipment
CN113128536A (en) Unsupervised learning method, system, computer device and readable storage medium
CN113592886A (en) Method and device for examining architectural drawings, electronic equipment and medium
CN110766460A (en) User portrait drawing method and device, storage medium and computer equipment
CN113223011B (en) Small sample image segmentation method based on guide network and full-connection conditional random field
CN113011961B (en) Method, device, equipment and storage medium for monitoring risk of company-related information
CN112329735B (en) Training method of face recognition model and online education system
CN115294561A (en) Text recognition method and device and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination