CN116843175A

CN116843175A - Contract term risk checking method, system, equipment and storage medium

Info

Publication number: CN116843175A
Application number: CN202310648556.0A
Authority: CN
Inventors: 周红; 白世超; 高滨玮; 汤世隆; 王书钰
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2023-06-02
Filing date: 2023-06-02
Publication date: 2023-10-03

Abstract

The invention provides a contract term risk checking method, a contract term risk checking system, contract term risk checking equipment and a storage medium, wherein the contract term risk checking method comprises the following steps: constructing a clause semantic similarity calculation model; acquiring a contract to be checked, and inputting the clause to be checked in the contract into a clause semantic similarity calculation model; obtaining the semantic similarity of the clause to be checked and the standard clause, wherein the clause with the semantic similarity lower than the set threshold value is determined as the risk clause. According to the scheme, the contract term missing detection task is decomposed into the problems of contract text classification and term similarity, linguistic, NLP technology, deep learning technology and construction engineering contract management are used as theoretical basis, automatic classification of the contract text, multi-label classification of terms and calculation of term semantic similarity are realized by using a computer technology, and the problems of low detection efficiency, incomplete risk detection and the like of relying on subjective experience of personnel in a traditional contract examination mode depending on experience are solved.

Description

Contract term risk checking method, system, equipment and storage medium

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a contract term risk checking method, a contract term risk checking system, contract term risk checking equipment and a contract term risk checking storage medium.

Background

The contract is an agreement between civil subjects to set up, alter and terminate civil legal relations. In business activities, contracts, particularly construction engineering contracts, are the basis for project management and are written agreements of the obligation relationships of the contracted parties, and the parties enjoy the rights and fulfill the corresponding obligation obligations according to the contractual agreements. Because the construction engineering contract has the characteristics of long performance period, complex clause content and the like, potential legal disputes are induced by contract risk factors such as irregular contract clause formats, incomplete content and the like in the performance process of a principal, and huge economic losses in responsibility can be caused. The number of cases for the disputes of the construction engineering contracts in China is increased up to 46.7 ten thousand, and the sum of the dispute judge documents in 2019 reaches 4376.41 hundred million yuan in recent six years, so that a great number of problems exist in the risk management of the construction engineering contracts depending on experience.

Contract risk management refers to the identification, estimation, evaluation and control of contract risk factors over the whole life cycle of a contract from the contract's establishment, performance to the end. The contract risk factors specifically comprise contract defect risks and performance process risks, wherein the contract defect risks refer to risks caused by insufficient physical contents of terms such as term missing, content ambiguity, unbalanced responsibility and authority relation in the contract establishment stage, and the risks can be avoided in time through contract detection; the latter refers to risks caused by engineering changes, construction period delays, payment violations, quality defects and other events in the contract performance stage, and risk control is performed by claim and dispute resolution means. Among the many contractual defect risk factors, the contractual term missing risk refers to the fact that the parties to the project contract establishment stage do not write part of the contract into the contract, resulting in the inability of one party to exercise rights or claim claims according to terms. If the risk detection work can be fully done in the contract making stage, the possible speculative tendency of the parties in the later contract making stage can be reduced as much as possible.

The risk detection process at the construction project contract establishment stage is usually separately audited by each department of the company and commonly agreed. Such conventional contract detection methods rely on personnel experience accumulation and subjective judgment on one hand, resulting in different levels of contract risk management for each department. On the other hand, the method can not meet the requirement of large-batch contract detection in the short bidding period, so that contract risks are continuously displayed in the performing process, and a large number of disputes are caused. At present, the contract risk detection has three problems of incomplete detection of clauses and defects and low examination efficiency for a long time: (1) Contract term defects caused by human omission frequently occur, such as incorrect text expressions, incorrect subjects, miscases of amounts, and the like. Especially, after the contract is transferred and inspected for a plurality of times in each department, the possibility of the occurrence of the problem of manual omission is increased; (2) the auditing mode is boring and inefficient. Although the reviewer is already very familiar with the contract content, multiple formal checks of the contract against all or part of the terms are still required to prevent intentional or unintentional modification by the opposing party during the negotiation process. One contract often needs to be circulated for more than ten hours in each department, so that the efficiency of enterprise contract management is affected; and (3) the auditing result is greatly influenced by subjective evaluation. Because of the different knowledge system and education level of the censoring staff, the same clause can be given completely different evaluation, and the thinking and settling of the censoring staff can easily cause the problem of 'recognition boundary' in contract risk detection.

Disclosure of Invention

In view of the foregoing, a first aspect of the present invention provides a contract term risk checking method, including the steps of: constructing a clause semantic similarity calculation model; acquiring a contract to be checked, and inputting the clause to be checked in the contract into a clause semantic similarity calculation model; obtaining the semantic similarity of the clause to be checked and the standard clause, wherein the clause with the semantic similarity lower than the set threshold value is determined as the risk clause.

Preferably, the method further comprises the step of: constructing a clause multi-label classification model and a standard clause label set, wherein the standard clause label comprises basic elements of a contract; inputting the clauses to be inspected in the contract to be inspected into a clause multi-label classification model to obtain a clause label set to be inspected; comparing the to-be-inspected clause label set with the standard clause label set, determining whether the to-be-inspected contract lacks a clause label, and marking the to-be-inspected contract with the missing clause label as a risk contract.

Preferably, before inputting the terms to be checked in the contract into the term semantic similarity calculation model, the method further comprises the steps of: presetting a plurality of contract fields, and designing a corresponding standard clause label set for each contract field; and constructing a contract text classification model, and classifying the contracts to be checked into preset contract fields. Therefore, for different categories given by the contract text classification model, after the multi-label classification model of the contract terms is input, the corresponding different standard term label sets are used for matching, the contract can be efficiently archived according to the category of the extracted contract text word vector features, and the efficiency and the accuracy of the contract term semantic similarity algorithm are improved.

Preferably, the standard clause label further comprises contract high frequency words obtained by statistics in advance from the training set.

Preferably, before entering the terms to be inspected in the contract into the term multi-label classification model, the method further comprises the steps of: the peer titles of the contracts to be checked and the standard contracts are compared, and the contracts to be checked, which are inconsistent in comparison, are directly marked as risk contracts.

Preferably, the clause semantic similarity calculation model is constructed based on a BERT pre-training model, and a MatchPyramid structure is introduced after the token layer is input, so that the context coding vector input by the token layer is expressed as a two-dimensional matching matrix. By means of the MatchPyramid structure, a matching matrix can be constructed from the word, phrase and sentence levels, the influence of irrelevant contents on calculation is reduced, and accurate matching of contract terms is achieved.

Preferably, before inputting the clause to be inspected in the contract into the clause multi-label classification model, the method further comprises the step of performing segmentation processing on long clauses exceeding character limits of the clause multi-label classification model, wherein the segmentation points are determined by adopting the following steps: taking the natural segment interval of the long clause as a potential segmentation point; and comparing the similarity of texts at two sides of each potential segmentation point, and taking the potential segmentation point with the highest similarity as a final segmentation point.

A second aspect of the present invention provides a contract term risk inspection system, comprising:

the model training module is configured to construct a clause semantic similarity calculation model;

the similarity calculation module is configured to acquire a contract to be checked, and input terms to be checked in the contract into a term semantic similarity calculation model;

and a risk clause confirming module configured to obtain the semantic similarity of the clause to be checked and the standard clause, wherein the clause with the semantic similarity lower than the set threshold value is determined as the risk clause.

A third aspect of the present invention provides a contract term risk inspection apparatus, comprising: a processor and a memory coupled to the processor; the memory has stored thereon a contract term risk inspection program executable on the processor, which when executed by the processor performs the steps of the contract term risk inspection method as in any of the first aspects.

A fourth aspect of the present invention provides a storage medium having stored thereon a contract term risk inspection program which, when executed by a processor, implements the steps of the contract term risk inspection method according to any one of the first aspects.

According to the scheme provided by the invention, the contract term missing detection task is decomposed into the problems of contract text classification and term similarity, linguistics, NLP technology, deep learning technology and construction engineering contract management are used as theoretical basis, computer technology is used for realizing automatic classification of the contract text, multi-label classification of terms and calculation of term semantic similarity, and deep learning models CNN, LSTM, BERT and the like can realize corresponding algorithms. The method can improve the traditional contract examination mode depending on experience, achieves the aim of intelligently detecting the risk of missing contract clauses, and provides a feasible solution for solving the problems of long-term existence of contract risk detection depending on subjective experience of personnel, low detection efficiency, incomplete risk detection and the like.

Drawings

For convenience of description, only parts related to the related invention are shown in the drawings.

FIG. 1 is an illustration of the problem and solution for which the present invention is directed;

FIG. 2 is an exemplary text standard contract structural form of China;

FIG. 3 is a diagram illustrating text similarity before and after segmentation points corresponding to each potential segmentation point according to an embodiment of the present invention;

FIG. 4 is a graph showing the similarity of potential segmentation points according to an embodiment of the present invention;

FIG. 5 is a diagram of a BERT semantic similarity calculation model based on an MatchPyramid structure in another embodiment of the present invention;

FIG. 6 is a diagram of a contract term detection system architecture in another embodiment of the application;

FIG. 7 is a flowchart illustrating operation of the contract term leak detection system in another embodiment of the application;

FIG. 8 is a schematic diagram of a contract term risk inspection system in accordance with another embodiment of the application.

Detailed Description

The application is described in further detail below with reference to the drawings and examples. The specific embodiments described herein are offered by way of illustration only, and not by way of limitation. Embodiments of the application and features of the embodiments may be combined with each other without conflict.

In order to improve the contract risk detection method, in recent years, leading information technologies such as Big Data (Big Data), natural language processing (Natural Language Processing, NLP), deep Learning (DL) and the like are gradually applied to the fields of compliance detection, auxiliary decision making and the like of construction engineering projects, and the development of emerging Data science provides technical support for getting rid of an inherent mode of contract risk management. As a computer technology for automatically performing structuring processing on unstructured text, the NLP technology becomes a key tool for realizing intelligent contract risk detection. NLP technology is a comprehensive discipline integrating linguistics and computer science, and is also a main research direction of artificial intelligence at present. The DL technique further improves the ability of the NLP technique to extract information and learn corpus features. At present, the computer and AI technology are advanced, the intelligent means is adopted to improve the contract detection efficiency, reduce the defect risk of contract clauses, not only meet the practical requirement of the contract risk control of construction engineering, but also become one of the forefront research directions in the contract management field. At present, students and companies start to try to manage contracts and mine contract implicit knowledge by using computer technology, and the students propose an NLP-based construction contract risk item extraction model to realize automatic identification of risk items; the israel company develops a first AI contract auditing platform LawGeex to preliminarily realize intelligent understanding of contract semantics and automatic risk detection.

The inventor proposes the invention through researching the contract terms of Chinese construction engineering. The construction engineering contract has various risk factors, wherein the main content of contract term missing risk detection is to identify whether contract terms which have great influence on the performance stage are missing or not, and the main content is to base the subsequent analysis of the term responsibility and entitlement relation. Therefore, the invention designs an intelligent contract term missing risk detection scheme based on NLP and deep learning technology aiming at the missing risk of the contract term of the construction engineering, and provides a mode which does not depend on manual participation, and the intelligent detection of the missing risk of the contract term is realized by utilizing a computer technology.

Fig. 1 is an illustration of the problem and solution for which the present invention is directed. In order to solve the problems of the prior art as shown in fig. 1, a solution is proposed to introduce the theoretical knowledge of the linguistic domain involved in NLP into the risk management of construction engineering contracts. In a specific embodiment, first, web crawlers and Easy Data Augmentation technologies are used to obtain a large number of published contract text corpus, and Chinese word segmentation, stop word removal, text representation and other NLP technologies are used to convert contract terms into language forms which can be recognized by a computer. And secondly, using the convolutional neural network and the long-short-term memory neural network to automatically extract the contract text characteristics, and fusing the two models to improve the performance of the classification model. Finally, carrying out construction engineering contract term similarity algorithm research, and designing a contract term multidimensional label system from the taxonomy perspective. A contract term multi-label classification algorithm is designed based on the BERT model, so that the calculated amount of semantic similarity data is reduced. And introducing the MatchPyramid structure into a BERT model, designing a contract term semantic similarity algorithm, acquiring a word feature matching matrix, performing convolution operation, and calculating the semantic similarity of terms according to the text interaction information. On the basis of the technology, a prototype system for detecting the missing of the terms of the construction engineering contract is designed.

For preprocessing of contract text formats, it is proposed to utilize multiple NLP techniques to effect conversion and processing of contract clause text formats. Selecting an open source tool library Jieba to realize Chinese word segmentation of contract clauses, and removing stop words and word frequency statistics tasks; converting contract terms into a Word vector space using a distributed representation technique Word2 vec; and aiming at the condition that the BERT model can only process texts with 512 characters or less, calculating cosine similarity of terms by using a TextTilling algorithm to realize long text segmentation processing. The contract terms are ultimately converted into structured text that can be understood by the computer.

Contract text classification algorithm

The embodiment builds a construction engineering contract classification model. A large amount of published contract text corpus is obtained by using web crawlers and Easy Data Augmentation technology, and the original contract corpus text is still a large descriptive content, so that text pretreatment is required for Chinese text, and invalid words which do not affect text expression are cleaned. The next step of Chinese NLP pretreatment is to express text on the same clause, so that three methods of Chinese word segmentation, stop word removal and word frequency statistics of NLP technology are selected.

The Chinese text such as contract clauses and the like takes a word as a minimum writing unit, and unlike English, the Chinese words do not have clear interval marks, so that the Chinese word segmentation operation aims to segment the character strings of the automatic recognition Chinese text into reasonable word sequences according to a certain rule; removing stop words is to divide sentences into word sets, and reject conjunctions, pronouns and prepositions such as 'so', 'and so' which have little influence on text processing but occur frequently in contract texts; the purpose of word frequency statistics is to calculate the frequency or probability of simultaneous occurrence between adjacent words for measuring the credibility of words, and the keyword word frequency statistical information obtained at the same time is also used as an effective reference for label selection of subsequent multi-label classification. The system selects a third-party Chinese word segmentation library Jieba library of Python as a tool for realizing text preprocessing.

The contract processed by the Chinese NLP technology is still unstructured text in natural language form. To enable a computer to process contract text, the present system converts it into a structured digital vector form by way of natural language modeling. Methods of natural language modeling can be classified into three types, rule-based, statistical-based, and deep learning-based. The present system selects a text representation method based on deep learning. The text representation method is mainly realized through a pre-training technology, the Word2vec has the advantages of stability and high efficiency in the model structure and training efficiency, has good performance in an unsupervised training Word vector task in a large-scale text corpus, is the most widely applied distributed representation technology in the pre-training technology, and the system selects the Word2vec model as a tool for contract text representation.

In a preferred embodiment, a convolutional neural network and a long-short-term memory neural network are used to automatically extract contract text features and fuse two models to improve classification model performance. The text classification essence of the construction engineering contract is a long text classification task, and the text of contract clauses not only contains local information of the word itself, but also contains sequence information of the word and other contextual clauses. The CNN model realizes the convolution of word vector one-dimensional direction through a plurality of different convolution kernels, and has the advantages of easily obtaining local key information of the text and well completing short text classification tasks. While LSTM models are easier to learn context sequence information in text and are more sensitive to overall context information. The feature types extracted by the two deep learning models possibly have certain difference, so that the advantages of the comprehensive CNN model and the LSTM model can be considered, and the classification effect on the construction engineering contract text is improved. The model fusion is also called ensemble learning (Ensemble Learning) or multi-classifier system, and is mainly applied to classification and regression tasks, and the calculation results of a plurality of classification models are combined according to requirements, so that the advantages of each model are utilized to the greatest extent, the generalization capability of the model is effectively improved, and the risk of overfitting of the model is reduced.

The system selects a linear weighting method of an averaging method, outputs a probability prediction matrix by CNN and LSTM models respectively in the training process, averages the probability matrix of each model according to coefficient weighting, and dynamically adjusts weight according to classification results in the training process. In the model experiment process, the highest accuracy of classification of the fusion model is found when the weight ratio of CNN to LSTM is 3:7.

The embodiment analyzes differences of two deep learning models on text feature extraction modes and text feature attention points from the viewpoint of algorithm principles. And a linear weighting method is adopted to fuse the CNN model and the LSTM model, so that the calculation accuracy of the contract text classification model is improved. Experimental results show that the accuracy rate, recall rate and F1 value of the fused classification model reach 0.882, 0.921 and 0.898, and the classification results of construction and investigation design contracts in the construction engineering contract classification results are better.

Embodiment 2 Multi-pattern multi-tag classification algorithm

In this embodiment, a contract term multidimensional labelling system is designed from a taxonomic perspective through construction engineering contract term similarity algorithm research. A contract term multi-label classification algorithm is designed based on the BERT model, so that the calculated amount of semantic similarity data is reduced.

The whole contract text is split through text segmentation, and whether the contract belongs to standard contract conditions is judged according to the chapter structure.

(1) Determining text segmentation primitives

Sentences, paragraphs or chapters can all be used as basic units for text segmentation in NLP tasks. However, the content of semantic information contained in the sentences is relatively small, so that the effect of the sentences in the context is easily ignored; the amount of information contained in the chapters is large, but the conventional model of the present NLP technology cannot accurately process semantic information of the chapters, so that the model is difficult to obtain detailed information of text corpus; the paragraph length is between sentences and chapters, so that the relation between the semantic unit information quantity and the model realization possibility can be well balanced. In a construction engineering contract, contract terms can be divided into five parts of chapter, section, strip, money and item according to structural hierarchy. Wherein the chapters and sections are used for classifying the collection of contract clauses of the same type, and primary titles and secondary title numbers of Arabic numerals are used; "strip" is the basic unit of composing a contract, one contract is composed of several contract terms, three-level title numbers of Arabic numerals are used; the "money" is the constituent content of the "bar" and is embodied as a natural segment of the contractual terms. Each natural section is one section, and Arabic numerals are not used before the section; "items" are detailed descriptions of the natural-segment text in "money" by way of enumeration. Accordingly, the text selects "bars" as the basic elements for text segmentation and subsequent multi-label classification and semantic similarity computation.

(2) Judging contract text structure

The construction engineering contract text can be divided into standard contract conditions and non-standard contract conditions according to structural characteristics, and the standard contract conditions and the non-standard contract conditions are very different in contract section arrangement, language description and the like. When dividing the basic unit, if the primary title number and the title content of the contract are specially identified, whether the contract belongs to standard contract conditions can be judged.

The standard contract conditions are sequentially composed of three parts of contract agreements, general terms and special contract terms: the contract agreement only briefly introduces the two parties and engineering projects, and the target to be achieved by the contract is explained, so that the substantial content is less, and the contract agreement mainly comprises basic information of the contract parties, project names, explanation sequences of contract files, signature of the parties and the like; the general terms are the most important components of the contract text, and the content of the terms which reflect engineering management conditions and are universally applicable is extracted to form a standard independent contract text module; the special contract clauses reflect the specificity of the engineering project and are the supplement, deletion and special description of the same general clauses. The standard contract condition FIDIC contract condition, demonstration text and AIA contract commonly used in the international engineering project all adopt the composition mode, and the contract structure form is shown in figure 2.

(3) Contract long term segmentation

When a researcher builds a BERT model, in order to achieve comprehensive optimization of model calculation and operation efficiency, a maximum length limit of BERT input data is defined to be 512 characters in a position embedding layer. If the text length is greater than 512 and needs to be truncated, zero padding is needed if the text length is smaller than the length. The system uses the word segmentation function of the Jieba tool to count the length distribution of the contract text, discovers that the length of contract clauses is mainly concentrated in the interval of 200-300, and the number of clauses with the length exceeding 512 characters accounts for 1.43% of the total training set. For contract clauses exceeding 512 characters, long text sentence segmentation is required, and since the contract clauses usually contain one or more parallel items, the long text segmentation position needs to be considered, so that the segmented two parts of content comprise 'money' and 'items' with similar meaning. The implementation of the algorithm is described below.

Dividing text units (token): most of the long clauses are mainly listed in turn under the assumption and responsibility convention conditions in each case, and each paragraph is generally delimited by a "period/semicolon+carriage return+serial number". The system uses the potential segmentation point as a potential segmentation point to segment the original contract clause text into a plurality of Pseudo sentences (Pseudo sentences).

(1) Remainder of the processAnd (5) calculating chord similarity: the text vectors at two sides of the potential segmentation point are respectively x= { x ₁ ,x ₂ ,…,x _n Sum y= { y ₁ ,y ₂ ,…,y _n And n is the number of words after Chinese word segmentation in contract clauses, and the calculation formula of the vector cosine value is as follows:

wherein x is _i The number of words i in the text vector at the left side of the potential segmentation point; y is _i The number of words i in the text vector on the right side of the potential segmentation point; b ₁ Is the left text vector; b ₂ Is the right text vector; t is the total word number of the text;at block b for word t ₁ The weight of (a); />At block b for word t ₂ The weight of (a);

(2) boundary identification (Boundary Identification): text similarity before and after the interval point corresponding to each potential division point is expressed by a Depth value (Depth Score). Taking fig. 3 as an example, the spacing point g ₁ There are two highest points on the left and right, and the spacing point g ₂ There is only one highest point to the right. Order of principleSpacing point g ₃ The left side and the right side are in a descending state, and no peak value is +.>Then the spacing point gap _i The corresponding depth value calculation formula is:

depth _i ＝max{(conSim _i,left -conSim _i ),0}+max{(conSim _i,right -conSim _i 0),0}

wherein cosSim is _i,left The similarity peak value at the left side of the interval point;cosSim _i,right is the similarity peak to the right of the segmentation point. When the similarity cosSim of the interval points is minimum in the calculation process, the corresponding depth of the target interval points is the maximum value in the contrary. And selecting potential segmentation points with similarity values higher than a given threshold value as text segmentation positions of the original guarantee responsibility description. In this embodiment, the threshold definition given by Hearst in Multi-Paragraph Segmentation of Expository Text is applied:

f(μ,σ)＝μ-σ/2

Where μ and σ are the mean and standard deviation, respectively, of the sequence of depth values.

Finally obtaining a contract term text segmentation similarity curve image, wherein the x-axis represents the composition of the interval point sequence, and the point c ₀ And c ₅ Is 0, representing a similarity value at the beginning and end of the terms of the construction contract. According to the similarity calculation result, potential division point g of clause ₂ If the similarity of the two images meets the threshold requirement, the two images are used as text segmentation points, and a similarity curve is obtained as shown in fig. 4.

And then constructing a multi-dimensional label system of contract clauses according to the taxonomy and the knowledge of the construction engineering contract field, and labeling corresponding labels for each chapter clause.

The method integrates two methods based on word frequency statistics and based on knowledge in the contract field, and a multidimensional label system suitable for construction of the construction engineering contract is constructed. Clause high frequency keywords based on word frequency size statistics are generally the most characteristic of contract text content. In the system, in a TF-IDF word frequency statistics module, partial clause label keywords are screened out after non-substantial dead words are removed.

The characteristics of the contract terms cannot be fully and comprehensively represented only by the labels obtained according to the word frequency, and the contract labels also comprise contents such as contract participants, legal relations, contract targets and the like. Considering that the construction engineering contract label has high relevance with project management theory and construction law and regulation, a method based on contract field knowledge is introduced as concept supplement, and the label is additionally supplemented from general terms of the contract, construction engineering project management knowledge and the part of the ' building law of the people's republic of China ', so that the coverage rate of the label is enlarged.

The subtitling of each chapter and item of the general clause basically summarizes noun phrases of the clause, and chapter keywords extracted from the subtitling can cover the whole construction stage and related matters related to each stage, so that the breadth and the depth of a label system framework are effectively improved. The system also introduces the project management knowledge of the construction engineering and the important terms of the building law of the people's republic of China as conceptual supplements. On the basis, the label keywords are subjected to upper and lower category induction and arrangement by means of a category classification method of taxonomies, and a multi-dimensional label system with unified classification hierarchy for construction engineering contract terms is obtained.

The multi-dimensional label system of the clause constructed by the system covers the attention elements of construction engineering contract management and three project management targets of quality, cost and construction period, and is divided into three-level labels, wherein 6 of the first-level labels, 36 of the second-level labels and 231 of the third-level labels. The first-level tag is a description of the principal and related articles, behaviors involved in any stage where the clause of the agreement is located; the second-level tag is a refinement of the first-level tag, for example, the first-level tag ' contract main body ' and the participant ' can be subdivided into different roles such as ' sender ', ' contractor ', ' supervisor ', ' supplier ', and the like; the tertiary labels further specify the specific content that the secondary labels contain, typically one secondary corresponding to multiple tertiary titles.

Considering that the definition range of the first-level labels is too wide, and the number of the third-level labels is too large, the computing capacity of the classification model cannot meet the requirements. The system selects the secondary labels as the basis of classification and tags the content described according to the clauses.

And finally, inputting the training corpus into a word embedding layer, a paragraph embedding layer and a position embedding layer of the BERT input characterization module in sequence, and performing word and sentence horizontal training on a pre-training layer to acquire semantic information of clauses. The multi-label classification algorithm training process of contract clauses based on the BERT model is divided into an input characterization layer, a pre-training layer and a fine-tuning layer.

The input characterization layer (Input Representation Layer) is composed of three embedded layers, a word vector layer is processed by a Chinese character segmentation processing method in the word embedded layer, a position embedded layer is added in the paragraph embedded layer for considering phrase sequence information, and parameter values of the embedded layer are continuously adjusted in the training process.

The BERT model Pre-training Layer (Pre-training Layer) is divided into two models, namely BERT-Base and BERT-Larget, and the difference is that the Layer number (L) of a transducer encoder, the head number (A) in a multi-head attention mechanism and the hidden state dimension (H) are different. The three parameters of the BERT-Base model are l=12, a=12, h=768, respectively, and the total number of parameters is 110M. In the BERT-Large model, l=24, a=16, h=1024, and the total number of parameters is 340M. The pre-training layer captures word and sentence level text representations by a Mask Language Model (MLM) and a predictive Next Sentence (NSP) method, respectively.

The Fine-tuning Layer (Fine-tuning Layer) aims to further improve the application effect of the BERT model on the appointed downstream task. The BERT model employs a fine-tuning strategy that converts the multi-label classification problem into binary correlations. In a multi-label semantic indexing task with the label space size of q, text data are sequentially input into q base classifiers for processing, and then output of the base classifiers are combined into a label set prediction result of an example according to a certain rule. The probability value P output by the classifier is calculated as follows:

P＝Softmax(CW)

wherein C is a text vector value; w is a new weight parameter introduced in the fine tuning process. The original weight and the new weight parameters W of the BERT model change at any time according to the fine adjustment result, so that the purpose of optimizing the multi-label classification result of the BERT model is achieved. The calculation formula of the objective function class cross entropy (Categorical Cross Entropy, CCE) in the optimization process is as follows:

CCE＝-log(P _i )

wherein P is _i The value is output for the Softmax function corresponding to the correct class.

In this embodiment, for the current situation that there is no unified classification standard temporarily in the multi-label classification algorithm, a multi-dimensional label system for terms of a construction engineering contract is constructed based on word frequency statistics and knowledge in the contract field on the basis of taxonomy, and the BERT pre-training model is used for realizing multi-label classification of terms. Experimental results show that the model accuracy rate, recall rate and F1 value reach 0.805, 0.772 and 0.782, the clause multi-label classification model has good prediction effect on contract labels at different frequencies, the effectiveness of the BERT model in multi-label classification tasks is verified, and preliminary screening of missing clause labels can be achieved.

[ example 3 ] clause semantic similarity algorithm

Introducing the MatchPyramid structure into a BERT model, designing a contract term semantic similarity algorithm, acquiring a word feature matching matrix, performing convolution operation, and calculating the semantic similarity of terms according to text interaction information.

The MatchPyramid structure embodies the idea of hierarchical matching, a two-dimensional matching matrix is constructed by using the dot product of word vectors or cosine similarity corresponding to words between two texts, and the text matching problem is converted into an image recognition task. The BERT model based on MatchPyramid refines contract text matching into three levels of word level matching, phrase level matching and sentence level matching first. After judging the word similarity of two contract clauses, continuously judging whether the phrases formed by a plurality of words have the same meaning, and finally judging whether the semantics of the whole sentence are similar.

Fig. 5 is a block diagram of a BERT semantic similarity calculation model based on a MatchPyramid structure. It comprises the following steps:

(1) Input characterization layer

The sentences P= { P are sequentially processed in the input representation layer (Input Representation Layer) ₁ ,p ₂ ,…,p _m Sum sentence q= { Q ₁ ,q ₂ ,…,q _m Use of [ SEP ]]And [ CLS ]]The special symbols are spliced into a sentence D, and the calculation formula is as follows:

D＝{[CLS],p ₁ ,p ₂ ,…,p _m ,[SEP],q ₁ ,q ₂ ,…q _n ,[SEP]}

Each character in D is sequentially obtained through Token encoding, segment Embedding and Positional Embedding to obtain corresponding vector information, the three vectors are combined to obtain an embedded representation S of the character, and then the embedded representation S is encoded in a BERT model to obtain a context encoding vector h E R of the sequence ^l×d[89] The calculation formula is as follows:

h＝{h ₀ ,h ₁ ,…,h _l-1 }

wherein l is the length of sequence D; h is a _i A contextual representation of the ith character of D; h is a ₀ For special symbols [ CLS ]]Is a vector representation of (c).

(2) Matching layer

The MatchPyramid structure is mainly applied to a Matching Layer (Matching Layer) and a Matching information extraction Layer (Information Extraction Layer). For this purpose, text-matched inputs are represented as two-dimensional Matching matrices (Matching Matrix), the spatial vector representation of sentences is obtained by means of a two-dimensional convolution method and the similarity is calculated by means of a multi-layer perceptron. Matching the context vector of the coding layer to obtain an element E of the j-th column of the i-th row in the matching matrix E _i,j Represents the i-th word in PAnd j-th word in Q>Similarity between them. The calculation formula of the matching matrix E is as follows:

E＝ξ(h ^(p) ,h ^(q) )

wherein xi is a matching matrix calculation function, which can be calculated by three methods of indication function, cosine similarity and dot product. Experiments show that compared with an exponential function and cosine similarity method, the dot product method further considers norms of word vectors, can better represent the relation between the word vectors, and the system calculates a matching matrix in a dot product mode. The matching matrix calculation formula of the dot product mode is as follows:

Wherein the method comprises the steps ofAnd->Words +.>And->The number product of the text vectors is the calculation result of the dot product mode.

(3) Matching information extraction layer

The information extraction of the text to be matched by the MatchPyramid structure can be regarded as an image similarity comparison algorithm in the field of computer image processing. In the image similarity comparison process, a shallow convolution kernel is used for capturing local information of the picture, a deep convolution kernel is used for capturing global information of the picture, the local information is a signal source of the global information, and the image similarity comparison process is similar to a plurality of pixel points to display complete image information. In the analog-to-semantic similarity calculation, word-level matching information determines phrase-level matching information, and phrase-level matching information determines sentence-level matching information.

(4) Output layer

The output layer is formed by stacking two convolution layers and a pooling layer in sequence. The convolution kernel size of the first layer of convolution layers is set to (3,3,64) and the second layer of convolution kernels is set to (3,3,128); the first layer of pooling layer is set to be (2, 2), the second layer of pooling layer is set to be global maximum pooling, and a vector r with the length of channel number is generated ₂ . The convolution calculation formula for extracting matching information of different levels using the multi-layer CNN is as follows:

Will a in a multi-layer perceptron _i,j And [ CLS ]]And calculates a similarity probability value for each combined vector in a Softmax function, which calculates a common scoreThe formula is as follows:

R＝[r ₂ ；h ₀ ]

P＝softmax(W _r r+b _r )

wherein W is _r And b _r Model weights obtained for early training; p is the similarity probability value of the combined vector. And finally updating the numerical value of the model weight through a loss function L, wherein the calculation formula is as follows:

wherein y is _i Is a true tag class; p is p _i Is the predicted tag class.

Inputting the clause to be detected and the standard clause set into the BERT input characterization layer, obtaining character embedded vector information and carrying out context vector coding work. The system introduces the MatchPyramid structure into the BERT, horizontally constructs a matching matrix from words, phrases and sentences, reduces the influence of irrelevant contents on calculation, and realizes accurate matching of contract terms. The final model outputs terms which are different from the standard contract terms semanteme, and contract examination personnel further judge whether the lack and the leakage exist or not.

In the embodiment, the MatchPyramid structure is introduced to replace the Softmax function, so that the problem that the BERT model is poor in prediction effect in a similarity calculation task is solved. And forming a matching matrix of contract clauses at the word, phrase and sentence levels, and obtaining the text semantic similarity of the contract clauses according to the fusion result of the information extraction and the special symbol [ CLS ] coding information. Experimental results show that the accuracy, recall rate and F1 value of the model test reach 0.709, 0.698 and 0.703, and the model can well complete the calculation process of semantic similarity among clauses, so that intelligent detection of the matched missing clauses is realized. And meanwhile, according to the calculation result, a conclusion that the information extraction effect of the semantic similarity calculation model is influenced by the length of the clause can be obtained, and the longer the clause is, the more unstable the similarity calculation effect is.

Embodiment 4 item missing detection system

The embodiment designs a prototype system for detecting the missing of the terms of the construction engineering contract, which comprises the following system functional requirements:

(1) Data acquisition function: acquiring contract text data through a web crawler technology and various channels, and training a contract classification algorithm and a clause similarity algorithm module;

(2) Text preprocessing function: carrying out data enhancement, chinese word segmentation, text representation and other works on the contract text, and converting the unstructured compound text into structured data;

(3) Contract text classification function: efficient classification and archiving are carried out on different category combinations, and corresponding text classification models are trained according to the type characteristics of the processing contracts;

(4) Multi-tag classification function: marking corresponding labels for contract clauses, and preliminarily judging whether label missing exists in the contract clauses;

(5) Semantic similarity calculation function: inputting the clauses to be detected into a semantic similarity algorithm module, and completing contract clause missing risk detection according to semantic information matching results;

(6) Management and usage functions: the system use object is mainly an administrator and a personal user, wherein the use range of the administrator comprises daily updating and maintaining of system bottom data and algorithms and management of basic information and operation range of the personal user. The application range of the individual user comprises operations of realizing contract omission risk detection, modifying basic information and the like in the system.

And constructing the architecture design of the engineering contract term missing detection prototype system, selecting to realize the functions of data storage, access and service processing in a browser/server mode, and supporting the functions of inquiring the problems and maintaining and upgrading the system by a user through the browser access detection system. The overall architecture of the system is shown in fig. 6, and the detection system is divided into five functional modules, namely an infrastructure layer, a data resource layer, an algorithm model layer, a business application layer and an interactive representation layer from bottom to top. The following describes each functional module in turn.

(1) Infrastructure layer

The infrastructure layer is positioned at the lowest layer of the contract term detection system architecture, and provides basic class library services such as hardware, software, a storage system, network service resources and the like for other layers. The local computer and the server are arranged under the same local area network, and after the user stores contract data into the local computer, the contract data is synchronously uploaded to the server by using FTP for text processing and file storage.

(2) Data resource layer

The data resource layer is composed of a local database, a network database and a clause tag library, and is mainly used for adding, deleting, modifying and inquiring stored data information and providing real-time data support service for the algorithm model layer and the business application layer. The data of the three databases are required to be classified and stored so as to meet the requirements of detection of different areas and different types of clauses. The local database is derived from the contract text related to the actual project of the construction project in which the project of the subject group participates. The network database is derived from construction engineering contracts which are captured from domestic public legal document websites such as North Dafabao net by using a Requests crawler library, so that information such as contract titles, text contents and the like is obtained.

(3) Algorithm model layer

The algorithm model layer is a foundation for realizing processing functions of the service application layer, and mainly aims to realize data acquisition, contract text preprocessing, text classification and semantic similarity algorithm by using web crawlers, NLP technology and deep learning technology. The algorithm model layer can effectively improve the problem of the computer processing efficiency by optimizing the deep learning calculation, so that the computer hardware efficiency is brought into play to the greatest extent.

(4) Business application layer

The business application layer is a core structure of the detection system and mainly aims to read contract data information stored in the data resource layer according to an operation instruction sent by the interactive representation layer and input the contract data information into the corresponding algorithm module. The business application layer mainly comprises three functional modules of text preprocessing, contract classification and similarity calculation. And the system administrator carries out modification maintenance and update work on the local database and the network database information of the data resource layer through the text preprocessing module. When the individual user performs the clause missing detection operation, the contract text to be detected is input into a detection system, and the text preprocessing module uses EDA technology to perform data enhancement processing on the text random word. After the CNN and LSTM models of the text classification module respectively generate text feature mapping, combining the two sets of feature mapping according to the set weight to obtain a contract text classification result. The text segmentation module cuts the contract into a plurality of sections of contract terms by taking a strip as a basic unit, and performs long text segmentation on the contract terms exceeding 512 characters. And then inputting the contract into a multi-label classification model of the clause, and performing multi-label prediction and label missing detection on the contract clause. The contract clauses to be detected also need to be further input into a similarity calculation module to judge whether the contract clauses are in missing risk caused by undefined semantic expression. And outputting a final clause missing detection result in the interactive representation layer interface after a series of complete system modules.

(5) Interactive presentation layer

The interactive presentation layer is positioned at the top layer of the system framework and is mainly used for interacting business application layer data with an administrator and a personal user and outputting an operation instruction result. The interactive presentation layer comprises a system management module, a data maintenance module, an algorithm maintenance module and an archiving management module which are used by a system administrator, and a contract input and contract lack detection module which belong to individual users.

Fig. 7 is a flowchart showing the operation of the contract term absence detection system in the present embodiment, specifically:

firstly, a personal user inputs a contract text to be detected, and after text preprocessing and text representation, the personal user inputs a classification result of judging the contract text in a fusion deep learning classification model, and in the embodiment, whether the contract text belongs to a construction contract is judged;

and then, performing clause text segmentation on the construction contract text, judging whether the contract belongs to a standard contract, and segmenting long clauses. Comparing the clause input multi-label classification model with a contract clause label library to judge whether contract clause labels are missed or not, and outputting a label detection result;

and finally, inputting a semantic similarity calculation model, constructing a matching matrix from the word, phrase and sentence layers, and calculating the text similarity. If the semantic similarity meets the threshold, the contract clauses pass detection, if not, the contract clauses are output, and the contract clauses are given to contract examination personnel for further risk evaluation.

The embodiment designs a construction engineering contract term missing detection prototype system based on NLP and DL technologies. And designing corresponding system function modules and operation flows according to the application requirements of the system. The system test result shows that the construction engineering contract term missing detection system designed in the specification has the feasibility and can realize the function of identifying the missing of the contract term.

FIG. 8 is a schematic diagram of a contract term risk inspection system 800 in another embodiment, including:

a model training module 801 configured to construct a clause semantic similarity calculation model;

a similarity calculation module 802 configured to obtain a contract to be inspected, and input terms to be inspected in the contract into a term semantic similarity calculation model;

a risk clause confirmation module 803 configured to obtain a semantic similarity of the clause to be inspected with the standard clause, wherein a clause having a semantic similarity below a set threshold is determined as a risk clause.

The invention designs a construction engineering contract term missing risk detection overall scheme based on NLP and deep learning. The linguistic theory related to the NLP technology is applied to the field of construction engineering contract risk management, the contract term missing detection task is decomposed into contract text classification and term similarity problems, and a deep learning model CNN, LSTM and BERT are used for realizing a corresponding algorithm, so that a feasible scheme is provided for improving a traditional contract examination mode depending on experience.

While the present application has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the application as defined by the appended claims.

Claims

1. A contract term risk inspection method, comprising the steps of:

constructing a clause semantic similarity calculation model;

acquiring a contract to be checked, and inputting the clauses to be checked in the contract into the clause semantic similarity calculation model;

obtaining the semantic similarity of the clause to be checked and the standard clause, wherein the clause with the semantic similarity lower than the set threshold value is determined as the risk clause.

2. The contract term risk inspection method according to claim 1, characterized by further comprising the step of:

constructing a clause multi-label classification model and a standard clause label set, wherein the standard clause label comprises basic elements of a contract;

inputting the clauses to be inspected in the contract to be inspected into the clause multi-label classification model to obtain a clause label set to be inspected;

Comparing the to-be-inspected clause label set with the standard clause label set, determining whether the to-be-inspected contract lacks a clause label or not, and marking the to-be-inspected contract with the missing clause label as a risk contract.

3. The contract term risk inspection method according to claim 2, characterized by the further step of, before entering the terms to be inspected in the contract into the term semantic similarity calculation model:

presetting a plurality of contract fields, and designing a corresponding standard clause label set for each contract field;

and constructing a contract text classification model, and classifying the contracts to be checked into preset contract fields.

4. The contract term risk inspection method of claim 3, characterized in that the standard term labels further include contract high frequency words obtained from a training set pre-statistics.

5. The contract term risk inspection method according to claim 3, characterized by the further step of, before entering the terms to be inspected in the contract into the term multi-label classification model:

the peer titles of the contracts to be checked and the standard contracts are compared, and the contracts to be checked, which are inconsistent in comparison, are directly marked as risk contracts.

6. The contract term risk inspection method according to claim 1, characterized in that the term semantic similarity calculation model is built based on a BERT pre-training model, and a MatchPyramid structure is introduced after the token layer is input, and the context coding vector input by the token layer is represented as a two-dimensional matching matrix.

7. The contract term risk inspection method according to claim 3, characterized by further comprising, before inputting the terms to be inspected in the contract into the term multi-label classification model, performing a segmentation process on long terms exceeding character limits of the term multi-label classification model, wherein the segmentation points are determined by:

taking the natural segment interval of the long clause as a potential segmentation point;

and comparing the similarity of texts at two sides of each potential segmentation point, and taking the potential segmentation point with the highest similarity as a final segmentation point.

8. A contract term risk inspection system, comprising:

the similarity calculation module is configured to acquire a contract to be checked, and input the clause to be checked in the contract into the clause semantic similarity calculation model;

9. A contract term risk inspection apparatus, characterized by comprising: a processor and a memory coupled to the processor; stored on the memory is a contract term risk inspection program executable on the processor, which when executed by the processor, implements the steps of the contract term risk inspection method of any one of claims 1 to 7.

10. A storage medium having stored thereon a contract term risk inspection program, which when executed by a processor, implements the steps of the contract term risk inspection method of any one of claims 1 to 7.