CN109857993A - A kind of contract text system for washing intelligently - Google Patents

A kind of contract text system for washing intelligently Download PDF

Info

Publication number
CN109857993A
CN109857993A CN201910002030.9A CN201910002030A CN109857993A CN 109857993 A CN109857993 A CN 109857993A CN 201910002030 A CN201910002030 A CN 201910002030A CN 109857993 A CN109857993 A CN 109857993A
Authority
CN
China
Prior art keywords
contract
term vector
text
vector model
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910002030.9A
Other languages
Chinese (zh)
Inventor
尚宏金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dazhuang Forensic Technology Co Ltd
Original Assignee
Shenzhen Dazhuang Forensic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dazhuang Forensic Technology Co Ltd filed Critical Shenzhen Dazhuang Forensic Technology Co Ltd
Priority to CN201910002030.9A priority Critical patent/CN109857993A/en
Publication of CN109857993A publication Critical patent/CN109857993A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of contract text system for washing intelligently, method includes the following steps: training obtains contract term vector model first and training obtains general term vector model, then a new contract text is handled, text analyzing, cleaning and mark is carried out to the contract using contract term vector model and general term vector model.The present invention interprets contract text, and identification is semantic, and non-contract term and doubtful customization contract sentence emphasis are marked, and lawyer can be assisted efficiently to quickly finish the check and correction of contract, the work such as examination & approval.

Description

A kind of contract text system for washing intelligently
Technical field
The present invention relates to field of artificial intelligence, in particular to a kind of contract text system for washing intelligently.
Background technique
It is well known that the examination and approval work of contract be it is very dull and uninteresting, kept away during the writing of many contracts Exempt from ambiguity, often text is write very rigorous, excessively can rigorously generate the sentence of many hello prolixity.Law works staff closes When with examination & approval, the text in face of a large amount of such formats is needed, but have to see, and really valuable data are (for a certain industry The sentence of business customization or the data filled in form contract) seldom, just there is the waste of a large amount of time and efforts here.
The procedural style that computer program is suitble to processing to have fixed logic works, artificial intelligence and big data in recent years at full speed Development, allow program handle some fuzzy events become increasingly may, on domestic market to intelligent contract approval also in Rule-based stage, these rule-based schemes can handle the common correct literary style of contract and common wrong literary style, But for uncommon contract literary style such as it is rare mistake or customized content if it is helpless.
The prior art it is more commonly used be that text filtering is carried out using the method for classification, thought is that all contracts are literary This is as positive sample, and other news, novel, magazine, random text are as negative sample, after all data are segmented, shape At the bag of words data of paragraph level, modeled using the method for returning or classifying.When new test data, that is, new contract needs When analysis, the bag of words data of each paragraph are also obtained, is classified or is returned using above-mentioned model, spam is similar to Filtering, but this method has a bigger defect, is exactly to upset the sequence of word after a normal sentence participle After rearranging, obtained bag of words are the same, but have had changed into out-of-order rubbish text at this time, this model is with regard to incapability Power.
Summary of the invention
In order to overcome the drawbacks described above of the prior art, the present invention provides a kind of contract text system for washing intelligently, this method Contract text is interpreted, identification is semantic, and non-contract term and doubtful customization contract sentence emphasis are marked, method can be assisted Rule worker efficiently quickly finishes the check and correction of contract, the work such as examination & approval.
The technical scheme adopted by the invention is as follows: a kind of contract text system for washing intelligently, method includes the following steps:
A) training obtains contract term vector model;
B) training obtains general term vector model;
C) a new contract text is handled, using contract term vector model and general term vector model to the conjunction With progress text analyzing, cleaning and mark, method includes the following steps:
(i) contract text full text is changed into coding;
(ii) every section of text of contract is cleaned, including whole section of contract text is used fullstop first ".", ask Number "? ", exclamation "!" decomposed, multiple individual sentences are resolved into, then calculate each sentence using general term vector model General generating probability, if probability is less than certain threshold value, then it is assumed that this sentence is not normal statement, the inside or it is wrong not The text of word or a pile random ordering, the meeting are prompted " it should be noted that literal mistake ", finally using contract word to Amount model calculates the contract production probability of each sentence, if probability is less than certain threshold value, then it is assumed that this sentence is not common Contract term, the inside or the data content filled a vacancy, or the contract terms of customization, the meeting are prompted " customized item Or particular content needs lawyer to audit in detail ";
(iii) the general generating probability that entire paragragh is calculated using general term vector model, as inside the paragragh The mean value of all the smallest three contract generating probabilities of sentence, if the general generating probability of paragragh is lower than some threshold value Think that whole section of content all needs lawyer's emphasis to notice, which can remove;
(iv) the contract generating probability of entire paragragh, as institute inside the paragragh are calculated using contract term vector model There is the mean value of the smallest three contract generating probabilities of sentence, recognizes if the contract generating probability of paragragh is lower than some threshold value Whole section of content is all that lawyer's emphasis is needed to notice, which can remove.
As a preferred solution of the present invention, the training obtain the step of contract term vector model include: obtain first it is big The contract text data of amount and the newsletter archive data of same size quantity form training set;Then training set data is divided Word goes stop words to handle;Finally contract term vector model is obtained using the training of multilayer neural network perceptron.
As a preferred solution of the present invention, the training obtain the step of general term vector model include: obtain first it is big Newsletter archive, novel, the magazine data of amount form training set;Then training set data is segmented, stop words is gone to handle;Most Contract term vector model is obtained using the training of multilayer neural network perceptron afterwards.
As a preferred solution of the present invention, the coding is encoded using GB2312.
Compared with prior art, the present invention has following technical effect that
The present invention, one kind are analyzed for contract text, and the artificial intelligence process system of cleaning, this system needs a large amount of Contract dataset use deep learning method training contract term vector model and general term vector model, then to contract text into Row is interpreted, and identification is semantic, and non-contract term and doubtful customization contract sentence emphasis are marked, can for subsequent lawyer's manual examination and verification With very big raising efficiency, or help to automate audit offer accuracy and user experience.
Detailed description of the invention
Fig. 1 is the flow diagram of training general term vector model in a kind of contract text system for washing intelligently of the present invention;
Fig. 2 is the flow diagram of training contract term vector model in a kind of contract text system for washing intelligently of the present invention;
Fig. 3 is the flow diagram that contract paragraph cleans sentence by sentence in a kind of contract text system for washing intelligently of the present invention;
Fig. 4 is the flow diagram of paragraph level cleaning in a kind of contract text system for washing intelligently of the present invention.
Specific embodiment
Specific embodiments of the present invention will be further explained with reference to the accompanying drawing.It should be noted that for The explanation of these embodiments is used to help understand the present invention, but and does not constitute a limitation of the invention.In addition, disclosed below Embodiment of the present invention involved in technical characteristic can be combined with each other as long as they do not conflict with each other.
A kind of contract text system for washing intelligently, it is characterised in that: method includes the following steps:
A) training obtains contract term vector model: obtaining the new of a large amount of contract text data and same size quantity first It hears text data and forms training set;Then training set data is segmented, stop words is gone to handle;Finally use multilayer nerve net The training of network perceptron obtains contract term vector model.
B) training obtains general term vector model: obtaining a large amount of newsletter archive, novel, magazine data composition training first Collection;Then training set data is segmented, stop words is gone to handle;Finally closed using the training of multilayer neural network perceptron With term vector model.
C) a new contract text is handled, using contract term vector model and general term vector model to the conjunction With progress text analyzing, cleaning and mark, method includes the following steps:
(i) contract text full text is changed into coding;
(ii) every section of text of contract is cleaned, including whole section of contract text is used fullstop first ".", ask Number "? ", exclamation "!" decomposed, multiple individual sentences are resolved into, then calculate each sentence using general term vector model General generating probability, if probability be less than certain threshold value (usually it is smaller, such as 0.1), then it is assumed that this sentence is not normal Sentence, the inside or the text for having wrong word or a pile random ordering, the meeting are prompted " it should be noted that literal mistake Accidentally ", for the sentence that there is no problem, following judgement is carried out, the contract for calculating each sentence using contract term vector model is raw Produce probability, if probability be less than certain threshold value (it is usually smaller, such as 0.3), then it is assumed that this sentence is not common contract term, The inside or the data content filled a vacancy, or the contract terms of customization, the meeting be prompted " customized item or it is specific in Appearance needs lawyer to audit in detail ";
(iii) the general generating probability that entire paragragh is calculated using general term vector model, as inside the paragragh The mean value of all the smallest three contract generating probabilities of sentence, if the general generating probability of paragragh is lower than some threshold value (usually smaller, as 0.2) then thought, whole section of content all needs lawyer's emphasis to notice, the other prompt of intersegmental part statement level It can remove;
(iv) the contract generating probability of entire paragragh, as institute inside the paragragh are calculated using contract term vector model There is the mean value of the smallest three contract generating probabilities of sentence, if the contract generating probability of paragragh is (logical lower than some threshold value Often smaller, as 0.5) then thought, whole section of content all needs lawyer's emphasis to notice, and the other prompt of the intersegmental part statement level is all It can remove.
As a preferred solution of the present invention, the coding is encoded using GB2312.
The use deep learning model of above-mentioned calculating generating probability, generally N-gram model, Word2vec model and Elmo model.
N-gram model: n-gram is a kind of statistical language model, is used to (n-1) a item before and predicts n-th item.In application, these item can be that phoneme (speech recognition application), character (input method application), word (answer by participle With).In general, n-gram model can be generated from extensive text or audio corpus, traditionally, 1-gram is cried Unigram, 2-gram are known as bigram, and 3-gram is trigram.There are also four-gram, five-gram etc., but be greater than n > 5 using rarely found.Since operand and data demand are huge, need to introduce Markov hypothesis, it may be assumed that item's goes out Existing probability, it is only related with its preceding m items, it is exactly unigram as m=0, is bigram model when m=1.Therefore, P (T) can be in the hope of, for example, when using bigram model, P (T)=P (A1) P (A2 | A1) P (A3 | A2) ... P (An | An-1) and P (An | An-1) conditional probability can be acquired by Maximum-likelihood estimation, be equal to Count (An-1, An)/Count (An-1).
Word2vec model: each of 1. hypothesis vocabularys word corresponds to a continuous feature vector;2. false The probabilistic model of a fixed continuously smooth, inputs the sequence of one section of term vector, can export the joint probability of this section of sequence;3. same When study term vector weight and probabilistic model in parameter.
The item of a word sequence is fitted using a simple feed-forward neural network f (wt-n+1 ..., wt) Part Probability p (wt | w1, w2 ..., wt-1).
The neural network can split into two parts and be understood: be a linear embedding layer first.It will be defeated The N-1 one-hot term vector entered is mapped as N-1 distributed term vectors by the Matrix C of a shared D × V (distributed vector).Wherein, V is the size of dictionary, and D is the dimension (Study first) of embedding vector. The word vector to be learnt is stored in CC matrix.Followed by one simple feed-forward neural network g.It is by one Tanh hidden layer and a softmax output layer composition.By the way that the N-1 term vector that embedding layers export is mapped as one Length is the ProbabilityDistribution Vector of V, is estimated to make to conditional probability of the word in dictionary in the case where inputting context: p (wi | w1, w2 ..., wt-1) ≈ f (wi, wt-1 ..., wt-n+1)=g (wi, C (wt-n+1) ..., C (wt-1));It uses When word2vec calculates probabilistic model, negative sample algorithm cannot be used, hierarchical softmax is used.
Elmo model: ELMO is based on context context to identify ambiguity relative to the maximum promotion of word2vec Word, ELMo are the combinations that the multilayer of bi-directional language model biLM indicates, are based on a large amount of texts, ELMo model is from the two-way of deep layer In language model (deep bidirectional language model) internal state (internal state) study and Come, is made of a forward direction and a backward language model, objective function is exactly the maximum for taking the two direction language models Likelihood can equally calculate the generating probability of a sentence after this good language model of pre-training.
Finally, it should be noted that the foregoing is only a preferred embodiment of the present invention, it is not intended to restrict the invention, Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features. All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention Within protection scope.

Claims (4)

1. a kind of contract text system for washing intelligently, it is characterised in that: method includes the following steps:
A) training obtains contract term vector model;
B) training obtains general term vector model;
C) a new contract text is handled, using contract term vector model and general term vector model to the contract into Row text analyzing, cleaning and mark, method includes the following steps:
(i) contract text full text is changed into coding;
(ii) every section of text of contract is cleaned, including whole section of contract text is used fullstop first ".", question mark "? ", sigh Number "!" decomposed, multiple individual sentences are resolved into, the general life of each sentence is then calculated using general term vector model At probability, if probability is less than certain threshold value, then it is assumed that this sentence is not normal statement, the inside perhaps have wrong word or It is exactly the text of a pile random ordering, which is prompted " it should be noted that literal mistake ", finally uses contract term vector model The contract production probability for calculating each sentence, if probability is less than certain threshold value, then it is assumed that this sentence is not that common contract is used Language, the inside or the data content filled a vacancy, or the contract terms of customization, the meeting are prompted " customized item or tool Holding in vivo needs lawyer to audit in detail ";
(iii) the general generating probability of entire paragragh is calculated using general term vector model, it is all as inside the paragragh The mean value of the smallest three contract generating probabilities of sentence thinks if the general generating probability of paragragh is lower than some threshold value Whole section of content all needs lawyer's emphasis to notice, which can remove;
(iv) the contract generating probability of entire paragragh is calculated using contract term vector model, all languages as inside the paragragh The mean value of the smallest three contract generating probabilities of sentence thinks whole if the contract generating probability of paragragh is lower than some threshold value Section content all needs lawyer's emphasis to notice, which can remove.
2. a kind of contract text system for washing intelligently according to claim 1, it is characterised in that: the training obtains contract The step of term vector model includes: the newsletter archive data group for obtaining a large amount of contract text data and same size quantity first At training set;Then training set data is segmented, stop words is gone to handle;Finally using the training of multilayer neural network perceptron Obtain contract term vector model.
3. a kind of contract text system for washing intelligently according to claim 1, it is characterised in that: the training obtains general The step of term vector model includes: to obtain a large amount of newsletter archive, novel, magazine data composition training set first;Then to instruction Practice collection data to be segmented, stop words is gone to handle;Finally contract term vector mould is obtained using the training of multilayer neural network perceptron Type.
4. a kind of contract text system for washing intelligently according to claim 1, it is characterised in that: the coding uses GB2312 coding.
CN201910002030.9A 2019-01-02 2019-01-02 A kind of contract text system for washing intelligently Pending CN109857993A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910002030.9A CN109857993A (en) 2019-01-02 2019-01-02 A kind of contract text system for washing intelligently

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910002030.9A CN109857993A (en) 2019-01-02 2019-01-02 A kind of contract text system for washing intelligently

Publications (1)

Publication Number Publication Date
CN109857993A true CN109857993A (en) 2019-06-07

Family

ID=66893646

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910002030.9A Pending CN109857993A (en) 2019-01-02 2019-01-02 A kind of contract text system for washing intelligently

Country Status (1)

Country Link
CN (1) CN109857993A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688847A (en) * 2019-08-23 2020-01-14 上海市研发公共服务平台管理中心 Technical contract determination method, device, computer equipment and storage medium
CN110705280A (en) * 2019-08-23 2020-01-17 上海市研发公共服务平台管理中心 Technical contract approval model creation method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688847A (en) * 2019-08-23 2020-01-14 上海市研发公共服务平台管理中心 Technical contract determination method, device, computer equipment and storage medium
CN110705280A (en) * 2019-08-23 2020-01-17 上海市研发公共服务平台管理中心 Technical contract approval model creation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112001185B (en) Emotion classification method combining Chinese syntax and graph convolution neural network
CN107992597B (en) Text structuring method for power grid fault case
CN111160008B (en) Entity relationship joint extraction method and system
CN108304468B (en) Text classification method and text classification device
CN112001187B (en) Emotion classification system based on Chinese syntax and graph convolution neural network
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN112002411A (en) Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record
CN112270379A (en) Training method of classification model, sample classification method, device and equipment
US20040148154A1 (en) System for using statistical classifiers for spoken language understanding
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
US20220083738A1 (en) Systems and methods for colearning custom syntactic expression types for suggesting next best corresponence in a communication environment
CN111753058B (en) Text viewpoint mining method and system
CN113204967B (en) Resume named entity identification method and system
CN114416942A (en) Automatic question-answering method based on deep learning
CN112328797A (en) Emotion classification method and system based on neural network and attention mechanism
CN111191442A (en) Similar problem generation method, device, equipment and medium
CN116049387A (en) Short text classification method, device and medium based on graph convolution
CN109857993A (en) A kind of contract text system for washing intelligently
CN115481219A (en) Electricity selling company evaluation emotion classification method based on grammar sequence embedded model
Lefevre Dynamic bayesian networks and discriminative classifiers for multi-stage semantic interpretation
CN113268974B (en) Method, device and equipment for marking pronunciations of polyphones and storage medium
CN112632969B (en) Incremental industry dictionary updating method and system
CN112818698A (en) Fine-grained user comment sentiment analysis method based on dual-channel model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination