CN109857993A - A kind of contract text system for washing intelligently - Google Patents
A kind of contract text system for washing intelligently Download PDFInfo
- Publication number
- CN109857993A CN109857993A CN201910002030.9A CN201910002030A CN109857993A CN 109857993 A CN109857993 A CN 109857993A CN 201910002030 A CN201910002030 A CN 201910002030A CN 109857993 A CN109857993 A CN 109857993A
- Authority
- CN
- China
- Prior art keywords
- contract
- term vector
- text
- vector model
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a kind of contract text system for washing intelligently, method includes the following steps: training obtains contract term vector model first and training obtains general term vector model, then a new contract text is handled, text analyzing, cleaning and mark is carried out to the contract using contract term vector model and general term vector model.The present invention interprets contract text, and identification is semantic, and non-contract term and doubtful customization contract sentence emphasis are marked, and lawyer can be assisted efficiently to quickly finish the check and correction of contract, the work such as examination & approval.
Description
Technical field
The present invention relates to field of artificial intelligence, in particular to a kind of contract text system for washing intelligently.
Background technique
It is well known that the examination and approval work of contract be it is very dull and uninteresting, kept away during the writing of many contracts
Exempt from ambiguity, often text is write very rigorous, excessively can rigorously generate the sentence of many hello prolixity.Law works staff closes
When with examination & approval, the text in face of a large amount of such formats is needed, but have to see, and really valuable data are (for a certain industry
The sentence of business customization or the data filled in form contract) seldom, just there is the waste of a large amount of time and efforts here.
The procedural style that computer program is suitble to processing to have fixed logic works, artificial intelligence and big data in recent years at full speed
Development, allow program handle some fuzzy events become increasingly may, on domestic market to intelligent contract approval also in
Rule-based stage, these rule-based schemes can handle the common correct literary style of contract and common wrong literary style,
But for uncommon contract literary style such as it is rare mistake or customized content if it is helpless.
The prior art it is more commonly used be that text filtering is carried out using the method for classification, thought is that all contracts are literary
This is as positive sample, and other news, novel, magazine, random text are as negative sample, after all data are segmented, shape
At the bag of words data of paragraph level, modeled using the method for returning or classifying.When new test data, that is, new contract needs
When analysis, the bag of words data of each paragraph are also obtained, is classified or is returned using above-mentioned model, spam is similar to
Filtering, but this method has a bigger defect, is exactly to upset the sequence of word after a normal sentence participle
After rearranging, obtained bag of words are the same, but have had changed into out-of-order rubbish text at this time, this model is with regard to incapability
Power.
Summary of the invention
In order to overcome the drawbacks described above of the prior art, the present invention provides a kind of contract text system for washing intelligently, this method
Contract text is interpreted, identification is semantic, and non-contract term and doubtful customization contract sentence emphasis are marked, method can be assisted
Rule worker efficiently quickly finishes the check and correction of contract, the work such as examination & approval.
The technical scheme adopted by the invention is as follows: a kind of contract text system for washing intelligently, method includes the following steps:
A) training obtains contract term vector model;
B) training obtains general term vector model;
C) a new contract text is handled, using contract term vector model and general term vector model to the conjunction
With progress text analyzing, cleaning and mark, method includes the following steps:
(i) contract text full text is changed into coding;
(ii) every section of text of contract is cleaned, including whole section of contract text is used fullstop first ".", ask
Number "? ", exclamation "!" decomposed, multiple individual sentences are resolved into, then calculate each sentence using general term vector model
General generating probability, if probability is less than certain threshold value, then it is assumed that this sentence is not normal statement, the inside or it is wrong not
The text of word or a pile random ordering, the meeting are prompted " it should be noted that literal mistake ", finally using contract word to
Amount model calculates the contract production probability of each sentence, if probability is less than certain threshold value, then it is assumed that this sentence is not common
Contract term, the inside or the data content filled a vacancy, or the contract terms of customization, the meeting are prompted " customized item
Or particular content needs lawyer to audit in detail ";
(iii) the general generating probability that entire paragragh is calculated using general term vector model, as inside the paragragh
The mean value of all the smallest three contract generating probabilities of sentence, if the general generating probability of paragragh is lower than some threshold value
Think that whole section of content all needs lawyer's emphasis to notice, which can remove;
(iv) the contract generating probability of entire paragragh, as institute inside the paragragh are calculated using contract term vector model
There is the mean value of the smallest three contract generating probabilities of sentence, recognizes if the contract generating probability of paragragh is lower than some threshold value
Whole section of content is all that lawyer's emphasis is needed to notice, which can remove.
As a preferred solution of the present invention, the training obtain the step of contract term vector model include: obtain first it is big
The contract text data of amount and the newsletter archive data of same size quantity form training set;Then training set data is divided
Word goes stop words to handle;Finally contract term vector model is obtained using the training of multilayer neural network perceptron.
As a preferred solution of the present invention, the training obtain the step of general term vector model include: obtain first it is big
Newsletter archive, novel, the magazine data of amount form training set;Then training set data is segmented, stop words is gone to handle;Most
Contract term vector model is obtained using the training of multilayer neural network perceptron afterwards.
As a preferred solution of the present invention, the coding is encoded using GB2312.
Compared with prior art, the present invention has following technical effect that
The present invention, one kind are analyzed for contract text, and the artificial intelligence process system of cleaning, this system needs a large amount of
Contract dataset use deep learning method training contract term vector model and general term vector model, then to contract text into
Row is interpreted, and identification is semantic, and non-contract term and doubtful customization contract sentence emphasis are marked, can for subsequent lawyer's manual examination and verification
With very big raising efficiency, or help to automate audit offer accuracy and user experience.
Detailed description of the invention
Fig. 1 is the flow diagram of training general term vector model in a kind of contract text system for washing intelligently of the present invention;
Fig. 2 is the flow diagram of training contract term vector model in a kind of contract text system for washing intelligently of the present invention;
Fig. 3 is the flow diagram that contract paragraph cleans sentence by sentence in a kind of contract text system for washing intelligently of the present invention;
Fig. 4 is the flow diagram of paragraph level cleaning in a kind of contract text system for washing intelligently of the present invention.
Specific embodiment
Specific embodiments of the present invention will be further explained with reference to the accompanying drawing.It should be noted that for
The explanation of these embodiments is used to help understand the present invention, but and does not constitute a limitation of the invention.In addition, disclosed below
Embodiment of the present invention involved in technical characteristic can be combined with each other as long as they do not conflict with each other.
A kind of contract text system for washing intelligently, it is characterised in that: method includes the following steps:
A) training obtains contract term vector model: obtaining the new of a large amount of contract text data and same size quantity first
It hears text data and forms training set;Then training set data is segmented, stop words is gone to handle;Finally use multilayer nerve net
The training of network perceptron obtains contract term vector model.
B) training obtains general term vector model: obtaining a large amount of newsletter archive, novel, magazine data composition training first
Collection;Then training set data is segmented, stop words is gone to handle;Finally closed using the training of multilayer neural network perceptron
With term vector model.
C) a new contract text is handled, using contract term vector model and general term vector model to the conjunction
With progress text analyzing, cleaning and mark, method includes the following steps:
(i) contract text full text is changed into coding;
(ii) every section of text of contract is cleaned, including whole section of contract text is used fullstop first ".", ask
Number "? ", exclamation "!" decomposed, multiple individual sentences are resolved into, then calculate each sentence using general term vector model
General generating probability, if probability be less than certain threshold value (usually it is smaller, such as 0.1), then it is assumed that this sentence is not normal
Sentence, the inside or the text for having wrong word or a pile random ordering, the meeting are prompted " it should be noted that literal mistake
Accidentally ", for the sentence that there is no problem, following judgement is carried out, the contract for calculating each sentence using contract term vector model is raw
Produce probability, if probability be less than certain threshold value (it is usually smaller, such as 0.3), then it is assumed that this sentence is not common contract term,
The inside or the data content filled a vacancy, or the contract terms of customization, the meeting be prompted " customized item or it is specific in
Appearance needs lawyer to audit in detail ";
(iii) the general generating probability that entire paragragh is calculated using general term vector model, as inside the paragragh
The mean value of all the smallest three contract generating probabilities of sentence, if the general generating probability of paragragh is lower than some threshold value
(usually smaller, as 0.2) then thought, whole section of content all needs lawyer's emphasis to notice, the other prompt of intersegmental part statement level
It can remove;
(iv) the contract generating probability of entire paragragh, as institute inside the paragragh are calculated using contract term vector model
There is the mean value of the smallest three contract generating probabilities of sentence, if the contract generating probability of paragragh is (logical lower than some threshold value
Often smaller, as 0.5) then thought, whole section of content all needs lawyer's emphasis to notice, and the other prompt of the intersegmental part statement level is all
It can remove.
As a preferred solution of the present invention, the coding is encoded using GB2312.
The use deep learning model of above-mentioned calculating generating probability, generally N-gram model, Word2vec model and
Elmo model.
N-gram model: n-gram is a kind of statistical language model, is used to (n-1) a item before and predicts n-th
item.In application, these item can be that phoneme (speech recognition application), character (input method application), word (answer by participle
With).In general, n-gram model can be generated from extensive text or audio corpus, traditionally, 1-gram is cried
Unigram, 2-gram are known as bigram, and 3-gram is trigram.There are also four-gram, five-gram etc., but be greater than n >
5 using rarely found.Since operand and data demand are huge, need to introduce Markov hypothesis, it may be assumed that item's goes out
Existing probability, it is only related with its preceding m items, it is exactly unigram as m=0, is bigram model when m=1.Therefore, P
(T) can be in the hope of, for example, when using bigram model, P (T)=P (A1) P (A2 | A1) P (A3 | A2) ... P (An | An-1) and
P (An | An-1) conditional probability can be acquired by Maximum-likelihood estimation, be equal to Count (An-1, An)/Count (An-1).
Word2vec model: each of 1. hypothesis vocabularys word corresponds to a continuous feature vector;2. false
The probabilistic model of a fixed continuously smooth, inputs the sequence of one section of term vector, can export the joint probability of this section of sequence;3. same
When study term vector weight and probabilistic model in parameter.
The item of a word sequence is fitted using a simple feed-forward neural network f (wt-n+1 ..., wt)
Part Probability p (wt | w1, w2 ..., wt-1).
The neural network can split into two parts and be understood: be a linear embedding layer first.It will be defeated
The N-1 one-hot term vector entered is mapped as N-1 distributed term vectors by the Matrix C of a shared D × V
(distributed vector).Wherein, V is the size of dictionary, and D is the dimension (Study first) of embedding vector.
The word vector to be learnt is stored in CC matrix.Followed by one simple feed-forward neural network g.It is by one
Tanh hidden layer and a softmax output layer composition.By the way that the N-1 term vector that embedding layers export is mapped as one
Length is the ProbabilityDistribution Vector of V, is estimated to make to conditional probability of the word in dictionary in the case where inputting context: p
(wi | w1, w2 ..., wt-1) ≈ f (wi, wt-1 ..., wt-n+1)=g (wi, C (wt-n+1) ..., C (wt-1));It uses
When word2vec calculates probabilistic model, negative sample algorithm cannot be used, hierarchical softmax is used.
Elmo model: ELMO is based on context context to identify ambiguity relative to the maximum promotion of word2vec
Word, ELMo are the combinations that the multilayer of bi-directional language model biLM indicates, are based on a large amount of texts, ELMo model is from the two-way of deep layer
In language model (deep bidirectional language model) internal state (internal state) study and
Come, is made of a forward direction and a backward language model, objective function is exactly the maximum for taking the two direction language models
Likelihood can equally calculate the generating probability of a sentence after this good language model of pre-training.
Finally, it should be noted that the foregoing is only a preferred embodiment of the present invention, it is not intended to restrict the invention,
Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, still may be used
To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features.
All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention
Within protection scope.
Claims (4)
1. a kind of contract text system for washing intelligently, it is characterised in that: method includes the following steps:
A) training obtains contract term vector model;
B) training obtains general term vector model;
C) a new contract text is handled, using contract term vector model and general term vector model to the contract into
Row text analyzing, cleaning and mark, method includes the following steps:
(i) contract text full text is changed into coding;
(ii) every section of text of contract is cleaned, including whole section of contract text is used fullstop first ".", question mark "? ", sigh
Number "!" decomposed, multiple individual sentences are resolved into, the general life of each sentence is then calculated using general term vector model
At probability, if probability is less than certain threshold value, then it is assumed that this sentence is not normal statement, the inside perhaps have wrong word or
It is exactly the text of a pile random ordering, which is prompted " it should be noted that literal mistake ", finally uses contract term vector model
The contract production probability for calculating each sentence, if probability is less than certain threshold value, then it is assumed that this sentence is not that common contract is used
Language, the inside or the data content filled a vacancy, or the contract terms of customization, the meeting are prompted " customized item or tool
Holding in vivo needs lawyer to audit in detail ";
(iii) the general generating probability of entire paragragh is calculated using general term vector model, it is all as inside the paragragh
The mean value of the smallest three contract generating probabilities of sentence thinks if the general generating probability of paragragh is lower than some threshold value
Whole section of content all needs lawyer's emphasis to notice, which can remove;
(iv) the contract generating probability of entire paragragh is calculated using contract term vector model, all languages as inside the paragragh
The mean value of the smallest three contract generating probabilities of sentence thinks whole if the contract generating probability of paragragh is lower than some threshold value
Section content all needs lawyer's emphasis to notice, which can remove.
2. a kind of contract text system for washing intelligently according to claim 1, it is characterised in that: the training obtains contract
The step of term vector model includes: the newsletter archive data group for obtaining a large amount of contract text data and same size quantity first
At training set;Then training set data is segmented, stop words is gone to handle;Finally using the training of multilayer neural network perceptron
Obtain contract term vector model.
3. a kind of contract text system for washing intelligently according to claim 1, it is characterised in that: the training obtains general
The step of term vector model includes: to obtain a large amount of newsletter archive, novel, magazine data composition training set first;Then to instruction
Practice collection data to be segmented, stop words is gone to handle;Finally contract term vector mould is obtained using the training of multilayer neural network perceptron
Type.
4. a kind of contract text system for washing intelligently according to claim 1, it is characterised in that: the coding uses
GB2312 coding.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910002030.9A CN109857993A (en) | 2019-01-02 | 2019-01-02 | A kind of contract text system for washing intelligently |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910002030.9A CN109857993A (en) | 2019-01-02 | 2019-01-02 | A kind of contract text system for washing intelligently |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109857993A true CN109857993A (en) | 2019-06-07 |
Family
ID=66893646
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910002030.9A Pending CN109857993A (en) | 2019-01-02 | 2019-01-02 | A kind of contract text system for washing intelligently |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109857993A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110688847A (en) * | 2019-08-23 | 2020-01-14 | 上海市研发公共服务平台管理中心 | Technical contract determination method, device, computer equipment and storage medium |
CN110705280A (en) * | 2019-08-23 | 2020-01-17 | 上海市研发公共服务平台管理中心 | Technical contract approval model creation method, device, equipment and storage medium |
-
2019
- 2019-01-02 CN CN201910002030.9A patent/CN109857993A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110688847A (en) * | 2019-08-23 | 2020-01-14 | 上海市研发公共服务平台管理中心 | Technical contract determination method, device, computer equipment and storage medium |
CN110705280A (en) * | 2019-08-23 | 2020-01-17 | 上海市研发公共服务平台管理中心 | Technical contract approval model creation method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112001185B (en) | Emotion classification method combining Chinese syntax and graph convolution neural network | |
CN107992597B (en) | Text structuring method for power grid fault case | |
CN111160008B (en) | Entity relationship joint extraction method and system | |
CN108304468B (en) | Text classification method and text classification device | |
CN112001187B (en) | Emotion classification system based on Chinese syntax and graph convolution neural network | |
WO2018028077A1 (en) | Deep learning based method and device for chinese semantics analysis | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN112002411A (en) | Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record | |
CN112270379A (en) | Training method of classification model, sample classification method, device and equipment | |
US20040148154A1 (en) | System for using statistical classifiers for spoken language understanding | |
CN112183094B (en) | Chinese grammar debugging method and system based on multiple text features | |
CN112001186A (en) | Emotion classification method using graph convolution neural network and Chinese syntax | |
US20220083738A1 (en) | Systems and methods for colearning custom syntactic expression types for suggesting next best corresponence in a communication environment | |
CN111753058B (en) | Text viewpoint mining method and system | |
CN113204967B (en) | Resume named entity identification method and system | |
CN114416942A (en) | Automatic question-answering method based on deep learning | |
CN112328797A (en) | Emotion classification method and system based on neural network and attention mechanism | |
CN111191442A (en) | Similar problem generation method, device, equipment and medium | |
CN116049387A (en) | Short text classification method, device and medium based on graph convolution | |
CN109857993A (en) | A kind of contract text system for washing intelligently | |
CN115481219A (en) | Electricity selling company evaluation emotion classification method based on grammar sequence embedded model | |
Lefevre | Dynamic bayesian networks and discriminative classifiers for multi-stage semantic interpretation | |
CN113268974B (en) | Method, device and equipment for marking pronunciations of polyphones and storage medium | |
CN112632969B (en) | Incremental industry dictionary updating method and system | |
CN112818698A (en) | Fine-grained user comment sentiment analysis method based on dual-channel model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |