CN110442842A - The extracting method and device of treaty content, computer equipment, storage medium - Google Patents
The extracting method and device of treaty content, computer equipment, storage medium Download PDFInfo
- Publication number
- CN110442842A CN110442842A CN201910534911.5A CN201910534911A CN110442842A CN 110442842 A CN110442842 A CN 110442842A CN 201910534911 A CN201910534911 A CN 201910534911A CN 110442842 A CN110442842 A CN 110442842A
- Authority
- CN
- China
- Prior art keywords
- contract
- text
- participle
- type
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a kind of extracting method of treaty content and device, computer equipment, storage mediums.On the one hand, this method comprises: determining target contract text to be identified;The contract type of the target contract text is identified using identification model;The constructive clause content in the target contract text is extracted according to the contract type.Through the invention, the technical issues of low efficiency when extracting treaty content on a large scale in the prior art is solved.
Description
[technical field]
The present invention relates to the extracting method and device of computer field more particularly to a kind of treaty content, computer equipment,
Storage medium.
[background technique]
Text identification is the common operation in artificial intelligence, can replace artificial screening body of an instrument, improves working efficiency.
In the prior art, the product temporarily without relevant contract terms automatic identification and classification, is only directed to standard form
Contract text, classified by its fixed format, few intelligentized contract classification products, this needs text to be identified must
It must be unified format, this is nearly impossible in complicated big data processing and analytic process.For different type or
The contract text of person's UNKNOWN TYPE, can only be by being manually divided into text block one by one for text, then into known text block
Content is extracted, this needs a large amount of manpower intervention, seriously affects working efficiency.
For the above problem present in the relevant technologies, at present it is not yet found that the solution of effect.
[summary of the invention]
In view of this, the embodiment of the invention provides a kind of extracting method of treaty content and device, computer equipment, depositing
Storage media.
On the one hand, the embodiment of the invention provides a kind of extracting methods of treaty content, which comprises determines wait know
Other target contract text;The contract type of the target contract text is identified using identification model;According to the contract type
Extract the constructive clause content in the target contract text.
Optionally, before the contract type for identifying the target contract text using identification model, the method is also wrapped
It includes: each of sample set contract to be sorted being segmented, the type attribute of each participle is set, calculates each participle
Feature vector;Calculate prior probability of each contract to be sorted in sample set;Using the prior probability calculate each to
The posterior probability of classification contract;The corresponding relationship of each contract type and posterior probability is established in the identification model.
Optionally, after being segmented to each of sample set contract to be sorted, the method also includes: it obtains
Frequency of use of each participle in contract field;It selects frequency of use to be greater than the participle of preset threshold, and determines it as symbol
The participle of conjunction condition.
Optionally, before obtaining frequency of use of each participle in contract field, the method also includes: it rejects and divides
Part of speech is the participle of adjective, adverbial word and modal particle in word.
Optionally, calculating prior probability of each contract to be sorted in sample set includes: in training text collection DiIn
Search s1,...,sn, calculate P (w1,...,wn) in training text collection DiThe secondary manifold N (y of middle appearance1,...yn), N (y1,
...yn) divided by training text collection DiIn participle total quantity, obtain P (w1,...,wn) in training text collection DiThe probability of middle appearance
Collect Q (w1,...,wn);By Q (w1,...,wn) it is determined as P (w1,...,wn) in training text collection DiIn each participle wnOccur
Prior probability P (w | Di), wherein P (wn) are as follows: training text collection DiMiddle attribute is wnParticiple, N (yn) are as follows: attribute wnIn training text
This collection DiThe number of middle appearance;Q(wn) are as follows: attribute wnIn training text collection DiThe number of middle appearance.
It optionally, the use of the posterior probability that the prior probability calculates each contract to be sorted include: by all participles
Prior probability is weighted summation, obtains the prior probability P (D of all texts to be sortedi);By P (Di)*P(xi|Di) obtained P
(w1,...,wn) be determined as in training text collection DiIn posterior probability P (Di| w), wherein P (xi|Di) are as follows: DiX when generationiHair
Raw probability, xiThe contract text for being i for contract type.
Optionally, extracting the constructive clause content in the target contract text according to the contract type includes: pre-
If searching text layout corresponding with contract type template in database;According to the typesetting pattern of text layout's template
Provision content is extracted in the designated position of the target contract text.
On the other hand, the embodiment of the invention provides a kind of extraction element of treaty content, described device comprises determining that mould
Block, for determining target contract text to be identified;Identification module, for using identification model to identify the target contract text
Contract type;Extraction module, for extracting the constructive clause content in the target contract text according to the contract type.
Optionally, described device further include: word segmentation module, for being used described in identification model identification in the identification module
Before the contract type of target contract text, each of sample set contract to be sorted is segmented, each participle is set
Type attribute, calculate the feature vector of each participle;First computing module, for calculating each contract to be sorted in sample set
Prior probability in conjunction;Second computing module, for calculating the posterior probability of each contract to be sorted using the prior probability;
Module is constructed, for establishing the corresponding relationship of each contract type and posterior probability in the identification model.
Optionally, the word segmentation module further include: acquiring unit, for each of sample set contract to be sorted
After being segmented, frequency of use of each participle in contract field is obtained;Determination unit, for selecting frequency of use to be greater than
The participle of preset threshold, and determine it as qualified participle.
Optionally, the word segmentation module further include: culling unit is being closed for obtaining each participle in the acquiring unit
Before frequency of use in same domain, the participle that part of speech in participle is adjective, adverbial word and modal particle is rejected.
Optionally, first computing module includes: the first computing unit, in training text collection DiMiddle lookup
s1,...,sn, calculate P (w1,...,wn) in training text collection DiThe secondary manifold N (y of middle appearance1,...yn);Second calculates list
Member is used for N (y1,...yn) divided by training text collection DiIn participle total quantity, obtain P (w1,...,wn) in training text collection
DiProbability set Q (the w of middle appearance1,...,wn);Determination unit is used for Q (w1,...,wn) it is determined as P (w1,...,wn) instructing
Practice text set DiIn each participle wnAppearance prior probability P (w | Di), wherein P (wn) are as follows: training text collection DiMiddle attribute is wn
Participle, N (yn) are as follows: attribute wnIn training text collection DiThe number of middle appearance;Q(wn) are as follows: attribute wnIn training text collection DiIn
The number of appearance.
Optionally, second computing module includes: computing unit, for the prior probability of all participles to be weighted
Summation, obtains the prior probability P (D of all texts to be sortedi), determination unit is used for P (Di)*P(xi|Di) obtained P
(w1,...,wn) be determined as in training text collection DiIn posterior probability P (Di| w), wherein P (xi|Di) are as follows: DiX when generationiHair
Raw probability, xiThe contract text for being i for contract type.
Optionally, the extraction module includes: searching unit, for searching and the contract type in the preset database
Corresponding text layout's template;Extraction unit, for the typesetting pattern according to text layout's template in the target contract
Extract provision content in the designated position of text.
According to still another embodiment of the invention, a kind of storage medium is additionally provided, meter is stored in the storage medium
Calculation machine program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
According to still another embodiment of the invention, a kind of electronic device, including memory and processor are additionally provided, it is described
Computer program is stored in memory, the processor is arranged to run the computer program to execute any of the above-described
Step in embodiment of the method.
Through the invention, after determining target contract text to be identified, the target contract is identified using identification model
The contract type of text, and then the based on contract constructive clause content in type-collection target contract text, solve existing skill
When extracting treaty content in art on a large scale the technical issues of low efficiency, the identification model based on artificial intelligence can identify multiple classes
The model of the contract of type can learn and adapt to the contract text of arbitrary format, save cost of human resources, the classification effect of machine
Rate is higher more acurrate.
[Detailed description of the invention]
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached
Figure is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this field
For those of ordinary skill, without any creative labor, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is a kind of hardware block diagram of the extraction terminal of treaty content of the embodiment of the present invention;
Fig. 2 is the flow chart of the extracting method of treaty content according to an embodiment of the present invention;
Fig. 3 is the flow chart of training identification model of the embodiment of the present invention;
Fig. 4 is the structural block diagram of the extraction element of treaty content according to an embodiment of the present invention.
[specific embodiment]
Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and in combination with Examples.It should be noted that not conflicting
In the case of, the features in the embodiments and the embodiments of the present application can be combined with each other.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.
Embodiment 1
Embodiment of the method provided by the embodiment of the present application one can mobile terminal, server, terminal or
It is executed in similar arithmetic unit.For running on computer terminals, Fig. 1 is a kind of treaty content of the embodiment of the present invention
Extraction terminal hardware block diagram.As shown in Figure 1, terminal 10 may include one or more (in Fig. 1
Only showing one) (processor 102 can include but is not limited to Micro-processor MCV or programmable logic device FPGA to processor 102
Deng processing unit) and memory 104 for storing data, optionally, above-mentioned terminal can also include for leading to
The transmission device 106 and input-output equipment 108 of telecommunication function.It will appreciated by the skilled person that knot shown in FIG. 1
Structure is only to illustrate, and does not cause to limit to the structure of above-mentioned terminal.For example, terminal 10 may also include than figure
More perhaps less component shown in 1 or with the configuration different from shown in Fig. 1.
Memory 104 can be used for storing computer program, for example, the software program and module of application software, such as this hair
The corresponding computer program of the extracting method of treaty content in bright embodiment, processor 102 are stored in memory by operation
Computer program in 104 realizes above-mentioned method thereby executing various function application and data processing.Memory 104
May include high speed random access memory, may also include nonvolatile memory, as one or more magnetic storage device, flash memory,
Or other non-volatile solid state memories.In some instances, memory 104 can further comprise relative to processor 102
Remotely located memory, these remote memories can pass through network connection to terminal 10.The example of above-mentioned network
Including but not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include
The wireless network that the communication providers of terminal 10 provide.In an example, transmitting device 106 includes that a network is suitable
Orchestration (Network Interface Controller, referred to as NIC), can be connected by base station with other network equipments from
And it can be communicated with internet.In an example, transmitting device 106 can be radio frequency (Radio Frequency, abbreviation
For RF) module, it is used to wirelessly be communicated with internet.
A kind of extracting method of treaty content is provided in the present embodiment, and Fig. 2 is contract according to an embodiment of the present invention
The flow chart of the extracting method of content, as shown in Fig. 2, the process includes the following steps:
Step S202 determines target contract text to be identified;
The contract of the present embodiment is the agreement that establish, change or terminate civil relationship between both parties, and contract text is assisted
Discuss the written or e-text formed.
Step S204 identifies the contract type of the target contract text using identification model;
Contract type refers to industry described in contract or law article, and the content of different types of contract, agreement is different, contract item
Money is also different, and the text formatting of same type of contract text is identical, and the contract type of the present embodiment includes labour contract, dealing
Contract, the contract of gift, loan contract, the contract of lease of property, contract for construction project etc..
Step S206 extracts the constructive clause content in the target contract text according to the contract type.
Scheme through this embodiment, after determining target contract text to be identified, identified using identification model described in
The contract type of target contract text, and then the based on contract constructive clause content in type-collection target contract text solve
When extracting treaty content on a large scale in the prior art the technical issues of low efficiency, the identification model based on artificial intelligence can know
The model of the contract of not multiple types can learn and adapt to the contract text of arbitrary format, save cost of human resources, machine
Classification effectiveness it is higher more acurrate.
The identification model of the present embodiment can be to be obtained by training, is also possible to set.It is used in training
In sample set, the single sample that uses is contract text, and the contract type of known contract text, and is shifted to an earlier date to it
Manual identification, in the training process, the input of identification model is target contract text, is exported as the contract of the target contract text
Type.
Before the contract type for identifying the target contract text using identification model, it is also necessary to locally use sample
Training identification model, Fig. 3 are the flow charts of training identification model of the embodiment of the present invention, as shown in Figure 3, comprising:
S302 segments each of sample set contract to be sorted, the type attribute of each participle is arranged, and counts
Calculate the feature vector of each participle;
Optionally, after being segmented to each of sample set contract to be sorted, further includes: obtain each participle
Frequency of use in contract field;It selects frequency of use to be greater than the participle of preset threshold, and determines it as qualified
Participle.Frequency of use refers to using temperature, higher using temperature, and frequency of use is also higher.
In a preferred embodiment of this embodiment, it is also necessary to remove meaningless participle word in text to be sorted, this
A little word frequency of use are high but without practical significance, are the general words of the contract text of multiple types, will not influence knowledge after rejecting
The performance of other model, but the treating capacity of sample data can be reduced, training for promotion efficiency is obtaining each participle in contract field
In frequency of use before, further includes: reject participle in part of speech be adjective, adverbial word and modal particle participle.
After obtaining qualified participle set, each participle (text or word) occurred in classifying text is treated
siClassify according to type attribute w, belongs to wnParticiple be sn;Wherein wnFor the type attribute of participle.Specifically use comentropy
Each participle is quantified as feature vector.
S304 calculates prior probability of each contract to be sorted in sample set;
In an embodiment of the present embodiment, prior probability packet of each contract to be sorted in sample set is calculated
It includes: in training text collection DiMiddle lookup s1,...,sn, calculate P (w1,...,wn) in training text collection DiThe secondary manifold of middle appearance
N(y1,...yn), N (y1,...yn) divided by training text collection DiThe middle sum by rejecting keyword after meaningless word pre-processes
Amount, obtains P (w1,...,wn) in training text collection DiProbability set Q (the w of middle appearance1,...,wn);By Q (w1,...,wn) determine
For P (w1,...,wn) in training text collection DiIn each participle wnAppearance prior probability P (w | Di), wherein P (wn) are as follows: training
Text set DiMiddle attribute is wnParticiple, N (yn) are as follows: attribute wnIn training text collection DiThe number of middle appearance;Q(wn) are as follows: attribute
wnIn training text collection DiThe number of middle appearance.
S306 calculates the posterior probability of each contract to be sorted using the prior probability;
In an embodiment of the present embodiment, the posteriority for calculating each contract to be sorted using the prior probability is general
Rate includes: that the prior probability of all participles is weighted summation, obtains the prior probability P (D of all texts to be sortedi);By P
(Di)*P(xi|Di) obtained P (w1,...,wn) be determined as in training text collection DiIn posterior probability P (Di| w), wherein P (xi
|Di) are as follows: DiX when generationiThe probability of generation, xiThe contract text for being i for contract type.
Due to P (x | DiWhen)=0, when some characteristic item does not occur under some classification, this phenomenon will be generated,
This can enable classifier quality substantially reduce.In order to solve this problem, Laplace calibration is introduced, item number under every classification (is closed
Same textual data) count is incremented, in this way if when training sample set quantity is sufficiently big, result can't be had an impact, and keep away
The scene that said frequencies are 0 is exempted from.
The realization of this embodiment scheme is based on naive Bayesian principle: for the item to be sorted provided, solving and goes out at this
The probability that each classification occurs under conditions of existing, which is maximum, is considered as which classification this item to be sorted belongs to.For popular,
Like so a reason, you see a Black people in the street, I asks that you guess what where this nabs came, you most likely guess non-
Continent.Why, because of African ratio highest in Black people, other is also likely to be American or Asian certainly, but is not being had
Have under other available informations, we understand the classification of alternative condition maximum probability, and here it is the idea basis of naive Bayesian.
S308 establishes the corresponding relationship of each contract type and posterior probability in the identification model.
In the present embodiment, identify that the contract type of the target contract text includes using training using identification model
Obtained identification model, is classified automatically.The contract text of each type is subjected to semantic participle, is converted to feature vector,
Feature vector is input to identification model, identification model identifies it, and available each contract text is some classification
Probability, export the type identification of the contract text of each type, select probability is highest as final model.
In one example, deal contract, the contract of gift, the type identification of loan contract are respectively 00,01,02, are passed through
Identification model calculates, and the probability of output is respectively as follows: 45%, 47%, 86%, then exports 02.Contract type is without being limited thereto, contract
Type can also include: deal contract, the contract of gift, loan contract, the contract of lease of property, contract for construction project etc..
Optionally, extracting the constructive clause content in the target contract text according to the contract type includes: pre-
If searching text layout corresponding with contract type template in database;According to the typesetting pattern of text layout's template
Provision content is extracted in the designated position of the target contract text.It is identified according to the category, designated position is gone to go to extract clause
Content, the clause that different types of contract text is included is different, even if including same clause, clause is in contract text
Position is also different.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation
The method of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but it is very much
In the case of the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to existing
The part that technology contributes can be embodied in the form of software products, which is stored in a storage
In medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, calculate
Machine, server or network equipment etc.) execute method described in each embodiment of the present invention.
Embodiment 2
Additionally provide a kind of extraction element of treaty content in the present embodiment, the device for realizing above-described embodiment and
Preferred embodiment, the descriptions that have already been made will not be repeated.As used below, predetermined function may be implemented in term " module "
The combination of the software and/or hardware of energy.It is hard although device described in following embodiment is preferably realized with software
The realization of the combination of part or software and hardware is also that may and be contemplated.
Fig. 4 is the structural block diagram of the extraction element of treaty content according to an embodiment of the present invention, as shown in figure 4, the device
Include:
Determining module 40, for determining target contract text to be identified;
Identification module 42, for identifying the contract type of the target contract text using identification model;
Extraction module 44, for extracting the constructive clause content in the target contract text according to the contract type.
Optionally, described device further include: word segmentation module, for being used described in identification model identification in the identification module
Before the contract type of target contract text, each of sample set contract to be sorted is segmented, each participle is set
Type attribute, calculate the feature vector of each participle;First computing module, for calculating each contract to be sorted in sample set
Prior probability in conjunction;Second computing module, for calculating the posterior probability of each contract to be sorted using the prior probability;
Module is constructed, for establishing the corresponding relationship of each contract type and posterior probability in the identification model.
Optionally, the word segmentation module further include: acquiring unit, for each of sample set contract to be sorted
After being segmented, frequency of use of each participle in contract field is obtained;Determination unit, for selecting frequency of use to be greater than
The participle of preset threshold, and determine it as qualified participle.
Optionally, the word segmentation module further include: culling unit is being closed for obtaining each participle in the acquiring unit
Before frequency of use in same domain, the participle that part of speech in participle is adjective, adverbial word and modal particle is rejected.
Optionally, first computing module includes: the first computing unit, in training text collection DiMiddle lookup
s1,...,sn, calculate P (w1,...,wn) in training text collection DiThe secondary manifold N (y of middle appearance1,...yn);Second calculates list
Member is used for N (y1,...yn) divided by training text collection DiIn participle total quantity, obtain P (w1,...,wn) in training text collection
DiProbability set Q (the w of middle appearance1,...,wn);Determination unit is used for Q (w1,...,wn) it is determined as P (w1,...,wn) instructing
Practice text set DiIn each participle wnAppearance prior probability P (w | Di), wherein P (wn) are as follows: training text collection DiMiddle attribute is wn
Participle, N (yn) are as follows: attribute wnIn training text collection DiThe number of middle appearance;Q(wn) are as follows: attribute wnIn training text collection DiIn
The number of appearance.
Optionally, second computing module includes: computing unit, is used for training text collection DiIn quantity of documents remove
Prior probability P (D is obtained with the sum of entire training text collectioni), determination unit is used for P (Di)*P(xi|Di) obtained P
(w1,...,wn) be determined as in training text collection DiIn posterior probability P (Di| w), wherein P (xi|Di) are as follows: DiX when generationiHair
Raw probability, xiThe contract text for being i for contract type.
Optionally, the extraction module includes: searching unit, for searching and the contract type in the preset database
Corresponding text layout's template;Extraction unit, for the typesetting pattern according to text layout's template in the target contract
Extract provision content in the designated position of text.
It should be noted that above-mentioned modules can be realized by software or hardware, for the latter, Ke Yitong
Following manner realization is crossed, but not limited to this: above-mentioned module is respectively positioned in same processor;Alternatively, above-mentioned modules are with any
Combined form is located in different processors.
Embodiment 3
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or group
Part can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown
Or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit it is indirect
Coupling or communication connection can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer
It is each that device (can be personal computer, server or network equipment etc.) or processor (Processor) execute the present invention
The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-
Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various
It can store the medium of program code.
The embodiments of the present invention also provide a kind of storage medium, computer program is stored in the storage medium, wherein
The computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
Optionally, in the present embodiment, above-mentioned storage medium can be set to store by executing based on following steps
Calculation machine program:
S1 determines target contract text to be identified;
S2 identifies the contract type of the target contract text using identification model;
S3 extracts the constructive clause content in the target contract text according to the contract type.
Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read-
Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard
The various media that can store computer program such as disk, magnetic or disk.
The embodiments of the present invention also provide a kind of electronic device, including memory and processor, stored in the memory
There is computer program, which is arranged to run computer program to execute the step in any of the above-described embodiment of the method
Suddenly.
Optionally, above-mentioned electronic device can also include transmission device and input-output equipment, wherein the transmission device
It is connected with above-mentioned processor, which connects with above-mentioned processor.
Optionally, in the present embodiment, above-mentioned processor can be set to execute following steps by computer program:
S1 determines target contract text to be identified;
S2 identifies the contract type of the target contract text using identification model;
S3 extracts the constructive clause content in the target contract text according to the contract type.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.
Claims (10)
1. a kind of extracting method of treaty content, which is characterized in that the described method includes:
Determine target contract text to be identified;
The contract type of the target contract text is identified using identification model;
The constructive clause content in the target contract text is extracted according to the contract type.
2. the method according to claim 1, wherein identifying the target contract text using identification model
Before contract type, the method also includes:
Each of sample set contract to be sorted is segmented, the type attribute of each participle is set, calculates each participle
Feature vector;
Calculate prior probability of each contract to be sorted in sample set;
The posterior probability of each contract to be sorted is calculated using the prior probability;
The corresponding relationship of each contract type and posterior probability is established in the identification model.
3. according to the method described in claim 2, it is characterized in that, dividing to each of sample set contract to be sorted
After word, the method also includes:
Obtain frequency of use of each participle in contract field;
It selects frequency of use to be greater than the participle of preset threshold, and determines it as qualified participle.
4. according to the method described in claim 2, it is characterized in that, obtaining frequency of use of each participle in contract field
Before, the method also includes
Reject the participle that part of speech in participle is adjective, adverbial word and modal particle.
5. according to the method described in claim 2, it is characterized in that, calculating priori of each contract to be sorted in sample set
Probability includes:
In training text collection DiMiddle lookup s1,...,sn, calculate P (w1,...,wn) in training text collection DiThe number of middle appearance
Collect N (y1,...yn);By N (y1,...yn) divided by training text collection DiIn participle total quantity, obtain P (w1,...,wn) instructing
Practice text set DiProbability set Q (the w of middle appearance1,...,wn);By Q (w1,...,wn) it is determined as P (w1,...,wn) in training text
Collect DiIn each participle wnAppearance prior probability P (w | Di), wherein P (wn) are as follows: training text collection DiMiddle attribute is wnParticiple,
N(yn) are as follows: attribute wnIn training text collection DiThe number of middle appearance;Q(wn) are as follows: attribute wnIn training text collection DiTime of middle appearance
Number.
6. according to the method described in claim 2, it is characterized in that, calculating each contract to be sorted using the prior probability
Posterior probability includes:
The prior probability of all participles is weighted summation, obtains the prior probability P (D of all texts to be sortedi);By P
(Di)*P(xi|Di) obtained P (w1,...,wn) be determined as in training text collection DiIn posterior probability P (Di| w), wherein P (xi
|Di) are as follows: DiX when generationiThe probability of generation, xiThe contract text for being i for contract type.
7. the method according to claim 1, wherein extracting the target contract text according to the contract type
In constructive clause content include:
Text layout corresponding with contract type template is searched in the preset database;
Provision content is extracted in the designated position of the target contract text according to the typesetting pattern of text layout's template.
8. a kind of extraction element of treaty content, which is characterized in that described device includes:
Determining module, for determining target contract text to be identified;
Identification module, for identifying the contract type of the target contract text using identification model;
Extraction module, for extracting the constructive clause content in the target contract text according to the contract type.
9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists
In the step of processor realizes any one of claims 1 to 7 the method when executing the computer program.
10. a kind of computer storage medium, is stored thereon with computer program, which is characterized in that the computer program is located
The step of reason device realizes method described in any one of claims 1 to 7 when executing.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910534911.5A CN110442842A (en) | 2019-06-20 | 2019-06-20 | The extracting method and device of treaty content, computer equipment, storage medium |
PCT/CN2020/093511 WO2020253506A1 (en) | 2019-06-20 | 2020-05-29 | Contract content extraction method and apparatus, and computer device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910534911.5A CN110442842A (en) | 2019-06-20 | 2019-06-20 | The extracting method and device of treaty content, computer equipment, storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110442842A true CN110442842A (en) | 2019-11-12 |
Family
ID=68428235
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910534911.5A Pending CN110442842A (en) | 2019-06-20 | 2019-06-20 | The extracting method and device of treaty content, computer equipment, storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110442842A (en) |
WO (1) | WO2020253506A1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111046629A (en) * | 2019-12-16 | 2020-04-21 | 北大方正集团有限公司 | Outline display method, device and equipment |
CN111078871A (en) * | 2019-11-21 | 2020-04-28 | 深圳前海环融联易信息科技服务有限公司 | Method and system for automatically classifying contracts based on artificial intelligence |
CN111274782A (en) * | 2020-02-25 | 2020-06-12 | 平安科技(深圳)有限公司 | Text auditing method and device, computer equipment and readable storage medium |
CN111814457A (en) * | 2020-05-30 | 2020-10-23 | 国网上海市电力公司 | Power grid engineering contract text generation method |
WO2020253506A1 (en) * | 2019-06-20 | 2020-12-24 | 平安科技(深圳)有限公司 | Contract content extraction method and apparatus, and computer device and storage medium |
CN116306573A (en) * | 2023-03-15 | 2023-06-23 | 广联达科技股份有限公司 | Intelligent analysis method, device and equipment for engineering practice and readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107391772A (en) * | 2017-09-15 | 2017-11-24 | 国网四川省电力公司眉山供电公司 | A kind of file classification method based on naive Bayesian |
CN108830443A (en) * | 2018-04-19 | 2018-11-16 | 出门问问信息科技有限公司 | A kind of contract review method and device |
CN109190594A (en) * | 2018-09-21 | 2019-01-11 | 广东蔚海数问大数据科技有限公司 | Optical Character Recognition system and information extracting method |
CN109739985A (en) * | 2018-12-26 | 2019-05-10 | 斑马网络技术有限公司 | Automatic document classification method, equipment and storage medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105045825B (en) * | 2015-06-29 | 2018-05-01 | 中国地质大学(武汉) | A kind of multinomial naive Bayesian file classification method of structure extension |
JP6776805B2 (en) * | 2016-10-24 | 2020-10-28 | 富士通株式会社 | Character recognition device, character recognition method, character recognition program |
CN110442842A (en) * | 2019-06-20 | 2019-11-12 | 平安科技(深圳)有限公司 | The extracting method and device of treaty content, computer equipment, storage medium |
-
2019
- 2019-06-20 CN CN201910534911.5A patent/CN110442842A/en active Pending
-
2020
- 2020-05-29 WO PCT/CN2020/093511 patent/WO2020253506A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107391772A (en) * | 2017-09-15 | 2017-11-24 | 国网四川省电力公司眉山供电公司 | A kind of file classification method based on naive Bayesian |
CN108830443A (en) * | 2018-04-19 | 2018-11-16 | 出门问问信息科技有限公司 | A kind of contract review method and device |
CN109190594A (en) * | 2018-09-21 | 2019-01-11 | 广东蔚海数问大数据科技有限公司 | Optical Character Recognition system and information extracting method |
CN109739985A (en) * | 2018-12-26 | 2019-05-10 | 斑马网络技术有限公司 | Automatic document classification method, equipment and storage medium |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020253506A1 (en) * | 2019-06-20 | 2020-12-24 | 平安科技(深圳)有限公司 | Contract content extraction method and apparatus, and computer device and storage medium |
CN111078871A (en) * | 2019-11-21 | 2020-04-28 | 深圳前海环融联易信息科技服务有限公司 | Method and system for automatically classifying contracts based on artificial intelligence |
CN111046629A (en) * | 2019-12-16 | 2020-04-21 | 北大方正集团有限公司 | Outline display method, device and equipment |
CN111046629B (en) * | 2019-12-16 | 2022-03-01 | 北大方正集团有限公司 | Outline display method, device and equipment |
CN111274782A (en) * | 2020-02-25 | 2020-06-12 | 平安科技(深圳)有限公司 | Text auditing method and device, computer equipment and readable storage medium |
WO2021169208A1 (en) * | 2020-02-25 | 2021-09-02 | 平安科技(深圳)有限公司 | Text review method and apparatus, and computer device, and readable storage medium |
CN111274782B (en) * | 2020-02-25 | 2023-10-20 | 平安科技(深圳)有限公司 | Text auditing method and device, computer equipment and readable storage medium |
CN111814457A (en) * | 2020-05-30 | 2020-10-23 | 国网上海市电力公司 | Power grid engineering contract text generation method |
CN116306573A (en) * | 2023-03-15 | 2023-06-23 | 广联达科技股份有限公司 | Intelligent analysis method, device and equipment for engineering practice and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2020253506A1 (en) | 2020-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110442842A (en) | The extracting method and device of treaty content, computer equipment, storage medium | |
CN108629413B (en) | Neural network model training and transaction behavior risk identification method and device | |
CN109189901B (en) | Method for automatically discovering new classification and corresponding corpus in intelligent customer service system | |
CN107835496B (en) | Spam short message identification method and device and server | |
CN109561322A (en) | A kind of method, apparatus, equipment and the storage medium of video audit | |
CN108090508A (en) | A kind of classification based training method, apparatus and storage medium | |
CN110399490A (en) | A kind of barrage file classification method, device, equipment and storage medium | |
CN108874921A (en) | Extract method, apparatus, terminal device and the storage medium of text feature word | |
CN106777232A (en) | Question and answer abstracting method, device and terminal | |
CN106228389A (en) | Network potential usage mining method and system based on random forests algorithm | |
CN110069627A (en) | Classification method, device, electronic equipment and the storage medium of short text | |
CN113590764B (en) | Training sample construction method and device, electronic equipment and storage medium | |
CN107145516A (en) | A kind of Text Clustering Method and system | |
CN110069630B (en) | Improved mutual information feature selection method | |
CN111159404B (en) | Text classification method and device | |
CN113254643B (en) | Text classification method and device, electronic equipment and text classification program | |
CN107766860A (en) | Natural scene image Method for text detection based on concatenated convolutional neutral net | |
CN109739985A (en) | Automatic document classification method, equipment and storage medium | |
CN107067022B (en) | Method, device and equipment for establishing image classification model | |
CN108334895A (en) | Sorting technique, device, storage medium and the electronic device of target data | |
CN109446300A (en) | A kind of corpus preprocess method, the pre- mask method of corpus and electronic equipment | |
CN107229614A (en) | Method and apparatus for grouped data | |
CN110516937A (en) | A kind of demand based on topic model is intelligent to be turned to do method and apparatus | |
CN114428854A (en) | Variable-length text classification method based on length normalization and active learning | |
CN106569996A (en) | Chinese-microblog-oriented emotional tendency analysis method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |