CN106933787A - Adjudicate the computational methods of document similarity, search device and computer equipment - Google Patents
Adjudicate the computational methods of document similarity, search device and computer equipment Download PDFInfo
- Publication number
- CN106933787A CN106933787A CN201710165953.7A CN201710165953A CN106933787A CN 106933787 A CN106933787 A CN 106933787A CN 201710165953 A CN201710165953 A CN 201710165953A CN 106933787 A CN106933787 A CN 106933787A
- Authority
- CN
- China
- Prior art keywords
- judgement
- similarity
- document
- defendant
- keyword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Adjudicate the computational methods of document similarity, search device and computer equipment, the computational methods for adjudicating document similarity include:Obtain at least two judgement documents;Extract the judgement keyword of one or more defendants in every judgement document;Similarity between corresponding judgement document is determined according to the similarity between the corresponding judgement keyword of defendant in different judgement documents.Technical solution of the present invention improves the accuracy of judgement document Similarity Measure.
Description
Technical field
The present invention relates to technical field of data processing, more particularly to adjudicate the computational methods of document similarity, search device
And computer equipment.
Background technology
In the prior art, a kind of calculation of text similarity is:Two long texts are entered based on dictionary respectively first
Row cutting word, and calculate the word frequency of each participle that cutting word in each text is obtained, then determines each point in two texts according to dictionary
Similarity between word, and then the overall similarity of two texts is calculated according to resulting similarity and word frequency;Another text
The calculation of similarity is:Beforehand through machine learning mode, training obtains a Text similarity computing model, then basis
The model determines the similarity between text.
For the document of different field, due to itself having the distinctive language feature in the field, therefore, either adopt
The mode of text similarity is calculated with dictionary and word frequency or by the similarity calculation of training in advance determine that text is similar
The mode of degree, can cause the calculating of text similarity only to be calculated in the aspect of word, and accuracy is poor.
The content of the invention
Present invention solves the technical problem that being the accuracy for how improving judgement document Similarity Measure.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of computational methods for adjudicating document similarity, including:
Obtain at least two judgement documents;Extract the judgement keyword of one or more defendants in every judgement document;Sentenced according to difference
Certainly the similarity in document between the corresponding judgement keyword of defendant determines the similarity between corresponding judgement document.
Optionally, after at least two judgement documents are obtained, one or more defendants in every judgement document is extracted
Judgement keyword before, also include:Subordinate sentence treatment is carried out to adjudicating document every described, multiple sentences are obtained.
Optionally, the judgement keyword for extracting one or more defendants in every judgement document includes:According to participle
Dictionary carries out participle to adjudicating the sentence in document every described, obtains word segmentation result;Entity knowledge is carried out to the word segmentation result
Not, the entity name in the word segmentation result is obtained, the entity name includes defendant;According to the entity name, to same
The word segmentation result in sentence and/or adjacent sentence carries out entity relation extraction, obtains the entity between the entity name
Relation;Feature extraction is carried out to the word segmentation result according to the defendant, the characteristic value of the defendant is obtained;Combine same defendant
Corresponding entity relationship and characteristic value, obtain the judgement keyword of each defendant.
Optionally, the feature that feature extraction is carried out to the word segmentation result according to the defendant, the defendant is obtained
Value includes:Basic triggering vocabulary is set up, the basic triggering vocabulary includes one or more trigger words, and the trigger word is used for table
Show the event information in the judgement document;Obtained triggering the near of at least one of vocabulary trigger word substantially according to synonym woods
Adopted word;The near synonym are added into basic triggering vocabulary, be expanded triggering vocabulary;Institute is extracted according to the extension triggering vocabulary
Word segmentation result is stated, the characteristic value of the defendant is obtained.
Optionally, the similarity according between the corresponding judgement keyword of defendant in different judgement documents determines correspondence
Judgement document between similarity include:Calculate it is different judgement documents in two-by-two defendant it is corresponding judgement keyword between phase
Like spending, the corresponding maximum for adjudicating the similarity between keyword of the defendant two-by-two is the corresponding phase adjudicated between document
Like degree.
Optionally, it is described calculate it is different judgement documents in two-by-two defendant it is corresponding judgement keyword between similarity bag
Include:Vector is built according to the corresponding judgement keyword of each defendant;Calculate the similarity between the corresponding vector of defendant two-by-two.
Optionally, the judgement keyword includes following a kind of or any various:Adjudicate true keyword, grounds of decision
Keyword and court verdict keyword.
Optionally, before participle is carried out to the sentence in the judgement document according to dictionary for word segmentation, also include:Extract institute
State the neologisms candidate word in judgement document;The neologisms candidate word is filtered according to grammer and/or word order, will meet described
The filter result of grammer and/or word order adds the dictionary for word segmentation.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of lookup method of similar judgement document, bag
Include:Obtain pending judgement document;Judgement document database, the judgement document number are searched according to the pending judgement document
Include many first judgement documents according to storehouse;Using the computational methods of described judgement document similarity, determine described pending to sentence
Certainly document and every it is described first judgement document between similarity;Document is adjudicated by the first of similarity highest preset number
Or similarity adjudicates document as lookup result more than the first of predetermined threshold value.
Optionally, it is described judgement document database in store every it is described first judgement document in each defendant it is corresponding to
Amount.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of computing device for adjudicating document similarity,
Including:Acquisition module, for obtaining at least two judgement documents;Keyword extracting module, for extract every judgement document in
The judgement keyword of one or more defendants;Keyword similarity calculation module, for obtain it is different judgement documents in defendant couple
Similarity between the judgement keyword answered;Text similarity computing module, for according to defendant's correspondence in different judgement documents
Judgement keyword between similarity determine it is corresponding judgement document between similarity.
Optionally, further include:Subordinate sentence module, for carrying out subordinate sentence treatment to adjudicating document every described, obtains many
Individual sentence.
Optionally, the keyword extracting module includes:Participle unit, for according to dictionary for word segmentation to judgement every described
Sentence in document carries out participle, obtains word segmentation result;Entity recognition unit, for carrying out entity knowledge to the word segmentation result
Not, the entity name in the word segmentation result is obtained, the entity name includes defendant;Entity relation extraction unit, for root
According to the entity name, entity relation extraction is carried out to the word segmentation result in same sentence and/or adjacent sentence, obtain institute
State the entity relationship between entity name;Feature extraction unit, for carrying out feature to the word segmentation result according to the defendant
Extract, obtain the characteristic value of the defendant;Assembled unit, for combining the corresponding entity relationship of same defendant and characteristic value, obtains
To the judgement keyword of defendant each described.
Optionally, the feature extraction unit includes:Basic triggering vocabulary sets up subelement, for setting up basic trigger word
Table, the basic triggering vocabulary includes one or more trigger words, and the trigger word is used to represent the thing in the judgement document
Part information;Near synonym obtain subelement, for being obtained triggering at least one of vocabulary trigger word substantially according to synonym woods
Near synonym;Extension triggering vocabulary sets up subelement, and for the near synonym to be added into basic triggering vocabulary, be expanded trigger word
Table;Defendant's eigenvalue extracting subelement, for extracting the word segmentation result according to the extension triggering vocabulary, obtains the defendant
Characteristic value.
Optionally, the keyword similarity calculation module specifically for:Calculate defendant couple two-by-two in different judgement documents
Similarity between the judgement keyword answered;The Text similarity computing module is specifically for defendant is corresponding two-by-two by described in
Judgement keyword between similarity maximum as it is corresponding judgement document between similarity.
Optionally, the keyword similarity calculation module includes:Vectorial construction unit, for according to each defendant correspondence
Judgement keyword build vector;Vector similarity computing unit, for similar between the calculating two-by-two corresponding vector of defendant
Degree.
Optionally, the judgement keyword includes following a kind of or any various:Adjudicate true keyword, grounds of decision
Keyword and court verdict keyword.
Optionally, further include:Candidate word extraction module, for extracting the neologisms candidate word in the judgement document;
Filtering module, for being filtered to the neologisms candidate word according to grammer and/or word order, will meet the grammer and/or language
The filter result of sequence adds the dictionary for word segmentation.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of lookup device of similar judgement document, bag
Include:Input module, for obtaining pending judgement document;Searching modul, sentences for being searched according to the pending judgement document
Certainly document database, the judgement document database includes many first judgement documents;The calculating of the judgement document similarity
Device, for determining the similarity between the pending judgement document and every the first judgement document;Output module, uses
Make more than the first judgement document of predetermined threshold value in by the first judgement document or similarity of similarity highest preset number
It is lookup result.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of lookup system of similar judgement document, bag
The lookup device and judgement document database are included, many first judgement documents and every are stored in the judgement document database
The corresponding vector of each defendant in a piece the first judgement document.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of computer equipment, including memory, treatment
Device and the computer program that store on a memory and can run on a processor, computer program described in the computing device
Shi Shixian adjudicates the computational methods of document similarity as previously described.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of computer equipment, including memory, treatment
Device and the computer program that store on a memory and can run on a processor, computer program described in the computing device
The lookup method of the foregoing similar judgement documents of Shi Shixian.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of computer-readable recording medium, depositing thereon
Computer program is contained, the computer program is when executed by realizing adjudicating as previously described the calculating side of document similarity
Method.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of computer-readable recording medium, depositing thereon
Computer program is contained, the computer program is when executed by realizing the lookup side of foregoing similar judgement document
Method.
Compared with prior art, the technical scheme of the embodiment of the present invention has the advantages that:
Technical solution of the present invention obtains at least two judgement documents;One or more defendants in every judgement document of extraction
Judgement keyword;Corresponding judgement text is determined according to the similarity between the corresponding judgement keyword of defendant in different judgement documents
Similarity between book.The characteristics of technical solution of the present invention is for judgement document, extracts one or more in every judgement document
The judgement keyword of defendant, it is determined that judgement document between similarity when based on it is different judgement documents in the corresponding judgement of defendant
Similarity between keyword, it is to avoid the Text similarity computing model according only to dictionary set in advance or training in advance exists
The aspect of word is calculated, and will with defendant on the meaning of a word may relevance less, but have in judgement aspect for defendant
There is the judgement keyword of critical significance as the core of Similarity Measure between judgement document, and then improve judgement document similarity
The accuracy of calculating.
Technical solution of the present invention obtains pending judgement document;Judgement document number is searched according to the pending judgement document
According to storehouse, the judgement document database includes many first judgement documents;Using the calculating side of described judgement document similarity
Method, determines the similarity between the pending judgement document and every the first judgement document;Similarity highest is pre-
If the first judgement document or similarity of number adjudicate document as lookup result more than the first of predetermined threshold value.Skill of the present invention
The characteristics of art scheme is for judgement document, it is crucial according to the corresponding judgement of defendant in pending judgement document and the first judgement document
Similarity between word determines the similarity between pending judgement document and every the first judgement document, by similarity most
The first judgement document or similarity of preset number high adjudicate document as lookup result more than the first of predetermined threshold value.Keep away
Exempt to carry out similarity meter in the aspect of word according only to the Text similarity computing model of dictionary set in advance or training in advance
Calculate, and will with defendant on the meaning of a word may relevance less, but sentencing with critical significance in the judgement aspect for defendant
Certainly keyword adjudicates the core of Similarity Measure between document as pending judgement document and every described first, and then improves
The accuracy that judgement document similarity is searched.
Brief description of the drawings
By the detailed description made to non-limiting example made with reference to the following drawings of reading, it is of the invention other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is a kind of flow chart of the computational methods for adjudicating document similarity of the embodiment of the present invention;
Fig. 2 is the flow chart of the computational methods of another judgement document similarity of the embodiment of the present invention;
Fig. 3 is a kind of flow chart of the lookup method of similar judgement document of the embodiment of the present invention;
Fig. 4 is a kind of structural representation of the computing device for adjudicating document similarity of the embodiment of the present invention;
Fig. 5 is a kind of structural representation of the lookup device of similar judgement document of the embodiment of the present invention;
Fig. 6 is a kind of structural representation of computer equipment provided in an embodiment of the present invention.
Specific embodiment
As described in the background art, in the prior art either by the way of dictionary and word frequency calculate text similarity or
Person by way of the similarity calculation of training in advance determines text similarity, for the document of different field,
The calculating of text similarity can be caused only to be calculated in the aspect of word, accuracy is poor.
It is understandable to enable the above objects, features and advantages of the present invention to become apparent, below in conjunction with the accompanying drawings to the present invention
Specific embodiment be described in detail.
Fig. 1 is a kind of flow chart of the computational methods for adjudicating document similarity of the embodiment of the present invention.
The computational methods of the judgement document similarity shown in Fig. 1 may comprise steps of:
Step S101:Obtain at least two judgement documents.
Step S102:Extract the judgement keyword of one or more defendants in every judgement document.
Step S103:Determined according to the similarity between the corresponding judgement keyword of defendant in different judgement documents corresponding
Similarity between judgement document.
In specific implementation, judgement document is at least two, when it is two to adjudicate document, can calculate two judgement documents
Between similarity;When it is more than two to adjudicate document, it is also possible to calculate the similarity between more than two judgement documents.Sentence
Certainly document refers to the document that law court is write as according to judgement, including but not limited to paper of civil judgment, criminal judgment, administrative judgment book
With incidental civil court verdict etc..Typically one or more defendants are made decisions in document is adjudicated, for example,《Zeng Wentai,
The bright crime of smuggling, trafficking, transporting and manufacturing drugs second trial criminal judgments of Li Yun》) in defendant Zeng Wentai smuggling 50.35g be less than standard
The heroin of purity, is first offender, decides as prime culprit, and attitude toward admission of guilt is good, there is great rendering meritorious service;Defendant Li Yunming is peddled
40.12g, less than the methamphetamine of standard purity, is first offender, decides as prime culprit, and attitude toward admission of guilt is good, render meritorious service etc..
Specifically, judgement keyword includes following a kind of or any various:Adjudicate true keyword, the pass of grounds of decision
The keyword of keyword and court verdict, wherein, true keyword is adjudicated, for example, " Hai Luos of the 50.35g less than standard purity
Cause ";The keyword of grounds of decision, for example, " request that plaintiff vacates house obtains evidence support ";The keyword of court verdict,
For example, " first offender ", " prime culprit " etc..First in step S101, at least two judgement documents are obtained.Then in step S102
In, extract the judgement keyword of one or more defendants in every judgement document.In step s 103, according to different judgements finally
Similarity in document between the corresponding judgement keyword of defendant determines the similarity between corresponding judgement document.
The characteristics of embodiment of the present invention is for judgement document, extracts the judgement of one or more defendants in every judgement document
Keyword, it is determined that judgement document between similarity when based on it is different judgement documents in defendant it is corresponding judgement keyword between
Similarity so that with defendant on the meaning of a word may relevance less, but there is crucial meaning for defendant in judgement aspect
Adopted adjudicates core of the keyword as Similarity Measure between judgement document, it is to avoid according only to dictionary set in advance or pre-
The Text similarity computing model first trained is calculated in the aspect of word, and then improves the accurate of judgement document Similarity Measure
Property.
In specific implementation, after step slol, before step S102, may comprise steps of:To described in every
Judgement document carries out subordinate sentence treatment, obtains multiple sentences.Specifically, carrying out subordinate sentence treatment to every judgement document can be according to
The punctuate of sentence ending, such as question mark, exclamation and fullstop etc. is indicated to be divided into row and preserve.More specifically, can divided
Before sentence treatment, every judgement document is converted into text formatting, the invalid form obtained in filtering transfer process, for example, figure
Piece, mess code etc..Subordinate sentence treatment is carried out to the judgement document after filtering again.The present embodiment carries out subordinate sentence treatment to judgement document can be with
For the operation of subsequent step provides facility.
In specific implementation, step S102 may comprise steps of:According to dictionary for word segmentation in judgement document every described
Sentence carry out participle, obtain word segmentation result;Entity recognition is carried out to the word segmentation result, the reality in the word segmentation result is obtained
Body title, the entity name includes defendant;According to the entity name, described in same sentence and/or adjacent sentence
Word segmentation result carries out entity relation extraction, obtains the entity relationship between the entity name;According to the defendant to described point
Word result carries out feature extraction, obtains the characteristic value of the defendant;The corresponding entity relationship of same defendant and characteristic value are combined, is obtained
To the judgement keyword of defendant each described.
Specifically, entity name refers to name, mechanism's name, place name and other all entities with entitled mark,
Including numeral, date, currency, address etc., it can be obtained using entity recognition methods such as CRF.For example, " Zhang San ", " Lee
Four " etc..
Entity relationship is the relation between two entity names, for example, name+drugs name+peddle, name+date+birth
Etc., entity relationship can be obtained using existing entity relation extraction method.For example, " Zhang San+drugs name+peddle ", " Li Si
+ date+birth " etc..
Characteristic value is used to represent the judgement benchmark of defendant, and it can be obtained by triggering the feature extracting methods such as vocabulary extraction
Take, for example, " Zhang San+prime culprit ", " Li Si+great render meritorious service " etc..
Specifically, dictionary for word segmentation can in the following way be trained:Extract the neologisms candidate word in the judgement document;Root
The neologisms candidate word is filtered according to grammer and/or word order, the filter result that will meet the grammer and/or word order is added
The dictionary for word segmentation.Neologisms candidate word in judgement document is trained to dictionary for word segmentation, can improve dictionary for word segmentation
Integrality, filters according to grammer and/or word order to the neologisms candidate word, can improve the degree of accuracy of dictionary for word segmentation.
Specifically, the characteristic value of the defendant can in the following way be obtained:Basic triggering vocabulary is set up, it is described basic
Triggering vocabulary includes one or more trigger words, and the trigger word is used to represent the event information in the judgement document;According to
Synonym woods obtains the near synonym of at least one of triggering vocabulary trigger word substantially;The near synonym are added into basic trigger word
Table, be expanded triggering vocabulary;The word segmentation result is extracted according to the extension triggering vocabulary, the feature of the defendant is obtained
Value.Near synonym comprising trigger word in extension triggering vocabulary, it is accurate according to the characteristic value that extension triggering vocabulary extracts the defendant for obtaining
True rate is high.
In specific implementation, following exemplary explanation is carried out, this exemplary explanation is only used for understanding technology of the invention
Scheme, rather than the restriction to technical solution of the present invention.For example, pretreated judgement document includes herein below:
" appellant (first trial defendant) fourth is sub-, and former name fourth wishes refined, man, and the birth of on November 13rd, 1981, Han nationality is unemployed,
Household register ground Guizhou Province Dafang County, Xiamen City, Fujian Province Tongan District of staying temporarily.
The court thinks, appeal population knows Asia that drugs still help other people to peddle 61 grams of crystal methamphetamine perfectly well, its behavior structure
Into drug offense, should impose punishment in accordance with the law.
This case system joint crime, appeal population plays secondary booster action in Asia in crime, is accessory, in accordance with the law should from light or
Reduce punishment.
Fourth Asia can make a deposition oneself crime strictly according to the facts after appearing in court, and can give a lesser punishment in accordance with the law.
Fourth subbreed recidivist, should give a severe punishment in accordance with the law.”
In specific implementation, the one of which word segmentation result of judgement document is shown in Table 1 for more than, and each lattice in table 1 are represented
One participle.
The word segmentation result example of table 1
In specific implementation, by entity identification algorithms, a kind of Entity recognition for obtaining the results are shown in Table 2, each lattice in table 2
Represent an entity name.
The Entity recognition result example of table 2
In specific implementation, according to the extraction of entity relation extraction method, " appellant (first trial defendant) fourth is sub-, and former name fourth is wished
It is refined, man, the birth of on November 13rd, 1981, Han nationality has no property, household register ground Guizhou Province Dafang County, Xiamen City, Fujian Province Tongan City of staying temporarily
Area." following a kind of entity relationship " fourth is sub- ----birth --- on November 13rd, 1 " is obtained, and/or, another entity relationship
" fourth is sub- ----household register ----Guizhou Province Dafang County ";Extracted according to entity relation extraction method and " think that appeal population is sub- in the court
Know that drugs still help other people to peddle 61 grams of crystal methamphetamine perfectly well, its behavior has constituted drug offense, should impose punishment in accordance with the law " obtain
A kind of following entity relationship " fourth is sub- ----he peddles -61 grams of ----crystal methamphetamine --- ".
In specific implementation, for example, triggering vocabulary includes:Prime culprit, accessory, render meritorious service, the extension after being extended through synonym woods
Triggering vocabulary includes:Prime culprit, Main Function, accessory, secondary role, booster action, render meritorious service, great render meritorious service.Touched according to extension
Hair vocabulary extracts that " this case system joint crime, appeal population plays secondary booster action in Asia in crime, is accessory, in accordance with the law should be from
It is light or reduce punishment ", obtain such as lower eigenvalue " fourth is sub- ----accessory ";According to extension triggering vocabulary extract " fourth Asia appear in court after energy
Make a deposition strictly according to the facts oneself crime, can give a lesser punishment in accordance with the law ", obtain such as lower eigenvalue " fourth is sub- ----make a deposition strictly according to the facts ";According to extension
Triggering vocabulary extracts " fourth subbreed recidivist, should give a severe punishment in accordance with the law " and obtains such as lower eigenvalue " fourth Asia ----recidivist ".
The embodiment of the present invention extracts entity name, so as to obtain by step S102 from the word segmentation result of judgement document
Defendant's information in judgement document;The entity relationship between entity name, Yi Jigen are extracted from same sentence and/or adjacent sentence
The characteristic value that feature extraction obtains the defendant is carried out to word segmentation result according to defendant, convenience of calculation is quick, improves the judgement of defendant
The speed of keyword.
In specific implementation, step S103 may include steps of:Defendant is corresponding two-by-two in the different judgement documents of calculating
Similarity between judgement keyword, the corresponding maximum for adjudicating the similarity between keyword of the defendant two-by-two is correspondence
Judgement document between similarity.
Specifically, can calculate in the following way it is different judgement documents in two-by-two defendant it is corresponding judgement keyword between
Similarity:Vector is built according to the corresponding judgement keyword of each defendant;Calculate the phase between the corresponding vector of defendant two-by-two
Like degree.Calculating the similarity between vector can be by calculating the Euclidean distance between vector or square by cosine-algorithm etc.
Formula.The embodiment of the present invention is corresponding judgement text by the maximum of the similarity between the corresponding judgement keyword of defendant two-by-two
Similarity between book improves the accuracy of judgement document Similarity Measure.
In specific implementation, by taking above-mentioned exemplary judgement document as an example, the characteristic value of defendant Ding Ya is following (can be according to grammer
Or the mode such as word order is ranked up to characteristic value):Peddle 61 grams of accessory recidivists ... of crystal methamphetamine and build judgement key
Term vector, for example with bag of words, pre-set standard vector it is following [peddle, transport, crystal methamphetamine, hemp, 1-10g,
1,1-50g, 50-100g, accessory, prime culprit, recidivist, first offender ...], all of feature is given tacit consent in characteristic value digitlization therein
Weight it is identical, i.e. the significance level of each feature is identical, the characteristic value for existing it is corresponding numeral be 1, non-existent characteristic value
Corresponding numeral is 0.Then the corresponding vectors of defendant Ding Ya are:(1 0 1 0 0 0 1 1 0 1 0......).
In a preferred embodiment, the computational methods of judgement document similarity refer to Fig. 2, and Fig. 2 is the present invention
The flow chart of the computational methods of another judgement document similarity of embodiment.
The computational methods of the judgement document similarity shown in Fig. 2 may comprise steps of:
Step S201:Obtain at least two judgement documents.
Step S202:Subordinate sentence treatment is carried out to adjudicating document every described, multiple sentences are obtained.
Step S203:Extract the neologisms candidate word in the judgement document.
Step S204:The neologisms candidate word is filtered according to grammer and/or word order, will meet the grammer and/
Or the filter result of word order adds the dictionary for word segmentation.
Step S205:Participle is carried out to adjudicating the sentence in document every described according to dictionary for word segmentation, word segmentation result is obtained.
Step S206:Entity recognition is carried out to the word segmentation result, the entity name in the word segmentation result is obtained, it is described
Entity name includes defendant.
Step S207:According to the entity name, the word segmentation result in same sentence and/or adjacent sentence is carried out
Entity relation extraction, obtains the entity relationship between the entity name.
Step S208:Basic triggering vocabulary is set up, the basic triggering vocabulary includes one or more trigger words, described to touch
Hair word is used to represent the event information in the judgement document.
Step S209:The near synonym of at least one of triggering vocabulary trigger word substantially are obtained according to synonym woods.
Step S210:The near synonym are added into basic triggering vocabulary, be expanded triggering vocabulary.
Step S211:The word segmentation result is extracted according to the extension triggering vocabulary, the characteristic value of the defendant is obtained.
Step S212:The corresponding entity relationship of same defendant and characteristic value are combined, the judgement for obtaining each defendant is closed
Keyword.
Step S213:Vector is built according to the corresponding judgement keyword of each defendant.
Step S214:The similarity between the corresponding vector of defendant two-by-two is calculated, the corresponding judgement of the defendant two-by-two is closed
The maximum of the similarity between keyword is the similarity between corresponding judgement document.
In specific implementation, in step S202, every judgement document is converted into text formatting, in filtering transfer process
The invalid form for arriving, for example, picture, mess code etc..The punctuate for being ended up according to instruction sentence to the judgement document after filtering again, for example
Question mark, exclamation and fullstop etc. are divided into row and preserve.
In another specific embodiment of the present invention, step S203 and step S204 can also be after step S201, step
Performed before S202, i.e., before carrying out subordinate sentence treatment to judgement document, dictionary for word segmentation is built in advance, to reduce the work of subsequent step
Measure.
In specific implementation, in step S206, it is possible to use entity identification algorithms carry out Entity recognition to word segmentation result.Tool
For body, entity identification algorithms can be using condition random field algorithm (Conditional Random Field, CRF) etc..
In specific implementation, in step S207, it is possible to use entity relation extraction algorithm carries out entity pass to word segmentation result
System extracts.Specifically, entity relation extraction algorithm can also be using CRF algorithms etc..
It will be apparent to a skilled person that entity identification algorithms and entity relation extraction algorithm can arbitrarily may be used
The algorithm of implementation, the embodiment of the present invention is without limitation.
In specific implementation, in step S208, trigger word, for example, " prime culprit " etc..
In another specific embodiment of the present invention, step S208, step S209 and step S210 can also step S201 it
Afterwards, performed before step S202, i.e., before carrying out subordinate sentence treatment to judgement document, extension triggering vocabulary is built in advance, to reduce
The workload of subsequent step.
In specific implementation, in step S214, can be by calculating the Euclidean distance between vector or being calculated by cosine
The modes such as method calculate the similarity between vector.
Skilled person would appreciate that it can be any enforceable algorithm to calculate the similarity between vector,
The embodiment of the present invention is without limitation.
The characteristics of embodiment of the present invention is for judgement document, extracts the judgement of one or more defendants in every judgement document
Keyword, it is determined that judgement document between similarity when based on it is different judgement documents in defendant it is corresponding judgement keyword between
Similarity so that with defendant on the meaning of a word may relevance less, but there is crucial meaning for defendant in judgement aspect
Adopted adjudicates core of the keyword as Similarity Measure between judgement document, it is to avoid according only to dictionary set in advance or pre-
The Text similarity computing model first trained is calculated in the aspect of word, and then improves the accurate of judgement document Similarity Measure
Property.
Fig. 3 is a kind of flow chart of the lookup method of similar judgement document of the embodiment of the present invention.
The lookup method of the similar judgement document shown in Fig. 3 may comprise steps of:
Step S301:Obtain pending judgement document.
Step S302:Judgement document database, the judgement document database are searched according to the pending judgement document
Including many first judgement documents.
Step S303:Using the computational methods of the judgement document similarity described in above-described embodiment, determine described pending
Similarity between judgement document and every the first judgement document.
Step S304:By the first judgement document or similarity of similarity highest preset number more than predetermined threshold value
First judgement document is used as lookup result.
In specific implementation, every first judgement can be prestored in judgement document database in step s 302
The corresponding vector of each defendant in document, to reduce the workload of subsequent step.
In specific implementation, the specific embodiment that the computational methods of document similarity are adjudicated in step S303 can refer to figure
The computational methods of the judgement document similarity shown in 1 or Fig. 2, here is omitted.
In specific implementation, in step s 304, obtain between pending judgement document and every the first judgement document
Similarity after, can from high to low be arranged according to similarity, so as to using top n adjudicate document as lookup result simultaneously
Output, N is preset number, it is also possible to using similarity more than predetermined threshold value first judgement document is as lookup result and exports.
The predetermined threshold value of wherein preset number or similarity that output similarity highest first adjudicates document can be according to reality
Applied environment is custom-configured the modification with adaptability, and the embodiment of the present invention is without limitation.
After the corresponding vector of defendant is obtained using the computational methods of the judgement document similarity shown in Fig. 1, in reality
In, each characteristic value in vector can be weighted according to inquiry purpose, for example, in this inquiry, drugs first
Base amphetamine is important, the weight of other characteristic values can be set into 1.0, and the weight of crystal methamphetamine is set into 2.0,
Defendant's vector is as follows obtained from:Fourth Asia (1.0*1 1.0*0 2.0*1 1.0*0 1.0*0 1.0*0 1.0*1 1.0*1
1.0*0 1.0*1 1.0*0......)。
In specific implementation, Query Result can be ranked up from big to small according to the result of calculation of text similarity,
And document is as lookup result and exports using top n judgement, N therein can be set to 5-20 according to custom is usually browsed,
It can also be provided that other numerical value;Retrieval result can also be screened according to threshold value, for example, predetermined threshold value can be set for 0.8-
0.99.In general, similarity>0.8, then it is assumed that the similarity of two judgement documents is higher, can be exported as Query Result.
The characteristics of technical solution of the present invention is for judgement document, according to quilt in pending judgement document and the first judgement document
The similarity between corresponding judgement keyword is accused to determine between pending judgement document and every the first judgement document
Similarity, the first judgement text by the first judgement document or similarity of similarity highest preset number more than predetermined threshold value
Book is used as lookup result.The Text similarity computing model according only to dictionary set in advance or training in advance is avoided in word
Aspect carries out Similarity Measure, and will with defendant on the meaning of a word may relevance less, but for defendant in judgement aspect
Judgement keyword with critical significance is used as similarity meter between the first judgement document of pending judgement document and every
The core of calculation, and then improve the accuracy that judgement document similarity is searched.
Fig. 4 is a kind of structural representation of the computing device for adjudicating document similarity of the embodiment of the present invention.
The computing device 40 of the judgement document similarity shown in Fig. 4 can include acquisition module 401, keyword extracting module
402nd, keyword similarity calculation module 403 and Text similarity computing module 404.
Wherein, acquisition module 401 is used to obtain at least two judgement documents.
Keyword extracting module 402 is used to extract the judgement keyword of one or more defendants in every judgement document.
Keyword similarity calculation module 403 be used for obtain it is different judgement documents in defendant it is corresponding judgement keyword between
Similarity.
Text similarity computing module 404 is used for according between the corresponding judgement keyword of defendant in different judgement documents
Similarity determines the similarity between corresponding judgement document.
The characteristics of embodiment of the present invention is for judgement document, extracts the judgement of one or more defendants in every judgement document
Keyword, it is determined that judgement document between similarity when based on it is different judgement documents in defendant it is corresponding judgement keyword between
Similarity so that with defendant on the meaning of a word may relevance less, but there is crucial meaning for defendant in judgement aspect
Adopted adjudicates core of the keyword as Similarity Measure between judgement document, it is to avoid according only to dictionary set in advance or pre-
The Text similarity computing model first trained is calculated in the aspect of word, and then improves the accurate of judgement document Similarity Measure
Property.
In specific implementation, judgement keyword includes following a kind of or any various:Adjudicate true keyword, grounds of decision
Keyword and court verdict keyword.
In specific implementation, the computing device 40 for adjudicating document similarity can also include subordinate sentence module (not shown), subordinate sentence
Module is used to carry out subordinate sentence treatment to adjudicating document every described, obtains multiple sentences.Specifically, subordinate sentence module can will be every
Piece judgement document is converted to text formatting, the invalid form obtained in filtering transfer process, for example, picture, mess code etc..Again to mistake
According to the punctuate for indicating sentence to end up, such as question mark, exclamation and fullstop etc. are divided into row and preserve judgement document after filter.To sentencing
Certainly document carry out subordinate sentence treatment can be for the operation of subsequent step provides facility.
In specific implementation, keyword extracting module 402 can include participle unit (not shown), Entity recognition unit (figure
Do not show), entity relation extraction unit (not shown), feature extraction unit (not shown) and assembled unit (not shown).Wherein, divide
Word unit is used to carry out participle to adjudicating the sentence in document every described according to dictionary for word segmentation, obtains word segmentation result;Entity is known
Other unit is used to carry out Entity recognition to the word segmentation result, obtains the entity name in the word segmentation result, the physical name
Title includes defendant;Entity relation extraction unit is used for according to the entity name, to the institute in same sentence and/or adjacent sentence
Stating word segmentation result carries out entity relation extraction, obtains the entity relationship between the entity name;Feature extraction unit is used for root
Feature extraction is carried out to the word segmentation result according to the defendant, the characteristic value of the defendant is obtained;Assembled unit is used to combine same
The corresponding entity relationship of one defendant and characteristic value, obtain the judgement keyword of each defendant.From the participle knot of judgement document
Entity name is extracted in fruit, so as to obtain adjudicating the defendant's information in document;Entity is extracted from same sentence and/or adjacent sentence
Entity relationship between title, and the characteristic value that feature extraction obtains the defendant is carried out to word segmentation result according to defendant, count
It is convenient and swift, improve the speed of the judgement keyword of defendant.
Specifically, feature extraction unit can include that basic triggering vocabulary is set up subelement (not shown), near synonym and obtained
Subelement (not shown), extension triggering vocabulary set up subelement (not shown) and defendant's eigenvalue extracting subelement (not shown).
Wherein, basic triggering vocabulary sets up subelement includes one or many for setting up basic triggering vocabulary, the basic triggering vocabulary
Individual trigger word, the trigger word is used to represent the event information in the judgement document;Near synonym obtain subelement is used for basis
Synonym woods obtains the near synonym of at least one of triggering vocabulary trigger word substantially;Extension triggering vocabulary sets up subelement to be used for
The near synonym are added into basic triggering vocabulary, be expanded triggering vocabulary;Defendant's eigenvalue extracting subelement is used for according to institute
State extension triggering vocabulary and extract the word segmentation result, obtain the characteristic value of the defendant.Trigger word is included in extension triggering vocabulary
Near synonym, the characteristic value accuracy rate of defendant for obtaining is extracted according to extension triggering vocabulary high.
Specifically, keyword similarity calculation module 403 can be used for calculating in different judgement documents defendant's correspondence two-by-two
Judgement keyword between similarity;Text similarity computing module 404 can be used for sentencing the defendant two-by-two is corresponding
Certainly the maximum of the similarity between keyword is used as the similarity between corresponding judgement document.It is corresponding by defendant two-by-two
The maximum of the similarity between judgement keyword is that the similarity between corresponding judgement document improves judgement document similarity
The accuracy of calculating.
In specific implementation, keyword similarity calculation module 403 can include vectorial construction unit (not shown) and vector
Similarity calculated (not shown).Wherein, vectorial construction unit is used to be built according to the corresponding judgement keyword of each defendant
Vector;Vector similarity computing unit is used to calculate the similarity between the corresponding vector of defendant two-by-two.
In specific implementation, the computing device 40 for adjudicating document similarity can also include candidate word extraction module (not shown)
With filtering module (not shown).Wherein, candidate word extraction module is used to extract the neologisms candidate word in the judgement document;Filtering
Module is used to filter the neologisms candidate word according to grammer and/or word order, will meet the mistake of the grammer and/or word order
Filter result adds the dictionary for word segmentation.
The specific embodiment of the embodiment of the present invention refers to the calculating of the judgement document similarity shown in Fig. 1 or Fig. 2
Method, here is omitted.
Fig. 5 is a kind of structural representation of the lookup device of similar judgement document of the embodiment of the present invention.
The lookup device 50 of the similar judgement document shown in Fig. 5 can include input module 501, searching modul 502, judgement
The computing device 503 and output module 504 of document similarity.Wherein, input module 501 is used to obtain pending judgement document;
Searching modul 502 is used to search judgement document database, the judgement document database bag according to the pending judgement document
Include many first judgement documents;Adjudicate document similarity computing device 503 be used for determine it is described it is pending judgement document with it is every
Similarity between a piece the first judgement document;Output module 504 is used to sentence the first of similarity highest preset number
Certainly document or similarity adjudicate document as lookup result more than the first of predetermined threshold value.
It should be noted that output similarity highest first adjudicates the preset number of document or the default threshold of similarity
Value can be custom-configured modification with adaptability according to actual applied environment, and the embodiment of the present invention is not limited this
System.
The specific embodiment of the computing device 503 of judgement document similarity refers to the judgement text shown in Fig. 1 or Fig. 2
The computational methods of book similarity, here is omitted.
The specific embodiment of the embodiment of the present invention refers to the lookup method of the similar judgement document shown in Fig. 3, herein
Repeat no more.
Technical solution of the present invention obtains pending judgement document;Judgement document number is searched according to the pending judgement document
According to storehouse, the judgement document database includes many first judgement documents;Determine it is described it is pending judgement document with described in every
Similarity between first judgement document;By the first judgement document or similarity of similarity highest preset number more than pre-
If the first judgement document of threshold value is used as lookup result.The characteristics of technical solution of the present invention is for judgement document, according to pending
Judgement document and first judgement document in defendant it is corresponding judgement keyword between similarity determine it is pending judgement document with
Similarity between every the first judgement document, document or similar is adjudicated by the first of similarity highest preset number
Degree adjudicates document as lookup result more than the first of predetermined threshold value.Avoid according only to dictionary set in advance or training in advance
Text similarity computing model carry out Similarity Measure in the aspect of word, and will with defendant on the meaning of a word may relevance not
Greatly, but there is the judgement keyword of critical significance for defendant in judgement aspect as pending judgement document and every institute
The core of Similarity Measure between the first judgement document is stated, and then improves the accuracy that judgement document similarity is searched.
The embodiment of the invention also discloses a kind of lookup system of similar judgement document, the lookup system can be included such as
Lookup device 50 and judgement document database shown in Fig. 5, store many first judgement documents in the judgement document database
And the corresponding vector of each defendant in every the first judgement document.Searching device 50 can be internally integrated in lookup system
In, it is also possible to outside is coupled to lookup system.
Fig. 6 is a kind of structural representation of computer equipment provided in an embodiment of the present invention.Fig. 6 shows and is suitable to for reality
The block diagram of the computer equipment 12 of existing embodiment of the present invention.The computer equipment 12 that Fig. 6 shows is only an example, should not
Function to the embodiment of the present invention and range band is used to carry out any limitation.
Computer equipment 12 shown in Fig. 6 is showed in the form of universal computing device.The component of computer equipment 12 can be with
Including but not limited to:One or more processor 16, memory 28 and connection different system component (including the He of memory 28
Processor 16) bus 18.
Bus 18 represents one or more in a few class bus structures, including memory bus, processor bus or makes
With the local bus of any bus structures in various bus structures.For example, these architectures include but is not limited to work
Industry standard architecture (ISA) bus, MCA (MAC) bus, enhanced isa bus, video electronics standard association
Meeting (VESA) local bus and periphery component interconnection (PCI) bus.
Computer equipment 12 typically comprises various computer-readable recording mediums.These media can be it is any can be by
The usable medium that computer equipment 12 is accessed, including volatibility and non-volatile media, moveable and immovable medium.
Memory 28 can include the computer-readable recording medium of form of volatile memory, such as random access memory
Device (RAM) 30 and/or cache memory 32.It is removable/nonremovable that computer equipment 12 may further include other
, volatile/non-volatile computer readable storage medium storing program for executing.Only as an example, storage system 34 can be used for read-write it is not removable
Dynamic, non-volatile magnetic media (Fig. 6 do not show, commonly referred to " hard disk drive ").Although not shown in Fig. 6, can provide
For the disc driver to may move non-volatile magnetic disk (such as " floppy disk ") read-write, and to may move anonvolatile optical disk
The CD drive of (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driver can
It is connected with bus 18 with by one or more data media interfaces.Memory 28 can include at least one program product
(i.e. computer program), the program product has one group of (for example, at least one) program module, and these program modules are configured to
Perform the function of various embodiments of the present invention.
With one group of program/utility 40 of (at least one) program module 42, can store in such as memory 28
In, such program module 42 include but is not limited to operating system, one or more application program, other program modules and
Routine data, potentially includes the realization of network environment in each or certain combination in these examples.Program module 42 is usual
Perform the function and/or method in embodiment described in the invention.
Computer equipment 12 can also be with one or more external equipment 14 (such as keyboard, sensing equipment, displays 24
Deng) communication, can also enable a user to the equipment communication that is interacted with the computer equipment 12 with one or more, and/or with make
Obtain any equipment (such as network interface card, modulatedemodulate that the computer equipment 12 can be communicated with one or more of the other computing device
Adjust device etc.) communication.This communication can be carried out by input/output (I/O) interface 22.Also, computer equipment 12 may be used also
With by network adapter 20 and one or more network (such as LAN (LAN), wide area network (WAN) and/or public network
Network, such as internet) communication.As illustrated, network adapter 20 is led to by bus 18 with other modules of computer equipment 12
Letter.It should be understood that although not shown in, computer equipment 12 can be combined and use other hardware and/or software module, including
But it is not limited to:Microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive
And data backup storage system etc..
Program of the processor 16 by operation storage in memory 28, so as to perform at various function application and data
Reason, for example, realize the computational methods of judgement document similarity as shown in Figure 1 or 2 or realize as shown in Figure 3 similar sentencing
The certainly lookup method of document.
The embodiment of the present invention additionally provides a kind of computer-readable recording medium, is stored thereon with computer program, the meter
Calculation machine program is when executed by realizing the computational methods of judgement document similarity as shown in Figure 1 or 2.Or the program
The lookup method of the similar judgement document for being when executed by realizing as shown in Figure 3.
The computer-readable recording medium of the embodiment of the present invention, for example may be-but not limited to-electricity, magnetic, light,
The system of electromagnetism, infrared ray or semiconductor, device or device, or it is any more than combination.Computer-readable recording medium
More specifically example (non exhaustive list) includes:It is electrical connection, portable computer diskette with one or more wires, hard
Disk, random access memory (RAM), read-only storage (ROM), erasable programmable read only memory (EPROM or flash memory), light
Fibre, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate
Combination.In this document, computer-readable recording medium can be it is any comprising or storage program tangible medium, the program
Execution system, device or device can be commanded to use or in connection.
Computer for performing present invention operation can be write with one or more programming language or its combination
Program code, described program design language includes object oriented program language-such as Java, Smalltalk, C++,
Also include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
Fully perform on the user computer, partly perform on the user computer, performed as an independent software kit, portion
Part on the user computer is divided to perform on the remote computer or performed on remote computer or server completely.
Be related in the situation of remote computer, remote computer can be by the network of any kind --- including LAN (LAN) or
Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (is for example carried using Internet service
Come by Internet connection for business).
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that
The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes,
Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example
It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also
More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.
Claims (24)
1. it is a kind of adjudicate document similarity computational methods, it is characterised in that including:
Obtain at least two judgement documents;
Extract the judgement keyword of one or more defendants in every judgement document;
According to it is different judgement documents in defendant it is corresponding judgement keyword between similarity determine it is corresponding judgement document between
Similarity.
2. computational methods according to claim 1, it is characterised in that after at least two judgement documents are obtained, carrying
Before taking the judgement keyword of one or more defendants in every judgement document, also include:
Subordinate sentence treatment is carried out to adjudicating document every described, multiple sentences are obtained.
3. computational methods according to claim 2, it is characterised in that in every judgement document of the extraction one or more
The judgement keyword of defendant includes:
Participle is carried out to adjudicating the sentence in document every described according to dictionary for word segmentation, word segmentation result is obtained;
Entity recognition is carried out to the word segmentation result, the entity name in the word segmentation result is obtained, the entity name includes
Defendant;
According to the entity name, entity relation extraction is carried out to the word segmentation result in same sentence and/or adjacent sentence,
Obtain the entity relationship between the entity name;
Feature extraction is carried out to the word segmentation result according to the defendant, the characteristic value of the defendant is obtained;
The corresponding entity relationship of same defendant and characteristic value are combined, the judgement keyword of each defendant is obtained.
4. computational methods according to claim 3, it is characterised in that described to be entered to the word segmentation result according to the defendant
Row feature extraction, the characteristic value for obtaining the defendant includes:
Basic triggering vocabulary is set up, the basic triggering vocabulary includes one or more trigger words, and the trigger word is used to represent
Event information in the judgement document;
The near synonym of at least one of triggering vocabulary trigger word substantially are obtained according to synonym woods;
The near synonym are added into basic triggering vocabulary, be expanded triggering vocabulary;
The word segmentation result is extracted according to the extension triggering vocabulary, the characteristic value of the defendant is obtained.
5. computational methods according to any one of Claims 1-4, it is characterised in that described according in different judgement documents
Similarity between the corresponding judgement keyword of defendant determines that the similarity between corresponding judgement document includes:
The similarity between the corresponding judgement keyword of defendant two-by-two is calculated in different judgement documents, the defendant two-by-two is corresponding
The maximum of the similarity between judgement keyword is the similarity between corresponding judgement document.
6. computational methods according to claim 5, it is characterised in that defendant couple two-by-two in the different judgement documents of the calculating
Similarity between the judgement keyword answered includes:
Vector is built according to the corresponding judgement keyword of each defendant;
Calculate the similarity between the corresponding vector of defendant two-by-two.
7. computational methods according to claim 1, it is characterised in that the judgement keyword includes following a kind of or any
It is various:Adjudicate the keyword of true keyword, the keyword of grounds of decision and court verdict.
8. computational methods according to claim 3, it is characterised in that according to dictionary for word segmentation in the judgement document
Before sentence carries out participle, also include:
Extract the neologisms candidate word in the judgement document;
The neologisms candidate word is filtered according to grammer and/or word order, the filtering knot of the grammer and/or word order will be met
Fruit adds the dictionary for word segmentation.
9. it is a kind of it is similar judgement document lookup method, it is characterised in that including:
Obtain pending judgement document;
Judgement document database is searched according to the pending judgement document, the judgement document database is sentenced including many first
Certainly document;
Using the computational methods of the judgement document similarity as described in any one of claim 1 to 8, the pending judgement is determined
Similarity between document and every the first judgement document;
The first judgement document by the first judgement document or similarity of similarity highest preset number more than predetermined threshold value
As lookup result.
10. lookup method as claimed in claim 9, it is characterised in that stored described in every in the judgement document database
The corresponding vector of each defendant in first judgement document.
A kind of 11. computing devices for adjudicating document similarity, it is characterised in that including:
Acquisition module, for obtaining at least two judgement documents;
Keyword extracting module, the judgement keyword for extracting one or more defendants in every judgement document;
Keyword similarity calculation module, for obtain it is different judgement documents in defendant it is corresponding judgement keyword between it is similar
Degree;
Text similarity computing module, for according to the similarity between the corresponding judgement keyword of defendant in different judgement documents
Determine the similarity between corresponding judgement document.
12. computing devices according to claim 11, it is characterised in that further include:
Subordinate sentence module, for carrying out subordinate sentence treatment to adjudicating document every described, obtains multiple sentences.
13. computing devices according to claim 12, it is characterised in that the keyword extracting module includes:
Participle unit, for carrying out participle to adjudicating the sentence in document every described according to dictionary for word segmentation, obtains word segmentation result;
Entity recognition unit, for carrying out Entity recognition to the word segmentation result, obtains the entity name in the word segmentation result,
The entity name includes defendant;
Entity relation extraction unit, for according to the entity name, to the participle in same sentence and/or adjacent sentence
Result carries out entity relation extraction, obtains the entity relationship between the entity name;
Feature extraction unit, for carrying out feature extraction to the word segmentation result according to the defendant, obtains the spy of the defendant
Value indicative;
Assembled unit, for combining the corresponding entity relationship of same defendant and characteristic value, the judgement for obtaining each defendant is closed
Keyword.
14. computing devices according to claim 13, it is characterised in that the feature extraction unit includes:
Basic triggering vocabulary sets up subelement, and for setting up basic triggering vocabulary, the basic triggering vocabulary includes one or many
Individual trigger word, the trigger word is used to represent the event information in the judgement document;
Near synonym obtain subelement, the nearly justice for obtaining at least one of triggering vocabulary trigger word substantially according to synonym woods
Word;
Extension triggering vocabulary sets up subelement, and for the near synonym to be added into basic triggering vocabulary, be expanded triggering vocabulary;
Defendant's eigenvalue extracting subelement, for extracting the word segmentation result according to the extension triggering vocabulary, obtains the quilt
The characteristic value of announcement.
15. computing device according to any one of claim 11 to 14, it is characterised in that the keyword Similarity Measure
Module specifically for:Calculate it is different judgement documents in two-by-two defendant it is corresponding judgement keyword between similarity;
The Text similarity computing module is specifically for by the similarity between the corresponding judgement keyword of the defendant two-by-two
Maximum as it is corresponding judgement document between similarity.
16. computing devices according to claim 15, it is characterised in that the keyword similarity calculation module includes:
Vectorial construction unit, for building vector according to the corresponding judgement keyword of each defendant;
Vector similarity computing unit, for calculating the similarity between the corresponding vector of defendant two-by-two.
17. computing devices according to claim 11, it is characterised in that the judgement keyword includes following a kind of or appoints
Meaning is various:Adjudicate the keyword of true keyword, the keyword of grounds of decision and court verdict.
18. computing devices according to claim 13, it is characterised in that further include:
Candidate word extraction module, for extracting the neologisms candidate word in the judgement document;
Filtering module, for being filtered to the neologisms candidate word according to grammer and/or word order, will meet the grammer and/
Or the filter result of word order adds the dictionary for word segmentation.
A kind of 19. lookup devices of similar judgement document, it is characterised in that including:
Input module, for obtaining pending judgement document;
Searching modul, for searching judgement document database, the judgement document database according to the pending judgement document
Including many first judgement documents;
The computing device of the judgement document similarity as described in any one of claim 11 to 18, for determining described pending to sentence
Certainly document and every it is described first judgement document between similarity;
Output module, for the first judgement document or similarity of similarity highest preset number to be more than into predetermined threshold value
First judgement document is used as lookup result.
A kind of 20. lookup systems of similar judgement document, it is characterised in that including it is as claimed in claim 19 search device and
Judgement document database, stores many first judgement documents and every the first judgement text in the judgement document database
The corresponding vector of each defendant in book.
A kind of 21. computer equipments, including memory, processor and storage are on a memory and the meter that can run on a processor
Calculation machine program, it is characterised in that realized as described in any in claim 1-8 during computer program described in the computing device
Judgement document similarity computational methods.
A kind of 22. computer equipments, including memory, processor and storage are on a memory and the meter that can run on a processor
Calculation machine program, it is characterised in that realize the phase as described in claim 9 or 10 described in the computing device during computer program
Like the lookup method for adjudicating document.
A kind of 23. computer-readable recording mediums, are stored thereon with computer program, it is characterised in that the computer program quilt
The computational methods of the judgement document similarity as described in any in claim 1-8 are realized during computing device.
A kind of 24. computer-readable recording mediums, are stored thereon with computer program, it is characterised in that the computer program quilt
The lookup method of the similar judgement document as described in claim 9 or 10 is realized during computing device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710165953.7A CN106933787A (en) | 2017-03-20 | 2017-03-20 | Adjudicate the computational methods of document similarity, search device and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710165953.7A CN106933787A (en) | 2017-03-20 | 2017-03-20 | Adjudicate the computational methods of document similarity, search device and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106933787A true CN106933787A (en) | 2017-07-07 |
Family
ID=59432440
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710165953.7A Pending CN106933787A (en) | 2017-03-20 | 2017-03-20 | Adjudicate the computational methods of document similarity, search device and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106933787A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107807962A (en) * | 2017-10-11 | 2018-03-16 | 中国软件与技术服务股份有限公司 | A kind of method for carrying out similarity mode to legal decision document using LDA topic models |
CN107818175A (en) * | 2017-11-17 | 2018-03-20 | 厦门能见易判信息科技有限公司 | A kind of law class case problem intelligently prejudges system and method |
CN109657227A (en) * | 2018-10-08 | 2019-04-19 | 平安科技(深圳)有限公司 | Contract feasibility determination method, equipment, storage medium and device |
CN110019655A (en) * | 2017-07-21 | 2019-07-16 | 北京国双科技有限公司 | Precedent case acquisition methods and device |
CN110134761A (en) * | 2019-04-16 | 2019-08-16 | 深圳壹账通智能科技有限公司 | Adjudicate document information retrieval method, device, computer equipment and storage medium |
WO2019170015A1 (en) * | 2018-03-09 | 2019-09-12 | 北京国双科技有限公司 | Judicial document searching method and device |
CN110246064A (en) * | 2018-03-09 | 2019-09-17 | 北京国双科技有限公司 | A kind of relations of fact determines method and device |
CN110674308A (en) * | 2019-08-23 | 2020-01-10 | 上海科技发展有限公司 | Scientific and technological word list expansion method, device, terminal and medium based on grammar mode |
CN110866095A (en) * | 2019-10-10 | 2020-03-06 | 重庆金融资产交易所有限责任公司 | Text similarity determination method and related equipment |
CN111104790A (en) * | 2018-10-10 | 2020-05-05 | 百度在线网络技术(北京)有限公司 | Method, device and equipment for extracting key relation and computer readable medium |
CN111177332A (en) * | 2019-11-27 | 2020-05-19 | 中证信用增进股份有限公司 | Method and device for automatically extracting referee document case-related mark and referee result |
CN111382769A (en) * | 2018-12-29 | 2020-07-07 | 阿里巴巴集团控股有限公司 | Information processing method, device and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101295307A (en) * | 2007-04-27 | 2008-10-29 | 株式会社日立制作所 | Document retrieval system and document retrieval method |
CN103218432A (en) * | 2013-04-15 | 2013-07-24 | 北京邮电大学 | Named entity recognition-based news search result similarity calculation method |
CN106095737A (en) * | 2016-06-07 | 2016-11-09 | 杭州凡闻科技有限公司 | Documents Similarity computational methods and similar document the whole network retrieval tracking |
-
2017
- 2017-03-20 CN CN201710165953.7A patent/CN106933787A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101295307A (en) * | 2007-04-27 | 2008-10-29 | 株式会社日立制作所 | Document retrieval system and document retrieval method |
CN103218432A (en) * | 2013-04-15 | 2013-07-24 | 北京邮电大学 | Named entity recognition-based news search result similarity calculation method |
CN106095737A (en) * | 2016-06-07 | 2016-11-09 | 杭州凡闻科技有限公司 | Documents Similarity computational methods and similar document the whole network retrieval tracking |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110019655A (en) * | 2017-07-21 | 2019-07-16 | 北京国双科技有限公司 | Precedent case acquisition methods and device |
CN107807962B (en) * | 2017-10-11 | 2018-11-30 | 中国软件与技术服务股份有限公司 | A method of similarity mode being carried out to legal decision document using LDA topic model |
CN107807962A (en) * | 2017-10-11 | 2018-03-16 | 中国软件与技术服务股份有限公司 | A kind of method for carrying out similarity mode to legal decision document using LDA topic models |
CN107818175A (en) * | 2017-11-17 | 2018-03-20 | 厦门能见易判信息科技有限公司 | A kind of law class case problem intelligently prejudges system and method |
WO2019170015A1 (en) * | 2018-03-09 | 2019-09-12 | 北京国双科技有限公司 | Judicial document searching method and device |
CN110246064A (en) * | 2018-03-09 | 2019-09-17 | 北京国双科技有限公司 | A kind of relations of fact determines method and device |
CN110246064B (en) * | 2018-03-09 | 2021-11-23 | 北京国双科技有限公司 | Method and device for determining fact relationship |
CN109657227A (en) * | 2018-10-08 | 2019-04-19 | 平安科技(深圳)有限公司 | Contract feasibility determination method, equipment, storage medium and device |
CN111104790A (en) * | 2018-10-10 | 2020-05-05 | 百度在线网络技术(北京)有限公司 | Method, device and equipment for extracting key relation and computer readable medium |
CN111104790B (en) * | 2018-10-10 | 2024-03-22 | 百度在线网络技术(北京)有限公司 | Method, apparatus, device and computer readable medium for extracting key relation |
CN111382769A (en) * | 2018-12-29 | 2020-07-07 | 阿里巴巴集团控股有限公司 | Information processing method, device and system |
CN111382769B (en) * | 2018-12-29 | 2023-09-22 | 阿里巴巴集团控股有限公司 | Information processing method, device and system |
CN110134761A (en) * | 2019-04-16 | 2019-08-16 | 深圳壹账通智能科技有限公司 | Adjudicate document information retrieval method, device, computer equipment and storage medium |
CN110674308A (en) * | 2019-08-23 | 2020-01-10 | 上海科技发展有限公司 | Scientific and technological word list expansion method, device, terminal and medium based on grammar mode |
CN110866095A (en) * | 2019-10-10 | 2020-03-06 | 重庆金融资产交易所有限责任公司 | Text similarity determination method and related equipment |
CN111177332A (en) * | 2019-11-27 | 2020-05-19 | 中证信用增进股份有限公司 | Method and device for automatically extracting referee document case-related mark and referee result |
CN111177332B (en) * | 2019-11-27 | 2023-11-24 | 中证信用增进股份有限公司 | Method and device for automatically extracting judge document case-related label and judge result |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106933787A (en) | Adjudicate the computational methods of document similarity, search device and computer equipment | |
CN107102989B (en) | Entity disambiguation method based on word vector and convolutional neural network | |
CN107193803B (en) | Semantic-based specific task text keyword extraction method | |
US9613024B1 (en) | System and methods for creating datasets representing words and objects | |
US8892420B2 (en) | Text segmentation with multiple granularity levels | |
CN110188168A (en) | Semantic relation recognition methods and device | |
US20170169008A1 (en) | Method and electronic device for sentiment classification | |
Guo et al. | A graph-based method for entity linking | |
US8577882B2 (en) | Method and system for searching multilingual documents | |
Shouzhong et al. | Mining microblog user interests based on TextRank with TF-IDF factor | |
CN106940726B (en) | Creative automatic generation method and terminal based on knowledge network | |
WO2023071118A1 (en) | Method and system for calculating text similarity, device, and storage medium | |
CN111291177A (en) | Information processing method and device and computer storage medium | |
Wick et al. | A unified approach for schema matching, coreference and canonicalization | |
CN110516036A (en) | Legal documents information extracting method, device, computer equipment and storage medium | |
CN105868366A (en) | Concept space navigation method based on concept association | |
CN106933824A (en) | The method and apparatus that the collection of document similar to destination document is determined in multiple documents | |
CN109614478A (en) | Construction method, key word matching method and the device of term vector model | |
CN107315735B (en) | Method and equipment for note arrangement | |
Gupta et al. | Text analysis and information retrieval of text data | |
JP2021501387A (en) | Methods, computer programs and computer systems for extracting expressions for natural language processing | |
CN109918661B (en) | Synonym acquisition method and device | |
Yuan et al. | Task-specific word identification from short texts using a convolutional neural network | |
WO2016000511A1 (en) | Method and apparatus for mining rare resource of internet | |
CN113505196B (en) | Text retrieval method and device based on parts of speech, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170707 |