CN106933787A - Adjudicate the computational methods of document similarity, search device and computer equipment - Google Patents

Adjudicate the computational methods of document similarity, search device and computer equipment Download PDF

Info

Publication number
CN106933787A
CN106933787A CN201710165953.7A CN201710165953A CN106933787A CN 106933787 A CN106933787 A CN 106933787A CN 201710165953 A CN201710165953 A CN 201710165953A CN 106933787 A CN106933787 A CN 106933787A
Authority
CN
China
Prior art keywords
judgement
similarity
document
defendant
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710165953.7A
Other languages
Chinese (zh)
Inventor
谢瑜
张昊
林涵
王洪远
朱频频
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhizhen Intelligent Network Technology Co Ltd
Original Assignee
Shanghai Zhizhen Intelligent Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zhizhen Intelligent Network Technology Co Ltd filed Critical Shanghai Zhizhen Intelligent Network Technology Co Ltd
Priority to CN201710165953.7A priority Critical patent/CN106933787A/en
Publication of CN106933787A publication Critical patent/CN106933787A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Adjudicate the computational methods of document similarity, search device and computer equipment, the computational methods for adjudicating document similarity include:Obtain at least two judgement documents;Extract the judgement keyword of one or more defendants in every judgement document;Similarity between corresponding judgement document is determined according to the similarity between the corresponding judgement keyword of defendant in different judgement documents.Technical solution of the present invention improves the accuracy of judgement document Similarity Measure.

Description

Adjudicate the computational methods of document similarity, search device and computer equipment
Technical field
The present invention relates to technical field of data processing, more particularly to adjudicate the computational methods of document similarity, search device And computer equipment.
Background technology
In the prior art, a kind of calculation of text similarity is:Two long texts are entered based on dictionary respectively first Row cutting word, and calculate the word frequency of each participle that cutting word in each text is obtained, then determines each point in two texts according to dictionary Similarity between word, and then the overall similarity of two texts is calculated according to resulting similarity and word frequency;Another text The calculation of similarity is:Beforehand through machine learning mode, training obtains a Text similarity computing model, then basis The model determines the similarity between text.
For the document of different field, due to itself having the distinctive language feature in the field, therefore, either adopt The mode of text similarity is calculated with dictionary and word frequency or by the similarity calculation of training in advance determine that text is similar The mode of degree, can cause the calculating of text similarity only to be calculated in the aspect of word, and accuracy is poor.
The content of the invention
Present invention solves the technical problem that being the accuracy for how improving judgement document Similarity Measure.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of computational methods for adjudicating document similarity, including: Obtain at least two judgement documents;Extract the judgement keyword of one or more defendants in every judgement document;Sentenced according to difference Certainly the similarity in document between the corresponding judgement keyword of defendant determines the similarity between corresponding judgement document.
Optionally, after at least two judgement documents are obtained, one or more defendants in every judgement document is extracted Judgement keyword before, also include:Subordinate sentence treatment is carried out to adjudicating document every described, multiple sentences are obtained.
Optionally, the judgement keyword for extracting one or more defendants in every judgement document includes:According to participle Dictionary carries out participle to adjudicating the sentence in document every described, obtains word segmentation result;Entity knowledge is carried out to the word segmentation result Not, the entity name in the word segmentation result is obtained, the entity name includes defendant;According to the entity name, to same The word segmentation result in sentence and/or adjacent sentence carries out entity relation extraction, obtains the entity between the entity name Relation;Feature extraction is carried out to the word segmentation result according to the defendant, the characteristic value of the defendant is obtained;Combine same defendant Corresponding entity relationship and characteristic value, obtain the judgement keyword of each defendant.
Optionally, the feature that feature extraction is carried out to the word segmentation result according to the defendant, the defendant is obtained Value includes:Basic triggering vocabulary is set up, the basic triggering vocabulary includes one or more trigger words, and the trigger word is used for table Show the event information in the judgement document;Obtained triggering the near of at least one of vocabulary trigger word substantially according to synonym woods Adopted word;The near synonym are added into basic triggering vocabulary, be expanded triggering vocabulary;Institute is extracted according to the extension triggering vocabulary Word segmentation result is stated, the characteristic value of the defendant is obtained.
Optionally, the similarity according between the corresponding judgement keyword of defendant in different judgement documents determines correspondence Judgement document between similarity include:Calculate it is different judgement documents in two-by-two defendant it is corresponding judgement keyword between phase Like spending, the corresponding maximum for adjudicating the similarity between keyword of the defendant two-by-two is the corresponding phase adjudicated between document Like degree.
Optionally, it is described calculate it is different judgement documents in two-by-two defendant it is corresponding judgement keyword between similarity bag Include:Vector is built according to the corresponding judgement keyword of each defendant;Calculate the similarity between the corresponding vector of defendant two-by-two.
Optionally, the judgement keyword includes following a kind of or any various:Adjudicate true keyword, grounds of decision Keyword and court verdict keyword.
Optionally, before participle is carried out to the sentence in the judgement document according to dictionary for word segmentation, also include:Extract institute State the neologisms candidate word in judgement document;The neologisms candidate word is filtered according to grammer and/or word order, will meet described The filter result of grammer and/or word order adds the dictionary for word segmentation.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of lookup method of similar judgement document, bag Include:Obtain pending judgement document;Judgement document database, the judgement document number are searched according to the pending judgement document Include many first judgement documents according to storehouse;Using the computational methods of described judgement document similarity, determine described pending to sentence Certainly document and every it is described first judgement document between similarity;Document is adjudicated by the first of similarity highest preset number Or similarity adjudicates document as lookup result more than the first of predetermined threshold value.
Optionally, it is described judgement document database in store every it is described first judgement document in each defendant it is corresponding to Amount.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of computing device for adjudicating document similarity, Including:Acquisition module, for obtaining at least two judgement documents;Keyword extracting module, for extract every judgement document in The judgement keyword of one or more defendants;Keyword similarity calculation module, for obtain it is different judgement documents in defendant couple Similarity between the judgement keyword answered;Text similarity computing module, for according to defendant's correspondence in different judgement documents Judgement keyword between similarity determine it is corresponding judgement document between similarity.
Optionally, further include:Subordinate sentence module, for carrying out subordinate sentence treatment to adjudicating document every described, obtains many Individual sentence.
Optionally, the keyword extracting module includes:Participle unit, for according to dictionary for word segmentation to judgement every described Sentence in document carries out participle, obtains word segmentation result;Entity recognition unit, for carrying out entity knowledge to the word segmentation result Not, the entity name in the word segmentation result is obtained, the entity name includes defendant;Entity relation extraction unit, for root According to the entity name, entity relation extraction is carried out to the word segmentation result in same sentence and/or adjacent sentence, obtain institute State the entity relationship between entity name;Feature extraction unit, for carrying out feature to the word segmentation result according to the defendant Extract, obtain the characteristic value of the defendant;Assembled unit, for combining the corresponding entity relationship of same defendant and characteristic value, obtains To the judgement keyword of defendant each described.
Optionally, the feature extraction unit includes:Basic triggering vocabulary sets up subelement, for setting up basic trigger word Table, the basic triggering vocabulary includes one or more trigger words, and the trigger word is used to represent the thing in the judgement document Part information;Near synonym obtain subelement, for being obtained triggering at least one of vocabulary trigger word substantially according to synonym woods Near synonym;Extension triggering vocabulary sets up subelement, and for the near synonym to be added into basic triggering vocabulary, be expanded trigger word Table;Defendant's eigenvalue extracting subelement, for extracting the word segmentation result according to the extension triggering vocabulary, obtains the defendant Characteristic value.
Optionally, the keyword similarity calculation module specifically for:Calculate defendant couple two-by-two in different judgement documents Similarity between the judgement keyword answered;The Text similarity computing module is specifically for defendant is corresponding two-by-two by described in Judgement keyword between similarity maximum as it is corresponding judgement document between similarity.
Optionally, the keyword similarity calculation module includes:Vectorial construction unit, for according to each defendant correspondence Judgement keyword build vector;Vector similarity computing unit, for similar between the calculating two-by-two corresponding vector of defendant Degree.
Optionally, the judgement keyword includes following a kind of or any various:Adjudicate true keyword, grounds of decision Keyword and court verdict keyword.
Optionally, further include:Candidate word extraction module, for extracting the neologisms candidate word in the judgement document; Filtering module, for being filtered to the neologisms candidate word according to grammer and/or word order, will meet the grammer and/or language The filter result of sequence adds the dictionary for word segmentation.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of lookup device of similar judgement document, bag Include:Input module, for obtaining pending judgement document;Searching modul, sentences for being searched according to the pending judgement document Certainly document database, the judgement document database includes many first judgement documents;The calculating of the judgement document similarity Device, for determining the similarity between the pending judgement document and every the first judgement document;Output module, uses Make more than the first judgement document of predetermined threshold value in by the first judgement document or similarity of similarity highest preset number It is lookup result.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of lookup system of similar judgement document, bag The lookup device and judgement document database are included, many first judgement documents and every are stored in the judgement document database The corresponding vector of each defendant in a piece the first judgement document.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of computer equipment, including memory, treatment Device and the computer program that store on a memory and can run on a processor, computer program described in the computing device Shi Shixian adjudicates the computational methods of document similarity as previously described.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of computer equipment, including memory, treatment Device and the computer program that store on a memory and can run on a processor, computer program described in the computing device The lookup method of the foregoing similar judgement documents of Shi Shixian.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of computer-readable recording medium, depositing thereon Computer program is contained, the computer program is when executed by realizing adjudicating as previously described the calculating side of document similarity Method.
In order to solve the above technical problems, the embodiment of the invention also discloses a kind of computer-readable recording medium, depositing thereon Computer program is contained, the computer program is when executed by realizing the lookup side of foregoing similar judgement document Method.
Compared with prior art, the technical scheme of the embodiment of the present invention has the advantages that:
Technical solution of the present invention obtains at least two judgement documents;One or more defendants in every judgement document of extraction Judgement keyword;Corresponding judgement text is determined according to the similarity between the corresponding judgement keyword of defendant in different judgement documents Similarity between book.The characteristics of technical solution of the present invention is for judgement document, extracts one or more in every judgement document The judgement keyword of defendant, it is determined that judgement document between similarity when based on it is different judgement documents in the corresponding judgement of defendant Similarity between keyword, it is to avoid the Text similarity computing model according only to dictionary set in advance or training in advance exists The aspect of word is calculated, and will with defendant on the meaning of a word may relevance less, but have in judgement aspect for defendant There is the judgement keyword of critical significance as the core of Similarity Measure between judgement document, and then improve judgement document similarity The accuracy of calculating.
Technical solution of the present invention obtains pending judgement document;Judgement document number is searched according to the pending judgement document According to storehouse, the judgement document database includes many first judgement documents;Using the calculating side of described judgement document similarity Method, determines the similarity between the pending judgement document and every the first judgement document;Similarity highest is pre- If the first judgement document or similarity of number adjudicate document as lookup result more than the first of predetermined threshold value.Skill of the present invention The characteristics of art scheme is for judgement document, it is crucial according to the corresponding judgement of defendant in pending judgement document and the first judgement document Similarity between word determines the similarity between pending judgement document and every the first judgement document, by similarity most The first judgement document or similarity of preset number high adjudicate document as lookup result more than the first of predetermined threshold value.Keep away Exempt to carry out similarity meter in the aspect of word according only to the Text similarity computing model of dictionary set in advance or training in advance Calculate, and will with defendant on the meaning of a word may relevance less, but sentencing with critical significance in the judgement aspect for defendant Certainly keyword adjudicates the core of Similarity Measure between document as pending judgement document and every described first, and then improves The accuracy that judgement document similarity is searched.
Brief description of the drawings
By the detailed description made to non-limiting example made with reference to the following drawings of reading, it is of the invention other Feature, objects and advantages will become more apparent upon:
Fig. 1 is a kind of flow chart of the computational methods for adjudicating document similarity of the embodiment of the present invention;
Fig. 2 is the flow chart of the computational methods of another judgement document similarity of the embodiment of the present invention;
Fig. 3 is a kind of flow chart of the lookup method of similar judgement document of the embodiment of the present invention;
Fig. 4 is a kind of structural representation of the computing device for adjudicating document similarity of the embodiment of the present invention;
Fig. 5 is a kind of structural representation of the lookup device of similar judgement document of the embodiment of the present invention;
Fig. 6 is a kind of structural representation of computer equipment provided in an embodiment of the present invention.
Specific embodiment
As described in the background art, in the prior art either by the way of dictionary and word frequency calculate text similarity or Person by way of the similarity calculation of training in advance determines text similarity, for the document of different field, The calculating of text similarity can be caused only to be calculated in the aspect of word, accuracy is poor.
It is understandable to enable the above objects, features and advantages of the present invention to become apparent, below in conjunction with the accompanying drawings to the present invention Specific embodiment be described in detail.
Fig. 1 is a kind of flow chart of the computational methods for adjudicating document similarity of the embodiment of the present invention.
The computational methods of the judgement document similarity shown in Fig. 1 may comprise steps of:
Step S101:Obtain at least two judgement documents.
Step S102:Extract the judgement keyword of one or more defendants in every judgement document.
Step S103:Determined according to the similarity between the corresponding judgement keyword of defendant in different judgement documents corresponding Similarity between judgement document.
In specific implementation, judgement document is at least two, when it is two to adjudicate document, can calculate two judgement documents Between similarity;When it is more than two to adjudicate document, it is also possible to calculate the similarity between more than two judgement documents.Sentence Certainly document refers to the document that law court is write as according to judgement, including but not limited to paper of civil judgment, criminal judgment, administrative judgment book With incidental civil court verdict etc..Typically one or more defendants are made decisions in document is adjudicated, for example,《Zeng Wentai, The bright crime of smuggling, trafficking, transporting and manufacturing drugs second trial criminal judgments of Li Yun》) in defendant Zeng Wentai smuggling 50.35g be less than standard The heroin of purity, is first offender, decides as prime culprit, and attitude toward admission of guilt is good, there is great rendering meritorious service;Defendant Li Yunming is peddled 40.12g, less than the methamphetamine of standard purity, is first offender, decides as prime culprit, and attitude toward admission of guilt is good, render meritorious service etc..
Specifically, judgement keyword includes following a kind of or any various:Adjudicate true keyword, the pass of grounds of decision The keyword of keyword and court verdict, wherein, true keyword is adjudicated, for example, " Hai Luos of the 50.35g less than standard purity Cause ";The keyword of grounds of decision, for example, " request that plaintiff vacates house obtains evidence support ";The keyword of court verdict, For example, " first offender ", " prime culprit " etc..First in step S101, at least two judgement documents are obtained.Then in step S102 In, extract the judgement keyword of one or more defendants in every judgement document.In step s 103, according to different judgements finally Similarity in document between the corresponding judgement keyword of defendant determines the similarity between corresponding judgement document.
The characteristics of embodiment of the present invention is for judgement document, extracts the judgement of one or more defendants in every judgement document Keyword, it is determined that judgement document between similarity when based on it is different judgement documents in defendant it is corresponding judgement keyword between Similarity so that with defendant on the meaning of a word may relevance less, but there is crucial meaning for defendant in judgement aspect Adopted adjudicates core of the keyword as Similarity Measure between judgement document, it is to avoid according only to dictionary set in advance or pre- The Text similarity computing model first trained is calculated in the aspect of word, and then improves the accurate of judgement document Similarity Measure Property.
In specific implementation, after step slol, before step S102, may comprise steps of:To described in every Judgement document carries out subordinate sentence treatment, obtains multiple sentences.Specifically, carrying out subordinate sentence treatment to every judgement document can be according to The punctuate of sentence ending, such as question mark, exclamation and fullstop etc. is indicated to be divided into row and preserve.More specifically, can divided Before sentence treatment, every judgement document is converted into text formatting, the invalid form obtained in filtering transfer process, for example, figure Piece, mess code etc..Subordinate sentence treatment is carried out to the judgement document after filtering again.The present embodiment carries out subordinate sentence treatment to judgement document can be with For the operation of subsequent step provides facility.
In specific implementation, step S102 may comprise steps of:According to dictionary for word segmentation in judgement document every described Sentence carry out participle, obtain word segmentation result;Entity recognition is carried out to the word segmentation result, the reality in the word segmentation result is obtained Body title, the entity name includes defendant;According to the entity name, described in same sentence and/or adjacent sentence Word segmentation result carries out entity relation extraction, obtains the entity relationship between the entity name;According to the defendant to described point Word result carries out feature extraction, obtains the characteristic value of the defendant;The corresponding entity relationship of same defendant and characteristic value are combined, is obtained To the judgement keyword of defendant each described.
Specifically, entity name refers to name, mechanism's name, place name and other all entities with entitled mark, Including numeral, date, currency, address etc., it can be obtained using entity recognition methods such as CRF.For example, " Zhang San ", " Lee Four " etc..
Entity relationship is the relation between two entity names, for example, name+drugs name+peddle, name+date+birth Etc., entity relationship can be obtained using existing entity relation extraction method.For example, " Zhang San+drugs name+peddle ", " Li Si + date+birth " etc..
Characteristic value is used to represent the judgement benchmark of defendant, and it can be obtained by triggering the feature extracting methods such as vocabulary extraction Take, for example, " Zhang San+prime culprit ", " Li Si+great render meritorious service " etc..
Specifically, dictionary for word segmentation can in the following way be trained:Extract the neologisms candidate word in the judgement document;Root The neologisms candidate word is filtered according to grammer and/or word order, the filter result that will meet the grammer and/or word order is added The dictionary for word segmentation.Neologisms candidate word in judgement document is trained to dictionary for word segmentation, can improve dictionary for word segmentation Integrality, filters according to grammer and/or word order to the neologisms candidate word, can improve the degree of accuracy of dictionary for word segmentation.
Specifically, the characteristic value of the defendant can in the following way be obtained:Basic triggering vocabulary is set up, it is described basic Triggering vocabulary includes one or more trigger words, and the trigger word is used to represent the event information in the judgement document;According to Synonym woods obtains the near synonym of at least one of triggering vocabulary trigger word substantially;The near synonym are added into basic trigger word Table, be expanded triggering vocabulary;The word segmentation result is extracted according to the extension triggering vocabulary, the feature of the defendant is obtained Value.Near synonym comprising trigger word in extension triggering vocabulary, it is accurate according to the characteristic value that extension triggering vocabulary extracts the defendant for obtaining True rate is high.
In specific implementation, following exemplary explanation is carried out, this exemplary explanation is only used for understanding technology of the invention Scheme, rather than the restriction to technical solution of the present invention.For example, pretreated judgement document includes herein below:
" appellant (first trial defendant) fourth is sub-, and former name fourth wishes refined, man, and the birth of on November 13rd, 1981, Han nationality is unemployed, Household register ground Guizhou Province Dafang County, Xiamen City, Fujian Province Tongan District of staying temporarily.
The court thinks, appeal population knows Asia that drugs still help other people to peddle 61 grams of crystal methamphetamine perfectly well, its behavior structure Into drug offense, should impose punishment in accordance with the law.
This case system joint crime, appeal population plays secondary booster action in Asia in crime, is accessory, in accordance with the law should from light or Reduce punishment.
Fourth Asia can make a deposition oneself crime strictly according to the facts after appearing in court, and can give a lesser punishment in accordance with the law.
Fourth subbreed recidivist, should give a severe punishment in accordance with the law.”
In specific implementation, the one of which word segmentation result of judgement document is shown in Table 1 for more than, and each lattice in table 1 are represented One participle.
The word segmentation result example of table 1
In specific implementation, by entity identification algorithms, a kind of Entity recognition for obtaining the results are shown in Table 2, each lattice in table 2 Represent an entity name.
The Entity recognition result example of table 2
In specific implementation, according to the extraction of entity relation extraction method, " appellant (first trial defendant) fourth is sub-, and former name fourth is wished It is refined, man, the birth of on November 13rd, 1981, Han nationality has no property, household register ground Guizhou Province Dafang County, Xiamen City, Fujian Province Tongan City of staying temporarily Area." following a kind of entity relationship " fourth is sub- ----birth --- on November 13rd, 1 " is obtained, and/or, another entity relationship " fourth is sub- ----household register ----Guizhou Province Dafang County ";Extracted according to entity relation extraction method and " think that appeal population is sub- in the court Know that drugs still help other people to peddle 61 grams of crystal methamphetamine perfectly well, its behavior has constituted drug offense, should impose punishment in accordance with the law " obtain A kind of following entity relationship " fourth is sub- ----he peddles -61 grams of ----crystal methamphetamine --- ".
In specific implementation, for example, triggering vocabulary includes:Prime culprit, accessory, render meritorious service, the extension after being extended through synonym woods Triggering vocabulary includes:Prime culprit, Main Function, accessory, secondary role, booster action, render meritorious service, great render meritorious service.Touched according to extension Hair vocabulary extracts that " this case system joint crime, appeal population plays secondary booster action in Asia in crime, is accessory, in accordance with the law should be from It is light or reduce punishment ", obtain such as lower eigenvalue " fourth is sub- ----accessory ";According to extension triggering vocabulary extract " fourth Asia appear in court after energy Make a deposition strictly according to the facts oneself crime, can give a lesser punishment in accordance with the law ", obtain such as lower eigenvalue " fourth is sub- ----make a deposition strictly according to the facts ";According to extension Triggering vocabulary extracts " fourth subbreed recidivist, should give a severe punishment in accordance with the law " and obtains such as lower eigenvalue " fourth Asia ----recidivist ".
The embodiment of the present invention extracts entity name, so as to obtain by step S102 from the word segmentation result of judgement document Defendant's information in judgement document;The entity relationship between entity name, Yi Jigen are extracted from same sentence and/or adjacent sentence The characteristic value that feature extraction obtains the defendant is carried out to word segmentation result according to defendant, convenience of calculation is quick, improves the judgement of defendant The speed of keyword.
In specific implementation, step S103 may include steps of:Defendant is corresponding two-by-two in the different judgement documents of calculating Similarity between judgement keyword, the corresponding maximum for adjudicating the similarity between keyword of the defendant two-by-two is correspondence Judgement document between similarity.
Specifically, can calculate in the following way it is different judgement documents in two-by-two defendant it is corresponding judgement keyword between Similarity:Vector is built according to the corresponding judgement keyword of each defendant;Calculate the phase between the corresponding vector of defendant two-by-two Like degree.Calculating the similarity between vector can be by calculating the Euclidean distance between vector or square by cosine-algorithm etc. Formula.The embodiment of the present invention is corresponding judgement text by the maximum of the similarity between the corresponding judgement keyword of defendant two-by-two Similarity between book improves the accuracy of judgement document Similarity Measure.
In specific implementation, by taking above-mentioned exemplary judgement document as an example, the characteristic value of defendant Ding Ya is following (can be according to grammer Or the mode such as word order is ranked up to characteristic value):Peddle 61 grams of accessory recidivists ... of crystal methamphetamine and build judgement key Term vector, for example with bag of words, pre-set standard vector it is following [peddle, transport, crystal methamphetamine, hemp, 1-10g, 1,1-50g, 50-100g, accessory, prime culprit, recidivist, first offender ...], all of feature is given tacit consent in characteristic value digitlization therein Weight it is identical, i.e. the significance level of each feature is identical, the characteristic value for existing it is corresponding numeral be 1, non-existent characteristic value Corresponding numeral is 0.Then the corresponding vectors of defendant Ding Ya are:(1 0 1 0 0 0 1 1 0 1 0......).
In a preferred embodiment, the computational methods of judgement document similarity refer to Fig. 2, and Fig. 2 is the present invention The flow chart of the computational methods of another judgement document similarity of embodiment.
The computational methods of the judgement document similarity shown in Fig. 2 may comprise steps of:
Step S201:Obtain at least two judgement documents.
Step S202:Subordinate sentence treatment is carried out to adjudicating document every described, multiple sentences are obtained.
Step S203:Extract the neologisms candidate word in the judgement document.
Step S204:The neologisms candidate word is filtered according to grammer and/or word order, will meet the grammer and/ Or the filter result of word order adds the dictionary for word segmentation.
Step S205:Participle is carried out to adjudicating the sentence in document every described according to dictionary for word segmentation, word segmentation result is obtained.
Step S206:Entity recognition is carried out to the word segmentation result, the entity name in the word segmentation result is obtained, it is described Entity name includes defendant.
Step S207:According to the entity name, the word segmentation result in same sentence and/or adjacent sentence is carried out Entity relation extraction, obtains the entity relationship between the entity name.
Step S208:Basic triggering vocabulary is set up, the basic triggering vocabulary includes one or more trigger words, described to touch Hair word is used to represent the event information in the judgement document.
Step S209:The near synonym of at least one of triggering vocabulary trigger word substantially are obtained according to synonym woods.
Step S210:The near synonym are added into basic triggering vocabulary, be expanded triggering vocabulary.
Step S211:The word segmentation result is extracted according to the extension triggering vocabulary, the characteristic value of the defendant is obtained.
Step S212:The corresponding entity relationship of same defendant and characteristic value are combined, the judgement for obtaining each defendant is closed Keyword.
Step S213:Vector is built according to the corresponding judgement keyword of each defendant.
Step S214:The similarity between the corresponding vector of defendant two-by-two is calculated, the corresponding judgement of the defendant two-by-two is closed The maximum of the similarity between keyword is the similarity between corresponding judgement document.
In specific implementation, in step S202, every judgement document is converted into text formatting, in filtering transfer process The invalid form for arriving, for example, picture, mess code etc..The punctuate for being ended up according to instruction sentence to the judgement document after filtering again, for example Question mark, exclamation and fullstop etc. are divided into row and preserve.
In another specific embodiment of the present invention, step S203 and step S204 can also be after step S201, step Performed before S202, i.e., before carrying out subordinate sentence treatment to judgement document, dictionary for word segmentation is built in advance, to reduce the work of subsequent step Measure.
In specific implementation, in step S206, it is possible to use entity identification algorithms carry out Entity recognition to word segmentation result.Tool For body, entity identification algorithms can be using condition random field algorithm (Conditional Random Field, CRF) etc..
In specific implementation, in step S207, it is possible to use entity relation extraction algorithm carries out entity pass to word segmentation result System extracts.Specifically, entity relation extraction algorithm can also be using CRF algorithms etc..
It will be apparent to a skilled person that entity identification algorithms and entity relation extraction algorithm can arbitrarily may be used The algorithm of implementation, the embodiment of the present invention is without limitation.
In specific implementation, in step S208, trigger word, for example, " prime culprit " etc..
In another specific embodiment of the present invention, step S208, step S209 and step S210 can also step S201 it Afterwards, performed before step S202, i.e., before carrying out subordinate sentence treatment to judgement document, extension triggering vocabulary is built in advance, to reduce The workload of subsequent step.
In specific implementation, in step S214, can be by calculating the Euclidean distance between vector or being calculated by cosine The modes such as method calculate the similarity between vector.
Skilled person would appreciate that it can be any enforceable algorithm to calculate the similarity between vector, The embodiment of the present invention is without limitation.
The characteristics of embodiment of the present invention is for judgement document, extracts the judgement of one or more defendants in every judgement document Keyword, it is determined that judgement document between similarity when based on it is different judgement documents in defendant it is corresponding judgement keyword between Similarity so that with defendant on the meaning of a word may relevance less, but there is crucial meaning for defendant in judgement aspect Adopted adjudicates core of the keyword as Similarity Measure between judgement document, it is to avoid according only to dictionary set in advance or pre- The Text similarity computing model first trained is calculated in the aspect of word, and then improves the accurate of judgement document Similarity Measure Property.
Fig. 3 is a kind of flow chart of the lookup method of similar judgement document of the embodiment of the present invention.
The lookup method of the similar judgement document shown in Fig. 3 may comprise steps of:
Step S301:Obtain pending judgement document.
Step S302:Judgement document database, the judgement document database are searched according to the pending judgement document Including many first judgement documents.
Step S303:Using the computational methods of the judgement document similarity described in above-described embodiment, determine described pending Similarity between judgement document and every the first judgement document.
Step S304:By the first judgement document or similarity of similarity highest preset number more than predetermined threshold value First judgement document is used as lookup result.
In specific implementation, every first judgement can be prestored in judgement document database in step s 302 The corresponding vector of each defendant in document, to reduce the workload of subsequent step.
In specific implementation, the specific embodiment that the computational methods of document similarity are adjudicated in step S303 can refer to figure The computational methods of the judgement document similarity shown in 1 or Fig. 2, here is omitted.
In specific implementation, in step s 304, obtain between pending judgement document and every the first judgement document Similarity after, can from high to low be arranged according to similarity, so as to using top n adjudicate document as lookup result simultaneously Output, N is preset number, it is also possible to using similarity more than predetermined threshold value first judgement document is as lookup result and exports. The predetermined threshold value of wherein preset number or similarity that output similarity highest first adjudicates document can be according to reality Applied environment is custom-configured the modification with adaptability, and the embodiment of the present invention is without limitation.
After the corresponding vector of defendant is obtained using the computational methods of the judgement document similarity shown in Fig. 1, in reality In, each characteristic value in vector can be weighted according to inquiry purpose, for example, in this inquiry, drugs first Base amphetamine is important, the weight of other characteristic values can be set into 1.0, and the weight of crystal methamphetamine is set into 2.0, Defendant's vector is as follows obtained from:Fourth Asia (1.0*1 1.0*0 2.0*1 1.0*0 1.0*0 1.0*0 1.0*1 1.0*1 1.0*0 1.0*1 1.0*0......)。
In specific implementation, Query Result can be ranked up from big to small according to the result of calculation of text similarity, And document is as lookup result and exports using top n judgement, N therein can be set to 5-20 according to custom is usually browsed, It can also be provided that other numerical value;Retrieval result can also be screened according to threshold value, for example, predetermined threshold value can be set for 0.8- 0.99.In general, similarity>0.8, then it is assumed that the similarity of two judgement documents is higher, can be exported as Query Result.
The characteristics of technical solution of the present invention is for judgement document, according to quilt in pending judgement document and the first judgement document The similarity between corresponding judgement keyword is accused to determine between pending judgement document and every the first judgement document Similarity, the first judgement text by the first judgement document or similarity of similarity highest preset number more than predetermined threshold value Book is used as lookup result.The Text similarity computing model according only to dictionary set in advance or training in advance is avoided in word Aspect carries out Similarity Measure, and will with defendant on the meaning of a word may relevance less, but for defendant in judgement aspect Judgement keyword with critical significance is used as similarity meter between the first judgement document of pending judgement document and every The core of calculation, and then improve the accuracy that judgement document similarity is searched.
Fig. 4 is a kind of structural representation of the computing device for adjudicating document similarity of the embodiment of the present invention.
The computing device 40 of the judgement document similarity shown in Fig. 4 can include acquisition module 401, keyword extracting module 402nd, keyword similarity calculation module 403 and Text similarity computing module 404.
Wherein, acquisition module 401 is used to obtain at least two judgement documents.
Keyword extracting module 402 is used to extract the judgement keyword of one or more defendants in every judgement document.
Keyword similarity calculation module 403 be used for obtain it is different judgement documents in defendant it is corresponding judgement keyword between Similarity.
Text similarity computing module 404 is used for according between the corresponding judgement keyword of defendant in different judgement documents Similarity determines the similarity between corresponding judgement document.
The characteristics of embodiment of the present invention is for judgement document, extracts the judgement of one or more defendants in every judgement document Keyword, it is determined that judgement document between similarity when based on it is different judgement documents in defendant it is corresponding judgement keyword between Similarity so that with defendant on the meaning of a word may relevance less, but there is crucial meaning for defendant in judgement aspect Adopted adjudicates core of the keyword as Similarity Measure between judgement document, it is to avoid according only to dictionary set in advance or pre- The Text similarity computing model first trained is calculated in the aspect of word, and then improves the accurate of judgement document Similarity Measure Property.
In specific implementation, judgement keyword includes following a kind of or any various:Adjudicate true keyword, grounds of decision Keyword and court verdict keyword.
In specific implementation, the computing device 40 for adjudicating document similarity can also include subordinate sentence module (not shown), subordinate sentence Module is used to carry out subordinate sentence treatment to adjudicating document every described, obtains multiple sentences.Specifically, subordinate sentence module can will be every Piece judgement document is converted to text formatting, the invalid form obtained in filtering transfer process, for example, picture, mess code etc..Again to mistake According to the punctuate for indicating sentence to end up, such as question mark, exclamation and fullstop etc. are divided into row and preserve judgement document after filter.To sentencing Certainly document carry out subordinate sentence treatment can be for the operation of subsequent step provides facility.
In specific implementation, keyword extracting module 402 can include participle unit (not shown), Entity recognition unit (figure Do not show), entity relation extraction unit (not shown), feature extraction unit (not shown) and assembled unit (not shown).Wherein, divide Word unit is used to carry out participle to adjudicating the sentence in document every described according to dictionary for word segmentation, obtains word segmentation result;Entity is known Other unit is used to carry out Entity recognition to the word segmentation result, obtains the entity name in the word segmentation result, the physical name Title includes defendant;Entity relation extraction unit is used for according to the entity name, to the institute in same sentence and/or adjacent sentence Stating word segmentation result carries out entity relation extraction, obtains the entity relationship between the entity name;Feature extraction unit is used for root Feature extraction is carried out to the word segmentation result according to the defendant, the characteristic value of the defendant is obtained;Assembled unit is used to combine same The corresponding entity relationship of one defendant and characteristic value, obtain the judgement keyword of each defendant.From the participle knot of judgement document Entity name is extracted in fruit, so as to obtain adjudicating the defendant's information in document;Entity is extracted from same sentence and/or adjacent sentence Entity relationship between title, and the characteristic value that feature extraction obtains the defendant is carried out to word segmentation result according to defendant, count It is convenient and swift, improve the speed of the judgement keyword of defendant.
Specifically, feature extraction unit can include that basic triggering vocabulary is set up subelement (not shown), near synonym and obtained Subelement (not shown), extension triggering vocabulary set up subelement (not shown) and defendant's eigenvalue extracting subelement (not shown). Wherein, basic triggering vocabulary sets up subelement includes one or many for setting up basic triggering vocabulary, the basic triggering vocabulary Individual trigger word, the trigger word is used to represent the event information in the judgement document;Near synonym obtain subelement is used for basis Synonym woods obtains the near synonym of at least one of triggering vocabulary trigger word substantially;Extension triggering vocabulary sets up subelement to be used for The near synonym are added into basic triggering vocabulary, be expanded triggering vocabulary;Defendant's eigenvalue extracting subelement is used for according to institute State extension triggering vocabulary and extract the word segmentation result, obtain the characteristic value of the defendant.Trigger word is included in extension triggering vocabulary Near synonym, the characteristic value accuracy rate of defendant for obtaining is extracted according to extension triggering vocabulary high.
Specifically, keyword similarity calculation module 403 can be used for calculating in different judgement documents defendant's correspondence two-by-two Judgement keyword between similarity;Text similarity computing module 404 can be used for sentencing the defendant two-by-two is corresponding Certainly the maximum of the similarity between keyword is used as the similarity between corresponding judgement document.It is corresponding by defendant two-by-two The maximum of the similarity between judgement keyword is that the similarity between corresponding judgement document improves judgement document similarity The accuracy of calculating.
In specific implementation, keyword similarity calculation module 403 can include vectorial construction unit (not shown) and vector Similarity calculated (not shown).Wherein, vectorial construction unit is used to be built according to the corresponding judgement keyword of each defendant Vector;Vector similarity computing unit is used to calculate the similarity between the corresponding vector of defendant two-by-two.
In specific implementation, the computing device 40 for adjudicating document similarity can also include candidate word extraction module (not shown) With filtering module (not shown).Wherein, candidate word extraction module is used to extract the neologisms candidate word in the judgement document;Filtering Module is used to filter the neologisms candidate word according to grammer and/or word order, will meet the mistake of the grammer and/or word order Filter result adds the dictionary for word segmentation.
The specific embodiment of the embodiment of the present invention refers to the calculating of the judgement document similarity shown in Fig. 1 or Fig. 2 Method, here is omitted.
Fig. 5 is a kind of structural representation of the lookup device of similar judgement document of the embodiment of the present invention.
The lookup device 50 of the similar judgement document shown in Fig. 5 can include input module 501, searching modul 502, judgement The computing device 503 and output module 504 of document similarity.Wherein, input module 501 is used to obtain pending judgement document; Searching modul 502 is used to search judgement document database, the judgement document database bag according to the pending judgement document Include many first judgement documents;Adjudicate document similarity computing device 503 be used for determine it is described it is pending judgement document with it is every Similarity between a piece the first judgement document;Output module 504 is used to sentence the first of similarity highest preset number Certainly document or similarity adjudicate document as lookup result more than the first of predetermined threshold value.
It should be noted that output similarity highest first adjudicates the preset number of document or the default threshold of similarity Value can be custom-configured modification with adaptability according to actual applied environment, and the embodiment of the present invention is not limited this System.
The specific embodiment of the computing device 503 of judgement document similarity refers to the judgement text shown in Fig. 1 or Fig. 2 The computational methods of book similarity, here is omitted.
The specific embodiment of the embodiment of the present invention refers to the lookup method of the similar judgement document shown in Fig. 3, herein Repeat no more.
Technical solution of the present invention obtains pending judgement document;Judgement document number is searched according to the pending judgement document According to storehouse, the judgement document database includes many first judgement documents;Determine it is described it is pending judgement document with described in every Similarity between first judgement document;By the first judgement document or similarity of similarity highest preset number more than pre- If the first judgement document of threshold value is used as lookup result.The characteristics of technical solution of the present invention is for judgement document, according to pending Judgement document and first judgement document in defendant it is corresponding judgement keyword between similarity determine it is pending judgement document with Similarity between every the first judgement document, document or similar is adjudicated by the first of similarity highest preset number Degree adjudicates document as lookup result more than the first of predetermined threshold value.Avoid according only to dictionary set in advance or training in advance Text similarity computing model carry out Similarity Measure in the aspect of word, and will with defendant on the meaning of a word may relevance not Greatly, but there is the judgement keyword of critical significance for defendant in judgement aspect as pending judgement document and every institute The core of Similarity Measure between the first judgement document is stated, and then improves the accuracy that judgement document similarity is searched.
The embodiment of the invention also discloses a kind of lookup system of similar judgement document, the lookup system can be included such as Lookup device 50 and judgement document database shown in Fig. 5, store many first judgement documents in the judgement document database And the corresponding vector of each defendant in every the first judgement document.Searching device 50 can be internally integrated in lookup system In, it is also possible to outside is coupled to lookup system.
Fig. 6 is a kind of structural representation of computer equipment provided in an embodiment of the present invention.Fig. 6 shows and is suitable to for reality The block diagram of the computer equipment 12 of existing embodiment of the present invention.The computer equipment 12 that Fig. 6 shows is only an example, should not Function to the embodiment of the present invention and range band is used to carry out any limitation.
Computer equipment 12 shown in Fig. 6 is showed in the form of universal computing device.The component of computer equipment 12 can be with Including but not limited to:One or more processor 16, memory 28 and connection different system component (including the He of memory 28 Processor 16) bus 18.
Bus 18 represents one or more in a few class bus structures, including memory bus, processor bus or makes With the local bus of any bus structures in various bus structures.For example, these architectures include but is not limited to work Industry standard architecture (ISA) bus, MCA (MAC) bus, enhanced isa bus, video electronics standard association Meeting (VESA) local bus and periphery component interconnection (PCI) bus.
Computer equipment 12 typically comprises various computer-readable recording mediums.These media can be it is any can be by The usable medium that computer equipment 12 is accessed, including volatibility and non-volatile media, moveable and immovable medium.
Memory 28 can include the computer-readable recording medium of form of volatile memory, such as random access memory Device (RAM) 30 and/or cache memory 32.It is removable/nonremovable that computer equipment 12 may further include other , volatile/non-volatile computer readable storage medium storing program for executing.Only as an example, storage system 34 can be used for read-write it is not removable Dynamic, non-volatile magnetic media (Fig. 6 do not show, commonly referred to " hard disk drive ").Although not shown in Fig. 6, can provide For the disc driver to may move non-volatile magnetic disk (such as " floppy disk ") read-write, and to may move anonvolatile optical disk The CD drive of (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driver can It is connected with bus 18 with by one or more data media interfaces.Memory 28 can include at least one program product (i.e. computer program), the program product has one group of (for example, at least one) program module, and these program modules are configured to Perform the function of various embodiments of the present invention.
With one group of program/utility 40 of (at least one) program module 42, can store in such as memory 28 In, such program module 42 include but is not limited to operating system, one or more application program, other program modules and Routine data, potentially includes the realization of network environment in each or certain combination in these examples.Program module 42 is usual Perform the function and/or method in embodiment described in the invention.
Computer equipment 12 can also be with one or more external equipment 14 (such as keyboard, sensing equipment, displays 24 Deng) communication, can also enable a user to the equipment communication that is interacted with the computer equipment 12 with one or more, and/or with make Obtain any equipment (such as network interface card, modulatedemodulate that the computer equipment 12 can be communicated with one or more of the other computing device Adjust device etc.) communication.This communication can be carried out by input/output (I/O) interface 22.Also, computer equipment 12 may be used also With by network adapter 20 and one or more network (such as LAN (LAN), wide area network (WAN) and/or public network Network, such as internet) communication.As illustrated, network adapter 20 is led to by bus 18 with other modules of computer equipment 12 Letter.It should be understood that although not shown in, computer equipment 12 can be combined and use other hardware and/or software module, including But it is not limited to:Microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive And data backup storage system etc..
Program of the processor 16 by operation storage in memory 28, so as to perform at various function application and data Reason, for example, realize the computational methods of judgement document similarity as shown in Figure 1 or 2 or realize as shown in Figure 3 similar sentencing The certainly lookup method of document.
The embodiment of the present invention additionally provides a kind of computer-readable recording medium, is stored thereon with computer program, the meter Calculation machine program is when executed by realizing the computational methods of judgement document similarity as shown in Figure 1 or 2.Or the program The lookup method of the similar judgement document for being when executed by realizing as shown in Figure 3.
The computer-readable recording medium of the embodiment of the present invention, for example may be-but not limited to-electricity, magnetic, light, The system of electromagnetism, infrared ray or semiconductor, device or device, or it is any more than combination.Computer-readable recording medium More specifically example (non exhaustive list) includes:It is electrical connection, portable computer diskette with one or more wires, hard Disk, random access memory (RAM), read-only storage (ROM), erasable programmable read only memory (EPROM or flash memory), light Fibre, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate Combination.In this document, computer-readable recording medium can be it is any comprising or storage program tangible medium, the program Execution system, device or device can be commanded to use or in connection.
Computer for performing present invention operation can be write with one or more programming language or its combination Program code, described program design language includes object oriented program language-such as Java, Smalltalk, C++, Also include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with Fully perform on the user computer, partly perform on the user computer, performed as an independent software kit, portion Part on the user computer is divided to perform on the remote computer or performed on remote computer or server completely. Be related in the situation of remote computer, remote computer can be by the network of any kind --- including LAN (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (is for example carried using Internet service Come by Internet connection for business).
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims (24)

1. it is a kind of adjudicate document similarity computational methods, it is characterised in that including:
Obtain at least two judgement documents;
Extract the judgement keyword of one or more defendants in every judgement document;
According to it is different judgement documents in defendant it is corresponding judgement keyword between similarity determine it is corresponding judgement document between Similarity.
2. computational methods according to claim 1, it is characterised in that after at least two judgement documents are obtained, carrying Before taking the judgement keyword of one or more defendants in every judgement document, also include:
Subordinate sentence treatment is carried out to adjudicating document every described, multiple sentences are obtained.
3. computational methods according to claim 2, it is characterised in that in every judgement document of the extraction one or more The judgement keyword of defendant includes:
Participle is carried out to adjudicating the sentence in document every described according to dictionary for word segmentation, word segmentation result is obtained;
Entity recognition is carried out to the word segmentation result, the entity name in the word segmentation result is obtained, the entity name includes Defendant;
According to the entity name, entity relation extraction is carried out to the word segmentation result in same sentence and/or adjacent sentence, Obtain the entity relationship between the entity name;
Feature extraction is carried out to the word segmentation result according to the defendant, the characteristic value of the defendant is obtained;
The corresponding entity relationship of same defendant and characteristic value are combined, the judgement keyword of each defendant is obtained.
4. computational methods according to claim 3, it is characterised in that described to be entered to the word segmentation result according to the defendant Row feature extraction, the characteristic value for obtaining the defendant includes:
Basic triggering vocabulary is set up, the basic triggering vocabulary includes one or more trigger words, and the trigger word is used to represent Event information in the judgement document;
The near synonym of at least one of triggering vocabulary trigger word substantially are obtained according to synonym woods;
The near synonym are added into basic triggering vocabulary, be expanded triggering vocabulary;
The word segmentation result is extracted according to the extension triggering vocabulary, the characteristic value of the defendant is obtained.
5. computational methods according to any one of Claims 1-4, it is characterised in that described according in different judgement documents Similarity between the corresponding judgement keyword of defendant determines that the similarity between corresponding judgement document includes:
The similarity between the corresponding judgement keyword of defendant two-by-two is calculated in different judgement documents, the defendant two-by-two is corresponding The maximum of the similarity between judgement keyword is the similarity between corresponding judgement document.
6. computational methods according to claim 5, it is characterised in that defendant couple two-by-two in the different judgement documents of the calculating Similarity between the judgement keyword answered includes:
Vector is built according to the corresponding judgement keyword of each defendant;
Calculate the similarity between the corresponding vector of defendant two-by-two.
7. computational methods according to claim 1, it is characterised in that the judgement keyword includes following a kind of or any It is various:Adjudicate the keyword of true keyword, the keyword of grounds of decision and court verdict.
8. computational methods according to claim 3, it is characterised in that according to dictionary for word segmentation in the judgement document Before sentence carries out participle, also include:
Extract the neologisms candidate word in the judgement document;
The neologisms candidate word is filtered according to grammer and/or word order, the filtering knot of the grammer and/or word order will be met Fruit adds the dictionary for word segmentation.
9. it is a kind of it is similar judgement document lookup method, it is characterised in that including:
Obtain pending judgement document;
Judgement document database is searched according to the pending judgement document, the judgement document database is sentenced including many first Certainly document;
Using the computational methods of the judgement document similarity as described in any one of claim 1 to 8, the pending judgement is determined Similarity between document and every the first judgement document;
The first judgement document by the first judgement document or similarity of similarity highest preset number more than predetermined threshold value As lookup result.
10. lookup method as claimed in claim 9, it is characterised in that stored described in every in the judgement document database The corresponding vector of each defendant in first judgement document.
A kind of 11. computing devices for adjudicating document similarity, it is characterised in that including:
Acquisition module, for obtaining at least two judgement documents;
Keyword extracting module, the judgement keyword for extracting one or more defendants in every judgement document;
Keyword similarity calculation module, for obtain it is different judgement documents in defendant it is corresponding judgement keyword between it is similar Degree;
Text similarity computing module, for according to the similarity between the corresponding judgement keyword of defendant in different judgement documents Determine the similarity between corresponding judgement document.
12. computing devices according to claim 11, it is characterised in that further include:
Subordinate sentence module, for carrying out subordinate sentence treatment to adjudicating document every described, obtains multiple sentences.
13. computing devices according to claim 12, it is characterised in that the keyword extracting module includes:
Participle unit, for carrying out participle to adjudicating the sentence in document every described according to dictionary for word segmentation, obtains word segmentation result;
Entity recognition unit, for carrying out Entity recognition to the word segmentation result, obtains the entity name in the word segmentation result, The entity name includes defendant;
Entity relation extraction unit, for according to the entity name, to the participle in same sentence and/or adjacent sentence Result carries out entity relation extraction, obtains the entity relationship between the entity name;
Feature extraction unit, for carrying out feature extraction to the word segmentation result according to the defendant, obtains the spy of the defendant Value indicative;
Assembled unit, for combining the corresponding entity relationship of same defendant and characteristic value, the judgement for obtaining each defendant is closed Keyword.
14. computing devices according to claim 13, it is characterised in that the feature extraction unit includes:
Basic triggering vocabulary sets up subelement, and for setting up basic triggering vocabulary, the basic triggering vocabulary includes one or many Individual trigger word, the trigger word is used to represent the event information in the judgement document;
Near synonym obtain subelement, the nearly justice for obtaining at least one of triggering vocabulary trigger word substantially according to synonym woods Word;
Extension triggering vocabulary sets up subelement, and for the near synonym to be added into basic triggering vocabulary, be expanded triggering vocabulary;
Defendant's eigenvalue extracting subelement, for extracting the word segmentation result according to the extension triggering vocabulary, obtains the quilt The characteristic value of announcement.
15. computing device according to any one of claim 11 to 14, it is characterised in that the keyword Similarity Measure Module specifically for:Calculate it is different judgement documents in two-by-two defendant it is corresponding judgement keyword between similarity;
The Text similarity computing module is specifically for by the similarity between the corresponding judgement keyword of the defendant two-by-two Maximum as it is corresponding judgement document between similarity.
16. computing devices according to claim 15, it is characterised in that the keyword similarity calculation module includes:
Vectorial construction unit, for building vector according to the corresponding judgement keyword of each defendant;
Vector similarity computing unit, for calculating the similarity between the corresponding vector of defendant two-by-two.
17. computing devices according to claim 11, it is characterised in that the judgement keyword includes following a kind of or appoints Meaning is various:Adjudicate the keyword of true keyword, the keyword of grounds of decision and court verdict.
18. computing devices according to claim 13, it is characterised in that further include:
Candidate word extraction module, for extracting the neologisms candidate word in the judgement document;
Filtering module, for being filtered to the neologisms candidate word according to grammer and/or word order, will meet the grammer and/ Or the filter result of word order adds the dictionary for word segmentation.
A kind of 19. lookup devices of similar judgement document, it is characterised in that including:
Input module, for obtaining pending judgement document;
Searching modul, for searching judgement document database, the judgement document database according to the pending judgement document Including many first judgement documents;
The computing device of the judgement document similarity as described in any one of claim 11 to 18, for determining described pending to sentence Certainly document and every it is described first judgement document between similarity;
Output module, for the first judgement document or similarity of similarity highest preset number to be more than into predetermined threshold value First judgement document is used as lookup result.
A kind of 20. lookup systems of similar judgement document, it is characterised in that including it is as claimed in claim 19 search device and Judgement document database, stores many first judgement documents and every the first judgement text in the judgement document database The corresponding vector of each defendant in book.
A kind of 21. computer equipments, including memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, it is characterised in that realized as described in any in claim 1-8 during computer program described in the computing device Judgement document similarity computational methods.
A kind of 22. computer equipments, including memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, it is characterised in that realize the phase as described in claim 9 or 10 described in the computing device during computer program Like the lookup method for adjudicating document.
A kind of 23. computer-readable recording mediums, are stored thereon with computer program, it is characterised in that the computer program quilt The computational methods of the judgement document similarity as described in any in claim 1-8 are realized during computing device.
A kind of 24. computer-readable recording mediums, are stored thereon with computer program, it is characterised in that the computer program quilt The lookup method of the similar judgement document as described in claim 9 or 10 is realized during computing device.
CN201710165953.7A 2017-03-20 2017-03-20 Adjudicate the computational methods of document similarity, search device and computer equipment Pending CN106933787A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710165953.7A CN106933787A (en) 2017-03-20 2017-03-20 Adjudicate the computational methods of document similarity, search device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710165953.7A CN106933787A (en) 2017-03-20 2017-03-20 Adjudicate the computational methods of document similarity, search device and computer equipment

Publications (1)

Publication Number Publication Date
CN106933787A true CN106933787A (en) 2017-07-07

Family

ID=59432440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710165953.7A Pending CN106933787A (en) 2017-03-20 2017-03-20 Adjudicate the computational methods of document similarity, search device and computer equipment

Country Status (1)

Country Link
CN (1) CN106933787A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107807962A (en) * 2017-10-11 2018-03-16 中国软件与技术服务股份有限公司 A kind of method for carrying out similarity mode to legal decision document using LDA topic models
CN107818175A (en) * 2017-11-17 2018-03-20 厦门能见易判信息科技有限公司 A kind of law class case problem intelligently prejudges system and method
CN109657227A (en) * 2018-10-08 2019-04-19 平安科技(深圳)有限公司 Contract feasibility determination method, equipment, storage medium and device
CN110019655A (en) * 2017-07-21 2019-07-16 北京国双科技有限公司 Precedent case acquisition methods and device
CN110134761A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Adjudicate document information retrieval method, device, computer equipment and storage medium
WO2019170015A1 (en) * 2018-03-09 2019-09-12 北京国双科技有限公司 Judicial document searching method and device
CN110246064A (en) * 2018-03-09 2019-09-17 北京国双科技有限公司 A kind of relations of fact determines method and device
CN110674308A (en) * 2019-08-23 2020-01-10 上海科技发展有限公司 Scientific and technological word list expansion method, device, terminal and medium based on grammar mode
CN110866095A (en) * 2019-10-10 2020-03-06 重庆金融资产交易所有限责任公司 Text similarity determination method and related equipment
CN111104790A (en) * 2018-10-10 2020-05-05 百度在线网络技术(北京)有限公司 Method, device and equipment for extracting key relation and computer readable medium
CN111177332A (en) * 2019-11-27 2020-05-19 中证信用增进股份有限公司 Method and device for automatically extracting referee document case-related mark and referee result
CN111382769A (en) * 2018-12-29 2020-07-07 阿里巴巴集团控股有限公司 Information processing method, device and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295307A (en) * 2007-04-27 2008-10-29 株式会社日立制作所 Document retrieval system and document retrieval method
CN103218432A (en) * 2013-04-15 2013-07-24 北京邮电大学 Named entity recognition-based news search result similarity calculation method
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295307A (en) * 2007-04-27 2008-10-29 株式会社日立制作所 Document retrieval system and document retrieval method
CN103218432A (en) * 2013-04-15 2013-07-24 北京邮电大学 Named entity recognition-based news search result similarity calculation method
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019655A (en) * 2017-07-21 2019-07-16 北京国双科技有限公司 Precedent case acquisition methods and device
CN107807962B (en) * 2017-10-11 2018-11-30 中国软件与技术服务股份有限公司 A method of similarity mode being carried out to legal decision document using LDA topic model
CN107807962A (en) * 2017-10-11 2018-03-16 中国软件与技术服务股份有限公司 A kind of method for carrying out similarity mode to legal decision document using LDA topic models
CN107818175A (en) * 2017-11-17 2018-03-20 厦门能见易判信息科技有限公司 A kind of law class case problem intelligently prejudges system and method
WO2019170015A1 (en) * 2018-03-09 2019-09-12 北京国双科技有限公司 Judicial document searching method and device
CN110246064A (en) * 2018-03-09 2019-09-17 北京国双科技有限公司 A kind of relations of fact determines method and device
CN110246064B (en) * 2018-03-09 2021-11-23 北京国双科技有限公司 Method and device for determining fact relationship
CN109657227A (en) * 2018-10-08 2019-04-19 平安科技(深圳)有限公司 Contract feasibility determination method, equipment, storage medium and device
CN111104790A (en) * 2018-10-10 2020-05-05 百度在线网络技术(北京)有限公司 Method, device and equipment for extracting key relation and computer readable medium
CN111104790B (en) * 2018-10-10 2024-03-22 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer readable medium for extracting key relation
CN111382769A (en) * 2018-12-29 2020-07-07 阿里巴巴集团控股有限公司 Information processing method, device and system
CN111382769B (en) * 2018-12-29 2023-09-22 阿里巴巴集团控股有限公司 Information processing method, device and system
CN110134761A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Adjudicate document information retrieval method, device, computer equipment and storage medium
CN110674308A (en) * 2019-08-23 2020-01-10 上海科技发展有限公司 Scientific and technological word list expansion method, device, terminal and medium based on grammar mode
CN110866095A (en) * 2019-10-10 2020-03-06 重庆金融资产交易所有限责任公司 Text similarity determination method and related equipment
CN111177332A (en) * 2019-11-27 2020-05-19 中证信用增进股份有限公司 Method and device for automatically extracting referee document case-related mark and referee result
CN111177332B (en) * 2019-11-27 2023-11-24 中证信用增进股份有限公司 Method and device for automatically extracting judge document case-related label and judge result

Similar Documents

Publication Publication Date Title
CN106933787A (en) Adjudicate the computational methods of document similarity, search device and computer equipment
CN107102989B (en) Entity disambiguation method based on word vector and convolutional neural network
CN107193803B (en) Semantic-based specific task text keyword extraction method
US9613024B1 (en) System and methods for creating datasets representing words and objects
US8892420B2 (en) Text segmentation with multiple granularity levels
CN110188168A (en) Semantic relation recognition methods and device
US20170169008A1 (en) Method and electronic device for sentiment classification
Guo et al. A graph-based method for entity linking
US8577882B2 (en) Method and system for searching multilingual documents
Shouzhong et al. Mining microblog user interests based on TextRank with TF-IDF factor
CN106940726B (en) Creative automatic generation method and terminal based on knowledge network
WO2023071118A1 (en) Method and system for calculating text similarity, device, and storage medium
CN111291177A (en) Information processing method and device and computer storage medium
Wick et al. A unified approach for schema matching, coreference and canonicalization
CN110516036A (en) Legal documents information extracting method, device, computer equipment and storage medium
CN105868366A (en) Concept space navigation method based on concept association
CN106933824A (en) The method and apparatus that the collection of document similar to destination document is determined in multiple documents
CN109614478A (en) Construction method, key word matching method and the device of term vector model
CN107315735B (en) Method and equipment for note arrangement
Gupta et al. Text analysis and information retrieval of text data
JP2021501387A (en) Methods, computer programs and computer systems for extracting expressions for natural language processing
CN109918661B (en) Synonym acquisition method and device
Yuan et al. Task-specific word identification from short texts using a convolutional neural network
WO2016000511A1 (en) Method and apparatus for mining rare resource of internet
CN113505196B (en) Text retrieval method and device based on parts of speech, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170707