CN109543178A - A kind of judicial style label system construction method and system - Google Patents

A kind of judicial style label system construction method and system Download PDF

Info

Publication number
CN109543178A
CN109543178A CN201811294777.8A CN201811294777A CN109543178A CN 109543178 A CN109543178 A CN 109543178A CN 201811294777 A CN201811294777 A CN 201811294777A CN 109543178 A CN109543178 A CN 109543178A
Authority
CN
China
Prior art keywords
label
vocabulary
text
judicial
accuracy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811294777.8A
Other languages
Chinese (zh)
Other versions
CN109543178B (en
Inventor
丁锴
李建元
陈涛
王开红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Enjoyor Co Ltd
Original Assignee
Enjoyor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Enjoyor Co Ltd filed Critical Enjoyor Co Ltd
Priority to CN201811294777.8A priority Critical patent/CN109543178B/en
Publication of CN109543178A publication Critical patent/CN109543178A/en
Application granted granted Critical
Publication of CN109543178B publication Critical patent/CN109543178B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Tourism & Hospitality (AREA)
  • Probability & Statistics with Applications (AREA)
  • Technology Law (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

This application provides a kind of judicial style label system construction method and systems.Judicial vocabulary text is obtained by participle tool, primary label system is constructed according to word frequency statistics, the label of semantic similarity in primary label system is merged, jerky label is extended, extension tag system is obtained, using text test set, counts the accuracy of extension tag system search text, it verifies whether current extension tag system constructs completion, otherwise advanced optimizes label system.It realizes to different Legal constructions targetedly label system, substantially increases the search precision of judicial style.

Description

A kind of judicial style label system construction method and system
Technical field
This application involves natural language processing field, in particular to a kind of judicial style label system construction method and it is System.
Background technique
With the disclosure and transparence of legal field, more and more judgement documents are placed under the supervision of the public.According to Chinese judgement document nets statistics, currently existing more than 5,000 ten thousand documents online, and is incremented by with daily 30,000 or so scale.However, method The growth of rule textual resources also brings a series of problems, such as memory capacity is increasing, and search speed is slower and slower, search knot Not the problems such as fruit is not expectation information.These problems cause the service efficiency of Law Text resource to reduce.In order to solve these problems, Law Text is handled.Internet mass data processing common method is to carry out data label, i.e. vector space mould Type technology (Vector Space Model).Data are processed into a series of keywords (Term) or label, then utilize this A little keywords generate index code.Law Text processing equally uses this model, the difference is that how label defines.
Has a large amount of work in terms of text label extraction.Patent CN201510697001 is proposed to existing short message text This, excavates class short message of putting up a notice by writing regular expression;Using the XX of excavation as the identity label information of short message text;It is right Such notice class short message text identity excavated takes the highest identity label information of frequency to make by way of taking threshold value For the final identity label information of the service number.And this identity label can new message arrive when real-time update.Patent CN201710541481 proposes a kind of text label generation method, corresponding by the way that each tag types are respectively adopted for target text Strategy carry out keyword extraction, after obtaining the candidate label of each tag types of the target text, to each tag types Candidate label, cross validation is carried out between different tag types, finally according to by verifying candidate label, determine mesh Mark the target labels of text.Due to respectively for different tag types including entity word, segment text and/or topic, Tag extraction is carried out, and carries out cross validation, to improve the accuracy of tag extraction, label in the prior art is solved and mentions The not high technical problem of the accuracy taken.Patent CN201711213971 proposes a kind of generation method of text label word.Firstly, The label word in text is extracted, according to the label word of extraction and preset label word relationship, generation, which is mutually related, is grouped mark Sign word;And then according to the incidence relation between each packet label word, packet label word is polymerize, and in preset label Packet label word after searching the polymerization that can be completely covered herein in word dictionary, obtains combined label word;Last basis Combined label word and preset label word relationship generate map tags word in the text.Can quickly, independently according to reality Border demand is the corresponding label word of text generation, intervenes without professional.CN201510197328 proposes a kind of text label Then extracting method, carries out theme prediction by Subject Clustering model, obtains prediction master firstly, carrying out text categories prediction Topic then extracts text key word, finally, using text objects classification, target topic and target keyword as the text Label.The label of text has different levels, meets varigrained Search Requirement, can also be mentioned according to different labels For varigrained recommendation article.
Since Law Text specialized vocabulary is more, the features such as coincidence factor is high, above-mentioned text label extracting method are put in case dispute It is unable to satisfy precise requirements.For this purpose, it is proposed that a kind of new label system, constructs label word by series of rulesization Allusion quotation, and by law merit and the verifying of the corresponding relationship of law article and optimization label dictionary, improve the search precision of Law Text.
Summary of the invention
The problems such as present invention is more for Law Text specialized vocabulary, and case dispute point registration is high, propose a kind of judicial text This label system construction method and system.Due to the advantages of combining machine learning and reinspection, on the basis of reducing manual intervention, The precision of Law Text retrieval can be significantly improved.
A kind of judicial style label system construction method characterized by comprising
Obtain vocabulary text, vocabulary text refers to is solicited articles this form with vocabulary;
According to vocabulary text word frequency and/or combination word frequency, candidate label is selected, obtains primary label system;
According to the similarity of primary label system acceptance of the bid label, merging and/or extension tag obtain extension tag system;
The accuracy that text is searched for according to extension tag system determines that final label system construction is completed.
Further, vocabulary text is obtained, comprising: construct judicial vocabulary, participle tool is added in judicial vocabulary Judicial style cutting is obtained vocabulary text by Custom Dictionaries;
Wherein, the judicial vocabulary of the building, comprising:
Preparation vocabulary is added in the entry of law dictionary and legal profession dictionary etc.;
The combination word frequency for counting conventional word adds the conventional word combination that combination word frequency meets given threshold I as new term Enter prepared vocabulary;
Preparation vocabulary is added in the non-correct specialized vocabulary of cutting by reinspection;
Obtain judicial vocabulary.
Further, according to vocabulary text word frequency and combination word frequency, candidate label is selected, obtains primary label system, packet It includes:
Length of window K is defined, the number that any M word combination of method statistic traversed using window is occurred will occur Vocabulary in the highest N number of combination of number counts the word frequency of single vocabulary in the keyword, word frequency is met as keyword Primary label system is added as candidate label in the vocabulary of given threshold II.
Further, the similarity of label, calculation method include:
Label similarity weight p and semantic-based label similarity weight q based on character are set;
Obtain the label similarity sim (W1, W2) of label W1, W2 based on character, wherein sim (W1, W2)=label W1 and The identical quantity of character/label W1 and label W2 character length the larger value in label W2;
Obtain the semantic-based label similarity score (W1, W2) of label W1, W2, wherein score (W1, W2) is label The relevance values of W1 and label W2, relevance values obtain in the semantic model after making corpus training with judicial style;
Calculate similarity=p*sim (W1, W2)+q*score (W1, W2) of label.
Further,
Merge label, specially when the similarity of two labels meets the similar of given threshold III or described two labels When R are spent before the label similarity value of the primary label system, by two Label Mergings, retain one of label, it will Another label is removed from the primary label system;
Extension tag is specially set when the similarity of words several in semantic model or thesaurus and label word meets When threshold value IV, using these words as the expansion word of this label word, primary label system is added in the extension vocabulary.
Further, the accuracy of text is searched for, calculation method includes:
Test set is established, test set includes sample set and object search collection.The each sample of sample set include problem with And with the maximally related n merit of problem and maximally related m law article.Object search collection includes all merits and law article set;
The text label of the problems in sample drawn collection, merit and law article forms label vector;
The merit similar with problem and applicable law article for being concentrated object search using the method for Vectors matching are recommended Come, wherein vector similarity is calculated using Euler's distance;
By recommending the control of merit, law article corresponding with sample set merit, law article, accuracy in computation, wherein accuracy It is indicated using the average value of recall rate and accuracy, recall rate is also known as recall ratio, recall rate=find correct sample number/number out According to correct sample numbers whole in collection;Accuracy is also known as precision ratio, and accuracy=find correct sample number/whole out is found out Sample number.
Further, the accuracy of text is searched for, calculation method includes:
Preset sample set and object search collection, wherein sample set SS includes NC sample, a sample SiIncluding one A search problem QiAnd vocabulary text collection X relevant to search problemi, the vocabulary text collection XiIncluding Hi word Remittance text, Xi={ xi1,xi2,…,xiHi};Object search collection Y includes NS vocabulary text, Y={ y1,y2,…,yNS};
Using extension tag system, extension tag Z, the Z={ z of object search collection Y is obtained1,z2,…,zNS};
Extract a sample S in order from the sample seti, obtain described search problem QiLabel vector Ti
Calculate label vector TiWith extension tag ZjSimilarity, take the highest preceding Hi extension tag of similarity corresponding Vocabulary text, in contrast organizes T;
Calculate single search accuracy=control group T quantity/Hi identical with vocabulary text in set Xi;
Entire sample set is traversed, bat, the accuracy as described search text are calculated.
Further, the accuracy that text is searched for according to extension tag system determines that final label system construction is completed, packet It includes:
When the accuracy for searching for text meets given threshold V, current extension tag system is final label system, no Then, the numerical value for adjusting threshold value I, II, III, IV, updates current extension tag system, until the extension tag system of update is searched The accuracy of Suo Wenben meets given threshold V, obtains final label system.
Further, the accuracy that text is searched for according to extension tag system determines that final label system construction is completed, packet Include: when the accuracy for searching for text meets given threshold V, current extension tag system is final label system, otherwise, meter The accuracy of search text after calculating the removal of a certain label, if accuracy than remove the accuracy that is obtained before the label it is constant or Increase, then the label is removed from extension tag system, traverse all labels, obtain final label system.
A kind of judicial style label system construction system, including law vocabulary module, data acquisition module, participle mould Block, primary label building module, extension tag module, verifying label model, optimization label model, wherein
Law vocabulary module is stored with law vocabulary, includes judicial relevant speciality vocabulary;
Data acquisition module acquires judicial style, is pre-processed;
General participle tool is added in law vocabulary by word segmentation module, and the judicial style provided data acquisition module is cut Point, obtain judicial vocabulary text;
Primary label constructs module, obtains the judicial vocabulary text that word segmentation module provides, statistics word frequency and combination word frequency, mentions It takes word frequency and combines vocabulary and combination vocabulary that word frequency meets given threshold II, as primary label system;
Extension tag module is stored with extension tag dictionary, counts the similarity of primary label system acceptance of the bid label, will meet The label of given threshold III merges, and corresponding extension vocabulary is extracted from extension tag dictionary, primary label system is added, Obtain extension tag system;
Verify label model, be stored with sample set and object search collection, sample set include several problem labels and with ask Relevant judicial vocabulary text collection X is inscribed, object search collection includes several judicial vocabulary text collection Y, utilizes extension tag System obtains the label of set Y, and problem label is extracted from sample set, and statistics Utilizing question tag search goes out the word in set Y The accuracy of remittance text and the vocabulary text in set X;
Optimize label model, judge to verify whether the accuracy that label model provides meets given threshold V, meets then current Label system be final label system;It is unsatisfactory for, then adjusts given threshold II, the extension tag in primary label building module Given threshold III, given threshold IV in module.
Using at least one above-mentioned technical solution can reach it is following the utility model has the advantages that
In conjunction with the law vocabulary in a variety of sources, judicial vocabulary is constructed, the precision of word segmentation of Law Text is improved, it is high-precision Word segmentation result is the basis of follow-up text processing.
Using automatic keyword extraction and part-of-speech tagging method, primary label system is established.
Based on layering thought, different label dictionaries are established to different laws, label system is constructed, can effectively eliminate law Between chiasma interference.
Using a variety of semantic dependency method extension tag dictionaries, label system is filled, the non-standards such as spoken language are effectively eliminated Term bring semantic ambiguity.
Using a large amount of merits as test set, label system is optimized based on subtraction verification method, while verifying label system Validity.
Detailed description of the invention
Fig. 1 is flow chart involved in this specification embodiment.
Specific embodiment
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with the application specific embodiment and Technical scheme is clearly and completely described in corresponding attached drawing.Obviously, described embodiment is only the application one Section Example, instead of all the embodiments.The embodiment of base in this manual, those of ordinary skill in the art are not having Every other embodiment obtained under the premise of creative work is made, shall fall in the protection scope of this application.
Embodiment one provides a kind of judicial style label system construction method, specifically includes:
One, judicial style data collection and pretreatment.
Judicial style data are collected, such as: administration of justice document, including title of a cause, former defendant's information, law case is by, case The fields such as part details, applicable law and specific law article;It is suitable in collection law, law article and its interpretative provisions, with judgement document It is corresponding with law and specific law article.
Judicial style data prediction, removal case details, applicable law field are empty judicial style data, removal case The text size of part details removes duplicate judicial style data lower than the judicial style data of setting case details threshold value.It is right Every kind of common law major class, such as marriage and family, traffic safety etc. need to collect enough cases, guarantee the diversity of data and complete Face property.
Two, vocabulary text is obtained, vocabulary text refers to is solicited articles this form with vocabulary.
Vocabulary text can be administration of justice document and carry out the text after word segmentation processing, be also possible in administration of justice document A certain field corresponds to text and carries out the text after word segmentation processing, and vocabulary text acquisition methods can use one or more of side Method.
A, vocabulary text is directly acquired, obtains or directly input vocabulary text from other systems.
In one embodiment, in vocabulary text about marriage law law article such as: ' bigamy ', ' spouse ', ' live together ', ' Two ', ' implement ', ' family ', ' front yard ', ' violence ', ' maltreat ', ' abandon ', ' family ', ' front yard ', ' member ', ' three ', ' gambling ', ' inhale Poison ', ' bad habit ', ' refuse to mend one's ways despite repeated admonition ', ' four ', ' emotion ', ' or not, ' separation ', ' full ', ' two ', ' year ', ' five ', ' cause ', ' husband Wife ', ' emotion ', ' rupture ', ' situation ', ' side ', ' declaration ', ' missing '.
B, judicial style is obtained, using participle tool by judicial style cutting, obtains vocabulary text.Existing participle work Tool, such as jieba, the thulac of Tsinghua University, the hanltp of Harbin Institute of Technology, Fudan University funltp etc., these tools participle function is identical, It is all made of default glossary and quick word cutting algorithm, can successfully be syncopated as everyday expressions and general professional word.
In one embodiment, judicial style is obtained, in judicial style such as about marriage law law article: " (one) bigamy Or there is spouse person to live together with other people;(2) implement domestic violence or maltreat, desert one's wife and children member's;(3) there is gambling, take drugs Etc. bad habits refuse to mend one's ways despite repeated admonition;(4) live apart Man Ernian's because being on bad terms with each other;(5) other situations for leading to the alienation of mutual affection.One Side is declared missing, what another party took proceedings for divorce, should grant divorce."
Using participle tool thulac by judicial style cutting, vocabulary text is obtained, about marriage law in vocabulary text Law article is such as: ' bigamy ', ' spouse ', ' live together ', ' two ', ' implement ', ' family ', ' front yard ', ' violence ', ' maltreat ', ' abandon ', ' family ', ' Front yard ', ' member ', ' three ', ' gambling ', ' take drugs ', ' bad habit ', ' refuse to mend one's ways despite repeated admonition ', ' four ', ' emotion ', ' or not, ' separation ', ' Full ', ' two ', ' year ', ' five ', ' cause ', ' man and wife ', ' emotion ', ' rupture ', ' situation ', ' side ', ' declaration ', ' missing '.
Existing participle tool for professional very strong law vocabulary can not correct word cutting, such as ' limitation civil acts Ability people ', ' disease that should not be got married ' etc..It is positive and definitely goes out these vocabulary, customized law vocabulary need to be used.
C, judicial vocabulary is constructed, judicial vocabulary is added to the Custom Dictionaries of participle tool, is replaced in participle tool Default glossary judicial style cutting is obtained into vocabulary text.Judicial vocabulary construction method:
C.1 vocabulary) is added in the entry of law dictionary and legal profession dictionary etc.;
C.2 conventional word combination) is formed into new term using combination word frequency statistic algorithm, is more than setting threshold by combination word frequency Vocabulary, the frequency that combination word frequency refers to more than two words while occurring is added in the new term of value;
C.3) vocabulary is added to the Custom Dictionaries of participle tool, replace the default glossary in participle tool, will take charge of Method text dividing obtains vocabulary text, artificial to recheck, and checks and compare one by one the word frequency of word cutting result including control cutting result Statistics reinspection, adds to vocabulary for the non-correct specialized vocabulary of cutting;
C.4) using the vocabulary rechecked as judicial vocabulary.
In one embodiment, judicial style cutting is obtained into vocabulary text, about marriage law using judicial vocabulary Certain law article is such as: ' bigamy ', ' have spouse person with other people live together ', ' two ', ' implement ', ' domestic violence ', ' maltreat ', ' desert one's wife and children Member ', ' three ', ' gambling ', ' take drugs ', ' bad habit ', ' refuse to mend one's ways despite repeated admonition ', ' four ', ' be on bad terms with each other ', ' separation ', ' full ', ' two ', ' Year ', ' five ', ' cause ', ' man and wife ', ' break emotionally ', ' situation ', ' side ', ' declaration ', ' missing ', ' side ', ' propose ', ' from Wedding ', ' lawsuit ', ' answer ', ' grant ', ' divorce '
Compared with directly utilizing participle tool thulac, using judicial vocabulary by judicial style cutting, to legal profession Word such as ' domestic violence ', ' breaking emotionally ' etc. can correctly cut out.In conjunction with the law vocabulary in a variety of sources, judicial word is constructed Remittance table, improves the precision of word segmentation of Law Text, and high-precision word segmentation result is the basis of follow-up text processing.
Further, part of speech inspection is carried out to the vocabulary in the vocabulary text of acquisition, retains noun, verb and adjective, Remove other vocabulary.
Three, according to vocabulary text word frequency and/or combination word frequency, candidate label is selected, obtains primary label system.Word frequency refers to The frequency or number that single vocabulary occurs;The frequency or number that combination word frequency refers to more than two vocabulary while occurring.It can use One or more of mode.
A) word frequency for counting single vocabulary in vocabulary text, when word frequency is greater than the threshold value of setting, using the vocabulary as time Label is selected, primary label system is added, until all glossary statistics terminate;
B) it using vocabulary adjacent two-by-two as combination, counts in vocabulary text and combines word frequency, sort from high to low, take combination Primary label system is added as new term in the combination vocabulary of setting number of bits before word frequency sequence;
C) window co-occurrence method is used, length of window K is defined, any M vocabulary group of method statistic traversed using window The number occurred is closed, using the vocabulary in the highest N number of combination of frequency of occurrence as keyword, counts single vocabulary in keyword Primary label system is added as candidate label in word frequency, the vocabulary using word frequency beyond given threshold.
Further, the label in primary label system is screened using regularization, i.e., the vocabulary in primary label system disappears Unless universal word and non-label vocabulary, wherein non-universal vocabulary is the vocabulary in preset non-universal vocabulary, such as name; Non- label vocabulary is the vocabulary in preset non-label vocabulary, such as isolated verb.
Since law is composed a piece of writing the professional of difference and law, same target has different role, such as ' vapour under different laws Vehicle ' it is a kind of property in marriage law, and ' motor vehicle ' this legal subject is represented in traffic method.Therefore, different laws are wanted Using different label dictionaries, the label dictionary of a variety of laws forms a label system.
Using automatic keyword extraction and part-of-speech tagging method, primary label system is established, layering thought is based on, to difference Law establishes different label dictionaries, constructs label system, can effectively eliminate the chiasma interference between law.
Four, according to the similarity of primary label system acceptance of the bid label, merging and/or extension tag, extension tag system is obtained. Wherein, the similarity calculation of label can use one or more of mode.
In one embodiment, using the label similarity calculation method based on character, two marks are indicated with W1, W2 Label, W1={ w11,w12,…,w1e1, W2={ w21,w22,…,w2e2, wherein e1, e2 are the word for including of label W1, label W2 Accord with length, w11、w12、w1e1Respectively the 1st of label W1 the, 2, e1 character, w21、w22、w2e2Respectively the 1st of label W2 the, 2, e2 A character.
The identical quantity of character/label W1 and label W2 in similarity sim (W1, the W2)=label W1 and label W2 of label Character length the larger value.
If label 1 is ' Mr. and Mrs ', label 2 is ' man and wife ', character length is respectively 2,2, and wherein character ' husband ' is identical, character Identical number is 1, then the similarity of label is 0.5.
In one embodiment, using semantic-based label similarity calculation method, using such as Word2Vec, The language models such as Glove construct semantic model;A large amount of various types of judicial styles are obtained as corpus, training semantic model; Two labels are inputted into semantic model, obtain the correlation score (W1, W2) of two labels;The correlation of two labels is made For the similarity of label.
Such as (' elder brother ', ' younger brother ') and (' elder brother ', ' motor vehicle ') two groups of words, after semantic model training, first group The correlation of word is obviously greater than second group.
In one embodiment, using based on character and semantic label similarity calculation method, setting is based on character With semantic label similarity weight p, q, the label similarity sim (W1, W2) of label W1, W2 based on character is obtained, obtains mark Sign the semantic-based label similarity score (W1, W2) of W1, W2, the similarity of COMPREHENSIVE CALCULATING label: p*sim (W1, W2)+q* score(W1,W2)。
Primary label system is a fairly simple word lists, and there may be the feelings of semantic similarity for some vocabulary in table Condition needs to merge.In addition, vocabulary can not effectively be compatible with diversity semantic in real life in table, need to extend.
Merge and/or extension tag, acquisition extension tag system can use one or more of mode.
In one embodiment, when the similarity of two labels is more than the similarity of threshold value III or two labels in institute When there are before the label similarity value of primary label system R, by two Label Mergings, retain one of label, by another Label is removed from primary label system.When the similarity satisfaction of words several in semantic model or thesaurus and label word is set When determining threshold value IV, using these words as the expansion word of this label word, primary label system is added in the extension vocabulary.
Such as: include in semantic model or thesaurus ' Mr. and Mrs ', ' object ' this 2 vocabulary, primary label system Label word is ' man and wife ', calculates separately the similarity of vocabulary Yu label word, judges whether to meet threshold value IV, wherein ' Mr. and Mrs ' are full Sufficient condition, the expansion word as ' man and wife '.
By tag extension, the table below for example is formed.This table is used for disambiguation, by the different tables of identical semanteme It states and is unified for same words, complete text normalization.
1 marriage class label dictionary example of table
2 traffic class label dictionary example of table
In one embodiment, from the extension corresponding with the vocabulary in primary label system of extraction in extension tag dictionary Primary label system is added in vocabulary, when the similarity of two labels in primary label body system is more than threshold value III or two labels When similarity is R before the label similarity value of all primary label systems, by two Label Mergings, retain one of mark Label, another label is removed from primary label system.
Five, the accuracy that text is searched for according to the extension tag system determines that final label system construction is completed.
The basic purposes of text label system is text search.Search accuracy is tied up to by compareing different editions label body On difference, the effectiveness of label system can be verified.
In one embodiment, a kind of accuracy calculation method for searching for text is provided.
5.1) judicial style is obtained, the text of merit, law article relevant field in judicial style is extracted;According to the merit word Remittance text and law article vocabulary text word frequency and/or combination word frequency, select candidate label, obtain primary label system;According to described The similarity of primary label system acceptance of the bid label, merging and/or extension tag, obtain extension tag system.
5.2) test set is established, test set includes sample set and object search collection.The each sample of sample set includes one and asks Topic and to the maximally related n merit of problem and the most related law article of m item.Object search collection includes all merits and applicable law Law article set.
Such as the problem of sample set is that ' accident occurs in road of driving, breaks rear taillight by non-motor vehicle, how to pay for Repay? ', with the maximally related merit of the problem 3, maximally related law article 6.
5.3) text label of the problems in sample drawn collection, merit and law article forms label vector.
5.4) merit similar with problem and applicable law article concentrated object search using the method for Vectors matching are pushed away It recommends out, wherein vector similarity is calculated using Euler's distance, and vector subtracts each other and modulus is vector distance, and Euler's distance is most Common vector distance calculation method.
5.5) by recommending the control of merit, law article corresponding with sample set merit, law article, accuracy in computation, wherein quasi- Exactness indicates that recall rate is also known as recall ratio, recall rate=find correct sample out using the average value of recall rate and accuracy Whole correct sample numbers in number/data set;Accuracy is also known as precision ratio, accuracy=find correct sample number/whole out The sample number found out.
For example, share 5 recommendation results, correctly the result is that 2, recall rate is exactly 40%;Test set has 10 samples, Identical as true value to the recommendation results of 5 samples, accuracy is just 50%.
3 object search collection example of table
1 label of merit Merit 1 is applicable in law article ×× method first First strip label
2 label of merit Merit 2 is applicable in law article ×× method Article 2 Second strip label
Merit N label Merit N is applicable in law article Other methods × article N strip label
4 search result of table and true value comparative example
In one embodiment, a kind of accuracy calculation method for searching for text.
Preset sample set and object search collection, wherein sample set SS includes NC sample, a sample SiIncluding one A search problem QiAnd vocabulary text collection X relevant to search problemi, vocabulary text collection XiIncluding Hi vocabulary text This, Xi={ xi1,xi2,…,xiHi};Object search collection Y includes NS vocabulary text, Y={ y1,y2,…,yNS};
Using extension tag system, extension tag Z, the Z={ z of object search collection Y is obtained1,z2,…,zNS};
Extract a sample S in order from sample seti, obtain search problem QiLabel vector Ti
Calculate label vector TiWith extension tag ZjSimilarity, take the highest preceding Hi extension tag of similarity corresponding Vocabulary text, in contrast organizes T;
Calculate single search accuracy=control group T quantity/Hi identical with vocabulary text in set Xi;
Entire sample set is traversed, bat, the accuracy as described search text are calculated.
Further, when the accuracy for searching for text is greater than threshold value V, current extension tag system is final label body Otherwise system optimizes label system.
Optimizing label system can be using the combination of following one or more of methods:
1) numerical value for adjusting threshold value I, II, III, IV, updates extension tag system, until searching for current extensions label system The accuracy of Suo Wenben is greater than threshold value V, obtains final label system.
2) the accuracy calculation method for adjusting law vocabulary, the similarity calculating method of label, search text, updates and expands Open up label system, until current extensions label system search text accuracy be greater than threshold value V, obtain final label system.
3) using current extension tag system as object, the accuracy of the search text after calculating a certain label removal, such as Fruit accuracy is more constant or increase than removing the accuracy that obtains before the label, then removes the label from extension tag system, All labels are traversed, final label system is obtained.
Embodiment two provides a kind of judicial style label system construction system, including the acquisition of law vocabulary module, data Module, word segmentation module, primary label building module, extension tag module, verifying label model, optimization label model, wherein
Law vocabulary module is stored with law vocabulary, includes judicial relevant speciality vocabulary;
Data acquisition module acquires judicial style, is pre-processed;
General participle tool is added in law vocabulary by word segmentation module, and the judicial style provided data acquisition module is cut Point, obtain judicial vocabulary text;
Primary label constructs module, obtains the judicial vocabulary text that word segmentation module provides, statistics word frequency and combination word frequency, mentions It takes word frequency and combines vocabulary and combination vocabulary that word frequency meets given threshold II, as primary label system;
Extension tag module is stored with extension tag dictionary, counts the similarity of primary label system acceptance of the bid label, will meet The label of given threshold III merges, and corresponding extension vocabulary is extracted from extension tag dictionary, primary label system is added, Obtain extension tag system;
Verify label model, be stored with sample set and object search collection, sample set include several problem labels and with ask Relevant judicial vocabulary text collection X is inscribed, object search collection includes several judicial vocabulary text collection Y, utilizes extension tag System obtains the label of set Y, and problem label is extracted from sample set, and statistics Utilizing question tag search goes out the word in set Y The accuracy of remittance text and the vocabulary text in set X;
Optimize label model, judge to verify whether the accuracy that label model provides meets given threshold V, meets then current Label system be final label system;It is unsatisfactory for, then adjusts given threshold II, the extension tag in primary label building module Given threshold III, given threshold IV in module.
Referring to Fig.1, a kind of judicial style label system construction system data process flow is as follows:
About 160,000 parts of judicial styles of nearly 10 years paper of civil judgement, including marriage class, traffic class judgement document are acquired, are carried out Data prediction, comprising: removal case details, applicable law field are empty judicial style data, remove the text of case details This length removes duplicate judicial style data lower than the judicial style data of setting case details threshold value, individually extracts judicial The text of case details, applicable law and specific law article field in text.Common 170 multi-section of civil law is acquired, law is extracted The text of two fields of clause and concrete regulation.
Using word segmentation module general participle tool is added in law vocabulary by participle, to the administration of justice text after data prediction This cutting obtains judicial vocabulary text.
Primary label is constructed, word frequency is extracted and meets the vocabulary of given threshold as primary label.
Extension tag extracts corresponding extension vocabulary from extension tag dictionary.
Label is verified, is verified by law merit and the corresponding relationship of law article, control different editions extension tag is being searched for Difference in accuracy.
Optimize label, whether the label after judging verifying label meets the requirements, and meets, then label system construction is completed;No Meet, then feeds back to verifying label model.
Although depicting the application by embodiment, it will be appreciated by the skilled addressee that the application there are many deformation and Variation is without departing from spirit herein, it is desirable to which appended embodiment includes these deformations and changes without departing from the application.

Claims (10)

1. a kind of judicial style label system construction method characterized by comprising
Obtain vocabulary text, the vocabulary text refers to is solicited articles this form with vocabulary;
According to the vocabulary text word frequency and/or combination word frequency, candidate label is selected, obtains primary label system;
According to the similarity of the primary label system acceptance of the bid label, merging and/or extension tag obtain extension tag system;
The accuracy that text is searched for according to the extension tag system determines that final label system construction is completed.
2. a kind of judicial style label system construction method according to claim 1, which is characterized in that the acquisition vocabulary Text, comprising: construct judicial vocabulary, the judicial vocabulary is added to the Custom Dictionaries of participle tool, by judicial style Cutting obtains vocabulary text;Wherein, the judicial vocabulary of the building, comprising:
Preparation vocabulary is added in the vocabulary of law dictionary and legal profession dictionary etc.;
The combination word frequency for counting conventional word adds the conventional word combination that the combination word frequency meets given threshold I as new term Enter the prepared vocabulary;
Preparation vocabulary is added in the non-correct specialized vocabulary of cutting by reinspection;
Obtain the judicial vocabulary.
3. a kind of judicial style label system construction method according to claim 1, which is characterized in that according to the vocabulary Text word frequency and combination word frequency, select candidate label, obtain primary label system, comprising:
Length of window K is defined, the number that any M word combination of method statistic traversed using window is occurred, by frequency of occurrence Vocabulary in highest N number of combination counts the word frequency of single vocabulary in the keyword, the word frequency is met as keyword The primary label system is added as candidate label in the vocabulary of given threshold II.
4. a kind of judicial style label system construction method according to claim 1, which is characterized in that the phase of the label Like degree, calculation method includes:
Label similarity weight p and semantic-based label similarity weight q based on character are set;
Obtain the label similarity sim (W1, W2) of label W1, W2 based on character, wherein sim (W1, the W2)=label W1 and The identical quantity of character/label W1 and label W2 character length the larger value in label W2;
The semantic-based label similarity score (W1, W2) of label W1, W2 is obtained, wherein the score (W1, W2) is label The relevance values of W1 and label W2, the relevance values obtain in the semantic model after making corpus training with judicial style;
Calculate similarity=p*sim (W1, W2)+q*score (W1, W2) of label.
5. a kind of judicial style label system construction method according to claim 1, which is characterized in that
The merging label, specially when the similarity of two labels meets the similar of given threshold III or described two labels When R are spent before the label similarity value of the primary label system, by two Label Mergings, retain one of label, it will Another label is removed from the primary label system;
The extension tag is specially set when the similarity of words several in semantic model or thesaurus and label word meets When threshold value IV, using these words as the expansion word of this label word, primary label system is added in the extension vocabulary.
6. a kind of judicial style label system construction method according to claim 1, it is characterised in that: described search text Accuracy, calculation method includes:
Establish test set, test set includes sample set and object search collection, each sample of sample set include a problem with And with the maximally related n merit of problem and maximally related m law article, described search object set includes all merits and law article collection It closes;
The text label of the problems in sample drawn collection, merit and law article forms label vector;
The merit similar with problem in described search object set and the law article being applicable in are recommended using the method for Vectors matching Come, wherein vector similarity is calculated using Euler's distance;
By recommending the control of merit, law article corresponding with the sample set merit, law article, accuracy in computation, wherein accuracy It is indicated using the average value of recall rate and accuracy, the recall rate is also known as recall ratio, the recall rate=find out correctly Whole correct sample numbers in sample number/data set;The accuracy is also known as precision ratio, the accuracy=find out correctly The sample number for sample number/all find out.
7. a kind of judicial style label system construction method according to claim 1, which is characterized in that described search text Accuracy, calculation method includes:
Preset sample set and object search collection, wherein sample set SS includes NC sample, a sample SiIt is searched including one Suo Wenti QiAnd vocabulary text collection X relevant to search problemi, the vocabulary text collection XiIncluding Hi vocabulary text This, Xi={ xi1,xi2,…,xiHi};Described search object set Y includes NS vocabulary text, Y={ y1,y2,…,yNS};
Using the extension tag system, extension tag Z, the Z={ z of described search object set Y is obtained1,z2,…,zNS};
Extract a sample S in order from the sample seti, obtain described search problem QiLabel vector Ti
Calculate label vector TiWith extension tag ZjSimilarity, take the corresponding vocabulary of the highest preceding Hi extension tag of similarity In contrast text organizes T;
Calculate single search accuracy=control group T quantity/Hi identical with vocabulary text in set Xi;
Entire sample set is traversed, bat, the accuracy as described search text are calculated.
8. a kind of judicial style label system construction method according to claim 1, which is characterized in that according to the extension Label system searches for the accuracy of text, determines that final label system construction is completed, comprising: when the accuracy of search text meets When given threshold V, current extension tag system is final label system, otherwise, adjusts the numerical value of threshold value I, II, III, IV, Update current extension tag system, until update extension tag system search text accuracy meet given threshold V, obtain Obtain final label system.
9. a kind of judicial style label system construction method according to claim 1, which is characterized in that according to the extension Label system searches for the accuracy of text, determines that final label system construction is completed, comprising: when the accuracy of search text meets When given threshold V, current extension tag system is final label system, otherwise, the search text after calculating a certain label removal This accuracy marks the label from extension if accuracy is more constant or increase than removing the accuracy that obtains before the label It is removed in label system, traverses all labels, obtain final label system.
10. a kind of judicial style label system construction system, including law vocabulary module, data acquisition module, word segmentation module, Primary label building module, extension tag module, verifying label model, optimization label model, wherein
The law vocabulary module is stored with law vocabulary, includes judicial relevant speciality vocabulary;
The data acquisition module acquires judicial style, is pre-processed;
General participle tool is added, the department provided the data acquisition module in the law vocabulary by the word segmentation module Method text dividing obtains judicial vocabulary text;
The primary label constructs module, obtains the judicial vocabulary text that the word segmentation module provides, counts word frequency and group Word frequency is closed, word frequency is extracted and combines vocabulary and combination vocabulary that word frequency meets given threshold II, as primary label system;
The extension tag module is stored with extension tag dictionary, counts the similarity of the primary label system acceptance of the bid label, will The label for meeting given threshold III merges, and is extracted described in corresponding extension vocabulary addition from the extension tag dictionary Primary label system obtains extension tag system;
The verifying label model, is stored with sample set and object search collection, the sample set include several problem labels and Administration of justice vocabulary text collection X relevant to problem, described search object set include several judicial vocabulary text collection Y, are utilized The extension tag system obtains the label of set Y, and problem label is extracted from the sample set, and statistics Utilizing question label is searched Rope goes out the accuracy of the vocabulary text in set Y and the vocabulary text in set X;
Whether the optimization label model, the accuracy for judging that the verifying label model provides meet given threshold V, meet then Current label system is final label system;Be unsatisfactory for, then adjust given threshold II in the primary label building module, Given threshold III, given threshold IV in the extension tag module.
CN201811294777.8A 2018-11-01 2018-11-01 Method and system for constructing judicial text label system Active CN109543178B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811294777.8A CN109543178B (en) 2018-11-01 2018-11-01 Method and system for constructing judicial text label system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811294777.8A CN109543178B (en) 2018-11-01 2018-11-01 Method and system for constructing judicial text label system

Publications (2)

Publication Number Publication Date
CN109543178A true CN109543178A (en) 2019-03-29
CN109543178B CN109543178B (en) 2023-02-28

Family

ID=65846358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811294777.8A Active CN109543178B (en) 2018-11-01 2018-11-01 Method and system for constructing judicial text label system

Country Status (1)

Country Link
CN (1) CN109543178B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110675241A (en) * 2019-08-15 2020-01-10 上海新颜人工智能科技有限公司 Label calibration system and method
CN110929513A (en) * 2019-10-31 2020-03-27 北京三快在线科技有限公司 Text-based label system construction method and device
CN110928981A (en) * 2019-11-18 2020-03-27 佰聆数据股份有限公司 Method, system and storage medium for establishing and perfecting iteration of text label system
CN111177388A (en) * 2019-12-30 2020-05-19 联想(北京)有限公司 Processing method and computer equipment
CN111221974A (en) * 2020-04-22 2020-06-02 成都索贝数码科技股份有限公司 Method for constructing news text classification model based on hierarchical structure multi-label system
CN111353045A (en) * 2020-03-18 2020-06-30 智者四海(北京)技术有限公司 Method for constructing text classification system
CN111524043A (en) * 2020-04-24 2020-08-11 南京擎盾信息科技有限公司 Method and device for automatically generating litigation risk assessment questionnaire
CN111666771A (en) * 2020-06-05 2020-09-15 北京百度网讯科技有限公司 Semantic label extraction device, electronic equipment and readable storage medium of document
CN112084290A (en) * 2019-06-13 2020-12-15 北京沃东天骏信息技术有限公司 Data retrieval method, device, equipment and storage medium
CN112148868A (en) * 2020-09-27 2020-12-29 南京大学 Law recommendation method based on law co-occurrence
CN112365372A (en) * 2020-10-09 2021-02-12 银江股份有限公司 Judgment document oriented quality detection and evaluation method and system
CN112925902A (en) * 2021-02-22 2021-06-08 新智认知数据服务有限公司 Method and system for intelligently extracting text abstract in case text and electronic equipment
CN113065312A (en) * 2020-01-02 2021-07-02 北京沃东天骏信息技术有限公司 Text label extraction method and device
CN113505192A (en) * 2021-05-25 2021-10-15 平安银行股份有限公司 Data tag library construction method and device, electronic equipment and computer storage medium
CN113948087A (en) * 2021-09-13 2022-01-18 北京数美时代科技有限公司 Voice tag determination method, system, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004318381A (en) * 2003-04-15 2004-11-11 National Institute Of Advanced Industrial & Technology Similarity computing method, similarity computing program, and computer-readable storage medium storing it
JP2017078919A (en) * 2015-10-19 2017-04-27 日本電信電話株式会社 Word expansion device, classification device, machine learning device, method, and program
CN106682149A (en) * 2016-12-22 2017-05-17 湖南科技学院 Label automatic generation method based on meta-search engine
CN107577785A (en) * 2017-09-15 2018-01-12 南京大学 A kind of level multi-tag sorting technique suitable for law identification

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004318381A (en) * 2003-04-15 2004-11-11 National Institute Of Advanced Industrial & Technology Similarity computing method, similarity computing program, and computer-readable storage medium storing it
JP2017078919A (en) * 2015-10-19 2017-04-27 日本電信電話株式会社 Word expansion device, classification device, machine learning device, method, and program
CN106682149A (en) * 2016-12-22 2017-05-17 湖南科技学院 Label automatic generation method based on meta-search engine
CN107577785A (en) * 2017-09-15 2018-01-12 南京大学 A kind of level multi-tag sorting technique suitable for law identification

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084290A (en) * 2019-06-13 2020-12-15 北京沃东天骏信息技术有限公司 Data retrieval method, device, equipment and storage medium
CN112084290B (en) * 2019-06-13 2024-04-05 北京沃东天骏信息技术有限公司 Data retrieval method, device, equipment and storage medium
CN110675241A (en) * 2019-08-15 2020-01-10 上海新颜人工智能科技有限公司 Label calibration system and method
CN110929513A (en) * 2019-10-31 2020-03-27 北京三快在线科技有限公司 Text-based label system construction method and device
CN110928981A (en) * 2019-11-18 2020-03-27 佰聆数据股份有限公司 Method, system and storage medium for establishing and perfecting iteration of text label system
CN111177388A (en) * 2019-12-30 2020-05-19 联想(北京)有限公司 Processing method and computer equipment
CN111177388B (en) * 2019-12-30 2023-07-21 联想(北京)有限公司 Processing method and computer equipment
CN113065312A (en) * 2020-01-02 2021-07-02 北京沃东天骏信息技术有限公司 Text label extraction method and device
CN111353045A (en) * 2020-03-18 2020-06-30 智者四海(北京)技术有限公司 Method for constructing text classification system
CN111353045B (en) * 2020-03-18 2023-12-22 智者四海(北京)技术有限公司 Method for constructing text classification system
CN111221974B (en) * 2020-04-22 2020-08-14 成都索贝数码科技股份有限公司 Method for constructing news text classification model based on hierarchical structure multi-label system
CN111221974A (en) * 2020-04-22 2020-06-02 成都索贝数码科技股份有限公司 Method for constructing news text classification model based on hierarchical structure multi-label system
CN111524043A (en) * 2020-04-24 2020-08-11 南京擎盾信息科技有限公司 Method and device for automatically generating litigation risk assessment questionnaire
CN111666771A (en) * 2020-06-05 2020-09-15 北京百度网讯科技有限公司 Semantic label extraction device, electronic equipment and readable storage medium of document
CN111666771B (en) * 2020-06-05 2024-03-08 北京百度网讯科技有限公司 Semantic tag extraction device, electronic equipment and readable storage medium for document
CN112148868A (en) * 2020-09-27 2020-12-29 南京大学 Law recommendation method based on law co-occurrence
CN112365372B (en) * 2020-10-09 2024-01-12 银江技术股份有限公司 Quality detection and evaluation method and system for referee document
CN112365372A (en) * 2020-10-09 2021-02-12 银江股份有限公司 Judgment document oriented quality detection and evaluation method and system
CN112925902B (en) * 2021-02-22 2024-01-30 新智认知数据服务有限公司 Method, system and electronic equipment for intelligently extracting text abstract from case text
CN112925902A (en) * 2021-02-22 2021-06-08 新智认知数据服务有限公司 Method and system for intelligently extracting text abstract in case text and electronic equipment
CN113505192A (en) * 2021-05-25 2021-10-15 平安银行股份有限公司 Data tag library construction method and device, electronic equipment and computer storage medium
CN113948087A (en) * 2021-09-13 2022-01-18 北京数美时代科技有限公司 Voice tag determination method, system, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN109543178B (en) 2023-02-28

Similar Documents

Publication Publication Date Title
CN109543178A (en) A kind of judicial style label system construction method and system
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN106598944B (en) A kind of civil aviaton's security public sentiment sentiment analysis method
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN105022725B (en) A kind of text emotion trend analysis method applied to finance Web fields
CN103678576B (en) The text retrieval system analyzed based on dynamic semantics
CN102929937B (en) Based on the data processing method of the commodity classification of text subject model
CN107818138A (en) A kind of case legal regulation recommends method and system
CN106951438A (en) A kind of event extraction system and method towards open field
CN101751455B (en) Method for automatically generating title by adopting artificial intelligence technology
CN106844424A (en) A kind of file classification method based on LDA
CN107153658A (en) A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN103309862B (en) Webpage type recognition method and system
CN107315734B (en) A kind of method and system to be standardized based on time window and semantic variant word
CN104281645A (en) Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN106599032A (en) Text event extraction method in combination of sparse coding and structural perceptron
CN106294744A (en) Interest recognition methods and system
CN109960756A (en) Media event information inductive method
CN107291895B (en) Quick hierarchical document query method
CN105843796A (en) Microblog emotional tendency analysis method and device
CN101923556B (en) Method and device for searching webpages according to sentence serial numbers
CN105512333A (en) Product comment theme searching method based on emotional tendency

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 310012 1st floor, building 1, 223 Yile Road, Hangzhou City, Zhejiang Province

Applicant after: Yinjiang Technology Co.,Ltd.

Address before: 310012 1st floor, building 1, 223 Yile Road, Hangzhou City, Zhejiang Province

Applicant before: ENJOYOR Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant