CN110737781A

CN110737781A - law and fact relation calculation method based on multi-layer knowledge

Info

Publication number: CN110737781A
Application number: CN201911003330.5A
Authority: CN
Inventors: 李传艺; 葛季栋; 李中月; 冯奕; 周筱羽; 骆斌
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-10-21
Filing date: 2019-10-21
Publication date: 2020-01-31

Abstract

The invention discloses an law and fact relation calculation method based on multi-layer knowledge , which comprises the following steps of extracting a referee text set and preprocessing the text, constructing a law-specific deactivation dictionary, training a word vector model, preprocessing user input, and outputting a prediction result of the relation between the fact and the law, wherein the method is used for a judge to comb or check cases in the actual trial process or after the trial is finished, and the method adopts a multi-layer knowledge mechanism to filter the noise in the facts and the law text based on the prior knowledge of the input text to obtain a new vector expression of the fact law, so that the accuracy of model prediction can be effectively improved, and meanwhile, the filtering effect of a mechanism is greatly improved by utilizing the hierarchical relation between law catalogues.

Description

law and fact relation calculation method based on multi-layer knowledge

Technical Field

The invention relates to text pair relation calculation methods, in particular to text pair relation calculation methods, and belongs to the technical field of big data mining.

Background

The referee document is -class important judicial data, and is collected and published in a Chinese referee document network by 10 months in 2019 after the referee document is taken as a carrier of legal judge activity records, completely reflects the objective processes of advocation, testification and quality assurance of parties, and comprehensively expounds the legal basis, factual evidence and reasoning process formed by referee results.

Research works based on these judicial big data are also carried out successively, and "artificial intelligence + law" becomes a hot research topic. Semantic retrieval, legal question answering, legal assistance, online courts and the like based on natural language processing and machine learning enable the operation mode of the legal industry to become more intelligent and efficient.

However, with the increasing legal consciousness of people, cases needing to be handled by the court are greatly increased year by year, and the workload of workers such as judges is greater and greater.

Considering that the citation of the law bar needs a fact as a support, the relation calculation of the law bar and the fact of the document can help to comb the citation relation between the fact and the law bar, so the value of the relation calculation of the law bar and the fact is embodied in two aspects, namely can help a judge to check the reasonability of the law bar citation in the trial process or after the trial is finished, so that errors in case trial can be found in time, misjudgments and misjudgments are reduced, and can help a party to supervise the reasonability of court judgment after the case is finished, if errors occur, the party can timely make a complaint, and the cost of legal consultation is saved.

Disclosure of Invention

The invention discloses law and fact relation calculation methods based on multi-layer knowledge , and provides knowledge mechanisms for filtering text information according to multi-level priori knowledge, and a mirror image frame is constructed based on the mechanism and used for generating cross expression of input texts, so that useful words for calculating the relation between texts are emphasized, irrelevant information is filtered, the relation between facts and the law can be predicted more accurately, and the method conforms to the real scene that a judge needs to check the reasonability of a quoting law in an auditing case.

The law enforcement and fact relation calculation method based on multi-layer knowledge is characterized by comprising the following steps of:

step (1) extracting a referee text set from a referee text database according to case law and preprocessing the text;

step (2) establishing a special law stopping dictionary;

step (3) training a word vector model;

step (4) preprocessing user input;

predicting the relation between the fact and the law;

1. the law enforcement and factual relation calculation method based on multilayer knowledge as claimed in claim 1, wherein the steps of (1) extracting the official document set from the official document database according to case and preprocessing the document comprise the following steps:

and (1.1) downloading a document set of the appointed case by the referee.

Step (1.2) extract cases to find fact paragraphs and list of reference laws. Extracting cases from the referee document by using a regular expression to find out fact paragraphs and a citation law strip list;

the method comprises the steps of (1.3) standardizing the title of the French, firstly constructing a mapping relation of the title standard of the French, dividing and citing the legal title and the number of the French by using a regular expression, removing symbols in the legal title, counting the citing frequency of the legal title, arranging the legal titles in a descending order, selecting a high-frequency legal title as a standardization target, then constructing the mapping relation between the legal titles by using a Levensan edit distance algorithm, secondly standardizing the title of the French, dividing and citing the legal title and the number of the French by using the regular expression, removing symbols in the legal title, obtaining the standard title of the French according to the mapping relation of the title standard of the French, unifying the Arabic numerals in the number of the French into Chinese numerals, and connecting the standardized title of the French with the number of the French by underlining.

And (1.4) establishing a French superior directory library, inquiring the French superior directory according to the French name, and using the French superior directory as the input of the knowledge for filtering the noise in the text.

2. The law enforcement and factual relation calculation method based on multilayer knowledge as claimed in claim 1, wherein the law enforcement and factual relation calculation method in step (2) comprises the following steps:

step (2.1) downloading a general deactivation dictionary;

and (2.2) carrying out low-frequency and high-frequency word statistics. And traversing the corpus to count words with the frequency of less than 20 or the highest top 10. These words and the general deactivation dictionary are combined to construct a special deactivation dictionary for the pattern.

3. The law and fact relation calculation method based on multi-layer knowledge as claimed in claim 1, wherein the training of word vector model in step (3) includes the following steps:

and (3.1) establishing a French sentence text library. Obtaining full texts of all laws according to the law bank database, and establishing a law text bank;

combining the whole texts of the documents in the legal text library and the referee document set, performing Chinese word segmentation, only reserving words with parts of speech of 'n', 'v' and 'a', and filtering the words according to a stop word list, wherein the obtained word list is separated by blanks and stored in txt files to construct a training corpus;

and (3.4) training a word vector model. The word vector model is trained using a training corpus.

4. The French friendship and factual relation calculation method based on multilayer knowledge of claim 1, wherein the user input is preprocessed in step (4). The concrete substeps include:

and (4.1) acquiring the body of the legal provision and a superior directory. For the input French slips, acquiring a French text and a higher-level directory from a French database and a French higher-level directory library respectively according to the French name;

and (4.2) word segmentation and filtering. The method comprises the steps of segmenting words of facts, a text of a body of the French sentence and a plurality of upper-level catalogues of the French sentence, only keeping words with parts of speech of 'n', 'v' and 'a', and filtering the words according to a stop word list.

And (4.3) vectorizing the text. And vectorizing the facts, the French text and the French superior directory text by using the trained word vector model.

5. The French and factual relation calculation method based on multilayer knowledge , as claimed in claim 1, wherein step (5) outputs the prediction result of the relationship between fact and French, the relationship between French and fact prediction uses F1 value and accuracy as evaluation criteria, concrete substeps include:

taking the upper-level directory of the French clause as prior knowledge, and calculating the new expression of the fact together with the fact input knowledge , so that the information useful for matching is enhanced and the useless information is filtered;

step (5.2) calculates a priori knowledge of the fact. Calculating a priori knowledge vector of the fact according to the new expression matrix of the fact;

step (5.3) calculating new expression of the French sentence, inputting the fact priori knowledge vector and the French sentence into the knowledge , and calculating new expression of the French sentence text;

and (5.4) the CNN extracts text features. Extracting feature vectors of the new expression of the facts and the normal bars by using the CNN;

and (5.5) calculating the relation between the fact and the law.

Compared with the prior art, the method has the advantages that in consideration of the particularity of the legal text, a special stop word list for law is established according to word frequency screening of words in the text, interference caused by using all input texts is avoided, training cost is reduced to a certain extent by , words irrelevant to the law in the fact are filtered by utilizing a higher-level directory of the law, filtering of words in the text of the law is guided by using the filtered fact, interactive expression vectors of facts and the law are generated, noise in the text is greatly reduced, accuracy of prediction is improved, a multi-layer knowledge mechanism utilizes the hierarchical relation of the higher-level directory of the law to assist calculation of prior knowledge at all levels, the filtering effect of a mechanism is effectively improved, interactive expression vectors of the fact law need to be generated in consideration, the facts and a corpus of the law are combined for training when a word vector model is trained, the facts and the law are mapped to the same vector space, more reasonable and accurate interactive text vectors can be generated, when a new text is input, only the directories, the directories in addition, the vector of the language in addition, the prediction model can be obtained, and the prediction model can be input only after the language of the law and the law is input of the language.

Drawings

FIG. 1 is a flow chart of a law and fact relation calculation method based on a multi-layer knowledge mechanism

FIG. 2 regular expressions for extracting specific referee document paragraphs

FIG. 3 is an exemplary drawing of the factual law of official documents

FIG. 4 example diagram of a legal exclusive disuse thesaurus

FIG. 5 exemplary graph of GM-LPK processing based on three levels of prior knowledge

FIG. 6 is a sixty-seven example of the item splitting in the national Community of the people's republic of China

FIG. 7 comparison of traffic incident cases by the relation calculation experiment of the facts and law rules

FIG. 8 is a flow chart of a computing method for a user to use the present invention

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The invention mainly comprises the following steps:

step (2) establishing a special law stopping dictionary;

step (3) training a word vector model;

step (4) preprocessing user input;

and (5) outputting a prediction result of the relation between the fact and the law.

The detailed work flow of the law and fact relation calculation method based on multi-layer knowledge is shown in fig. 1.

1. The calculation method provided by the invention is based on a model obtained by neural network training, so a corpus needs to be established for model training, and the specific steps are as follows:

and (1.1) downloading a document set of the appointed case by the referee. Because various litigation types exist, and the case description words of different case groups are different, the range of the rules to be matched can be effectively reduced by considering the case groups of the cases, and the training effect of the model is improved. Therefore, the official document with the determined case is selected as the original corpus.

The method comprises the steps of (1.2) extracting case finding fact paragraphs and citing law bar lists, wherein information used by the problem researched in the invention is case facts and case citing law bar lists, and legal documents generally mainly comprise part information such as case basic situation sections, original complaint sections, defended claims sections, evidence sections, finding fact sections, judgment results, judgment reasons and cited legal bars.

Taking the traffic incident document as an example, the finding fact and the citation law bar list of the referee document are respectively extracted. An example of the extraction result is shown in fig. 3.

The method includes the steps of (1.3) standardizing the names of the law bars, standardizing the names of the law bars due to regional differences, personal writing habits, or writing errors and the like, wherein multiple expression forms exist in the document with the names of the law bars, for example, the tenth article of the people's republic of China marital law is written as the tenth article of the people's republic of China marital law or the tenth article of the people's republic of China and marital law in the actual document, wherein the irregular writing forms increase the difficulty of training to degrees, so that a system and standardized names of the law bars are needed.

According to the sixty article of the national Law, laws can be divided into a chapter, a section, a bar, a money, an item and a target according to content requirements, the law belongs to the -level catalogue of the bar, so that the upper-level catalogue of the bar comprises the legal name, the code, the chapter and the section, except for compiling, other three upper-level catalogue names summarize the information of a lower-level catalogue, and are compiled by legal professionals, so that all cited laws in a downloaded judge document can be counted firstly and stored in the txt, and then the upper-level catalogue name of the judge document is inquired from the legal text according to the law name, written into the back of the law name in the txt and inquired later.

2. The text of the legal documents comprises words which are frequently used in the legal field, such as ' people's court ' and the like, the documents in cases are similar to the cases described, so that a plurality of words which occur more frequently, such as ' vehicles ' in traffic incident cases and the like, occur too frequently, or can not bring redundant information under the condition that the cases are determined by the cases, in addition, noise words which occur less frequently, such as names of people and places, are useless for most of the documents and can increase the training burden, therefore, a law-specific stop word list is established in the step 3, and the specific steps are as follows:

complete universal disuse word lists are downloaded from the internet, and basic stop words such as common symbols, tone words and the like are contained in the universal disuse word lists.

And (2.2) carrying out low-frequency and high-frequency word statistics. Through traversing the full text of all documents in the downloaded official document set, the words with the highest statistical word frequency top10 and the words with the word frequency less than 20 are added into the general deactivation dictionary, and after manual screening, a special deactivation word list for law is formed, as shown in fig. 4.

3. In the invention, the text input by the user is calculated by inputting the model in a vector form, therefore, fixed word vector models need to be trained for text vectorization, the specific steps are as follows:

and (3.1) establishing a French sentence text library. In order to train a word vector model of a French text, text texts of all laws of the people's republic of China are obtained from a French database and stored in a TXT format, and a legal text library is established;

the method comprises the following steps of (3.2) establishing a corpus, in order to map a fact text and a French text to vector spaces and conveniently find the relation between word pairs of the fact text and the French text, training word vector models shared by the fact text and the French, observing that key information of a legal document is mainly embodied on words with specific parts of speech, including 'n', 'v' and 'a', merging the whole texts of a legal text library and a judge document centralized document, carrying out Chinese word segmentation, only reserving words with specific parts of speech, and filtering words according to a disabled word list, separating the obtained word list by blanks and storing the word list in txt files to construct a training corpus, wherein a plurality of error-free word segmentation tools such as JIEBA, ICTS, SCWS, LTP, NLPIR and the like exist at present, wherein EBJIA is most used in pan, so JIEBA is adopted in the invention;

and (3.4) training a word vector model. The word vector model is trained using a training corpus. Because the text is vectorized in a fixed word vector manner, we choose the word2vec model to train the word vector model. We use the TXT format corpus to train to get a word vector for each word in the corpus. In this step, the vectorization tool used is a genesis library, which has a word2vec method, and can conveniently train a word vector model.

4. In the invention, facts, law and priori knowledge of law are all input in the form of word vectors, and the method mainly comprises the following steps:

and (4.1) acquiring the body of the legal provision and a superior directory. The calculation method of the invention is based on matching the body of the French sentence and the fact text, and uses the superior catalog of the French sentence as prior knowledge to filter the fact. Therefore, firstly, the system needs to query the French database and the French superior directory library according to the input French name to obtain the French text and the French superior directory. Wherein, the superior directory of the French bank comprises the name, chapter and section of the French bank, and the default directory uses "" character substitution ";

and (4.2) word segmentation and filtering. For the input fact, the body of the French sentence and three upper-level directory texts, the JIEBA is used for word segmentation, words with parts of speech of 'n', 'v' and 'a' are reserved, and filtering is carried out according to a legal deactivation word list.

The method comprises the steps of (4.3) text vectorization, aiming at the fact law-law relation calculation problem, the method has no related data set, and recording and labeling consume a large amount of human resources, therefore, only 500 referee texts from traffic incident cases are selected for manual labeling, and 13000 data are obtained.

5. Based on facts obtained by preprocessing and the upper-level directory vector of the French text, the matching relation between the French text and the French text is calculated in an input model, and the method specifically comprises the following steps:

the method comprises the steps of (5.1) calculating new expression of a fact, wherein a fact text contains a plurality of words which are useless for matching a current law bar, the words can not only improve training overhead, but also interfere positioning and calculation of a model on key words, so that the fact text needs to be filtered, a knowledge mechanism can filter noise by using priori knowledge and emphasize useful information, the words with different degrees of importance in the text are managed by using the priori knowledge in the invention, RNN is neural networks for modeling sequence data, information of an upper unit flows into a current unit to guide control data of the current unit, and multilayer knowledge mechanisms, namely GM-LPK (Gate mechanism-Layeredprodior knowledgebase) are designed by combining ideas of the knowledge mechanism and RNN, and input texts are filtered and enhanced by using multistage priori knowledge.

Suppose the input text is T ═ (w)₀，w₁，w₂，...，w_m) The multi-level prior knowledge is expressed as KW ═ KW (KW)₁，kw₂，...，kw_n) With a priori knowledge kw per level _iEach represented by d-dimensional vectors GM-LPK acts on each word in T and the process can be expressed as follows:

pw₁＝δ₁(δ₂(e_wWkw₁))

pw_i＝δ₁(δ₂(e_wW[kw_i；pw_i-1]))，n≥i＞1

wherein e is_wAnd kw_iIn GM-LPK, words flow sequentially along the hierarchy of prior knowledge, each layer of prior knowledge corresponds to children , pw processing the word_iAnd pw_i-1Processing word e corresponding to ith and (i-1) th priori knowledge corresponding to sub _wThe result of (1). pw_iThe larger the value of the middle term, the more the original information is preserved, from the above equation, the junction of the first sub- In order to improve mechanism filtering and enhancing effects, two layers of activation functions are used in equation 1, wherein the th layer is a relu function, the second layer is a sigmoid function, after obtaining output results of each layer, the overall information retention condition of a word by using the multi-level priori knowledge is counted in a sum mode, and the word vector is point-multiplied with an initial word vector to obtain a new word vector of the wordThe results of each can be more comprehensively integrated by adopting the form of the sum of the output results of each sub- , so that the key information is effectively enhanced by the superposition of the results of each sub- , and the situation that key parts are filtered by mistake due to the fact that the text of the prior knowledge is short and the prior knowledge expresses more sidedly can be well avoided.

Aiming at the input fact, a three-level superior directory of a law bar is used as multi-level prior knowledge, the fact and the three-level superior directory are input into a GM-LPK to obtain a new expression vector of the fact, so that the part related to the law bar in the fact is reserved and emphasized, the unrelated part is filtered, and the matching effect can be greatly improved.

Step (5.2) calculating prior knowledge of the fact main body of the law article usually contains a plurality of sub-item contents, but facts related to the law article usually only correspond to sub-items, and other sub-items interfere with matching, so that the main body of the law article is necessarily filteredIt is emphasized that irrelevant sub-items are filtered, so that the prior knowledge of the fact needs to be acquired as the input of the GM-LPK, and the method mainly comprises two sub-steps of (1) k-maxporoling and (2) mean-poolling, assuming that the fact new vector obtained in the step (5.1) is F', considering that the prior knowledge represents the global information of the text, so that the k-max poolling is firstly used for extracting the first k maximum values in each dimension to obtain matrixes M storing the main information of F₁Then, we are dealing with M₁The fact word vector F 'is obtained through mechanism filtering, and contains less noise and more information related to the law, so that the FPK obtained based on the F' represents main fact information related to the law.

And (5.3) calculating new expression of the law bar. After the priori knowledge FPK of the fact is obtained, the FPK and the body text of the French sentence are input into GM-LPK to obtain a new vector expression of the French sentence, so that sub-items which are irrelevant to the fact in the body text of the French sentence are filtered, and important sub-items are reserved and emphasized.

Step (5.4) CNN extracting text characteristic after obtaining new vector expression of facts and normal rules, supposing to be respectively expressed by F 'and S', respectively inputting the new vector expression into layers CNN, extracting text characteristic through maximum pooling, respectively using V to obtain characteristic vectors_fAnd V_sWe connect the two eigenvectors together at to get P ═ V_f；V_s]

The method comprises the steps of (5.5) calculating a relation between facts and law bars, inputting P into two-layer fully-connected neural networks for training to obtain respective probabilities of relevant and irrelevant two categories, and selecting output with high probability as a final classification result, wherein the relation between facts and law bars belongs to a two-class problem, so that the effect of relation calculation is evaluated by using an F1 value and an accuracy ACC.

When the user inputs documents, the flow of establishing the factual law relationship network of the documents is shown in fig. 8, and the obtained relationship network can help the judges and the general public to know the whole case and the judging process more intuitively and clearly.

The law and fact relation calculation methods based on multilayer knowledge implemented by the invention have been described in detail above by referring to the frame, the invention has the advantages that the particularity of the legal text is considered, a special stop word list for law is established according to word frequency screening of words in the text, the interference caused by using all input texts is avoided, the training cost is reduced to a certain extent by , the words irrelevant to the law in the fact are filtered by using the upper-level directory of the law, the filtered facts are used for guiding the filtering of the words of the law, interactive expression vectors of facts and the law are generated, the noise in the text is greatly reduced, the accuracy of prediction is improved, meanwhile, the hierarchical relation of the upper-level directory of the law is utilized by the multilayer knowledge mechanism to assist the calculation of prior knowledge at all levels, the filtering effect of the mechanism is effectively improved, the interactive expression vectors of the fact and the law are required to be generated, the facts and the law are combined when a word vector model is trained, the facts and the law are mapped to the same directory space, the vector generation is beneficial to generate more reasonable and more accurate text and text models are input, only the model of the model and the model of the dictionary after the training is obtained, the input of the training of the grammar.

It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. Also, a detailed description of known process techniques is omitted herein for the sake of brevity. The present embodiments are to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1, a law enforcement and factual relation calculation method based on multi-layer knowledge , which is characterized by comprising the following steps:

step (2) establishing a special law stopping dictionary;

step (3) training a word vector model;

step (4) preprocessing user input;

and (5) predicting the relation between the fact and the law.

2. The law enforcement and factual relation calculation method based on multilayer knowledge as claimed in claim 1, wherein the steps of (1) extracting the official document set from the official document database according to case and preprocessing the document comprise the following steps:

and (1.1) downloading a document set of the appointed case by the referee.

3. The law enforcement and factual relation calculation method based on multilayer knowledge as claimed in claim 1, wherein the law enforcement and factual relation calculation method in step (2) comprises the following steps:

step (2.1) downloading a general deactivation dictionary;

4. The law and fact relation calculation method based on multi-layer knowledge as claimed in claim 1, wherein the training of word vector model in step (3) includes the following steps:

5. The French friendship and factual relation calculation method based on multilayer knowledge of claim 1, wherein the user input is preprocessed in step (4). The concrete substeps include:

6. The French and factual relation calculation method based on multilayer knowledge , as claimed in claim 1, wherein step (5) outputs the prediction result of the relationship between fact and French, the relationship between French and fact prediction uses F1 value and accuracy as evaluation criteria, concrete substeps include:

and (5.5) calculating the relation between the fact and the law.