CN112100212A

CN112100212A - Case scenario extraction method based on machine learning and rule matching

Info

Publication number: CN112100212A
Application number: CN202010920756.3A
Authority: CN
Inventors: 梁鸿翔; 胡潇; 时子威; 陈放; 颉明明; 杨帅; 张博羿
Original assignee: Second Research Institute Of Casic
Current assignee: Second Research Institute Of Casic
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2020-12-18

Abstract

The invention relates to a case scenario extraction method based on machine learning and rule matching, wherein the case scenario extraction method comprises the following steps: the keyword matching and regularization matching method comprises the following steps: extracting description sentences containing specified keywords or conforming to the regular expression in the paragraphs of the referee document as features; searching an episode corresponding to the characteristics in a pre-constructed episode library; the deep learning process comprises the following steps: performing word segmentation processing on the text to obtain a word sequence after word segmentation; vectorizing the word sequence after word segmentation to obtain a text vector of the text to be extracted; and inputting the text vector of the text to be extracted into a pre-constructed deep learning extraction model, and obtaining a result according to the output of the extraction model. The invention can extract not only explicit plots with strong interpretability, but also implicit plots with not so strong interpretability. And the analysis accuracy of case facts of low-frequency criminal names is improved by using different deep neural networks for different criminal names.

Description

Case scenario extraction method based on machine learning and rule matching

Technical Field

The invention relates to the electronic technology of legal documents, in particular to a case scenario extraction method based on machine learning and rule matching.

Background

The legal scenario extraction task aims to automatically extract the most important scenarios from the case fact description part in the legal document. On the one hand, help people without legal basis understand important stories; on the other hand, legal references are provided for professional legal personnel. In recent years, China is continuously and deeply promoted to build an intelligent judicial law, and extraction of legal plots is an important link. The key word matching algorithm is utilized to research the plot extraction of cases as early as the last century. There have also been some studies related to legal plot extraction in recent years. With the rapid development of the deep learning technology, some scholars extract case scenarios in legal documents by using a deep neural network, and a good effect is achieved.

The chinese patent "109285094 legal document processing method and device" provides a method for extracting crime keywords in a target legal document and determining the crime episodes of case crime names in the target legal document in a pre-constructed crime database according to the crime keywords.

The Chinese patent '110032721 a referee document pushing method and device' provides a method and device for pushing referee documents by obtaining case plot characteristics through keyword matching and regular expression matching and searching similar characteristics.

Chinese patent '110263323 keyword extraction method and system based on a fence type long-time memory neural network' provides a keyword extraction method and system based on a neural network. : inputting legal text corpora of keywords to be extracted into a text coding model of a neural network to obtain a text semantic feature vector sequence; and inputting the text semantic feature vector sequence into a keyword recognition model to obtain a keyword extraction result.

The keyword matching and regular expression matching method can simply and effectively extract the obvious plot with high confidence coefficient, but the result is easy to make mistakes due to neglecting a plurality of semantic subtleties. But also has a low recall rate due to the expression of the regular expression without keywords and mismatch. On the basis of certain performance, even if a small promotion is required, enormous manpower is required to design a tighter regular expression. The deep learning method can learn some scenes which are difficult to match by regular expressions. But has the disadvantage that a large amount of labeled data is typically required for training. Furthermore, due to the problem of data imbalance, the deep learning method has poor analysis accuracy when dealing with cases of low-frequency guilt names. And the deep learning method lacks a certain interpretability.

The keyword matching and regular expression matching method is simple and efficient, has good interpretability, can extract explicit plots, and has the defects of low recall rate and labor consumption. The deep learning technology can extract some implicit plots through some implicit expressions, but has poor interpretability, needs a large amount of data, and has low analysis accuracy and coverage rate on case facts of some low-frequency crime names.

Disclosure of Invention

The invention aims to provide a case plot extraction method based on machine learning and rule matching, which is used for solving the problem that the deep learning method has poor analysis accuracy when processing cases of low-frequency criminal names by using different deep neural networks for different criminal names.

The invention relates to a case scenario extraction method based on machine learning and rule matching, which comprises the following steps: the keyword matching and regularization matching method comprises the following steps: extracting description sentences containing specified keywords or conforming to the regular expression in the paragraphs of the referee document as features; searching an episode corresponding to the characteristics in a pre-constructed episode library; the deep learning process comprises the following steps: performing word segmentation processing on the text to obtain a word sequence after word segmentation; vectorizing the word sequence after word segmentation to obtain a text vector of the text to be extracted; and inputting the text vector of the text to be extracted into a pre-constructed deep learning extraction model, and obtaining a result according to the output of the extraction model.

According to an embodiment of the case scenario extraction method based on machine learning and rule matching, the method further comprises the following steps: an episode base is constructed in advance.

According to an embodiment of the case scenario extraction method based on machine learning and rule matching, the method further comprises the following steps: a deep learning extraction model is constructed in advance: collecting official documents aiming at different crimes; cleaning data of the referee document, and dividing and extracting a part of case fact description according to keywords; manually calibrating the plot corresponding to the case fact; and (5) training the model.

According to an embodiment of the case scenario extraction method based on machine learning and rule matching, the pre-constructing scenario library comprises the following steps: (1) determining general plots and exclusive plots of various criminal names; (2) making regular expressions and matching rules for the determined universal plots and the special plots of the various criminal names; (3) and testing the various criminal names by using massive actual cases, and modifying the regular expression and the matching rule according to the test result.

According to an embodiment of the case scenario extraction method based on machine learning and rule matching, the method for constructing the deep learning extraction model further comprises the following steps: the method comprises the steps of dividing a referee document according to the names of the crimes, dividing the referee document with each name of the crimes into a training set, a testing set and a development set according to a certain proportion, wherein the training set is used for training a model, the development set is used for adjusting model parameters, and the testing set is used for finally evaluating the performance of the model.

According to an embodiment of the case scenario extraction method based on machine learning and rule matching, the training model comprises: an input layer, a hidden layer, and an output layer, wherein: an input layer: inputting a word vector two-dimensional matrix of a training text; order to

Representing a k-dimensional word vector corresponding to the ith word in a sentence, the sentence of length n is represented as:

wherein

Is a concatenation operator; hiding the layer: the system comprises a convolutional neural network, a word vector, a maximum pool and a vector matrix, wherein the convolutional neural network is used for abstracting a text input vector matrix to obtain deeper text information, extracting features for classification by using a binary classification task with plot extraction regarded as different criminals, performing convolution operation on the word vector by using convolution kernels with different sizes, and splicing the obtained features together to obtain final features through maximum pool pooling; an output layer: and (3) passing the obtained feature vector through one or more full-connection layers and activation function layers, and then passing through a Sigmoid activation function to obtain a text-based feature prediction classification result.

According to one embodiment of the case scenario extraction method based on machine learning and rule matching, the experiment is set, a jieba word segmentation component is used for Chinese word segmentation in the experiment, and a pre-training word vector of an Tencent AI laboratory is used; the experimental hidden layer adopts a convolutional neural network, the convolutional neural network uses convolutional kernels with the window sizes of 1, 2, 3 and 4, each convolutional kernel has 64, the output layer adopts a structure of two linear layers, and the characteristic size is S_fThe size of the hidden layer is S_hThe number of labels is S_lFirst by S_f×S_hOf (2) isLayer, then by Tanh function:

Tanh(x)＝(e^x-e^-x)/(e^x+e^-x)

by S_h×S_lIn which the layer size S is hidden_h256 and linear layer dropout, probability 0.5; the learning rate for the training was set to 0.001, the training used Adam optimizer and BCELoss as loss function and Sigmoid activation function.

The case plot extraction method based on machine learning and rule matching can extract not only explicit plots with strong interpretability, but also some implicit plots with not so strong interpretability. And the analysis accuracy of case facts of low-frequency criminal names is improved by using different deep neural networks for different criminal names.

Drawings

FIG. 1 is a main flow chart of a case scenario extraction method based on machine learning and rule matching according to the present invention;

FIG. 2 is a diagram of the steps of pre-building a deep learning extraction model;

fig. 3 is a specific topology structure diagram of the deep learning extraction model.

Detailed Description

In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.

Fig. 1 is a main flow chart of a case scenario extraction method based on machine learning and rule matching according to the present invention, and as shown in fig. 1, the method for extracting a scenario according to the present invention includes two modules after a text of a scenario to be extracted is input, one of which is a keyword matching and regularization matching flow on the left side of fig. 1 and a deep learning flow on the right side of fig. 1.

The keyword matching and regularization matching process comprises the following steps:

(1) in a paragraph of the referee document, only the keyword or the descriptive sentence conforming to the regular expression is included and extracted as the feature.

(2) And searching the plot corresponding to the characteristic in a pre-constructed plot library.

For example, the expression "(not | not) {0,4} antecedent" has a feature corresponding to the scenario "non-criminal antecedent".

The deep learning process comprises the following steps:

(1) and performing word segmentation processing on the text to obtain a word sequence after word segmentation.

(2) And vectorizing the word sequence after word segmentation to obtain a text vector of the text to be extracted.

(3) And inputting the text vector of the text to be extracted into a pre-constructed deep learning extraction model, and obtaining a result according to the output of the extraction model.

In order to make the technical field of the embodiment of the invention better understand, the embodiment of the invention is further described in detail below.

The method comprises the following steps of constructing an episode base in advance according to the criminal instruction opinions in various regions of the people's republic of China criminal law and the criminal action law of the people related to law, wherein the method comprises the following steps:

(1) the related people of law determine the universal plot and the exclusive plot of each criminal name according to the criminal instruction opinions in various regions of the criminal law of the people's republic of China and the criminal action law of the people's republic of China.

(2) Under the guidance of law-related people, regular expressions and matching rules are formulated for the determined general plots and the special plots of the various criminal names.

(3) And testing the various criminal names by using massive actual cases, and modifying the regular expressions and the matching rules under the guidance of persons related to law.

Fig. 2 is a diagram of steps for constructing a deep learning extraction model in advance, and the steps shown in fig. 2 are as follows:

(1) a large number of official documents for different names of crimes are collected. A total of 2426402 official documents were collected all over the country since 2000 years.

(2) And (4) cleaning data of the referee document, and dividing and extracting a part of the case fact description according to the keywords. 889690 parts of official documents remained after washing.

(3) The corresponding plot of the case fact is manually calibrated according to the criminal law of the people's republic of China and the criminal action of the people's republic of China in various regions of the sentencing guidance opinions.

(4) The referee documents are divided according to the names of the crimes, and a training set, a testing set and a development set are divided according to a certain proportion for the referee documents with the names of the crimes. The training set is used for training the model, the development set is used for adjusting the model parameters, and the test set is used for finally evaluating the model performance.

(5) And (5) training the model.

The specific topology of the deep learning extraction model is shown in fig. 3, and the model includes: input layer, hidden layer, output layer, wherein:

an input layer: the input is a two-dimensional matrix of word vectors of training text, for example, if the maximum length of the text is defined as 500 words, and the dimension of each word vector is set to 200, then the input should be a two-dimensional matrix of 500 × 200.

Order to

Representing a k-dimensional word vector corresponding to the ith word in a sentence, a sentence of length n is represented as

Wherein

Is the concatenation operator.

Hiding the layer: the method is mainly used for abstracting the text input vector matrix to obtain deeper text information. The conventional Convolutional Neural Network (CNN) or long-short term memory network (LSTM) can be used as the layer, and the Convolutional Neural Network (CNN) is taken as an example, the convolutional neural network is a deep feedforward artificial neural network and achieves remarkable results in the aspects of computer vision and speech recognition. Two classification tasks that treat episode extraction as different names of guilties use CNN to extract features for classification. And performing convolution operation on the word vectors by using convolution kernels with different sizes, performing maximum pooling, and splicing the obtained features together to obtain the final features.

In particular, for CNN, a convolution kernel

Generating features for h words, e.g. from the word x_i:i+h-1Window generation feature c_i：

c_i＝f(w·x_i:i+h-1+b)

Wherein

Is a bias term, f is a non-linear function like a linear rectification function (relu (x) max (0, x)), for a sentence { x_1:h,x_2:h+1,…x_n-h+1:nGeneration of feature maps by convolution with a convolution kernel

c＝[c₁,c₂,…c_n-h+1]

Wherein

Each feature map is considered to capture the most important features and may process sentences of variable length. Maximum pooling of features

An output layer: and passing the obtained feature vector through one or more full connection layers and an activation function layer. And activating a function through Sigmoid:

Sigmoid(x)＝σ(x)＝(1+e^-x)^-1

and obtaining a text feature prediction based classification result.

The linear layer input is c and the output is y, which can be expressed as y ═ cA^T+ b. Where b is a deviation term.

The model is lightweight, does not occupy excessive time cost, and has good robustness.

(6) Experimental setup

The experiment uses the jieba word segmentation component for Chinese word segmentation and uses the 200-dimensional pre-training word vector of Tencent AI laboratories.

The experimental hidden layer used CNN, which used convolution kernels with window sizes of 1, 2, 3, and 4, with 64 convolution kernels per convolution kernel. The output layer adopts a structure of two linear layers, and the characteristic size is S_fThe size of the hidden layer is S_hThe number of labels is S_l. First through S_f×S_hAnd then through the Tanh function:

Tanh(x)＝(e^x-e^-x)/(e^x+e^-x)

finally, pass through S_h×S_lThe linear layer of (2). Wherein the hidden layer size S_h256. And the linear layer sets dropout with a probability of 0.5.

The training batch size is set to 100, the maximum epoch of the training is set to 50, the training is stopped using the early-stop mechanism, and the training is stopped when f1score is no longer elevated on the development set after 10 epochs, where f1 score:

f1score＝2*(precision*recall)/(precision+recall)

the learning rate for training was set to 0.001. Training uses an Adam optimizer and BCELoss as a loss function, which can be expressed as l_n＝-w_n[y_n·logx_n+(1-y_n)·log(1-x_n)]Wherein y is_nIs true value label, x_nIs output by the network. To ensure x_nFor numbers between 0 and 1, Sigmoid activation function is used.

According to the invention, by integrating keyword matching and regular expression matching extraction scenarios and deep learning extraction scenarios, not only can explicit scenarios with strong interpretability be extracted, but also some implicit scenarios with not so strong interpretability can be extracted. The invention pre-constructs a deep learning extraction model and an episode base help system to extract episodes.

Compared with the prior art, the technical scheme provided by the invention integrates the advantages of the keyword matching and regular expression matching and the deep learning plot extraction method. Not only can explicit plots with strong interpretability be extracted, but also some implicit plots with not so strong interpretability can be extracted. Meanwhile, different deep neural networks are used for different criminal names, so that the problem of low analysis accuracy of low-frequency criminal names is solved. From the experimental results, the extraction effect is good after the method is adopted. The requirement of plot extraction is basically met. The invention is simple to realize, effectively extracts the plot and meets the application requirement.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A case scenario extraction method based on machine learning and rule matching is characterized by comprising the following steps:

the keyword matching and regularization matching method comprises the following steps:

extracting description sentences containing specified keywords or conforming to the regular expression in the paragraphs of the referee document as features;

searching an episode corresponding to the characteristics in a pre-constructed episode library;

the deep learning process comprises the following steps:

performing word segmentation processing on the text to obtain a word sequence after word segmentation;

vectorizing the word sequence after word segmentation to obtain a text vector of the text to be extracted;

and inputting the text vector of the text to be extracted into a pre-constructed deep learning extraction model, and obtaining a result according to the output of the extraction model.

2. The case scenario extraction method based on machine learning and rule matching as claimed in claim 1, further comprising: an episode base is constructed in advance.

3. The case scenario extraction method based on machine learning and rule matching as claimed in claim 1, further comprising: a deep learning extraction model is constructed in advance:

collecting official documents aiming at different crimes;

cleaning data of the referee document, and dividing and extracting a part of case fact description according to keywords;

manually calibrating the plot corresponding to the case fact;

and (5) training the model.

4. The case scenario extraction method based on machine learning and rule matching as claimed in claim 1, wherein pre-constructing a scenario library comprises:

(1) determining general plots and exclusive plots of various criminal names;

(2) making regular expressions and matching rules for the determined universal plots and the special plots of the various criminal names;

(3) and testing the various criminal names by using massive actual cases, and modifying the regular expression and the matching rule according to the test result.

5. The case scenario extraction method based on machine learning and rule matching as claimed in claim 3, wherein the building of deep learning extraction model further comprises:

the method comprises the steps of dividing a referee document according to the names of the crimes, dividing the referee document with each name of the crimes into a training set, a testing set and a development set according to a certain proportion, wherein the training set is used for training a model, the development set is used for adjusting model parameters, and the testing set is used for finally evaluating the performance of the model.

6. The case scenario extraction method based on machine learning and rule matching as claimed in claim 3, wherein training the model comprises: an input layer, a hidden layer, and an output layer, wherein:

an input layer: inputting a word vector two-dimensional matrix of a training text;

order to

The representation corresponds to a sentenceThe k-dimensional word vector of the ith word and the sentence with the length of n are expressed as

Wherein

Is a concatenation operator;

hiding the layer: the system comprises a convolutional neural network, a word vector, a maximum pool and a vector matrix, wherein the convolutional neural network is used for abstracting a text input vector matrix to obtain deeper text information, extracting features for classification by using a binary classification task with plot extraction regarded as different criminals, performing convolution operation on the word vector by using convolution kernels with different sizes, and splicing the obtained features together to obtain final features through maximum pool pooling;

an output layer: and (3) passing the obtained feature vector through one or more full-connection layers and activation function layers, and then passing through a Sigmoid activation function to obtain a text-based feature prediction classification result.

7. The case scenario extraction method based on machine learning and rule matching as claimed in claim 3, further comprising: setting an experiment, wherein the experiment uses a jieba word segmentation component to perform Chinese word segmentation and uses a pre-training word vector of an Tencent AI laboratory;

the experimental hidden layer adopts a convolutional neural network, the convolutional neural network uses convolutional kernels with the window sizes of 1, 2, 3 and 4, each convolutional kernel has 64, the output layer adopts a structure of two linear layers, and the characteristic size is S_fThe size of the hidden layer is S_hThe number of labels is S_lFirst by S_f×S_hAnd then through the Tanh function:

Tanh(x)＝(e^x-e^-x)/(e^x+e^-x)

by S_h×S_lIn which the layer size S is hidden_h256 and linear layer dropout, probability 0.5;

the learning rate for the training was set to 0.001, the training used Adam optimizer and BCELoss as loss function and Sigmoid activation function.