CN111460100A

CN111460100A - Criminal legal document and criminal name recommendation method and system

Info

Publication number: CN111460100A
Application number: CN202010236444.0A
Authority: CN
Inventors: 李芳芳; 陈可道; 张健
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-28

Abstract

The invention relates to a method and a system for recommending criminal legal document and criminal names. The method comprises the following steps: obtaining a criminal law document, and performing word segmentation processing on the criminal law document to obtain a text set with entries as units; obtaining a word2vec word vector model and a text separable convolutional neural network model; establishing a criminal name recommendation model by taking the word2vec word vector model as an embedded layer and combining the text separable convolutional neural network model; and obtaining a criminal name tag matrix according to the text set by using the criminal name recommendation model, and further determining the criminal name corresponding to the criminal legal document according to the criminal name tag matrix. The criminal legal document and criminal name recommending method and system provided by the invention have the characteristics of low labor cost, high criminal name acquiring efficiency, high criminal name recommending accuracy and the like.

Description

Criminal legal document and criminal name recommendation method and system

Technical Field

The invention relates to the technical field of text processing, in particular to a method and a system for recommending criminal legal document and criminal names.

Background

Predicting the names of crimes according to case description and facts of criminal legal documents refers to predicting the names of crimes corresponding to criminal cases when a prisoner is lifted up for criminal actions according to a section of specific text describing the criminal cases.

There are the following methods for judging the name of a criminal law document:

the method is a relatively original method, and is characterized in that the crime name is judged purely manually by professionals generally according to related knowledge. The method requires manpower with rich criminal law knowledge and experience, the accuracy of the method also depends on the professional degree of personnel, a lot of time is needed, and the efficiency is low.

The automatic prediction method based on key word matching determines the corresponding names of the crimes by collecting and customizing a dictionary corresponding to different types of the crimes and matching the text of each case with the custom dictionary for the key words. The method is high in speed, but low in precision, needs a large amount of priori knowledge to construct a matching word dictionary, and is high in labor cost.

The classification algorithm based on the machine learning model, such as a support vector machine, a random forest and the like, is mainly characterized in that according to the description of the observed legal document case of the heart and the characteristics of the fact text, the characteristics are constructed through manual extraction and then serve as input, and the classification result about the criminal name is obtained through the machine learning algorithm. When the algorithms face the multi-label multi-classification problem, the accuracy is not high, the training time is not long enough, and a large amount of priori knowledge is needed to manually construct complex features.

The general deep learning algorithm based on the sequence model or the Recurrent Neural Network does not need to manually construct features, but has high requirements on original training data, for example, when the criminal legal document data is very unbalanced, the deep learning model is not enough to well obtain semantic information of criminal legal document case description and factual text, and the semantic information is often combined with input capable of extracting the semantic information of the original text to have better precision effect, and the method has the defect of long training time because of complex Network structure.

Therefore, the method for recommending criminal legal documents and criminal names has low labor cost and high criminal name acquisition efficiency, and is a technical problem to be solved in the field.

Disclosure of Invention

The invention aims to provide a criminal legal document and criminal name recommendation method and system with low labor cost and high criminal name acquisition efficiency.

In order to achieve the purpose, the invention provides the following scheme:

a method of recommending criminal legal document and criminal names, comprising:

obtaining a criminal law document, and performing word segmentation processing on the criminal law document to obtain a text set with entries as units;

obtaining a word2vec word vector model and a text separable convolutional neural network model;

establishing a criminal name recommendation model by taking the word2vec word vector model as an embedded layer and combining the text separable convolutional neural network model;

and obtaining a criminal name tag matrix according to the text set by using the criminal name recommendation model, and further determining the criminal name corresponding to the criminal legal document according to the criminal name tag matrix.

Preferably, the obtaining the criminal law documents and performing the word segmentation processing on the criminal law documents to obtain the text set with the entries as the units further includes:

preprocessing the text set; the pretreatment comprises the following steps: stop words and punctuation are removed.

Preferably, the recommendation method further includes:

selecting the criminal law documents meeting preset conditions from the criminal law documents to perform calibration sampling;

acquiring a text set of the criminal law documents subjected to calibration sampling and a criminal name label matrix corresponding to the text set of the criminal law documents subjected to calibration sampling as a training sample pair;

and training the criminal name recommendation model by adopting the training samples.

Preferably, the training of the guilty name recommendation model by using the training samples specifically includes:

acquiring a text set in the training sample pair;

obtaining a word vector matrix according to the text set in the training sample pair by using the word2vec word vector model;

obtaining a first guilt name tag matrix according to the word vector matrix by using the text separable convolutional neural network model;

judging whether the first guilt name tag matrix is a guilt name tag matrix corresponding to the text set in the training sample pair, if so, directly outputting the first guilt name tag matrix to obtain a trained guilt name recommendation model; otherwise, adjusting the filling parameters of the text separable convolutional neural network model until the first guilty name tag matrix output by the text separable convolutional neural network model is the guilty name tag matrix, and stopping the adjustment of the filling parameters to obtain the trained guilty name recommendation model.

A system for recommending criminal legal document and criminal names, comprising:

the system comprises a text set determining module, a word segmentation module and a word segmentation module, wherein the text set determining module is used for acquiring criminal legal documents and performing word segmentation on the criminal legal documents to obtain a text set with entries as units;

the acquisition module is used for acquiring a word2vec word vector model and a text separable convolutional neural network model;

a criminal name recommendation model building module, which is used for building a criminal name recommendation model by taking the word2vec word vector model as an embedded layer and combining the text separable convolutional neural network model;

and the criminal name determining module is used for obtaining a criminal name label matrix according to the text set by utilizing the criminal name recommending model and further determining the criminal name corresponding to the criminal legal document according to the criminal name label matrix.

Preferably, the system further comprises:

the preprocessing module is used for preprocessing the text set; the pretreatment comprises the following steps: stop words and punctuation are removed.

Preferably, the recommendation system further comprises:

the calibration adopting module is used for selecting criminal law documents meeting preset conditions from the criminal law documents to perform calibration sampling;

the training sample pair acquisition module is used for acquiring a text set of the criminal legal documents subjected to calibration sampling and a criminal name label matrix corresponding to the text set of the criminal legal documents subjected to calibration sampling as a training sample pair;

and the training module is used for training the guiltname recommendation model by adopting the training samples.

Preferably, the training module specifically includes:

a text set obtaining unit, configured to obtain a text set in the training sample pair;

a word vector matrix determining unit, configured to obtain a word vector matrix according to the text set in the training sample pair by using the word2vec word vector model;

the first guilt name tag matrix determining unit is used for obtaining a first guilt name tag matrix according to the word vector matrix by utilizing the text separable convolutional neural network model;

a guiltname recommendation model training unit for judging whether the first guiltname label matrix is a guiltname label matrix corresponding to the text set in the training sample pair, if so, directly outputting the first guiltname label matrix to obtain a trained guiltname recommendation model; otherwise, adjusting the filling parameters of the text separable convolutional neural network model until the first guilty name tag matrix output by the text separable convolutional neural network model is the guilty name tag matrix, and stopping the adjustment of the filling parameters to obtain the trained guilty name recommendation model.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the criminal legal document and the system for recommending the names of the criminal legal documents, the names of the criminal legal documents can be obtained by inputting the text set in the criminal legal documents, taking the word2vec word vector model as an embedded layer and combining the text separable convolutional neural network model with the built-up criminal name recommending model, so that the manpower cost can be greatly reduced, and the acquirement efficiency of the names of the criminal legal documents can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a method for recommending criminal legal document and criminal names provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a criminal name recommendation model according to an embodiment of the invention;

fig. 3 is a schematic structural diagram of a criminal legal document and criminal name recommendation system according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a flowchart of a method for recommending criminal legal document and criminal names according to an embodiment of the present invention, and as shown in fig. 1, the method for recommending criminal legal document and criminal names includes:

s1, obtaining the criminal law documents, and performing word segmentation processing on the criminal law documents to obtain a text set with entries as units.

And S2, acquiring a word2vec word vector model and a text separable convolutional neural network model.

And S3, constructing a criminal name recommendation model by taking the word2vec word vector model as an embedding layer and combining the text separable convolutional neural network model.

S4, obtaining a criminal name label matrix according to the text set by using the criminal name recommendation model, and further determining the criminal name corresponding to the criminal legal document according to the criminal name label matrix.

In order to improve the accuracy of the recommendation of the criminal name by the criminal name recommendation model, after obtaining the criminal legal document at the step S1 and performing word segmentation on the criminal legal document to obtain a text set with entries as units, the method for recommending the criminal legal document name further comprises the following steps:

In order to further improve the accuracy of acquiring the criminal name, as another embodiment of the present invention, the recommendation method further includes:

and selecting the criminal law documents meeting preset conditions from the criminal law documents for calibration and sampling.

And acquiring a text set of the criminal legal documents subjected to calibration sampling and a criminal name label matrix corresponding to the text set of the criminal legal documents subjected to calibration sampling as a training sample pair.

And training the criminal name recommendation model by adopting the training samples. The training process specifically comprises:

and acquiring a text set in the training sample pair.

And obtaining a word vector matrix according to the text set in the training sample pair by using the word2vec word vector model.

As another embodiment of the present invention, the training of the guiltname recommendation model may further include:

the method comprises the following steps: dividing criminal legal documents into a text training set and a test set, and performing word segmentation treatment on the text training set and the test set to form a text set which takes entries as units, wherein the text set specifically comprises the following steps:

A. a collection of texts on criminal case descriptions in criminal law documents is extracted as a training set.

B. And (4) extracting a criminal name label part corresponding to each criminal case description in the criminal legal document as a training set label.

C. And performing word segmentation on the text set in the A by using a Chinese word segmentation tool to obtain a vocabulary entry set taking vocabulary entries as units.

D. And constructing an integer criminal name label matrix aiming at the related criminal name label part in the B.

Step two: and (4) constructing a targeted stop word list according to the statistical information of the criminal law documents, and then preprocessing the entry text set in the step one. The preprocessing mainly comprises the steps of removing stop words and punctuation marks according to a specific stop word list, then randomly dividing a training set to obtain an input entry set, and simultaneously converting labels of the training set into a guilty name label matrix. The method specifically comprises the following steps:

A. and D, observing and counting the entry set obtained in the step one. The observing and counting comprises obtaining the maximum and average entry length; determining common times, locations, people words, logically unrelated words, and the like.

B. And B, designing a stop word list for the criminal legal documents according to the related words and the common stop words in the step A, and using the stop words and the punctuation marks.

C. And numbering the entry sets according to the original sequence, and randomly dividing the entry sets according to the numbers to obtain input entry sets.

Step three: and (4) acquiring Chinese text corpus training to generate a word2vec word vector model, and converting the input entry set obtained in the step two into a word vector matrix by using the word vector model. The method specifically comprises the following steps:

A. and collecting a public Chinese short text corpus in the Internet as a text training set of word2 vec.

B. And inputting the external corpus training set obtained in the step A into a word2vec model for training to obtain a word2vec word vector model capable of converting the vocabulary entry into a word vector.

C. And converting the input entry set in the step two into a word vector matrix through a word2vec model.

Step four: and extracting unbalanced data items with less quantity, and generating additional data items by oversampling according to the word vector similarity between the samples to fully balance the sample data size.

And counting the unbalanced data label categories, and sorting corresponding word vectors into a set.

Regarding each category, regarding all word vectors in the word vector matrix as a semantic set, traversing all word vectors of all samples in sequence, respectively solving the similarity of the mean word vector of each existing semantic set, if the similarity is higher, adding the word vector into the set, otherwise, regarding as a new semantic set, then taking a fixed percentage of semantic sets to sort the sequence according to the number of the word vectors, and randomly generating the data item of the label category by using the mean word vector of each semantic set. The method comprises the following specific steps:

A. assume that a word vector matrix of a first class with a sample size n is:

{w₁,w₂,w₃...,w_n}

B. each word vector in the word vector matrix is considered as a semantic set:

s₁s₂s₃...s_n

C. and sequentially taking each word vector wx of the samples in the rest first categories, and respectively solving the similarity I with the mean word vector of each semantic set:

I＝Sim(w_x,Avg(s_i))

D. if I is above a certain threshold, s is set_iAdding wx, otherwise, regarding as a new semantic set s_x。

E. And circulating the A-D process until all the word vectors contained in the class sample are traversed.

F. And ordering all semantic sets by word vector quantity, taking the first x percent (determining the specific value of x according to actual needs), and finally randomly generating new data items of the category by using the mean word vectors in the semantic sets so as to achieve data balance.

Step five: designing and realizing a deep learning network based on text-separable-cnn (text separable convolutional neural network model), filling parameters, and selecting and realizing a classifier layer;

the structure for realizing text-platform-cnn predicted aiming at criminal legal document and criminal name includes: 128 convolution kernels with window sizes of 2, 3, 4, 5 can separate the convolution layer, the BatchNorm layer, the relu activation function layer, the max pooling layer, and the sigmoid output layer.

Step six: training a criminal name recommendation model which is formed by taking a word2vec vector model as an embedded layer and combining a text separable convolutional neural network, and specifically comprises the following steps:

and (3) taking the word2vec word vector model generated in the third step as a word embedding layer, combining a text-semantic-cnn network realized in the fourth step, taking a word vector matrix of a training set in the third step as input, taking a guiltname label matrix in the second step as training output, and training a guiltname recommendation model to obtain a deep learning model for predicting guiltnames of criminal legal documents (the specific structure of the guiltname recommendation model is shown in figure 2).

In order to improve the fitting degree of the model, a random gradient descent algorithm is adopted in the training process, and the parameters of the criminal name recommendation model are adjusted. When the loss function of the guilt name recommendation model is lower than a set threshold value, the guilt name label matrix output by the guilt name recommendation model obtained through current training is the same as or similar to the guilt name label matrix calibrated in the training sample pair.

In addition, aiming at the recommendation method of the criminal legal document and criminal names, the invention also correspondingly provides a recommendation system of the criminal legal document and criminal names, the structure of which is shown in fig. 3, and the system comprises: the system comprises a text set determining module 1, an obtaining module 2, a crime recommendation model building module 3 and a crime determining module 4.

The text set determining module 1 is used for acquiring criminal legal documents, and performing word segmentation processing on the criminal legal documents to obtain a text set with entries as units; the obtaining module 2 is used for obtaining a word2vec word vector model and a text separable convolutional neural network model; the criminal name recommendation model building module 3 is used for building a criminal name recommendation model by taking the word2vec word vector model as an embedded layer and combining the text separable convolutional neural network model; and the criminal name determining module 4 is used for obtaining a criminal name label matrix according to the text set by utilizing the criminal name recommending model, and further determining the criminal name corresponding to the criminal legal document according to the criminal name label matrix.

In order to improve the accuracy of the recommendation of the criminal name, the system further comprises: a pre-processing module for pre-processing the text collection.

As a further optimization of the system, the recommendation system further comprises: the system comprises a calibration adopting module, a training sample pair obtaining module and a training module.

The calibration adopting module is used for selecting criminal legal documents meeting preset conditions from the criminal legal documents to perform calibration sampling; the training sample pair acquisition module is used for acquiring a text set of the criminal legal documents subjected to calibration sampling and a criminal name label matrix corresponding to the text set of the criminal legal documents subjected to calibration sampling as a training sample pair; the training module is used for training the guiltname recommendation model by adopting the training samples.

Wherein, the training module specifically includes: the system comprises a text set acquisition unit, a word vector matrix determination unit, a first guiltname tag matrix determination unit and a guiltname recommendation model training unit.

The text set acquisition unit is used for acquiring a text set in the training sample pair.

And the word vector matrix determining unit is used for obtaining a word vector matrix according to the text set in the training sample pair by utilizing the word2vec word vector model.

the training unit of the name of the guilty recommendation model is used for judging whether the first name of the guilty label matrix is the name of the guilty label matrix corresponding to the text set in the training sample pair, if yes, the first name of the guilty label matrix is directly output to obtain the trained name of the guilty recommendation model; otherwise, adjusting the filling parameters of the text separable convolutional neural network model until the first guilty name tag matrix output by the text separable convolutional neural network model is the guilty name tag matrix, and stopping the adjustment of the filling parameters to obtain the trained guilty name recommendation model.

Compared with the prior art, the criminal legal document and criminal name recommendation method and system provided by the invention can reduce noise by constructing the disuse word list of the criminal legal document; the method has the advantages that word vectors containing semantics and adapting to criminal legal documents are calculated by adopting a word2vec word vector model, unbalanced data are subjected to oversampling processing of similarity among categories, a specific structure of text-private-cnn is constructed, and other technical means, accuracy of criminal name recommendation can be remarkably improved, and meanwhile, the problems of long time spent on human criminal name determination, low efficiency and low accuracy in the prior art can be solved.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method for recommending criminal legal document and criminal names, comprising:

2. The method for recommending criminal legal document and criminal names according to claim 1, wherein said obtaining criminal legal document and performing word segmentation process on said criminal legal document to obtain text collection in terms of entries further comprises:

3. A method of recommending criminal legal prosecution names according to claim 1, characterized in that said method of recommending further comprises:

4. The criminal legal document criminal name recommendation method according to claim 3, wherein the training of the criminal name recommendation model by the training sample specifically comprises:

acquiring a text set in the training sample pair;

5. A system for recommending criminal legal document and criminal names, comprising:

6. A system for recommending criminal legal document and criminal names according to claim 5, characterized in that said system further comprises:

7. A recommendation system for criminal legal document and criminal names according to claim 5, characterized in that said recommendation system further comprises:

8. The system for recommending criminal legal document and criminal names of claim 7, wherein said training module specifically comprises: