CN109614606B

CN109614606B - Document embedding-based long text case penalty range classification prediction method and device

Info

Publication number: CN109614606B
Application number: CN201811237399.XA
Authority: CN
Inventors: 郑子彬; 庄业广; 周晓聪
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2023-02-03
Anticipated expiration: 2038-10-23
Also published as: CN109614606A

Abstract

The invention discloses a document embedding-based classification prediction method for a penalty range of a long text case, which comprises the following steps: s1: discretizing the penalty amount, and marking different labels according to the penalty amount; s2: performing word segmentation and word stopping processing; s3: splicing and combining the judgment documents under the same label into a document, calculating the TFIDF value of each word in different documents, and sequencing the importance of the words according to the variance of the TFIDF value of each word in different documents; s4: taking the sequenced top k words as key words, and reserving the words which are the key words; s5: training a doc2vec model by using the filtered judgment document, and obtaining a corresponding doc2vec central vector; s6: and calculating the doc2vec vector of the document to be predicted and the distance between the vector and the doc2vec center vector of each label, and taking the label closest to the vector as the predicted label to obtain the penalty range of the document to be predicted. The invention automatically gives the penalty range of the responsible person in the case by using mechanical learning and big data, improves the case handling efficiency and reduces human resources.

Description

Document embedding-based long text case penalty range classification prediction method and device

Technical Field

The invention relates to the field of text classification, in particular to a document embedding-based long text case penalty range classification prediction method and device.

Background

With the development of artificial intelligence technology and the wide development of judicial big data application, business personnel hope to read a large number of cases through machines and automatically give the range of penalties of responsible persons in the cases so as to improve the case handling efficiency. Meanwhile, the method is also beneficial to the citizen to quickly know possible penalties according to the relevant case facts.

Disclosure of Invention

The invention provides a document embedding-based method and a document embedding-based device for classifying and predicting the penalty range of a long text case, aiming at overcoming at least one defect in the prior art.

The present invention aims to solve the above technical problem at least to some extent.

The invention aims to utilize a large amount of open case data, enable a machine to read a large amount of case facts, learn legal professional experience hidden behind the data and automatically give the range of penalties of a responsible person in the case so as to improve case handling efficiency.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a document embedding-based long text case penalty range classification prediction method comprises the following steps:

s1: discretizing the penalty amount of the judgment document with the known penalty amount, and marking different labels for the judgment document with the known penalty amount according to the penalty amount;

s2: carrying out word segmentation on the judgment documents with the known penalty amount and stopping word processing;

s3: splicing the judgment documents with known penalty amount under the same label to form a document, calculating the TFIDF value of each word in the documents under different labels, and sequencing the words according to the variance of the TFIDF value of each word in different documents;

s4: taking the sequenced top k words as key words, filtering the words of the judgment documents with known penalty amount, and only keeping the words which are the key words;

s5: training a doc2vec model by using the filtered judgment documents with known penalty amount, and calculating a corresponding doc2vec central vector according to label;

s6: and calculating the doc2vec vector of the judgment document to be predicted and the Euclidean distance between the doc2vec vector and the doc2vec center vector of each label, and taking the label closest to the Euclidean distance as the predicted label value of the judgment document to be predicted to obtain the penalty range of the judgment document to be predicted.

The method comprises the steps of classifying and segmenting words according to the amount of the penalty by using a disclosed decision document with known amount of the penalty, then training a doc2vec model to obtain a trained doc2vec model, prejudging the decision document to be predicted and giving out the prejudged amount of the penalty, so that a large amount of manpower and material resources are reduced, case penalty judgment is realized quickly, and the case trial process is accelerated.

Preferably, the step S1 discretizes the penalty amount of the decision document with the known penalty amount by:

the penalty amounts are divided into eight ranges (0,2000 ], (2000,3000 ], (3000,4000 ], (4000,5000 ], (5000,6000 ], (6000,10000 ], (10000,500000 ], and 500000+ with label labeled 1 through 8, respectively.

Preferably, the specific steps of segmenting the known penalty amount of the decision document and deactivating the word processing in step S2 are as follows:

and performing Chinese word segmentation on the judgment documents with the known penalty amount by using a word segmentation tool, and filtering the appeared stop words by using a Chinese stop word list after word segmentation.

Preferably, the specific steps of calculating TFIDF values of each word in the documents under different labels in step S3 are as follows:

s3.1: calculating the word frequency (TF) of the word w in the following way:

in the formula, TF (w, d) represents the word frequency (TF) of the word w in the document classified by label d, count (w | d) represents the number of times the word w appears in the document classified by label d, and Count (d) represents the number of words appearing in the document classified by label d;

s3.2: the Inverse Document Frequency (IDF) of the word w is calculated as follows:

in the formula, idf (w, D) represents the inverse document frequency of the word w in the whole document set D, N represents the number of documents, N represents the number of documents _w The number of documents in which the word w appears;

s3.3: multiplying the word frequency (TF) of the word w by the Inverse Document Frequency (IDF) to obtain the TFIDF value of the word w in the document classified by label as d.

Preferably, the specific steps of training a doc2vec model in step S5 using the filtered decision document with known penalty amount are as follows:

s5.1: adding a param id to each filtered judgment document with known penalty amount, and mapping the param id to a vector, namely a param vector, wherein in the training process, the param id is kept unchanged, and shares the same param vector, which is equivalent to that the semantics of the whole judgment document is utilized when the probability of a word is predicted each time;

s5.2: mapping each word into a vector, namely, the dimension of the word vector is the same as that of the word vector, and the word vector comes from two different vector spaces;

s5.3: accumulating or connecting the paramph vector and the word vector to be used as the input of an output layer softmax;

s5.4: outputting the filtered partial vector of the decision document with the known penalty amount.

Preferably, the specific step of calculating the corresponding doc2vec center vector according to label in step S5 is as follows:

s5.5: for each filtered decision document with a known penalty amount under a label, all the paragraph vectors under the label are taken as doc2vec center vectors of the label.

Preferably, in step S6, the specific steps of calculating the doc2vec vector of the decision document to be predicted and the euclidean distances between the doc2vec vector and the center vectors of the respective labels in the decision document to be predicted, and taking the label with the nearest euclidean distance as the predicted value of the label are as follows:

s6.1: newly allocating a para id to the judgment document to be predicted, keeping parameters obtained in the training stage unchanged by the parameters of the word vector and the output layer softmax, training the judgment document to be predicted by utilizing gradient descent, and obtaining a doc2vec vector of the sentence to be predicted after convergence;

s6.2: calculating the Euclidean distance between the doc2vec vector of the judgment document to be predicted and each label center vector in the following way:

wherein d (x, y) represents the Euclidean distance between the doc2vec vector of the decision document to be predicted and each label center vector, x is the component of the doc2vec vector of the decision document to be predicted, y is the component of each label center vector, i =1,2.. N, and n is the dimension of the doc2vec vector of the decision document to be predicted and each label center vector;

s6.3: and taking the label value of the corresponding cluster with the minimum Euclidean distance as the predicted label value of the document to be predicted to obtain the penalty range of the document.

A document embedding-based long text case penalty range classification prediction device comprises:

a marking module;

a preprocessing module;

the calculation importance module comprises a word frequency calculation module, a reverse document frequency calculation module and a TFIDF value calculation module;

a keyword filtering module;

the model processing module comprises a document processing module, a word processing module, an input module, an output module and a mean vector calculating module;

the prediction module comprises a doc2vec vector module for calculating to-be-predicted, an Euclidean distance calculation module and a penalty acquisition module;

the device is used for completing the cited long text case penalty range classification prediction method based on document embedding.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

according to the invention, through a machine learning method, a large amount of published case data is utilized, a machine is enabled to read a large amount of case facts, the legal professional experience hidden behind the data is learned, and the penalty range of a responsible person in the case is automatically given, so that the case handling efficiency is improved, and a large amount of manpower and material resources are saved. Meanwhile, common citizens can quickly know the range of penalties to face according to the fact of cases involved by the citizens.

Drawings

FIG. 1 is a flowchart of a document embedding-based long text case penalty range classification prediction method provided by the invention.

FIG. 2 is a flowchart of the document embedding-based penalty range classification prediction method for long text cases according to the present invention, step S3.

FIG. 3 is a flowchart of the document embedding-based penalty range classification prediction method for long text cases according to the present invention, step S5.

FIG. 4 is a flowchart of the document embedding-based penalty range classification prediction method for long text cases according to the present invention, step S6.

FIG. 5 is a diagram of a doc2vec c-bow training model provided by the present invention.

Fig. 6 is a diagram illustrating the results of sorting the variance of TFIDF values in example 1.

FIG. 7 is a schematic diagram of a document embedding-based penalty range classification prediction device for long-text cases.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described with reference to the drawings and the embodiments.

Example 1

A document embedding-based long text case penalty range classification prediction method, as shown in FIG. 1, comprises the following steps:

s1: discretizing the penalty amount of the judgment documents with the known penalty amount, marking different labels for the judgment documents with the known penalty amount according to the penalty amount, dividing the penalty amount into eight ranges of (0,2000 ], (2000,3000 ], (3000,4000 ], (4000,5000 ], (5000,6000 ], (6000,10000 ], (10000,500000 ] and 500000+ and marking the labels thereof with 1 to 8 respectively;

s2: performing Chinese word segmentation on the judgment documents with the known penalty amount by using a word segmentation tool, and filtering the occurring stop words by using a Chinese stop word list after word segmentation;

s3: combining the judgment documents with known penalty amount under the same label into a document, calculating the TFIDF value of each word in the documents under different labels, as shown in FIG. 3, and sorting the importance of the words according to the variance of the TFIDF value of each word in different documents, which is specifically as follows:

s3.1: calculating the word frequency (TF) of the word w in the following way:

s3.3: multiplying the word frequency (TF) of the word w by the Inverse Document Frequency (IDF) to obtain the TFIDF value of the word w in the document classified by label as d. (ii) a

s5: training a doc2vec model by using the filtered judgment documents with known penalty amount, and calculating a corresponding doc2vec center vector according to label, as shown in fig. 2, the specific steps are as follows:

s5.3: accumulating or connecting the paramph vector and the word vector as the input of the output layer softmax;

S5.5: regarding the judgment documents with known penalty amount after filtering under each label, taking all the paralgraph vectors under the label as the doc2vec center vector of the label;

s6: calculating the doc2vec vector of the judgment document to be predicted and the Euclidean distance between the doc2vec vector and the doc2vec center vector of each label, and taking the label closest to the Euclidean distance as the predicted label value to obtain the penalty range of the judgment document to be predicted, wherein the method specifically comprises the following steps of:

s6.1: newly allocating a paragraph id to the judgment document to be predicted, keeping parameters obtained in the training stage unchanged, training the judgment document to be predicted by utilizing gradient descent, and obtaining a doc2vec vector of the sentence to be predicted after converging;

Example 2

a marking module;

a preprocessing module;

the importance calculating module comprises a word frequency calculating module, a reverse document frequency calculating module and a TFIDF value calculating module;

a keyword filtering module;

the device is used for completing the above-mentioned classification and prediction method based on the document embedded long text case penalty range.

In a specific implementation process, the marking module carries out discretization processing on the penalty amount of the judgment document with the known penalty amount, different labels are marked on the judgment document with the known penalty amount according to the penalty amount, the preprocessing module carries out word segmentation on the judgment document with the known penalty amount and stops word processing, the word frequency calculation module calculates the word frequency (TF) of the word w, and the calculation mode is as follows:

the inverse document frequency calculation module calculates the Inverse Document Frequency (IDF) of the word w in the following way:

the TFIDF value calculation module multiplies the word frequency (TF) of the word w by the Inverse Document Frequency (IDF) to obtain the TFIDF value of the word w in the document classified by label d;

the importance calculating module can also rank the importance of the words according to the variance of the TFIDF value of each word in different documents;

the keyword filtering module takes the sequenced top k words as keywords, filters the words of the judgment documents with known penalty amount, and only keeps the words which are the keywords;

the document processing module adds a paragraph id1 to each filtered judgment document with known penalty amount and maps the added paragraph id1 to a paragraph vector, and in the training process, the paragraph id1 is kept unchanged and shares the same paragraph vector;

the word processing module maps each word into a word vector, and the paramgraph vector has the same dimension as the word vector and comes from two different vector spaces;

the input module accumulates or connects the param vector and the word vector as input;

the output module outputs the filtered paragraph vector of the judgment document with the known penalty amount;

calculating the average vector module to the judgment document of the filtered known penalty amount under each label, and taking the average vector of all the paragraph vectors under the label as the doc2vec center vector of the label;

a doc2vec vector module to be predicted is calculated to newly allocate a paragraph id2 to a judgment document to be predicted, word vector and output parameters keep the parameters obtained in the training stage unchanged, the judgment document to be predicted is trained by utilizing gradient descent, and after convergence, the doc2vec vector of a sentence to be predicted is obtained;

the Euclidean distance calculation module calculates the Euclidean distance between the doc2vec vector of the judgment document to be predicted and each label center vector, and the calculation mode is as follows:

and the penalty obtaining module takes the label value of the corresponding cluster with the minimum Euclidean distance as the predicted label value of the document to be predicted to obtain the document penalty range.

The same or similar reference numerals correspond to the same or similar parts;

the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A document embedding-based long text case penalty range classification prediction method is characterized by comprising the following steps:

s2: carrying out word segmentation on the judgment documents with known penalty amount and stopping word processing;

s6: calculating the doc2vec vector of the judgment document to be predicted and the Euclidean distance between the doc2vec vector and the doc2vec center vector of each label, and taking the label closest to the Euclidean distance as the predicted label value of the judgment document to be predicted to obtain the penalty range of the judgment document to be predicted;

the step S5 comprises the following specific steps:

s5.1: adding a param id1 to each filtered judgment document with the known penalty amount, mapping the added param id1 to a param vector, wherein the param id1 is kept unchanged and shares the same param vector in the training process;

s5.2: mapping each word into a word vector, wherein the paramgraph vector has the same dimension as the word vector and comes from two different vector spaces;

s5.3: accumulating or connecting the paramph vector and the word vector as input;

s5.4: outputting the paragraph vector of the filtered decision document with the known penalty amount;

s5.5: for each filtered decision document with a known penalty amount under a label, the average vector of all paragraph vectors under the label is taken as the doc2vec center vector of the label.

2. The document embedding-based long text case penalty range classification prediction method according to claim 1, characterized in that the step S1 specifically comprises:

3. The document embedding-based long text case penalty range classification prediction method according to claim 2, characterized in that the step S2 is specifically:

and segmenting the judgment documents with the known penalty money amount by using a word segmentation tool of the ending segmentation or the ancient segmentation, and filtering the occurring stop words by using the stop word list after segmenting the words.

4. The method for predicting the penalty range of long text cases embedded in documents according to claim 1, wherein the specific steps of calculating the TFIDF value of each word in the documents under different labels in the step S3 are as follows:

s3.1: calculating the word frequency (TF) of the word w in the following way:

in the formula, idf (w, D) represents the inverse of the word w in the entire document set DTo document frequency, N represents the number of documents, N _w The number of documents in which the word w appears;

5. The document embedding-based long text case penalty range classification prediction method according to claim 4, characterized in that, in step S6, the Euclidean distance between the doc2vec vector and the doc2vec center vector of each label is calculated for the decision document to be predicted, and the label with the nearest Euclidean distance is taken as the predicted label value, and the specific steps are as follows:

s6.1: newly allocating a para 2 to the judgment document to be predicted, keeping the parameters obtained in the training stage unchanged by the word vector and the output parameters, training the judgment document to be predicted by utilizing gradient descent, and obtaining a doc2vec vector of the sentence to be predicted after convergence;

6. A document embedding-based classification and prediction device for penalty range of long text cases comprises:

the marking module is used for carrying out discretization processing on the penalty amount of the judgment document with the known penalty amount and marking different labels on the judgment documents with the known penalty amount according to the penalty amount;

the preprocessing module is used for segmenting words of the judgment documents with known penalty amount and stopping word processing;

the system comprises a calculation importance module, a word frequency calculation module and a word frequency calculation module, wherein the calculation importance module is used for splicing judgment documents with known penalty amount under the same label into a document, calculating TFIDF values of each word in the documents under different labels, and ordering the importance of the words according to the variance of the TFIDF values of each word in different documents, and the calculation mode comprises the following steps:

in the formula, TF (w, d) represents the word frequency (TF) of the word w in the document classified by label d, count (w | d) represents the number of times the word w appears in the document classified by label d, and Count (d) represents the number of words appearing in the document classified by label d; the reverse document frequency calculation module is used for calculating the reverse document frequency (IDF) of the word w, and the calculation mode is as follows:

in the formula, idf (w, D) represents the inverse document frequency of the word w in the whole document set D, N represents the number of documents, N represents the number of documents _w The number of documents in which the word w appears; the TFIDF value calculating module is used for multiplying the word frequency (TF) of the word w by the Inverse Document Frequency (IDF) to obtain the TFIDF value of the word w in the document classified by label d;

the keyword filtering module is used for taking the top k words after sequencing as keywords, filtering the words of the judgment documents with known penalty amount and only keeping the words which are the keywords;

the model processing module is used for training a doc2vec model by using the filtered judgment documents with known penalty amount and calculating corresponding doc2vec center vectors according to label, and comprises a document processing module, a document vector mapping module and a document vector matching module, wherein the document processing module is used for adding a paramgraph id1 to each filtered judgment document with known penalty amount and mapping the paramgraph id1 to a paramgraph vector, and the paramgraph id1 is kept unchanged and shares the same paramgraph vector in the training process; the word processing module is used for mapping each word into a word vector, and the paramph vector has the same dimension with the word vector and comes from two different vector spaces; the input module is used for accumulating or connecting the param vector and the word vector as input; the output module is used for outputting the paragraph vectors of the filtered judgment documents with the known penalty amount, and the average vector calculating module is used for taking the average vectors of all the paragraph vectors under each label as the doc2vec central vector of the label for the judgment documents with the known penalty amount under each label;

the system comprises a prediction module, a doc2vec vector module, a multi-parameter calculation module and a parameter output module, wherein the prediction module is used for calculating a doc2vec vector of a to-be-predicted judgment document and Euclidean distances between the doc2vec vector and doc2vec central vectors of all labels, the label closest to the Euclidean distance is taken as a predicted label value of the label, and a penalty range of the to-be-predicted judgment document is obtained; the Euclidean distance calculation module is used for calculating the Euclidean distance between the doc2vec vector of the judgment document to be predicted and each label center vector, and the calculation mode is as follows:

wherein d (x, y) represents the Euclidean distance between the doc2vec vector of the judgment document to be predicted and each label central vector, x is the component of the doc2vec vector of the judgment document to be predicted, y is the component of each label central vector, i =1,2.. N, and n is the dimension between the doc2vec vector of the judgment document to be predicted and each label central vector; and the penalty obtaining module is used for taking the label value of the corresponding cluster with the minimum Euclidean distance as the predicted label value of the document to be predicted to obtain the document penalty range.