CN109614606B - Document embedding-based long text case penalty range classification prediction method and device - Google Patents

Document embedding-based long text case penalty range classification prediction method and device Download PDF

Info

Publication number
CN109614606B
CN109614606B CN201811237399.XA CN201811237399A CN109614606B CN 109614606 B CN109614606 B CN 109614606B CN 201811237399 A CN201811237399 A CN 201811237399A CN 109614606 B CN109614606 B CN 109614606B
Authority
CN
China
Prior art keywords
document
vector
word
label
predicted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811237399.XA
Other languages
Chinese (zh)
Other versions
CN109614606A (en
Inventor
郑子彬
庄业广
周晓聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201811237399.XA priority Critical patent/CN109614606B/en
Publication of CN109614606A publication Critical patent/CN109614606A/en
Application granted granted Critical
Publication of CN109614606B publication Critical patent/CN109614606B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents

Abstract

The invention discloses a document embedding-based classification prediction method for a penalty range of a long text case, which comprises the following steps: s1: discretizing the penalty amount, and marking different labels according to the penalty amount; s2: performing word segmentation and word stopping processing; s3: splicing and combining the judgment documents under the same label into a document, calculating the TFIDF value of each word in different documents, and sequencing the importance of the words according to the variance of the TFIDF value of each word in different documents; s4: taking the sequenced top k words as key words, and reserving the words which are the key words; s5: training a doc2vec model by using the filtered judgment document, and obtaining a corresponding doc2vec central vector; s6: and calculating the doc2vec vector of the document to be predicted and the distance between the vector and the doc2vec center vector of each label, and taking the label closest to the vector as the predicted label to obtain the penalty range of the document to be predicted. The invention automatically gives the penalty range of the responsible person in the case by using mechanical learning and big data, improves the case handling efficiency and reduces human resources.

Description

Document embedding-based long text case penalty range classification prediction method and device
Technical Field
The invention relates to the field of text classification, in particular to a document embedding-based long text case penalty range classification prediction method and device.
Background
With the development of artificial intelligence technology and the wide development of judicial big data application, business personnel hope to read a large number of cases through machines and automatically give the range of penalties of responsible persons in the cases so as to improve the case handling efficiency. Meanwhile, the method is also beneficial to the citizen to quickly know possible penalties according to the relevant case facts.
Disclosure of Invention
The invention provides a document embedding-based method and a document embedding-based device for classifying and predicting the penalty range of a long text case, aiming at overcoming at least one defect in the prior art.
The present invention aims to solve the above technical problem at least to some extent.
The invention aims to utilize a large amount of open case data, enable a machine to read a large amount of case facts, learn legal professional experience hidden behind the data and automatically give the range of penalties of a responsible person in the case so as to improve case handling efficiency.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a document embedding-based long text case penalty range classification prediction method comprises the following steps:
s1: discretizing the penalty amount of the judgment document with the known penalty amount, and marking different labels for the judgment document with the known penalty amount according to the penalty amount;
s2: carrying out word segmentation on the judgment documents with the known penalty amount and stopping word processing;
s3: splicing the judgment documents with known penalty amount under the same label to form a document, calculating the TFIDF value of each word in the documents under different labels, and sequencing the words according to the variance of the TFIDF value of each word in different documents;
s4: taking the sequenced top k words as key words, filtering the words of the judgment documents with known penalty amount, and only keeping the words which are the key words;
s5: training a doc2vec model by using the filtered judgment documents with known penalty amount, and calculating a corresponding doc2vec central vector according to label;
s6: and calculating the doc2vec vector of the judgment document to be predicted and the Euclidean distance between the doc2vec vector and the doc2vec center vector of each label, and taking the label closest to the Euclidean distance as the predicted label value of the judgment document to be predicted to obtain the penalty range of the judgment document to be predicted.
The method comprises the steps of classifying and segmenting words according to the amount of the penalty by using a disclosed decision document with known amount of the penalty, then training a doc2vec model to obtain a trained doc2vec model, prejudging the decision document to be predicted and giving out the prejudged amount of the penalty, so that a large amount of manpower and material resources are reduced, case penalty judgment is realized quickly, and the case trial process is accelerated.
Preferably, the step S1 discretizes the penalty amount of the decision document with the known penalty amount by:
the penalty amounts are divided into eight ranges (0,2000 ], (2000,3000 ], (3000,4000 ], (4000,5000 ], (5000,6000 ], (6000,10000 ], (10000,500000 ], and 500000+ with label labeled 1 through 8, respectively.
Preferably, the specific steps of segmenting the known penalty amount of the decision document and deactivating the word processing in step S2 are as follows:
and performing Chinese word segmentation on the judgment documents with the known penalty amount by using a word segmentation tool, and filtering the appeared stop words by using a Chinese stop word list after word segmentation.
Preferably, the specific steps of calculating TFIDF values of each word in the documents under different labels in step S3 are as follows:
s3.1: calculating the word frequency (TF) of the word w in the following way:
Figure GDA0003908257830000021
in the formula, TF (w, d) represents the word frequency (TF) of the word w in the document classified by label d, count (w | d) represents the number of times the word w appears in the document classified by label d, and Count (d) represents the number of words appearing in the document classified by label d;
s3.2: the Inverse Document Frequency (IDF) of the word w is calculated as follows:
Figure GDA0003908257830000022
in the formula, idf (w, D) represents the inverse document frequency of the word w in the whole document set D, N represents the number of documents, N represents the number of documents w The number of documents in which the word w appears;
s3.3: multiplying the word frequency (TF) of the word w by the Inverse Document Frequency (IDF) to obtain the TFIDF value of the word w in the document classified by label as d.
Preferably, the specific steps of training a doc2vec model in step S5 using the filtered decision document with known penalty amount are as follows:
s5.1: adding a param id to each filtered judgment document with known penalty amount, and mapping the param id to a vector, namely a param vector, wherein in the training process, the param id is kept unchanged, and shares the same param vector, which is equivalent to that the semantics of the whole judgment document is utilized when the probability of a word is predicted each time;
s5.2: mapping each word into a vector, namely, the dimension of the word vector is the same as that of the word vector, and the word vector comes from two different vector spaces;
s5.3: accumulating or connecting the paramph vector and the word vector to be used as the input of an output layer softmax;
s5.4: outputting the filtered partial vector of the decision document with the known penalty amount.
Preferably, the specific step of calculating the corresponding doc2vec center vector according to label in step S5 is as follows:
s5.5: for each filtered decision document with a known penalty amount under a label, all the paragraph vectors under the label are taken as doc2vec center vectors of the label.
Preferably, in step S6, the specific steps of calculating the doc2vec vector of the decision document to be predicted and the euclidean distances between the doc2vec vector and the center vectors of the respective labels in the decision document to be predicted, and taking the label with the nearest euclidean distance as the predicted value of the label are as follows:
s6.1: newly allocating a para id to the judgment document to be predicted, keeping parameters obtained in the training stage unchanged by the parameters of the word vector and the output layer softmax, training the judgment document to be predicted by utilizing gradient descent, and obtaining a doc2vec vector of the sentence to be predicted after convergence;
s6.2: calculating the Euclidean distance between the doc2vec vector of the judgment document to be predicted and each label center vector in the following way:
Figure GDA0003908257830000031
wherein d (x, y) represents the Euclidean distance between the doc2vec vector of the decision document to be predicted and each label center vector, x is the component of the doc2vec vector of the decision document to be predicted, y is the component of each label center vector, i =1,2.. N, and n is the dimension of the doc2vec vector of the decision document to be predicted and each label center vector;
s6.3: and taking the label value of the corresponding cluster with the minimum Euclidean distance as the predicted label value of the document to be predicted to obtain the penalty range of the document.
A document embedding-based long text case penalty range classification prediction device comprises:
a marking module;
a preprocessing module;
the calculation importance module comprises a word frequency calculation module, a reverse document frequency calculation module and a TFIDF value calculation module;
a keyword filtering module;
the model processing module comprises a document processing module, a word processing module, an input module, an output module and a mean vector calculating module;
the prediction module comprises a doc2vec vector module for calculating to-be-predicted, an Euclidean distance calculation module and a penalty acquisition module;
the device is used for completing the cited long text case penalty range classification prediction method based on document embedding.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
according to the invention, through a machine learning method, a large amount of published case data is utilized, a machine is enabled to read a large amount of case facts, the legal professional experience hidden behind the data is learned, and the penalty range of a responsible person in the case is automatically given, so that the case handling efficiency is improved, and a large amount of manpower and material resources are saved. Meanwhile, common citizens can quickly know the range of penalties to face according to the fact of cases involved by the citizens.
Drawings
FIG. 1 is a flowchart of a document embedding-based long text case penalty range classification prediction method provided by the invention.
FIG. 2 is a flowchart of the document embedding-based penalty range classification prediction method for long text cases according to the present invention, step S3.
FIG. 3 is a flowchart of the document embedding-based penalty range classification prediction method for long text cases according to the present invention, step S5.
FIG. 4 is a flowchart of the document embedding-based penalty range classification prediction method for long text cases according to the present invention, step S6.
FIG. 5 is a diagram of a doc2vec c-bow training model provided by the present invention.
Fig. 6 is a diagram illustrating the results of sorting the variance of TFIDF values in example 1.
FIG. 7 is a schematic diagram of a document embedding-based penalty range classification prediction device for long-text cases.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described with reference to the drawings and the embodiments.
Example 1
A document embedding-based long text case penalty range classification prediction method, as shown in FIG. 1, comprises the following steps:
s1: discretizing the penalty amount of the judgment documents with the known penalty amount, marking different labels for the judgment documents with the known penalty amount according to the penalty amount, dividing the penalty amount into eight ranges of (0,2000 ], (2000,3000 ], (3000,4000 ], (4000,5000 ], (5000,6000 ], (6000,10000 ], (10000,500000 ] and 500000+ and marking the labels thereof with 1 to 8 respectively;
s2: performing Chinese word segmentation on the judgment documents with the known penalty amount by using a word segmentation tool, and filtering the occurring stop words by using a Chinese stop word list after word segmentation;
s3: combining the judgment documents with known penalty amount under the same label into a document, calculating the TFIDF value of each word in the documents under different labels, as shown in FIG. 3, and sorting the importance of the words according to the variance of the TFIDF value of each word in different documents, which is specifically as follows:
s3.1: calculating the word frequency (TF) of the word w in the following way:
Figure GDA0003908257830000051
in the formula, TF (w, d) represents the word frequency (TF) of the word w in the document classified by label d, count (w | d) represents the number of times the word w appears in the document classified by label d, and Count (d) represents the number of words appearing in the document classified by label d;
s3.2: the Inverse Document Frequency (IDF) of the word w is calculated as follows:
Figure GDA0003908257830000052
in the formula, idf (w, D) represents the inverse document frequency of the word w in the whole document set D, N represents the number of documents, N represents the number of documents w The number of documents in which the word w appears;
s3.3: multiplying the word frequency (TF) of the word w by the Inverse Document Frequency (IDF) to obtain the TFIDF value of the word w in the document classified by label as d. (ii) a
S4: taking the sequenced top k words as key words, filtering the words of the judgment documents with known penalty amount, and only keeping the words which are the key words;
s5: training a doc2vec model by using the filtered judgment documents with known penalty amount, and calculating a corresponding doc2vec center vector according to label, as shown in fig. 2, the specific steps are as follows:
s5.1: adding a param id to each filtered judgment document with known penalty amount, and mapping the param id to a vector, namely a param vector, wherein in the training process, the param id is kept unchanged, and shares the same param vector, which is equivalent to that the semantics of the whole judgment document is utilized when the probability of a word is predicted each time;
s5.2: mapping each word into a vector, namely, the dimension of the word vector is the same as that of the word vector, and the word vector comes from two different vector spaces;
s5.3: accumulating or connecting the paramph vector and the word vector as the input of the output layer softmax;
s5.4: outputting the filtered partial vector of the decision document with the known penalty amount.
S5.5: regarding the judgment documents with known penalty amount after filtering under each label, taking all the paralgraph vectors under the label as the doc2vec center vector of the label;
s6: calculating the doc2vec vector of the judgment document to be predicted and the Euclidean distance between the doc2vec vector and the doc2vec center vector of each label, and taking the label closest to the Euclidean distance as the predicted label value to obtain the penalty range of the judgment document to be predicted, wherein the method specifically comprises the following steps of:
s6.1: newly allocating a paragraph id to the judgment document to be predicted, keeping parameters obtained in the training stage unchanged, training the judgment document to be predicted by utilizing gradient descent, and obtaining a doc2vec vector of the sentence to be predicted after converging;
s6.2: calculating the Euclidean distance between the doc2vec vector of the judgment document to be predicted and each label center vector in the following way:
Figure GDA0003908257830000061
wherein d (x, y) represents the Euclidean distance between the doc2vec vector of the decision document to be predicted and each label center vector, x is the component of the doc2vec vector of the decision document to be predicted, y is the component of each label center vector, i =1,2.. N, and n is the dimension of the doc2vec vector of the decision document to be predicted and each label center vector;
s6.3: and taking the label value of the corresponding cluster with the minimum Euclidean distance as the predicted label value of the document to be predicted to obtain the penalty range of the document.
Example 2
A document embedding-based long text case penalty range classification prediction device comprises:
a marking module;
a preprocessing module;
the importance calculating module comprises a word frequency calculating module, a reverse document frequency calculating module and a TFIDF value calculating module;
a keyword filtering module;
the model processing module comprises a document processing module, a word processing module, an input module, an output module and a mean vector calculating module;
the prediction module comprises a doc2vec vector module for calculating to-be-predicted, an Euclidean distance calculation module and a penalty acquisition module;
the device is used for completing the above-mentioned classification and prediction method based on the document embedded long text case penalty range.
In a specific implementation process, the marking module carries out discretization processing on the penalty amount of the judgment document with the known penalty amount, different labels are marked on the judgment document with the known penalty amount according to the penalty amount, the preprocessing module carries out word segmentation on the judgment document with the known penalty amount and stops word processing, the word frequency calculation module calculates the word frequency (TF) of the word w, and the calculation mode is as follows:
Figure GDA0003908257830000071
in the formula, TF (w, d) represents the word frequency (TF) of the word w in the document classified by label d, count (w | d) represents the number of times the word w appears in the document classified by label d, and Count (d) represents the number of words appearing in the document classified by label d;
the inverse document frequency calculation module calculates the Inverse Document Frequency (IDF) of the word w in the following way:
Figure GDA0003908257830000072
in the formula, idf (w, D) represents the inverse document frequency of the word w in the whole document set D, N represents the number of documents, N represents the number of documents w The number of documents in which the word w appears;
the TFIDF value calculation module multiplies the word frequency (TF) of the word w by the Inverse Document Frequency (IDF) to obtain the TFIDF value of the word w in the document classified by label d;
the importance calculating module can also rank the importance of the words according to the variance of the TFIDF value of each word in different documents;
the keyword filtering module takes the sequenced top k words as keywords, filters the words of the judgment documents with known penalty amount, and only keeps the words which are the keywords;
the document processing module adds a paragraph id1 to each filtered judgment document with known penalty amount and maps the added paragraph id1 to a paragraph vector, and in the training process, the paragraph id1 is kept unchanged and shares the same paragraph vector;
the word processing module maps each word into a word vector, and the paramgraph vector has the same dimension as the word vector and comes from two different vector spaces;
the input module accumulates or connects the param vector and the word vector as input;
the output module outputs the filtered paragraph vector of the judgment document with the known penalty amount;
calculating the average vector module to the judgment document of the filtered known penalty amount under each label, and taking the average vector of all the paragraph vectors under the label as the doc2vec center vector of the label;
a doc2vec vector module to be predicted is calculated to newly allocate a paragraph id2 to a judgment document to be predicted, word vector and output parameters keep the parameters obtained in the training stage unchanged, the judgment document to be predicted is trained by utilizing gradient descent, and after convergence, the doc2vec vector of a sentence to be predicted is obtained;
the Euclidean distance calculation module calculates the Euclidean distance between the doc2vec vector of the judgment document to be predicted and each label center vector, and the calculation mode is as follows:
Figure GDA0003908257830000081
wherein d (x, y) represents the Euclidean distance between the doc2vec vector of the decision document to be predicted and each label center vector, x is the component of the doc2vec vector of the decision document to be predicted, y is the component of each label center vector, i =1,2.. N, and n is the dimension of the doc2vec vector of the decision document to be predicted and each label center vector;
and the penalty obtaining module takes the label value of the corresponding cluster with the minimum Euclidean distance as the predicted label value of the document to be predicted to obtain the document penalty range.
The same or similar reference numerals correspond to the same or similar parts;
the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (6)

1. A document embedding-based long text case penalty range classification prediction method is characterized by comprising the following steps:
s1: discretizing the penalty amount of the judgment document with the known penalty amount, and marking different labels for the judgment document with the known penalty amount according to the penalty amount;
s2: carrying out word segmentation on the judgment documents with known penalty amount and stopping word processing;
s3: splicing the judgment documents with known penalty amount under the same label to form a document, calculating the TFIDF value of each word in the documents under different labels, and sequencing the words according to the variance of the TFIDF value of each word in different documents;
s4: taking the sequenced top k words as key words, filtering the words of the judgment documents with known penalty amount, and only keeping the words which are the key words;
s5: training a doc2vec model by using the filtered judgment documents with known penalty amount, and calculating a corresponding doc2vec central vector according to label;
s6: calculating the doc2vec vector of the judgment document to be predicted and the Euclidean distance between the doc2vec vector and the doc2vec center vector of each label, and taking the label closest to the Euclidean distance as the predicted label value of the judgment document to be predicted to obtain the penalty range of the judgment document to be predicted;
the step S5 comprises the following specific steps:
s5.1: adding a param id1 to each filtered judgment document with the known penalty amount, mapping the added param id1 to a param vector, wherein the param id1 is kept unchanged and shares the same param vector in the training process;
s5.2: mapping each word into a word vector, wherein the paramgraph vector has the same dimension as the word vector and comes from two different vector spaces;
s5.3: accumulating or connecting the paramph vector and the word vector as input;
s5.4: outputting the paragraph vector of the filtered decision document with the known penalty amount;
s5.5: for each filtered decision document with a known penalty amount under a label, the average vector of all paragraph vectors under the label is taken as the doc2vec center vector of the label.
2. The document embedding-based long text case penalty range classification prediction method according to claim 1, characterized in that the step S1 specifically comprises:
the penalty amounts are divided into eight ranges (0,2000 ], (2000,3000 ], (3000,4000 ], (4000,5000 ], (5000,6000 ], (6000,10000 ], (10000,500000 ], and 500000+ with label labeled 1 through 8, respectively.
3. The document embedding-based long text case penalty range classification prediction method according to claim 2, characterized in that the step S2 is specifically:
and segmenting the judgment documents with the known penalty money amount by using a word segmentation tool of the ending segmentation or the ancient segmentation, and filtering the occurring stop words by using the stop word list after segmenting the words.
4. The method for predicting the penalty range of long text cases embedded in documents according to claim 1, wherein the specific steps of calculating the TFIDF value of each word in the documents under different labels in the step S3 are as follows:
s3.1: calculating the word frequency (TF) of the word w in the following way:
Figure FDA0003908257820000021
in the formula, TF (w, d) represents the word frequency (TF) of the word w in the document classified by label d, count (w | d) represents the number of times the word w appears in the document classified by label d, and Count (d) represents the number of words appearing in the document classified by label d;
s3.2: the Inverse Document Frequency (IDF) of the word w is calculated as follows:
Figure FDA0003908257820000022
in the formula, idf (w, D) represents the inverse of the word w in the entire document set DTo document frequency, N represents the number of documents, N w The number of documents in which the word w appears;
s3.3: multiplying the word frequency (TF) of the word w by the Inverse Document Frequency (IDF) to obtain the TFIDF value of the word w in the document classified by label as d.
5. The document embedding-based long text case penalty range classification prediction method according to claim 4, characterized in that, in step S6, the Euclidean distance between the doc2vec vector and the doc2vec center vector of each label is calculated for the decision document to be predicted, and the label with the nearest Euclidean distance is taken as the predicted label value, and the specific steps are as follows:
s6.1: newly allocating a para 2 to the judgment document to be predicted, keeping the parameters obtained in the training stage unchanged by the word vector and the output parameters, training the judgment document to be predicted by utilizing gradient descent, and obtaining a doc2vec vector of the sentence to be predicted after convergence;
s6.2: calculating the Euclidean distance between the doc2vec vector of the judgment document to be predicted and each label center vector in the following way:
Figure FDA0003908257820000031
wherein d (x, y) represents the Euclidean distance between the doc2vec vector of the decision document to be predicted and each label center vector, x is the component of the doc2vec vector of the decision document to be predicted, y is the component of each label center vector, i =1,2.. N, and n is the dimension of the doc2vec vector of the decision document to be predicted and each label center vector;
s6.3: and taking the label value of the corresponding cluster with the minimum Euclidean distance as the predicted label value of the document to be predicted to obtain the penalty range of the document.
6. A document embedding-based classification and prediction device for penalty range of long text cases comprises:
the marking module is used for carrying out discretization processing on the penalty amount of the judgment document with the known penalty amount and marking different labels on the judgment documents with the known penalty amount according to the penalty amount;
the preprocessing module is used for segmenting words of the judgment documents with known penalty amount and stopping word processing;
the system comprises a calculation importance module, a word frequency calculation module and a word frequency calculation module, wherein the calculation importance module is used for splicing judgment documents with known penalty amount under the same label into a document, calculating TFIDF values of each word in the documents under different labels, and ordering the importance of the words according to the variance of the TFIDF values of each word in different documents, and the calculation mode comprises the following steps:
Figure FDA0003908257820000032
in the formula, TF (w, d) represents the word frequency (TF) of the word w in the document classified by label d, count (w | d) represents the number of times the word w appears in the document classified by label d, and Count (d) represents the number of words appearing in the document classified by label d; the reverse document frequency calculation module is used for calculating the reverse document frequency (IDF) of the word w, and the calculation mode is as follows:
Figure FDA0003908257820000033
in the formula, idf (w, D) represents the inverse document frequency of the word w in the whole document set D, N represents the number of documents, N represents the number of documents w The number of documents in which the word w appears; the TFIDF value calculating module is used for multiplying the word frequency (TF) of the word w by the Inverse Document Frequency (IDF) to obtain the TFIDF value of the word w in the document classified by label d;
the keyword filtering module is used for taking the top k words after sequencing as keywords, filtering the words of the judgment documents with known penalty amount and only keeping the words which are the keywords;
the model processing module is used for training a doc2vec model by using the filtered judgment documents with known penalty amount and calculating corresponding doc2vec center vectors according to label, and comprises a document processing module, a document vector mapping module and a document vector matching module, wherein the document processing module is used for adding a paramgraph id1 to each filtered judgment document with known penalty amount and mapping the paramgraph id1 to a paramgraph vector, and the paramgraph id1 is kept unchanged and shares the same paramgraph vector in the training process; the word processing module is used for mapping each word into a word vector, and the paramph vector has the same dimension with the word vector and comes from two different vector spaces; the input module is used for accumulating or connecting the param vector and the word vector as input; the output module is used for outputting the paragraph vectors of the filtered judgment documents with the known penalty amount, and the average vector calculating module is used for taking the average vectors of all the paragraph vectors under each label as the doc2vec central vector of the label for the judgment documents with the known penalty amount under each label;
the system comprises a prediction module, a doc2vec vector module, a multi-parameter calculation module and a parameter output module, wherein the prediction module is used for calculating a doc2vec vector of a to-be-predicted judgment document and Euclidean distances between the doc2vec vector and doc2vec central vectors of all labels, the label closest to the Euclidean distance is taken as a predicted label value of the label, and a penalty range of the to-be-predicted judgment document is obtained; the Euclidean distance calculation module is used for calculating the Euclidean distance between the doc2vec vector of the judgment document to be predicted and each label center vector, and the calculation mode is as follows:
Figure FDA0003908257820000041
wherein d (x, y) represents the Euclidean distance between the doc2vec vector of the judgment document to be predicted and each label central vector, x is the component of the doc2vec vector of the judgment document to be predicted, y is the component of each label central vector, i =1,2.. N, and n is the dimension between the doc2vec vector of the judgment document to be predicted and each label central vector; and the penalty obtaining module is used for taking the label value of the corresponding cluster with the minimum Euclidean distance as the predicted label value of the document to be predicted to obtain the document penalty range.
CN201811237399.XA 2018-10-23 2018-10-23 Document embedding-based long text case penalty range classification prediction method and device Active CN109614606B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811237399.XA CN109614606B (en) 2018-10-23 2018-10-23 Document embedding-based long text case penalty range classification prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811237399.XA CN109614606B (en) 2018-10-23 2018-10-23 Document embedding-based long text case penalty range classification prediction method and device

Publications (2)

Publication Number Publication Date
CN109614606A CN109614606A (en) 2019-04-12
CN109614606B true CN109614606B (en) 2023-02-03

Family

ID=66002042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811237399.XA Active CN109614606B (en) 2018-10-23 2018-10-23 Document embedding-based long text case penalty range classification prediction method and device

Country Status (1)

Country Link
CN (1) CN109614606B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472011B (en) * 2019-07-19 2023-07-14 平安科技(深圳)有限公司 Litigation cost prediction method and device and terminal equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001014992A1 (en) * 1999-08-25 2001-03-01 Kent Ridge Digital Labs Document classification apparatus
JP2006293616A (en) * 2005-04-08 2006-10-26 Nippon Telegr & Teleph Corp <Ntt> Document aggregating method, and device and program
CN105447750A (en) * 2015-11-17 2016-03-30 小米科技有限责任公司 Information identification method, apparatus, terminal and server
CN107133283A (en) * 2017-04-17 2017-09-05 北京科技大学 A kind of Legal ontology knowledge base method for auto constructing
CN107578270A (en) * 2017-08-03 2018-01-12 中国银联股份有限公司 A kind of construction method, device and the computing device of financial label

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001014992A1 (en) * 1999-08-25 2001-03-01 Kent Ridge Digital Labs Document classification apparatus
JP2006293616A (en) * 2005-04-08 2006-10-26 Nippon Telegr & Teleph Corp <Ntt> Document aggregating method, and device and program
CN105447750A (en) * 2015-11-17 2016-03-30 小米科技有限责任公司 Information identification method, apparatus, terminal and server
CN107133283A (en) * 2017-04-17 2017-09-05 北京科技大学 A kind of Legal ontology knowledge base method for auto constructing
CN107578270A (en) * 2017-08-03 2018-01-12 中国银联股份有限公司 A kind of construction method, device and the computing device of financial label

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度学习的司法智能研究;邓文超;《中国优秀硕士学位论文全文数据库 社会科学Ⅰ辑》;20180215(第2期);第10-22、38-47页 *

Also Published As

Publication number Publication date
CN109614606A (en) 2019-04-12

Similar Documents

Publication Publication Date Title
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN104834686B (en) A kind of video recommendation method based on mixing semantic matrix
CN109783818A (en) A kind of enterprises &#39; industry multi-tag classification method
Bhatia et al. End-to-end resume parsing and finding candidates for a job description using bert
CN106776538A (en) The information extracting method of enterprise&#39;s noncanonical format document
CN109241285A (en) A kind of device of the judicial decision in a case of auxiliary based on machine learning
CN109271521A (en) A kind of file classification method and device
CN101877064B (en) Image classification method and image classification device
CN111967302A (en) Video tag generation method and device and electronic equipment
CN109241297B (en) Content classification and aggregation method, electronic equipment, storage medium and engine
CN112632980A (en) Enterprise classification method and system based on big data deep learning and electronic equipment
US11429810B2 (en) Question answering method, terminal, and non-transitory computer readable storage medium
CN107832287A (en) A kind of label identification method and device, storage medium, terminal
CN103294817A (en) Text feature extraction method based on categorical distribution probability
CN108199951A (en) A kind of rubbish mail filtering method based on more algorithm fusion models
CN113420145B (en) Semi-supervised learning-based bid-bidding text classification method and system
CN109446423B (en) System and method for judging sentiment of news and texts
CN108460098A (en) Information recommendation method, device and computer equipment
CN107194617A (en) A kind of app software engineers soft skill categorizing system and method
CN109271516A (en) Entity type classification method and system in a kind of knowledge mapping
CN112836509A (en) Expert system knowledge base construction method and system
CN110110087A (en) A kind of Feature Engineering method for Law Text classification based on two classifiers
CN108764302A (en) A kind of bill images sorting technique based on color characteristic and bag of words feature
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zheng Zibin

Inventor after: Zhuang Yeguang

Inventor after: Zhou Xiaocong

Inventor before: Zhuang Yeguang

GR01 Patent grant
GR01 Patent grant