CN112800211A

CN112800211A - Method for extracting critical information of criminal process in legal document based on TextRank algorithm

Info

Publication number: CN112800211A
Application number: CN202011625462.4A
Authority: CN
Inventors: 李参宏
Original assignee: Jiangsu Netmarch Technologies Co ltd
Current assignee: Jiangsu Netmarch Technologies Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-14

Abstract

The invention discloses a method for extracting crime process key information in a legal document based on a TextRank algorithm, which specifically comprises the following steps: preprocessing the related text of the legal document, labeling the set words or parts of speech, and obtaining TF (T) IDF (inverse discrete frequency) values of each word in the subject word set; converting the subject term set w after the text preprocessing into a vector representation form to obtain a new process text subject term set w_c(ii) a Adding word position information and merging semantic similar words to obtain the final ranking information of the keywords; separating the legal documents to be extracted by taking sentences as units, and step F: constructing a graph model of the TextRank algorithm, and representing and setting the initial by using the obtained word vectorsIterating the initial value until convergence; and sequencing the top scores of all sentences, taking the set highest K sentences as the extracted crime process key information, arranging the sequence of the K sentences, and removing redundant information in the K sentences. The final retained sentences can be made more coherent.

Description

Method for extracting critical information of criminal process in legal document based on TextRank algorithm

Technical Field

The invention relates to the technical field of information content extraction methods, in particular to a method for extracting critical information in a criminal process in a legal document based on a TextRank algorithm.

Background

In recent years, the criminal process of suspects in legal documents has the characteristic of various forms due to the continuous change of criminal means. Extracting the key information of the criminal process of the suspect in the legal document is a prerequisite for finally realizing document matching, sentry prediction and other downstream applications. The existing text information extraction method has the following defects:

the extraction of text information by using a neural network algorithm needs a large document corpus, and meanwhile, the problems of long training time and slow extraction of key information in the criminal process exist, so that the method is not suitable for practical application.

Information extraction in a document by using a statistical-based method can achieve a faster speed, wherein the most classical algorithm is a TextRank algorithm. However, only the similarity between the nodes of the sentences is considered in the text information, and the quantity of the common words contained in the sentences is directly compared when the edge relation between the nodes in the graph model is constructed, so that the association degree of the two sentences is judged, and the chapter structure of the text and the position and semantic information of the sentences in the text are ignored.

Meanwhile, the legal documents are different from other field texts, the criminal process of the suspect is more concentrated in the documents and has more professional statements, and the criminal process cannot be extracted by directly using the existing text information extraction method.

Therefore, it is necessary to provide a new extraction method to solve the above problems.

Disclosure of Invention

To solve the problems set forth in the background art described above. The invention provides a method for extracting the critical information of the criminal process in the legal document based on the TextRank algorithm, which can ensure that the finally retained sentences have more consistency.

In order to achieve the purpose, the invention provides the following technical scheme: a method for extracting crime process key information in a legal document based on a TextRank algorithm specifically comprises the following steps:

step A: preprocessing the related text of the legal document, labeling the set words or parts of speech to obtain a preliminarily screened subject word set w ═ { w ═ w₁,w₂,…w_n}; and B: obtaining TF (T) IDF (inverse discrete cosine function) values of all words in the subject word set; and C: subject word set after text preprocessingConverting w into a vector representation form to obtain n-dimensional word vector representation and obtain a new process text subject word set w_c(ii) a Step D: adding word position information and merging semantic similar words to obtain the final ranking information of the keywords; step E: separating the legal documents to be extracted in sentence units, wherein the sentence set of the whole text is expressed as S ═ S₁,s₂,…s_nAnd B, simultaneously taking sentences as units, preprocessing each sentence in the same step A, converting all words into word vector representations, and forming a matrix representation M of each sentence in a splicing mode_n*mN is a word vector dimension, m is the maximum sentence length in the text and is insufficient for zero filling; step F: building a graph model of the TextRank algorithm, and utilizing the word vector representation obtained in the step E and the set initial value to iterate until convergence; step G: and F, sequencing the top scores of all the sentences in the step F, taking the set highest K sentences as the extracted crime process key information, sequencing the K sentences, and removing redundant information in the K sentences.

In the step D, the steps include: step D1: when extracting word information from a text, if a current word is positioned in the front of a sentence and weight information is to be added, obtaining a distance value according to a word vector obtained by utilizing the CBOW model in the previous step, comparing the distance value with a position average value to obtain distance information, and obtaining the weight P when the distance is closer to the beginning position of the sentence_iThe larger; step D2: will gather w_cCalculating similarity of the obtained residual words by using cosine similarity; removing words with similar semanteme and reserving TF-IDF in the two words_newWords with larger values; step D3: pressing the remaining words as TF-IDF_newAnd (4) value sequencing, namely obtaining a final subject term set w of the legal document text according to a set threshold value, and providing subject term semantic support for finally extracting the critical information of the criminal process in the text.

In the step D1, the weight assignment method is as follows:

will be obtained in the second stepTF-IDF value and weight P of the arriving word_iMultiplying to obtain TF-IDF_newAs a result of fusing the word position information.

In the above step D3, the set w_cThe similarity of the obtained residual words is calculated by utilizing cosine similarity, and the specific mode is as follows:

wherein the word w_x＝(v₁,v₂,…v_n)、w_y＝(v₁,v₂,…v_n) Are all n-dimensional vector representations transformed by the CBOW model in step C. wordsim (w)_x,w_y) The larger the value of (A) is, the higher the semantic similarity of the words is, finally removing the words with similar semantics and keeping TF-IDF in the two words_newThe word with the larger value.

The step F comprises the following steps: step F1: the positions of the sentences in the text and the subject word information are fused into the vertexes of the graph model to calculate the vertex sentence scores of the graph model; confirming whether the sentences contain the subject words or not, wherein the sentences containing the subject words are obviously more critical, and defining the information weight of the subject words; step F2: through sentence matrix representation, cosine similarity of the two matrixes is obtained and used as an edge relation weight in the graph model; step F3: the method comprises the steps that an edge relation weight between vertexes in a graph is initialized to be 1, a value is set for a learning rate, all vertex values and the edge relation weight are continuously subjected to iterative calculation until the model converges, and the final score of each vertex is used as an important basis for determining key sentences in the current legal documents.

The step F1 includes: the definition of sentence position information weight is as follows:

the definition of the subject term information weight is as follows:

the top sentence scoring mode of the graph model is as follows:

Score(i)＝P(s_i)*F(s_i)*TextRank(i)

wherein TextRank (i) is the mode in classical TextRank, and the formula is as follows:

wherein w_ijFor the similarity coefficient between two sentences, the calculation is performed in the next step G2; input (v)_i) Set S for all sentences of the current text obtained in step F_all，Output(v_j) Representing the other sentence sets linked by the current sentence, d representing the damping value, representing the probability of a certain vertex jumping to other arbitrary vertices in the graph.

The specific way of obtaining the inter-vertex edge relation weight in the graph model in the step F2 is as follows:

the step G comprises the following steps: step G1: sequencing the K sentences, arranging the K sentences from high to low according to the score sequence to enable the final crime process key information to lack integrity and continuity, and processing according to the mode to form a crime process key information set S_new(ii) a Step G2: combining sentences and subject word information and removing redundant information by using an MMR algorithm, wherein the method comprises the following steps:

MR(S_i)＝α·Sim₁(S_i,S_m)-(1-α)·max[Sim₂(S_i,S_j)]

wherein alpha is a set value, the similarity calculation method adopts the method of graph model edge relation weight in the same step F, S_mFor text information sentences composed of subject words, Sim, in step D₁The function reflects the relevance between text information sentences composed of the current sentence and the text subject word，Sim₂The function collects the current sentence and the crime process key information S_newComparing other sentences contained in the sentence and taking the maximum value; obtaining the MR value of each sentence when MR (S)_i) And if the damping value is less than or equal to the damping value, the sentence is retained, the sentence higher than the damping value is removed, and finally the critical information of the criminal process in the legal document is extracted.

Compared with the prior art, the method for extracting the critical information in the criminal process in the legal document based on the TextRank algorithm has the beneficial effects that: the method comprises the steps of extracting subject terms aiming at the particularity of a legal document, then calculating vertex scores and edge relations among vertexes of a graph model in a TextRank algorithm by fusing text subject terms, sentence position relations and semantic relations, processing redundant information by combining sentences and subject terms, and finally selecting sentences with topK scores as crime process key information of suspects in the legal document. And the method also combines sentences and subject word information and utilizes an MMR algorithm to remove redundant information, so that the crime process key information extracted from the legal document text can better summarize the full text, and meanwhile, the continuity among sentences is kept.

Drawings

FIG. 1 is a schematic flow chart of a criminal process key information extraction method in a legal document based on a TextRank algorithm according to the present invention;

FIG. 2 is a schematic view of the process of adding word position information and merging semantic similar words to obtain the final keyword ranking information according to the present invention;

FIG. 3 is a schematic flow chart of the method for constructing the graph model of the TextRank algorithm, and performing iteration to convergence by using the word vector representation obtained in the step E and setting an initial value.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 to 3, the present invention provides a method for extracting critical information of a criminal process in a legal document based on a TextRank algorithm, which is characterized by comprising the following steps:

step A: preprocessing the related text of the legal document, labeling the set words or parts of speech, mainly comprising word segmentation, stop words removal and part of speech labeling, and obtaining a preliminarily screened subject word set w ═ { w ═₁,w₂,…w_n}；

Specifically, after the input text is acquired in the step a, the preprocessing step specifically includes: step A1: the word segmentation uses a Chinese word segmentation tool jieba with a good effect to segment characters contained in the text.

Step A2: and summarizing the stop word list according to the text characteristics of the steel legal documents, and removing useless words in the legal documents by using the constructed stop word list, wherein the words mainly comprise prepositions, auxiliary words, connecting words and the like.

Step A3: using a jieba toolkit to label the part of speech of the process text, removing all non-nouns in the text, and obtaining a process text subject word set w ═ { w ═₁,w₂,…w_n}。

And B: obtaining TF (T) IDF (inverse discrete cosine function) values of all words in the subject word set; the step B specifically comprises the following steps: firstly, calculating a word frequency TF value, and counting the occurrence times of related words in a w set in a text; calculating the IDF value of the inverse document frequency, counting the proportion of related words in the w set appearing in all legal documents (or document corpora), taking the logarithm of the calculated result, wherein the smaller the proportion value is, the larger the IDF value is, and the stronger the capability of distinguishing the word from other words is; and finally, calculating the TF-IDF value of the word, wherein the more times the word appears, the larger the TF-IDF value is obtained.

TF (term frequency-inverse document frequency) is a common weighting technique used for information retrieval and data mining, and is a common technical method for retrieving and analyzing the occurrence frequency of a word frequency in a file.

And C: converting the subject word set w after the text preprocessing into a vector representation form to obtain an n-dimensional word vector tableShowing to obtain a new process text subject word set w_c；

Specifically, the subject word set w after text preprocessing is converted into a vector representation form through a word2vec tool.

Specifically, the CBOW model is selected and converted by a iterative softmax method to obtain n-dimensional word vector representation, preparation is made for counting word position information and semantic similarity among words, and a new legal document subject term set w is obtained through vectorization representation_c。

Step D: adding word position information and merging semantic similar words;

referring to fig. 2, in the step D, the steps include:

step D1: when extracting word information from a text, if a current word is positioned in the front of a sentence and weight information is to be added, obtaining a distance value according to a word vector obtained by utilizing the CBOW model in the previous step, comparing the distance value with a position average value to obtain distance information, and obtaining the weight P when the distance is closer to the beginning position of the sentence_iThe larger;

step D2: will gather w_cCalculating similarity of the obtained residual words by using cosine similarity; removing words with similar semanteme and reserving TF-IDF in the two words_newWords with larger values;

step D3: pressing the remaining words as TF-IDF_newAnd (4) value sequencing, namely obtaining a final subject term set w of the legal document text according to a set threshold value, and providing subject term semantic support for finally extracting the critical information of the criminal process in the text.

In the step D1, the weight assignment method is as follows:

the TF-IDF value of the word obtained in the second step is combined with the weight P_iMultiplying to obtain TF-IDF_newAs a result of fusing the word position information.

In the above step D3, the set w_cThe similarity of the obtained residual words is calculated by using cosine similarity, and the specific method is as followsThe following:

Step E: separating the legal documents to be extracted in sentence units, wherein the sentence set of the whole text is expressed as S ═ S₁,s₂,…s_nAnd B, simultaneously taking sentences as units, preprocessing each sentence in the same step A, converting all words into word vector representations, and forming a matrix representation M of each sentence in a splicing mode_n*mN is a word vector dimension, m is the maximum sentence length in the text and is insufficient for zero filling;

step F: constructing a graph model of the TextRank algorithm, and setting an initial value to iterate until convergence;

the step F comprises the following steps: step F1: the positions of the sentences in the text and the subject word information are fused into the vertexes of the graph model to calculate the vertex sentence scores of the graph model; sentences in the text are in different positions, the weight information is different, and the information contained in the first sentence in the text is most often; meanwhile, whether the sentences contain the subject words is determined, the sentences containing the subject words are obviously more critical, and the information weight of the subject words is defined;

step F2: through sentence matrix representation, cosine similarity of the two matrixes is obtained and used as an edge relation weight in the graph model;

referring to fig. 3, the above step F1 includes:

the definition of sentence position information weight is as follows:

the definition of the subject term information weight is as follows:

the top sentence scoring mode of the graph model is as follows:

Score(i)＝P(s_i)*F(s_i)*TextRank(i)

in the prior art, TextRank (i) is a mode in classic TextRank, and the formula is as follows:

wherein w_ijFor the similarity coefficient between two sentences, the calculation is performed in the next step G2; input (v)_i) Set S for all sentences of the current text obtained in step F_all，Output(v_j) Representing the set of other sentences linked by the current sentence, d represents a damping value, representing the probability (0.85) that a certain vertex jumps to any other vertex in the graph.

The number of co-occurring words between two sentences is mainly counted to reflect the relevance, and the semantic information between the sentences is ignored. The sentence matrix representation obtained in the step E is utilized, and the cosine similarity of the two matrixes is calculated to be used as the side relation weight in the graph model.

In the prior art, relevance is reflected only by the quantity of co-occurring words of two sentences, but semantic information between the sentences is ignored.

In the prior art, the number of co-occurring words between two sentences is mainly counted to reflect relevance and ignore semantic information between the sentences, and the classic TextRank algorithm has the following acquisition mode:

the sentence matrix representation obtained in the step E is utilized, and the cosine similarity of the two matrixes is calculated to be used as the side relation weight in the graph model. The specific way of obtaining the inter-vertex edge relation weight in the graph model is as follows:

step F3: training an improved TextRank graph model, initializing the edge relation weight between the vertexes in the graph to be 1, setting the learning rate to be 0.001, continuously and iteratively calculating all the vertex values and the edge relation weight until the model converges, and taking the final score of each vertex as an important basis for determining the critical sentence of the criminal process in the current legal document.

Step G: sequencing the top scores of all sentences in the step F, and taking topK as the extracted crime process key information, wherein the K value is 10% of the number of all sentences in the text; and the K sentences are arranged in sequence, and redundant information in the K sentences is removed.

The step G comprises the following steps:

step G1: the K sentences are sorted, the final crime process key information lacks integrity and consistency only by sorting from high to low according to the score sequence, and the sorting principle of the invention is as follows:

1) when the sentences contain time sequence information such as key time, steps and the like, the two sentences determine the sequence according to the time sequence information;

2) the original sequence of the subject words and the sequence of the sentences are corresponding to each other, if two sentences respectively contain two different pieces of subject word information, the sentences are ordered according to the original sequence of the subject words;

3) and under the condition that the sentences contain the same subject word information, arranging the sentences according to the score sequence of the original text. After being processed according to the mode, a crime process key information set S is formed_new。

Step G2: combining the sentence and the subject word information and removing redundant information by using an MMR algorithm, wherein the calculation formula is as follows:

MR(S_i)＝α·Sim₁(S_i,S_m)-(1-α)·max[Sim₂(S_i,S_j)]

wherein alpha is 0.8, the similarity calculation adopts the method of graph model edge relation weight in the same step F, S_mFor text information sentences composed of subject words, Sim, in step D₁The function reflects the relevance between text information sentences formed by the current sentence and the text subject word, Sim₂The function collects the current sentence and the crime process key information S_newThe difference between the selected crime process key information and the other sentences contained in the sentence graph can be reflected by comparing the other sentences and taking the maximum value.

Calculating the MR value of each sentence as MR (S)_i) And if the value is less than or equal to 0.85, the statement is retained, the statement higher than the value is removed, and finally the critical information of the criminal process in the legal document is extracted.

Compared with the prior art, the invention has the following beneficial effects:

1) compared with a neural network algorithm, the method does not need to construct a large corpus, is short in training time, high in process information extraction speed in legal documents, and suitable for practical use.

2) During text preprocessing, a special stop word list is constructed according to the particularity of the legal document text, and compared with the use of a general stop word list, the accuracy of crime process key information finally extracted from the document can be effectively improved.

3) When the classic TextRank algorithm is used for constructing a graph model, the vertex score and the edge relation weight calculation ignore the chapter structure and the text theme of the text and the position and semantic information of sentences in the text. The method improves the classic TextRank algorithm, and firstly, the positions of sentences in the text and subject word information are blended into the vertex calculation of a graph model; and secondly, calculating the edge relation between the vertexes in the graph without using the original co-occurrence word formula, and finally using the matrix containing sentence semantic information and calculating the cosine similarity of the two matrixes as the edge relation weight in the graph model.

4) After the improved TextRank algorithm is used for obtaining the topK key sentence, the method also combines sentences and subject term information and uses the MMR algorithm to remove redundant information, so that the crime process key information extracted from the legal document text can better summarize the full text, and meanwhile, the consistency among the sentences is kept.

Claims

1. A method for extracting crime process key information in a legal document based on a TextRank algorithm is characterized by comprising the following steps:

step A: labeling the set words or parts of speech to obtain a preliminarily screened subject word set w ═ { w ═ w₁,w₂,…w_n}；

And B: obtaining TF (T) IDF (inverse discrete cosine function) values of all words in the subject word set;

and C: converting the subject word set w after the text preprocessing into a vector representation form to obtain n-dimensional word vector representation and obtain a new process text subject word set w_c；

Step D: adding word position information and merging semantic similar words to obtain the final ranking information of the keywords;

step E: separating the legal documents to be extracted in sentence units, wherein the sentence set of the whole text is expressed as S ═ S₁,s₂,…s_nAnd B, simultaneously taking sentences as units, preprocessing each sentence in the same step A, converting all words into word vector representations, and forming a matrix representation M of each sentence in a splicing mode_n*N is a word vector dimension, m is the maximum sentence length in the text and is insufficient for zero filling;

step F: building a graph model of the TextRank algorithm, and utilizing the word vector representation obtained in the step E and the set initial value to iterate until convergence;

step G: and F, sequencing the top scores of all the sentences in the step F, taking the set highest K sentences as the extracted crime process key information, sequencing the K sentences, and removing redundant information in the K sentences.

2. The method for extracting crime procedure key information in a legal document based on a TextRank algorithm according to claim 1, wherein in the step D, the steps comprise:

step D1: obtaining distance value by using word vector obtained by CBOW model, comparing it with position average value to obtain distance information, the closer the distance to the beginning position of sentence is, the weight P_iThe larger;

3. The method for extracting critical information of criminal process in legal document based on TextRank algorithm as claimed in claim 2, wherein in the step D1, the weights are assigned as follows:

4. The method for extracting crime process key information in a legal document based on a TextRank algorithm as claimed in claim 3, wherein: in the above step D3, the set w_cThe similarity of the obtained residual words is calculated by utilizing cosine similarity, and the specific mode is as follows:

5. The method for extracting crime procedure key information in a legal document based on a TextRank algorithm according to claim 1, wherein the step F comprises:

step F1: the positions of the sentences in the text and the subject word information are fused into the vertexes of the graph model to calculate the vertex sentence scores of the graph model; confirming whether the sentences contain the subject words or not, wherein the sentences containing the subject words are obviously more critical, and defining the information weight of the subject words;

step F3: and initializing the edge relation weight between the vertexes in the graph by the trained TextRank graph model, setting a value for the learning rate, continuously and iteratively calculating all the vertex values and the edge relation weight until the model converges, and taking the final score of each vertex as an important basis for determining the key sentence in the current legal document.

6. The method for extracting crime procedure key information in a legal document based on a TextRank algorithm according to claim 5, wherein the step F1 comprises:

the definition of sentence position information weight is as follows:

the definition of the subject term information weight is as follows:

the top sentence scoring mode of the graph model is as follows:

Score(i)＝P(s_i)*F(s_i)*TextRank(i)

and E, expressing the sentence matrix obtained in the step E, and obtaining the cosine similarity of the two matrixes as the edge relation weight in the graph model, wherein the edge relation weight is as follows:

7. The method for extracting crime process key information in a legal document based on a TextRank algorithm according to claim 6, wherein the step F2 is implemented by obtaining the relationship weight between vertices in the graph model:

8. the method for extracting crime process key information in a legal document based on a TextRank algorithm as claimed in claim 7, wherein the step G comprises the steps of:

step G1: sequencing the K sentences, arranging the K sentences from high to low according to the score sequence to enable the final crime process key information to lack integrity and continuity, and processing according to the mode to form a crime process key information set S_new；

Step G2: combining sentences and subject word information and removing redundant information by using an MMR algorithm, wherein the method comprises the following steps:

MR(S_i)＝α·Sim₁(S_i,S_m)-(1-α)·max[Sim₂(S_i,S_j)]

wherein alpha is a set value, the similarity calculation method adopts the method of graph model edge relation weight in the same step F, S_mFor text information sentences composed of subject words, Sim, in step D₁The function reflects the relevance between text information sentences formed by the current sentence and the text subject word, Sim₂The function collects the current sentence and the crime process key information S_newComparing other sentences contained in the sentence and taking the maximum value; obtaining the MR value of each sentence when MR (S)_i) And if the damping value is less than or equal to the damping value, the sentence is retained, the sentence higher than the damping value is removed, and finally the critical information of the criminal process in the legal document is extracted.