CN112800211A - Method for extracting critical information of criminal process in legal document based on TextRank algorithm - Google Patents

Method for extracting critical information of criminal process in legal document based on TextRank algorithm Download PDF

Info

Publication number
CN112800211A
CN112800211A CN202011625462.4A CN202011625462A CN112800211A CN 112800211 A CN112800211 A CN 112800211A CN 202011625462 A CN202011625462 A CN 202011625462A CN 112800211 A CN112800211 A CN 112800211A
Authority
CN
China
Prior art keywords
sentences
information
sentence
words
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011625462.4A
Other languages
Chinese (zh)
Inventor
李参宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Netmarch Technologies Co ltd
Original Assignee
Jiangsu Netmarch Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Netmarch Technologies Co ltd filed Critical Jiangsu Netmarch Technologies Co ltd
Priority to CN202011625462.4A priority Critical patent/CN112800211A/en
Publication of CN112800211A publication Critical patent/CN112800211A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Tourism & Hospitality (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Resources & Organizations (AREA)
  • Probability & Statistics with Applications (AREA)
  • Economics (AREA)
  • Technology Law (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for extracting crime process key information in a legal document based on a TextRank algorithm, which specifically comprises the following steps: preprocessing the related text of the legal document, labeling the set words or parts of speech, and obtaining TF (T) IDF (inverse discrete frequency) values of each word in the subject word set; converting the subject term set w after the text preprocessing into a vector representation form to obtain a new process text subject term set wc(ii) a Adding word position information and merging semantic similar words to obtain the final ranking information of the keywords; separating the legal documents to be extracted by taking sentences as units, and step F: constructing a graph model of the TextRank algorithm, and representing and setting the initial by using the obtained word vectorsIterating the initial value until convergence; and sequencing the top scores of all sentences, taking the set highest K sentences as the extracted crime process key information, arranging the sequence of the K sentences, and removing redundant information in the K sentences. The final retained sentences can be made more coherent.

Description

Method for extracting critical information of criminal process in legal document based on TextRank algorithm
Technical Field
The invention relates to the technical field of information content extraction methods, in particular to a method for extracting critical information in a criminal process in a legal document based on a TextRank algorithm.
Background
In recent years, the criminal process of suspects in legal documents has the characteristic of various forms due to the continuous change of criminal means. Extracting the key information of the criminal process of the suspect in the legal document is a prerequisite for finally realizing document matching, sentry prediction and other downstream applications. The existing text information extraction method has the following defects:
the extraction of text information by using a neural network algorithm needs a large document corpus, and meanwhile, the problems of long training time and slow extraction of key information in the criminal process exist, so that the method is not suitable for practical application.
Information extraction in a document by using a statistical-based method can achieve a faster speed, wherein the most classical algorithm is a TextRank algorithm. However, only the similarity between the nodes of the sentences is considered in the text information, and the quantity of the common words contained in the sentences is directly compared when the edge relation between the nodes in the graph model is constructed, so that the association degree of the two sentences is judged, and the chapter structure of the text and the position and semantic information of the sentences in the text are ignored.
Meanwhile, the legal documents are different from other field texts, the criminal process of the suspect is more concentrated in the documents and has more professional statements, and the criminal process cannot be extracted by directly using the existing text information extraction method.
Therefore, it is necessary to provide a new extraction method to solve the above problems.
Disclosure of Invention
To solve the problems set forth in the background art described above. The invention provides a method for extracting the critical information of the criminal process in the legal document based on the TextRank algorithm, which can ensure that the finally retained sentences have more consistency.
In order to achieve the purpose, the invention provides the following technical scheme: a method for extracting crime process key information in a legal document based on a TextRank algorithm specifically comprises the following steps:
step A: preprocessing the related text of the legal document, labeling the set words or parts of speech to obtain a preliminarily screened subject word set w ═ { w ═ w1,w2,…wn}; and B: obtaining TF (T) IDF (inverse discrete cosine function) values of all words in the subject word set; and C: subject word set after text preprocessingConverting w into a vector representation form to obtain n-dimensional word vector representation and obtain a new process text subject word set wc(ii) a Step D: adding word position information and merging semantic similar words to obtain the final ranking information of the keywords; step E: separating the legal documents to be extracted in sentence units, wherein the sentence set of the whole text is expressed as S ═ S1,s2,…snAnd B, simultaneously taking sentences as units, preprocessing each sentence in the same step A, converting all words into word vector representations, and forming a matrix representation M of each sentence in a splicing moden*mN is a word vector dimension, m is the maximum sentence length in the text and is insufficient for zero filling; step F: building a graph model of the TextRank algorithm, and utilizing the word vector representation obtained in the step E and the set initial value to iterate until convergence; step G: and F, sequencing the top scores of all the sentences in the step F, taking the set highest K sentences as the extracted crime process key information, sequencing the K sentences, and removing redundant information in the K sentences.
In the step D, the steps include: step D1: when extracting word information from a text, if a current word is positioned in the front of a sentence and weight information is to be added, obtaining a distance value according to a word vector obtained by utilizing the CBOW model in the previous step, comparing the distance value with a position average value to obtain distance information, and obtaining the weight P when the distance is closer to the beginning position of the sentenceiThe larger; step D2: will gather wcCalculating similarity of the obtained residual words by using cosine similarity; removing words with similar semanteme and reserving TF-IDF in the two wordsnewWords with larger values; step D3: pressing the remaining words as TF-IDFnewAnd (4) value sequencing, namely obtaining a final subject term set w of the legal document text according to a set threshold value, and providing subject term semantic support for finally extracting the critical information of the criminal process in the text.
In the step D1, the weight assignment method is as follows:
Figure BDA0002879172340000021
will be obtained in the second stepTF-IDF value and weight P of the arriving wordiMultiplying to obtain TF-IDFnewAs a result of fusing the word position information.
In the above step D3, the set wcThe similarity of the obtained residual words is calculated by utilizing cosine similarity, and the specific mode is as follows:
Figure BDA0002879172340000031
wherein the word wx=(v1,v2,…vn)、wy=(v1,v2,…vn) Are all n-dimensional vector representations transformed by the CBOW model in step C. wordsim (w)x,wy) The larger the value of (A) is, the higher the semantic similarity of the words is, finally removing the words with similar semantics and keeping TF-IDF in the two wordsnewThe word with the larger value.
The step F comprises the following steps: step F1: the positions of the sentences in the text and the subject word information are fused into the vertexes of the graph model to calculate the vertex sentence scores of the graph model; confirming whether the sentences contain the subject words or not, wherein the sentences containing the subject words are obviously more critical, and defining the information weight of the subject words; step F2: through sentence matrix representation, cosine similarity of the two matrixes is obtained and used as an edge relation weight in the graph model; step F3: the method comprises the steps that an edge relation weight between vertexes in a graph is initialized to be 1, a value is set for a learning rate, all vertex values and the edge relation weight are continuously subjected to iterative calculation until the model converges, and the final score of each vertex is used as an important basis for determining key sentences in the current legal documents.
The step F1 includes: the definition of sentence position information weight is as follows:
Figure BDA0002879172340000032
the definition of the subject term information weight is as follows:
Figure BDA0002879172340000033
the top sentence scoring mode of the graph model is as follows:
Score(i)=P(si)*F(si)*TextRank(i)
wherein TextRank (i) is the mode in classical TextRank, and the formula is as follows:
Figure BDA0002879172340000034
wherein wijFor the similarity coefficient between two sentences, the calculation is performed in the next step G2; input (v)i) Set S for all sentences of the current text obtained in step Fall,Output(vj) Representing the other sentence sets linked by the current sentence, d representing the damping value, representing the probability of a certain vertex jumping to other arbitrary vertices in the graph.
The specific way of obtaining the inter-vertex edge relation weight in the graph model in the step F2 is as follows:
Figure BDA0002879172340000041
the step G comprises the following steps: step G1: sequencing the K sentences, arranging the K sentences from high to low according to the score sequence to enable the final crime process key information to lack integrity and continuity, and processing according to the mode to form a crime process key information set Snew(ii) a Step G2: combining sentences and subject word information and removing redundant information by using an MMR algorithm, wherein the method comprises the following steps:
MR(Si)=α·Sim1(Si,Sm)-(1-α)·max[Sim2(Si,Sj)]
wherein alpha is a set value, the similarity calculation method adopts the method of graph model edge relation weight in the same step F, SmFor text information sentences composed of subject words, Sim, in step D1The function reflects the relevance between text information sentences composed of the current sentence and the text subject word,Sim2The function collects the current sentence and the crime process key information SnewComparing other sentences contained in the sentence and taking the maximum value; obtaining the MR value of each sentence when MR (S)i) And if the damping value is less than or equal to the damping value, the sentence is retained, the sentence higher than the damping value is removed, and finally the critical information of the criminal process in the legal document is extracted.
Compared with the prior art, the method for extracting the critical information in the criminal process in the legal document based on the TextRank algorithm has the beneficial effects that: the method comprises the steps of extracting subject terms aiming at the particularity of a legal document, then calculating vertex scores and edge relations among vertexes of a graph model in a TextRank algorithm by fusing text subject terms, sentence position relations and semantic relations, processing redundant information by combining sentences and subject terms, and finally selecting sentences with topK scores as crime process key information of suspects in the legal document. And the method also combines sentences and subject word information and utilizes an MMR algorithm to remove redundant information, so that the crime process key information extracted from the legal document text can better summarize the full text, and meanwhile, the continuity among sentences is kept.
Drawings
FIG. 1 is a schematic flow chart of a criminal process key information extraction method in a legal document based on a TextRank algorithm according to the present invention;
FIG. 2 is a schematic view of the process of adding word position information and merging semantic similar words to obtain the final keyword ranking information according to the present invention;
FIG. 3 is a schematic flow chart of the method for constructing the graph model of the TextRank algorithm, and performing iteration to convergence by using the word vector representation obtained in the step E and setting an initial value.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1 to 3, the present invention provides a method for extracting critical information of a criminal process in a legal document based on a TextRank algorithm, which is characterized by comprising the following steps:
step A: preprocessing the related text of the legal document, labeling the set words or parts of speech, mainly comprising word segmentation, stop words removal and part of speech labeling, and obtaining a preliminarily screened subject word set w ═ { w ═1,w2,…wn};
Specifically, after the input text is acquired in the step a, the preprocessing step specifically includes: step A1: the word segmentation uses a Chinese word segmentation tool jieba with a good effect to segment characters contained in the text.
Step A2: and summarizing the stop word list according to the text characteristics of the steel legal documents, and removing useless words in the legal documents by using the constructed stop word list, wherein the words mainly comprise prepositions, auxiliary words, connecting words and the like.
Step A3: using a jieba toolkit to label the part of speech of the process text, removing all non-nouns in the text, and obtaining a process text subject word set w ═ { w ═1,w2,…wn}。
And B: obtaining TF (T) IDF (inverse discrete cosine function) values of all words in the subject word set; the step B specifically comprises the following steps: firstly, calculating a word frequency TF value, and counting the occurrence times of related words in a w set in a text; calculating the IDF value of the inverse document frequency, counting the proportion of related words in the w set appearing in all legal documents (or document corpora), taking the logarithm of the calculated result, wherein the smaller the proportion value is, the larger the IDF value is, and the stronger the capability of distinguishing the word from other words is; and finally, calculating the TF-IDF value of the word, wherein the more times the word appears, the larger the TF-IDF value is obtained.
TF (term frequency-inverse document frequency) is a common weighting technique used for information retrieval and data mining, and is a common technical method for retrieving and analyzing the occurrence frequency of a word frequency in a file.
And C: converting the subject word set w after the text preprocessing into a vector representation form to obtain an n-dimensional word vector tableShowing to obtain a new process text subject word set wc
Specifically, the subject word set w after text preprocessing is converted into a vector representation form through a word2vec tool.
Specifically, the CBOW model is selected and converted by a iterative softmax method to obtain n-dimensional word vector representation, preparation is made for counting word position information and semantic similarity among words, and a new legal document subject term set w is obtained through vectorization representationc
Step D: adding word position information and merging semantic similar words;
referring to fig. 2, in the step D, the steps include:
step D1: when extracting word information from a text, if a current word is positioned in the front of a sentence and weight information is to be added, obtaining a distance value according to a word vector obtained by utilizing the CBOW model in the previous step, comparing the distance value with a position average value to obtain distance information, and obtaining the weight P when the distance is closer to the beginning position of the sentenceiThe larger;
step D2: will gather wcCalculating similarity of the obtained residual words by using cosine similarity; removing words with similar semanteme and reserving TF-IDF in the two wordsnewWords with larger values;
step D3: pressing the remaining words as TF-IDFnewAnd (4) value sequencing, namely obtaining a final subject term set w of the legal document text according to a set threshold value, and providing subject term semantic support for finally extracting the critical information of the criminal process in the text.
In the step D1, the weight assignment method is as follows:
Figure BDA0002879172340000061
the TF-IDF value of the word obtained in the second step is combined with the weight PiMultiplying to obtain TF-IDFnewAs a result of fusing the word position information.
In the above step D3, the set wcThe similarity of the obtained residual words is calculated by using cosine similarity, and the specific method is as followsThe following:
Figure BDA0002879172340000071
wherein the word wx=(v1,v2,…vn)、wy=(v1,v2,…vn) Are all n-dimensional vector representations transformed by the CBOW model in step C. wordsim (w)x,wy) The larger the value of (A) is, the higher the semantic similarity of the words is, finally removing the words with similar semantics and keeping TF-IDF in the two wordsnewThe word with the larger value.
Step E: separating the legal documents to be extracted in sentence units, wherein the sentence set of the whole text is expressed as S ═ S1,s2,…snAnd B, simultaneously taking sentences as units, preprocessing each sentence in the same step A, converting all words into word vector representations, and forming a matrix representation M of each sentence in a splicing moden*mN is a word vector dimension, m is the maximum sentence length in the text and is insufficient for zero filling;
step F: constructing a graph model of the TextRank algorithm, and setting an initial value to iterate until convergence;
the step F comprises the following steps: step F1: the positions of the sentences in the text and the subject word information are fused into the vertexes of the graph model to calculate the vertex sentence scores of the graph model; sentences in the text are in different positions, the weight information is different, and the information contained in the first sentence in the text is most often; meanwhile, whether the sentences contain the subject words is determined, the sentences containing the subject words are obviously more critical, and the information weight of the subject words is defined;
step F2: through sentence matrix representation, cosine similarity of the two matrixes is obtained and used as an edge relation weight in the graph model;
referring to fig. 3, the above step F1 includes:
the definition of sentence position information weight is as follows:
Figure BDA0002879172340000072
the definition of the subject term information weight is as follows:
Figure BDA0002879172340000073
the top sentence scoring mode of the graph model is as follows:
Score(i)=P(si)*F(si)*TextRank(i)
in the prior art, TextRank (i) is a mode in classic TextRank, and the formula is as follows:
Figure BDA0002879172340000081
wherein wijFor the similarity coefficient between two sentences, the calculation is performed in the next step G2; input (v)i) Set S for all sentences of the current text obtained in step Fall,Output(vj) Representing the set of other sentences linked by the current sentence, d represents a damping value, representing the probability (0.85) that a certain vertex jumps to any other vertex in the graph.
The number of co-occurring words between two sentences is mainly counted to reflect the relevance, and the semantic information between the sentences is ignored. The sentence matrix representation obtained in the step E is utilized, and the cosine similarity of the two matrixes is calculated to be used as the side relation weight in the graph model.
In the prior art, relevance is reflected only by the quantity of co-occurring words of two sentences, but semantic information between the sentences is ignored.
In the prior art, the number of co-occurring words between two sentences is mainly counted to reflect relevance and ignore semantic information between the sentences, and the classic TextRank algorithm has the following acquisition mode:
Figure BDA0002879172340000082
the sentence matrix representation obtained in the step E is utilized, and the cosine similarity of the two matrixes is calculated to be used as the side relation weight in the graph model. The specific way of obtaining the inter-vertex edge relation weight in the graph model is as follows:
Figure BDA0002879172340000083
step F3: training an improved TextRank graph model, initializing the edge relation weight between the vertexes in the graph to be 1, setting the learning rate to be 0.001, continuously and iteratively calculating all the vertex values and the edge relation weight until the model converges, and taking the final score of each vertex as an important basis for determining the critical sentence of the criminal process in the current legal document.
Step G: sequencing the top scores of all sentences in the step F, and taking topK as the extracted crime process key information, wherein the K value is 10% of the number of all sentences in the text; and the K sentences are arranged in sequence, and redundant information in the K sentences is removed.
The step G comprises the following steps:
step G1: the K sentences are sorted, the final crime process key information lacks integrity and consistency only by sorting from high to low according to the score sequence, and the sorting principle of the invention is as follows:
1) when the sentences contain time sequence information such as key time, steps and the like, the two sentences determine the sequence according to the time sequence information;
2) the original sequence of the subject words and the sequence of the sentences are corresponding to each other, if two sentences respectively contain two different pieces of subject word information, the sentences are ordered according to the original sequence of the subject words;
3) and under the condition that the sentences contain the same subject word information, arranging the sentences according to the score sequence of the original text. After being processed according to the mode, a crime process key information set S is formednew
Step G2: combining the sentence and the subject word information and removing redundant information by using an MMR algorithm, wherein the calculation formula is as follows:
MR(Si)=α·Sim1(Si,Sm)-(1-α)·max[Sim2(Si,Sj)]
wherein alpha is 0.8, the similarity calculation adopts the method of graph model edge relation weight in the same step F, SmFor text information sentences composed of subject words, Sim, in step D1The function reflects the relevance between text information sentences formed by the current sentence and the text subject word, Sim2The function collects the current sentence and the crime process key information SnewThe difference between the selected crime process key information and the other sentences contained in the sentence graph can be reflected by comparing the other sentences and taking the maximum value.
Calculating the MR value of each sentence as MR (S)i) And if the value is less than or equal to 0.85, the statement is retained, the statement higher than the value is removed, and finally the critical information of the criminal process in the legal document is extracted.
Compared with the prior art, the invention has the following beneficial effects:
1) compared with a neural network algorithm, the method does not need to construct a large corpus, is short in training time, high in process information extraction speed in legal documents, and suitable for practical use.
2) During text preprocessing, a special stop word list is constructed according to the particularity of the legal document text, and compared with the use of a general stop word list, the accuracy of crime process key information finally extracted from the document can be effectively improved.
3) When the classic TextRank algorithm is used for constructing a graph model, the vertex score and the edge relation weight calculation ignore the chapter structure and the text theme of the text and the position and semantic information of sentences in the text. The method improves the classic TextRank algorithm, and firstly, the positions of sentences in the text and subject word information are blended into the vertex calculation of a graph model; and secondly, calculating the edge relation between the vertexes in the graph without using the original co-occurrence word formula, and finally using the matrix containing sentence semantic information and calculating the cosine similarity of the two matrixes as the edge relation weight in the graph model.
4) After the improved TextRank algorithm is used for obtaining the topK key sentence, the method also combines sentences and subject term information and uses the MMR algorithm to remove redundant information, so that the crime process key information extracted from the legal document text can better summarize the full text, and meanwhile, the consistency among the sentences is kept.

Claims (8)

1. A method for extracting crime process key information in a legal document based on a TextRank algorithm is characterized by comprising the following steps:
step A: labeling the set words or parts of speech to obtain a preliminarily screened subject word set w ═ { w ═ w1,w2,…wn};
And B: obtaining TF (T) IDF (inverse discrete cosine function) values of all words in the subject word set;
and C: converting the subject word set w after the text preprocessing into a vector representation form to obtain n-dimensional word vector representation and obtain a new process text subject word set wc
Step D: adding word position information and merging semantic similar words to obtain the final ranking information of the keywords;
step E: separating the legal documents to be extracted in sentence units, wherein the sentence set of the whole text is expressed as S ═ S1,s2,…snAnd B, simultaneously taking sentences as units, preprocessing each sentence in the same step A, converting all words into word vector representations, and forming a matrix representation M of each sentence in a splicing moden*N is a word vector dimension, m is the maximum sentence length in the text and is insufficient for zero filling;
step F: building a graph model of the TextRank algorithm, and utilizing the word vector representation obtained in the step E and the set initial value to iterate until convergence;
step G: and F, sequencing the top scores of all the sentences in the step F, taking the set highest K sentences as the extracted crime process key information, sequencing the K sentences, and removing redundant information in the K sentences.
2. The method for extracting crime procedure key information in a legal document based on a TextRank algorithm according to claim 1, wherein in the step D, the steps comprise:
step D1: obtaining distance value by using word vector obtained by CBOW model, comparing it with position average value to obtain distance information, the closer the distance to the beginning position of sentence is, the weight PiThe larger;
step D2: will gather wcCalculating similarity of the obtained residual words by using cosine similarity; removing words with similar semanteme and reserving TF-IDF in the two wordsnewWords with larger values;
step D3: pressing the remaining words as TF-IDFnewAnd (4) value sequencing, namely obtaining a final subject term set w of the legal document text according to a set threshold value, and providing subject term semantic support for finally extracting the critical information of the criminal process in the text.
3. The method for extracting critical information of criminal process in legal document based on TextRank algorithm as claimed in claim 2, wherein in the step D1, the weights are assigned as follows:
Figure FDA0002879172330000021
the TF-IDF value of the word obtained in the second step is combined with the weight PiMultiplying to obtain TF-IDFnewAs a result of fusing the word position information.
4. The method for extracting crime process key information in a legal document based on a TextRank algorithm as claimed in claim 3, wherein: in the above step D3, the set wcThe similarity of the obtained residual words is calculated by utilizing cosine similarity, and the specific mode is as follows:
Figure FDA0002879172330000022
wherein the word wx=(v1,v2,…vn)、wy=(v1,v2,…vn) Are all n-dimensional vector representations transformed by the CBOW model in step C. wordsim (w)x,wy) The larger the value of (A) is, the higher the semantic similarity of the words is, finally removing the words with similar semantics and keeping TF-IDF in the two wordsnewThe word with the larger value.
5. The method for extracting crime procedure key information in a legal document based on a TextRank algorithm according to claim 1, wherein the step F comprises:
step F1: the positions of the sentences in the text and the subject word information are fused into the vertexes of the graph model to calculate the vertex sentence scores of the graph model; confirming whether the sentences contain the subject words or not, wherein the sentences containing the subject words are obviously more critical, and defining the information weight of the subject words;
step F2: through sentence matrix representation, cosine similarity of the two matrixes is obtained and used as an edge relation weight in the graph model;
step F3: and initializing the edge relation weight between the vertexes in the graph by the trained TextRank graph model, setting a value for the learning rate, continuously and iteratively calculating all the vertex values and the edge relation weight until the model converges, and taking the final score of each vertex as an important basis for determining the key sentence in the current legal document.
6. The method for extracting crime procedure key information in a legal document based on a TextRank algorithm according to claim 5, wherein the step F1 comprises:
the definition of sentence position information weight is as follows:
Figure FDA0002879172330000031
the definition of the subject term information weight is as follows:
Figure FDA0002879172330000032
the top sentence scoring mode of the graph model is as follows:
Score(i)=P(si)*F(si)*TextRank(i)
and E, expressing the sentence matrix obtained in the step E, and obtaining the cosine similarity of the two matrixes as the edge relation weight in the graph model, wherein the edge relation weight is as follows:
Figure FDA0002879172330000033
wherein wijFor the similarity coefficient between two sentences, the calculation is performed in the next step G2; input (v)i) Set S for all sentences of the current text obtained in step Fall,Output(vj) Representing the other sentence sets linked by the current sentence, d representing the damping value, representing the probability of a certain vertex jumping to other arbitrary vertices in the graph.
7. The method for extracting crime process key information in a legal document based on a TextRank algorithm according to claim 6, wherein the step F2 is implemented by obtaining the relationship weight between vertices in the graph model:
Figure FDA0002879172330000034
8. the method for extracting crime process key information in a legal document based on a TextRank algorithm as claimed in claim 7, wherein the step G comprises the steps of:
step G1: sequencing the K sentences, arranging the K sentences from high to low according to the score sequence to enable the final crime process key information to lack integrity and continuity, and processing according to the mode to form a crime process key information set Snew
Step G2: combining sentences and subject word information and removing redundant information by using an MMR algorithm, wherein the method comprises the following steps:
MR(Si)=α·Sim1(Si,Sm)-(1-α)·max[Sim2(Si,Sj)]
wherein alpha is a set value, the similarity calculation method adopts the method of graph model edge relation weight in the same step F, SmFor text information sentences composed of subject words, Sim, in step D1The function reflects the relevance between text information sentences formed by the current sentence and the text subject word, Sim2The function collects the current sentence and the crime process key information SnewComparing other sentences contained in the sentence and taking the maximum value; obtaining the MR value of each sentence when MR (S)i) And if the damping value is less than or equal to the damping value, the sentence is retained, the sentence higher than the damping value is removed, and finally the critical information of the criminal process in the legal document is extracted.
CN202011625462.4A 2020-12-31 2020-12-31 Method for extracting critical information of criminal process in legal document based on TextRank algorithm Pending CN112800211A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011625462.4A CN112800211A (en) 2020-12-31 2020-12-31 Method for extracting critical information of criminal process in legal document based on TextRank algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011625462.4A CN112800211A (en) 2020-12-31 2020-12-31 Method for extracting critical information of criminal process in legal document based on TextRank algorithm

Publications (1)

Publication Number Publication Date
CN112800211A true CN112800211A (en) 2021-05-14

Family

ID=75807778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011625462.4A Pending CN112800211A (en) 2020-12-31 2020-12-31 Method for extracting critical information of criminal process in legal document based on TextRank algorithm

Country Status (1)

Country Link
CN (1) CN112800211A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312532A (en) * 2021-06-01 2021-08-27 哈尔滨工业大学 Public opinion grade prediction method based on deep learning and oriented to public inspection field

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125349A (en) * 2019-12-17 2020-05-08 辽宁大学 Graph model text abstract generation method based on word frequency and semantics

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125349A (en) * 2019-12-17 2020-05-08 辽宁大学 Graph model text abstract generation method based on word frequency and semantics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张志尧: "基于TF-IDF与TextRank的自动摘要抽取", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312532A (en) * 2021-06-01 2021-08-27 哈尔滨工业大学 Public opinion grade prediction method based on deep learning and oriented to public inspection field

Similar Documents

Publication Publication Date Title
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN109960724B (en) Text summarization method based on TF-IDF
CN109858028B (en) Short text similarity calculation method based on probability model
CN109190117B (en) Short text semantic similarity calculation method based on word vector
CN111966826A (en) Method, system, medium and electronic device for constructing text classification system
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN110222172B (en) Multi-source network public opinion theme mining method based on improved hierarchical clustering
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN112464656A (en) Keyword extraction method and device, electronic equipment and storage medium
CN111984782A (en) Method and system for generating text abstract of Tibetan language
CN111859961A (en) Text keyword extraction method based on improved TopicRank algorithm
CN115906805A (en) Long text abstract generating method based on word fine granularity
CN114048354B (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN116501875A (en) Document processing method and system based on natural language and knowledge graph
CN113626584A (en) Automatic text abstract generation method, system, computer equipment and storage medium
CN111639189B (en) Text graph construction method based on text content features
CN110929022A (en) Text abstract generation method and system
CN112800211A (en) Method for extracting critical information of criminal process in legal document based on TextRank algorithm
CN108427769B (en) Character interest tag extraction method based on social network
CN116756303A (en) Automatic generation method and system for multi-topic text abstract
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
CN115269846A (en) Text processing method and device, electronic equipment and storage medium
CN114997161A (en) Keyword extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210514

WD01 Invention patent application deemed withdrawn after publication