CN113139061B

CN113139061B - Case feature extraction method based on word vector clustering

Info

Publication number: CN113139061B
Application number: CN202110525578.9A
Authority: CN
Inventors: 栗伟; 闵新�; 谢维冬; 陈强
Original assignee: 东北大学
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2023-07-21
Anticipated expiration: 2041-05-14
Also published as: CN113139061A

Abstract

The invention provides a case feature extraction method based on word vector clustering, and relates to the technical field of machine learning. According to the method, a word segmentation method based on a hash table is constructed by analyzing the case abstracts in the historical case data, a stop word table special for the judicial field is constructed to carry out stop word filtering, case abstract word vectors are generated through a word2vec method, word vectors are clustered, and finally the cluster distribution of the case abstracts is generated. By utilizing the case feature extraction method to analyze a large number of historical case summaries, different key information of the cases can be accurately extracted, further differentiation of the cases of the same type is realized, and a reference is provided for objectively and quantitatively predicting the workload of each case. The case cluster distribution of different checkhouses is provided, the case distribution of different checkhouses can be compared and analyzed, a reference is provided for comprehensive case handling capacity analysis of the checkhouses, and self-learning capacity of the checkhouses is improved.

Description

Case feature extraction method based on word vector clustering

Technical Field

The invention relates to the technical field of machine learning, in particular to a case feature extraction method based on word vector clustering.

Background

With the rapid development of modern technology, the role of information technology in inspection work is increasingly highlighted. In the case division process, the problem of feature extraction of cases is directly related to the rationality of case distribution; many scholars have studied how to accurately extract the case features from the case abstract, but the current research is only for qualitative analysis stage, and no method for extracting the case features for the case abstract is given except for listing the case key information. Because of the same case type, the workload consumed by cases with different key information is also different. By extracting the case characteristics in the case abstract, more detailed distinction of the cases can be realized, the accuracy of workload expectation for different cases of the same type is improved, and a reference is provided for workload evaluation.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a case feature extraction method based on word vector clustering, which not only considers the occurrence frequency of words, but also considers the context relation in the case abstract, thereby extracting more effective key feature clusters.

The technical scheme of the invention is that the case feature extraction method based on word vector clustering comprises the following steps:

step 1: text preprocessing is carried out on the case abstract text;

creating training data set of text data of case abstract, performing Chinese word segmentation on the training data set based on dictionary Bi-directional maximum matching method (Bi-MM method),

the bidirectional maximum matching method is to compare a word segmentation result obtained by a forward maximum matching method with a result obtained by a reverse maximum matching method, and find the longest matching character string in a dictionary as a word according to a maximum matching principle; a word segmentation dictionary is a dictionary containing judicial domain proper nouns in which consecutive english characters and digits are considered as a single word;

in the word segmentation process, a hash table structure is adopted to define a dictionary, the word segmentation dictionary adopts a unicode coding mode, a three-layer hash structure is selected, the unicode coding value of the first word in the first layer dictionary is used, the second layer is the number of characters in the word, and the third layer calculates the unicode coding value of all the characters in the word.

Step 2: part-of-speech tagging is carried out on the case abstract after word segmentation, and stop words in the case abstract are filtered, and the method specifically comprises the following steps:

step 2.1: calculating term frequency TF of each word in training data set after word segmentation _w For the term according to the term frequency TF _w Sorting, eliminating stop words in the stop word list before calculating term frequency, and taking the front K after sorting ₁ Individual words are used as professional terms stop words;

step 2.2: removing the stop words of the technical terms in the case abstract, taking and sorting K at last ₂ Individual words are used as professional terms stop words;

step 2.3: calculating IDF (reverse document frequency) of each word in the training data set after word segmentation, and obtaining the back K with the lowest IDF score ₃ The personal word is used as a stop word;

step 3: generating word vectors, carrying out vector representation on the filtered single words, and converting the single words into word vectors;

word2vec algorithm is adopted for generating word vectors, and CBOW model is adopted; predicting the center word according to the context word of the center word, specifically comprising the following steps:

step 3.1: setting window size as 5, namely context word number as 4, inputting center word w _t Context word one-hot vector w of (2) _t-2 ，w _t-1 ，w _t+1 ，w _t+2 ；

Step 3.2: let the input layer matrix be W _in The size |v|×d, where |v| is the size of the dictionary, d is the dimension of the word vector, V _t The vector is W _in Is multiplied by the input layer matrix, respectively:

step 3.3: the resulting vector v _t Adding and averaging as hidden layer vector v _PROJECTION The size of hidden layer vector is d×1

Step 3.4: transpose the hidden layer vector and then multiply by the output weight matrix W _out ，W _out The size is d×|v|:

wherein P is _out Representing the output vector.

Step 3.5: step 3.4 is followed to obtain P with a size of 1 X|V| _out Vector by activating functionWherein i represents P _out The i-th element in the vector is processed to obtain an |V| dimensional probability distribution, wherein each dimension represents a word, and the vector indicated by the dimension index with the maximum probability is a predicted central word vector;

step 3.6: and comparing the center word vector with the one-hot vector of the actual center word, wherein the smaller the error is, the more accurate the word vector of the model prediction center word is.

And 3.7, outputting and storing the central word vector with the smallest error as a word vector of the central word.

Step 3.8: repeating the steps 3.1-3.7, and sequentially outputting each word in the filtered case abstract to be converted into word vectors.

Step 4: clustering all word vectors, calculating the word class cluster distribution condition in each case abstract, and converting each case abstract into cluster vector representation;

and adopting K-means clustering, iteratively determining the optimal cluster number, and finally outputting the cluster vector of each case abstract of the test data set. The method specifically comprises the following steps:

step 4.1: presetting a change range of a cluster k;

step 4.2: clustering all word vectors by adopting a K-means clustering method, and obtaining a central vector W of each cluster ₁ ，W ₂ ，...，W _k ；

Step 4.3: repeating steps 1-3 from the first case abstract of the training data set, separating words, filtering irrelevant words, converting the irrelevant words into word vectors, and calculating the Euclidean distance between the word vector P and the center vector W of each word in the case abstractWherein p is _i Is each element, w, in the word vector P _i Is each element of the center vector W, compares the Euclidean distance between P and W, and places the word into the cluster corresponding to the minimum distance;

step 4.4: each word in each case abstract belongs to a cluster, the number of elements of each case abstract in each cluster is counted, finally a k-dimensional vector is formed, the k-dimensional vector is used as a characteristic vector of each case abstract, and finally the cluster-like distribution of the case abstract is output according to the sequence (0, 1, the.

Step 5: outputting a case abstract data set of key information to be extracted according to cluster vectors, wherein the dimension of the cluster vectors is the number of clustered clusters, and the cluster vector value represents the distribution condition of words in the current case abstract in each cluster, so that the extraction of case features is realized.

The beneficial effects generated by adopting the technical method are as follows:

the invention provides a case feature extraction method based on word vector clustering, which is characterized in that a word segmentation method based on a hash table is constructed by analyzing case abstracts in historical case data, a stop word table special for the judicial field is constructed to carry out stop word filtering, the case abstracts word vectors are generated by a word2vec method, the word vectors are clustered, and finally, the cluster distribution of the case abstracts is generated. By utilizing the case feature extraction method to analyze a large number of historical case summaries, different key information of the cases can be accurately extracted, further differentiation of the cases of the same type is realized, and a reference is provided for objectively and quantitatively predicting the workload of each case.

Drawings

FIG. 1 is an overall flow chart of an embodiment of the present invention;

FIG. 2 is a diagram of a dictionary hash table in accordance with an embodiment of the present invention;

FIG. 3 is a schematic view of a CBOW model according to an embodiment of the present invention;

FIG. 4 is a cluster-like distribution in an embodiment of the invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

A case feature extraction method based on word vector clustering, as shown in figure 1, comprises the following steps:

step 1: text preprocessing is carried out on the case abstract text;

in the word segmentation process, in order to improve the word segmentation speed, a hash table structure is used for defining a dictionary, the word segmentation dictionary adopts a unicode coding mode, a three-layer hash structure is selected, the unicode coding value of the first word in the first layer dictionary is used, the second layer is the number of characters in the word, and the third layer calculates the unicode coding value of all the characters in the word.

A schematic diagram of the hash code structure is shown in fig. 2. Unicode encoding of Chinese characters from 0x4e00 to 0x9fa5, total 20092 characters, and hash function of the first layer hash table is f (x) =x ₁ -0x4e00, wherein x is a word, x ₁ Unicode encoding of the first character; the hash value of the second layer is the length of the word, and the maximum length of the word is smaller than 20 through statistics, so that the hash value ranges from 1 to 20; the hash function of the third layer isWhere n is the length of the word.

Step 2: part-of-speech tagging is carried out on the case abstract after word segmentation, and stop words in the case abstract, such as character names, place names, english characters, numbers and the like, are filtered;

the term "Stop word" refers to that in information retrieval, in order to save storage space and improve searching efficiency, certain Words or Words are automatically filtered before or after processing natural language text or data, and these Words or Words are called Stop Words. Because no stop word list related to judicial related fields exists in the current natural language processing, the patent constructs the stop word list related to judicial fields on the basis of the original NLP stop word list, and specifically comprises the following steps:

step 2.1: calculating term frequency TF of each word in training data set after word segmentation _w For the term according to the term frequency TF _w Sorting, namely eliminating stop words in the stop word list before calculating term frequency so as to locate stop words in the judicial field, and taking K before sorting ₁ The individual word is taken as a stop word of the technical term, and K is taken in the embodiment ₁ ＝1000；

Step 2.2: removing the stop words of the technical terms in the case abstract, taking and sorting K at last ₂ The individual word is taken as a stop word of the technical term, and K is taken in the embodiment ₂ ＝1000；

Step 2.3: calculating IDF (reverse document frequency) of each word in the training data set after word segmentation, and obtaining the back K with the lowest IDF score ₃ The individual words are taken as stop words, K is taken in the embodiment ₃ ＝1000。

after obtaining the case abstract of the filtered irrelevant words, each word of which the case abstract is filtered is converted into a word vector. The Word2vec algorithm is used for generating Word vectors, the Word2vec algorithm is one of methods for converting words into Word vectors, and comprises two Word vector generation models, namely a CBOW model and a Skip-gram model, and the CBOW model is used for predicting a center Word according to the context words of the center Word as shown in fig. 3, and specifically comprises the following steps:

wherein P is _out Representing the output vector.

after the word vectors of the keywords in the case abstract are obtained, all the word vectors need to be clustered, and K-means clustering is adopted in the patent. In the word vector clustering process, the cluster size can directly influence the extraction of case features in the case abstract, in order to ensure the accuracy of case feature extraction, the process adopts iteration to determine the optimal cluster number, and finally, the cluster vector of each case abstract of the test data set is output. The method specifically comprises the following steps:

step 4.1: the change range of the cluster k is preset, and the change range of the cluster k is set to be 5-15 in the embodiment;

Step 4.3: repeating steps 1-3 from the first case abstract of the training data set, separating words, filtering irrelevant words, converting the irrelevant words into word vectors, and calculating the Euclidean distance between the word vector P and the center vector W of each word in the case abstract Wherein p is _i Is each element, w, in the word vector P _i Is each element of the center vector W, compares the Euclidean distance between P and W, and places the word into the cluster corresponding to the minimum distance;

step 4.4: each word in each case abstract belongs to a cluster, the number of elements of each case abstract in each cluster is counted, finally a k-dimensional vector is formed, the k-dimensional vector is used as a characteristic vector of each case abstract, and finally the cluster-like distribution of the case summaries is output according to the sequence (0, 1, …, k) of the clusters.

The method for extracting the case characteristics from the case conditions is used for testing the total 1364 case data of a certain ground level inspection hospital and 2 subordinate regional inspection hospitals, and calculating cluster-like distribution as characteristic distribution. As shown in fig. 4.

Claims

1. The case feature extraction method based on word vector clustering is characterized by comprising the following steps of:

step 1: text preprocessing is carried out on the case abstract text;

creating training data set with text data of abstract of case, performing Chinese word segmentation on the training data set based on Bi-directional maximum matching method Bi-MM method of dictionary,

step 2: marking the parts of speech of the case abstract after word segmentation, and filtering stop words in the case abstract;

step 2.3: calculating the reverse file frequency IDF of each word in the training data set after word segmentation, and obtaining the back K with the lowest IDF score ₃ The personal word is used as a stop word;

step 3: generating word vectors, carrying out vector representation on the filtered single words, and converting the single words into word vectors; word2vec algorithm is adopted for generating word vectors, and CBOW model is adopted; predicting the center word according to the context word of the center word;

step 4: clustering all word vectors, calculating the word class cluster distribution condition in each case abstract, and converting each case abstract into cluster vector representation; adopting K-means clustering, iteratively determining the optimal cluster number, and finally outputting the cluster vector of each case abstract of the test data set;

step 4.1: presetting a change range of a cluster k;

step 4.4: each word in each case abstract belongs to a cluster, the number of elements of each case abstract in each cluster is counted, a k-dimensional vector is finally formed, the k-dimensional vector is used as a characteristic vector of each case abstract, and finally the cluster-like distribution of the case abstract is output according to the sequence (0, 1, the..k) of the clusters;

2. The case feature extraction method based on word vector clustering according to claim 1, wherein the bidirectional maximum matching method in step 1 is to compare a word segmentation result obtained by a forward maximum matching method with a result obtained by a reverse maximum matching method, and find the longest matching character string in a dictionary as a word according to a maximum matching principle; a word dictionary is a dictionary containing judicial domain proper nouns in which consecutive english characters and digits are considered as a single word.

3. The case feature extraction method based on word vector clustering according to claim 1, wherein in the word segmentation processing in step 1, a hash table structure is used to define a dictionary, the word segmentation dictionary adopts a unicode coding mode, a three-layer hash structure is selected, the unicode coding value of the first word in the first layer dictionary is used, the second layer is the number of characters in the word, and the unicode coding value of all the characters in the word is calculated in the third layer.

4. The case feature extraction method based on word vector clustering of claim 1, wherein the step 3 specifically includes:

wherein P is _out Representing an output vector;

step 3.5: step 3.4 is followed to obtain P with a size of 1 X|V| _out Vector by activating functionWherein i represents P _out The i-th element in the vector is processed to obtain |V| dimensional probability distribution, wherein each dimension represents a word, and the vector indicated by the dimensional index with the maximum probability is a predicted central word vector target word;

step 3.6: comparing the central word vector with the one-hot vector of the actual central word, wherein the smaller the error is, the more accurate the word vector of the model prediction central word is;

step 3.7, outputting and storing the central word vector with the smallest error as the word vector of the central word;