CN113139061B - Case feature extraction method based on word vector clustering - Google Patents
Case feature extraction method based on word vector clustering Download PDFInfo
- Publication number
- CN113139061B CN113139061B CN202110525578.9A CN202110525578A CN113139061B CN 113139061 B CN113139061 B CN 113139061B CN 202110525578 A CN202110525578 A CN 202110525578A CN 113139061 B CN113139061 B CN 113139061B
- Authority
- CN
- China
- Prior art keywords
- word
- vector
- case
- cluster
- abstract
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000013598 vector Substances 0.000 title claims abstract description 136
- 238000000605 extraction Methods 0.000 title claims abstract description 20
- 238000000034 method Methods 0.000 claims abstract description 35
- 230000011218 segmentation Effects 0.000 claims abstract description 26
- 238000001914 filtration Methods 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000003064 k means clustering Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 230000003213 activating effect Effects 0.000 claims description 3
- 230000002457 bidirectional effect Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 2
- 230000004069 differentiation Effects 0.000 abstract description 2
- 238000010801 machine learning Methods 0.000 abstract description 2
- 238000007689 inspection Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000004451 qualitative analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/325—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a case feature extraction method based on word vector clustering, and relates to the technical field of machine learning. According to the method, a word segmentation method based on a hash table is constructed by analyzing the case abstracts in the historical case data, a stop word table special for the judicial field is constructed to carry out stop word filtering, case abstract word vectors are generated through a word2vec method, word vectors are clustered, and finally the cluster distribution of the case abstracts is generated. By utilizing the case feature extraction method to analyze a large number of historical case summaries, different key information of the cases can be accurately extracted, further differentiation of the cases of the same type is realized, and a reference is provided for objectively and quantitatively predicting the workload of each case. The case cluster distribution of different checkhouses is provided, the case distribution of different checkhouses can be compared and analyzed, a reference is provided for comprehensive case handling capacity analysis of the checkhouses, and self-learning capacity of the checkhouses is improved.
Description
Technical Field
The invention relates to the technical field of machine learning, in particular to a case feature extraction method based on word vector clustering.
Background
With the rapid development of modern technology, the role of information technology in inspection work is increasingly highlighted. In the case division process, the problem of feature extraction of cases is directly related to the rationality of case distribution; many scholars have studied how to accurately extract the case features from the case abstract, but the current research is only for qualitative analysis stage, and no method for extracting the case features for the case abstract is given except for listing the case key information. Because of the same case type, the workload consumed by cases with different key information is also different. By extracting the case characteristics in the case abstract, more detailed distinction of the cases can be realized, the accuracy of workload expectation for different cases of the same type is improved, and a reference is provided for workload evaluation.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a case feature extraction method based on word vector clustering, which not only considers the occurrence frequency of words, but also considers the context relation in the case abstract, thereby extracting more effective key feature clusters.
The technical scheme of the invention is that the case feature extraction method based on word vector clustering comprises the following steps:
step 1: text preprocessing is carried out on the case abstract text;
creating training data set of text data of case abstract, performing Chinese word segmentation on the training data set based on dictionary Bi-directional maximum matching method (Bi-MM method),
the bidirectional maximum matching method is to compare a word segmentation result obtained by a forward maximum matching method with a result obtained by a reverse maximum matching method, and find the longest matching character string in a dictionary as a word according to a maximum matching principle; a word segmentation dictionary is a dictionary containing judicial domain proper nouns in which consecutive english characters and digits are considered as a single word;
in the word segmentation process, a hash table structure is adopted to define a dictionary, the word segmentation dictionary adopts a unicode coding mode, a three-layer hash structure is selected, the unicode coding value of the first word in the first layer dictionary is used, the second layer is the number of characters in the word, and the third layer calculates the unicode coding value of all the characters in the word.
Step 2: part-of-speech tagging is carried out on the case abstract after word segmentation, and stop words in the case abstract are filtered, and the method specifically comprises the following steps:
step 2.1: calculating term frequency TF of each word in training data set after word segmentation w For the term according to the term frequency TF w Sorting, eliminating stop words in the stop word list before calculating term frequency, and taking the front K after sorting 1 Individual words are used as professional terms stop words;
step 2.2: removing the stop words of the technical terms in the case abstract, taking and sorting K at last 2 Individual words are used as professional terms stop words;
step 2.3: calculating IDF (reverse document frequency) of each word in the training data set after word segmentation, and obtaining the back K with the lowest IDF score 3 The personal word is used as a stop word;
step 3: generating word vectors, carrying out vector representation on the filtered single words, and converting the single words into word vectors;
word2vec algorithm is adopted for generating word vectors, and CBOW model is adopted; predicting the center word according to the context word of the center word, specifically comprising the following steps:
step 3.1: setting window size as 5, namely context word number as 4, inputting center word w t Context word one-hot vector w of (2) t-2 ,w t-1 ,w t+1 ,w t+2 ;
Step 3.2: let the input layer matrix be W in The size |v|×d, where |v| is the size of the dictionary, d is the dimension of the word vector, V t The vector is W in Is multiplied by the input layer matrix, respectively:
step 3.3: the resulting vector v t Adding and averaging as hidden layer vector v PROJECTION The size of hidden layer vector is d×1
Step 3.4: transpose the hidden layer vector and then multiply by the output weight matrix W out ,W out The size is d×|v|:
wherein P is out Representing the output vector.
Step 3.5: step 3.4 is followed to obtain P with a size of 1 X|V| out Vector by activating functionWherein i represents P out The i-th element in the vector is processed to obtain an |V| dimensional probability distribution, wherein each dimension represents a word, and the vector indicated by the dimension index with the maximum probability is a predicted central word vector;
step 3.6: and comparing the center word vector with the one-hot vector of the actual center word, wherein the smaller the error is, the more accurate the word vector of the model prediction center word is.
And 3.7, outputting and storing the central word vector with the smallest error as a word vector of the central word.
Step 3.8: repeating the steps 3.1-3.7, and sequentially outputting each word in the filtered case abstract to be converted into word vectors.
Step 4: clustering all word vectors, calculating the word class cluster distribution condition in each case abstract, and converting each case abstract into cluster vector representation;
and adopting K-means clustering, iteratively determining the optimal cluster number, and finally outputting the cluster vector of each case abstract of the test data set. The method specifically comprises the following steps:
step 4.1: presetting a change range of a cluster k;
step 4.2: clustering all word vectors by adopting a K-means clustering method, and obtaining a central vector W of each cluster 1 ,W 2 ,...,W k ;
Step 4.3: repeating steps 1-3 from the first case abstract of the training data set, separating words, filtering irrelevant words, converting the irrelevant words into word vectors, and calculating the Euclidean distance between the word vector P and the center vector W of each word in the case abstractWherein p is i Is each element, w, in the word vector P i Is each element of the center vector W, compares the Euclidean distance between P and W, and places the word into the cluster corresponding to the minimum distance;
step 4.4: each word in each case abstract belongs to a cluster, the number of elements of each case abstract in each cluster is counted, finally a k-dimensional vector is formed, the k-dimensional vector is used as a characteristic vector of each case abstract, and finally the cluster-like distribution of the case abstract is output according to the sequence (0, 1, the.
Step 5: outputting a case abstract data set of key information to be extracted according to cluster vectors, wherein the dimension of the cluster vectors is the number of clustered clusters, and the cluster vector value represents the distribution condition of words in the current case abstract in each cluster, so that the extraction of case features is realized.
The beneficial effects generated by adopting the technical method are as follows:
the invention provides a case feature extraction method based on word vector clustering, which is characterized in that a word segmentation method based on a hash table is constructed by analyzing case abstracts in historical case data, a stop word table special for the judicial field is constructed to carry out stop word filtering, the case abstracts word vectors are generated by a word2vec method, the word vectors are clustered, and finally, the cluster distribution of the case abstracts is generated. By utilizing the case feature extraction method to analyze a large number of historical case summaries, different key information of the cases can be accurately extracted, further differentiation of the cases of the same type is realized, and a reference is provided for objectively and quantitatively predicting the workload of each case.
Drawings
FIG. 1 is an overall flow chart of an embodiment of the present invention;
FIG. 2 is a diagram of a dictionary hash table in accordance with an embodiment of the present invention;
FIG. 3 is a schematic view of a CBOW model according to an embodiment of the present invention;
FIG. 4 is a cluster-like distribution in an embodiment of the invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
A case feature extraction method based on word vector clustering, as shown in figure 1, comprises the following steps:
step 1: text preprocessing is carried out on the case abstract text;
creating training data set of text data of case abstract, performing Chinese word segmentation on the training data set based on dictionary Bi-directional maximum matching method (Bi-MM method),
the bidirectional maximum matching method is to compare a word segmentation result obtained by a forward maximum matching method with a result obtained by a reverse maximum matching method, and find the longest matching character string in a dictionary as a word according to a maximum matching principle; a word segmentation dictionary is a dictionary containing judicial domain proper nouns in which consecutive english characters and digits are considered as a single word;
in the word segmentation process, in order to improve the word segmentation speed, a hash table structure is used for defining a dictionary, the word segmentation dictionary adopts a unicode coding mode, a three-layer hash structure is selected, the unicode coding value of the first word in the first layer dictionary is used, the second layer is the number of characters in the word, and the third layer calculates the unicode coding value of all the characters in the word.
A schematic diagram of the hash code structure is shown in fig. 2. Unicode encoding of Chinese characters from 0x4e00 to 0x9fa5, total 20092 characters, and hash function of the first layer hash table is f (x) =x 1 -0x4e00, wherein x is a word, x 1 Unicode encoding of the first character; the hash value of the second layer is the length of the word, and the maximum length of the word is smaller than 20 through statistics, so that the hash value ranges from 1 to 20; the hash function of the third layer isWhere n is the length of the word.
Step 2: part-of-speech tagging is carried out on the case abstract after word segmentation, and stop words in the case abstract, such as character names, place names, english characters, numbers and the like, are filtered;
the term "Stop word" refers to that in information retrieval, in order to save storage space and improve searching efficiency, certain Words or Words are automatically filtered before or after processing natural language text or data, and these Words or Words are called Stop Words. Because no stop word list related to judicial related fields exists in the current natural language processing, the patent constructs the stop word list related to judicial fields on the basis of the original NLP stop word list, and specifically comprises the following steps:
step 2.1: calculating term frequency TF of each word in training data set after word segmentation w For the term according to the term frequency TF w Sorting, namely eliminating stop words in the stop word list before calculating term frequency so as to locate stop words in the judicial field, and taking K before sorting 1 The individual word is taken as a stop word of the technical term, and K is taken in the embodiment 1 =1000;
Step 2.2: removing the stop words of the technical terms in the case abstract, taking and sorting K at last 2 The individual word is taken as a stop word of the technical term, and K is taken in the embodiment 2 =1000;
Step 2.3: calculating IDF (reverse document frequency) of each word in the training data set after word segmentation, and obtaining the back K with the lowest IDF score 3 The individual words are taken as stop words, K is taken in the embodiment 3 =1000。
Step 3: generating word vectors, carrying out vector representation on the filtered single words, and converting the single words into word vectors;
after obtaining the case abstract of the filtered irrelevant words, each word of which the case abstract is filtered is converted into a word vector. The Word2vec algorithm is used for generating Word vectors, the Word2vec algorithm is one of methods for converting words into Word vectors, and comprises two Word vector generation models, namely a CBOW model and a Skip-gram model, and the CBOW model is used for predicting a center Word according to the context words of the center Word as shown in fig. 3, and specifically comprises the following steps:
step 3.1: setting window size as 5, namely context word number as 4, inputting center word w t Context word one-hot vector w of (2) t-2 ,w t-1 ,w t+1 ,w t+2 ;
Step 3.2: let the input layer matrix be W in The size |v|×d, where |v| is the size of the dictionary, d is the dimension of the word vector, V t The vector is W in Is multiplied by the input layer matrix, respectively:
step 3.3: the resulting vector v t Adding and averaging as hidden layer vector v PROJECTION The size of hidden layer vector is d×1
Step 3.4: transpose the hidden layer vector and then multiply by the output weight matrix W out ,W out The size is d×|v|:
wherein P is out Representing the output vector.
Step 3.5: step 3.4 is followed to obtain P with a size of 1 X|V| out Vector by activating functionWherein i represents P out The i-th element in the vector is processed to obtain an |V| dimensional probability distribution, wherein each dimension represents a word, and the vector indicated by the dimension index with the maximum probability is a predicted central word vector;
step 3.6: and comparing the center word vector with the one-hot vector of the actual center word, wherein the smaller the error is, the more accurate the word vector of the model prediction center word is.
And 3.7, outputting and storing the central word vector with the smallest error as a word vector of the central word.
Step 3.8: repeating the steps 3.1-3.7, and sequentially outputting each word in the filtered case abstract to be converted into word vectors.
Step 4: clustering all word vectors, calculating the word class cluster distribution condition in each case abstract, and converting each case abstract into cluster vector representation;
after the word vectors of the keywords in the case abstract are obtained, all the word vectors need to be clustered, and K-means clustering is adopted in the patent. In the word vector clustering process, the cluster size can directly influence the extraction of case features in the case abstract, in order to ensure the accuracy of case feature extraction, the process adopts iteration to determine the optimal cluster number, and finally, the cluster vector of each case abstract of the test data set is output. The method specifically comprises the following steps:
step 4.1: the change range of the cluster k is preset, and the change range of the cluster k is set to be 5-15 in the embodiment;
step 4.2: clustering all word vectors by adopting a K-means clustering method, and obtaining a central vector W of each cluster 1 ,W 2 ,...,W k ;
Step 4.3: repeating steps 1-3 from the first case abstract of the training data set, separating words, filtering irrelevant words, converting the irrelevant words into word vectors, and calculating the Euclidean distance between the word vector P and the center vector W of each word in the case abstract Wherein p is i Is each element, w, in the word vector P i Is each element of the center vector W, compares the Euclidean distance between P and W, and places the word into the cluster corresponding to the minimum distance;
step 4.4: each word in each case abstract belongs to a cluster, the number of elements of each case abstract in each cluster is counted, finally a k-dimensional vector is formed, the k-dimensional vector is used as a characteristic vector of each case abstract, and finally the cluster-like distribution of the case summaries is output according to the sequence (0, 1, …, k) of the clusters.
Step 5: outputting a case abstract data set of key information to be extracted according to cluster vectors, wherein the dimension of the cluster vectors is the number of clustered clusters, and the cluster vector value represents the distribution condition of words in the current case abstract in each cluster, so that the extraction of case features is realized.
The method for extracting the case characteristics from the case conditions is used for testing the total 1364 case data of a certain ground level inspection hospital and 2 subordinate regional inspection hospitals, and calculating cluster-like distribution as characteristic distribution. As shown in fig. 4.
Claims (4)
1. The case feature extraction method based on word vector clustering is characterized by comprising the following steps of:
step 1: text preprocessing is carried out on the case abstract text;
creating training data set with text data of abstract of case, performing Chinese word segmentation on the training data set based on Bi-directional maximum matching method Bi-MM method of dictionary,
step 2: marking the parts of speech of the case abstract after word segmentation, and filtering stop words in the case abstract;
step 2.1: calculating term frequency TF of each word in training data set after word segmentation w For the term according to the term frequency TF w Sorting, eliminating stop words in the stop word list before calculating term frequency, and taking the front K after sorting 1 Individual words are used as professional terms stop words;
step 2.2: removing the stop words of the technical terms in the case abstract, taking and sorting K at last 2 Individual words are used as professional terms stop words;
step 2.3: calculating the reverse file frequency IDF of each word in the training data set after word segmentation, and obtaining the back K with the lowest IDF score 3 The personal word is used as a stop word;
step 3: generating word vectors, carrying out vector representation on the filtered single words, and converting the single words into word vectors; word2vec algorithm is adopted for generating word vectors, and CBOW model is adopted; predicting the center word according to the context word of the center word;
step 4: clustering all word vectors, calculating the word class cluster distribution condition in each case abstract, and converting each case abstract into cluster vector representation; adopting K-means clustering, iteratively determining the optimal cluster number, and finally outputting the cluster vector of each case abstract of the test data set;
step 4.1: presetting a change range of a cluster k;
step 4.2: clustering all word vectors by adopting a K-means clustering method, and obtaining a central vector W of each cluster 1 ,W 2 ,...,W k ;
Step 4.3: repeating steps 1-3 from the first case abstract of the training data set, separating words, filtering irrelevant words, converting the irrelevant words into word vectors, and calculating the Euclidean distance between the word vector P and the center vector W of each word in the case abstractWherein p is i Is each element, w, in the word vector P i Is each element of the center vector W, compares the Euclidean distance between P and W, and places the word into the cluster corresponding to the minimum distance;
step 4.4: each word in each case abstract belongs to a cluster, the number of elements of each case abstract in each cluster is counted, a k-dimensional vector is finally formed, the k-dimensional vector is used as a characteristic vector of each case abstract, and finally the cluster-like distribution of the case abstract is output according to the sequence (0, 1, the..k) of the clusters;
step 5: outputting a case abstract data set of key information to be extracted according to cluster vectors, wherein the dimension of the cluster vectors is the number of clustered clusters, and the cluster vector value represents the distribution condition of words in the current case abstract in each cluster, so that the extraction of case features is realized.
2. The case feature extraction method based on word vector clustering according to claim 1, wherein the bidirectional maximum matching method in step 1 is to compare a word segmentation result obtained by a forward maximum matching method with a result obtained by a reverse maximum matching method, and find the longest matching character string in a dictionary as a word according to a maximum matching principle; a word dictionary is a dictionary containing judicial domain proper nouns in which consecutive english characters and digits are considered as a single word.
3. The case feature extraction method based on word vector clustering according to claim 1, wherein in the word segmentation processing in step 1, a hash table structure is used to define a dictionary, the word segmentation dictionary adopts a unicode coding mode, a three-layer hash structure is selected, the unicode coding value of the first word in the first layer dictionary is used, the second layer is the number of characters in the word, and the unicode coding value of all the characters in the word is calculated in the third layer.
4. The case feature extraction method based on word vector clustering of claim 1, wherein the step 3 specifically includes:
step 3.1: setting window size as 5, namely context word number as 4, inputting center word w t Context word one-hot vector w of (2) t-2 ,w t-1 ,w t+1 ,w t+2 ;
Step 3.2: let the input layer matrix be W in The size |v|×d, where |v| is the size of the dictionary, d is the dimension of the word vector, V t The vector is W in Is multiplied by the input layer matrix, respectively:
step 3.3: the resulting vector v t Adding and averaging as hidden layer vector v PROJECTION The size of hidden layer vector is d×1
Step 3.4: transpose the hidden layer vector and then multiply by the output weight matrix W out ,W out The size is d×|v|:
wherein P is out Representing an output vector;
step 3.5: step 3.4 is followed to obtain P with a size of 1 X|V| out Vector by activating functionWherein i represents P out The i-th element in the vector is processed to obtain |V| dimensional probability distribution, wherein each dimension represents a word, and the vector indicated by the dimensional index with the maximum probability is a predicted central word vector target word;
step 3.6: comparing the central word vector with the one-hot vector of the actual central word, wherein the smaller the error is, the more accurate the word vector of the model prediction central word is;
step 3.7, outputting and storing the central word vector with the smallest error as the word vector of the central word;
step 3.8: repeating the steps 3.1-3.7, and sequentially outputting each word in the filtered case abstract to be converted into word vectors.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110525578.9A CN113139061B (en) | 2021-05-14 | 2021-05-14 | Case feature extraction method based on word vector clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110525578.9A CN113139061B (en) | 2021-05-14 | 2021-05-14 | Case feature extraction method based on word vector clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113139061A CN113139061A (en) | 2021-07-20 |
CN113139061B true CN113139061B (en) | 2023-07-21 |
Family
ID=76817023
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110525578.9A Active CN113139061B (en) | 2021-05-14 | 2021-05-14 | Case feature extraction method based on word vector clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113139061B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113779936A (en) * | 2021-09-16 | 2021-12-10 | 西华师范大学 | Network asset data processing method |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105243129A (en) * | 2015-09-30 | 2016-01-13 | 清华大学深圳研究生院 | Commodity property characteristic word clustering method |
CN106294319A (en) * | 2016-08-04 | 2017-01-04 | 武汉数为科技有限公司 | One is combined related cases recognition methods |
CN106599029A (en) * | 2016-11-02 | 2017-04-26 | 焦点科技股份有限公司 | Chinese short text clustering method |
WO2017076205A1 (en) * | 2015-11-04 | 2017-05-11 | 陈包容 | Method and apparatus for obtaining reply prompt content for chat start sentence |
CN107516110A (en) * | 2017-08-22 | 2017-12-26 | 华南理工大学 | A kind of medical question and answer Semantic Clustering method based on integrated convolutional encoding |
CN107526834A (en) * | 2017-09-05 | 2017-12-29 | 北京工商大学 | Joint part of speech and the word2vec improved methods of the correlation factor of word order training |
CN108733669A (en) * | 2017-04-14 | 2018-11-02 | 优路(北京)信息科技有限公司 | A kind of personalized digital media content recommendation system and method based on term vector |
CN109325231A (en) * | 2018-09-21 | 2019-02-12 | 中山大学 | A kind of method that multi task model generates term vector |
CN109783643A (en) * | 2019-01-09 | 2019-05-21 | 北京一览群智数据科技有限责任公司 | A kind of approximation sentence recommended method and device |
CN110597949A (en) * | 2019-08-01 | 2019-12-20 | 湖北工业大学 | Court similar case recommendation model based on word vectors and word frequency |
CN111159406A (en) * | 2019-12-30 | 2020-05-15 | 内蒙古工业大学 | Big data text clustering method and system based on parallel improved K-means algorithm |
-
2021
- 2021-05-14 CN CN202110525578.9A patent/CN113139061B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105243129A (en) * | 2015-09-30 | 2016-01-13 | 清华大学深圳研究生院 | Commodity property characteristic word clustering method |
WO2017076205A1 (en) * | 2015-11-04 | 2017-05-11 | 陈包容 | Method and apparatus for obtaining reply prompt content for chat start sentence |
CN106294319A (en) * | 2016-08-04 | 2017-01-04 | 武汉数为科技有限公司 | One is combined related cases recognition methods |
CN106599029A (en) * | 2016-11-02 | 2017-04-26 | 焦点科技股份有限公司 | Chinese short text clustering method |
CN108733669A (en) * | 2017-04-14 | 2018-11-02 | 优路(北京)信息科技有限公司 | A kind of personalized digital media content recommendation system and method based on term vector |
CN107516110A (en) * | 2017-08-22 | 2017-12-26 | 华南理工大学 | A kind of medical question and answer Semantic Clustering method based on integrated convolutional encoding |
CN107526834A (en) * | 2017-09-05 | 2017-12-29 | 北京工商大学 | Joint part of speech and the word2vec improved methods of the correlation factor of word order training |
CN109325231A (en) * | 2018-09-21 | 2019-02-12 | 中山大学 | A kind of method that multi task model generates term vector |
CN109783643A (en) * | 2019-01-09 | 2019-05-21 | 北京一览群智数据科技有限责任公司 | A kind of approximation sentence recommended method and device |
CN110597949A (en) * | 2019-08-01 | 2019-12-20 | 湖北工业大学 | Court similar case recommendation model based on word vectors and word frequency |
CN111159406A (en) * | 2019-12-30 | 2020-05-15 | 内蒙古工业大学 | Big data text clustering method and system based on parallel improved K-means algorithm |
Non-Patent Citations (5)
Title |
---|
fuzzy bag-of-words model for document representation;Rui Zhao等;IEEE transactions on fuzzy systems;第26卷(第2期);794-804 * |
基于矩阵分解和评论嵌入表示的推荐模型研究;张佳晖等;浙江理工大学学报(自然科学版);第41卷(第1期);79-91 * |
基于聚类和相似度计算的陆空通话词向量评估;向倩;;计算机技术与发展;第30卷(第09期);137-142 * |
融合多维信息的Web服务表征方法;张祥平等;计算机科学与探索;第16卷(第7期);1561-1569 * |
融合词向量与关键词提取的微博话题发现;王立平;赵晖;;现代计算机(第23期);3-9 * |
Also Published As
Publication number | Publication date |
---|---|
CN113139061A (en) | 2021-07-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113011533B (en) | Text classification method, apparatus, computer device and storage medium | |
CN110825877A (en) | Semantic similarity analysis method based on text clustering | |
CN107577785B (en) | Hierarchical multi-label classification method suitable for legal identification | |
CN109918666B (en) | Chinese punctuation mark adding method based on neural network | |
CN109800310B (en) | Electric power operation and maintenance text analysis method based on structured expression | |
CN112732934B (en) | Power grid equipment word segmentation dictionary and fault case library construction method | |
CN108519971B (en) | Cross-language news topic similarity comparison method based on parallel corpus | |
CN109670014B (en) | Paper author name disambiguation method based on rule matching and machine learning | |
CN111027323A (en) | Entity nominal item identification method based on topic model and semantic analysis | |
CN113988053A (en) | Hot word extraction method and device | |
Gunaseelan et al. | Automatic extraction of segments from resumes using machine learning | |
CN113139061B (en) | Case feature extraction method based on word vector clustering | |
CN114265935A (en) | Science and technology project establishment management auxiliary decision-making method and system based on text mining | |
CN112559741B (en) | Nuclear power equipment defect record text classification method, system, medium and electronic equipment | |
CN116629258B (en) | Structured analysis method and system for judicial document based on complex information item data | |
Yang et al. | Court similar case recommendation model based on word embedding and word frequency | |
Desyatirikov et al. | Computer analysis of text tonality based on the JSM method | |
CN110717015B (en) | Neural network-based polysemous word recognition method | |
CN114595324A (en) | Method, device, terminal and non-transitory storage medium for power grid service data domain division | |
CN113239277A (en) | Probability matrix decomposition recommendation method based on user comments | |
CN113535928A (en) | Service discovery method and system of long-term and short-term memory network based on attention mechanism | |
CN112000782A (en) | Intelligent customer service question-answering system based on k-means clustering algorithm | |
Nyandag et al. | Keyword extraction based on statistical information for Cyrillic Mongolian script | |
CN117688354B (en) | Text feature selection method and system based on evolutionary algorithm | |
Yao et al. | Chinese long text summarization using improved sequence-to-sequence lstm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |