CN113139061B - Case feature extraction method based on word vector clustering - Google Patents

Case feature extraction method based on word vector clustering Download PDF

Info

Publication number
CN113139061B
CN113139061B CN202110525578.9A CN202110525578A CN113139061B CN 113139061 B CN113139061 B CN 113139061B CN 202110525578 A CN202110525578 A CN 202110525578A CN 113139061 B CN113139061 B CN 113139061B
Authority
CN
China
Prior art keywords
word
vector
case
cluster
abstract
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110525578.9A
Other languages
Chinese (zh)
Other versions
CN113139061A (en
Inventor
栗伟
闵新�
谢维冬
陈强
Original Assignee
东北大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东北大学 filed Critical 东北大学
Priority to CN202110525578.9A priority Critical patent/CN113139061B/en
Publication of CN113139061A publication Critical patent/CN113139061A/en
Application granted granted Critical
Publication of CN113139061B publication Critical patent/CN113139061B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a case feature extraction method based on word vector clustering, and relates to the technical field of machine learning. According to the method, a word segmentation method based on a hash table is constructed by analyzing the case abstracts in the historical case data, a stop word table special for the judicial field is constructed to carry out stop word filtering, case abstract word vectors are generated through a word2vec method, word vectors are clustered, and finally the cluster distribution of the case abstracts is generated. By utilizing the case feature extraction method to analyze a large number of historical case summaries, different key information of the cases can be accurately extracted, further differentiation of the cases of the same type is realized, and a reference is provided for objectively and quantitatively predicting the workload of each case. The case cluster distribution of different checkhouses is provided, the case distribution of different checkhouses can be compared and analyzed, a reference is provided for comprehensive case handling capacity analysis of the checkhouses, and self-learning capacity of the checkhouses is improved.

Description

Case feature extraction method based on word vector clustering
Technical Field
The invention relates to the technical field of machine learning, in particular to a case feature extraction method based on word vector clustering.
Background
With the rapid development of modern technology, the role of information technology in inspection work is increasingly highlighted. In the case division process, the problem of feature extraction of cases is directly related to the rationality of case distribution; many scholars have studied how to accurately extract the case features from the case abstract, but the current research is only for qualitative analysis stage, and no method for extracting the case features for the case abstract is given except for listing the case key information. Because of the same case type, the workload consumed by cases with different key information is also different. By extracting the case characteristics in the case abstract, more detailed distinction of the cases can be realized, the accuracy of workload expectation for different cases of the same type is improved, and a reference is provided for workload evaluation.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a case feature extraction method based on word vector clustering, which not only considers the occurrence frequency of words, but also considers the context relation in the case abstract, thereby extracting more effective key feature clusters.
The technical scheme of the invention is that the case feature extraction method based on word vector clustering comprises the following steps:
step 1: text preprocessing is carried out on the case abstract text;
creating training data set of text data of case abstract, performing Chinese word segmentation on the training data set based on dictionary Bi-directional maximum matching method (Bi-MM method),
the bidirectional maximum matching method is to compare a word segmentation result obtained by a forward maximum matching method with a result obtained by a reverse maximum matching method, and find the longest matching character string in a dictionary as a word according to a maximum matching principle; a word segmentation dictionary is a dictionary containing judicial domain proper nouns in which consecutive english characters and digits are considered as a single word;
in the word segmentation process, a hash table structure is adopted to define a dictionary, the word segmentation dictionary adopts a unicode coding mode, a three-layer hash structure is selected, the unicode coding value of the first word in the first layer dictionary is used, the second layer is the number of characters in the word, and the third layer calculates the unicode coding value of all the characters in the word.
Step 2: part-of-speech tagging is carried out on the case abstract after word segmentation, and stop words in the case abstract are filtered, and the method specifically comprises the following steps:
step 2.1: calculating term frequency TF of each word in training data set after word segmentation w For the term according to the term frequency TF w Sorting, eliminating stop words in the stop word list before calculating term frequency, and taking the front K after sorting 1 Individual words are used as professional terms stop words;
step 2.2: removing the stop words of the technical terms in the case abstract, taking and sorting K at last 2 Individual words are used as professional terms stop words;
step 2.3: calculating IDF (reverse document frequency) of each word in the training data set after word segmentation, and obtaining the back K with the lowest IDF score 3 The personal word is used as a stop word;
step 3: generating word vectors, carrying out vector representation on the filtered single words, and converting the single words into word vectors;
word2vec algorithm is adopted for generating word vectors, and CBOW model is adopted; predicting the center word according to the context word of the center word, specifically comprising the following steps:
step 3.1: setting window size as 5, namely context word number as 4, inputting center word w t Context word one-hot vector w of (2) t-2 ,w t-1 ,w t+1 ,w t+2
Step 3.2: let the input layer matrix be W in The size |v|×d, where |v| is the size of the dictionary, d is the dimension of the word vector, V t The vector is W in Is multiplied by the input layer matrix, respectively:
step 3.3: the resulting vector v t Adding and averaging as hidden layer vector v PROJECTION The size of hidden layer vector is d×1
Step 3.4: transpose the hidden layer vector and then multiply by the output weight matrix W out ,W out The size is d×|v|:
wherein P is out Representing the output vector.
Step 3.5: step 3.4 is followed to obtain P with a size of 1 X|V| out Vector by activating functionWherein i represents P out The i-th element in the vector is processed to obtain an |V| dimensional probability distribution, wherein each dimension represents a word, and the vector indicated by the dimension index with the maximum probability is a predicted central word vector;
step 3.6: and comparing the center word vector with the one-hot vector of the actual center word, wherein the smaller the error is, the more accurate the word vector of the model prediction center word is.
And 3.7, outputting and storing the central word vector with the smallest error as a word vector of the central word.
Step 3.8: repeating the steps 3.1-3.7, and sequentially outputting each word in the filtered case abstract to be converted into word vectors.
Step 4: clustering all word vectors, calculating the word class cluster distribution condition in each case abstract, and converting each case abstract into cluster vector representation;
and adopting K-means clustering, iteratively determining the optimal cluster number, and finally outputting the cluster vector of each case abstract of the test data set. The method specifically comprises the following steps:
step 4.1: presetting a change range of a cluster k;
step 4.2: clustering all word vectors by adopting a K-means clustering method, and obtaining a central vector W of each cluster 1 ,W 2 ,...,W k
Step 4.3: repeating steps 1-3 from the first case abstract of the training data set, separating words, filtering irrelevant words, converting the irrelevant words into word vectors, and calculating the Euclidean distance between the word vector P and the center vector W of each word in the case abstractWherein p is i Is each element, w, in the word vector P i Is each element of the center vector W, compares the Euclidean distance between P and W, and places the word into the cluster corresponding to the minimum distance;
step 4.4: each word in each case abstract belongs to a cluster, the number of elements of each case abstract in each cluster is counted, finally a k-dimensional vector is formed, the k-dimensional vector is used as a characteristic vector of each case abstract, and finally the cluster-like distribution of the case abstract is output according to the sequence (0, 1, the.
Step 5: outputting a case abstract data set of key information to be extracted according to cluster vectors, wherein the dimension of the cluster vectors is the number of clustered clusters, and the cluster vector value represents the distribution condition of words in the current case abstract in each cluster, so that the extraction of case features is realized.
The beneficial effects generated by adopting the technical method are as follows:
the invention provides a case feature extraction method based on word vector clustering, which is characterized in that a word segmentation method based on a hash table is constructed by analyzing case abstracts in historical case data, a stop word table special for the judicial field is constructed to carry out stop word filtering, the case abstracts word vectors are generated by a word2vec method, the word vectors are clustered, and finally, the cluster distribution of the case abstracts is generated. By utilizing the case feature extraction method to analyze a large number of historical case summaries, different key information of the cases can be accurately extracted, further differentiation of the cases of the same type is realized, and a reference is provided for objectively and quantitatively predicting the workload of each case.
Drawings
FIG. 1 is an overall flow chart of an embodiment of the present invention;
FIG. 2 is a diagram of a dictionary hash table in accordance with an embodiment of the present invention;
FIG. 3 is a schematic view of a CBOW model according to an embodiment of the present invention;
FIG. 4 is a cluster-like distribution in an embodiment of the invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
A case feature extraction method based on word vector clustering, as shown in figure 1, comprises the following steps:
step 1: text preprocessing is carried out on the case abstract text;
creating training data set of text data of case abstract, performing Chinese word segmentation on the training data set based on dictionary Bi-directional maximum matching method (Bi-MM method),
the bidirectional maximum matching method is to compare a word segmentation result obtained by a forward maximum matching method with a result obtained by a reverse maximum matching method, and find the longest matching character string in a dictionary as a word according to a maximum matching principle; a word segmentation dictionary is a dictionary containing judicial domain proper nouns in which consecutive english characters and digits are considered as a single word;
in the word segmentation process, in order to improve the word segmentation speed, a hash table structure is used for defining a dictionary, the word segmentation dictionary adopts a unicode coding mode, a three-layer hash structure is selected, the unicode coding value of the first word in the first layer dictionary is used, the second layer is the number of characters in the word, and the third layer calculates the unicode coding value of all the characters in the word.
A schematic diagram of the hash code structure is shown in fig. 2. Unicode encoding of Chinese characters from 0x4e00 to 0x9fa5, total 20092 characters, and hash function of the first layer hash table is f (x) =x 1 -0x4e00, wherein x is a word, x 1 Unicode encoding of the first character; the hash value of the second layer is the length of the word, and the maximum length of the word is smaller than 20 through statistics, so that the hash value ranges from 1 to 20; the hash function of the third layer isWhere n is the length of the word.
Step 2: part-of-speech tagging is carried out on the case abstract after word segmentation, and stop words in the case abstract, such as character names, place names, english characters, numbers and the like, are filtered;
the term "Stop word" refers to that in information retrieval, in order to save storage space and improve searching efficiency, certain Words or Words are automatically filtered before or after processing natural language text or data, and these Words or Words are called Stop Words. Because no stop word list related to judicial related fields exists in the current natural language processing, the patent constructs the stop word list related to judicial fields on the basis of the original NLP stop word list, and specifically comprises the following steps:
step 2.1: calculating term frequency TF of each word in training data set after word segmentation w For the term according to the term frequency TF w Sorting, namely eliminating stop words in the stop word list before calculating term frequency so as to locate stop words in the judicial field, and taking K before sorting 1 The individual word is taken as a stop word of the technical term, and K is taken in the embodiment 1 =1000;
Step 2.2: removing the stop words of the technical terms in the case abstract, taking and sorting K at last 2 The individual word is taken as a stop word of the technical term, and K is taken in the embodiment 2 =1000;
Step 2.3: calculating IDF (reverse document frequency) of each word in the training data set after word segmentation, and obtaining the back K with the lowest IDF score 3 The individual words are taken as stop words, K is taken in the embodiment 3 =1000。
Step 3: generating word vectors, carrying out vector representation on the filtered single words, and converting the single words into word vectors;
after obtaining the case abstract of the filtered irrelevant words, each word of which the case abstract is filtered is converted into a word vector. The Word2vec algorithm is used for generating Word vectors, the Word2vec algorithm is one of methods for converting words into Word vectors, and comprises two Word vector generation models, namely a CBOW model and a Skip-gram model, and the CBOW model is used for predicting a center Word according to the context words of the center Word as shown in fig. 3, and specifically comprises the following steps:
step 3.1: setting window size as 5, namely context word number as 4, inputting center word w t Context word one-hot vector w of (2) t-2 ,w t-1 ,w t+1 ,w t+2
Step 3.2: let the input layer matrix be W in The size |v|×d, where |v| is the size of the dictionary, d is the dimension of the word vector, V t The vector is W in Is multiplied by the input layer matrix, respectively:
step 3.3: the resulting vector v t Adding and averaging as hidden layer vector v PROJECTION The size of hidden layer vector is d×1
Step 3.4: transpose the hidden layer vector and then multiply by the output weight matrix W out ,W out The size is d×|v|:
wherein P is out Representing the output vector.
Step 3.5: step 3.4 is followed to obtain P with a size of 1 X|V| out Vector by activating functionWherein i represents P out The i-th element in the vector is processed to obtain an |V| dimensional probability distribution, wherein each dimension represents a word, and the vector indicated by the dimension index with the maximum probability is a predicted central word vector;
step 3.6: and comparing the center word vector with the one-hot vector of the actual center word, wherein the smaller the error is, the more accurate the word vector of the model prediction center word is.
And 3.7, outputting and storing the central word vector with the smallest error as a word vector of the central word.
Step 3.8: repeating the steps 3.1-3.7, and sequentially outputting each word in the filtered case abstract to be converted into word vectors.
Step 4: clustering all word vectors, calculating the word class cluster distribution condition in each case abstract, and converting each case abstract into cluster vector representation;
after the word vectors of the keywords in the case abstract are obtained, all the word vectors need to be clustered, and K-means clustering is adopted in the patent. In the word vector clustering process, the cluster size can directly influence the extraction of case features in the case abstract, in order to ensure the accuracy of case feature extraction, the process adopts iteration to determine the optimal cluster number, and finally, the cluster vector of each case abstract of the test data set is output. The method specifically comprises the following steps:
step 4.1: the change range of the cluster k is preset, and the change range of the cluster k is set to be 5-15 in the embodiment;
step 4.2: clustering all word vectors by adopting a K-means clustering method, and obtaining a central vector W of each cluster 1 ,W 2 ,...,W k
Step 4.3: repeating steps 1-3 from the first case abstract of the training data set, separating words, filtering irrelevant words, converting the irrelevant words into word vectors, and calculating the Euclidean distance between the word vector P and the center vector W of each word in the case abstract Wherein p is i Is each element, w, in the word vector P i Is each element of the center vector W, compares the Euclidean distance between P and W, and places the word into the cluster corresponding to the minimum distance;
step 4.4: each word in each case abstract belongs to a cluster, the number of elements of each case abstract in each cluster is counted, finally a k-dimensional vector is formed, the k-dimensional vector is used as a characteristic vector of each case abstract, and finally the cluster-like distribution of the case summaries is output according to the sequence (0, 1, …, k) of the clusters.
Step 5: outputting a case abstract data set of key information to be extracted according to cluster vectors, wherein the dimension of the cluster vectors is the number of clustered clusters, and the cluster vector value represents the distribution condition of words in the current case abstract in each cluster, so that the extraction of case features is realized.
The method for extracting the case characteristics from the case conditions is used for testing the total 1364 case data of a certain ground level inspection hospital and 2 subordinate regional inspection hospitals, and calculating cluster-like distribution as characteristic distribution. As shown in fig. 4.

Claims (4)

1. The case feature extraction method based on word vector clustering is characterized by comprising the following steps of:
step 1: text preprocessing is carried out on the case abstract text;
creating training data set with text data of abstract of case, performing Chinese word segmentation on the training data set based on Bi-directional maximum matching method Bi-MM method of dictionary,
step 2: marking the parts of speech of the case abstract after word segmentation, and filtering stop words in the case abstract;
step 2.1: calculating term frequency TF of each word in training data set after word segmentation w For the term according to the term frequency TF w Sorting, eliminating stop words in the stop word list before calculating term frequency, and taking the front K after sorting 1 Individual words are used as professional terms stop words;
step 2.2: removing the stop words of the technical terms in the case abstract, taking and sorting K at last 2 Individual words are used as professional terms stop words;
step 2.3: calculating the reverse file frequency IDF of each word in the training data set after word segmentation, and obtaining the back K with the lowest IDF score 3 The personal word is used as a stop word;
step 3: generating word vectors, carrying out vector representation on the filtered single words, and converting the single words into word vectors; word2vec algorithm is adopted for generating word vectors, and CBOW model is adopted; predicting the center word according to the context word of the center word;
step 4: clustering all word vectors, calculating the word class cluster distribution condition in each case abstract, and converting each case abstract into cluster vector representation; adopting K-means clustering, iteratively determining the optimal cluster number, and finally outputting the cluster vector of each case abstract of the test data set;
step 4.1: presetting a change range of a cluster k;
step 4.2: clustering all word vectors by adopting a K-means clustering method, and obtaining a central vector W of each cluster 1 ,W 2 ,...,W k
Step 4.3: repeating steps 1-3 from the first case abstract of the training data set, separating words, filtering irrelevant words, converting the irrelevant words into word vectors, and calculating the Euclidean distance between the word vector P and the center vector W of each word in the case abstractWherein p is i Is each element, w, in the word vector P i Is each element of the center vector W, compares the Euclidean distance between P and W, and places the word into the cluster corresponding to the minimum distance;
step 4.4: each word in each case abstract belongs to a cluster, the number of elements of each case abstract in each cluster is counted, a k-dimensional vector is finally formed, the k-dimensional vector is used as a characteristic vector of each case abstract, and finally the cluster-like distribution of the case abstract is output according to the sequence (0, 1, the..k) of the clusters;
step 5: outputting a case abstract data set of key information to be extracted according to cluster vectors, wherein the dimension of the cluster vectors is the number of clustered clusters, and the cluster vector value represents the distribution condition of words in the current case abstract in each cluster, so that the extraction of case features is realized.
2. The case feature extraction method based on word vector clustering according to claim 1, wherein the bidirectional maximum matching method in step 1 is to compare a word segmentation result obtained by a forward maximum matching method with a result obtained by a reverse maximum matching method, and find the longest matching character string in a dictionary as a word according to a maximum matching principle; a word dictionary is a dictionary containing judicial domain proper nouns in which consecutive english characters and digits are considered as a single word.
3. The case feature extraction method based on word vector clustering according to claim 1, wherein in the word segmentation processing in step 1, a hash table structure is used to define a dictionary, the word segmentation dictionary adopts a unicode coding mode, a three-layer hash structure is selected, the unicode coding value of the first word in the first layer dictionary is used, the second layer is the number of characters in the word, and the unicode coding value of all the characters in the word is calculated in the third layer.
4. The case feature extraction method based on word vector clustering of claim 1, wherein the step 3 specifically includes:
step 3.1: setting window size as 5, namely context word number as 4, inputting center word w t Context word one-hot vector w of (2) t-2 ,w t-1 ,w t+1 ,w t+2
Step 3.2: let the input layer matrix be W in The size |v|×d, where |v| is the size of the dictionary, d is the dimension of the word vector, V t The vector is W in Is multiplied by the input layer matrix, respectively:
step 3.3: the resulting vector v t Adding and averaging as hidden layer vector v PROJECTION The size of hidden layer vector is d×1
Step 3.4: transpose the hidden layer vector and then multiply by the output weight matrix W out ,W out The size is d×|v|:
wherein P is out Representing an output vector;
step 3.5: step 3.4 is followed to obtain P with a size of 1 X|V| out Vector by activating functionWherein i represents P out The i-th element in the vector is processed to obtain |V| dimensional probability distribution, wherein each dimension represents a word, and the vector indicated by the dimensional index with the maximum probability is a predicted central word vector target word;
step 3.6: comparing the central word vector with the one-hot vector of the actual central word, wherein the smaller the error is, the more accurate the word vector of the model prediction central word is;
step 3.7, outputting and storing the central word vector with the smallest error as the word vector of the central word;
step 3.8: repeating the steps 3.1-3.7, and sequentially outputting each word in the filtered case abstract to be converted into word vectors.
CN202110525578.9A 2021-05-14 2021-05-14 Case feature extraction method based on word vector clustering Active CN113139061B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110525578.9A CN113139061B (en) 2021-05-14 2021-05-14 Case feature extraction method based on word vector clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110525578.9A CN113139061B (en) 2021-05-14 2021-05-14 Case feature extraction method based on word vector clustering

Publications (2)

Publication Number Publication Date
CN113139061A CN113139061A (en) 2021-07-20
CN113139061B true CN113139061B (en) 2023-07-21

Family

ID=76817023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110525578.9A Active CN113139061B (en) 2021-05-14 2021-05-14 Case feature extraction method based on word vector clustering

Country Status (1)

Country Link
CN (1) CN113139061B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779936A (en) * 2021-09-16 2021-12-10 西华师范大学 Network asset data processing method

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105243129A (en) * 2015-09-30 2016-01-13 清华大学深圳研究生院 Commodity property characteristic word clustering method
CN106294319A (en) * 2016-08-04 2017-01-04 武汉数为科技有限公司 One is combined related cases recognition methods
CN106599029A (en) * 2016-11-02 2017-04-26 焦点科技股份有限公司 Chinese short text clustering method
WO2017076205A1 (en) * 2015-11-04 2017-05-11 陈包容 Method and apparatus for obtaining reply prompt content for chat start sentence
CN107516110A (en) * 2017-08-22 2017-12-26 华南理工大学 A kind of medical question and answer Semantic Clustering method based on integrated convolutional encoding
CN107526834A (en) * 2017-09-05 2017-12-29 北京工商大学 Joint part of speech and the word2vec improved methods of the correlation factor of word order training
CN108733669A (en) * 2017-04-14 2018-11-02 优路(北京)信息科技有限公司 A kind of personalized digital media content recommendation system and method based on term vector
CN109325231A (en) * 2018-09-21 2019-02-12 中山大学 A kind of method that multi task model generates term vector
CN109783643A (en) * 2019-01-09 2019-05-21 北京一览群智数据科技有限责任公司 A kind of approximation sentence recommended method and device
CN110597949A (en) * 2019-08-01 2019-12-20 湖北工业大学 Court similar case recommendation model based on word vectors and word frequency
CN111159406A (en) * 2019-12-30 2020-05-15 内蒙古工业大学 Big data text clustering method and system based on parallel improved K-means algorithm

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105243129A (en) * 2015-09-30 2016-01-13 清华大学深圳研究生院 Commodity property characteristic word clustering method
WO2017076205A1 (en) * 2015-11-04 2017-05-11 陈包容 Method and apparatus for obtaining reply prompt content for chat start sentence
CN106294319A (en) * 2016-08-04 2017-01-04 武汉数为科技有限公司 One is combined related cases recognition methods
CN106599029A (en) * 2016-11-02 2017-04-26 焦点科技股份有限公司 Chinese short text clustering method
CN108733669A (en) * 2017-04-14 2018-11-02 优路(北京)信息科技有限公司 A kind of personalized digital media content recommendation system and method based on term vector
CN107516110A (en) * 2017-08-22 2017-12-26 华南理工大学 A kind of medical question and answer Semantic Clustering method based on integrated convolutional encoding
CN107526834A (en) * 2017-09-05 2017-12-29 北京工商大学 Joint part of speech and the word2vec improved methods of the correlation factor of word order training
CN109325231A (en) * 2018-09-21 2019-02-12 中山大学 A kind of method that multi task model generates term vector
CN109783643A (en) * 2019-01-09 2019-05-21 北京一览群智数据科技有限责任公司 A kind of approximation sentence recommended method and device
CN110597949A (en) * 2019-08-01 2019-12-20 湖北工业大学 Court similar case recommendation model based on word vectors and word frequency
CN111159406A (en) * 2019-12-30 2020-05-15 内蒙古工业大学 Big data text clustering method and system based on parallel improved K-means algorithm

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
fuzzy bag-of-words model for document representation;Rui Zhao等;IEEE transactions on fuzzy systems;第26卷(第2期);794-804 *
基于矩阵分解和评论嵌入表示的推荐模型研究;张佳晖等;浙江理工大学学报(自然科学版);第41卷(第1期);79-91 *
基于聚类和相似度计算的陆空通话词向量评估;向倩;;计算机技术与发展;第30卷(第09期);137-142 *
融合多维信息的Web服务表征方法;张祥平等;计算机科学与探索;第16卷(第7期);1561-1569 *
融合词向量与关键词提取的微博话题发现;王立平;赵晖;;现代计算机(第23期);3-9 *

Also Published As

Publication number Publication date
CN113139061A (en) 2021-07-20

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN107577785B (en) Hierarchical multi-label classification method suitable for legal identification
CN109918666B (en) Chinese punctuation mark adding method based on neural network
CN109800310B (en) Electric power operation and maintenance text analysis method based on structured expression
CN112732934B (en) Power grid equipment word segmentation dictionary and fault case library construction method
CN108519971B (en) Cross-language news topic similarity comparison method based on parallel corpus
CN109670014B (en) Paper author name disambiguation method based on rule matching and machine learning
CN111027323A (en) Entity nominal item identification method based on topic model and semantic analysis
CN113988053A (en) Hot word extraction method and device
Gunaseelan et al. Automatic extraction of segments from resumes using machine learning
CN113139061B (en) Case feature extraction method based on word vector clustering
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN112559741B (en) Nuclear power equipment defect record text classification method, system, medium and electronic equipment
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
Yang et al. Court similar case recommendation model based on word embedding and word frequency
Desyatirikov et al. Computer analysis of text tonality based on the JSM method
CN110717015B (en) Neural network-based polysemous word recognition method
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
CN113239277A (en) Probability matrix decomposition recommendation method based on user comments
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism
CN112000782A (en) Intelligent customer service question-answering system based on k-means clustering algorithm
Nyandag et al. Keyword extraction based on statistical information for Cyrillic Mongolian script
CN117688354B (en) Text feature selection method and system based on evolutionary algorithm
Yao et al. Chinese long text summarization using improved sequence-to-sequence lstm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant