CN109543036A - Text Clustering Method based on semantic similarity - Google Patents
Text Clustering Method based on semantic similarity Download PDFInfo
- Publication number
- CN109543036A CN109543036A CN201811385276.0A CN201811385276A CN109543036A CN 109543036 A CN109543036 A CN 109543036A CN 201811385276 A CN201811385276 A CN 201811385276A CN 109543036 A CN109543036 A CN 109543036A
- Authority
- CN
- China
- Prior art keywords
- text
- classification
- semantic
- data
- training characteristics
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention belongs to big data analysis fields, and it discloses a kind of Text Clustering Methods based on semantic similarity, carry out clustering to the sentence lack of standardization of parsing failure semantic in natural language understanding, improve the discrimination of semantic understanding.This method comprises: a. collects text data, classified according to the result that success parses to it;B. multi-component system fractionation is carried out for sorted text data, obtains the training characteristics collection under each classification;C. training characteristics collection is converted into space vector based on bag of words;D. it is trained using the space vector as the input of neural network, obtains the low latitude semantic model under different classifications;E. in use, calculating the similarity score between the low latitude semantic model of text lack of standardization and trained each classification to be sorted;F. classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.
Description
Technical field
The invention belongs to big data analysis fields, and in particular to a kind of Text Clustering Method based on semantic similarity.
Background technique
NLP (Natural Language Processing natural language processing) is practical in actual items utilization at present
In speech recognition last handling process, due to fluctuating, the dialectal accent of the possible psychology of voice importer (teller) or mood
The problems such as, formants and the tonal variations such as too fast, the high/low, distortion of tone change of word speed are caused, it is wrong to generate speech recognition signal
Accidentally, so that the true content that can not correctly express user (teller) does subsequent processing to computer.Different under experimental situation
More standard and Templated test case, therefore it is unidentified user out that NLP has large batch of text in practical applications
True intention invalid use-case, these reduce the discrimination of the semantic understanding in entire NLP application with regular meeting in vain.
Therefore, the application is it is necessary to propose a kind of Text Clustering Method based on semantic similarity, to natural language understanding
The sentence lack of standardization of middle semantic parsing failure carries out clustering, improves the discrimination of semantic understanding.
Summary of the invention
The technical problems to be solved by the present invention are: a kind of Text Clustering Method based on semantic similarity is proposed, to certainly
The sentence lack of standardization of semantic parsing failure carries out clustering in right language understanding, improves the discrimination of semantic understanding.
The technical proposal adopted by the invention to solve the above technical problems is that:
Text Clustering Method based on semantic similarity, comprising the following steps:
A. text data is collected, is classified according to the result that success parses to it;
B. multi-component system fractionation is carried out for sorted text data, obtains the training characteristics collection under each classification;
C. training characteristics collection is converted into space vector based on bag of words;
D. it is trained using the space vector as the input of neural network, obtains the low latitude semanteme mould under different classifications
Type;
E. in use, calculating the low latitude semantic model of text lack of standardization and trained each classification to be sorted
Between similarity score;
F. classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.
As advanced optimizing, in step a, the collection text data divides it according to the result that success parses
Class specifically includes:
It collects in actual items application in log system, to the successful text data of entity resolution, or collects
The data text marked, is based on known label as a result, according to labeling number, by data text be divided into it is different classes of under
Set.
As advanced optimizing, described to carry out multi-component system fractionation for sorted text data in step b, acquisition is each
Training characteristics collection under a classification, specifically includes:
N-grams fractionation is carried out to each data in the data text set under each classification, is split as by binary
Training characteristics collection under the category of phrase, ternary phrase and original text composition, and the phrase concentrated to training characteristics is gone
It handles again.
As advanced optimizing, step d is specifically included:
D1. select tanh for the hidden layer of neural network and the activation primitive of output layer;
D2. using the space vector of different classifications as the input of neural network;
D3. the vector model of the low latitude semanteme under each classification is generated by neural network algorithm;
D4. the semantic vector between the unknown text of input and the vector model of each low latitude semanteme is calculated;
D5. the similitude between unknown text and vector model is switched to by posterior probability by softmax function, and passed through
Maximum-likelihood estimation minimizes loss function:
D6. finally by backpropagation and stochastic gradient descent algorithm vector model is restrained.
The beneficial effects of the present invention are:
Sentence lack of standardization or odjective cause for parsing failure semantic in natural language understanding can not correct understandings
Text compares between the data lack of standardization of failure and authority data sample according to existing authority data learning text feature
Semantic similarity carries out clustering processing to it, so as to be trained solution to these irregular Text Feature Extraction individual features
Analysis.The scheme of invention, which helps to improve, understands natural language text feature, to improve the discrimination of semantic understanding.
In addition, the present invention in construction feature training set, utilizes polynary cutting, it is possible to reduce the dependence to tokenizer,
Model generalization ability is improved simultaneously, effectively the semantic feature of study to every text, reduced due to the error in unsupervised learning
And final data classification has been influenced, the invention does not need to map in pilot process, improves the precision of final result.
Detailed description of the invention
Fig. 1 is the Text Clustering Method flow chart based on semantic similarity in the present invention.
Specific embodiment
The present invention is directed to propose a kind of Text Clustering Method based on semantic similarity, solves to semantic in natural language understanding
The sentence lack of standardization of analysis failure carries out clustering, improves the discrimination of semantic understanding.
As shown in Figure 1, the Text Clustering Method based on semantic similarity in the present invention, comprising the following steps:
A. text data is collected, is classified according to the result that success parses to it;
B. multi-component system fractionation is carried out for sorted text data, obtains the training characteristics collection under each classification;
C. training characteristics collection is converted into space vector based on bag of words;
D. it is trained using the space vector as the input of neural network, obtains the low latitude semanteme mould under different classifications
Type;
E. in use, calculating the low latitude semantic model of text lack of standardization and trained each classification to be sorted
Between similarity score;
F. classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.
In specific implementation, the implementation steps of above scheme are as follows:
1) firstly, being collected in actual items application in log system, to the successful text data of entity resolution, Huo Zheshou
Collect the data text marked, based on known label as a result, according to labeling number, data text is divided into several
Set under classification.
Y={ y1,y2,...,yi,...,yn}i∈(1,n)
Wherein, Y is the text data collected, and n is the classification number divided, yiThe text collection being expressed as under classification i.
2) n-grams fractionation is carried out to the text data in each category set respectively: be split as by binary phrase, ternary
Training characteristics collection under the category of phrase and original text composition, and duplicate removal processing is carried out to the phrase that training characteristics are concentrated;Such as
This carries out polynary first phrase processing to text, can with the feature of following model autonomous learning as much as possible to text, reduction because
For participle inaccuracy to the influence that text classification accuracy generates, the dependence to Chinese word segmentation is reduced.
3) the training characteristics collection under each classification is converted into space vector using bag of words, to carry out dimensionality reduction;
4) select tanh for the hidden layer of neural network and the activation primitive of output layer;Using the space vector as nerve net
The input of network generates the vector model doc of the low latitude semanteme under each classification by neural network algorithm;
5) it is directed to each vector model, calculates the query text of new unknown input classification and the vector model of each classification
The semantic vector of doc:
Wherein, yqIndicate the query, y of the unknown classification newly inputteddThe low latitude semantic vector doc generated for above-mentioned steps.
The similitude between unknown text and vector model is switched into posterior probability by softmax function, and passes through pole
Maximum-likelihood is estimated to minimize loss function:
Wherein, doc+For the positive sample under query.Make vector finally by backpropagation and stochastic gradient descent algorithm
Model convergence, obtains stable vector model.
6) in practical application, the low latitude for calculating text lack of standardization and trained each classification to be sorted is semantic
Similarity score between model;
7) classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.
Claims (4)
1. the Text Clustering Method based on semantic similarity, which comprises the following steps:
A. text data is collected, is classified according to the result that success parses to it;
B. multi-component system fractionation is carried out for sorted text data, obtains the training characteristics collection under each classification;
C. training characteristics collection is converted into space vector based on bag of words;
D. it is trained using the space vector as the input of neural network, obtains the low latitude semantic model under different classifications;
E. in use, calculating between the low latitude semantic model of text lack of standardization and trained each classification to be sorted
Similarity score;
F. classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.
2. as described in claim 1 based on the Text Clustering Method of semantic similarity, which is characterized in that in step a, the receipts
Collect text data, classified according to the result that success parses to it, specifically included:
It is collected in actual items application in log system, to the successful text data of entity resolution, or collects and marked
Good data text, is based on known label as a result, according to labeling number, by data text be divided into it is different classes of under collection
It closes.
3. as described in claim 1 based on the Text Clustering Method of semantic similarity, which is characterized in that in step b, the needle
Multi-component system fractionation is carried out to sorted text data, the training characteristics collection under each classification is obtained, specifically includes:
To in the data text set under each classification each data carry out n-grams fractionation, be split as by binary phrase,
Training characteristics collection under the category of ternary phrase and original text composition, and the phrase that training characteristics are concentrated is carried out at duplicate removal
Reason.
4. as described in claim 1 based on the Text Clustering Method of semantic similarity, which is characterized in that step d is specifically included:
D1. select tanh for the hidden layer of neural network and the activation primitive of output layer;
D2. using the space vector of different classifications as the input of neural network;
D3. the vector model of the low latitude semanteme under each classification is generated by neural network algorithm;
D4. the semantic vector between the unknown text of input and the vector model of each low latitude semanteme is calculated;
D5. the similitude between unknown text and vector model is switched to by posterior probability by softmax function, and by very big
Possibility predication minimizes loss function:
D6. finally by backpropagation and stochastic gradient descent algorithm vector model is restrained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811385276.0A CN109543036A (en) | 2018-11-20 | 2018-11-20 | Text Clustering Method based on semantic similarity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811385276.0A CN109543036A (en) | 2018-11-20 | 2018-11-20 | Text Clustering Method based on semantic similarity |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109543036A true CN109543036A (en) | 2019-03-29 |
Family
ID=65848937
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811385276.0A Pending CN109543036A (en) | 2018-11-20 | 2018-11-20 | Text Clustering Method based on semantic similarity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109543036A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111078887A (en) * | 2019-12-20 | 2020-04-28 | 厦门市美亚柏科信息股份有限公司 | Text classification method and device |
CN111539220A (en) * | 2020-05-12 | 2020-08-14 | 北京百度网讯科技有限公司 | Training method and device of semantic similarity model, electronic equipment and storage medium |
CN112835798A (en) * | 2021-02-03 | 2021-05-25 | 广州虎牙科技有限公司 | Cluster learning method, test step clustering method and related device |
CN113590820A (en) * | 2021-07-16 | 2021-11-02 | 杭州网易智企科技有限公司 | Text processing method, device, medium and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605702A (en) * | 2013-11-08 | 2014-02-26 | 北京邮电大学 | Word similarity based network text classification method |
CN106326346A (en) * | 2016-08-06 | 2017-01-11 | 上海高欣计算机系统有限公司 | Text classification method and terminal device |
CN107153642A (en) * | 2017-05-16 | 2017-09-12 | 华北电力大学 | A kind of analysis method based on neural network recognization text comments Sentiment orientation |
CN107357895A (en) * | 2017-01-05 | 2017-11-17 | 大连理工大学 | A kind of processing method of the text representation based on bag of words |
CN107590177A (en) * | 2017-07-31 | 2018-01-16 | 南京邮电大学 | A kind of Chinese Text Categorization of combination supervised learning |
CN108595706A (en) * | 2018-05-10 | 2018-09-28 | 中国科学院信息工程研究所 | A kind of document semantic representation method, file classification method and device based on theme part of speech similitude |
-
2018
- 2018-11-20 CN CN201811385276.0A patent/CN109543036A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605702A (en) * | 2013-11-08 | 2014-02-26 | 北京邮电大学 | Word similarity based network text classification method |
CN106326346A (en) * | 2016-08-06 | 2017-01-11 | 上海高欣计算机系统有限公司 | Text classification method and terminal device |
CN107357895A (en) * | 2017-01-05 | 2017-11-17 | 大连理工大学 | A kind of processing method of the text representation based on bag of words |
CN107153642A (en) * | 2017-05-16 | 2017-09-12 | 华北电力大学 | A kind of analysis method based on neural network recognization text comments Sentiment orientation |
CN107590177A (en) * | 2017-07-31 | 2018-01-16 | 南京邮电大学 | A kind of Chinese Text Categorization of combination supervised learning |
CN108595706A (en) * | 2018-05-10 | 2018-09-28 | 中国科学院信息工程研究所 | A kind of document semantic representation method, file classification method and device based on theme part of speech similitude |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111078887A (en) * | 2019-12-20 | 2020-04-28 | 厦门市美亚柏科信息股份有限公司 | Text classification method and device |
CN111078887B (en) * | 2019-12-20 | 2022-04-29 | 厦门市美亚柏科信息股份有限公司 | Text classification method and device |
CN111539220A (en) * | 2020-05-12 | 2020-08-14 | 北京百度网讯科技有限公司 | Training method and device of semantic similarity model, electronic equipment and storage medium |
CN112835798A (en) * | 2021-02-03 | 2021-05-25 | 广州虎牙科技有限公司 | Cluster learning method, test step clustering method and related device |
CN112835798B (en) * | 2021-02-03 | 2024-02-20 | 广州虎牙科技有限公司 | Clustering learning method, testing step clustering method and related devices |
CN113590820A (en) * | 2021-07-16 | 2021-11-02 | 杭州网易智企科技有限公司 | Text processing method, device, medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN110134757B (en) | Event argument role extraction method based on multi-head attention mechanism | |
CN108304372B (en) | Entity extraction method and device, computer equipment and storage medium | |
CN110188781B (en) | Ancient poetry automatic identification method based on deep learning | |
CN109543036A (en) | Text Clustering Method based on semantic similarity | |
CN108763510A (en) | Intension recognizing method, device, equipment and storage medium | |
CN105260361A (en) | Trigger word tagging system and method for biomedical events | |
CN108733647B (en) | Word vector generation method based on Gaussian distribution | |
CN109815336A (en) | A kind of text polymerization and system | |
CN111597328B (en) | New event theme extraction method | |
CN110895559A (en) | Model training method, text processing method, device and equipment | |
CN110085215A (en) | A kind of language model data Enhancement Method based on generation confrontation network | |
CN107977353A (en) | A kind of mixing language material name entity recognition method based on LSTM-CNN | |
CN107797987A (en) | A kind of mixing language material name entity recognition method based on Bi LSTM CNN | |
CN110188359B (en) | Text entity extraction method | |
CN107797986B (en) | LSTM-CNN-based mixed corpus word segmentation method | |
CN107797988A (en) | A kind of mixing language material name entity recognition method based on Bi LSTM | |
CN113094502A (en) | Multi-granularity takeaway user comment sentiment analysis method | |
CN107992468A (en) | A kind of mixing language material name entity recognition method based on LSTM | |
CN111460147B (en) | Title short text classification method based on semantic enhancement | |
CN117033961A (en) | Multi-mode image-text classification method for context awareness | |
CN115600595A (en) | Entity relationship extraction method, system, equipment and readable storage medium | |
CN114065749A (en) | Text-oriented Guangdong language recognition model and training and recognition method of system | |
CN110162629B (en) | Text classification method based on multi-base model framework | |
CN113127607A (en) | Text data labeling method and device, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190329 |