CN109543036A - Text Clustering Method based on semantic similarity - Google Patents

Text Clustering Method based on semantic similarity Download PDF

Info

Publication number
CN109543036A
CN109543036A CN201811385276.0A CN201811385276A CN109543036A CN 109543036 A CN109543036 A CN 109543036A CN 201811385276 A CN201811385276 A CN 201811385276A CN 109543036 A CN109543036 A CN 109543036A
Authority
CN
China
Prior art keywords
text
classification
semantic
data
training characteristics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811385276.0A
Other languages
Chinese (zh)
Inventor
杨鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Changhong Electric Co Ltd
Original Assignee
Sichuan Changhong Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Changhong Electric Co Ltd filed Critical Sichuan Changhong Electric Co Ltd
Priority to CN201811385276.0A priority Critical patent/CN109543036A/en
Publication of CN109543036A publication Critical patent/CN109543036A/en
Pending legal-status Critical Current

Links

Abstract

The invention belongs to big data analysis fields, and it discloses a kind of Text Clustering Methods based on semantic similarity, carry out clustering to the sentence lack of standardization of parsing failure semantic in natural language understanding, improve the discrimination of semantic understanding.This method comprises: a. collects text data, classified according to the result that success parses to it;B. multi-component system fractionation is carried out for sorted text data, obtains the training characteristics collection under each classification;C. training characteristics collection is converted into space vector based on bag of words;D. it is trained using the space vector as the input of neural network, obtains the low latitude semantic model under different classifications;E. in use, calculating the similarity score between the low latitude semantic model of text lack of standardization and trained each classification to be sorted;F. classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.

Description

Text Clustering Method based on semantic similarity
Technical field
The invention belongs to big data analysis fields, and in particular to a kind of Text Clustering Method based on semantic similarity.
Background technique
NLP (Natural Language Processing natural language processing) is practical in actual items utilization at present In speech recognition last handling process, due to fluctuating, the dialectal accent of the possible psychology of voice importer (teller) or mood The problems such as, formants and the tonal variations such as too fast, the high/low, distortion of tone change of word speed are caused, it is wrong to generate speech recognition signal Accidentally, so that the true content that can not correctly express user (teller) does subsequent processing to computer.Different under experimental situation More standard and Templated test case, therefore it is unidentified user out that NLP has large batch of text in practical applications True intention invalid use-case, these reduce the discrimination of the semantic understanding in entire NLP application with regular meeting in vain.
Therefore, the application is it is necessary to propose a kind of Text Clustering Method based on semantic similarity, to natural language understanding The sentence lack of standardization of middle semantic parsing failure carries out clustering, improves the discrimination of semantic understanding.
Summary of the invention
The technical problems to be solved by the present invention are: a kind of Text Clustering Method based on semantic similarity is proposed, to certainly The sentence lack of standardization of semantic parsing failure carries out clustering in right language understanding, improves the discrimination of semantic understanding.
The technical proposal adopted by the invention to solve the above technical problems is that:
Text Clustering Method based on semantic similarity, comprising the following steps:
A. text data is collected, is classified according to the result that success parses to it;
B. multi-component system fractionation is carried out for sorted text data, obtains the training characteristics collection under each classification;
C. training characteristics collection is converted into space vector based on bag of words;
D. it is trained using the space vector as the input of neural network, obtains the low latitude semanteme mould under different classifications Type;
E. in use, calculating the low latitude semantic model of text lack of standardization and trained each classification to be sorted Between similarity score;
F. classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.
As advanced optimizing, in step a, the collection text data divides it according to the result that success parses Class specifically includes:
It collects in actual items application in log system, to the successful text data of entity resolution, or collects The data text marked, is based on known label as a result, according to labeling number, by data text be divided into it is different classes of under Set.
As advanced optimizing, described to carry out multi-component system fractionation for sorted text data in step b, acquisition is each Training characteristics collection under a classification, specifically includes:
N-grams fractionation is carried out to each data in the data text set under each classification, is split as by binary Training characteristics collection under the category of phrase, ternary phrase and original text composition, and the phrase concentrated to training characteristics is gone It handles again.
As advanced optimizing, step d is specifically included:
D1. select tanh for the hidden layer of neural network and the activation primitive of output layer;
D2. using the space vector of different classifications as the input of neural network;
D3. the vector model of the low latitude semanteme under each classification is generated by neural network algorithm;
D4. the semantic vector between the unknown text of input and the vector model of each low latitude semanteme is calculated;
D5. the similitude between unknown text and vector model is switched to by posterior probability by softmax function, and passed through Maximum-likelihood estimation minimizes loss function:
D6. finally by backpropagation and stochastic gradient descent algorithm vector model is restrained.
The beneficial effects of the present invention are:
Sentence lack of standardization or odjective cause for parsing failure semantic in natural language understanding can not correct understandings Text compares between the data lack of standardization of failure and authority data sample according to existing authority data learning text feature Semantic similarity carries out clustering processing to it, so as to be trained solution to these irregular Text Feature Extraction individual features Analysis.The scheme of invention, which helps to improve, understands natural language text feature, to improve the discrimination of semantic understanding.
In addition, the present invention in construction feature training set, utilizes polynary cutting, it is possible to reduce the dependence to tokenizer, Model generalization ability is improved simultaneously, effectively the semantic feature of study to every text, reduced due to the error in unsupervised learning And final data classification has been influenced, the invention does not need to map in pilot process, improves the precision of final result.
Detailed description of the invention
Fig. 1 is the Text Clustering Method flow chart based on semantic similarity in the present invention.
Specific embodiment
The present invention is directed to propose a kind of Text Clustering Method based on semantic similarity, solves to semantic in natural language understanding The sentence lack of standardization of analysis failure carries out clustering, improves the discrimination of semantic understanding.
As shown in Figure 1, the Text Clustering Method based on semantic similarity in the present invention, comprising the following steps:
A. text data is collected, is classified according to the result that success parses to it;
B. multi-component system fractionation is carried out for sorted text data, obtains the training characteristics collection under each classification;
C. training characteristics collection is converted into space vector based on bag of words;
D. it is trained using the space vector as the input of neural network, obtains the low latitude semanteme mould under different classifications Type;
E. in use, calculating the low latitude semantic model of text lack of standardization and trained each classification to be sorted Between similarity score;
F. classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.
In specific implementation, the implementation steps of above scheme are as follows:
1) firstly, being collected in actual items application in log system, to the successful text data of entity resolution, Huo Zheshou Collect the data text marked, based on known label as a result, according to labeling number, data text is divided into several Set under classification.
Y={ y1,y2,...,yi,...,yn}i∈(1,n)
Wherein, Y is the text data collected, and n is the classification number divided, yiThe text collection being expressed as under classification i.
2) n-grams fractionation is carried out to the text data in each category set respectively: be split as by binary phrase, ternary Training characteristics collection under the category of phrase and original text composition, and duplicate removal processing is carried out to the phrase that training characteristics are concentrated;Such as This carries out polynary first phrase processing to text, can with the feature of following model autonomous learning as much as possible to text, reduction because For participle inaccuracy to the influence that text classification accuracy generates, the dependence to Chinese word segmentation is reduced.
3) the training characteristics collection under each classification is converted into space vector using bag of words, to carry out dimensionality reduction;
4) select tanh for the hidden layer of neural network and the activation primitive of output layer;Using the space vector as nerve net The input of network generates the vector model doc of the low latitude semanteme under each classification by neural network algorithm;
5) it is directed to each vector model, calculates the query text of new unknown input classification and the vector model of each classification The semantic vector of doc:
Wherein, yqIndicate the query, y of the unknown classification newly inputteddThe low latitude semantic vector doc generated for above-mentioned steps.
The similitude between unknown text and vector model is switched into posterior probability by softmax function, and passes through pole Maximum-likelihood is estimated to minimize loss function:
Wherein, doc+For the positive sample under query.Make vector finally by backpropagation and stochastic gradient descent algorithm Model convergence, obtains stable vector model.
6) in practical application, the low latitude for calculating text lack of standardization and trained each classification to be sorted is semantic Similarity score between model;
7) classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.

Claims (4)

1. the Text Clustering Method based on semantic similarity, which comprises the following steps:
A. text data is collected, is classified according to the result that success parses to it;
B. multi-component system fractionation is carried out for sorted text data, obtains the training characteristics collection under each classification;
C. training characteristics collection is converted into space vector based on bag of words;
D. it is trained using the space vector as the input of neural network, obtains the low latitude semantic model under different classifications;
E. in use, calculating between the low latitude semantic model of text lack of standardization and trained each classification to be sorted Similarity score;
F. classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.
2. as described in claim 1 based on the Text Clustering Method of semantic similarity, which is characterized in that in step a, the receipts Collect text data, classified according to the result that success parses to it, specifically included:
It is collected in actual items application in log system, to the successful text data of entity resolution, or collects and marked Good data text, is based on known label as a result, according to labeling number, by data text be divided into it is different classes of under collection It closes.
3. as described in claim 1 based on the Text Clustering Method of semantic similarity, which is characterized in that in step b, the needle Multi-component system fractionation is carried out to sorted text data, the training characteristics collection under each classification is obtained, specifically includes:
To in the data text set under each classification each data carry out n-grams fractionation, be split as by binary phrase, Training characteristics collection under the category of ternary phrase and original text composition, and the phrase that training characteristics are concentrated is carried out at duplicate removal Reason.
4. as described in claim 1 based on the Text Clustering Method of semantic similarity, which is characterized in that step d is specifically included:
D1. select tanh for the hidden layer of neural network and the activation primitive of output layer;
D2. using the space vector of different classifications as the input of neural network;
D3. the vector model of the low latitude semanteme under each classification is generated by neural network algorithm;
D4. the semantic vector between the unknown text of input and the vector model of each low latitude semanteme is calculated;
D5. the similitude between unknown text and vector model is switched to by posterior probability by softmax function, and by very big Possibility predication minimizes loss function:
D6. finally by backpropagation and stochastic gradient descent algorithm vector model is restrained.
CN201811385276.0A 2018-11-20 2018-11-20 Text Clustering Method based on semantic similarity Pending CN109543036A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811385276.0A CN109543036A (en) 2018-11-20 2018-11-20 Text Clustering Method based on semantic similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811385276.0A CN109543036A (en) 2018-11-20 2018-11-20 Text Clustering Method based on semantic similarity

Publications (1)

Publication Number Publication Date
CN109543036A true CN109543036A (en) 2019-03-29

Family

ID=65848937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811385276.0A Pending CN109543036A (en) 2018-11-20 2018-11-20 Text Clustering Method based on semantic similarity

Country Status (1)

Country Link
CN (1) CN109543036A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078887A (en) * 2019-12-20 2020-04-28 厦门市美亚柏科信息股份有限公司 Text classification method and device
CN111539220A (en) * 2020-05-12 2020-08-14 北京百度网讯科技有限公司 Training method and device of semantic similarity model, electronic equipment and storage medium
CN112835798A (en) * 2021-02-03 2021-05-25 广州虎牙科技有限公司 Cluster learning method, test step clustering method and related device
CN113590820A (en) * 2021-07-16 2021-11-02 杭州网易智企科技有限公司 Text processing method, device, medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605702A (en) * 2013-11-08 2014-02-26 北京邮电大学 Word similarity based network text classification method
CN106326346A (en) * 2016-08-06 2017-01-11 上海高欣计算机系统有限公司 Text classification method and terminal device
CN107153642A (en) * 2017-05-16 2017-09-12 华北电力大学 A kind of analysis method based on neural network recognization text comments Sentiment orientation
CN107357895A (en) * 2017-01-05 2017-11-17 大连理工大学 A kind of processing method of the text representation based on bag of words
CN107590177A (en) * 2017-07-31 2018-01-16 南京邮电大学 A kind of Chinese Text Categorization of combination supervised learning
CN108595706A (en) * 2018-05-10 2018-09-28 中国科学院信息工程研究所 A kind of document semantic representation method, file classification method and device based on theme part of speech similitude

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605702A (en) * 2013-11-08 2014-02-26 北京邮电大学 Word similarity based network text classification method
CN106326346A (en) * 2016-08-06 2017-01-11 上海高欣计算机系统有限公司 Text classification method and terminal device
CN107357895A (en) * 2017-01-05 2017-11-17 大连理工大学 A kind of processing method of the text representation based on bag of words
CN107153642A (en) * 2017-05-16 2017-09-12 华北电力大学 A kind of analysis method based on neural network recognization text comments Sentiment orientation
CN107590177A (en) * 2017-07-31 2018-01-16 南京邮电大学 A kind of Chinese Text Categorization of combination supervised learning
CN108595706A (en) * 2018-05-10 2018-09-28 中国科学院信息工程研究所 A kind of document semantic representation method, file classification method and device based on theme part of speech similitude

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078887A (en) * 2019-12-20 2020-04-28 厦门市美亚柏科信息股份有限公司 Text classification method and device
CN111078887B (en) * 2019-12-20 2022-04-29 厦门市美亚柏科信息股份有限公司 Text classification method and device
CN111539220A (en) * 2020-05-12 2020-08-14 北京百度网讯科技有限公司 Training method and device of semantic similarity model, electronic equipment and storage medium
CN112835798A (en) * 2021-02-03 2021-05-25 广州虎牙科技有限公司 Cluster learning method, test step clustering method and related device
CN112835798B (en) * 2021-02-03 2024-02-20 广州虎牙科技有限公司 Clustering learning method, testing step clustering method and related devices
CN113590820A (en) * 2021-07-16 2021-11-02 杭州网易智企科技有限公司 Text processing method, device, medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110134757B (en) Event argument role extraction method based on multi-head attention mechanism
CN108304372B (en) Entity extraction method and device, computer equipment and storage medium
CN110188781B (en) Ancient poetry automatic identification method based on deep learning
CN109543036A (en) Text Clustering Method based on semantic similarity
CN108763510A (en) Intension recognizing method, device, equipment and storage medium
CN105260361A (en) Trigger word tagging system and method for biomedical events
CN108733647B (en) Word vector generation method based on Gaussian distribution
CN109815336A (en) A kind of text polymerization and system
CN111597328B (en) New event theme extraction method
CN110895559A (en) Model training method, text processing method, device and equipment
CN110085215A (en) A kind of language model data Enhancement Method based on generation confrontation network
CN107977353A (en) A kind of mixing language material name entity recognition method based on LSTM-CNN
CN107797987A (en) A kind of mixing language material name entity recognition method based on Bi LSTM CNN
CN110188359B (en) Text entity extraction method
CN107797986B (en) LSTM-CNN-based mixed corpus word segmentation method
CN107797988A (en) A kind of mixing language material name entity recognition method based on Bi LSTM
CN113094502A (en) Multi-granularity takeaway user comment sentiment analysis method
CN107992468A (en) A kind of mixing language material name entity recognition method based on LSTM
CN111460147B (en) Title short text classification method based on semantic enhancement
CN117033961A (en) Multi-mode image-text classification method for context awareness
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN114065749A (en) Text-oriented Guangdong language recognition model and training and recognition method of system
CN110162629B (en) Text classification method based on multi-base model framework
CN113127607A (en) Text data labeling method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190329