CN109543036A

CN109543036A - Text Clustering Method based on semantic similarity

Info

Publication number: CN109543036A
Application number: CN201811385276.0A
Authority: CN
Inventors: 杨鑫
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2018-11-20
Filing date: 2018-11-20
Publication date: 2019-03-29

Abstract

The invention belongs to big data analysis fields, and it discloses a kind of Text Clustering Methods based on semantic similarity, carry out clustering to the sentence lack of standardization of parsing failure semantic in natural language understanding, improve the discrimination of semantic understanding.This method comprises: a. collects text data, classified according to the result that success parses to it；B. multi-component system fractionation is carried out for sorted text data, obtains the training characteristics collection under each classification；C. training characteristics collection is converted into space vector based on bag of words；D. it is trained using the space vector as the input of neural network, obtains the low latitude semantic model under different classifications；E. in use, calculating the similarity score between the low latitude semantic model of text lack of standardization and trained each classification to be sorted；F. classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.

Description

Text Clustering Method based on semantic similarity

Technical field

The invention belongs to big data analysis fields, and in particular to a kind of Text Clustering Method based on semantic similarity.

Background technique

NLP (Natural Language Processing natural language processing) is practical in actual items utilization at present In speech recognition last handling process, due to fluctuating, the dialectal accent of the possible psychology of voice importer (teller) or mood The problems such as, formants and the tonal variations such as too fast, the high/low, distortion of tone change of word speed are caused, it is wrong to generate speech recognition signal Accidentally, so that the true content that can not correctly express user (teller) does subsequent processing to computer.Different under experimental situation More standard and Templated test case, therefore it is unidentified user out that NLP has large batch of text in practical applications True intention invalid use-case, these reduce the discrimination of the semantic understanding in entire NLP application with regular meeting in vain.

Therefore, the application is it is necessary to propose a kind of Text Clustering Method based on semantic similarity, to natural language understanding The sentence lack of standardization of middle semantic parsing failure carries out clustering, improves the discrimination of semantic understanding.

Summary of the invention

The technical problems to be solved by the present invention are: a kind of Text Clustering Method based on semantic similarity is proposed, to certainly The sentence lack of standardization of semantic parsing failure carries out clustering in right language understanding, improves the discrimination of semantic understanding.

The technical proposal adopted by the invention to solve the above technical problems is that:

Text Clustering Method based on semantic similarity, comprising the following steps:

A. text data is collected, is classified according to the result that success parses to it；

B. multi-component system fractionation is carried out for sorted text data, obtains the training characteristics collection under each classification；

C. training characteristics collection is converted into space vector based on bag of words；

D. it is trained using the space vector as the input of neural network, obtains the low latitude semanteme mould under different classifications Type；

E. in use, calculating the low latitude semantic model of text lack of standardization and trained each classification to be sorted Between similarity score；

F. classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.

As advanced optimizing, in step a, the collection text data divides it according to the result that success parses Class specifically includes:

It collects in actual items application in log system, to the successful text data of entity resolution, or collects The data text marked, is based on known label as a result, according to labeling number, by data text be divided into it is different classes of under Set.

As advanced optimizing, described to carry out multi-component system fractionation for sorted text data in step b, acquisition is each Training characteristics collection under a classification, specifically includes:

N-grams fractionation is carried out to each data in the data text set under each classification, is split as by binary Training characteristics collection under the category of phrase, ternary phrase and original text composition, and the phrase concentrated to training characteristics is gone It handles again.

As advanced optimizing, step d is specifically included:

D1. select tanh for the hidden layer of neural network and the activation primitive of output layer；

D2. using the space vector of different classifications as the input of neural network；

D3. the vector model of the low latitude semanteme under each classification is generated by neural network algorithm；

D4. the semantic vector between the unknown text of input and the vector model of each low latitude semanteme is calculated；

D5. the similitude between unknown text and vector model is switched to by posterior probability by softmax function, and passed through Maximum-likelihood estimation minimizes loss function:

D6. finally by backpropagation and stochastic gradient descent algorithm vector model is restrained.

The beneficial effects of the present invention are:

Sentence lack of standardization or odjective cause for parsing failure semantic in natural language understanding can not correct understandings Text compares between the data lack of standardization of failure and authority data sample according to existing authority data learning text feature Semantic similarity carries out clustering processing to it, so as to be trained solution to these irregular Text Feature Extraction individual features Analysis.The scheme of invention, which helps to improve, understands natural language text feature, to improve the discrimination of semantic understanding.

In addition, the present invention in construction feature training set, utilizes polynary cutting, it is possible to reduce the dependence to tokenizer, Model generalization ability is improved simultaneously, effectively the semantic feature of study to every text, reduced due to the error in unsupervised learning And final data classification has been influenced, the invention does not need to map in pilot process, improves the precision of final result.

Detailed description of the invention

Fig. 1 is the Text Clustering Method flow chart based on semantic similarity in the present invention.

Specific embodiment

The present invention is directed to propose a kind of Text Clustering Method based on semantic similarity, solves to semantic in natural language understanding The sentence lack of standardization of analysis failure carries out clustering, improves the discrimination of semantic understanding.

As shown in Figure 1, the Text Clustering Method based on semantic similarity in the present invention, comprising the following steps:

In specific implementation, the implementation steps of above scheme are as follows:

1) firstly, being collected in actual items application in log system, to the successful text data of entity resolution, Huo Zheshou Collect the data text marked, based on known label as a result, according to labeling number, data text is divided into several Set under classification.

Y={ y₁,y₂,...,y_i,...,y_n}i∈(1,n)

Wherein, Y is the text data collected, and n is the classification number divided, y_iThe text collection being expressed as under classification i.

2) n-grams fractionation is carried out to the text data in each category set respectively: be split as by binary phrase, ternary Training characteristics collection under the category of phrase and original text composition, and duplicate removal processing is carried out to the phrase that training characteristics are concentrated；Such as This carries out polynary first phrase processing to text, can with the feature of following model autonomous learning as much as possible to text, reduction because For participle inaccuracy to the influence that text classification accuracy generates, the dependence to Chinese word segmentation is reduced.

3) the training characteristics collection under each classification is converted into space vector using bag of words, to carry out dimensionality reduction；

4) select tanh for the hidden layer of neural network and the activation primitive of output layer；Using the space vector as nerve net The input of network generates the vector model doc of the low latitude semanteme under each classification by neural network algorithm；

5) it is directed to each vector model, calculates the query text of new unknown input classification and the vector model of each classification The semantic vector of doc:

Wherein, y_qIndicate the query, y of the unknown classification newly inputted_dThe low latitude semantic vector doc generated for above-mentioned steps.

The similitude between unknown text and vector model is switched into posterior probability by softmax function, and passes through pole Maximum-likelihood is estimated to minimize loss function:

Wherein, doc⁺For the positive sample under query.Make vector finally by backpropagation and stochastic gradient descent algorithm Model convergence, obtains stable vector model.

6) in practical application, the low latitude for calculating text lack of standardization and trained each classification to be sorted is semantic Similarity score between model；

7) classification of the highest classification of similarity score as the text lack of standardization is selected, classification output is carried out.

Claims

1. the Text Clustering Method based on semantic similarity, which comprises the following steps:

D. it is trained using the space vector as the input of neural network, obtains the low latitude semantic model under different classifications；

E. in use, calculating between the low latitude semantic model of text lack of standardization and trained each classification to be sorted Similarity score；

2. as described in claim 1 based on the Text Clustering Method of semantic similarity, which is characterized in that in step a, the receipts Collect text data, classified according to the result that success parses to it, specifically included:

It is collected in actual items application in log system, to the successful text data of entity resolution, or collects and marked Good data text, is based on known label as a result, according to labeling number, by data text be divided into it is different classes of under collection It closes.

3. as described in claim 1 based on the Text Clustering Method of semantic similarity, which is characterized in that in step b, the needle Multi-component system fractionation is carried out to sorted text data, the training characteristics collection under each classification is obtained, specifically includes:

To in the data text set under each classification each data carry out n-grams fractionation, be split as by binary phrase, Training characteristics collection under the category of ternary phrase and original text composition, and the phrase that training characteristics are concentrated is carried out at duplicate removal Reason.

4. as described in claim 1 based on the Text Clustering Method of semantic similarity, which is characterized in that step d is specifically included:

D5. the similitude between unknown text and vector model is switched to by posterior probability by softmax function, and by very big Possibility predication minimizes loss function: