CN106372061A

CN106372061A - Short text similarity calculation method based on semantics

Info

Publication number: CN106372061A
Application number: CN201610817910.8A
Authority: CN
Inventors: 费高雷; 胡馨月; 胡光岷
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2016-09-12
Filing date: 2016-09-12
Publication date: 2017-02-01
Anticipated expiration: 2036-09-12
Also published as: CN106372061B

Abstract

The invention discloses a short text similarity calculation method based on semantics. The short text similarity calculation method comprises the following steps of: preprocessing corpus data, establishing word Embedding, constructing a word semantic tree, calculating semantic similarity between words in a short text, and calculating the semantic similarity between the short texts. On the basis of deep learning word Embedding, a hierarchical clustering method is combined to create the word semantic tree and calculate the similarity between words in the short text, on the basis, various characteristics of the short text are combined to calculate the semantic similarity between the short texts, and the defect in the prior art that the word semantic tree can not describe a semantic relationship between a fresh word and a known word is effectively solved.

Description

Based on semantic short text similarity calculating method

Technical field

The invention belongs to short text Similarity Measure technical field, more particularly, to a kind of based on semantic short text similarity Computational methods.

Background technology

Semantic Similarity Measurement between short text artificial intelligence, natural language processing, cognitive science, semanticss, psychology, All there is in the fields such as bioinformatics researching value and the application background of theory.Can be overcome well using short text similarity Information redundancy in corpus.At present, many researchs all show that short text Similarity Measure can promote many natural language processings Task, such as event detection, information retrieval, text normalization, automatic text summarization, text classification and cluster etc..Short text is similar Widely, good semantic similarity calculation method can considerably improve existing a lot the application that degree calculates The performance of system.

At present, the computational methods of short text similarity have a lot, can be largely classified into several classes as follows: based on semantic dictionary Method, the method based on corpus, the method for feature based, the method by Internet resources.Method based on semantic dictionary Refer to by semantic dictionary, such as wordnet [], ppdb, framenet etc., calculate the semantic similarity between word and word, finally Semantic similarity is integrated the method obtaining text semantic similarity.Referred to extensive based on the method for corpus Text set carries out statistical analysiss, and typical method has lsa (latent semantic analysis) [] and hal (hyperspace analogues to language)[].The method [] of feature based attempts to be defined in advance with some Feature, to represent short text, then obtains the semantic similarity of short text by grader.Method by Internet resources [] great majority all enrich the contextual information of short text or the phase calculating word or entity using the returning result of search engine Like degree thus calculating the semantic similarity of short text.

It is highly dependent on the completeness of inquired about semantic dictionary based on the method for semantic dictionary, because may in short text Non-existent word in dictionary can be comprised, thus causing to calculate the semantic similarity of this short text and other short texts.Secondly, In dictionary, the polysemy of word also can affect the accuracy of Semantic Similarity Measurement.How the difficult point of the method for feature based is Define effective feature and automatically obtain the value of these features.In addition, the definition of feature is for specific proximate nutrition easily, right Relatively difficult in abstract conception.By Internet resources method for search engine returning result very sensitive it is impossible to To stable semantic similarity.Additionally, the co-occurrence information in search engine returning result can only react two to a certain extent The relation of word, and automatically extract from summary grammar templates precision it is difficult to ensure that.The shortcoming of hal be its construction word- Word matrix can not capture the meaning of whole text well.Lsa may not process the neologisms occurring in short text, secondly lsa Short text vector representation very sparse, can affect the precision of Similarity Measure, and nor represent in short text some Syntactic information.

With the rise of neutral net and deep learning, traditional word vectors space can be converted to word Embedding layer vector space, compensate for the features such as short text is sparse in term vector space, noise is big, and can be by no Supervised learning and supervised learning process seamless combination, are that the calculating of short text semantic similarity opens new direction, become not The development trend come.

Short text is different from long texts such as common news, magazines, and its length causes indivedual noise words to parsing compared with short-range missile The semantic interference of whole short text is very serious.Therefore using the regular text of conventional treatment model and method for short text Semantic Similarity Measurement may not be effective.

Content of the invention

The goal of the invention of the present invention is: cannot effectively solving short text length cause individually compared with short-range missile to solve prior art The noise word problem very serious to parsing the semantic interference of whole short text, the present invention propose a kind of based on semantic short Text similarity computing method.

The technical scheme is that a kind of based on semantic short text similarity calculating method, comprise the following steps:

A, pretreatment is carried out to language material database data, word embedding is set up according to word2vec hyper parameter；

B, using Hierarchical clustering methods build corpus phrase semantic tree；

C, the semanteme between word in short text is calculated according to the inconsistent rate that in the phrase semantic tree of step b, each connects Similarity；

D, according to the semantic similarity between the Semantic Similarity Measurement short text between word in step c short text.

Further, in described step a, pretreatment is carried out to language material database data, particularly as follows: by all words in corpus Language is all converted to small letter, and carries out participle；Select the word that occurrence number in corpus is more than n to set up corpus corresponding simultaneously Vocabulary, wherein n are default occurrence number threshold value.

Further, in described step a, word embedding is set up according to word2vec hyper parameter, particularly as follows: using not Train cbow the and sg model of word2vec with hyper parameter, by the use of COS distance as the semantic similarity of word embedding, Screen first three similarity highest word synonymously, using wordnet synonymously knowledge base, by accuracy rate, Recall rate and f1 fraction determine the word2vec hyper parameter simulating this corpus phrase semantic, thus setting up word embedding； Wherein, accuracy rate p represents the ratio to quantity and always pre- quantitation for the correctly predicted synonym of word embedding, recall rate r Represent the ratio to quantity for the synonym to appearance in quantity and wordnet for the correctly predicted synonym of word embedding, f1 Fraction representation is

Further, described step b adopts Hierarchical clustering methods to build the phrase semantic tree of corpus, particularly as follows: utilizing Simlex-999 data set determines distance metric and connects tolerance, using Hierarchical clustering methods according to the distance metric determining and company Connect the phrase semantic tree that tolerance builds corpus.

Further, described step c calculate the semantic similarity in short text between word computing formula particularly as follows:

s y n (w_{1}, w_{2}) = 1 - \frac{i n c o n s i s t e n t (l i n k (w_{1} . w_{2}))}{i n c o n s i s t e n t {(t r e e)}_{t h r e s h o l d}}

Wherein, w₁And w₂All represent word, link represents the public ancestor node of the minimum of two words, inconsistent (tree)_thresholdRepresent the inconsistent rate threshold value connecting in this hierarchical cluster tree, inconsistent represents the inconsistent of connection Rate.

Further, described step d is according to the language between the Semantic Similarity Measurement short text between word in short text Adopted similarity include following step by step:

D1, to short text t₁And t₂Carry out pretreatment, remove the punctuation mark in short text and special symbol, and by short essay In this, each word is converted to small letter；

D2, respectively calculating short text t₁Middle word w_iWith short text t₂Middle word w_jSemantic similarity s_ij；

D3, calculating short text t₁And t₂Semantic similarity, computing formula particularly as follows:

s i m (t_{1}, t_{2}) = \frac{1}{2} (\frac{s u m (r o w s)}{| | s_{r o w &notequal; 0} | |} + \frac{s u m (c o l u m n s)}{| | s_{c o l u m n &notequal; 0} | |})

Wherein, sum (rows) represents short text t₁And t₂Semantic similitude matrix s in every row element be not all zero row Maximum summation, sum (columns) represent short text t₁And t₂Semantic similitude matrix s in every column element be not all zero The maximum summation of row, | | s_row≠0| | represent the sum of non-zero row in the semantic similitude matrix s of short text t1 and t2, | | s_column≠0| | represent short text t₁And t₂Semantic similitude matrix s in non-zero column sum.

The following beneficial effect of office of the present invention:

1st, the phrase semantic tree of the present invention is that the word embedding based on deep neural network is reasonably layered Cluster gets, and compares existing phrase semantic tree and is more easily extensible；And for different corpus, can be with rapid build pair The phrase semantic tree answering, the vocabulary quantity comprising is more, and the phrase semantic tree solving wordnet, Chinese thesaurus etc. can not be carved The shortcoming drawing fresh word and known word semantic relation；

2nd, semantic similarity computational methods proposed by the present invention to be determined using the synonym data set of artificial mark The connection inconsistent rate threshold value of hierarchical cluster phrase semantic tree, thus reduce the inconsistent rate extreme value of connection to cause semantic similarity Differentiation out of proportion, improve semantic similarity calculating precision；

3rd, the method based on word Semantic Similarity Measurement short text semantic similarity proposed by the present invention, simply effectively, Any short text data collection can be processed by adjusting training corpus, and be capable of identify that the different parts of speech of similar word, thus Without the part of speech matching problem considering word, the more succinct similar short texts various to clause change are identified.

Brief description

Fig. 1 is the present invention based on semantic short text similarity calculating method schematic flow sheet.

Fig. 2 is hierarchical cluster lexical semantic tree construction schematic diagram in the embodiment of the present invention.

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with drawings and Examples, right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only in order to explain the present invention, not For limiting the present invention.

As shown in figure 1, for the present invention based on semantic short text similarity calculating method schematic flow sheet.One kind is based on Semantic short text similarity calculating method, comprises the following steps:

B, using Hierarchical clustering methods build corpus phrase semantic tree；

It is necessary first to pretreatment is carried out to language material database data in step a, particularly as follows: by all words in corpus All be converted to small letter, and carry out participle；In order to ensure the quality of word embedding, the present invention selects to go out occurrence in corpus The corresponding vocabulary of corpus set up in the word more than n for the number, and wherein n is default occurrence number threshold value；Preferably, the present invention sets N is 10, that is, select the word that occurrence number in corpus is more than 10 to set up the corresponding vocabulary of corpus.

The present invention adopts word2vec instrument to train word embedding, sets up word according to word2vec hyper parameter Embedding, particularly as follows: train cbow the and sg model of word2vec using different hyper parameter, hyper parameter here is upper and lower Text window size, dimension size etc., recycle COS distance as the semantic similarity of word embedding, screen first three Similarity highest word synonymously, using wordnet synonymously knowledge base, by accuracy rate, recall rate and f1 Fraction determines the word2vec hyper parameter simulating this corpus phrase semantic, thus setting up word embedding；Wherein, accurately Rate p represents the ratio to quantity and always pre- quantitation for the correctly predicted synonym of word embedding, and recall rate r represents word The correctly predicted synonym of the embedding ratio to quantity, f1 fraction representation to the synonym occurring in quantity and wordnet ForPreferably, present invention determine that word2vec hyper parameter for dimension size be 300, contextual window is big Little is 32, and iteration is 5 times, and negative sample rate is 5.

In stepb, as shown in Fig. 2 being hierarchical cluster lexical semantic tree construction schematic diagram in the embodiment of the present invention.This Bright employing Hierarchical clustering methods dynamically build the phrase semantic tree of corpus, and the word making semantic similarity is in phrase semantic tree Neighbouring.The Hierarchical clustering methods of cohesion use bottom-up strategy, and typically, it is opened from making each object form the cluster of oneself Begin, and iteratively cluster is merged into increasing cluster, until all of object is all in a cluster, or meet certain eventually Only condition.This single cluster becomes the root of hierarchical structure.In combining step, according to certain similarity measurement, it is found out two and connects most Near cluster, and merge their one clusters of formation.Because each iteration merges two clusters, wherein each cluster is right including at least one As therefore condensing method at most needs n iteration.

Build corpus phrase semantic tree when it needs to be determined that between word embedding distance tolerance, here this Bright consideration Euler's distance, COS distance and manhatton distance；Equally, in condensing method, between cluster, the tolerance of distance connects tolerance, Here the present invention is directed to different word vectors distance metrics, analysis mean distance, centre distance, ultimate range, average distance Deng the impact to Agglomerative Hierarchical Clustering arborescence amount for the connection tolerance.The present invention assesses different distance using simlex-999 data set Tolerance and the quality connecting the Agglomerative Hierarchical Clustering tree that tolerance generates, build high-quality phrase semantic tree with this.simlex- 999 data sets comprise 999 pairs of English words, and the synonymous similarity between these English words and semantic relevancy are by manually marking Note.Based on this data set, the present invention according to spearman coefficient of rank correlation analysis result determine suitable distance metric and Connect tolerance, to build high-quality hierarchical cluster tree.

The phrase semantic tree of the present invention is poly- to being reasonably layered based on the word embedding of deep neural network Class gets, and compares existing phrase semantic tree and is more easily extensible；And for different corpus, can be corresponded to rapid build Phrase semantic tree, the vocabulary quantity comprising is more, and the phrase semantic tree solving wordnet, Chinese thesaurus etc. can not be portrayed Fresh word and the shortcoming of known word semantic relation.

In step c, the present invention utilizes the semantic tree that Hierarchical clustering methods build, and devises a kind of new phrase semantic phase Like degree computational methods.In hierarchical cluster phrase semantic tree, leaf node represents word, and father node and root node represent a company Connect, each connects through inconsistent rate to indicate the degree of consistency between member in this connection.The present invention is according to hierarchical cluster The inconsistent rate that in tree, each connects calculating the semantic similarity between word, computing formula particularly as follows:

s y n (w_{1}, w_{2}) = 1 - \frac{i n c o n s i s t e n t (l i n k (w_{1} . w_{2}))}{i n c o n s i s t e n t {(t r e e)}_{t h r e s h o l d}}

Wherein, w₁And w₂All represent word, link represents the public ancestor node of the minimum of two words, inconsistent (tree)_thresholdRepresent the inconsistent rate threshold value connecting in this hierarchical cluster tree, inconsistent represents the inconsistent of connection Rate, if inconsistent rate exceedes given threshold, is equal to threshold value.In order to improve the precision of semantic similarity, the present invention sets when two The inconsistent rate of the connection of word be higher than certain value when then it is assumed that this two words are altogether irrelevant, therefore we set inconsistent Rate threshold value.Because the maximum of inconsistent rate may be very big, therefore this blocks and will effectively improve semantic similarity Precision.Ws-353 data set comprises 353 pairs of English words, and the semantic similarity between these English words is by manually marking.Base In this data set, the present invention determines the inconsistent rate threshold of hierarchical cluster tree according to spearman coefficient of rank correlation analysis result Value.

Using the synonym data set of artificial mark, the semantic similarity computational methods of the present invention to determine that layering is poly- The connection inconsistent rate threshold value of class phrase semantic tree, thus reduce connect the differentiation that inconsistent rate extreme value causes semantic similarity Out of proportion, improve the precision of semantic similarity calculating.

In step d, the present invention remembers that two short texts are respectively t1 and t2, according to step c calculate in short text word it Between semantic similarity, then the semantic similarity calculating between short text include following step by step:

s i m (t_{1}, t_{2}) = \frac{1}{2} (\frac{s u m (r o w s)}{| | s_{r o w &notequal; 0} | |} + \frac{s u m (c o l u m n s)}{| | s_{c o l u m n &notequal; 0} | |})

In step d1, the present invention is directed to short text t₁And t₂Illustrate, two short texts particularly as follows:

text 1:my phone is annoying me with these amber alerts.

text 2:that amber alert was getting annoying.

To short text t₁And t₂Carry out pretreatment, that is, remove the punctuation mark in short text and special symbol, and by short text In each word be converted to lowercase versions.

In step d2, the present invention is for short text t₁In each word w_i, in short text t₂Middle selection and its maximum language The word w of adopted similarity_j；Again for short text t₂In each word w_j, in short text t₁Middle selection and its maximum semantic similitude The word w of degree_i；Thus obtaining short text t₁And t₂Semantic similitude matrix s, as shown in table 1, be short text t₁And t₂Semantic phase Like matrix.

Table 1 short text t₁And t₂Semantic similitude matrix

In step d3, the present invention calculates to semantic similarity calculated in step d2 and carries out sum-average arithmetic, Obtain the semantic similarity between short text, short text t is calculated according to computing formula₁And t₂Semantic similarity be 0.855.

The method based on word Semantic Similarity Measurement short text semantic similarity of the present invention, simply effectively, by adjusting Whole training corpus can process any short text data collection, and is capable of identify that the different parts of speech of similar word, thus without examining Consider the part of speech matching problem of word, the more succinct similar short texts various to clause change are identified.

Those of ordinary skill in the art will be appreciated that, embodiment described here is to aid in reader and understands this Bright principle is it should be understood that protection scope of the present invention is not limited to such special statement and embodiment.This area Those of ordinary skill can make various other each without departing from present invention essence according to these technology disclosed by the invention enlightenment Plant concrete deformation and combine, these deform and combine still within the scope of the present invention.

Claims

1. a kind of based on semantic short text similarity calculating method it is characterised in that comprising the following steps:

B, using Hierarchical clustering methods build corpus phrase semantic tree；

C, the semantic similitude between word in short text is calculated according to the inconsistent rate that in the phrase semantic tree of step b, each connects Degree；

2. as claimed in claim 1 based on semantic short text similarity calculating method it is characterised in that in described step a Pretreatment is carried out to language material database data, particularly as follows: all words in corpus are all converted to small letter, and carries out participle；With When select the word that occurrence number in corpus is more than n to set up the corresponding vocabulary of corpus, wherein n is default occurrence number threshold Value.

3. as claimed in claim 2 based on semantic short text similarity calculating method it is characterised in that in described step a Word embedding is set up according to word2vec hyper parameter, particularly as follows: using different hyper parameter train word2vec cbow and Sg model, by the use of COS distance as the semantic similarity of word embedding, screens first three similarity highest word and makees For synonym, using wordnet synonymously knowledge base, determined by accuracy rate, recall rate and f1 fraction and simulate this language material The word2vec hyper parameter of storehouse phrase semantic, thus set up word embedding；Wherein, accuracy rate p represents word The ratio to quantity and always pre- quantitation for the correctly predicted synonym of embedding, recall rate r is just representing word embedding Really the synonym of prediction to the synonym of appearance in quantity and wordnet the ratio to quantity, f1 fraction representation is

4. as claimed in claim 3 based on semantic short text similarity calculating method it is characterised in that described step b is adopted Build the phrase semantic tree of corpus with Hierarchical clustering methods, particularly as follows: determining distance metric using simlex-999 data set Measure with connecting, using Hierarchical clustering methods according to the distance metric determining and the phrase semantic connecting tolerance structure corpus Tree.

5. as claimed in claim 4 based on semantic short text similarity calculating method it is characterised in that described step c meter Calculate the computing formula of the semantic similarity between word in short text particularly as follows:

s y n (w_{1}, w_{2}) = 1 - \frac{i n c o n s i s t e n t (l i n k (w_{1} . w_{2}))}{i n c o n s i s t e n t {(t r e e)}_{t h r e s h o l d}}

6. as claimed in claim 5 based on semantic short text similarity calculating method it is characterised in that described step d root According to the semantic similarity between the Semantic Similarity Measurement short text between word in short text include following step by step:

D1, to short text t₁And t₂Carry out pretreatment, remove the punctuation mark in short text and special symbol, and by short text Each word is converted to small letter；

s i m (t_{1}, t_{2}) = \frac{1}{2} (\frac{s u m (r o w s)}{| | s_{r o w &notequal; 0} | |} + \frac{s u m (c o l u m n s)}{| | s_{c o l u m n &notequal; 0} | |})

Wherein, sum (rows) represents short text t₁And t₂Semantic similitude matrix s in every row element be not all zero row Big value summation, sum (columns) represents short text t₁And t₂Semantic similitude matrix s in every column element be not all zero row Maximum is sued for peace, | | s_row≠0| | represent short text t₁And t₂Semantic similitude matrix s in non-zero row sum, | | s_column≠0|| Represent short text t₁And t₂Semantic similitude matrix s in non-zero column sum.