CN106445920A

CN106445920A - Sentence similarity calculation method based on sentence meaning structure characteristics

Info

Publication number: CN106445920A
Application number: CN201610867254.2A
Authority: CN
Inventors: 罗森林; 陈倩柔; 潘丽敏; 原玉娇
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2017-02-22

Abstract

The invention provides a sentence similarity calculation method based on sentence meaning structure characteristics, aiming to solve the problem of characteristic sparsity in social short-text sentence similarity calculation. The sentence similarity calculation method includes analyzing the meaning of a sentence according to a sentence meaning structure model, digging potential thematic knowledge according to a thematic model, expanding sentence characteristics according to theme-word distribution to obtain a sentence vector based on the sentence characteristics, introducing a Paragraph Vector deep study model to study the context characteristics of the sentence, acquiring a sentence vector based on context information, and weighing sentence similarity obtained from calculation of the two sentence vectors. The sentence similarity calculation has the advantages that semantic information and the context information of the sentence are dug deeply, so that internal relations among sentences are described comprehensively and accurately, and accuracy in similarity calculation is improved.

Description

Sentence similarity computational methods using sentence justice architectural feature

Technical field

The sentence similarity computational methods of sentence justice architectural feature are the present invention relates to the use of, belongs to computer science and natural language Speech process field.

Background technology

Sentence similarity calculates the semantic similitude degree for weighing two content of text, is information in natural language processing The basic link of the tasks such as retrieval, autoabstract.With the fast development of social network sites, the social short text with microblogging as representative Emerge in multitude, its length is short and small, representation is diversified, as the structured message for lacking lengthy document causes traditional sentence phase The sentence similarity that such short text cannot be directly applied for like degree computational methods is calculated.

At present, according to the depth difference to sentence semantic analysis, for the Similarity Measure side of sentence in social short text Method mainly includes word-based feature, is based on meaning of a word feature and based on three class of syntactic analysis feature.

The method of word-based feature is the sentence similarity computational methods of early stage, and sentence is mainly considered as word by the method Linear combination, using the word surface layer informations such as the word frequency of the means calculating sentence of statistics, part of speech, the long, word order of sentence, typical method bag Jaccard Similarity Coefficient string matching is included, which passes through to count the same words number for including in two sentences Sentence is characterized as vector as the similarity of sentence, TF-IDF word frequency statisticses method by mesh, calculates COS distance as similarity As a result.

Method based on meaning of a word feature from the angle of semantic analysis, by catching the semanteme of word by semantic knowledge resource Information.According to the resource difference for utilizing, the method being divided into based on semantic dictionary and the method based on corpus.Based on semantic dictionary Method be mainly lexical data base by WordNet, HowNet etc. based on meaning of a word organized words information, in conjunction with word sense disambiguation Technology mining sentence in the expressed connotation under given context of co-text of word, the semanteme so as to improve whole sentence divides Resolution.Method based on corpus is pushed away with the common probability size for occurring of two words mainly by introducing language model framework Break its similarity, and conventional technology is will using potential linguistic analysiss (Latent Semantic Analysis, LSA) method Word-document matrix carries out singular value decomposition and realizes the space reflection that high dimensional feature represents that low-dimensional latent semantic space represents.

Method based on syntactic analysis feature, should by carrying out, to sentence, the similarity that overall structural analyses judge sentence Method advocates that sentence center word aroused in interest is the key for arranging other compositions, and core verb itself is not arranged by any composition, and Other sentence constituents are arranged by core verb, excavate the semantic letter of sentence by analyzing the dependence between word Breath, generally only calculates the effectively word such as verb, nouns and adjectives and is directly attached to what effective word was constituted in practical application Arrange in pairs or groups between similarity estimating the similarity of sentence, avoid increasing the deviation that noise data brings result with this.

Although above-mentioned all kinds of methods calculate the similarity of sentence from different analysis levels, the notional word of social short text is relatively Few, any sentence structure analysis and the excavation to sentence semantic information is not added with, only leans on the statistics to surface layer informations such as word frequency, morphologies The deep information of word cannot be distinguished.Method based on meaning of a word feature is although it is contemplated that the semantic information of word, but the method is received Be limited to external semantic resource, social short text is ageing strong comprising substantial amounts of unregistered word, content, due to dictionary comprehensively and The sparse impact of feature often results in the indefinite problem of semantic information.Method based on syntactic feature receives current syntactic analysis skill The jejune restriction of art, does not account for sentence contextual information and Deep Semantics information, and the disappearance of information is tied to Similarity Measure The accuracy of fruit brings unpredictable impact.

Content of the invention

The present invention is that to solve feature that social short text sentence similarity calculates sparse and do not account for Deep Semantics information Problem, propose using sentence justice architectural feature sentence similarity computational methods.Consider sentence semantic information and on On the premise of context information, by multiple information Weighted Fusion, make sentence information more comprehensive, sentence semantics letter is excavated by depth Breath, makes sentence similarity result of calculation not affected by form of presentation, calculates the association journey of sentence semantics more comprehensively, exactly Degree.

The design principle of the present invention is：1) sentence justice structural model (Chinese Semantic Structure is based on Model, CSM) parsing sentence semanteme, extracts sentence justice composition, using Latent Dirichlet Allocation (LDA) theme Model excavates potential thematic knowledge, expands, according to knowledge base, the feature 1112 that sentence justice composition corresponds to dimension, obtains based on sentence The sentence vector of semantic information itself；2) Paragraph Vector (PV) deep learning model adaptation ground learning text is introduced Feature, obtains the sentence vector based on sentence contextual information；3) it is calculated between sentence respectively using two kinds of sentence vectors Similarity, and linear weighted function is carried out, optimized coefficients are adjusted by gridding method, makes sentence similarity result of calculation more accurate.

Comprise the following steps that：

Step 1, carries out pretreatment to social assigned short text set, first carries out subordinate sentence, then carries out participle and part-of-speech tagging, goes to stop Word.

Step 2, based on using CSM to the sentence justice results of structural analysis per bar sentence and using LDA topic model to short essay This collection is analyzed the theme for obtaining and word distribution, carries out feature expansion to sentence, and calculates sentence similarity.

Step 2.1, on the basis of step 1, to carrying out sentence justice structural analyses per bar sentence, extracts the topic of sentence, states Topic, elemental term, general term.The semantic expressiveness of whole sentence justice is the form of structure tree by CSM, is embodied as sentence pattern layer, description Layer, object layer and four levels of levels of detail.Sentence pattern layer indicates the sentence justice type of sentence, adopted, compound including simple sentence justice, complex sentence Type in adopted, the multiple sentence justice four of sentence；Include topic in describing layer and topic is stated, topic is the adopted Preliminary division of distich with topic is stated, and is Essential sentence justice composition in the adopted structure of sentence, topic is defined as the topic for being described object, stating that topic is defined as in sentence justice in sentence justice Description content；Comprising predicate, elemental term, general term, semantic lattice in object layer, semantic lattice are the semantic taggers to word, bag 7 kinds of fundamental mesh and 12 kinds of general lattice are included, elemental term is defined as in sentence justice having the composition for directly contacting with predicate, constitutes a sentence Subsemantic trunk, its corresponding semanteme lattice is the ornamental equivalent that fundamental mesh, general term is defined as in sentence justice, its corresponding semanteme Lattice are general lattice；Amplification implication comprising sentence in levels of detail.

Step 2.2, is analyzed to assigned short text set using LDA topic model, is excavated the potential thematic knowledge in text, is carried The theme in text and the distribution of the word under theme is taken, obtains text-theme matrix and theme-word matrix.LDA topic model The theme in text can be obtained, can be used to the word in text is divided, the word under same subject has identical Or similar semanteme.

Step 2.3, carries out feature expansion according to topic to sentence, obtains the sentence vector based on topic.If two phases Same word each acts as topic in sentence and states a part for topic, then it is assumed that the two words have different semantemes, fixed The two words adopted are different words, according to this definition, when carrying out feature to sentence and expanding, according to topic and should state topic respectively Part carries out feature expansion to sentence.The feature of the topic part of sentence expands concrete grammar：Base topic under is extracted first This and the corresponding word of general term, then according to the theme for obtaining in step 2.2-word matrix, compare word different main Probability under topic, chooses probability highest theme, other words under the theme is added in sentence, as one of sentence Point, finally use all words of sentence as feature, construction feature vector representation sentence, wherein in sentence corresponding to original word Dimension on value for word the occurrence number in sentence, and expand word corresponding to dimension on value press formula (1) calculated,

V=n*w (1)

V is to expand the value that word is corresponded in dimension, and n is to expand the number of times that word occurs in sentence, and w is for expanding word Probit under corresponding theme.

Step 2.4, by the method for step 2.3, carries out feature expansion according to topic is stated to sentence, obtains based on the sentence for stating topic Vector.

Step 2.5, is based respectively on step 2.3 and the 2.4 two kinds of sentence vectors for obtaining calculate sentence similarity, to two phases It is weighted like angle value, the final Similarity value between sentence is obtained, specific formula for calculation is as follows,

Wherein, S_AAnd S_BRepresent any two sentence, sim1 (S_A,S_B) represent two sentences Similarity value,With Represent sentence S respectively_AAnd S_BBased on topic sentence vector,WithRepresent sentence S respectively_AAnd S_BExpression based on stating topic Sentence vector, ω be adjustable parameter, span be [0,1], for adjusting the weight coefficient of two kinds of similarities.

Step 3, will be through the pretreated all sentence inputting of step 1 to PV deep learning model, using PV model Text feature is practised, sentence vector is obtained, and based on the COS distance between the sentence vector calculating sentence as similar between sentence Degree, computing formula is as follows,

Wherein, S_AAnd S_BRepresent any two sentence, sim2 (S_A,S_B) represent two sentences Similarity value,With Represent the sentence vector for being obtained with PV model learning respectively.PV model is a kind of non-supervisory learning style, and input is arbitrarily long The text (text can be the arbitrary forms such as article, paragraph, sentence here, be referred to as text) of degree, output is then corresponding text Continuous distribution formula vector representation, be similar to the principle of word2vec term vector, on the basis of semantic and word order information is retained By obtaining the vector representation of effective sentence or chapter to feature learning, PV model energy effectively solving bag of words are not examined Consider the problem of the meaning of a word and word order, the vector dimension of generation is dense, also can effectively overcome the feature of the sentence expression of short text sparse Problem.

Step 4, the Similarity value between the sentence that step 2 and step 3 are obtained carries out linear weighted function, is adjusted by gridding method Parameter, finds one group of optimum parameter value, exports the Similarity value between final sentence pair, and computing formula is as follows,

sim(S_A,S_B)=θ * sim1 (S_A,S_B)+(1-θ)*sim2(S_A,S_B) (4)

Wherein, S_AAnd S_BRepresent any two sentence, sim (S_A,S_B) represent two sentences Similarity value, θ be adjustable ginseng Number, span is [0,1], sim1 (S_A,S_B) and sim2 (S_A,S_B) be calculated by formula (2) and (3) respectively.According to formula (4), in conjunction with formula (2) and (3), complete sentence similarity computing formula is：

ω and θ are adjustable parameters, and span is all [0,1], using gridding method according to the calculating of sentence similarity or Application result carries out tuning to parameter, takes optimal value of the parameter.

Beneficial effect

The sentence similarity computational methods of the present invention effectively reduce the loss of semantic information, more comprehensively, exactly The internal relation between sentence is featured, and the context for sentence being excavated by depth makes sentence similar with inherent semantic structure feature Degree calculates the form of presentation for being not directly dependent on sentence, improves the accuracy rate of result of calculation.

Specific embodiment

In order to better illustrate objects and advantages of the present invention, with reference to embodiment party of the instantiation to the inventive method Formula is described in further details.

Test using the NLP＆＆CC meeting language for extracting towards Chinese microblogging viewpoint key element disclosed in evaluation and test task of 2013 Material.Therefrom 5 topics of random choose, totally 10896 sentences are used as short essay collection, using sentence similarity is calculated be applied to short Text cluster simultaneously evaluates the mode of Clustering Effect, and the effect of sentence Similarity Measure is evaluated.Commenting for Clustering Effect Valency, is weighed using silhouette coefficient (Silhouette Coefficient) index, and this concept of silhouette coefficient is earliest by Peter J.Pousseeuw was proposed in 1986, and it judges Clustering Effect with reference to cohesion degree and two kinds of factors of separating degree.

The calculation procedure of silhouette coefficient is as follows：

(1) for i-th object, the average distance of its other object in affiliated cluster is calculated, is designated as a_i.

(2) for i-th object, the object is calculated to the average departure of all objects in any cluster not comprising the object From the minima that finds out in each cluster is designated as b_i.

(3) for i-th object, silhouette coefficient is designated as s_i, shown in computational methods such as formula (6).

Silhouette coefficient span is [- 1,1], from formula (6) if as can be seen that s_i＜ 0, show i-th object and The average distance between element inside same cluster is less than other clusters, and Clustering Effect is inaccurate.If a_iValue tend to 0, Or b_iSufficiently large, then s_iValue is closer to 1, illustrates after cluster in cluster that data are tightr, overstepping the bounds of propriety from obvious difference between cluster, Clustering Effect is better.

Specific implementation step is：

Step 1, carries out subordinate sentence to social assigned short text set, then carries out participle to each sentence using ICTCLAS2015 And part-of-speech tagging, according to the deactivation vocabulary that downloads from Internet, remove the stop words in text.

Step 2, using CSM to carrying out sentence justice structural analyses in assigned short text set per bar sentence, and utilizes LDA topic model Assigned short text set is analyzed, and theme and the word distribution of short text is obtained, feature rich is carried out to sentence, and calculates sentence phase Like degree.

Step 2.1, on the basis of step 1, to carrying out sentence justice structural analyses per bar sentence, extracts the topic of sentence, states Topic, elemental term, general term.

Step 2.2, is analyzed to assigned short text set using LDA topic model, is extracted under the theme and theme in text Word is distributed, and obtains theme-word matrix.

Step 2.3, carries out feature expansion according to topic to sentence, obtains the sentence vector based on topic.Concrete grammar is： Elemental term and general term corresponding word topic under is extracted first, then according to the theme for obtaining in step 2.2-word square Battle array, compares probability of the word under different themes, chooses probability highest theme, other words under the theme are added to sentence In son, as a part for sentence, all words of sentence are finally used as feature, construction feature vector representation sentence, its The value in dimension in middle sentence corresponding to original word is the occurrence number in sentence of word, and corresponding to the word for expanding Dimension on value calculated by formula (1),

Step 2.5, is based respectively on step 2.3 and the 2.4 two kinds of sentence vectors for obtaining calculate sentence similarity, to two phases Be weighted like angle value, the final Similarity value between sentence is obtained by formula (2).

Step 3, will be through the pretreated all sentence inputting of step 1 to PV deep learning model, using PV model Text feature is practised, sentence vector is obtained, and based on the COS distance between the sentence vector calculating sentence as similar between sentence Degree, wherein the parameter in PV model is all using the default value in instrument.

Step 4, the Similarity value between the sentence that step 2 and step 3 are obtained carries out linear weighted function, is adjusted by gridding method Parameter ω and θ, select one group of optimum parameter.

Clustering Effect to 5 topics, the vector length size=100 in PV model, length of window window=5, When ω takes 0.33, θ takes 0.25, and silhouette coefficient reaches optimal effectiveness 0.45；When θ takes 0, i.e., only consider based on CSM sentence justice structure The Similarity Measure result that analysis is obtained, silhouette coefficient reaches 0.42；When θ takes 1, i.e., only consideration relies on the sentence that PV analysis is obtained Sub- similarity result, silhouette coefficient reaches 0.31.Test result indicate that the sentence vector for being obtained using CSM can include deeper The internal semantic information of secondary sentence, PV model makes sentence vector obtain abundant contextual information, has both considered that itself is semantic Information is while the also sentence similarity computational methods comprising contextual information more can accurately weigh the similarity degree between sentence.

Claims

1., using the sentence similarity computational methods of sentence justice architectural feature, the method comprising the steps of：

Step 1, carries out pretreatment to assigned short text set, first carries out subordinate sentence, then carries out participle and part-of-speech tagging, remove stop words；

Step 2, in conjunction with sentence justice architectural feature and theme-word distribution characteristicss, carries out feature expansion, and calculates sentence phase to sentence Like degree；

Step 2.1, on the basis of step 1, to carrying out sentence justice structural analyses per bar sentence, extracts the topic of sentence, states topic, base This, general term；

Step 2.2, is analyzed to assigned short text set, is carried using LDA (Latent Dirichlet Allocation) topic model The theme in text and the distribution of the word under theme is taken, obtains theme-word matrix；

Step 2.3, carries out feature expansion according to topic to sentence, obtains the sentence vector based on topic；

Step 2.4, carries out feature expansion according to topic is stated to sentence, obtains based on the sentence vector for stating topic；

Step 2.5, is based respectively on step 2.3 and the 2.4 two kinds of sentence vectors for obtaining calculate sentence similarity, to two similarities Value is weighted, and obtains the final Similarity value between sentence, and specific formula for calculation is as follows,

s i m 1 (S_{A}, S_{B}) = ω * \frac{\overset{&RightArrow;}{S_{A t}} \cdot \overset{&RightArrow;}{S_{B t}}}{| \overset{&RightArrow;}{S_{A t}} | | \overset{&RightArrow;}{S_{B t}} |} + (1 - ω) * \frac{\overset{&RightArrow;}{S_{A c}} \cdot \overset{&RightArrow;}{S_{B c}}}{| \overset{&RightArrow;}{S_{A c}} | | \overset{&RightArrow;}{S_{B c}} |}

Wherein, S_AAnd S_BRepresent any two sentence, sim1 (S_A,S_B) represent two sentences Similarity value,WithRespectively Represent sentence S_AAnd S_BBased on topic sentence vector,WithRepresent sentence S respectively_AAnd S_BExpression based on the sentence for stating topic Subvector, it is [0,1] that ω is adjustable parameter, span；

Step 3, will be through the pretreated all sentence inputting of step 1 to PV (Paragraph Vector) deep learning mould Type, using PV model learning text feature, is obtained sentence vector, and is calculated the COS distance work between sentence based on the sentence vector For the similarity between sentence, computing formula is as follows,

s i m 2 (S_{A}, S_{B}) = \frac{\overset{&RightArrow;}{S_{A p}} \cdot \overset{&RightArrow;}{S_{B p}}}{| \overset{&RightArrow;}{S_{A p}} | | \overset{&RightArrow;}{S_{B p}} |}

Wherein, S_AAnd S_BRepresent any two sentence, sim2 (S_A,S_B) represent two sentences Similarity value,WithRespectively Represent the sentence vector for being obtained with PV model learning；

Step 4, the Similarity value between the sentence that step 2 and step 3 are obtained carries out linear weighted function, adjusts ginseng by gridding method Number, finds one group of optimum parameter value, exports the Similarity value between final sentence pair.

2. sentence similarity computational methods of utilization sentence according to claim 1 justice architectural feature, it is characterised in that step Feature expansion concrete grammar being carried out to sentence based on topic in 2.3 is：The elemental term that extracts under topic first is corresponding with general term Word, then theme-word matrix for obtaining of short essay collection is analyzed according to LDA, compares probability of the word under different themes, choosing Probability highest theme is taken, other words under the theme are added in sentence, as a part for sentence, finally, use All words of sentence as feature, on construction feature vector representation sentence, the wherein dimension in sentence corresponding to original word Value is the occurrence number in sentence of word, and the value in the dimension corresponding to the word for expanding is counted as follows Calculate,

V=n*w

V is to expand the value that word is corresponded in dimension, and n is to expand the number of times that word occurs in sentence, and w is for expanding word right Answer the probit under theme；

In step 2.4 based on state topic carry out feature expansion to sentence method similar to the side for sentence being expanded based on topic Method.

3. sentence similarity computational methods of utilization sentence according to claim 1 justice architectural feature, it is characterised in that step The Similarity-Weighted for being obtained by the similarity for being obtained based on CSM and based on PV in 4 is merged, and specific formula for calculation is：

sim(S_A,S_B)=θ * sim1 (S_A,S_B)+(1-θ)*sim2(S_A,S_B)

Wherein, S_AAnd S_BRepresent any two sentence, sim (S_A,S_B) represent two sentences Similarity value, θ be adjustable parameter, take Value scope is [0,1], and in conjunction with the formula in the step 2.5 in claim 1 and step 3, complete sentence similarity calculates public Formula is：

s i m (S_{A}, S_{B}) = θ * [ω * \frac{\overset{&RightArrow;}{S_{A t}} \cdot \overset{&RightArrow;}{S_{B t}}}{| \overset{&RightArrow;}{S_{A t}} | | \overset{&RightArrow;}{S_{B t}} |} + (1 - ω) * \frac{\overset{&RightArrow;}{S_{A c}} \cdot \overset{&RightArrow;}{S_{B c}}}{| \overset{&RightArrow;}{S_{A c}} | | \overset{&RightArrow;}{S_{B c}} |}] + (1 - θ) * \frac{\overset{&RightArrow;}{S_{A p}} \cdot \overset{&RightArrow;}{S_{B p}}}{| \overset{&RightArrow;}{S_{A p}} | | \overset{&RightArrow;}{S_{B p}} |}

ω and θ are adjustable parameters, and span is all [0,1].