CN108399163A

CN108399163A - Bluebeard compound polymerize the text similarity measure with word combination semantic feature

Info

Publication number: CN108399163A
Application number: CN201810234539.1A
Authority: CN
Inventors: 罗森林; 周晓瑞; 潘丽敏; 魏超; 吴舟婷
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2018-08-14
Anticipated expiration: 2038-03-21
Also published as: CN108399163B

Abstract

The present invention relates to the text similarity measures for combining text set word polymerization and word combination distributed semantic feature, belong to natural language processing and machine learning field.Prediction that this method combines word polymerization in text set first, word combination distributed semantic feature progress own coding is composed a poem to a given tune of ci, word embedded coding network is established by the training process of own coding；Then it is embedded in and is indicated by word embedded coding network struction word, then calculated the maximum weighted matching that word is embedded in and measured as text similarity.The present invention has the characteristics that accuracy height, distributed semantic feature rich.The word embedded coding network of structure can efficiently use the semantic relation of word, establish the abundanter word insertion of distributed semantic information and indicate that the Semantic Similarity between better words of description further promotes the accuracy of text similarity measurement.

Description

Bluebeard compound polymerize the text similarity measure with word combination semantic feature

Technical field

The present invention relates to combination text set words to polymerize the text similarity measure with word combination distributed semantic feature, Belong to natural language processing and machine learning field.

Background technology

Currently, in multiple applications for needing to carry out text-processing such as more text matches, text cluster/classification, information retrieval In scene, text similarity measurement plays more and more important basic effect.In addition, in the text based on manifold learning Dimensionality reduction indicates in research that most of algorithm is substantially built upon the choosing that the neighbour on text set schemes structure or neighbour's text collection It selects, and the basis of these algorithms all relies on good text similarity measurement.

Text similarity measurement is broadly divided into the method based on character string and the method based on corpus.And it is based on corpus Text similarity measurement again can be divided into two steps：Pass through the training of the context (context) of word in corpus first The word insertion of text indicates that the distributed semantic information for the text that word insertion represents, recycling word insertion calculates between text Similitude.

1. the method based on character string

Text similarity measurement based on character string usually converts the text to vector or the discretization similar with vector Serial No., certain distance measure by calculating and comparing these vector sum sequences are used as the measurement of text similarity.For example, Damerau-Levenshtein distances are by insertion, deletion, replacement and exchange adjacent character four kinds of position mode of operation by one Text is converted into another text, and the similitude of two texts is measured by comparing required operation step number.It can also will be literary Originally being converted into a point of vector space, either vector measures text by comparing the distance of point or vector in vector space This similitude, such as Euclidean distance (Euclidean Distance) calculate the air line distance of two coordinate points in vector space, It is higher apart from smaller expression text similarity.Manhatton distance (Manhattan Distance) exists the air line distance of two points The sum of the projector distance of each reference axis, as the similarity measurement of two texts, apart from smaller, similitude is higher.COS distance (Cosine Similarity) calculates two vectorial cosine angles in vector space, and angle is smaller, indicates the phase of two texts It is higher like property.Finally, longest common subsequence (Longest Common Substring) is also that a kind of common text is similar Property measure, it is similar as the two by comparing longest identical continuation character subsequence present in two character strings The measurement of property.

2. the method based on corpus

Text similarity based on character string is measured using word, word or phrase as independent semantic primitive, is not examined fully The semantic relation that Matching Relation implies between worry word, causes them to be difficult to accurately portray the Semantic Similarity of text.

The missing of meaning of a word contact details reduces the accuracy of final text similarity measurement.It is this in order to effectively utilize The meaning of a word contacts, the distributed semantic feature construction word that the text similarity measurement based on corpus passes through word in analysis text set The vector of language indicates that the word insertion theories of learning occurred in recent years provide effective solution thinking for the problem.

(1) word incorporation model：Earliest work about word insertion proposes that he is in a series of papers by Bengio 2003 In used neural probabilistic language model (Neural probabilistic language models) to make machine " acquistion language Distributed indicate ", to achieve the purpose that language space dimensionality reduction.Since entire modeling process is to be based on N-gram models, So obtained word insertion can reflect the continuity between word context, the i.e. semantic relation of text.Word embedding grammar Inspiration comes from distributed hypothesis, i.e., similar linguistic unit distribution has similar meaning.The word embedding grammar of mainstream is studied It lays particular emphasis on and extracts feature from target word and its context, modeled by distributed semantic contact.

(2) word2vec models：Word2vec was proposed by Tomas Mikolov in 2013, included Continuous altogether Two kinds of models of Bag-of-Words (CBOW) and Skip-Gram (SG), both models use simple neural network framework The expression for practising word, is randomly assigned a coding vector for all words first, by the sliding window of a regular length from language Expect to collect target word w in library_tContext distributed information C_t, recycle neural network according to given C_tPredict w_tProcess (CBOW) or given w_tPredict C_t(SG).The word insertion finally built can preferably reflect distributed semantic information.Because it Simplicity, Skip-Gram and CBOG models can be trained on large data sets, be handled by parallelization, they can be with Learn the model of a more than one hundred billion word in 24 hours.But the shortcomings that both models, which is them, does not account for the overall situation Statistical information.

(3) Glove models：Pennington of Stanford University in 2014 et al. proposes that Glove models, the model are adopted again Distributed semantic information is characterized with the matrix of a global word-word, its main target is to observe the co-occurrence rate of word and word, is come The meaning of a word of comparing word in the text.Glove models initially set up a word co-occurrence matrix, then pass through singular value decomposition (SVD) vector for obtaining word indicates.Glove models improve the effect of word insertion.

(4) 2013 years Stephane Clinchant build word insertion using a gauss hybrid models (GMM) and indicate, In per one-dimensional word insertion can be regarded as a topic, thus entire text is converted into a set of word insertion, One text can be indicated by the vector of an indefinite length, finally use a Fisher kernel function that will represent original The DUAL PROBLEMS OF VECTOR MAPPING of this indefinite length of text is finally counted using COS distance at the fixed document representation vector of a length Calculate the similitude between text.

(5) 2014 years, Baotian Hu devised a kind of sentence level using depth convolutional neural networks (CNN) training Matching Model, this model keep sentence in word sequencing, sentence be converted into one by word insertion vector constitute Matrix, then successively carry out convolution operation, obtain adjacent word in matrix combination indicate, in depth convolutional neural networks Pond layer, the model select potential word combination using the operation of local maxima pondization, finally by a full connection The text vector that layer neural net layer obtains a fixed sentence level of length indicates.The shortcomings that this model, is, due to general Logical text usually contains a plurality of sentence, and the vector of single sentence indicates that the feature of semanteme of text can not be reacted how to utilize sentence Vector construct entire text vector indicate still to need deeper into research.Therefore, the effect of this mode still has promotion empty Between.

(6) 2014 years Quoc Le and Tomas Mikolov propose Paragraph again based on CBOW and SG frameworks Vector, it is a unsupervised-learning algorithm, by the way that text to be regarded as to " the virtual word as one and true word equivalence Language ", after one layer of additional mapping layer, into training in word2vec network structures, and with the w that occurs jointly_tAnd C_tAltogether With training, finally, original text is represented as the vector of a regular length in word embedded space.Similar achievement also has The Doc2vec that Chen Minmin in 2017 are proposed, the model randomly select one group of word insertion from text, utilize word insertion Averaging operation constructs text representation vector, to improve the execution efficiency of training process.

Summarizing the above text similarity measure can obtain：(1) the text similarity measure based on character string The semantic information between word in text is had ignored, causes them that can not accurately reflect the text similarity of deep layer；(2) word insertion side Method does not account for word combination, word polymerize the otherness of two kinds of distributed semantic features, this will necessarily cause word insertion to lose one Divide distributed semantic information.

Invention content

The purpose of the present invention is to solve the high-accuracy metric question of text similarity, propose a kind of imparametrization text Method for measuring similarity.The word polymerization of this method combination text contacts feature with word combination, and text is regarded as about there is word The word embedding set of language builds an own coding word insertion learning framework (SPC), passes through the mistake of one similar " prediction of composing a poem to a given tune of ci " Journey obtains the similarity measurement of text.

The present invention design principle be：According to the distributed definition assumed to word distribution form, exist between word A kind of " orthogonal " Two dimensional Distribution formula semantic relation feature：Word polymerization, word combination contact.Utilize the word polymerization of word, word combination Target vocabulary is first removed from its contextual window, then passes through one by the difference for the semantic level that the difference of distribution form is brought The course prediction target vocabulary of a encoding and decoding and other vocabulary that can substitute target vocabulary " filling a vacancy ", finally by own coding Network is trained, and can be built word embedded coding network and be indicated for extracting word insertion.

Technical scheme of the present invention includes that structure full text statistical information adjacency matrix and own coding compose a poem to a given tune of ci and predict two processes, Steps are as follows for specific implementation：

Step 1, construct a directed connected graph about dictionary V has wherein each node indicates a word in dictionary Front and back combination contact, is expressed as an adjacency matrix by this digraph, passes through adjacency matrix between indicating word to even side Off-diagonal element indicates the sequence combination contact (local message) of word in context, while being collected in full text by diagonal element The statistical information (global information) of word；

Step 2, the adjacency matrix established based on step 1, by the course prediction target vocabulary of encoding and decoding and with poly- The word of contact is closed, detailed process is as follows：

Step 2.1, a context is givenAnd p_i={ p_i(w_t)}_t∈V∈ C, it is assumed that polymerization word Remittance is w_t, target vocabulary is t_k(l+L)=w_s∈ V (word in window centre position), enable z_i(w_s) indicate about in target vocabulary It hereafter reconstructs, z_i(w_t) indicate context reconstruct about polymerization vocabulary, according to encoding-decoding process, according toPredict c_iAnd p_i (w_t), whole process can be expressed as formula

The target of model is to minimize the loss of reconstruct, is lost come Metric reconstruction using mean square error, the target letter of model Number can be expressed as following reconstruct loss metric formula

By minimizing reconstruct loss, as correct as possible " answer " filling space is found out, is as follows：

Step 2.1.1, the information decomposition that the context word combination adjacency matrix of target vocabulary is provided are full text statistics letter The diagonal regions element vector and context composite structure submatrix of breath；

Step 2.1.2 is iterated the update of formula parameter by gradient descent algorithm to the weight that these are remained；

Step 3, the similarity measurement δ between word pair is calculated_st, the word that consideration step 2 obtains is to (w_s,w_t), wherein w_s∈ Ω_i, w_t∈Ω_j, w is embedded in by word_sWith text Ω_jThe insertion of all words executes inner products and calculates come the similitude between indicating, It is denoted as δ^(s)=w_s·Ω_j；Likewise, being embedded in w by word_tWith Ω_iIn the insertion of all words ask inner product both to calculate to characterize Between similitude, be denoted as δ^(t)=w_t·Ω_i, then utilize softmax function formulas

Above-mentioned two situations are normalized, are denoted as σ (δ respectively^(s))_tWith σ (δ^(t))_s, finally select the two in compared with Similarity measurement of the big value as word pair, calculation formula are

δ_st=max (σ (δ^(s))_t,σ(δ^(t))_s)；

Step 4, the similarity measurement δ of the word and text that are obtained by step 3_st, by equalizing maximum weighted matching distance (Normalized maximum Matching Distance, NMD) obtains the similarity measurement of text, the calculation formula of NMD For

It, will be in coding network and decoding network in step 2.1.1 meanwhile according to loss metric process is reconstructed described in step 2 The decoding network for being responsible for reconstruct target vocabulary is set as shared relationship, and then reduces the parameter ranges for needing to estimate, prevents model Over-fitting.

Advantageous effect

Compared to the text similarity measure based on character string, the text based on corpus that the present invention uses is similar Property measure have the characteristics that accuracy rate is high, reflection text semantic feature.Through structure full text statistical information adjacency matrix and Own coding is composed a poem to a given tune of ci prediction, and the present invention can reach higher accuracy rate, while can efficiently use semantic relation between word, and carry For abundant distributed semantic information, effect is measured to further promote text similarity.

The maximum weighted matching distance of calculating word insertion, Ke Yichong are indicated by word insertion constructed by word embedded coding network Divide and provided distributed semantic information is provided using word, more accurately portray the Semantic Similarity of text, has good Application value and promotional value.

Description of the drawings

Fig. 1 is that the bluebeard compound polymerization phrase of the present invention closes distributed semantic word insertion learning algorithm schematic diagram；

The digraph of context word combination feature indicates example in Fig. 2 specific implementation modes；

Encoding and decoding frame based on word combination, aggregation association in Fig. 3 specific implementation modes；

Fig. 4 is the result figure that (2) are tested in specific implementation mode；

Fig. 5 is text similarity measurement algorithm principle figure in specific implementation mode；

Fig. 6 is the result figure that (3) are tested in specific implementation mode；

Fig. 7 is the result figure that (4) are tested in specific implementation mode.

Specific implementation mode

In order to better illustrate objects and advantages of the present invention, with reference to the accompanying drawings and examples to the reality of the method for the present invention The mode of applying is described in further details.

1. carrying out (1) part of speech than experiment and (2) meaning of a word discrimination contrast experiment by being embedded in construction method with existing word, test Card SPC has humidification to the distributed semantic information that word is embedded in.The specific method of comparison include CBOW and SG, Glove, HDC, CWin and SSG.Prediction of composing a poem to a given tune of ci only is carried out according to the context and full text information of target word simultaneously, SPC is simplified, It is denoted as SPC-1；On the basis of SPC-1, target word is predicted by only contextual information, obtains SPC-0.

2. the cluster and classifying quality of quantitative analysis NMD is tested by the experiment of (3) text cluster and (4) text classification, it is right The method of ratio includes Euclidean distance, COS distance, WMD distances.

Aforementioned four experiment flow is illustrated one by one below, all tests are completed on same computer, tool Body is configured to：Tetra- core processors of Intel (R) Core (TM) i7-477lK, 3.50GHz, physical memory (RAM) 32GB.Experiment opening Hair ring border：Windows 7spl, Microsoft VS2013, Python2.7, Matlab 2015a.

(1) part of speech is carried out than experiment using wikipedia corpus (Wikipedia 2010) disclosed in April, 2010, be somebody's turn to do Corpus includes about 2,000,000 texts and 9,900,000 vocabulary.After pretreatment, all words are changed to small letter, and screen Go out word of the occurrence number more than 20 to be trained.Part of speech is shown in Table 1 than testing using the details of corpus：

1. part of speech of table uses corpus details than experiment

Part of speech is than testing the COS distance by calculating word insertion, from all word insertions, search and vectorMost similar vectorTo answer, such as " word is embedded in w_aIt is embedded in w with word_bRelationship be similar to word it is embedding Enter w_cWhich it is embedded in word ".Using accuracy (accuracy) as evaluation index, calculation formula is for this experiment：

Accuracy=N_true/N

Wherein, N is the sum of problem, N_trueIt indicates and the accurate matched answer number of correct option.

(2) utilize two common data sets of context similarity (SCWS) of WordSim-353 (WS-353) and Stamford into Row meaning of a word discrimination is tested.WS-353 is made of 353 pairs of nouns, and each noun is mutual indepedent, and relevant contextual information is not present, One is provided by 0 to 10 marking results to the subjective judgement of similitude and correlation by the mankind.SCWS data sets are by each word With provide together below thereon, can based on context reflect the semantic variation of target vocabulary.

Vector is embedded in by the word of different terms to indicate to calculate the cosine similarity between arbitrary word pair, then count first The Spearman rank correlation ρ between the cosine similarity score between these words pair and human subject's judgement marking is calculated, into The word insertion of this assessment structure indicates vector for the semantic discrimination effect between word.For n word pair, y is enabled_iIndicate some The grade sequence that human subject gives a mark between word pair, x_iIndicate that being embedded in vector according to the word of the word pair indicates the cosine being calculated The grade sequence of similarity score, then, Spearman rank correlation calculation formula is：

(3) text cluster experiment, wherein 20newsgroups are carried out using 20newsgroups and RCV1 public data collection It is made of 20 different themes of news.And RCV1 is the multiclass for comprising more than 800,000 news releases provided by Reuter Other text data set, this experiment therefrom select 4 classifications to carry out dependence test, M11 (EQUITY MARKETS), M12 (BOND MARKETS), M131 (INTERBANK MARKETS) and M132 (FOREX MARKETS).Specific statistical information is shown in Table 2.

Partial statistical information on 2. two kinds of text sets of table

Evaluation index, value are used as using Average Mutual (Normalized mutual information, NMI) Range is [0,1], and value is higher to show that Clustering is more similar to true grouping.Its calculation formula is：

Wherein, C={ C_iIndicate by clustering algorithm and T={ T_iIndicate the true classification situation of text.H (C) and H (T) The entropy of two kinds of situations is corresponded to respectively, and MI (C, T) is the mutual information between two kinds of situations, its calculation formula is：

p(C_i,T_i) it is text x_i∈C_i,T_iJoint probability, expression randomly select a text from text set while belonging to In C_i,T_iProbability, and p (C_i) indicate to randomly select a text x_i∈C_iProbability, p (T_i) indicate to randomly select a text x_i∈T_iProbability.

(4) data set that text classification experiment uses is as text cluster.Classification experiments are using weighting F valuesTo assess The accuracy of final classification result,Value is higher to show that classification results are more accurate, and calculation formula is：

Wherein, c_iIndicate that the text of classification i accounts for ratio on entire test set, C indicates the whole size of test set.F_iIt is to close In the F values of classification i, it and accurate rate (precision) P_iWith recall rate (recall) R_iIt is closely related, these three evaluation indexes Calculation formula be：

Experimental result

For experiment (1), 3 are the results are shown in Table.On general effect, SPC has surmounted other word insertion construction methods.This shows Prediction that joint text set word polymerize and word combination distributed semantic feature progress own coding is composed a poem to a given tune of ci, establishes word embedded coding network, The semantic relation between word can be made full use of, the abundanter word insertion of distributed semantic information is established and indicates, under The accuracy that one step improves text similarity measurement provides guarantee.

Accuracy (%) of 3. part of speech of table than answering result in evaluation task

For experiment (2), under two datasets, SPC performances in the case where different words are embedded in dimension are better than control methods, this is tested It has demonstrate,proved and has established word embedded coding network by combining word polymerization, word combination distributed semantic feature in text set and can effectively carry Accuracy of the word insertion for meaning of a word discrimination is risen, the similitude between the meaning of a word is preferably described.The result of experiment (2) is shown in Fig. 4.

For experiment (3), from overall effect as can be seen that NMD the and WMD effects of word-based insertion are better than traditional text This similarity measurement can reach 63.1% on 20 newsgroups language materials.This shows can have using word embedded coding network Effect is using semantic relation between word, and the abundant distributed semantic information provided, to further promote text similarity measurement Effect.The result of experiment (3) is shown in Fig. 6.

For experiment (4), the experimental results showed that, it is embedded in and is indicated using the word constructed by word embedded coding network, Ke Yichong Divide using semantic relation between word, enhance the accuracy with similarity measurement between class text, thus more accurately portrays text language Adopted similitude.When k increases to 20, degenerating occurs in the classifying quality of WMD, and NMD remains to keep preferably to show, can To reach 71.59%.This illustrates that NMD can efficiently use the distributed semantic information of SPC offers, and it is similar further to promote text Property measurement accuracy.The result of experiment (4) is shown in Fig. 7.

Above-mentioned 4 experiments the experimental results showed that, the present invention has that accuracy rate is high, spy of distributed semantic feature rich Point.In the part of speech ratio experiment of word insertion, accuracy reaches 73.95%, in the meaning of a word discrimination experiment of word insertion, Spearman Rank correlation reaches 74.12.By combining word polymerization, word combination distributed semantic feature in text set, structure word insertion to compile Code network can efficiently use the semantic relation of word, establish the abundanter word insertion of distributed semantic information and indicate, more preferably Words of description between Semantic Similarity.

In text cluster experiment, the cluster NMI effects of NMD reach 63.1%, in text classification experiment, point of NMD ClassEffect promoting is to 71.59%.The maximum weighted of calculating word insertion is indicated by word insertion constructed by word embedded coding network Matching distance can make full use of word to be embedded in provided distributed semantic information, more accurately portray the semantic phase of text Like property, the accuracy of text similarity measurement is further promoted.

Claims

1. bluebeard compound polymerize the text similarity measure with word combination semantic feature, it is characterised in that：

Since the word polymerization of word, the difference of word combination distribution form can bring the difference of semantic level in text, for fully profit With the semantic information corresponding to this species diversity, this patent builds context word combination representation first, recycles in text set Prediction that the word of word polymerize, word combination distributed semantic feature progress own coding is composed a poem to a given tune of ci, word is obtained by the process that own coding is trained Embedded coding network；Then, it is embedded in and is indicated using obtained word embedded coding network struction word, finally by calculating word insertion Measurement of the maximum weighted matching distance as text similarity, specifically comprises the following steps：

Step 1, a directed connected graph about dictionary V is constructed, wherein each node indicates a word in dictionary, oriented company Side indicates front and back combination contact between word, this digraph is expressed as an adjacency matrix, passes through the non-right of adjacency matrix The sequence combination contact (local message) of word in the element representation context of angle, while word in full text is collected by diagonal element Statistical information (global information)；

Step 2, the adjacency matrix established based on step 1, by the course prediction target vocabulary of encoding and decoding and with polymerization connection The word of system, detailed process are as follows：

Step 2.1, a context is givenAnd p_i={ p_i(w_t)}_t∈V∈ C, it is assumed that polymerizeing vocabulary is w_t, target vocabulary is t_k(l+L)=w_s∈ V (word in window centre position), enable z_i(w_s) indicate about target vocabulary context Reconstruct, z_i(w_t) indicate context reconstruct about polymerization vocabulary, according to encoding-decoding process, according toPredict c_iAnd p_i(w_t), Whole process can be expressed as formula

The target of model is to minimize the loss of reconstruct, is lost come Metric reconstruction using mean square error, the object function of model can To be expressed as following reconstruct loss metric formula

Step 2.1.1, the information decomposition that the context word combination adjacency matrix of target vocabulary is provided are full text statistical information Diagonal regions element vector and context composite structure submatrix；

Step 3, the similarity measurement δ between word pair is calculated_st, the word that consideration step 2 obtains is to (w_s, w_t), wherein w_s∈Ω_i, w_t ∈Ω_j, w is embedded in by word_sWith text Ω_jThe insertion of all words executes inner products and calculates come the similitude between indicating, is denoted as δ^(s)=w_s·Ω_j；Likewise, being embedded in w by word_tWith Ω_iIn the insertion of all words carry out that inner product is asked to calculate to characterize between the two Similitude, be denoted as δ^(t)=w_t·Ω_i, then utilize softmax function formulas

Above-mentioned two situations are normalized, are denoted as σ (δ respectively^(s))_tWith σ (δ^(t))_s, finally select larger in the two It is worth the similarity measurement as word pair, calculation formula is

δ_st=max (σ (δ^(s))_t, σ (δ^(t))_s)；

2. minimizing reconstruct loss process according to claim 1, it is characterised in that：In step 2.1.1 by coding network with The decoding network for being responsible for reconstruct target vocabulary in decoding network is set as shared relationship, and then reduces the parameter amount for needing to estimate Grade, prevents model over-fitting.