CN104657350B

CN104657350B - Merge the short text Hash learning method of latent semantic feature

Info

Publication number: CN104657350B
Application number: CN201510096518.4A
Authority: CN
Inventors: 徐博; 许家铭; 郝红卫; 田冠华; 王方圆
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2015-03-04
Filing date: 2015-03-04
Publication date: 2017-06-09
Anticipated expiration: 2035-03-04
Also published as: CN104657350A

Abstract

The short text Hash learning method of the fusion latent semantic feature that the present invention is provided, including：Training text is carried out into dimensionality reduction binaryzation by Hash loss function and generates low-dimensional two-value code；Word feature and position feature are obtained from training text, according to word feature and position feature by the corresponding term vector of acquisition and the position vector of tabling look-up；Term vector and position vector are coupled by convolutional neural networks model, the latent semantic feature of training text is obtained；Low-dimensional two-value code is trained the convolutional neural networks model for being updated；Training text is carried out using the convolutional neural networks model for updating encode generative semantics Hash codes, and query text is carried out the Hash codes of mapping generation query text by convolutional neural networks model to semantic Hash codes；The Hash codes of query text are matched in two-value Hamming space to semantic Hash codes, obtain the semantic Similar Text of query text.The present invention can obtain the semantic Similar Text of query text.

Description

Merge the short text Hash learning method of latent semantic feature

Technical field

The present invention relates to information retrieval field, more particularly to a kind of short text Hash study for merging latent semantic feature Method.

Background technology

Hash learning method is widely used in proximity search technology, and the technology is applied to information retrieval, content and repeats to examine In survey, Tag Estimation and commending system.At present, Hash learning method be the explicit semantic feature of text based is mapped to it is low In dimension two-value space, the method can not well preserve the analog information between semanteme.For example, there is two text " President Write his first computer program " and " Obama kick off hour of code ", by using above-mentioned Hash learning method, it is impossible to make explicit features " President " and " Obama " and " program " in text and " code " carries out semantic association.In order to solve the semantic association between the explicit features in text, using latent layer semantic model side Method builds the similitude of text, however, these methods are still based on bag of words be trained, in not accounting for text Hyponymy and word order relation, can not well preserve the analog information between semanteme.

The content of the invention

The short text Hash learning method of the fusion latent semantic feature that the present invention is provided, so as to obtain the language of query text Adopted Similar Text.

According to an aspect of the present invention, there is provided a kind of short text Hash learning method for merging latent semantic feature, it is described Method includes：Training text is carried out into dimensionality reduction binaryzation by Hash loss function and generates low-dimensional two-value code；From the training text Word feature and position feature are obtained in this, according to institute's predicate feature and position feature obtain respectively by being tabled look-up institute's predicate feature and The corresponding term vector of position feature and position vector；The term vector and position vector are carried out by convolutional neural networks model Coupling, obtains the latent semantic feature of training text；Low-dimensional two-value code is trained the convolutional Neural net for being updated Network model；The training text is carried out to encode generative semantics Hash codes using the convolutional neural networks model of the renewal, and Query text is carried out into the mapping generation query text to the semantic Hash codes by the convolutional neural networks model Hash codes；The Hash codes of the query text are matched in two-value Hamming space to the semantic Hash codes, obtain described The semantic Similar Text of query text.

The short text Hash learning method of the fusion latent semantic feature that the present invention is provided, Kazakhstan is passed through by by training text Uncommon loss function carries out dimensionality reduction binaryzation generation low-dimensional two-value code, and training text is entered using the convolutional neural networks model for updating Row coding generative semantics Hash codes, and query text is carried out into mapping generation by convolutional neural networks model to semantic Hash codes The Hash codes of query text；The Hash codes of query text are matched in two-value Hamming space to semantic Hash codes, so as to obtain Obtain the semantic Similar Text of query text.

Brief description of the drawings

Fig. 1 is the flow chart of the short text Hash learning method of fusion latent semantic feature provided in an embodiment of the present invention；

Fig. 2 is that the framework of the short text Hash learning method of fusion latent semantic feature provided in an embodiment of the present invention is illustrated Figure；

Fig. 3 is retrieval performance schematic diagram provided in an embodiment of the present invention；

The retrieval performance schematic diagram that Fig. 4 is provided for another embodiment of the present invention.

Specific embodiment

The present general inventive concept is, by by training text by Hash loss function carry out dimensionality reduction binaryzation generate it is low Dimension two-value code, carries out encoding generative semantics Hash codes using the convolutional neural networks model for updating to training text, and will inquiry Text carries out the Hash codes of mapping generation query text by convolutional neural networks model to semantic Hash codes；The Kazakhstan of query text Uncommon code is matched in two-value Hamming space to semantic Hash codes, so as to obtain the semantic Similar Text of query text.

Below in conjunction with the accompanying drawings to the short text Hash learning method of fusion latent semantic feature provided in an embodiment of the present invention It is described in detail.

Fig. 1 is the flow chart of the short text Hash learning method of fusion latent semantic feature provided in an embodiment of the present invention.

Reference picture 1, in step S101, carries out training text dimensionality reduction binaryzation and generates low-dimensional two by Hash loss function Value code.

Exemplary embodiment of the invention, it is described that training text is carried out into dimensionality reduction binaryzation by Hash loss function Generation low-dimensional two-value code includes：

In step S1011, similarity matrix is constructed according to the training text.

In step S1012, Laplce's characteristic vector is obtained by the similarity matrix.

In step S1013, mean vectors are obtained by Laplce's characteristic vector.

In step S1014, binaryzation is carried out to Laplce's characteristic vector by the mean vectors, so as to generate The low-dimensional two-value code.

Exemplary embodiment of the invention, it is described to be included according to training text construction similarity matrix：

Similarity matrix is calculated according to formula (1)：

Wherein, S_ijIt is the similarity matrix, NN_kX () is the k neighbours set of the training text x, c_ijIt is confidence system Number.

Here, training text represents with x, similarity matrix S_ijRepresent, method for measuring similarity includes included angle cosine, Europe Formula distance, Gaussian kernel and linear kernel.Can be such as, but not limited to by taking Gaussian kernel structure local similarity matrix as an example.

The c in THC-I of the present invention_ij1 is always, and this is adjusted using label information in THC-II and THC-IV models and be Number.As two sample x_iAnd x_j(T when sharing any same label_ij=1), put c_ijOne value a higher.If on the contrary, two Individual sample x_iAnd x_j(T when not sharing any label_ij=0), put c_ijOne relatively low value b.As shown in formula (2)：

Wherein, parameter a and b meets 1 >=a >=b>0.For specific set of data, the confidence level then setup parameter a more high of label Gap and between b is bigger.In the embodiment of the present invention, setup parameter a=1, b=0.1.

In step S1012, Laplce's characteristic vector is obtained by the similarity matrix, Laplce's characteristic vector is usedRepresent.

To obtain the low-dimensional two-value code Y of pre-training text collection⁽⁰⁾, design shown in its optimization object function such as formula (3)：

s.t.Y∈{-1,1}^n×r,Y^T1=0, Y^TY=I

Wherein, S_ijIt is the local similarity matrix constructed by formula (1), y_iIt is text x_iLow-dimensional two-value code, | | | |_F It is F- norms.By constraints Y ∈ { -1,1 } of the two-value code discretization that relaxes^n×r, optimal l dimensions real-valued vectorsCan lead to Solution laplacian eigenmaps Resolving probiems are crossed, be will not be described here.

In step S1013, mean vectors are obtained by Laplce's characteristic vector, wherein, mean vectors

In step S102, word feature and position feature are obtained from the training text, according to institute's predicate feature and position Feature obtains institute's predicate feature and the corresponding term vector of position feature and position vector respectively by tabling look-up.

Here, by being represented from the distributed vectorization of word feature in tabled look-upObtain word to Amount.Meanwhile, term vectorization is updated as parameter in a model.

Similarly, by being represented from the distributed vectorization of position feature in tabled look-upObtain Position vector.Position vector represents whole random initializtions, and is updated as parameter in model training.

In step S103, the term vector and position vector are coupled by convolutional neural networks model, instructed Practice the latent semantic feature of text.

Exemplary embodiment of the invention, it is described that the term vector and position vector are passed through into convolutional neural networks mould Type is coupled, and the latent semantic feature for obtaining the training text includes：

In step S1031, the term vector and position vector are carried out into one-dimensional convolution respectively, obtain eigenmatrix.

In step S1032, the eigenmatrix obtains one-dimensional characteristic vector by collapsing operation.

In step S1033, maximum neural unit is chosen from one-dimensional characteristic vector.

In step S1034, the maximum neural unit obtains the implicit language of the training text according to tangent activation primitive Adopted feature.

Term vector and position vector are integrated first, and each word in training text is then expressed as againThe matrixing character representation of each text is

In present example, one-dimensional convolution is used on text matrixing character representation, here w be convolution kernel frame mouthful it is big It is small, n₁It is the number of convolution kernel.Represent for convenience, introduce diagonal matrix and represent.Such as j-th convolution kernel such as formula (4) institute Show：

By j-th one-dimensional convolution kernel, the eigenmatrix after convolution is obtained

Wherein, shown in XF such as formula (6)：

Exemplary embodiment of the invention, the eigenmatrix obtains one-dimensional characteristic vector bag by collapsing operation Include：

One-dimensional characteristic vector is calculated according to formula (7)：

Wherein,It is one-dimensional characteristic vector, (d_w+d_p) it is dimension.

Here, the eigenmatrix C after j-th convolution operation is given_j, operation is collapsed then directly by t × (d_w+d_p) dimension Eigenmatrix is compressed into t × 1 dimensional vector.By collapsing layer, eigenmatrix C_jIt is compressed into one-dimensional vector

Exemplary embodiment of the invention, the maximum neural unit obtains the training according to tangent activation primitive The latent semantic feature of text includes：

The latent semantic feature of the training text is calculated according to following formula：

Wherein,M is the latent semantic feature of the training text.

Here, the characteristic vector for being obtained after j-th collapses layerIn characteristic vectorOn carry out the maximum samplings of k- Operation, obtains k neural unit of maximum.Then, implicit semantic feature is obtained using tangent activation primitive.

In step S104, low-dimensional two-value code is trained the convolutional neural networks model for being updated.

Exemplary embodiment of the invention, it is described that low-dimensional two-value code is trained the convolution god for being updated Include through network model：

In step S1041, the latent semantic feature of training text and display semantic feature are input into the convolutional Neural net The output layer of network model；

In step S1042, low-dimensional two-value code is carried out into the convolutional neural networks mould that error back propagation is updated The parameter of type.

Here, in step S1041, the latent semantic feature of training text is represented with m, the explicit semantic spy of training text Requisition TF-IDF is represented, the explicit semantic feature of the latent semantic feature m of training text and training text is carried out with TF-IDF Linear transformation, specifically from formula (9)：

O^(H)=W_Zm+αW_Ox (9)

Wherein,It is output vector,WithIt is the matrix of a linear transformation, x is TF-IDF features Vector, α is characterized fusion coefficients.

In order to carry out binaryzation, Hash codes are obtained, in output layer feature O^(H)Using r logic, this spy returns, specifically by public affairs Formula (10) understands：

If training text includes label information t ∈ { 0,1 }^c×n, then in extended model THC-III and THC-IV, increase Plus c extra output unit fitting label information, specifically from formula (11)：

O^(C)=W '_Zm+αW′_Ox (11)

Here, WithIt is linear transformation matrix, and makes on extra output unit With c logic, this spy returns, specifically from formula (12)：

Here, in step S1042, it is θ that the parameter unified definition for updating will be needed in model, specifically can by formula (13) Know：

θ={ E^(W),E^(P),W,W_Z,W_O,W′_Z,W′_O} (13)

Given training setThe low-dimensional two-value code Y of pre-training⁽⁰⁾And label information t ={ t₁,t₂...,t_n}∈{0,1}^c×n, then the object function based on cross entropy by being knowable to formula (14)：

By using stochastic gradient descent method undated parameter θ with optimization object function, specifically from formula (15)：

In step S105, coding generation language is carried out to the training text using the convolutional neural networks model of the renewal Adopted Hash codes, and query text is carried out described in mapping generation by the convolutional neural networks model to the semantic Hash codes The Hash codes of query text.

In step S106, the Hash codes of the query text are carried out in two-value Hamming space to the semantic Hash codes Match somebody with somebody, obtain the semantic Similar Text of the query text.

Exemplary embodiment of the invention, the Hash codes of query text are in two-value Hamming space to semantic Hash codes Matched, the semantic Similar Text for obtaining query text includes：

The Hash codes of query text are matched in two-value Hamming space to semantic Hash codes, the semantic Kazakhstan for being matched Uncommon code；

The semantic Hash codes for matching are ranked up according to Hamming distance, the semantic Similar Text of query text is obtained.

The short text Hash learning method of fusion latent semantic feature provided in an embodiment of the present invention, by by training text Dimensionality reduction binaryzation is carried out by Hash loss function and generates low-dimensional two-value code, using the convolutional neural networks model for updating to training Text is carried out encoding generative semantics Hash codes, and query text is reflected by convolutional neural networks model to semantic Hash codes Penetrate the Hash codes of generation query text；The Hash codes of query text are matched in two-value Hamming space to semantic Hash codes, So as to obtain the semantic Similar Text of query text.

Fig. 2 is that the framework of the short text Hash learning method of fusion latent semantic feature provided in an embodiment of the present invention is illustrated Figure.

Reference picture 2, merging the short text Hash learning method of latent semantic feature includes two stages, and the first stage is to breathe out The uncommon code pre-training stage；Second stage is hash function training and forecast period.

First stage：Training text is carried out into dimensionality reduction binaryzation by Hash loss function and generates low-dimensional two-value code.

Second stage：Word feature and position feature are obtained from the training text, it is special according to institute's predicate feature and position Levy and obtain institute's predicate feature and the corresponding term vector of position feature and position vector respectively by tabling look-up；

Term vector and position vector are carried out into one-dimensional convolution respectively, eigenmatrix is obtained；

Eigenmatrix obtains one-dimensional characteristic vector by collapsing operation；

Maximum neural unit is chosen from one-dimensional characteristic vector；

Maximum neural unit obtains the latent semantic feature of the training text according to tangent activation primitive；

The latent semantic feature of training text and display semantic feature are input into the output of the convolutional neural networks model Layer；

Low-dimensional two-value code is carried out the parameter of the convolutional neural networks model that error back propagation is updated；

Training text is carried out using the convolutional neural networks model for updating encode generative semantics Hash codes, and will inquiry text This carries out the Hash codes of mapping generation query text by the convolutional neural networks model to semantic Hash codes；

The Hash codes of query text are matched in two-value Hamming space to semantic Hash codes, obtain the language of query text Adopted Similar Text.

In order to accurately assess retrieval performance of the invention, searching system of the present invention is by preceding_NIndividual returning result it is average Precision evaluates general effect of the invention.

Because the primitive character of short text data collection can not well reflect the semantic similarity relation between document, present invention examination Test by whether sharing any label between two samples of differentiation to decide whether to be semantic related text, it is all in experiment Evaluation metricses result is average value of all test samples in training sample retrieval result.

Using a kind of open short text data collection in present invention experiment, the data set includes 8 kinds of classification texts.It is right in experiment The data set is left intact (including going the operation such as stop words and stem reduction), and is made as label information using class label With the statistical information of text data set is as shown in table 1 employed in experiment：

Table 1

Data set	Classification number	Training/test quantity	Length (average/maximum)	Dictionary dimension
					SearchSnippets	8	10060/2280	17.2/38	26265

In test, the width w of fixed convolution kernel is 3, the number n of convolution kernel₁Be 80, k- maximum sample level in k be 2, the dimension d of term vector expression layer_wIt is 50, the dimension d of position vector expression layer_pIt is 8, and learning rate λ is 0.01.In addition, defeated The feature weight α for going out layer is progressively adjusted in present invention experiment from 0.001 to 1024, and the optimal value α of final choice is 16.

Acquiescence is using the disclosed 50 dimension term vectors trained based on GloVe instruments in an experiment, and with other term vectors Contrasted, such as Senna languages vector sum random initializtion.Referring in particular to covering for GloVe and Senna term vectors as shown in table 2 Cover degree statistical information：

Table 2

Four kinds of mutation models are proposed in the present invention, respectively：

THC-I：It is basic model of the invention, trains whole hash function not need any label；

THC-II：Label information is incorporated in the Hash codes pre-training stage in stage 1-；

THC-III：In the training of stage 2- hash function label information is incorporated with forecast period；

THC-IV：Label information is incorporated in the stage 1 and 2 simultaneously.

Following contrast hash method is used in present invention experiment：

Control methods one：Self study hash method, the method is a kind of typical two steps hash method, and the first step is using text This primitive character carries out Laplce's Feature Dimension Reduction, and Hash codes are obtained after mean vectors carry out binaryzation.Second step, by previous Stage generates training sample primitive character and corresponding r dimension Hash codes, and r two-value support vector machine classifier of training is used as Hash Function.

Control methods two：Self study hash method based on Gaussian kernel, the method is a modified version of control methods one, The r two-value SVMs based on Gaussian kernel is employed in the hash function training stage.

Control methods three：Supervision type self study hash method, the method is the further modified version of control methods one, is added Full supervision message is used as constraint.It is only similar between sample of the consideration with identical category label when local similarity matrix S is built Degree.

Control methods four：Quick hash method (FastHash), the method is based on two step hash methods, using decision tree mould Type can solve the problems, such as the Hash mapping of high dimensional data as hash function, it is contemplated that full supervision message.

Table 3 is the inventive method, self study hash method, the self study hash method based on Gaussian kernel, the self-study of supervision type Practise the average retrieval precision index of hash method and quick hash method under 64 Hash codes.THC-I, self study hash method With the self study hash method based on Gaussian kernel in the training process without using any tape label data, THC-II, THC- III, THC-IV, quick Hash and supervision type self study Hash have used label information to be instructed in Hash learning process Practice.

Table 3

As can be seen that the basic skills THC-I in the inventive method is clearly distinguishable from the Hash side of other unused labels Method (self study hash method and self study hash method based on Gaussian kernel).Compared to supervision type Hash learning method, THC- II, THC-III and THC-IV equally provide more excellent retrieval performance, and THC-IV is optimal mutation model.We have found that due to Term vector and convolutional neural networks structure are introduced, the performance of the basic skills THC-I without label data is based in the present invention even The supervision type hash method in control methods is exceeded.Referring in particular in table 4 contrast the inventive method learning characteristic in word to The feature (for example, Senna term vectors and random initializtion) of amount, position vector, TF-IDF and other term vectors is to accessibility The influence of energy.

Table 4

As can be seen that after with the addition of position feature in word feature base, the mean accuracy of retrieval has carrying for 1%-2% Rise, and be significantly better than the retrieval performance 8% or so based on explicit features TF-IDF.Then will be explicit in the inventive method When feature is merged with implicit features, the retrieval mean accuracy of system has continued to lift up 1% or so.In present invention experiment simultaneously Other term vector features be compared for retrieving the influence of performance.It can be seen that the retrieval result based on Senna term vectors is only than this The GloVe term vectors that invention acquiescence is used are low by 2% or so.Even if however, using term vector feature as parameter in the inventive method It is updated, but the retrieval performance based on random initializtion term vector have dropped 10% or so.Experiment shows using based on a large amount of The term vector of language material unsupervised learning carries out the necessity of model parameter initialization.

Next, we study influence of the Fusion Features parameter alpha to retrieval performance in the inventive method.We adjust α from 0.001 gradually changes to 1024, and corresponding retrieval result is shown with reference to retrieval performance provided in an embodiment of the present invention as shown in Figure 3 It is intended to.It can be seen that, when our regulation parameter α become big, retrieval performance tends to the retrieval result based on explicit features, equally Ground, when our regulation parameter α become hour, retrieval performance tends to the retrieval result based on implicit features.And only regulation parameter α During to an optimal value, system can reach best retrieval performance.

For convolutional neural networks structure, term vector dimension, learning rate, and convolution kernel are secured in present invention experiment Frame mouthful size.The neuron number that we heuristically limit implicit features is 160, referring in particular to convolution different in table 5 Neural network structure

Table 5

	Framework -1	Framework -2	Framework -3	Framework -4	Framework -5
						Convolution kernel number	160	80	40	20	10
K- maximum numbers	1	2	4	8	16

The retrieval performance schematic diagram that corresponding retrieval result is provided with reference to another embodiment of the present invention as shown in Figure 4.I It can be seen that when K- maximum number of samples be less than 4 when, retrieval hydraulic performance decline it is slow.However, due to convolution kernel number Increase, convolutional layer and collapse the output neuron number of layer and increase and can cause the computation complexity to increase, thus present invention experiment Middle unified compromise have selected the convolution kernel of framework -2, i.e., 80,2- maximum sampling structures.

The above, specific embodiment only of the invention, but protection scope of the present invention is not limited thereto, and it is any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all contain Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims

1. it is a kind of merge latent semantic feature short text Hash learning method, it is characterised in that methods described includes：

Training text is carried out into dimensionality reduction binaryzation by Hash loss function and generates low-dimensional two-value code；

Word feature and position feature are obtained from the training text, according to institute's predicate feature and position feature by difference of tabling look-up Obtain institute's predicate feature and the corresponding term vector of position feature and position vector；

The term vector and position vector are coupled by convolutional neural networks model, the implicit semanteme of training text is obtained Feature；

Low-dimensional two-value code is trained the convolutional neural networks model for being updated；

The training text is carried out to encode generative semantics Hash codes using the convolutional neural networks model of the renewal, and will be looked into Ask text carries out the Hash of the mapping generation query text by the convolutional neural networks model to the semantic Hash codes Code；

The Hash codes of the query text are matched in two-value Hamming space to the semantic Hash codes, obtain the inquiry The semantic Similar Text of text；

It is described training text is carried out into dimensionality reduction binaryzation by Hash loss function to generate low-dimensional two-value code including：

Similarity matrix is constructed according to the training text；

Laplce's characteristic vector is obtained by the similarity matrix；

Mean vectors are obtained by Laplce's characteristic vector；

Binaryzation is carried out to Laplce's characteristic vector by the mean vectors, so as to generate the low-dimensional two-value code；

It is described that the term vector and position vector are coupled by convolutional neural networks model, obtain the training text Latent semantic feature includes：

The term vector and position vector are carried out into one-dimensional convolution respectively, eigenmatrix is obtained；

The eigenmatrix obtains one-dimensional characteristic vector by collapsing operation；

Maximum neural unit is chosen from one-dimensional characteristic vector；

The maximum neural unit obtains the latent semantic feature of the training text according to tangent activation primitive.

2. method according to claim 1, it is characterised in that described that similarity matrix bag is constructed according to the training text Include：

The similarity matrix is calculated according to following formula：

Wherein, S_ijIt is the similarity matrix, NN_kX () is the k neighbours set of the training text x, c_ijIt is confidence coefficient；Institute State σ and represent tuning parameter.

3. method according to claim 1, it is characterised in that the eigenmatrix obtains one-dimensional characteristic by collapsing operation Vector includes：

The one-dimensional characteristic vector is calculated according to following formula：

C_{j, p}^{(0)} = Σ_{q = 1}^{d_{w} + d_{p}} C_{j, p, q}

Wherein, (d_w+d_p) it is dimension；Represent one-dimensional characteristic vector；RepresentIn p-th value；d_wRepresent word to Amount dimension；d_pRepresent position vector dimension；(0) identifier of the feature after collapsing operation is represented；C_jBefore expression collapses operation Eigenmatrix；P represents the item number of the characteristic value of one-dimensional characteristic vector；C_j,pRepresent C_jPth row characteristic vector；Q is represented and is collapsed operation The characteristic element footnote of preceding eigenmatrix pth row characteristic vector.

4. method according to claim 3, it is characterised in that the maximum neural unit is obtained according to tangent activation primitive The latent semantic feature of the training text includes：

m = \tanh ({\hat{C}}^{(0)})

Wherein,M is the latent semantic feature of the training text；Represent n₁K- is carried out on individual convolution kernel passage most The maximum neural unit vector constituted after big sampling；(0) identifier of the feature after collapsing operation is represented； Represent the latent semantic feature of training text.

5. method according to claim 4, it is characterised in that described be trained low-dimensional two-value code is updated Convolutional neural networks model include：

The latent semantic feature of the training text and explicit semantic feature are input into the output of the convolutional neural networks model Layer；

Low-dimensional two-value code is carried out the parameter of the convolutional neural networks model that error back propagation is updated.

6. method according to claim 1 or 5, it is characterised in that the Hash codes of the query text are empty in two-value Hamming Between in the semantic Hash codes are matched, the semantic Similar Text for obtaining the query text includes：

The Hash codes of the query text are matched in two-value Hamming space to the semantic Hash codes, the language for being matched Adopted Hash codes；

The semantic Hash codes of the matching are ranked up according to Hamming distance, the semantic similar text of the query text is obtained This.