CN104657350B - Merge the short text Hash learning method of latent semantic feature - Google Patents

Merge the short text Hash learning method of latent semantic feature Download PDF

Info

Publication number
CN104657350B
CN104657350B CN201510096518.4A CN201510096518A CN104657350B CN 104657350 B CN104657350 B CN 104657350B CN 201510096518 A CN201510096518 A CN 201510096518A CN 104657350 B CN104657350 B CN 104657350B
Authority
CN
China
Prior art keywords
text
feature
hash
vector
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510096518.4A
Other languages
Chinese (zh)
Other versions
CN104657350A (en
Inventor
徐博
许家铭
郝红卫
田冠华
王方圆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201510096518.4A priority Critical patent/CN104657350B/en
Publication of CN104657350A publication Critical patent/CN104657350A/en
Application granted granted Critical
Publication of CN104657350B publication Critical patent/CN104657350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The short text Hash learning method of the fusion latent semantic feature that the present invention is provided, including:Training text is carried out into dimensionality reduction binaryzation by Hash loss function and generates low-dimensional two-value code;Word feature and position feature are obtained from training text, according to word feature and position feature by the corresponding term vector of acquisition and the position vector of tabling look-up;Term vector and position vector are coupled by convolutional neural networks model, the latent semantic feature of training text is obtained;Low-dimensional two-value code is trained the convolutional neural networks model for being updated;Training text is carried out using the convolutional neural networks model for updating encode generative semantics Hash codes, and query text is carried out the Hash codes of mapping generation query text by convolutional neural networks model to semantic Hash codes;The Hash codes of query text are matched in two-value Hamming space to semantic Hash codes, obtain the semantic Similar Text of query text.The present invention can obtain the semantic Similar Text of query text.

Description

Merge the short text Hash learning method of latent semantic feature
Technical field
The present invention relates to information retrieval field, more particularly to a kind of short text Hash study for merging latent semantic feature Method.
Background technology
Hash learning method is widely used in proximity search technology, and the technology is applied to information retrieval, content and repeats to examine In survey, Tag Estimation and commending system.At present, Hash learning method be the explicit semantic feature of text based is mapped to it is low In dimension two-value space, the method can not well preserve the analog information between semanteme.For example, there is two text " President Write his first computer program " and " Obama kick off hour of code ", by using above-mentioned Hash learning method, it is impossible to make explicit features " President " and " Obama " and " program " in text and " code " carries out semantic association.In order to solve the semantic association between the explicit features in text, using latent layer semantic model side Method builds the similitude of text, however, these methods are still based on bag of words be trained, in not accounting for text Hyponymy and word order relation, can not well preserve the analog information between semanteme.
The content of the invention
The short text Hash learning method of the fusion latent semantic feature that the present invention is provided, so as to obtain the language of query text Adopted Similar Text.
According to an aspect of the present invention, there is provided a kind of short text Hash learning method for merging latent semantic feature, it is described Method includes:Training text is carried out into dimensionality reduction binaryzation by Hash loss function and generates low-dimensional two-value code;From the training text Word feature and position feature are obtained in this, according to institute's predicate feature and position feature obtain respectively by being tabled look-up institute's predicate feature and The corresponding term vector of position feature and position vector;The term vector and position vector are carried out by convolutional neural networks model Coupling, obtains the latent semantic feature of training text;Low-dimensional two-value code is trained the convolutional Neural net for being updated Network model;The training text is carried out to encode generative semantics Hash codes using the convolutional neural networks model of the renewal, and Query text is carried out into the mapping generation query text to the semantic Hash codes by the convolutional neural networks model Hash codes;The Hash codes of the query text are matched in two-value Hamming space to the semantic Hash codes, obtain described The semantic Similar Text of query text.
The short text Hash learning method of the fusion latent semantic feature that the present invention is provided, Kazakhstan is passed through by by training text Uncommon loss function carries out dimensionality reduction binaryzation generation low-dimensional two-value code, and training text is entered using the convolutional neural networks model for updating Row coding generative semantics Hash codes, and query text is carried out into mapping generation by convolutional neural networks model to semantic Hash codes The Hash codes of query text;The Hash codes of query text are matched in two-value Hamming space to semantic Hash codes, so as to obtain Obtain the semantic Similar Text of query text.
Brief description of the drawings
Fig. 1 is the flow chart of the short text Hash learning method of fusion latent semantic feature provided in an embodiment of the present invention;
Fig. 2 is that the framework of the short text Hash learning method of fusion latent semantic feature provided in an embodiment of the present invention is illustrated Figure;
Fig. 3 is retrieval performance schematic diagram provided in an embodiment of the present invention;
The retrieval performance schematic diagram that Fig. 4 is provided for another embodiment of the present invention.
Specific embodiment
The present general inventive concept is, by by training text by Hash loss function carry out dimensionality reduction binaryzation generate it is low Dimension two-value code, carries out encoding generative semantics Hash codes using the convolutional neural networks model for updating to training text, and will inquiry Text carries out the Hash codes of mapping generation query text by convolutional neural networks model to semantic Hash codes;The Kazakhstan of query text Uncommon code is matched in two-value Hamming space to semantic Hash codes, so as to obtain the semantic Similar Text of query text.
Below in conjunction with the accompanying drawings to the short text Hash learning method of fusion latent semantic feature provided in an embodiment of the present invention It is described in detail.
Fig. 1 is the flow chart of the short text Hash learning method of fusion latent semantic feature provided in an embodiment of the present invention.
Reference picture 1, in step S101, carries out training text dimensionality reduction binaryzation and generates low-dimensional two by Hash loss function Value code.
Exemplary embodiment of the invention, it is described that training text is carried out into dimensionality reduction binaryzation by Hash loss function Generation low-dimensional two-value code includes:
In step S1011, similarity matrix is constructed according to the training text.
In step S1012, Laplce's characteristic vector is obtained by the similarity matrix.
In step S1013, mean vectors are obtained by Laplce's characteristic vector.
In step S1014, binaryzation is carried out to Laplce's characteristic vector by the mean vectors, so as to generate The low-dimensional two-value code.
Exemplary embodiment of the invention, it is described to be included according to training text construction similarity matrix:
Similarity matrix is calculated according to formula (1):
Wherein, SijIt is the similarity matrix, NNkX () is the k neighbours set of the training text x, cijIt is confidence system Number.
Here, training text represents with x, similarity matrix SijRepresent, method for measuring similarity includes included angle cosine, Europe Formula distance, Gaussian kernel and linear kernel.Can be such as, but not limited to by taking Gaussian kernel structure local similarity matrix as an example.
The c in THC-I of the present inventionij1 is always, and this is adjusted using label information in THC-II and THC-IV models and be Number.As two sample xiAnd xj(T when sharing any same labelij=1), put cijOne value a higher.If on the contrary, two Individual sample xiAnd xj(T when not sharing any labelij=0), put cijOne relatively low value b.As shown in formula (2):
Wherein, parameter a and b meets 1 >=a >=b>0.For specific set of data, the confidence level then setup parameter a more high of label Gap and between b is bigger.In the embodiment of the present invention, setup parameter a=1, b=0.1.
In step S1012, Laplce's characteristic vector is obtained by the similarity matrix, Laplce's characteristic vector is usedRepresent.
To obtain the low-dimensional two-value code Y of pre-training text collection(0), design shown in its optimization object function such as formula (3):
s.t.Y∈{-1,1}n×r,YT1=0, YTY=I
Wherein, SijIt is the local similarity matrix constructed by formula (1), yiIt is text xiLow-dimensional two-value code, | | | |F It is F- norms.By constraints Y ∈ { -1,1 } of the two-value code discretization that relaxesn×r, optimal l dimensions real-valued vectorsCan lead to Solution laplacian eigenmaps Resolving probiems are crossed, be will not be described here.
In step S1013, mean vectors are obtained by Laplce's characteristic vector, wherein, mean vectors
In step S102, word feature and position feature are obtained from the training text, according to institute's predicate feature and position Feature obtains institute's predicate feature and the corresponding term vector of position feature and position vector respectively by tabling look-up.
Here, by being represented from the distributed vectorization of word feature in tabled look-upObtain word to Amount.Meanwhile, term vectorization is updated as parameter in a model.
Similarly, by being represented from the distributed vectorization of position feature in tabled look-upObtain Position vector.Position vector represents whole random initializtions, and is updated as parameter in model training.
In step S103, the term vector and position vector are coupled by convolutional neural networks model, instructed Practice the latent semantic feature of text.
Exemplary embodiment of the invention, it is described that the term vector and position vector are passed through into convolutional neural networks mould Type is coupled, and the latent semantic feature for obtaining the training text includes:
In step S1031, the term vector and position vector are carried out into one-dimensional convolution respectively, obtain eigenmatrix.
In step S1032, the eigenmatrix obtains one-dimensional characteristic vector by collapsing operation.
In step S1033, maximum neural unit is chosen from one-dimensional characteristic vector.
In step S1034, the maximum neural unit obtains the implicit language of the training text according to tangent activation primitive Adopted feature.
In step S1031, the term vector and position vector are carried out into one-dimensional convolution respectively, obtain eigenmatrix.
Term vector and position vector are integrated first, and each word in training text is then expressed as againThe matrixing character representation of each text is
In present example, one-dimensional convolution is used on text matrixing character representation, here w be convolution kernel frame mouthful it is big It is small, n1It is the number of convolution kernel.Represent for convenience, introduce diagonal matrix and represent.Such as j-th convolution kernel such as formula (4) institute Show:
By j-th one-dimensional convolution kernel, the eigenmatrix after convolution is obtained
Wherein, shown in XF such as formula (6):
Exemplary embodiment of the invention, the eigenmatrix obtains one-dimensional characteristic vector bag by collapsing operation Include:
One-dimensional characteristic vector is calculated according to formula (7):
Wherein,It is one-dimensional characteristic vector, (dw+dp) it is dimension.
Here, the eigenmatrix C after j-th convolution operation is givenj, operation is collapsed then directly by t × (dw+dp) dimension Eigenmatrix is compressed into t × 1 dimensional vector.By collapsing layer, eigenmatrix CjIt is compressed into one-dimensional vector
Exemplary embodiment of the invention, the maximum neural unit obtains the training according to tangent activation primitive The latent semantic feature of text includes:
The latent semantic feature of the training text is calculated according to following formula:
Wherein,M is the latent semantic feature of the training text.
Here, the characteristic vector for being obtained after j-th collapses layerIn characteristic vectorOn carry out the maximum samplings of k- Operation, obtains k neural unit of maximum.Then, implicit semantic feature is obtained using tangent activation primitive.
In step S104, low-dimensional two-value code is trained the convolutional neural networks model for being updated.
Exemplary embodiment of the invention, it is described that low-dimensional two-value code is trained the convolution god for being updated Include through network model:
In step S1041, the latent semantic feature of training text and display semantic feature are input into the convolutional Neural net The output layer of network model;
In step S1042, low-dimensional two-value code is carried out into the convolutional neural networks mould that error back propagation is updated The parameter of type.
Here, in step S1041, the latent semantic feature of training text is represented with m, the explicit semantic spy of training text Requisition TF-IDF is represented, the explicit semantic feature of the latent semantic feature m of training text and training text is carried out with TF-IDF Linear transformation, specifically from formula (9):
O(H)=WZm+αWOx (9)
Wherein,It is output vector,WithIt is the matrix of a linear transformation, x is TF-IDF features Vector, α is characterized fusion coefficients.
In order to carry out binaryzation, Hash codes are obtained, in output layer feature O(H)Using r logic, this spy returns, specifically by public affairs Formula (10) understands:
If training text includes label information t ∈ { 0,1 }c×n, then in extended model THC-III and THC-IV, increase Plus c extra output unit fitting label information, specifically from formula (11):
O(C)=W 'Zm+αW′Ox (11)
Here, WithIt is linear transformation matrix, and makes on extra output unit With c logic, this spy returns, specifically from formula (12):
Here, in step S1042, it is θ that the parameter unified definition for updating will be needed in model, specifically can by formula (13) Know:
θ={ E(W),E(P),W,WZ,WO,W′Z,W′O} (13)
Given training setThe low-dimensional two-value code Y of pre-training(0)And label information t ={ t1,t2...,tn}∈{0,1}c×n, then the object function based on cross entropy by being knowable to formula (14):
By using stochastic gradient descent method undated parameter θ with optimization object function, specifically from formula (15):
In step S105, coding generation language is carried out to the training text using the convolutional neural networks model of the renewal Adopted Hash codes, and query text is carried out described in mapping generation by the convolutional neural networks model to the semantic Hash codes The Hash codes of query text.
In step S106, the Hash codes of the query text are carried out in two-value Hamming space to the semantic Hash codes Match somebody with somebody, obtain the semantic Similar Text of the query text.
Exemplary embodiment of the invention, the Hash codes of query text are in two-value Hamming space to semantic Hash codes Matched, the semantic Similar Text for obtaining query text includes:
The Hash codes of query text are matched in two-value Hamming space to semantic Hash codes, the semantic Kazakhstan for being matched Uncommon code;
The semantic Hash codes for matching are ranked up according to Hamming distance, the semantic Similar Text of query text is obtained.
The short text Hash learning method of fusion latent semantic feature provided in an embodiment of the present invention, by by training text Dimensionality reduction binaryzation is carried out by Hash loss function and generates low-dimensional two-value code, using the convolutional neural networks model for updating to training Text is carried out encoding generative semantics Hash codes, and query text is reflected by convolutional neural networks model to semantic Hash codes Penetrate the Hash codes of generation query text;The Hash codes of query text are matched in two-value Hamming space to semantic Hash codes, So as to obtain the semantic Similar Text of query text.
Fig. 2 is that the framework of the short text Hash learning method of fusion latent semantic feature provided in an embodiment of the present invention is illustrated Figure.
Reference picture 2, merging the short text Hash learning method of latent semantic feature includes two stages, and the first stage is to breathe out The uncommon code pre-training stage;Second stage is hash function training and forecast period.
First stage:Training text is carried out into dimensionality reduction binaryzation by Hash loss function and generates low-dimensional two-value code.
Second stage:Word feature and position feature are obtained from the training text, it is special according to institute's predicate feature and position Levy and obtain institute's predicate feature and the corresponding term vector of position feature and position vector respectively by tabling look-up;
Term vector and position vector are carried out into one-dimensional convolution respectively, eigenmatrix is obtained;
Eigenmatrix obtains one-dimensional characteristic vector by collapsing operation;
Maximum neural unit is chosen from one-dimensional characteristic vector;
Maximum neural unit obtains the latent semantic feature of the training text according to tangent activation primitive;
The latent semantic feature of training text and display semantic feature are input into the output of the convolutional neural networks model Layer;
Low-dimensional two-value code is carried out the parameter of the convolutional neural networks model that error back propagation is updated;
Training text is carried out using the convolutional neural networks model for updating encode generative semantics Hash codes, and will inquiry text This carries out the Hash codes of mapping generation query text by the convolutional neural networks model to semantic Hash codes;
The Hash codes of query text are matched in two-value Hamming space to semantic Hash codes, obtain the language of query text Adopted Similar Text.
In order to accurately assess retrieval performance of the invention, searching system of the present invention is by precedingNIndividual returning result it is average Precision evaluates general effect of the invention.
Because the primitive character of short text data collection can not well reflect the semantic similarity relation between document, present invention examination Test by whether sharing any label between two samples of differentiation to decide whether to be semantic related text, it is all in experiment Evaluation metricses result is average value of all test samples in training sample retrieval result.
Using a kind of open short text data collection in present invention experiment, the data set includes 8 kinds of classification texts.It is right in experiment The data set is left intact (including going the operation such as stop words and stem reduction), and is made as label information using class label With the statistical information of text data set is as shown in table 1 employed in experiment:
Table 1
Data set Classification number Training/test quantity Length (average/maximum) Dictionary dimension
SearchSnippets 8 10060/2280 17.2/38 26265
In test, the width w of fixed convolution kernel is 3, the number n of convolution kernel1Be 80, k- maximum sample level in k be 2, the dimension d of term vector expression layerwIt is 50, the dimension d of position vector expression layerpIt is 8, and learning rate λ is 0.01.In addition, defeated The feature weight α for going out layer is progressively adjusted in present invention experiment from 0.001 to 1024, and the optimal value α of final choice is 16.
Acquiescence is using the disclosed 50 dimension term vectors trained based on GloVe instruments in an experiment, and with other term vectors Contrasted, such as Senna languages vector sum random initializtion.Referring in particular to covering for GloVe and Senna term vectors as shown in table 2 Cover degree statistical information:
Table 2
Four kinds of mutation models are proposed in the present invention, respectively:
THC-I:It is basic model of the invention, trains whole hash function not need any label;
THC-II:Label information is incorporated in the Hash codes pre-training stage in stage 1-;
THC-III:In the training of stage 2- hash function label information is incorporated with forecast period;
THC-IV:Label information is incorporated in the stage 1 and 2 simultaneously.
Following contrast hash method is used in present invention experiment:
Control methods one:Self study hash method, the method is a kind of typical two steps hash method, and the first step is using text This primitive character carries out Laplce's Feature Dimension Reduction, and Hash codes are obtained after mean vectors carry out binaryzation.Second step, by previous Stage generates training sample primitive character and corresponding r dimension Hash codes, and r two-value support vector machine classifier of training is used as Hash Function.
Control methods two:Self study hash method based on Gaussian kernel, the method is a modified version of control methods one, The r two-value SVMs based on Gaussian kernel is employed in the hash function training stage.
Control methods three:Supervision type self study hash method, the method is the further modified version of control methods one, is added Full supervision message is used as constraint.It is only similar between sample of the consideration with identical category label when local similarity matrix S is built Degree.
Control methods four:Quick hash method (FastHash), the method is based on two step hash methods, using decision tree mould Type can solve the problems, such as the Hash mapping of high dimensional data as hash function, it is contemplated that full supervision message.
Table 3 is the inventive method, self study hash method, the self study hash method based on Gaussian kernel, the self-study of supervision type Practise the average retrieval precision index of hash method and quick hash method under 64 Hash codes.THC-I, self study hash method With the self study hash method based on Gaussian kernel in the training process without using any tape label data, THC-II, THC- III, THC-IV, quick Hash and supervision type self study Hash have used label information to be instructed in Hash learning process Practice.
Table 3
As can be seen that the basic skills THC-I in the inventive method is clearly distinguishable from the Hash side of other unused labels Method (self study hash method and self study hash method based on Gaussian kernel).Compared to supervision type Hash learning method, THC- II, THC-III and THC-IV equally provide more excellent retrieval performance, and THC-IV is optimal mutation model.We have found that due to Term vector and convolutional neural networks structure are introduced, the performance of the basic skills THC-I without label data is based in the present invention even The supervision type hash method in control methods is exceeded.Referring in particular in table 4 contrast the inventive method learning characteristic in word to The feature (for example, Senna term vectors and random initializtion) of amount, position vector, TF-IDF and other term vectors is to accessibility The influence of energy.
Table 4
As can be seen that after with the addition of position feature in word feature base, the mean accuracy of retrieval has carrying for 1%-2% Rise, and be significantly better than the retrieval performance 8% or so based on explicit features TF-IDF.Then will be explicit in the inventive method When feature is merged with implicit features, the retrieval mean accuracy of system has continued to lift up 1% or so.In present invention experiment simultaneously Other term vector features be compared for retrieving the influence of performance.It can be seen that the retrieval result based on Senna term vectors is only than this The GloVe term vectors that invention acquiescence is used are low by 2% or so.Even if however, using term vector feature as parameter in the inventive method It is updated, but the retrieval performance based on random initializtion term vector have dropped 10% or so.Experiment shows using based on a large amount of The term vector of language material unsupervised learning carries out the necessity of model parameter initialization.
Next, we study influence of the Fusion Features parameter alpha to retrieval performance in the inventive method.We adjust α from 0.001 gradually changes to 1024, and corresponding retrieval result is shown with reference to retrieval performance provided in an embodiment of the present invention as shown in Figure 3 It is intended to.It can be seen that, when our regulation parameter α become big, retrieval performance tends to the retrieval result based on explicit features, equally Ground, when our regulation parameter α become hour, retrieval performance tends to the retrieval result based on implicit features.And only regulation parameter α During to an optimal value, system can reach best retrieval performance.
For convolutional neural networks structure, term vector dimension, learning rate, and convolution kernel are secured in present invention experiment Frame mouthful size.The neuron number that we heuristically limit implicit features is 160, referring in particular to convolution different in table 5 Neural network structure
Table 5
Framework -1 Framework -2 Framework -3 Framework -4 Framework -5
Convolution kernel number 160 80 40 20 10
K- maximum numbers 1 2 4 8 16
The retrieval performance schematic diagram that corresponding retrieval result is provided with reference to another embodiment of the present invention as shown in Figure 4.I It can be seen that when K- maximum number of samples be less than 4 when, retrieval hydraulic performance decline it is slow.However, due to convolution kernel number Increase, convolutional layer and collapse the output neuron number of layer and increase and can cause the computation complexity to increase, thus present invention experiment Middle unified compromise have selected the convolution kernel of framework -2, i.e., 80,2- maximum sampling structures.
The above, specific embodiment only of the invention, but protection scope of the present invention is not limited thereto, and it is any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all contain Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims (6)

1. it is a kind of merge latent semantic feature short text Hash learning method, it is characterised in that methods described includes:
Training text is carried out into dimensionality reduction binaryzation by Hash loss function and generates low-dimensional two-value code;
Word feature and position feature are obtained from the training text, according to institute's predicate feature and position feature by difference of tabling look-up Obtain institute's predicate feature and the corresponding term vector of position feature and position vector;
The term vector and position vector are coupled by convolutional neural networks model, the implicit semanteme of training text is obtained Feature;
Low-dimensional two-value code is trained the convolutional neural networks model for being updated;
The training text is carried out to encode generative semantics Hash codes using the convolutional neural networks model of the renewal, and will be looked into Ask text carries out the Hash of the mapping generation query text by the convolutional neural networks model to the semantic Hash codes Code;
The Hash codes of the query text are matched in two-value Hamming space to the semantic Hash codes, obtain the inquiry The semantic Similar Text of text;
It is described training text is carried out into dimensionality reduction binaryzation by Hash loss function to generate low-dimensional two-value code including:
Similarity matrix is constructed according to the training text;
Laplce's characteristic vector is obtained by the similarity matrix;
Mean vectors are obtained by Laplce's characteristic vector;
Binaryzation is carried out to Laplce's characteristic vector by the mean vectors, so as to generate the low-dimensional two-value code;
It is described that the term vector and position vector are coupled by convolutional neural networks model, obtain the training text Latent semantic feature includes:
The term vector and position vector are carried out into one-dimensional convolution respectively, eigenmatrix is obtained;
The eigenmatrix obtains one-dimensional characteristic vector by collapsing operation;
Maximum neural unit is chosen from one-dimensional characteristic vector;
The maximum neural unit obtains the latent semantic feature of the training text according to tangent activation primitive.
2. method according to claim 1, it is characterised in that described that similarity matrix bag is constructed according to the training text Include:
The similarity matrix is calculated according to following formula:
Wherein, SijIt is the similarity matrix, NNkX () is the k neighbours set of the training text x, cijIt is confidence coefficient;Institute State σ and represent tuning parameter.
3. method according to claim 1, it is characterised in that the eigenmatrix obtains one-dimensional characteristic by collapsing operation Vector includes:
The one-dimensional characteristic vector is calculated according to following formula:
C j , p ( 0 ) = Σ q = 1 d w + d p C j , p , q
Wherein, (dw+dp) it is dimension;Represent one-dimensional characteristic vector;RepresentIn p-th value;dwRepresent word to Amount dimension;dpRepresent position vector dimension;(0) identifier of the feature after collapsing operation is represented;CjBefore expression collapses operation Eigenmatrix;P represents the item number of the characteristic value of one-dimensional characteristic vector;Cj,pRepresent CjPth row characteristic vector;Q is represented and is collapsed operation The characteristic element footnote of preceding eigenmatrix pth row characteristic vector.
4. method according to claim 3, it is characterised in that the maximum neural unit is obtained according to tangent activation primitive The latent semantic feature of the training text includes:
The latent semantic feature of the training text is calculated according to following formula:
m = tanh ( C ^ ( 0 ) )
Wherein,M is the latent semantic feature of the training text;Represent n1K- is carried out on individual convolution kernel passage most The maximum neural unit vector constituted after big sampling;(0) identifier of the feature after collapsing operation is represented; Represent the latent semantic feature of training text.
5. method according to claim 4, it is characterised in that described be trained low-dimensional two-value code is updated Convolutional neural networks model include:
The latent semantic feature of the training text and explicit semantic feature are input into the output of the convolutional neural networks model Layer;
Low-dimensional two-value code is carried out the parameter of the convolutional neural networks model that error back propagation is updated.
6. method according to claim 1 or 5, it is characterised in that the Hash codes of the query text are empty in two-value Hamming Between in the semantic Hash codes are matched, the semantic Similar Text for obtaining the query text includes:
The Hash codes of the query text are matched in two-value Hamming space to the semantic Hash codes, the language for being matched Adopted Hash codes;
The semantic Hash codes of the matching are ranked up according to Hamming distance, the semantic similar text of the query text is obtained This.
CN201510096518.4A 2015-03-04 2015-03-04 Merge the short text Hash learning method of latent semantic feature Active CN104657350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510096518.4A CN104657350B (en) 2015-03-04 2015-03-04 Merge the short text Hash learning method of latent semantic feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510096518.4A CN104657350B (en) 2015-03-04 2015-03-04 Merge the short text Hash learning method of latent semantic feature

Publications (2)

Publication Number Publication Date
CN104657350A CN104657350A (en) 2015-05-27
CN104657350B true CN104657350B (en) 2017-06-09

Family

ID=53248499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510096518.4A Active CN104657350B (en) 2015-03-04 2015-03-04 Merge the short text Hash learning method of latent semantic feature

Country Status (1)

Country Link
CN (1) CN104657350B (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220220A (en) * 2016-03-22 2017-09-29 索尼公司 Electronic equipment and method for text-processing
CN106095735A (en) * 2016-06-06 2016-11-09 北京中加国道科技有限责任公司 A kind of method plagiarized based on deep neural network detection academic documents
CN106354701B (en) * 2016-08-30 2019-06-21 腾讯科技(深圳)有限公司 Chinese character processing method and device
CN108073576A (en) * 2016-11-09 2018-05-25 上海诺悦智能科技有限公司 Intelligent search method, searcher and search engine system
CN106776545B (en) * 2016-11-29 2019-12-24 西安交通大学 Method for calculating similarity between short texts through deep convolutional neural network
CN106776553A (en) * 2016-12-07 2017-05-31 中山大学 A kind of asymmetric text hash method based on deep learning
CN107016708B (en) * 2017-03-24 2020-06-05 杭州电子科技大学 Image hash coding method based on deep learning
CN107092918B (en) * 2017-03-29 2020-10-30 太原理工大学 Image retrieval method based on semantic features and supervised hashing
CN107145910A (en) * 2017-05-08 2017-09-08 京东方科技集团股份有限公司 Performance generation system, its training method and the performance generation method of medical image
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN107391575B (en) * 2017-06-20 2020-08-04 浙江理工大学 Implicit feature recognition method based on word vector model
CN107563408A (en) * 2017-08-01 2018-01-09 天津大学 Cell sorting method based on Laplce's figure relation and various visual angles Fusion Features
CN107967253A (en) * 2017-10-27 2018-04-27 北京大学 A kind of low-resource field segmenter training method and segmenting method based on transfer learning
CN107894979B (en) * 2017-11-21 2021-09-17 北京百度网讯科技有限公司 Compound word processing method, device and equipment for semantic mining
CN109840321B (en) * 2017-11-29 2022-02-01 腾讯科技(深圳)有限公司 Text recommendation method and device and electronic equipment
CN108536669B (en) * 2018-02-27 2019-10-22 北京达佳互联信息技术有限公司 Literal information processing method, device and terminal
CN108874941B (en) * 2018-06-04 2021-09-21 成都知道创宇信息技术有限公司 Big data URL duplication removing method based on convolution characteristics and multiple Hash mapping
CN108959551B (en) * 2018-06-29 2021-07-13 北京百度网讯科技有限公司 Neighbor semantic mining method and device, storage medium and terminal equipment
CN109241317B (en) * 2018-09-13 2022-01-11 北京工商大学 Pedestrian Hash retrieval method based on measurement loss in deep learning network
CN109615006B (en) * 2018-12-10 2021-08-17 北京市商汤科技开发有限公司 Character recognition method and device, electronic equipment and storage medium
CN110119784B (en) * 2019-05-16 2020-08-04 重庆天蓬网络有限公司 Order recommendation method and device
CN111581332A (en) * 2020-04-29 2020-08-25 山东大学 Similar judicial case matching method and system based on triple deep hash learning
CN111737406B (en) * 2020-07-28 2022-11-29 腾讯科技(深圳)有限公司 Text retrieval method, device and equipment and training method of text retrieval model
CN112364198B (en) * 2020-11-17 2023-06-30 深圳大学 Cross-modal hash retrieval method, terminal equipment and storage medium
CN112488231A (en) * 2020-12-11 2021-03-12 北京工业大学 Cosine measurement supervision deep hash algorithm with balanced similarity
CN113935329B (en) * 2021-10-13 2022-12-13 昆明理工大学 Asymmetric text matching method based on adaptive feature recognition and denoising
CN115495546B (en) * 2022-11-21 2023-04-07 中国科学技术大学 Similar text retrieval method, system, device and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823862A (en) * 2014-02-24 2014-05-28 西安交通大学 Cross-linguistic electronic text plagiarism detection system and detection method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8874434B2 (en) * 2010-06-02 2014-10-28 Nec Laboratories America, Inc. Method and apparatus for full natural language parsing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823862A (en) * 2014-02-24 2014-05-28 西安交通大学 Cross-linguistic electronic text plagiarism detection system and detection method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A Convolutional Neural Network for Modelling Sentences;Nal Kalchbrenner et al.;《Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics》;20140630;第655-665页 *
Learning Semantic Representations Using Convolutional Neural Networks for Web Search;Yelong Shen et al.;《Proceedings of the 23rd International Conference on World Wide Web》;20140407;第373-374页 *
Relation Classification via Convolutional Deep Neural Network;Daojian Zeng et al.;《Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics》;20140823;第1-10页 *
基于神经网络的文本聚类促进本体的构建;付渊 等;《电脑开发与应用》;20060531;第19卷(第5期);第13-15页 *

Also Published As

Publication number Publication date
CN104657350A (en) 2015-05-27

Similar Documents

Publication Publication Date Title
CN104657350B (en) Merge the short text Hash learning method of latent semantic feature
CN107679234B (en) Customer service information providing method, customer service information providing device, electronic equipment and storage medium
CN106980683B (en) Blog text abstract generating method based on deep learning
Ristoski et al. Rdf2vec: Rdf graph embeddings for data mining
Sun et al. Sentiment analysis for Chinese microblog based on deep neural networks with convolutional extension features
Pachter et al. Tropical geometry of statistical models
CN109933670B (en) Text classification method for calculating semantic distance based on combined matrix
CN110750640B (en) Text data classification method and device based on neural network model and storage medium
CN108009148A (en) Text emotion classification method for expressing based on deep learning
KR101717230B1 (en) Document summarization method using recursive autoencoder based sentence vector modeling and document summarization system
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN113312480B (en) Scientific and technological thesis level multi-label classification method and device based on graph volume network
CN111177383A (en) Text entity relation automatic classification method fusing text syntactic structure and semantic information
KR20220076419A (en) Method for utilizing deep learning based semantic role analysis
Sarkhel et al. Improving information extraction from visually rich documents using visual span representations
Mankolli et al. Machine learning and natural language processing: Review of models and optimization problems
CN111680264A (en) Multi-document reading understanding method
Wang et al. Classification with unstructured predictors and an application to sentiment analysis
Kumar et al. An abstractive text summarization technique using transformer model with self-attention mechanism
Azzam et al. A question routing technique using deep neural network for communities of question answering
Lezama-Sánchez et al. An approach based on semantic relationship embeddings for text classification
CN108932222A (en) A kind of method and device obtaining the word degree of correlation
Lin et al. Chinese story generation of sentence format control based on multi-channel word embedding and novel data format
Garg et al. Personalization of news for a logistics organisation by finding relevancy using NLP
Ramasubramanian et al. ES2Vec: Earth science metadata keyword assignment using domain-specific word embeddings

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant