CN115495546A

CN115495546A - Similar text retrieval method, system, device and storage medium

Info

Publication number: CN115495546A
Application number: CN202211452104.7A
Authority: CN
Inventors: 陈恩红; 何理扬; 黄振亚; 刘淇; 童世炜
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2022-12-20
Anticipated expiration: 2042-11-21
Also published as: CN115495546B

Abstract

The invention discloses a similar text retrieval method, a system, equipment and a storage medium, wherein a self-encoder (encoder-decoder) framework is used for constructing a Hash mapping model, the problem of information loss under the condition of low-dimensional Hash codes is solved through an encoder module of relationship propagation, global equilibrium optimization is carried out through a global equilibrium optimization module, the retrieval efficiency is effectively enhanced, the robustness of the Hash codes is enhanced through a decoder module of noise perception, and therefore the problem of noise in texts is solved.

Description

Similar text retrieval method, system, device and storage medium

Technical Field

The present invention relates to the field of similar text retrieval technologies, and in particular, to a similar text retrieval method, system, device, and storage medium.

Background

In recent years, with the explosion of online technologies, text data has multiplied, and users can come into contact with a large amount of text data. The similar text retrieval system can help a user to find text information related to the query text in a large number of text resources, greatly relieves the degree of information overload, and is widely applied to the life of people. With the rapid development of the technology, the similar text retrieval model is improved from a machine learning method to a deep learning method, so that the retrieval precision is continuously improved. However, the efficiency problem is still an inevitable key problem in the similar text retrieval system, including retrieval efficiency and storage efficiency, which is related to the user experience and the burden of the system. Therefore, how to improve efficiency without losing too much accuracy is an urgent research problem to be solved in similar text retrieval systems.

Around this research problem, researchers have proposed a variety of approaches, with deep semantic hashing being one solution that has received considerable attention in recent years. Its main idea is to use a deep semantic model to map the text into a binary representation, also called hash code, which the text will be stored in a hash bucket corresponding to the hash code. In one aspect, a hash table may be utilized to quickly return relevant text from a set of candidate text during retrieval. On the other hand, the hash code requires very little storage overhead, and thus saves a lot of storage space.

However, in practical applications, the current deep semantic hash scheme still has some technical problems to be solved: 1) A supervised training scheme is used, so that a large amount of time is consumed for labeling massive texts; 2) The lower the dimension of the hash code is, the faster the retrieval speed is, but more information is lost by the hash code compared with the original representation of the text, and how to ensure the precision of the low-dimensional hash code is a very challenging problem; 3) When the hash code is unevenly distributed in the space, some extreme situations easily occur, and the retrieval efficiency is affected. 4) In practical applications, the input behavior of the user cannot be controlled, so that the problem of text noise introduced due to misspelling of the user may be faced, and the accuracy of the retrieval result is further affected.

Disclosure of Invention

The invention aims to provide a method, a system, equipment and a storage medium for searching similar texts, which can improve the searching efficiency and the accuracy of the searching result.

The purpose of the invention is realized by the following technical scheme:

a method of similar text retrieval, comprising:

constructing a Hash mapping model, and training in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relational propagation inputs training texts and noise texts corresponding to the training texts, sequentially generates a first type of representation and a second type of representation for the training texts, generates corresponding hash codes by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the second type of representation, wherein the dimensionality of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the noise-aware decoder module utilizes the training texts and the second class of characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;

respectively generating a hash code of each candidate text by using an encoder module of relation propagation in the trained hash mapping model, and constructing a hash table;

and for the input query text, generating a hash code by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on the initial result obtained by inquiry to obtain a final similar text retrieval result.

A similar text retrieval system comprising:

the model building and training unit is used for building a Hash mapping model and training the Hash mapping model in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relation propagation inputs a noise text which comprises a training text and is generated by the training text, sequentially generates a first type of representation and a second type of representation for the training text, generates a corresponding hash code by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the generated second type of representation, wherein the dimension of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the noise-aware decoder module utilizes the training texts and the second class of characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;

the hash table construction unit is used for respectively generating a hash code of each candidate text by utilizing an encoder module of the relation propagation in the trained hash mapping model and constructing a hash table;

and the retrieval unit is used for generating a hash code for the input query text by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on an initial result obtained by inquiry to obtain a final similar text retrieval result.

A processing device, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method as previously described.

A readable storage medium, storing a computer program which, when executed by a processor, implements the aforementioned method.

According to the technical scheme provided by the invention, the Hash mapping model is constructed by using a self-encoder (encoder-decoder) framework, the problem of information loss under the condition of low-dimensional Hash codes is solved by using an encoder module with relation propagation, the global equilibrium optimization module is used for global equilibrium optimization, the retrieval efficiency is effectively enhanced, the robustness of the Hash codes is enhanced by using a noise-sensing decoder module, and therefore, the problem of noise in texts is solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a schematic diagram of a similar text retrieval method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a hash mapping model according to an embodiment of the present invention;

FIG. 3 is a diagram of a similar text retrieval system according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The terms that may be used herein are first described as follows:

the term "and/or" means that either or both can be achieved, for example, X and/or Y means that both cases include "X" or "Y" as well as three cases including "X and Y".

The terms "comprising," "including," "containing," "having," or other similar terms of meaning should be construed as non-exclusive inclusions. For example: including a feature (e.g., material, component, ingredient, carrier, formulation, material, dimension, part, component, mechanism, device, process, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product, or article of manufacture), is to be construed as including not only the particular feature explicitly listed but also other features not explicitly listed as such which are known in the art.

A method, a system, a device and a storage medium for searching for similar texts according to the present invention are described in detail below. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art. The examples of the present invention, in which specific conditions are not specified, were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer.

Example one

The embodiment of the invention provides a similar text retrieval method, which is used for solving the problems that the prior art has poor retrieval effect on low-dimensional hash codes, lacks the capability of resisting noise texts and needs to optimize the retrieval efficiency and accuracy under the condition of no relevance marking. The invention aims to provide an unsupervised method for generating low-dimensional hash codes with robustness and uniform distribution, which is mainly used for generating binary representation (namely the hash codes) with semantic information from text data, and is realized by a constructed hash mapping model, wherein in the hash mapping model: a relation propagation encoder structure is provided aiming at the poor effect of the existing text semantic hash method in the low-dimensional hash code; the global balance optimization module is provided for solving the problem that the efficiency is influenced by extreme distribution conditions easily caused by only locally optimizing the balance of the hash code in the existing scheme; and proposes a noise-aware decoder structure for text noise problems. And after a Hash mapping model is trained in an unsupervised mode, mapping the candidate texts into Hash codes and constructing a Hash table, taking the input query texts as the Hash codes and querying in the Hash table, and combining the query results to generate a final similar text retrieval result. As shown in fig. 1, the main principle of the similar text retrieval method provided by the embodiment of the present invention is shown, which mainly includes:

1. a hash mapping model is constructed and trained using an unsupervised approach.

The hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relation propagation inputs a training text and a noise text corresponding to each training text, sequentially generates a first type of representation and a second type of representation for the training text, then generates a corresponding hash code by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the second type of representation, wherein the dimension of the first type of representation is higher than that of the second type of representation; for the noise text, generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the decoder module for sensing the noise utilizes the training texts and second-class characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts which are respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; and (4) combining the correlation propagation loss function, the global equalization optimization loss function and the reconstruction loss function to construct an overall loss function during training.

2. And respectively generating the hash code of each candidate text by using an encoder module of the relation propagation in the trained hash mapping model, and constructing a hash table and an index.

3. And for the input query text, generating a hash code by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on the initial result obtained by inquiry to obtain a final similar text retrieval result.

In each specific application in the field of similar text retrieval, the scheme provided by the invention can not only retain more semantic information in the hash code, but also improve the retrieval efficiency and the anti-noise capability of the hash code, and provide help for implementing efficient and accurate similar text retrieval.

In order to more clearly show the technical solutions and the technical effects provided by the present invention, the following detailed description is provided for the above methods provided by the embodiments of the present invention with specific embodiments.

1. Problem definition and formalization.

In the embodiment of the invention, two problems of definition and formalization are a semantic hash task and a similar text retrieval task respectively.

Defining a set of candidate texts as

Wherein each one of

Representing a candidate text, N representing the number of candidate texts,

the subscript of (2) is the sequence number of the candidate text. The goal of the semantic hash task is to learn a hash function

It maps the original text x to a binary representation

Where the symbol b represents the dimension length of the binary representation, this binary representation s is also referred to as a hash code.

In the similar text retrieval task, the goal is to find a set of texts S (q) similar to a given query text q. Generally speaking, the process of similar text retrieval based on semantic hash has two stages, namely an index construction stage and an online retrieval stage. In the index building stage (offline stage), candidate texts are aggregated using a hash function f (x)

The text of (1) is mapped into a hash code, and is constructed into a hash index by utilizing a hash table. In the online query stage, a hash code of a query text q is obtained through a hash function f (q), and a preliminary relevant text is quickly found through a hash table. Unlike traditional hash table lookup, since semantic hashing maps similar text to a similar hash space. Therefore, the searching process of the invention is as follows: assuming that the symbol r represents the distance between the query text and the candidate text in the hash space, the value of r will be gradually increased from 0 until K related texts greater than or equal to the predefined number are found. And subsequently, sequencing the preliminary related texts by using a word shift distance or other correlation evaluation scheme calculation method to obtain a final similar text retrieval result.

2. And (4) collecting and preprocessing data.

1. And (6) collecting data.

The present patent uses plain text in a broad sense as an input data set. Examples of such data are the public news data set (20 Ngnews) and the data set published by Yahoo | Answer (Yahooanswer). In addition, various text data sets can be collected as input data through network crawling or offline.

2. And (4) preprocessing data.

The collected data is preprocessed to ensure the effect of the model. The embodiment of the invention mainly utilizes the pure text type data to train, and the collected text data used for training may contain messy codes, illegal characters and the like, so that meaningless texts such as the messy codes, the illegal characters and the like are removed through preprocessing. And dividing a plurality of groups of training sets and verification sets according to a cross-validation mode.

3. And (5) constructing and training a Hash mapping model.

As shown in fig. 1, the present invention is a similar text retrieval system based on deep semantic hash, which includes a hash index construction process and an online retrieval process. Both processes map the original text into hash codes through the same hash mapping model. The hash mapping model constructed in the embodiment of the present invention is shown in fig. 2, and is an auto-encoder as a whole, so that training can be performed in an unsupervised manner, and the hash mapping model mainly includes: a relation-propagating encoder module, a global equalization optimization module, and a noise-aware decoder module.

1. An encoder module for relationship propagation.

To obtain the mapping from text to the low-dimensional hash tokens, a relational propagation encoder module is constructed to retain semantic information in the low-dimensional hash code. In the embodiment of the present invention, the objective of the encoder E for relationship propagation to map the text data into a low-dimensional hash code needs to go through the following steps:

1) Assume a set of training texts

，

Wherein

Representing training text

The word in the u-th position,

representing text

I =1,2, \ 8230and Q, Q represents the total number of training texts. Before inputting to the encoder module of the relationship propagation, the original training text is converted into bag-of-word model representation, and the noise text mentioned later also needs to be converted into bag-of-word model representation.

2) Using a multi-layer feed-forward network with a ReLU activation layer and obtaining a deep characterization of the text:

wherein the content of the first and second substances,t ₁ representing intermediate features, bow (x) representing the feature representation of the training text x under the bag-of-words model, reLU representing a modified linear unit, W ₁ And W ₂ Weight parameter representing a multi-layer feedforward network with ReLU activation layer, b ₁ And b ₂ Representing the bias parameters of a multi-layer feed-forward network with a ReLU active layer,t ₂ representing a depth feature;

the MLP (multi-layer perceptron) in fig. 2 is a multi-layer feed-forward network with a ReLU activation layer.

As will be appreciated by those skilled in the art, a deep characterization is a generic term for all characterizations obtained after passing through a neural network.

3) Respectively obtaining first type characteristics through two feedforward networks with tanh active layers in sequencelAnd a second kind of characterization

：

Wherein tanh represents a hyperbolic tangent function, e represents a natural constant,

is hyperparametric, W ₃ And b ₃ Weight parameter and bias parameter, W, respectively, in the first feedforward network with tan's active layer ₄ And b ₄ Respectively, a weight parameter and a bias parameter in a second feedforward network with a tanh active layer;

in the embodiment of the invention, all the weight parameters and the bias parameters are parameters to be learned. Hyper-parameter

Mainly for controlling the degree of smoothing. The dimensions of the first type of representation are higher than the dimensions of the second type of representation, for example, the dimensions of the first type of representation may be 8-16 times the dimensions of the second type of representation. Although the first class of features and the second class of features are not binary, the values of each dimension in the two classes of features are very close to 1 or-1 due to the smoothing operation. Generally speaking, the first type of representation can ensure better accuracy because the first type of representation can store more semantic information within a certain range, but the second type of representation can reduce the retrieval accuracy rate because more information is lost, so that the relationship between the two types of representations is utilized to promote the low-dimensional representationAnd (4) accuracy. For the first and second type characterizations a1 and a2, their distance d _h (a 1, a 2) may be represented as d _h (a1,a2)=-0.5(a1 ^T a2- | a1 |), and T is a transposed symbol. Based on the characteristic, calculating a correlation propagation loss function by using the generated first class characterization and the second class characterization, wherein the correlation propagation loss function is expressed as:

wherein, the first and the second end of the pipe are connected with each other,L _rp a function representing the propagation loss of the correlation is shown,

represents the length of a dimension of the first type of token,N _B representing the number of training texts in the current training batch,l _k andl _j respectively represent the first training batchkA training text and the secondjA first type of characterization of the individual training texts,

and

respectively represent the first training batchkA training text and the secondjA second type of representation of the training text,band the dimension length of the second type of representation is equal to the dimension length of the hash code.

The intuitive meaning of the correlation propagation loss function is if the first class of two tokens of the training text are representedl _k Andl _j close in space, then their second class of characterization

And

are also close to each other; on the contrary, if the first class of characteristicsl _k And withl _j Are far apart from each other, then the corresponding second type of characterization

And

should also be relatively far apart. That is, the relationship information between the first class of tokens is passed to the second class of tokens by a correlation propagation loss function.

In addition, in order to obtain the hash code, a median method is used for processing the numerical value of each dimension in the second type of representation; specifically, the method comprises the following steps: integrating second type representations of all training samples, determining the average value of each dimension in the second type representations, and setting the numerical value of each dimension in the second type representations to be 1 if the numerical value is larger than the average value of the corresponding dimension, or setting the numerical value of each dimension to be-1 if the numerical value is not larger than the average value of the corresponding dimension; thereby obtaining a hash code having a value of 1 or-1.

2. And a global balance optimization module.

In order to ensure the high efficiency of the generated hash codes in the retrieval process, the hash codes generated by all texts need to be uniformly distributed in the hash space. In the embodiment of the invention, the hash codes corresponding to all the training texts are stored in a global storage module M, and the stored hash codes are used for carrying out optimization guidance on global balance information on the hash codes of a new batch in the training process. Before the training begins, the global memory module M is initialized with a bernoulli distribution with a parameter of 0.5.

As shown in fig. 2, each training text is provided with a corresponding storage location in the global storage module M, Q is the total number of training texts, and the ith training text x in all training texts _i Is recorded as M _i (ii) a In each training batch, selecting partial hash codes from the global storage module M for calculating the weight of global equilibrium information, wherein the selection mode is as follows: setting a corresponding timer for each memory location (store location M) _i Corresponding timer is notedv _i ) And initialized to 0, each trainingAdding 1 to the timer value before the start of the training batch (some examples of counter values are provided in FIG. 2); in the current training batch, if the training text corresponding to a certain storage position belongs to the training text of the current training batch, resetting the timer value of the corresponding storage position to be 0; if the training text does not belong to the current training batch, judging whether the timer value of the corresponding storage position meets the set extraction condition value (for example, is less than or equal to the set extraction condition value)

) If yes, selecting the hash code stored in the corresponding storage position; all selected hash codes form a set

Use sets

And calculating the weight of the global balance information to carry out optimization guidance of the global balance information on the newly generated hash code. Wherein Q represents the number of training texts,N _B representing the number of training texts in the current training batch,

the system is a hyper-parameter and is used for controlling the number of the hash codes taken out from the global storage module;

representation collection

The number of the elements in the set, each element in the set represents a selected hash code, and b is the dimension length of the hash code.

In order to achieve hash codes that are evenly distributed in the hash space, the present invention has two optimized objectives, bit balancing and bit independence, respectively. Thus, the global equalization optimization penalty function includes: a bit equalization loss function and a bit independent loss function; meanwhile, the global equalization information includes: bit equalization and bit independence.

The bit equalization refers to the value of each dimension in the hash code, the same probability is 1 or-1, and in order to realize the bit equalization from the global perspective, a set is used

A bit-equalization weight for the global case is calculated, expressed as:

wherein, aggregate

For the selected set of partial hash codes,

representation collection

The value of the c-th dimension in the t-th hash code, b represents the length of the dimension of the hash code,

representation collection

The number of hash codes in the hash table,

representing the bit equality weight of the c-th dimension in the hash code.

Utilizing the bit balance weight to restrain the second class of characteristics corresponding to the training texts in the current batch to obtain a bit balance loss function L _bb Expressed as:

wherein the content of the first and second substances,N _B representing the number of training texts in the current training batch,

indicates the second in the current training batchkThe training texts correspond to the value of the c-th dimension in the second type of representation, and the dimension length of the second type of representation is equal to the dimension length of the hash code.

The above constraint can be considered as making the expectation of the hash code generated by the current training batch to be 0 in each dimension, and applying a bit equality weight to each dimension

。

Bit independence means that any two dimensions in a hash code are independent, using sets

A bit-independence weight a for measuring global conditions is calculated, expressed as:

wherein the content of the first and second substances,

the matrix of the unit is expressed by,Rrepresenting a set of real numbers; t is the transposed symbol.

For convenience of representation, a set of training texts in the current batch corresponding to the second class of features is denoted as S:

and utilizing the coefficient A of the bit-independent condition to constrain the second type of characteristics corresponding to the training texts in the current batch to obtain a bit-independent loss function L _bd Expressed as:

coefficient A pair using bit-independent caseThe training texts in the current batch are restricted corresponding to the second class of characteristics to obtain a bit-independent loss function L _bd Expressed as:

wherein each of the set S

And representing a second type representation corresponding to one training text in the current batch, wherein the subscript is the serial number of the training text.

3. A noise-aware decoder module.

The noise-aware decoder module is used to reconstruct the training text, in the present example, using

And

representing the text reconstructed by the noise-aware decoder module (text representation), discrete words are predicted using the softmax layer.

On one hand, calculating the probability that the word at each position in the training text belongs to each word in the word set by using the second type of representation corresponding to the training text, selecting the word in the word set with the maximum probability as the word at the corresponding position, and reconstructing the corresponding training text; for the training text x, its corresponding second class of characteristics is denoted as

Then the word x at the u-th position ^(u) Is the first in the word setvA word w _v Probability of (2)

Expressed as:

wherein, T represents a transposed symbol,

representing a set of words in the text that,

representing the total number of words in the set of words,

representing the first in a set of words

The number of the individual words is,

，

，

and

are respectively words w _v And words

Is used to indicate the one-hot code of (c),

and with

Respectively represent words w _v And words

Corresponding bias parameters, W represents a trainable word embedding matrix.

According to the above formula, the word x at the u-th position can be calculated ^(u) Selecting the word in the word set corresponding to the maximum probability as the reconstructed word at the u-th position for the probabilities of all the words in the word set

。

And calculating therefrom a first reconstruction loss function, expressed as:

wherein L is _rec A first reconstruction loss function is represented that,

representing the total number of words of the training text x,

representing reconstructed training text

The word in the u-th position in (c),

to represent

E represents the mathematical expectation, x-D _x Distribution D representing that training text x satisfies text data _x 。

On the other hand, in order to solve the problem of noise in the text, a noise perception module and a reconstruction target of the noise text are introduced. For a training text x, a corresponding noise text set can be obtained by randomly deleting words from the training text x

，

Represents the h-th noisy text, n is the number of noisy texts, h =1, 2. The specific process is as follows: from a Gaussian distribution

In the method, multiple groups of noise rates are randomly sampled

，

Representing the h-th noise rate, the number of which is equal to the number of noise texts corresponding to the training text x, so as to

Randomly replacing words in the training text x by the probability to obtain the h-th noise text

. The encoder module for transmitting the P (x) input relation can obtain the corresponding first class characterization set

And a second type of feature set

，

And with

And respectively representing the first class representation and the second class representation of the h-th noise text.

Using a set of second kind of features

Reconstructing the training text to obtain a reconstructed training text set

，

Representing the reconstructed training text using the h-th noisy text in a manner similar to that described above, using a second type of representation corresponding to the noisy text

Reconstructing a corresponding training text x; for h noise text

The u-th position of the word

Is the first in the word setvA word w _v Probability of (2)

Expressed as:

according to the formula, the word of the u-th position can be calculated

Selecting the word in the word set corresponding to the maximum probability as the reconstructed word at the u-th position for the probabilities of all the words in the word set

Synthesizing all positions to form a training text reconstructed by using the h-th noise text

(ii) a Here reconstructed using the h-th noisy textTraining text

The same number of words as the training text x.

When a second reconstruction loss function (noise perception reconstruction loss function) is calculated, firstly, a correlation coefficient between each noise text and a corresponding training text is calculated by using a first type of representation corresponding to each noise text, and the calculation mode of the semantic correlation coefficient between the h-th noise text and the training text is as follows:

wherein the content of the first and second substances,

representing the semantic relevance coefficient to the training text x calculated using the first class of tokens corresponding to the h-th noisy text,nwhich represents the number of noisy texts,

and

respectively representing the first type representation corresponding to the h-th noise text and the d-th noise text,lrepresenting a first class of tokens of the training text x.

And combining the semantic correlation coefficient with the training text reconstructed by the corresponding noise text to obtain a second reconstruction loss function as follows:

wherein L is _{rec_noise} A second reconstruction loss function is represented that is,

representing a training text reconstructed with an h-th noisy text

The word in the u-th position in the list,

to represent

The probability of (c).

And (3) integrating all the loss functions to construct an integral loss function during training:

wherein the content of the first and second substances,

、

and

for controlling the balance of the loss functions. Training is carried out by minimizing the overall Loss function Loss through an Adam algorithm, weight parameters and bias parameters of an encoder module with relation propagation are updated, trainable word embedding matrixes W and bias parameters in a noise-aware decoder module are updated, and the Zhonghaxi codes in the global equalization optimization module are also updated until convergence.

4. And (5) index construction.

After the training of the Hash mapping model is completed, the candidate text is collected

Inputting the data into a trained relation propagation encoder module to obtain a second type of characterization set, and setting the value of each dimension in the hash code, which is larger than the average value of the corresponding dimension, to be 1 by adopting a median method, otherwise, setting the value of each dimension to be 0. Thereby obtaining the whole waiting timeAnd selecting a hash code of the text set. And constructing a hash table T(s), taking the hash code corresponding to the text as an index, wherein the value is the id (identification) of the candidate text, and then putting the corresponding id into the corresponding hash bucket according to the hash code of the candidate text.

5. And (5) online query.

In the case of the invention, when the method is used on line, the input query text q is converted into the bag-of-words representation, and then the corresponding hash code is obtained through an encoder module of the relation propagation in the hash mapping model

. Presetting a query threshold K, and querying according to the following process:

(1) Initializing query radius R =0, query structure set R = { }.

(2) Fast lookup and lookup through hash table T(s)

And obtaining the id of the corresponding candidate text from the found hash bucket, selecting the candidate text by using the id, and putting the candidate text into a query structure set R, wherein each hash bucket may store the ids of a plurality of candidate texts.

(3) Judging whether the number of the candidate texts in the query structure set R is smaller than K, if so, skipping to the step (2) with the value of the query radius being + 1; and (4) if the K is greater than or equal to K, turning to the step (4).

(4) And performing similarity calculation on the candidate texts in the query structure set R and the query text q by using a relevance evaluation scheme such as word movement distance and the like, and returning the previous K candidate texts in a descending manner according to the similarity to form a final similar text retrieval result.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

Example two

The invention also provides a similar text retrieval system, which is implemented mainly based on the method provided by the foregoing embodiment, as shown in fig. 3, the system mainly includes:

the model building and training unit is used for building a Hash mapping model and training the Hash mapping model in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relation propagation inputs a noise text which comprises a training text and is generated by the training text, sequentially generates a first type of representation and a second type of representation for the training text, generates a corresponding hash code by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the generated second type of representation, wherein the dimension of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first class of characteristics and a second class of characteristics; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the decoder module for sensing the noise utilizes the training texts and second-class characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts which are respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.

EXAMPLE III

The present invention also provides a processing apparatus, as shown in fig. 4, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, a processor, a memory, an input device and an output device are connected through a bus.

In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;

the output device may be a display terminal;

the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.

Example four

The present invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.

The readable storage medium in the embodiment of the present invention may be provided in the foregoing processing device as a computer readable storage medium, for example, as a memory in the processing device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are also within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for retrieving similar text, comprising:

2. A method for similar text retrieval according to claim 1 wherein the processing flow of the relation-propagating encoder module comprises:

for the input training text, a multi-layer feed-forward network with a ReLU activation layer is used and a deep characterization of the text is obtained:

wherein the content of the first and second substances,t ₁ representing intermediate features, bow (x) representing the feature representation of training text x under the bag-of-words model, reLU representing modified linear units, W ₁ And W ₂ Weight parameter representing a multi-layer feedforward network with ReLU activation layer, b ₁ And b ₂ Representing the bias parameters of a multi-layer feed-forward network with a ReLU active layer,t ₂ representing a depth feature;

then, respectively obtaining first type characteristics through two feedforward networks with tanh active layers in sequencelAnd a second kind of characterization

：

is hyperparametric, W ₃ And b ₃ Weight parameter and bias parameter, W, in the first feedforward network with tanh active layer ₄ And b ₄ Respectively, a weight parameter and a bias parameter in a second feedforward network with a tanh active layer;

and processing the numerical value of each dimension in the second type of representation by using a median method to obtain the hash code.

3. A method for similar text retrieval according to claim 1 or 2, wherein the correlation propagation loss function is calculated using the generated first class representation and the second class representation, and is represented as:

wherein the content of the first and second substances,L _rp a function representing the propagation loss of the correlation is shown,

represents the length of a dimension of the first type of token,N _B representing the number of training texts in the current training batch,l _k andl _j respectively represent the first training batchkA training text and the secondjFirst type representation of training text，

And

respectively represent the first training batchkA training text andja second class of characterization of the individual training texts,band the dimension length of the second type of representation is equal to the dimension length of the hash code.

4. The method for retrieving similar texts according to claim 1, wherein the global equalization optimization module stores hash codes corresponding to all training texts, and performs optimization guidance on global equalization information on the hash codes newly generated in the training process includes:

storing the hash codes corresponding to all the training texts in a global storage module M, wherein each training text is provided with a corresponding storage position in the global storage module M, and storing the ith training text x in all the training texts _i Is recorded as M _i ；

In each training batch, selecting partial hash codes from the global storage module M for calculating the weight of global balance information, in the following manner: setting a timer for each storage position, initializing the timer to be 0, and adding 1 to the value of the timer before each training batch starts; in the current training batch, if the training text corresponding to a certain storage position belongs to the training text of the current training batch, resetting the timer value of the corresponding storage position to be 0; if the training texts do not belong to the current training batch, judging whether the timer value of the corresponding storage position meets a set extraction condition value, and if so, selecting the hash code stored in the corresponding storage position;

all selected hash codes form a set

Use sets

And calculating the weight of the global balance information to perform optimization guidance of the global balance information on the newly generated hash code.

5. The method for retrieving similar texts according to claim 1 or 4, wherein said global equalization optimization loss function comprises: a bit equalization loss function and a bit independent loss function; the global equalization information includes: bit equality and bit independence;

bit equality refers to the value of each dimension in the hash code, with the same probability of 1 or-1, using sets

A bit equalization weight for the global case is calculated, expressed as:

wherein the content of the first and second substances,

for the selected set of partial hash codes,

representation collection

representation collection

The number of hash codes in the hash table,

representing the bit equality weight of the c dimension in the hash code;

wherein, the first and the second end of the pipe are connected with each other,N _B representing the number of training texts in the current training batch,

indicates the first in the current training batchkThe training texts correspond to the value of the c-th dimension in the second type of representation, and the dimension length of the second type of representation is equal to the dimension length of the hash code;

wherein, I represents an identity matrix; t is a transposed symbol;

utilizing the coefficient A of the bit-independent condition to restrain the second class of characteristics corresponding to the training texts in the current batch to obtain a bit-independent loss function L _bd Expressed as:

and S represents a set formed by the training texts in the current batch corresponding to the second class of characteristics.

6. A method for similar text retrieval as in claim 1 wherein said noise-aware decoder module reconstructs the corresponding training text using the second class of features corresponding to the training text and the noise text respectively comprises:

calculating the probability that the word at each position in the training text belongs to each word in the word set by using the second type of representation corresponding to the training text, selecting the word in the word set with the maximum probability as the word at the corresponding position, and reconstructing the corresponding training text; for the training text x, its corresponding second class is labeled as

Expressed as:

wherein, T represents a transposed symbol,

representing a set of words in the text that,

representing the total number of words in the set of words,

representing the first in a set of words

The number of the individual words is,

，

，

and

are respectively words w _v And words

Is represented by the one-hot code of (a),

and

respectively represent words w _v And words

Corresponding bias parameters, W represents a word embedding matrix;

the training text x corresponds to n noise texts and is represented as a noise text set

Second type of feature set corresponding to noisy text set P (x)

Wherein, in the process,

which represents the h-th noisy text, is,

a second class of tokens representing an h-th noise text, h =1,2, ·, n; using a collection of features of the second type

Reconstructing the training text to obtain a reconstructed training text set

，

The representation represents the training text reconstructed with the h-th noisy text.

7. A method of similar text retrieval as in claim 6 wherein said reconstruction loss function comprises: a first reconstruction loss function and a second reconstruction loss function; wherein:

calculating a first reconstruction loss function by using a second class of training texts corresponding to the training texts, wherein the first reconstruction loss function is represented as:

wherein L is _rec A first reconstruction loss function is represented that,

representing the total number of words of the training text x,

representing reconstructed training text

The word in the u-th position in the list,

represent

E represents the mathematical expectation, x-D _x Distribution D representing that training text x satisfies text data _x ；

Calculating a semantic correlation coefficient between the first type of representation corresponding to each noise text and the corresponding training text, and calculating a second loss function by combining the training text reconstructed by each noise text, wherein the semantic correlation coefficient is expressed as:

representing a training text reconstructed with an h-th noisy text

The word in the u-th position in the list,

to represent

The probability of (a) of (b) being,

the semantic correlation coefficient of the training text x calculated by using the first type of representation corresponding to the h-th noise text is represented by the following calculation mode:

wherein the content of the first and second substances,

and

respectively representing the first type representation corresponding to the h-th noise text and the d-th noise text,lrepresenting a first class of tokens of the training text x, T being a transposed symbol.

8. A similar text retrieval system realized based on the method of any one of claims 1 to 7, the system comprising:

and the retrieval unit is used for generating a hash code for the input query text by utilizing an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on an initial result obtained by inquiry to obtain a final similar text retrieval result.

9. A processing device, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1 to 7.

10. A readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.