CN115495546A - Similar text retrieval method, system, device and storage medium - Google Patents

Similar text retrieval method, system, device and storage medium Download PDF

Info

Publication number
CN115495546A
CN115495546A CN202211452104.7A CN202211452104A CN115495546A CN 115495546 A CN115495546 A CN 115495546A CN 202211452104 A CN202211452104 A CN 202211452104A CN 115495546 A CN115495546 A CN 115495546A
Authority
CN
China
Prior art keywords
training
text
representation
hash
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211452104.7A
Other languages
Chinese (zh)
Other versions
CN115495546B (en
Inventor
陈恩红
何理扬
黄振亚
刘淇
童世炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202211452104.7A priority Critical patent/CN115495546B/en
Publication of CN115495546A publication Critical patent/CN115495546A/en
Application granted granted Critical
Publication of CN115495546B publication Critical patent/CN115495546B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a similar text retrieval method, a system, equipment and a storage medium, wherein a self-encoder (encoder-decoder) framework is used for constructing a Hash mapping model, the problem of information loss under the condition of low-dimensional Hash codes is solved through an encoder module of relationship propagation, global equilibrium optimization is carried out through a global equilibrium optimization module, the retrieval efficiency is effectively enhanced, the robustness of the Hash codes is enhanced through a decoder module of noise perception, and therefore the problem of noise in texts is solved.

Description

Similar text retrieval method, system, device and storage medium
Technical Field
The present invention relates to the field of similar text retrieval technologies, and in particular, to a similar text retrieval method, system, device, and storage medium.
Background
In recent years, with the explosion of online technologies, text data has multiplied, and users can come into contact with a large amount of text data. The similar text retrieval system can help a user to find text information related to the query text in a large number of text resources, greatly relieves the degree of information overload, and is widely applied to the life of people. With the rapid development of the technology, the similar text retrieval model is improved from a machine learning method to a deep learning method, so that the retrieval precision is continuously improved. However, the efficiency problem is still an inevitable key problem in the similar text retrieval system, including retrieval efficiency and storage efficiency, which is related to the user experience and the burden of the system. Therefore, how to improve efficiency without losing too much accuracy is an urgent research problem to be solved in similar text retrieval systems.
Around this research problem, researchers have proposed a variety of approaches, with deep semantic hashing being one solution that has received considerable attention in recent years. Its main idea is to use a deep semantic model to map the text into a binary representation, also called hash code, which the text will be stored in a hash bucket corresponding to the hash code. In one aspect, a hash table may be utilized to quickly return relevant text from a set of candidate text during retrieval. On the other hand, the hash code requires very little storage overhead, and thus saves a lot of storage space.
However, in practical applications, the current deep semantic hash scheme still has some technical problems to be solved: 1) A supervised training scheme is used, so that a large amount of time is consumed for labeling massive texts; 2) The lower the dimension of the hash code is, the faster the retrieval speed is, but more information is lost by the hash code compared with the original representation of the text, and how to ensure the precision of the low-dimensional hash code is a very challenging problem; 3) When the hash code is unevenly distributed in the space, some extreme situations easily occur, and the retrieval efficiency is affected. 4) In practical applications, the input behavior of the user cannot be controlled, so that the problem of text noise introduced due to misspelling of the user may be faced, and the accuracy of the retrieval result is further affected.
Disclosure of Invention
The invention aims to provide a method, a system, equipment and a storage medium for searching similar texts, which can improve the searching efficiency and the accuracy of the searching result.
The purpose of the invention is realized by the following technical scheme:
a method of similar text retrieval, comprising:
constructing a Hash mapping model, and training in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relational propagation inputs training texts and noise texts corresponding to the training texts, sequentially generates a first type of representation and a second type of representation for the training texts, generates corresponding hash codes by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the second type of representation, wherein the dimensionality of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the noise-aware decoder module utilizes the training texts and the second class of characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;
respectively generating a hash code of each candidate text by using an encoder module of relation propagation in the trained hash mapping model, and constructing a hash table;
and for the input query text, generating a hash code by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on the initial result obtained by inquiry to obtain a final similar text retrieval result.
A similar text retrieval system comprising:
the model building and training unit is used for building a Hash mapping model and training the Hash mapping model in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relation propagation inputs a noise text which comprises a training text and is generated by the training text, sequentially generates a first type of representation and a second type of representation for the training text, generates a corresponding hash code by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the generated second type of representation, wherein the dimension of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the noise-aware decoder module utilizes the training texts and the second class of characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;
the hash table construction unit is used for respectively generating a hash code of each candidate text by utilizing an encoder module of the relation propagation in the trained hash mapping model and constructing a hash table;
and the retrieval unit is used for generating a hash code for the input query text by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on an initial result obtained by inquiry to obtain a final similar text retrieval result.
A processing device, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method as previously described.
A readable storage medium, storing a computer program which, when executed by a processor, implements the aforementioned method.
According to the technical scheme provided by the invention, the Hash mapping model is constructed by using a self-encoder (encoder-decoder) framework, the problem of information loss under the condition of low-dimensional Hash codes is solved by using an encoder module with relation propagation, the global equilibrium optimization module is used for global equilibrium optimization, the retrieval efficiency is effectively enhanced, the robustness of the Hash codes is enhanced by using a noise-sensing decoder module, and therefore, the problem of noise in texts is solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a schematic diagram of a similar text retrieval method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a hash mapping model according to an embodiment of the present invention;
FIG. 3 is a diagram of a similar text retrieval system according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The terms that may be used herein are first described as follows:
the term "and/or" means that either or both can be achieved, for example, X and/or Y means that both cases include "X" or "Y" as well as three cases including "X and Y".
The terms "comprising," "including," "containing," "having," or other similar terms of meaning should be construed as non-exclusive inclusions. For example: including a feature (e.g., material, component, ingredient, carrier, formulation, material, dimension, part, component, mechanism, device, process, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product, or article of manufacture), is to be construed as including not only the particular feature explicitly listed but also other features not explicitly listed as such which are known in the art.
A method, a system, a device and a storage medium for searching for similar texts according to the present invention are described in detail below. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art. The examples of the present invention, in which specific conditions are not specified, were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer.
Example one
The embodiment of the invention provides a similar text retrieval method, which is used for solving the problems that the prior art has poor retrieval effect on low-dimensional hash codes, lacks the capability of resisting noise texts and needs to optimize the retrieval efficiency and accuracy under the condition of no relevance marking. The invention aims to provide an unsupervised method for generating low-dimensional hash codes with robustness and uniform distribution, which is mainly used for generating binary representation (namely the hash codes) with semantic information from text data, and is realized by a constructed hash mapping model, wherein in the hash mapping model: a relation propagation encoder structure is provided aiming at the poor effect of the existing text semantic hash method in the low-dimensional hash code; the global balance optimization module is provided for solving the problem that the efficiency is influenced by extreme distribution conditions easily caused by only locally optimizing the balance of the hash code in the existing scheme; and proposes a noise-aware decoder structure for text noise problems. And after a Hash mapping model is trained in an unsupervised mode, mapping the candidate texts into Hash codes and constructing a Hash table, taking the input query texts as the Hash codes and querying in the Hash table, and combining the query results to generate a final similar text retrieval result. As shown in fig. 1, the main principle of the similar text retrieval method provided by the embodiment of the present invention is shown, which mainly includes:
1. a hash mapping model is constructed and trained using an unsupervised approach.
The hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relation propagation inputs a training text and a noise text corresponding to each training text, sequentially generates a first type of representation and a second type of representation for the training text, then generates a corresponding hash code by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the second type of representation, wherein the dimension of the first type of representation is higher than that of the second type of representation; for the noise text, generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the decoder module for sensing the noise utilizes the training texts and second-class characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts which are respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; and (4) combining the correlation propagation loss function, the global equalization optimization loss function and the reconstruction loss function to construct an overall loss function during training.
2. And respectively generating the hash code of each candidate text by using an encoder module of the relation propagation in the trained hash mapping model, and constructing a hash table and an index.
3. And for the input query text, generating a hash code by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on the initial result obtained by inquiry to obtain a final similar text retrieval result.
In each specific application in the field of similar text retrieval, the scheme provided by the invention can not only retain more semantic information in the hash code, but also improve the retrieval efficiency and the anti-noise capability of the hash code, and provide help for implementing efficient and accurate similar text retrieval.
In order to more clearly show the technical solutions and the technical effects provided by the present invention, the following detailed description is provided for the above methods provided by the embodiments of the present invention with specific embodiments.
1. Problem definition and formalization.
In the embodiment of the invention, two problems of definition and formalization are a semantic hash task and a similar text retrieval task respectively.
Defining a set of candidate texts as
Figure 465443DEST_PATH_IMAGE001
Wherein each one of
Figure 354902DEST_PATH_IMAGE002
Representing a candidate text, N representing the number of candidate texts,
Figure 922412DEST_PATH_IMAGE002
the subscript of (2) is the sequence number of the candidate text. The goal of the semantic hash task is to learn a hash function
Figure 504703DEST_PATH_IMAGE003
It maps the original text x to a binary representation
Figure 898775DEST_PATH_IMAGE004
Where the symbol b represents the dimension length of the binary representation, this binary representation s is also referred to as a hash code.
In the similar text retrieval task, the goal is to find a set of texts S (q) similar to a given query text q. Generally speaking, the process of similar text retrieval based on semantic hash has two stages, namely an index construction stage and an online retrieval stage. In the index building stage (offline stage), candidate texts are aggregated using a hash function f (x)
Figure 716558DEST_PATH_IMAGE005
The text of (1) is mapped into a hash code, and is constructed into a hash index by utilizing a hash table. In the online query stage, a hash code of a query text q is obtained through a hash function f (q), and a preliminary relevant text is quickly found through a hash table. Unlike traditional hash table lookup, since semantic hashing maps similar text to a similar hash space. Therefore, the searching process of the invention is as follows: assuming that the symbol r represents the distance between the query text and the candidate text in the hash space, the value of r will be gradually increased from 0 until K related texts greater than or equal to the predefined number are found. And subsequently, sequencing the preliminary related texts by using a word shift distance or other correlation evaluation scheme calculation method to obtain a final similar text retrieval result.
2. And (4) collecting and preprocessing data.
1. And (6) collecting data.
The present patent uses plain text in a broad sense as an input data set. Examples of such data are the public news data set (20 Ngnews) and the data set published by Yahoo | Answer (Yahooanswer). In addition, various text data sets can be collected as input data through network crawling or offline.
2. And (4) preprocessing data.
The collected data is preprocessed to ensure the effect of the model. The embodiment of the invention mainly utilizes the pure text type data to train, and the collected text data used for training may contain messy codes, illegal characters and the like, so that meaningless texts such as the messy codes, the illegal characters and the like are removed through preprocessing. And dividing a plurality of groups of training sets and verification sets according to a cross-validation mode.
3. And (5) constructing and training a Hash mapping model.
As shown in fig. 1, the present invention is a similar text retrieval system based on deep semantic hash, which includes a hash index construction process and an online retrieval process. Both processes map the original text into hash codes through the same hash mapping model. The hash mapping model constructed in the embodiment of the present invention is shown in fig. 2, and is an auto-encoder as a whole, so that training can be performed in an unsupervised manner, and the hash mapping model mainly includes: a relation-propagating encoder module, a global equalization optimization module, and a noise-aware decoder module.
1. An encoder module for relationship propagation.
To obtain the mapping from text to the low-dimensional hash tokens, a relational propagation encoder module is constructed to retain semantic information in the low-dimensional hash code. In the embodiment of the present invention, the objective of the encoder E for relationship propagation to map the text data into a low-dimensional hash code needs to go through the following steps:
1) Assume a set of training texts
Figure 512476DEST_PATH_IMAGE006
Figure 390302DEST_PATH_IMAGE007
Wherein
Figure 271671DEST_PATH_IMAGE008
Representing training text
Figure 627566DEST_PATH_IMAGE009
The word in the u-th position,
Figure 543569DEST_PATH_IMAGE010
representing text
Figure 90832DEST_PATH_IMAGE009
I =1,2, \ 8230and Q, Q represents the total number of training texts. Before inputting to the encoder module of the relationship propagation, the original training text is converted into bag-of-word model representation, and the noise text mentioned later also needs to be converted into bag-of-word model representation.
2) Using a multi-layer feed-forward network with a ReLU activation layer and obtaining a deep characterization of the text:
Figure 459496DEST_PATH_IMAGE011
Figure 619082DEST_PATH_IMAGE012
wherein the content of the first and second substances,t 1 representing intermediate features, bow (x) representing the feature representation of the training text x under the bag-of-words model, reLU representing a modified linear unit, W 1 And W 2 Weight parameter representing a multi-layer feedforward network with ReLU activation layer, b 1 And b 2 Representing the bias parameters of a multi-layer feed-forward network with a ReLU active layer,t 2 representing a depth feature;
the MLP (multi-layer perceptron) in fig. 2 is a multi-layer feed-forward network with a ReLU activation layer.
As will be appreciated by those skilled in the art, a deep characterization is a generic term for all characterizations obtained after passing through a neural network.
3) Respectively obtaining first type characteristics through two feedforward networks with tanh active layers in sequencelAnd a second kind of characterization
Figure 389592DEST_PATH_IMAGE013
Figure 484587DEST_PATH_IMAGE014
Figure 465181DEST_PATH_IMAGE015
Wherein tanh represents a hyperbolic tangent function, e represents a natural constant,
Figure 303824DEST_PATH_IMAGE016
is hyperparametric, W 3 And b 3 Weight parameter and bias parameter, W, respectively, in the first feedforward network with tan's active layer 4 And b 4 Respectively, a weight parameter and a bias parameter in a second feedforward network with a tanh active layer;
in the embodiment of the invention, all the weight parameters and the bias parameters are parameters to be learned. Hyper-parameter
Figure 53474DEST_PATH_IMAGE016
Mainly for controlling the degree of smoothing. The dimensions of the first type of representation are higher than the dimensions of the second type of representation, for example, the dimensions of the first type of representation may be 8-16 times the dimensions of the second type of representation. Although the first class of features and the second class of features are not binary, the values of each dimension in the two classes of features are very close to 1 or-1 due to the smoothing operation. Generally speaking, the first type of representation can ensure better accuracy because the first type of representation can store more semantic information within a certain range, but the second type of representation can reduce the retrieval accuracy rate because more information is lost, so that the relationship between the two types of representations is utilized to promote the low-dimensional representationAnd (4) accuracy. For the first and second type characterizations a1 and a2, their distance d h (a 1, a 2) may be represented as d h (a1,a2)=-0.5(a1 T a2- | a1 |), and T is a transposed symbol. Based on the characteristic, calculating a correlation propagation loss function by using the generated first class characterization and the second class characterization, wherein the correlation propagation loss function is expressed as:
Figure 319371DEST_PATH_IMAGE017
wherein, the first and the second end of the pipe are connected with each other,L rp a function representing the propagation loss of the correlation is shown,
Figure 23147DEST_PATH_IMAGE018
represents the length of a dimension of the first type of token,N B representing the number of training texts in the current training batch,l k andl j respectively represent the first training batchkA training text and the secondjA first type of characterization of the individual training texts,
Figure 665481DEST_PATH_IMAGE019
and
Figure 269637DEST_PATH_IMAGE020
respectively represent the first training batchkA training text and the secondjA second type of representation of the training text,band the dimension length of the second type of representation is equal to the dimension length of the hash code.
The intuitive meaning of the correlation propagation loss function is if the first class of two tokens of the training text are representedl k Andl j close in space, then their second class of characterization
Figure 706435DEST_PATH_IMAGE019
And
Figure 536988DEST_PATH_IMAGE020
are also close to each other; on the contrary, if the first class of characteristicsl k And withl j Are far apart from each other, then the corresponding second type of characterization
Figure 576488DEST_PATH_IMAGE019
And
Figure 176096DEST_PATH_IMAGE020
should also be relatively far apart. That is, the relationship information between the first class of tokens is passed to the second class of tokens by a correlation propagation loss function.
In addition, in order to obtain the hash code, a median method is used for processing the numerical value of each dimension in the second type of representation; specifically, the method comprises the following steps: integrating second type representations of all training samples, determining the average value of each dimension in the second type representations, and setting the numerical value of each dimension in the second type representations to be 1 if the numerical value is larger than the average value of the corresponding dimension, or setting the numerical value of each dimension to be-1 if the numerical value is not larger than the average value of the corresponding dimension; thereby obtaining a hash code having a value of 1 or-1.
2. And a global balance optimization module.
In order to ensure the high efficiency of the generated hash codes in the retrieval process, the hash codes generated by all texts need to be uniformly distributed in the hash space. In the embodiment of the invention, the hash codes corresponding to all the training texts are stored in a global storage module M, and the stored hash codes are used for carrying out optimization guidance on global balance information on the hash codes of a new batch in the training process. Before the training begins, the global memory module M is initialized with a bernoulli distribution with a parameter of 0.5.
As shown in fig. 2, each training text is provided with a corresponding storage location in the global storage module M, Q is the total number of training texts, and the ith training text x in all training texts i Is recorded as M i (ii) a In each training batch, selecting partial hash codes from the global storage module M for calculating the weight of global equilibrium information, wherein the selection mode is as follows: setting a corresponding timer for each memory location (store location M) i Corresponding timer is notedv i ) And initialized to 0, each trainingAdding 1 to the timer value before the start of the training batch (some examples of counter values are provided in FIG. 2); in the current training batch, if the training text corresponding to a certain storage position belongs to the training text of the current training batch, resetting the timer value of the corresponding storage position to be 0; if the training text does not belong to the current training batch, judging whether the timer value of the corresponding storage position meets the set extraction condition value (for example, is less than or equal to the set extraction condition value)
Figure 174008DEST_PATH_IMAGE021
) If yes, selecting the hash code stored in the corresponding storage position; all selected hash codes form a set
Figure 226278DEST_PATH_IMAGE022
Use sets
Figure 568004DEST_PATH_IMAGE023
And calculating the weight of the global balance information to carry out optimization guidance of the global balance information on the newly generated hash code. Wherein Q represents the number of training texts,N B representing the number of training texts in the current training batch,
Figure 22119DEST_PATH_IMAGE024
the system is a hyper-parameter and is used for controlling the number of the hash codes taken out from the global storage module;
Figure 66299DEST_PATH_IMAGE025
representation collection
Figure 730498DEST_PATH_IMAGE023
The number of the elements in the set, each element in the set represents a selected hash code, and b is the dimension length of the hash code.
In order to achieve hash codes that are evenly distributed in the hash space, the present invention has two optimized objectives, bit balancing and bit independence, respectively. Thus, the global equalization optimization penalty function includes: a bit equalization loss function and a bit independent loss function; meanwhile, the global equalization information includes: bit equalization and bit independence.
The bit equalization refers to the value of each dimension in the hash code, the same probability is 1 or-1, and in order to realize the bit equalization from the global perspective, a set is used
Figure 252746DEST_PATH_IMAGE023
A bit-equalization weight for the global case is calculated, expressed as:
Figure 951581DEST_PATH_IMAGE026
wherein, aggregate
Figure 635503DEST_PATH_IMAGE027
For the selected set of partial hash codes,
Figure 521420DEST_PATH_IMAGE028
representation collection
Figure 847359DEST_PATH_IMAGE027
The value of the c-th dimension in the t-th hash code, b represents the length of the dimension of the hash code,
Figure 902165DEST_PATH_IMAGE029
representation collection
Figure 22567DEST_PATH_IMAGE027
The number of hash codes in the hash table,
Figure 661359DEST_PATH_IMAGE030
representing the bit equality weight of the c-th dimension in the hash code.
Utilizing the bit balance weight to restrain the second class of characteristics corresponding to the training texts in the current batch to obtain a bit balance loss function L bb Expressed as:
Figure 790989DEST_PATH_IMAGE031
wherein the content of the first and second substances,N B representing the number of training texts in the current training batch,
Figure 808624DEST_PATH_IMAGE032
indicates the second in the current training batchkThe training texts correspond to the value of the c-th dimension in the second type of representation, and the dimension length of the second type of representation is equal to the dimension length of the hash code.
The above constraint can be considered as making the expectation of the hash code generated by the current training batch to be 0 in each dimension, and applying a bit equality weight to each dimension
Figure 490141DEST_PATH_IMAGE030
Bit independence means that any two dimensions in a hash code are independent, using sets
Figure 491595DEST_PATH_IMAGE027
A bit-independence weight a for measuring global conditions is calculated, expressed as:
Figure 18391DEST_PATH_IMAGE033
wherein the content of the first and second substances,
Figure 156111DEST_PATH_IMAGE034
the matrix of the unit is expressed by,Rrepresenting a set of real numbers; t is the transposed symbol.
For convenience of representation, a set of training texts in the current batch corresponding to the second class of features is denoted as S:
Figure 149475DEST_PATH_IMAGE035
and utilizing the coefficient A of the bit-independent condition to constrain the second type of characteristics corresponding to the training texts in the current batch to obtain a bit-independent loss function L bd Expressed as:
coefficient A pair using bit-independent caseThe training texts in the current batch are restricted corresponding to the second class of characteristics to obtain a bit-independent loss function L bd Expressed as:
Figure 566173DEST_PATH_IMAGE036
wherein each of the set S
Figure 37606DEST_PATH_IMAGE037
And representing a second type representation corresponding to one training text in the current batch, wherein the subscript is the serial number of the training text.
3. A noise-aware decoder module.
The noise-aware decoder module is used to reconstruct the training text, in the present example, using
Figure 154467DEST_PATH_IMAGE038
And
Figure 53152DEST_PATH_IMAGE039
representing the text reconstructed by the noise-aware decoder module (text representation), discrete words are predicted using the softmax layer.
On one hand, calculating the probability that the word at each position in the training text belongs to each word in the word set by using the second type of representation corresponding to the training text, selecting the word in the word set with the maximum probability as the word at the corresponding position, and reconstructing the corresponding training text; for the training text x, its corresponding second class of characteristics is denoted as
Figure 888253DEST_PATH_IMAGE013
Then the word x at the u-th position (u) Is the first in the word setvA word w v Probability of (2)
Figure 897798DEST_PATH_IMAGE040
Expressed as:
Figure 10110DEST_PATH_IMAGE041
wherein, T represents a transposed symbol,
Figure 204331DEST_PATH_IMAGE042
representing a set of words in the text that,
Figure 136515DEST_PATH_IMAGE043
representing the total number of words in the set of words,
Figure 841428DEST_PATH_IMAGE044
representing the first in a set of words
Figure 808247DEST_PATH_IMAGE045
The number of the individual words is,
Figure 907790DEST_PATH_IMAGE046
Figure 592849DEST_PATH_IMAGE047
Figure 209775DEST_PATH_IMAGE048
and
Figure 155735DEST_PATH_IMAGE049
are respectively words w v And words
Figure 567124DEST_PATH_IMAGE044
Is used to indicate the one-hot code of (c),
Figure 598534DEST_PATH_IMAGE050
and with
Figure 753572DEST_PATH_IMAGE051
Respectively represent words w v And words
Figure 52573DEST_PATH_IMAGE044
Corresponding bias parameters, W represents a trainable word embedding matrix.
According to the above formula, the word x at the u-th position can be calculated (u) Selecting the word in the word set corresponding to the maximum probability as the reconstructed word at the u-th position for the probabilities of all the words in the word set
Figure 634864DEST_PATH_IMAGE052
And calculating therefrom a first reconstruction loss function, expressed as:
Figure 294516DEST_PATH_IMAGE053
wherein L is rec A first reconstruction loss function is represented that,
Figure 112299DEST_PATH_IMAGE054
representing the total number of words of the training text x,
Figure 908217DEST_PATH_IMAGE052
representing reconstructed training text
Figure 51622DEST_PATH_IMAGE055
The word in the u-th position in (c),
Figure 667411DEST_PATH_IMAGE056
to represent
Figure 288885DEST_PATH_IMAGE057
E represents the mathematical expectation, x-D x Distribution D representing that training text x satisfies text data x
On the other hand, in order to solve the problem of noise in the text, a noise perception module and a reconstruction target of the noise text are introduced. For a training text x, a corresponding noise text set can be obtained by randomly deleting words from the training text x
Figure 939310DEST_PATH_IMAGE058
Figure 489502DEST_PATH_IMAGE059
Represents the h-th noisy text, n is the number of noisy texts, h =1, 2. The specific process is as follows: from a Gaussian distribution
Figure 123746DEST_PATH_IMAGE060
In the method, multiple groups of noise rates are randomly sampled
Figure 158698DEST_PATH_IMAGE061
Figure 788262DEST_PATH_IMAGE062
Representing the h-th noise rate, the number of which is equal to the number of noise texts corresponding to the training text x, so as to
Figure 883257DEST_PATH_IMAGE062
Randomly replacing words in the training text x by the probability to obtain the h-th noise text
Figure 863851DEST_PATH_IMAGE059
. The encoder module for transmitting the P (x) input relation can obtain the corresponding first class characterization set
Figure 702494DEST_PATH_IMAGE063
And a second type of feature set
Figure 593090DEST_PATH_IMAGE064
Figure 983620DEST_PATH_IMAGE065
And with
Figure 61297DEST_PATH_IMAGE066
And respectively representing the first class representation and the second class representation of the h-th noise text.
Using a set of second kind of features
Figure 61221DEST_PATH_IMAGE064
Reconstructing the training text to obtain a reconstructed training text set
Figure 806323DEST_PATH_IMAGE067
Figure 367755DEST_PATH_IMAGE068
Representing the reconstructed training text using the h-th noisy text in a manner similar to that described above, using a second type of representation corresponding to the noisy text
Figure 198307DEST_PATH_IMAGE066
Reconstructing a corresponding training text x; for h noise text
Figure 378753DEST_PATH_IMAGE059
The u-th position of the word
Figure 102995DEST_PATH_IMAGE069
Is the first in the word setvA word w v Probability of (2)
Figure 976273DEST_PATH_IMAGE070
Expressed as:
Figure 887598DEST_PATH_IMAGE071
according to the formula, the word of the u-th position can be calculated
Figure 871734DEST_PATH_IMAGE069
Selecting the word in the word set corresponding to the maximum probability as the reconstructed word at the u-th position for the probabilities of all the words in the word set
Figure 951948DEST_PATH_IMAGE072
Synthesizing all positions to form a training text reconstructed by using the h-th noise text
Figure 730548DEST_PATH_IMAGE073
(ii) a Here reconstructed using the h-th noisy textTraining text
Figure 270114DEST_PATH_IMAGE073
The same number of words as the training text x.
When a second reconstruction loss function (noise perception reconstruction loss function) is calculated, firstly, a correlation coefficient between each noise text and a corresponding training text is calculated by using a first type of representation corresponding to each noise text, and the calculation mode of the semantic correlation coefficient between the h-th noise text and the training text is as follows:
Figure 916996DEST_PATH_IMAGE074
wherein the content of the first and second substances,
Figure 225617DEST_PATH_IMAGE075
representing the semantic relevance coefficient to the training text x calculated using the first class of tokens corresponding to the h-th noisy text,nwhich represents the number of noisy texts,
Figure 299753DEST_PATH_IMAGE076
and
Figure 592194DEST_PATH_IMAGE077
respectively representing the first type representation corresponding to the h-th noise text and the d-th noise text,lrepresenting a first class of tokens of the training text x.
And combining the semantic correlation coefficient with the training text reconstructed by the corresponding noise text to obtain a second reconstruction loss function as follows:
Figure 777187DEST_PATH_IMAGE078
wherein L is rec_noise A second reconstruction loss function is represented that is,
Figure 471474DEST_PATH_IMAGE079
representing a training text reconstructed with an h-th noisy text
Figure 591877DEST_PATH_IMAGE080
The word in the u-th position in the list,
Figure 463624DEST_PATH_IMAGE081
to represent
Figure 327675DEST_PATH_IMAGE082
The probability of (c).
And (3) integrating all the loss functions to construct an integral loss function during training:
Figure 1102DEST_PATH_IMAGE083
wherein the content of the first and second substances,
Figure 292406DEST_PATH_IMAGE084
Figure 152915DEST_PATH_IMAGE085
and
Figure 86236DEST_PATH_IMAGE086
for controlling the balance of the loss functions. Training is carried out by minimizing the overall Loss function Loss through an Adam algorithm, weight parameters and bias parameters of an encoder module with relation propagation are updated, trainable word embedding matrixes W and bias parameters in a noise-aware decoder module are updated, and the Zhonghaxi codes in the global equalization optimization module are also updated until convergence.
4. And (5) index construction.
After the training of the Hash mapping model is completed, the candidate text is collected
Figure 223956DEST_PATH_IMAGE001
Inputting the data into a trained relation propagation encoder module to obtain a second type of characterization set, and setting the value of each dimension in the hash code, which is larger than the average value of the corresponding dimension, to be 1 by adopting a median method, otherwise, setting the value of each dimension to be 0. Thereby obtaining the whole waiting timeAnd selecting a hash code of the text set. And constructing a hash table T(s), taking the hash code corresponding to the text as an index, wherein the value is the id (identification) of the candidate text, and then putting the corresponding id into the corresponding hash bucket according to the hash code of the candidate text.
5. And (5) online query.
In the case of the invention, when the method is used on line, the input query text q is converted into the bag-of-words representation, and then the corresponding hash code is obtained through an encoder module of the relation propagation in the hash mapping model
Figure 810795DEST_PATH_IMAGE087
. Presetting a query threshold K, and querying according to the following process:
(1) Initializing query radius R =0, query structure set R = { }.
(2) Fast lookup and lookup through hash table T(s)
Figure 33966DEST_PATH_IMAGE087
And obtaining the id of the corresponding candidate text from the found hash bucket, selecting the candidate text by using the id, and putting the candidate text into a query structure set R, wherein each hash bucket may store the ids of a plurality of candidate texts.
(3) Judging whether the number of the candidate texts in the query structure set R is smaller than K, if so, skipping to the step (2) with the value of the query radius being + 1; and (4) if the K is greater than or equal to K, turning to the step (4).
(4) And performing similarity calculation on the candidate texts in the query structure set R and the query text q by using a relevance evaluation scheme such as word movement distance and the like, and returning the previous K candidate texts in a descending manner according to the similarity to form a final similar text retrieval result.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
Example two
The invention also provides a similar text retrieval system, which is implemented mainly based on the method provided by the foregoing embodiment, as shown in fig. 3, the system mainly includes:
the model building and training unit is used for building a Hash mapping model and training the Hash mapping model in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relation propagation inputs a noise text which comprises a training text and is generated by the training text, sequentially generates a first type of representation and a second type of representation for the training text, generates a corresponding hash code by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the generated second type of representation, wherein the dimension of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first class of characteristics and a second class of characteristics; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the decoder module for sensing the noise utilizes the training texts and second-class characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts which are respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;
the hash table construction unit is used for respectively generating a hash code of each candidate text by utilizing an encoder module of the relation propagation in the trained hash mapping model and constructing a hash table;
and the retrieval unit is used for generating a hash code for the input query text by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on an initial result obtained by inquiry to obtain a final similar text retrieval result.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.
EXAMPLE III
The present invention also provides a processing apparatus, as shown in fig. 4, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.
Further, the processing device further comprises at least one input device and at least one output device; in the processing device, a processor, a memory, an input device and an output device are connected through a bus.
In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:
the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;
the output device may be a display terminal;
the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.
Example four
The present invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.
The readable storage medium in the embodiment of the present invention may be provided in the foregoing processing device as a computer readable storage medium, for example, as a memory in the processing device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are also within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for retrieving similar text, comprising:
constructing a Hash mapping model, and training in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relational propagation inputs training texts and noise texts corresponding to the training texts, sequentially generates a first type of representation and a second type of representation for the training texts, generates corresponding hash codes by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the second type of representation, wherein the dimensionality of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the noise-aware decoder module utilizes the training texts and the second class of characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;
respectively generating a hash code of each candidate text by using an encoder module of relation propagation in the trained hash mapping model, and constructing a hash table;
and for the input query text, generating a hash code by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on the initial result obtained by inquiry to obtain a final similar text retrieval result.
2. A method for similar text retrieval according to claim 1 wherein the processing flow of the relation-propagating encoder module comprises:
for the input training text, a multi-layer feed-forward network with a ReLU activation layer is used and a deep characterization of the text is obtained:
Figure 318768DEST_PATH_IMAGE001
Figure 310995DEST_PATH_IMAGE002
wherein the content of the first and second substances,t 1 representing intermediate features, bow (x) representing the feature representation of training text x under the bag-of-words model, reLU representing modified linear units, W 1 And W 2 Weight parameter representing a multi-layer feedforward network with ReLU activation layer, b 1 And b 2 Representing the bias parameters of a multi-layer feed-forward network with a ReLU active layer,t 2 representing a depth feature;
then, respectively obtaining first type characteristics through two feedforward networks with tanh active layers in sequencelAnd a second kind of characterization
Figure 68735DEST_PATH_IMAGE003
Figure 779203DEST_PATH_IMAGE004
Figure 913381DEST_PATH_IMAGE005
Wherein tanh represents a hyperbolic tangent function, e represents a natural constant,
Figure 760114DEST_PATH_IMAGE006
is hyperparametric, W 3 And b 3 Weight parameter and bias parameter, W, in the first feedforward network with tanh active layer 4 And b 4 Respectively, a weight parameter and a bias parameter in a second feedforward network with a tanh active layer;
and processing the numerical value of each dimension in the second type of representation by using a median method to obtain the hash code.
3. A method for similar text retrieval according to claim 1 or 2, wherein the correlation propagation loss function is calculated using the generated first class representation and the second class representation, and is represented as:
Figure 452870DEST_PATH_IMAGE007
wherein the content of the first and second substances,L rp a function representing the propagation loss of the correlation is shown,
Figure 385054DEST_PATH_IMAGE008
represents the length of a dimension of the first type of token,N B representing the number of training texts in the current training batch,l k andl j respectively represent the first training batchkA training text and the secondjFirst type representation of training text,
Figure 322923DEST_PATH_IMAGE009
And
Figure 758583DEST_PATH_IMAGE010
respectively represent the first training batchkA training text andja second class of characterization of the individual training texts,band the dimension length of the second type of representation is equal to the dimension length of the hash code.
4. The method for retrieving similar texts according to claim 1, wherein the global equalization optimization module stores hash codes corresponding to all training texts, and performs optimization guidance on global equalization information on the hash codes newly generated in the training process includes:
storing the hash codes corresponding to all the training texts in a global storage module M, wherein each training text is provided with a corresponding storage position in the global storage module M, and storing the ith training text x in all the training texts i Is recorded as M i
In each training batch, selecting partial hash codes from the global storage module M for calculating the weight of global balance information, in the following manner: setting a timer for each storage position, initializing the timer to be 0, and adding 1 to the value of the timer before each training batch starts; in the current training batch, if the training text corresponding to a certain storage position belongs to the training text of the current training batch, resetting the timer value of the corresponding storage position to be 0; if the training texts do not belong to the current training batch, judging whether the timer value of the corresponding storage position meets a set extraction condition value, and if so, selecting the hash code stored in the corresponding storage position;
all selected hash codes form a set
Figure 389285DEST_PATH_IMAGE011
Use sets
Figure 808765DEST_PATH_IMAGE011
And calculating the weight of the global balance information to perform optimization guidance of the global balance information on the newly generated hash code.
5. The method for retrieving similar texts according to claim 1 or 4, wherein said global equalization optimization loss function comprises: a bit equalization loss function and a bit independent loss function; the global equalization information includes: bit equality and bit independence;
bit equality refers to the value of each dimension in the hash code, with the same probability of 1 or-1, using sets
Figure 284746DEST_PATH_IMAGE011
A bit equalization weight for the global case is calculated, expressed as:
Figure 106071DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 143560DEST_PATH_IMAGE011
for the selected set of partial hash codes,
Figure 315915DEST_PATH_IMAGE013
representation collection
Figure 470953DEST_PATH_IMAGE011
The value of the c-th dimension in the t-th hash code, b represents the length of the dimension of the hash code,
Figure 271419DEST_PATH_IMAGE014
representation collection
Figure 119289DEST_PATH_IMAGE011
The number of hash codes in the hash table,
Figure 637995DEST_PATH_IMAGE015
representing the bit equality weight of the c dimension in the hash code;
utilizing the bit balance weight to restrain the second class of characteristics corresponding to the training texts in the current batch to obtain a bit balance loss function L bb Expressed as:
Figure 331144DEST_PATH_IMAGE016
wherein, the first and the second end of the pipe are connected with each other,N B representing the number of training texts in the current training batch,
Figure 251696DEST_PATH_IMAGE017
indicates the first in the current training batchkThe training texts correspond to the value of the c-th dimension in the second type of representation, and the dimension length of the second type of representation is equal to the dimension length of the hash code;
bit independence means that any two dimensions in a hash code are independent, using sets
Figure 739309DEST_PATH_IMAGE011
A bit-independence weight a for measuring global conditions is calculated, expressed as:
Figure 243846DEST_PATH_IMAGE018
wherein, I represents an identity matrix; t is a transposed symbol;
utilizing the coefficient A of the bit-independent condition to restrain the second class of characteristics corresponding to the training texts in the current batch to obtain a bit-independent loss function L bd Expressed as:
Figure 740687DEST_PATH_IMAGE019
and S represents a set formed by the training texts in the current batch corresponding to the second class of characteristics.
6. A method for similar text retrieval as in claim 1 wherein said noise-aware decoder module reconstructs the corresponding training text using the second class of features corresponding to the training text and the noise text respectively comprises:
calculating the probability that the word at each position in the training text belongs to each word in the word set by using the second type of representation corresponding to the training text, selecting the word in the word set with the maximum probability as the word at the corresponding position, and reconstructing the corresponding training text; for the training text x, its corresponding second class is labeled as
Figure 515745DEST_PATH_IMAGE003
Then the word x at the u-th position (u) Is the first in the word setvA word w v Probability of (2)
Figure 705418DEST_PATH_IMAGE020
Expressed as:
Figure 808503DEST_PATH_IMAGE021
wherein, T represents a transposed symbol,
Figure 968089DEST_PATH_IMAGE022
representing a set of words in the text that,
Figure 738599DEST_PATH_IMAGE023
representing the total number of words in the set of words,
Figure 958227DEST_PATH_IMAGE024
representing the first in a set of words
Figure 548609DEST_PATH_IMAGE025
The number of the individual words is,
Figure 278929DEST_PATH_IMAGE026
Figure 903946DEST_PATH_IMAGE027
Figure 294476DEST_PATH_IMAGE028
and
Figure 372153DEST_PATH_IMAGE029
are respectively words w v And words
Figure 873542DEST_PATH_IMAGE024
Is represented by the one-hot code of (a),
Figure 618644DEST_PATH_IMAGE030
and
Figure 55441DEST_PATH_IMAGE031
respectively represent words w v And words
Figure 10628DEST_PATH_IMAGE024
Corresponding bias parameters, W represents a word embedding matrix;
the training text x corresponds to n noise texts and is represented as a noise text set
Figure 191074DEST_PATH_IMAGE032
Second type of feature set corresponding to noisy text set P (x)
Figure 148272DEST_PATH_IMAGE033
Wherein, in the process,
Figure 21550DEST_PATH_IMAGE034
which represents the h-th noisy text, is,
Figure 73820DEST_PATH_IMAGE035
a second class of tokens representing an h-th noise text, h =1,2, ·, n; using a collection of features of the second type
Figure 182590DEST_PATH_IMAGE033
Reconstructing the training text to obtain a reconstructed training text set
Figure 371126DEST_PATH_IMAGE036
Figure 274360DEST_PATH_IMAGE037
The representation represents the training text reconstructed with the h-th noisy text.
7. A method of similar text retrieval as in claim 6 wherein said reconstruction loss function comprises: a first reconstruction loss function and a second reconstruction loss function; wherein:
calculating a first reconstruction loss function by using a second class of training texts corresponding to the training texts, wherein the first reconstruction loss function is represented as:
Figure 79505DEST_PATH_IMAGE038
wherein L is rec A first reconstruction loss function is represented that,
Figure 726387DEST_PATH_IMAGE039
representing the total number of words of the training text x,
Figure 35008DEST_PATH_IMAGE040
representing reconstructed training text
Figure 610608DEST_PATH_IMAGE041
The word in the u-th position in the list,
Figure 903050DEST_PATH_IMAGE042
represent
Figure 963409DEST_PATH_IMAGE043
E represents the mathematical expectation, x-D x Distribution D representing that training text x satisfies text data x
Calculating a semantic correlation coefficient between the first type of representation corresponding to each noise text and the corresponding training text, and calculating a second loss function by combining the training text reconstructed by each noise text, wherein the semantic correlation coefficient is expressed as:
Figure 251171DEST_PATH_IMAGE044
wherein L is rec_noise A second reconstruction loss function is represented that is,
Figure 637153DEST_PATH_IMAGE045
representing a training text reconstructed with an h-th noisy text
Figure 10366DEST_PATH_IMAGE046
The word in the u-th position in the list,
Figure 139996DEST_PATH_IMAGE047
to represent
Figure 547843DEST_PATH_IMAGE048
The probability of (a) of (b) being,
Figure 839147DEST_PATH_IMAGE049
the semantic correlation coefficient of the training text x calculated by using the first type of representation corresponding to the h-th noise text is represented by the following calculation mode:
Figure 198191DEST_PATH_IMAGE050
wherein the content of the first and second substances,
Figure 865933DEST_PATH_IMAGE051
and
Figure 128287DEST_PATH_IMAGE052
respectively representing the first type representation corresponding to the h-th noise text and the d-th noise text,lrepresenting a first class of tokens of the training text x, T being a transposed symbol.
8. A similar text retrieval system realized based on the method of any one of claims 1 to 7, the system comprising:
the model building and training unit is used for building a Hash mapping model and training the Hash mapping model in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relation propagation inputs a noise text which comprises a training text and is generated by the training text, sequentially generates a first type of representation and a second type of representation for the training text, generates a corresponding hash code by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the generated second type of representation, wherein the dimension of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the noise-aware decoder module utilizes the training texts and the second class of characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;
the hash table construction unit is used for respectively generating a hash code of each candidate text by utilizing an encoder module of the relation propagation in the trained hash mapping model and constructing a hash table;
and the retrieval unit is used for generating a hash code for the input query text by utilizing an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on an initial result obtained by inquiry to obtain a final similar text retrieval result.
9. A processing device, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1 to 7.
10. A readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202211452104.7A 2022-11-21 2022-11-21 Similar text retrieval method, system, device and storage medium Active CN115495546B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211452104.7A CN115495546B (en) 2022-11-21 2022-11-21 Similar text retrieval method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211452104.7A CN115495546B (en) 2022-11-21 2022-11-21 Similar text retrieval method, system, device and storage medium

Publications (2)

Publication Number Publication Date
CN115495546A true CN115495546A (en) 2022-12-20
CN115495546B CN115495546B (en) 2023-04-07

Family

ID=85116261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211452104.7A Active CN115495546B (en) 2022-11-21 2022-11-21 Similar text retrieval method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN115495546B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116051686A (en) * 2023-01-13 2023-05-02 中国科学技术大学 Method, system, equipment and storage medium for erasing characters on graph

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657350A (en) * 2015-03-04 2015-05-27 中国科学院自动化研究所 Hash learning method for short text integrated with implicit semantic features
US20180341720A1 (en) * 2017-05-24 2018-11-29 International Business Machines Corporation Neural Bit Embeddings for Graphs
CN110457503A (en) * 2019-07-31 2019-11-15 北京大学 A kind of rapid Optimum depth hashing image coding method and target image search method
CN110659375A (en) * 2019-09-20 2020-01-07 中国科学技术大学 Hash model training method, similar object retrieval method and device
CN112256727A (en) * 2020-10-19 2021-01-22 东北大学 Database query processing and optimizing method based on artificial intelligence technology
CN113342922A (en) * 2021-06-17 2021-09-03 北京邮电大学 Cross-modal retrieval method based on fine-grained self-supervision of labels
CN113392180A (en) * 2021-01-07 2021-09-14 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN113449849A (en) * 2021-06-29 2021-09-28 桂林电子科技大学 Learning type text hash method based on self-encoder
CN113821527A (en) * 2021-06-30 2021-12-21 腾讯科技(深圳)有限公司 Hash code generation method and device, computer equipment and storage medium
CN114328818A (en) * 2021-11-25 2022-04-12 腾讯科技(深圳)有限公司 Text corpus processing method and device, storage medium and electronic equipment
US20220179891A1 (en) * 2019-04-09 2022-06-09 University Of Washington Systems and methods for providing similarity-based retrieval of information stored in dna

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657350A (en) * 2015-03-04 2015-05-27 中国科学院自动化研究所 Hash learning method for short text integrated with implicit semantic features
US20180341720A1 (en) * 2017-05-24 2018-11-29 International Business Machines Corporation Neural Bit Embeddings for Graphs
US20220179891A1 (en) * 2019-04-09 2022-06-09 University Of Washington Systems and methods for providing similarity-based retrieval of information stored in dna
CN110457503A (en) * 2019-07-31 2019-11-15 北京大学 A kind of rapid Optimum depth hashing image coding method and target image search method
CN110659375A (en) * 2019-09-20 2020-01-07 中国科学技术大学 Hash model training method, similar object retrieval method and device
CN112256727A (en) * 2020-10-19 2021-01-22 东北大学 Database query processing and optimizing method based on artificial intelligence technology
CN113392180A (en) * 2021-01-07 2021-09-14 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN113342922A (en) * 2021-06-17 2021-09-03 北京邮电大学 Cross-modal retrieval method based on fine-grained self-supervision of labels
CN113449849A (en) * 2021-06-29 2021-09-28 桂林电子科技大学 Learning type text hash method based on self-encoder
CN113821527A (en) * 2021-06-30 2021-12-21 腾讯科技(深圳)有限公司 Hash code generation method and device, computer equipment and storage medium
CN114328818A (en) * 2021-11-25 2022-04-12 腾讯科技(深圳)有限公司 Text corpus processing method and device, storage medium and electronic equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NGHI D. Q. BUI 等: ""Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations"", 《PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL》 *
ZHIWEN LI 等: ""Deep Hash Model for Similarity Text Retrieval"", 《2022 5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA》 *
邹傲 等: ""基于深度哈希的文本表示学习"", 《计算机系统应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116051686A (en) * 2023-01-13 2023-05-02 中国科学技术大学 Method, system, equipment and storage medium for erasing characters on graph
CN116051686B (en) * 2023-01-13 2023-08-01 中国科学技术大学 Method, system, equipment and storage medium for erasing characters on graph

Also Published As

Publication number Publication date
CN115495546B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
US20210004682A1 (en) Adapting a sequence model for use in predicting future device interactions with a computing system
Wang Bankruptcy prediction using machine learning
CN109960738B (en) Large-scale remote sensing image content retrieval method based on depth countermeasure hash learning
CN110781409B (en) Article recommendation method based on collaborative filtering
CN111414461A (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN111310439A (en) Intelligent semantic matching method and device based on depth feature dimension-changing mechanism
CN112732864B (en) Document retrieval method based on dense pseudo query vector representation
Cummings et al. Structured citation trend prediction using graph neural networks
CN115495546B (en) Similar text retrieval method, system, device and storage medium
CN113222139A (en) Neural network training method, device and equipment and computer storage medium
CN111930931A (en) Abstract evaluation method and device
Seo et al. Reliable knowledge graph path representation learning
Goswami et al. Filter-based feature selection methods using hill climbing approach
CN111966811A (en) Intention recognition and slot filling method and device, readable storage medium and terminal equipment
Yao et al. Hash bit selection with reinforcement learning for image retrieval
Ko et al. MASCOT: A Quantization Framework for Efficient Matrix Factorization in Recommender Systems
Zeng et al. Pyramid hybrid pooling quantization for efficient fine-grained image retrieval
CN116720519B (en) Seedling medicine named entity identification method
CN112711648A (en) Database character string ciphertext storage method, electronic device and medium
CN111274359B (en) Query recommendation method and system based on improved VHRED and reinforcement learning
Sahu et al. Forecasting currency exchange rate time series with fireworks-algorithm-based higher order neural network with special attention to training data enrichment
Qiang et al. Large-scale multi-label image retrieval using residual network with hash layer
Cui et al. Deep hashing with multi-central ranking loss for multi-label image retrieval
CN113326393B (en) Image retrieval method based on deep hash feature and heterogeneous parallel processing
CN117609632B (en) Tobacco legal service method and system based on Internet technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant