CN115495546A - Similar text retrieval method, system, device and storage medium - Google Patents
Similar text retrieval method, system, device and storage medium Download PDFInfo
- Publication number
- CN115495546A CN115495546A CN202211452104.7A CN202211452104A CN115495546A CN 115495546 A CN115495546 A CN 115495546A CN 202211452104 A CN202211452104 A CN 202211452104A CN 115495546 A CN115495546 A CN 115495546A
- Authority
- CN
- China
- Prior art keywords
- training
- text
- representation
- hash
- texts
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/325—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a similar text retrieval method, a system, equipment and a storage medium, wherein a self-encoder (encoder-decoder) framework is used for constructing a Hash mapping model, the problem of information loss under the condition of low-dimensional Hash codes is solved through an encoder module of relationship propagation, global equilibrium optimization is carried out through a global equilibrium optimization module, the retrieval efficiency is effectively enhanced, the robustness of the Hash codes is enhanced through a decoder module of noise perception, and therefore the problem of noise in texts is solved.
Description
Technical Field
The present invention relates to the field of similar text retrieval technologies, and in particular, to a similar text retrieval method, system, device, and storage medium.
Background
In recent years, with the explosion of online technologies, text data has multiplied, and users can come into contact with a large amount of text data. The similar text retrieval system can help a user to find text information related to the query text in a large number of text resources, greatly relieves the degree of information overload, and is widely applied to the life of people. With the rapid development of the technology, the similar text retrieval model is improved from a machine learning method to a deep learning method, so that the retrieval precision is continuously improved. However, the efficiency problem is still an inevitable key problem in the similar text retrieval system, including retrieval efficiency and storage efficiency, which is related to the user experience and the burden of the system. Therefore, how to improve efficiency without losing too much accuracy is an urgent research problem to be solved in similar text retrieval systems.
Around this research problem, researchers have proposed a variety of approaches, with deep semantic hashing being one solution that has received considerable attention in recent years. Its main idea is to use a deep semantic model to map the text into a binary representation, also called hash code, which the text will be stored in a hash bucket corresponding to the hash code. In one aspect, a hash table may be utilized to quickly return relevant text from a set of candidate text during retrieval. On the other hand, the hash code requires very little storage overhead, and thus saves a lot of storage space.
However, in practical applications, the current deep semantic hash scheme still has some technical problems to be solved: 1) A supervised training scheme is used, so that a large amount of time is consumed for labeling massive texts; 2) The lower the dimension of the hash code is, the faster the retrieval speed is, but more information is lost by the hash code compared with the original representation of the text, and how to ensure the precision of the low-dimensional hash code is a very challenging problem; 3) When the hash code is unevenly distributed in the space, some extreme situations easily occur, and the retrieval efficiency is affected. 4) In practical applications, the input behavior of the user cannot be controlled, so that the problem of text noise introduced due to misspelling of the user may be faced, and the accuracy of the retrieval result is further affected.
Disclosure of Invention
The invention aims to provide a method, a system, equipment and a storage medium for searching similar texts, which can improve the searching efficiency and the accuracy of the searching result.
The purpose of the invention is realized by the following technical scheme:
a method of similar text retrieval, comprising:
constructing a Hash mapping model, and training in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relational propagation inputs training texts and noise texts corresponding to the training texts, sequentially generates a first type of representation and a second type of representation for the training texts, generates corresponding hash codes by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the second type of representation, wherein the dimensionality of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the noise-aware decoder module utilizes the training texts and the second class of characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;
respectively generating a hash code of each candidate text by using an encoder module of relation propagation in the trained hash mapping model, and constructing a hash table;
and for the input query text, generating a hash code by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on the initial result obtained by inquiry to obtain a final similar text retrieval result.
A similar text retrieval system comprising:
the model building and training unit is used for building a Hash mapping model and training the Hash mapping model in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relation propagation inputs a noise text which comprises a training text and is generated by the training text, sequentially generates a first type of representation and a second type of representation for the training text, generates a corresponding hash code by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the generated second type of representation, wherein the dimension of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the noise-aware decoder module utilizes the training texts and the second class of characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;
the hash table construction unit is used for respectively generating a hash code of each candidate text by utilizing an encoder module of the relation propagation in the trained hash mapping model and constructing a hash table;
and the retrieval unit is used for generating a hash code for the input query text by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on an initial result obtained by inquiry to obtain a final similar text retrieval result.
A processing device, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method as previously described.
A readable storage medium, storing a computer program which, when executed by a processor, implements the aforementioned method.
According to the technical scheme provided by the invention, the Hash mapping model is constructed by using a self-encoder (encoder-decoder) framework, the problem of information loss under the condition of low-dimensional Hash codes is solved by using an encoder module with relation propagation, the global equilibrium optimization module is used for global equilibrium optimization, the retrieval efficiency is effectively enhanced, the robustness of the Hash codes is enhanced by using a noise-sensing decoder module, and therefore, the problem of noise in texts is solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a schematic diagram of a similar text retrieval method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a hash mapping model according to an embodiment of the present invention;
FIG. 3 is a diagram of a similar text retrieval system according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The terms that may be used herein are first described as follows:
the term "and/or" means that either or both can be achieved, for example, X and/or Y means that both cases include "X" or "Y" as well as three cases including "X and Y".
The terms "comprising," "including," "containing," "having," or other similar terms of meaning should be construed as non-exclusive inclusions. For example: including a feature (e.g., material, component, ingredient, carrier, formulation, material, dimension, part, component, mechanism, device, process, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product, or article of manufacture), is to be construed as including not only the particular feature explicitly listed but also other features not explicitly listed as such which are known in the art.
A method, a system, a device and a storage medium for searching for similar texts according to the present invention are described in detail below. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art. The examples of the present invention, in which specific conditions are not specified, were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer.
Example one
The embodiment of the invention provides a similar text retrieval method, which is used for solving the problems that the prior art has poor retrieval effect on low-dimensional hash codes, lacks the capability of resisting noise texts and needs to optimize the retrieval efficiency and accuracy under the condition of no relevance marking. The invention aims to provide an unsupervised method for generating low-dimensional hash codes with robustness and uniform distribution, which is mainly used for generating binary representation (namely the hash codes) with semantic information from text data, and is realized by a constructed hash mapping model, wherein in the hash mapping model: a relation propagation encoder structure is provided aiming at the poor effect of the existing text semantic hash method in the low-dimensional hash code; the global balance optimization module is provided for solving the problem that the efficiency is influenced by extreme distribution conditions easily caused by only locally optimizing the balance of the hash code in the existing scheme; and proposes a noise-aware decoder structure for text noise problems. And after a Hash mapping model is trained in an unsupervised mode, mapping the candidate texts into Hash codes and constructing a Hash table, taking the input query texts as the Hash codes and querying in the Hash table, and combining the query results to generate a final similar text retrieval result. As shown in fig. 1, the main principle of the similar text retrieval method provided by the embodiment of the present invention is shown, which mainly includes:
1. a hash mapping model is constructed and trained using an unsupervised approach.
The hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relation propagation inputs a training text and a noise text corresponding to each training text, sequentially generates a first type of representation and a second type of representation for the training text, then generates a corresponding hash code by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the second type of representation, wherein the dimension of the first type of representation is higher than that of the second type of representation; for the noise text, generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the decoder module for sensing the noise utilizes the training texts and second-class characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts which are respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; and (4) combining the correlation propagation loss function, the global equalization optimization loss function and the reconstruction loss function to construct an overall loss function during training.
2. And respectively generating the hash code of each candidate text by using an encoder module of the relation propagation in the trained hash mapping model, and constructing a hash table and an index.
3. And for the input query text, generating a hash code by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on the initial result obtained by inquiry to obtain a final similar text retrieval result.
In each specific application in the field of similar text retrieval, the scheme provided by the invention can not only retain more semantic information in the hash code, but also improve the retrieval efficiency and the anti-noise capability of the hash code, and provide help for implementing efficient and accurate similar text retrieval.
In order to more clearly show the technical solutions and the technical effects provided by the present invention, the following detailed description is provided for the above methods provided by the embodiments of the present invention with specific embodiments.
1. Problem definition and formalization.
In the embodiment of the invention, two problems of definition and formalization are a semantic hash task and a similar text retrieval task respectively.
Defining a set of candidate texts asWherein each one ofRepresenting a candidate text, N representing the number of candidate texts,the subscript of (2) is the sequence number of the candidate text. The goal of the semantic hash task is to learn a hash functionIt maps the original text x to a binary representationWhere the symbol b represents the dimension length of the binary representation, this binary representation s is also referred to as a hash code.
In the similar text retrieval task, the goal is to find a set of texts S (q) similar to a given query text q. Generally speaking, the process of similar text retrieval based on semantic hash has two stages, namely an index construction stage and an online retrieval stage. In the index building stage (offline stage), candidate texts are aggregated using a hash function f (x)The text of (1) is mapped into a hash code, and is constructed into a hash index by utilizing a hash table. In the online query stage, a hash code of a query text q is obtained through a hash function f (q), and a preliminary relevant text is quickly found through a hash table. Unlike traditional hash table lookup, since semantic hashing maps similar text to a similar hash space. Therefore, the searching process of the invention is as follows: assuming that the symbol r represents the distance between the query text and the candidate text in the hash space, the value of r will be gradually increased from 0 until K related texts greater than or equal to the predefined number are found. And subsequently, sequencing the preliminary related texts by using a word shift distance or other correlation evaluation scheme calculation method to obtain a final similar text retrieval result.
2. And (4) collecting and preprocessing data.
1. And (6) collecting data.
The present patent uses plain text in a broad sense as an input data set. Examples of such data are the public news data set (20 Ngnews) and the data set published by Yahoo | Answer (Yahooanswer). In addition, various text data sets can be collected as input data through network crawling or offline.
2. And (4) preprocessing data.
The collected data is preprocessed to ensure the effect of the model. The embodiment of the invention mainly utilizes the pure text type data to train, and the collected text data used for training may contain messy codes, illegal characters and the like, so that meaningless texts such as the messy codes, the illegal characters and the like are removed through preprocessing. And dividing a plurality of groups of training sets and verification sets according to a cross-validation mode.
3. And (5) constructing and training a Hash mapping model.
As shown in fig. 1, the present invention is a similar text retrieval system based on deep semantic hash, which includes a hash index construction process and an online retrieval process. Both processes map the original text into hash codes through the same hash mapping model. The hash mapping model constructed in the embodiment of the present invention is shown in fig. 2, and is an auto-encoder as a whole, so that training can be performed in an unsupervised manner, and the hash mapping model mainly includes: a relation-propagating encoder module, a global equalization optimization module, and a noise-aware decoder module.
1. An encoder module for relationship propagation.
To obtain the mapping from text to the low-dimensional hash tokens, a relational propagation encoder module is constructed to retain semantic information in the low-dimensional hash code. In the embodiment of the present invention, the objective of the encoder E for relationship propagation to map the text data into a low-dimensional hash code needs to go through the following steps:
1) Assume a set of training texts,WhereinRepresenting training textThe word in the u-th position,representing textI =1,2, \ 8230and Q, Q represents the total number of training texts. Before inputting to the encoder module of the relationship propagation, the original training text is converted into bag-of-word model representation, and the noise text mentioned later also needs to be converted into bag-of-word model representation.
2) Using a multi-layer feed-forward network with a ReLU activation layer and obtaining a deep characterization of the text:
wherein the content of the first and second substances,t 1 representing intermediate features, bow (x) representing the feature representation of the training text x under the bag-of-words model, reLU representing a modified linear unit, W 1 And W 2 Weight parameter representing a multi-layer feedforward network with ReLU activation layer, b 1 And b 2 Representing the bias parameters of a multi-layer feed-forward network with a ReLU active layer,t 2 representing a depth feature;
the MLP (multi-layer perceptron) in fig. 2 is a multi-layer feed-forward network with a ReLU activation layer.
As will be appreciated by those skilled in the art, a deep characterization is a generic term for all characterizations obtained after passing through a neural network.
3) Respectively obtaining first type characteristics through two feedforward networks with tanh active layers in sequencelAnd a second kind of characterization:
Wherein tanh represents a hyperbolic tangent function, e represents a natural constant,is hyperparametric, W 3 And b 3 Weight parameter and bias parameter, W, respectively, in the first feedforward network with tan's active layer 4 And b 4 Respectively, a weight parameter and a bias parameter in a second feedforward network with a tanh active layer;
in the embodiment of the invention, all the weight parameters and the bias parameters are parameters to be learned. Hyper-parameterMainly for controlling the degree of smoothing. The dimensions of the first type of representation are higher than the dimensions of the second type of representation, for example, the dimensions of the first type of representation may be 8-16 times the dimensions of the second type of representation. Although the first class of features and the second class of features are not binary, the values of each dimension in the two classes of features are very close to 1 or-1 due to the smoothing operation. Generally speaking, the first type of representation can ensure better accuracy because the first type of representation can store more semantic information within a certain range, but the second type of representation can reduce the retrieval accuracy rate because more information is lost, so that the relationship between the two types of representations is utilized to promote the low-dimensional representationAnd (4) accuracy. For the first and second type characterizations a1 and a2, their distance d h (a 1, a 2) may be represented as d h (a1,a2)=-0.5(a1 T a2- | a1 |), and T is a transposed symbol. Based on the characteristic, calculating a correlation propagation loss function by using the generated first class characterization and the second class characterization, wherein the correlation propagation loss function is expressed as:
wherein, the first and the second end of the pipe are connected with each other,L rp a function representing the propagation loss of the correlation is shown,represents the length of a dimension of the first type of token,N B representing the number of training texts in the current training batch,l k andl j respectively represent the first training batchkA training text and the secondjA first type of characterization of the individual training texts,andrespectively represent the first training batchkA training text and the secondjA second type of representation of the training text,band the dimension length of the second type of representation is equal to the dimension length of the hash code.
The intuitive meaning of the correlation propagation loss function is if the first class of two tokens of the training text are representedl k Andl j close in space, then their second class of characterizationAndare also close to each other; on the contrary, if the first class of characteristicsl k And withl j Are far apart from each other, then the corresponding second type of characterizationAndshould also be relatively far apart. That is, the relationship information between the first class of tokens is passed to the second class of tokens by a correlation propagation loss function.
In addition, in order to obtain the hash code, a median method is used for processing the numerical value of each dimension in the second type of representation; specifically, the method comprises the following steps: integrating second type representations of all training samples, determining the average value of each dimension in the second type representations, and setting the numerical value of each dimension in the second type representations to be 1 if the numerical value is larger than the average value of the corresponding dimension, or setting the numerical value of each dimension to be-1 if the numerical value is not larger than the average value of the corresponding dimension; thereby obtaining a hash code having a value of 1 or-1.
2. And a global balance optimization module.
In order to ensure the high efficiency of the generated hash codes in the retrieval process, the hash codes generated by all texts need to be uniformly distributed in the hash space. In the embodiment of the invention, the hash codes corresponding to all the training texts are stored in a global storage module M, and the stored hash codes are used for carrying out optimization guidance on global balance information on the hash codes of a new batch in the training process. Before the training begins, the global memory module M is initialized with a bernoulli distribution with a parameter of 0.5.
As shown in fig. 2, each training text is provided with a corresponding storage location in the global storage module M, Q is the total number of training texts, and the ith training text x in all training texts i Is recorded as M i (ii) a In each training batch, selecting partial hash codes from the global storage module M for calculating the weight of global equilibrium information, wherein the selection mode is as follows: setting a corresponding timer for each memory location (store location M) i Corresponding timer is notedv i ) And initialized to 0, each trainingAdding 1 to the timer value before the start of the training batch (some examples of counter values are provided in FIG. 2); in the current training batch, if the training text corresponding to a certain storage position belongs to the training text of the current training batch, resetting the timer value of the corresponding storage position to be 0; if the training text does not belong to the current training batch, judging whether the timer value of the corresponding storage position meets the set extraction condition value (for example, is less than or equal to the set extraction condition value)) If yes, selecting the hash code stored in the corresponding storage position; all selected hash codes form a setUse setsAnd calculating the weight of the global balance information to carry out optimization guidance of the global balance information on the newly generated hash code. Wherein Q represents the number of training texts,N B representing the number of training texts in the current training batch,the system is a hyper-parameter and is used for controlling the number of the hash codes taken out from the global storage module;representation collectionThe number of the elements in the set, each element in the set represents a selected hash code, and b is the dimension length of the hash code.
In order to achieve hash codes that are evenly distributed in the hash space, the present invention has two optimized objectives, bit balancing and bit independence, respectively. Thus, the global equalization optimization penalty function includes: a bit equalization loss function and a bit independent loss function; meanwhile, the global equalization information includes: bit equalization and bit independence.
The bit equalization refers to the value of each dimension in the hash code, the same probability is 1 or-1, and in order to realize the bit equalization from the global perspective, a set is usedA bit-equalization weight for the global case is calculated, expressed as:
wherein, aggregateFor the selected set of partial hash codes,representation collectionThe value of the c-th dimension in the t-th hash code, b represents the length of the dimension of the hash code,representation collectionThe number of hash codes in the hash table,representing the bit equality weight of the c-th dimension in the hash code.
Utilizing the bit balance weight to restrain the second class of characteristics corresponding to the training texts in the current batch to obtain a bit balance loss function L bb Expressed as:
wherein the content of the first and second substances,N B representing the number of training texts in the current training batch,indicates the second in the current training batchkThe training texts correspond to the value of the c-th dimension in the second type of representation, and the dimension length of the second type of representation is equal to the dimension length of the hash code.
The above constraint can be considered as making the expectation of the hash code generated by the current training batch to be 0 in each dimension, and applying a bit equality weight to each dimension。
Bit independence means that any two dimensions in a hash code are independent, using setsA bit-independence weight a for measuring global conditions is calculated, expressed as:
wherein the content of the first and second substances,the matrix of the unit is expressed by,Rrepresenting a set of real numbers; t is the transposed symbol.
For convenience of representation, a set of training texts in the current batch corresponding to the second class of features is denoted as S:and utilizing the coefficient A of the bit-independent condition to constrain the second type of characteristics corresponding to the training texts in the current batch to obtain a bit-independent loss function L bd Expressed as:
coefficient A pair using bit-independent caseThe training texts in the current batch are restricted corresponding to the second class of characteristics to obtain a bit-independent loss function L bd Expressed as:
wherein each of the set SAnd representing a second type representation corresponding to one training text in the current batch, wherein the subscript is the serial number of the training text.
3. A noise-aware decoder module.
The noise-aware decoder module is used to reconstruct the training text, in the present example, usingAndrepresenting the text reconstructed by the noise-aware decoder module (text representation), discrete words are predicted using the softmax layer.
On one hand, calculating the probability that the word at each position in the training text belongs to each word in the word set by using the second type of representation corresponding to the training text, selecting the word in the word set with the maximum probability as the word at the corresponding position, and reconstructing the corresponding training text; for the training text x, its corresponding second class of characteristics is denoted asThen the word x at the u-th position (u) Is the first in the word setvA word w v Probability of (2)Expressed as:
wherein, T represents a transposed symbol,representing a set of words in the text that,representing the total number of words in the set of words,representing the first in a set of wordsThe number of the individual words is,,,andare respectively words w v And wordsIs used to indicate the one-hot code of (c),and withRespectively represent words w v And wordsCorresponding bias parameters, W represents a trainable word embedding matrix.
According to the above formula, the word x at the u-th position can be calculated (u) Selecting the word in the word set corresponding to the maximum probability as the reconstructed word at the u-th position for the probabilities of all the words in the word set。
And calculating therefrom a first reconstruction loss function, expressed as:
wherein L is rec A first reconstruction loss function is represented that,representing the total number of words of the training text x,representing reconstructed training textThe word in the u-th position in (c),to representE represents the mathematical expectation, x-D x Distribution D representing that training text x satisfies text data x 。
On the other hand, in order to solve the problem of noise in the text, a noise perception module and a reconstruction target of the noise text are introduced. For a training text x, a corresponding noise text set can be obtained by randomly deleting words from the training text x,Represents the h-th noisy text, n is the number of noisy texts, h =1, 2. The specific process is as follows: from a Gaussian distributionIn the method, multiple groups of noise rates are randomly sampled,Representing the h-th noise rate, the number of which is equal to the number of noise texts corresponding to the training text x, so as toRandomly replacing words in the training text x by the probability to obtain the h-th noise text. The encoder module for transmitting the P (x) input relation can obtain the corresponding first class characterization setAnd a second type of feature set,And withAnd respectively representing the first class representation and the second class representation of the h-th noise text.
Using a set of second kind of featuresReconstructing the training text to obtain a reconstructed training text set,Representing the reconstructed training text using the h-th noisy text in a manner similar to that described above, using a second type of representation corresponding to the noisy textReconstructing a corresponding training text x; for h noise textThe u-th position of the wordIs the first in the word setvA word w v Probability of (2)Expressed as:
according to the formula, the word of the u-th position can be calculatedSelecting the word in the word set corresponding to the maximum probability as the reconstructed word at the u-th position for the probabilities of all the words in the word setSynthesizing all positions to form a training text reconstructed by using the h-th noise text(ii) a Here reconstructed using the h-th noisy textTraining textThe same number of words as the training text x.
When a second reconstruction loss function (noise perception reconstruction loss function) is calculated, firstly, a correlation coefficient between each noise text and a corresponding training text is calculated by using a first type of representation corresponding to each noise text, and the calculation mode of the semantic correlation coefficient between the h-th noise text and the training text is as follows:
wherein the content of the first and second substances,representing the semantic relevance coefficient to the training text x calculated using the first class of tokens corresponding to the h-th noisy text,nwhich represents the number of noisy texts,andrespectively representing the first type representation corresponding to the h-th noise text and the d-th noise text,lrepresenting a first class of tokens of the training text x.
And combining the semantic correlation coefficient with the training text reconstructed by the corresponding noise text to obtain a second reconstruction loss function as follows:
wherein L is rec_noise A second reconstruction loss function is represented that is,representing a training text reconstructed with an h-th noisy textThe word in the u-th position in the list,to representThe probability of (c).
And (3) integrating all the loss functions to construct an integral loss function during training:
wherein the content of the first and second substances,、andfor controlling the balance of the loss functions. Training is carried out by minimizing the overall Loss function Loss through an Adam algorithm, weight parameters and bias parameters of an encoder module with relation propagation are updated, trainable word embedding matrixes W and bias parameters in a noise-aware decoder module are updated, and the Zhonghaxi codes in the global equalization optimization module are also updated until convergence.
4. And (5) index construction.
After the training of the Hash mapping model is completed, the candidate text is collectedInputting the data into a trained relation propagation encoder module to obtain a second type of characterization set, and setting the value of each dimension in the hash code, which is larger than the average value of the corresponding dimension, to be 1 by adopting a median method, otherwise, setting the value of each dimension to be 0. Thereby obtaining the whole waiting timeAnd selecting a hash code of the text set. And constructing a hash table T(s), taking the hash code corresponding to the text as an index, wherein the value is the id (identification) of the candidate text, and then putting the corresponding id into the corresponding hash bucket according to the hash code of the candidate text.
5. And (5) online query.
In the case of the invention, when the method is used on line, the input query text q is converted into the bag-of-words representation, and then the corresponding hash code is obtained through an encoder module of the relation propagation in the hash mapping model. Presetting a query threshold K, and querying according to the following process:
(1) Initializing query radius R =0, query structure set R = { }.
(2) Fast lookup and lookup through hash table T(s)And obtaining the id of the corresponding candidate text from the found hash bucket, selecting the candidate text by using the id, and putting the candidate text into a query structure set R, wherein each hash bucket may store the ids of a plurality of candidate texts.
(3) Judging whether the number of the candidate texts in the query structure set R is smaller than K, if so, skipping to the step (2) with the value of the query radius being + 1; and (4) if the K is greater than or equal to K, turning to the step (4).
(4) And performing similarity calculation on the candidate texts in the query structure set R and the query text q by using a relevance evaluation scheme such as word movement distance and the like, and returning the previous K candidate texts in a descending manner according to the similarity to form a final similar text retrieval result.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
Example two
The invention also provides a similar text retrieval system, which is implemented mainly based on the method provided by the foregoing embodiment, as shown in fig. 3, the system mainly includes:
the model building and training unit is used for building a Hash mapping model and training the Hash mapping model in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relation propagation inputs a noise text which comprises a training text and is generated by the training text, sequentially generates a first type of representation and a second type of representation for the training text, generates a corresponding hash code by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the generated second type of representation, wherein the dimension of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first class of characteristics and a second class of characteristics; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the decoder module for sensing the noise utilizes the training texts and second-class characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts which are respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;
the hash table construction unit is used for respectively generating a hash code of each candidate text by utilizing an encoder module of the relation propagation in the trained hash mapping model and constructing a hash table;
and the retrieval unit is used for generating a hash code for the input query text by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on an initial result obtained by inquiry to obtain a final similar text retrieval result.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.
EXAMPLE III
The present invention also provides a processing apparatus, as shown in fig. 4, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.
Further, the processing device further comprises at least one input device and at least one output device; in the processing device, a processor, a memory, an input device and an output device are connected through a bus.
In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:
the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;
the output device may be a display terminal;
the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.
Example four
The present invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.
The readable storage medium in the embodiment of the present invention may be provided in the foregoing processing device as a computer readable storage medium, for example, as a memory in the processing device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are also within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A method for retrieving similar text, comprising:
constructing a Hash mapping model, and training in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relational propagation inputs training texts and noise texts corresponding to the training texts, sequentially generates a first type of representation and a second type of representation for the training texts, generates corresponding hash codes by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the second type of representation, wherein the dimensionality of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the noise-aware decoder module utilizes the training texts and the second class of characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;
respectively generating a hash code of each candidate text by using an encoder module of relation propagation in the trained hash mapping model, and constructing a hash table;
and for the input query text, generating a hash code by using an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on the initial result obtained by inquiry to obtain a final similar text retrieval result.
2. A method for similar text retrieval according to claim 1 wherein the processing flow of the relation-propagating encoder module comprises:
for the input training text, a multi-layer feed-forward network with a ReLU activation layer is used and a deep characterization of the text is obtained:
wherein the content of the first and second substances,t 1 representing intermediate features, bow (x) representing the feature representation of training text x under the bag-of-words model, reLU representing modified linear units, W 1 And W 2 Weight parameter representing a multi-layer feedforward network with ReLU activation layer, b 1 And b 2 Representing the bias parameters of a multi-layer feed-forward network with a ReLU active layer,t 2 representing a depth feature;
then, respectively obtaining first type characteristics through two feedforward networks with tanh active layers in sequencelAnd a second kind of characterization:
Wherein tanh represents a hyperbolic tangent function, e represents a natural constant,is hyperparametric, W 3 And b 3 Weight parameter and bias parameter, W, in the first feedforward network with tanh active layer 4 And b 4 Respectively, a weight parameter and a bias parameter in a second feedforward network with a tanh active layer;
and processing the numerical value of each dimension in the second type of representation by using a median method to obtain the hash code.
3. A method for similar text retrieval according to claim 1 or 2, wherein the correlation propagation loss function is calculated using the generated first class representation and the second class representation, and is represented as:
wherein the content of the first and second substances,L rp a function representing the propagation loss of the correlation is shown,represents the length of a dimension of the first type of token,N B representing the number of training texts in the current training batch,l k andl j respectively represent the first training batchkA training text and the secondjFirst type representation of training text,Andrespectively represent the first training batchkA training text andja second class of characterization of the individual training texts,band the dimension length of the second type of representation is equal to the dimension length of the hash code.
4. The method for retrieving similar texts according to claim 1, wherein the global equalization optimization module stores hash codes corresponding to all training texts, and performs optimization guidance on global equalization information on the hash codes newly generated in the training process includes:
storing the hash codes corresponding to all the training texts in a global storage module M, wherein each training text is provided with a corresponding storage position in the global storage module M, and storing the ith training text x in all the training texts i Is recorded as M i ;
In each training batch, selecting partial hash codes from the global storage module M for calculating the weight of global balance information, in the following manner: setting a timer for each storage position, initializing the timer to be 0, and adding 1 to the value of the timer before each training batch starts; in the current training batch, if the training text corresponding to a certain storage position belongs to the training text of the current training batch, resetting the timer value of the corresponding storage position to be 0; if the training texts do not belong to the current training batch, judging whether the timer value of the corresponding storage position meets a set extraction condition value, and if so, selecting the hash code stored in the corresponding storage position;
5. The method for retrieving similar texts according to claim 1 or 4, wherein said global equalization optimization loss function comprises: a bit equalization loss function and a bit independent loss function; the global equalization information includes: bit equality and bit independence;
bit equality refers to the value of each dimension in the hash code, with the same probability of 1 or-1, using setsA bit equalization weight for the global case is calculated, expressed as:
wherein the content of the first and second substances,for the selected set of partial hash codes,representation collectionThe value of the c-th dimension in the t-th hash code, b represents the length of the dimension of the hash code,representation collectionThe number of hash codes in the hash table,representing the bit equality weight of the c dimension in the hash code;
utilizing the bit balance weight to restrain the second class of characteristics corresponding to the training texts in the current batch to obtain a bit balance loss function L bb Expressed as:
wherein, the first and the second end of the pipe are connected with each other,N B representing the number of training texts in the current training batch,indicates the first in the current training batchkThe training texts correspond to the value of the c-th dimension in the second type of representation, and the dimension length of the second type of representation is equal to the dimension length of the hash code;
bit independence means that any two dimensions in a hash code are independent, using setsA bit-independence weight a for measuring global conditions is calculated, expressed as:
wherein, I represents an identity matrix; t is a transposed symbol;
utilizing the coefficient A of the bit-independent condition to restrain the second class of characteristics corresponding to the training texts in the current batch to obtain a bit-independent loss function L bd Expressed as:
and S represents a set formed by the training texts in the current batch corresponding to the second class of characteristics.
6. A method for similar text retrieval as in claim 1 wherein said noise-aware decoder module reconstructs the corresponding training text using the second class of features corresponding to the training text and the noise text respectively comprises:
calculating the probability that the word at each position in the training text belongs to each word in the word set by using the second type of representation corresponding to the training text, selecting the word in the word set with the maximum probability as the word at the corresponding position, and reconstructing the corresponding training text; for the training text x, its corresponding second class is labeled asThen the word x at the u-th position (u) Is the first in the word setvA word w v Probability of (2)Expressed as:
wherein, T represents a transposed symbol,representing a set of words in the text that,representing the total number of words in the set of words,representing the first in a set of wordsThe number of the individual words is,,,andare respectively words w v And wordsIs represented by the one-hot code of (a),andrespectively represent words w v And wordsCorresponding bias parameters, W represents a word embedding matrix;
the training text x corresponds to n noise texts and is represented as a noise text setSecond type of feature set corresponding to noisy text set P (x)Wherein, in the process,which represents the h-th noisy text, is,a second class of tokens representing an h-th noise text, h =1,2, ·, n; using a collection of features of the second typeReconstructing the training text to obtain a reconstructed training text set,The representation represents the training text reconstructed with the h-th noisy text.
7. A method of similar text retrieval as in claim 6 wherein said reconstruction loss function comprises: a first reconstruction loss function and a second reconstruction loss function; wherein:
calculating a first reconstruction loss function by using a second class of training texts corresponding to the training texts, wherein the first reconstruction loss function is represented as:
wherein L is rec A first reconstruction loss function is represented that,representing the total number of words of the training text x,representing reconstructed training textThe word in the u-th position in the list,representE represents the mathematical expectation, x-D x Distribution D representing that training text x satisfies text data x ;
Calculating a semantic correlation coefficient between the first type of representation corresponding to each noise text and the corresponding training text, and calculating a second loss function by combining the training text reconstructed by each noise text, wherein the semantic correlation coefficient is expressed as:
wherein L is rec_noise A second reconstruction loss function is represented that is,representing a training text reconstructed with an h-th noisy textThe word in the u-th position in the list,to representThe probability of (a) of (b) being,the semantic correlation coefficient of the training text x calculated by using the first type of representation corresponding to the h-th noise text is represented by the following calculation mode:
8. A similar text retrieval system realized based on the method of any one of claims 1 to 7, the system comprising:
the model building and training unit is used for building a Hash mapping model and training the Hash mapping model in an unsupervised mode; the hash mapping model comprises: a relation-propagating encoder module, a global equalization optimization module and a noise-aware decoder module; during training, an encoder module for relation propagation inputs a noise text which comprises a training text and is generated by the training text, sequentially generates a first type of representation and a second type of representation for the training text, generates a corresponding hash code by using the second type of representation, and calculates a correlation propagation loss function by using the generated first type of representation and the generated second type of representation, wherein the dimension of the first type of representation is higher than that of the second type of representation; for the noise text, sequentially generating a first type of representation and a second type of representation; the global equalization optimization module stores hash codes corresponding to all training texts, performs optimization guidance of global equalization information on the hash codes newly generated in the training process, and calculates a global equalization optimization loss function; the noise-aware decoder module utilizes the training texts and the second class of characteristics corresponding to the noise texts to respectively reconstruct the corresponding training texts, and utilizes the training texts respectively reconstructed and the correlation between the noise texts and the corresponding training texts to calculate a reconstruction loss function; constructing an integral loss function during training by combining a correlation propagation loss function, a global equilibrium optimization loss function and a reconstruction loss function;
the hash table construction unit is used for respectively generating a hash code of each candidate text by utilizing an encoder module of the relation propagation in the trained hash mapping model and constructing a hash table;
and the retrieval unit is used for generating a hash code for the input query text by utilizing an encoder module propagated by the relation in the trained hash mapping model, inquiring in the hash table, and performing correlation evaluation on an initial result obtained by inquiry to obtain a final similar text retrieval result.
9. A processing device, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1 to 7.
10. A readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211452104.7A CN115495546B (en) | 2022-11-21 | 2022-11-21 | Similar text retrieval method, system, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211452104.7A CN115495546B (en) | 2022-11-21 | 2022-11-21 | Similar text retrieval method, system, device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115495546A true CN115495546A (en) | 2022-12-20 |
CN115495546B CN115495546B (en) | 2023-04-07 |
Family
ID=85116261
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211452104.7A Active CN115495546B (en) | 2022-11-21 | 2022-11-21 | Similar text retrieval method, system, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115495546B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116051686A (en) * | 2023-01-13 | 2023-05-02 | 中国科学技术大学 | Method, system, equipment and storage medium for erasing characters on graph |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104657350A (en) * | 2015-03-04 | 2015-05-27 | 中国科学院自动化研究所 | Hash learning method for short text integrated with implicit semantic features |
US20180341720A1 (en) * | 2017-05-24 | 2018-11-29 | International Business Machines Corporation | Neural Bit Embeddings for Graphs |
CN110457503A (en) * | 2019-07-31 | 2019-11-15 | 北京大学 | A kind of rapid Optimum depth hashing image coding method and target image search method |
CN110659375A (en) * | 2019-09-20 | 2020-01-07 | 中国科学技术大学 | Hash model training method, similar object retrieval method and device |
CN112256727A (en) * | 2020-10-19 | 2021-01-22 | 东北大学 | Database query processing and optimizing method based on artificial intelligence technology |
CN113342922A (en) * | 2021-06-17 | 2021-09-03 | 北京邮电大学 | Cross-modal retrieval method based on fine-grained self-supervision of labels |
CN113392180A (en) * | 2021-01-07 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Text processing method, device, equipment and storage medium |
CN113449849A (en) * | 2021-06-29 | 2021-09-28 | 桂林电子科技大学 | Learning type text hash method based on self-encoder |
CN113821527A (en) * | 2021-06-30 | 2021-12-21 | 腾讯科技(深圳)有限公司 | Hash code generation method and device, computer equipment and storage medium |
CN114328818A (en) * | 2021-11-25 | 2022-04-12 | 腾讯科技(深圳)有限公司 | Text corpus processing method and device, storage medium and electronic equipment |
US20220179891A1 (en) * | 2019-04-09 | 2022-06-09 | University Of Washington | Systems and methods for providing similarity-based retrieval of information stored in dna |
-
2022
- 2022-11-21 CN CN202211452104.7A patent/CN115495546B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104657350A (en) * | 2015-03-04 | 2015-05-27 | 中国科学院自动化研究所 | Hash learning method for short text integrated with implicit semantic features |
US20180341720A1 (en) * | 2017-05-24 | 2018-11-29 | International Business Machines Corporation | Neural Bit Embeddings for Graphs |
US20220179891A1 (en) * | 2019-04-09 | 2022-06-09 | University Of Washington | Systems and methods for providing similarity-based retrieval of information stored in dna |
CN110457503A (en) * | 2019-07-31 | 2019-11-15 | 北京大学 | A kind of rapid Optimum depth hashing image coding method and target image search method |
CN110659375A (en) * | 2019-09-20 | 2020-01-07 | 中国科学技术大学 | Hash model training method, similar object retrieval method and device |
CN112256727A (en) * | 2020-10-19 | 2021-01-22 | 东北大学 | Database query processing and optimizing method based on artificial intelligence technology |
CN113392180A (en) * | 2021-01-07 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Text processing method, device, equipment and storage medium |
CN113342922A (en) * | 2021-06-17 | 2021-09-03 | 北京邮电大学 | Cross-modal retrieval method based on fine-grained self-supervision of labels |
CN113449849A (en) * | 2021-06-29 | 2021-09-28 | 桂林电子科技大学 | Learning type text hash method based on self-encoder |
CN113821527A (en) * | 2021-06-30 | 2021-12-21 | 腾讯科技(深圳)有限公司 | Hash code generation method and device, computer equipment and storage medium |
CN114328818A (en) * | 2021-11-25 | 2022-04-12 | 腾讯科技(深圳)有限公司 | Text corpus processing method and device, storage medium and electronic equipment |
Non-Patent Citations (3)
Title |
---|
NGHI D. Q. BUI 等: ""Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations"", 《PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL》 * |
ZHIWEN LI 等: ""Deep Hash Model for Similarity Text Retrieval"", 《2022 5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA》 * |
邹傲 等: ""基于深度哈希的文本表示学习"", 《计算机系统应用》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116051686A (en) * | 2023-01-13 | 2023-05-02 | 中国科学技术大学 | Method, system, equipment and storage medium for erasing characters on graph |
CN116051686B (en) * | 2023-01-13 | 2023-08-01 | 中国科学技术大学 | Method, system, equipment and storage medium for erasing characters on graph |
Also Published As
Publication number | Publication date |
---|---|
CN115495546B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210004682A1 (en) | Adapting a sequence model for use in predicting future device interactions with a computing system | |
Wang | Bankruptcy prediction using machine learning | |
CN109960738B (en) | Large-scale remote sensing image content retrieval method based on depth countermeasure hash learning | |
CN110781409B (en) | Article recommendation method based on collaborative filtering | |
CN111414461A (en) | Intelligent question-answering method and system fusing knowledge base and user modeling | |
CN111310439A (en) | Intelligent semantic matching method and device based on depth feature dimension-changing mechanism | |
CN112732864B (en) | Document retrieval method based on dense pseudo query vector representation | |
Cummings et al. | Structured citation trend prediction using graph neural networks | |
CN115495546B (en) | Similar text retrieval method, system, device and storage medium | |
CN113222139A (en) | Neural network training method, device and equipment and computer storage medium | |
CN111930931A (en) | Abstract evaluation method and device | |
Seo et al. | Reliable knowledge graph path representation learning | |
Goswami et al. | Filter-based feature selection methods using hill climbing approach | |
CN111966811A (en) | Intention recognition and slot filling method and device, readable storage medium and terminal equipment | |
Yao et al. | Hash bit selection with reinforcement learning for image retrieval | |
Ko et al. | MASCOT: A Quantization Framework for Efficient Matrix Factorization in Recommender Systems | |
Zeng et al. | Pyramid hybrid pooling quantization for efficient fine-grained image retrieval | |
CN116720519B (en) | Seedling medicine named entity identification method | |
CN112711648A (en) | Database character string ciphertext storage method, electronic device and medium | |
CN111274359B (en) | Query recommendation method and system based on improved VHRED and reinforcement learning | |
Sahu et al. | Forecasting currency exchange rate time series with fireworks-algorithm-based higher order neural network with special attention to training data enrichment | |
Qiang et al. | Large-scale multi-label image retrieval using residual network with hash layer | |
Cui et al. | Deep hashing with multi-central ranking loss for multi-label image retrieval | |
CN113326393B (en) | Image retrieval method based on deep hash feature and heterogeneous parallel processing | |
CN117609632B (en) | Tobacco legal service method and system based on Internet technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |