CN116662490A - Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information - Google Patents
Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information Download PDFInfo
- Publication number
- CN116662490A CN116662490A CN202310956922.9A CN202310956922A CN116662490A CN 116662490 A CN116662490 A CN 116662490A CN 202310956922 A CN202310956922 A CN 202310956922A CN 116662490 A CN116662490 A CN 116662490A
- Authority
- CN
- China
- Prior art keywords
- hash
- class
- sample
- hierarchical
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 claims abstract description 71
- 230000008569 process Effects 0.000 claims abstract description 37
- 239000013598 vector Substances 0.000 claims description 91
- 230000006870 function Effects 0.000 claims description 57
- 238000005070 sampling Methods 0.000 claims description 36
- 238000012549 training Methods 0.000 claims description 33
- 239000010410 layer Substances 0.000 claims description 22
- 238000004364 calculation method Methods 0.000 claims description 14
- 238000010276 construction Methods 0.000 claims description 10
- 239000002344 surface layer Substances 0.000 claims description 10
- 230000011218 segmentation Effects 0.000 claims description 9
- 230000003213 activating effect Effects 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 5
- 241000208340 Araliaceae Species 0.000 claims 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims 1
- 235000003140 Panax quinquefolius Nutrition 0.000 claims 1
- 235000008434 ginseng Nutrition 0.000 claims 1
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 238000001507 sample dispersion Methods 0.000 abstract description 2
- 238000012360 testing method Methods 0.000 description 7
- 230000004913 activation Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 230000002776 aggregation Effects 0.000 description 3
- 238000004220 aggregation Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 101000779402 Methanocaldococcus jannaschii (strain ATCC 43067 / DSM 2661 / JAL-1 / JCM 10045 / NBRC 100440) Homoisocitrate dehydrogenase Proteins 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000001747 exhibiting effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 230000003313 weakening effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/325—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A confusion-removing text hash algorithm and a confusion-removing text hash device fusing layering label information belong to the technical field of natural language processing and information retrieval. Aiming at the text characteristics with layering labels, the invention constructs layering similarity relation in the hash space through multiple losses; in addition, in order to prevent the hash algorithm from being influenced by samples with inconsistent category and semantic similarity in the encoding process, a technical thought of confusion removal is introduced. The invention can establish effective layering similarity relation in the hash space, can better adapt to a real neighbor retrieval scene, and effectively uses the label information to construct layering hash space, so that the hash code can be gathered by the same label sample in different layers while meeting the requirement of consistency with the original semantic similarity, and the layering space distribution of different label sample dispersions is presented. The invention also effectively increases the robustness of the model in practical use.
Description
Technical Field
The invention relates to a confusion-removing text hash algorithm and a confusion-removing text hash device for fusing hierarchical label information, and belongs to the technical field of natural language processing and information retrieval.
Background
The text hash is a method for mapping the document from an original high-dimensional symbol feature space to a low-dimensional binary address space through hash, can enable the document with similar semantics to be mapped to a similar address space, can measure the distance between hash codes through Hamming distance, and compared with the traditional expensive calculation power and time consumption for calculating the similarity in European space, the text hash only needs to evaluate the similarity between binary hash codes through calculating Hamming distance, and the retrieval efficiency is remarkably improved. Generally, the category labels of the text may be used to assist in the construction of the hash codes, so that the text hash code distances of the same category remain similar. Hierarchical text hashing is a special form of text hashing that hashes text with hierarchical category labels. The text with the layering type accords with the real text semantic distribution scene, for example, in news texts, news texts such as basketball, football, table tennis and the like under the 'sports' major categories keep similar semantic aggregation among news of different ball categories, and the semantic aggregation is presented in the sports major categories while the different categories are scattered, and the news texts are scattered with texts under other major categories such as economy and the like. The hash retrieval is used as a neighbor retrieval technology, and it is important to keep the hierarchical neighbor relation of the real scene. Therefore, the hierarchical text hash has important application value in the field of large-scale information retrieval.
In the hierarchical text hashing problem, how to build efficient hierarchical similarity constraints in the hash space is quite challenging. Unlike non-hierarchical text hashing scenarios, hierarchical text hashing requires that the intelligent model be able to build a label-based hierarchical neighbor relation in the hash space, where the neighbor relation can be described as a similarity relation of hash codes, and introducing hierarchical label information by means of multi-label prediction of the encoded hash codes alone is far from sufficient, as this approach does not build explicit hierarchical similarity constraints in the hash space. In addition, fuzzy samples with inconsistent semantic and category information similarity sometimes appear in the hierarchical text hash process, so that the generalization performance of the model is reduced. The hash model aims to solve the challenge of establishing effective layering similarity in the hash code, and the challenge of weakening the influence of mixed signals with inconsistent semantic and category similarity on the performance of the hash model, so that the generalization capability and the robustness of the hash model are improved.
In the prior art, there have been many heuristics in text Ha Xifang down:
chinese patent literature: CN113449849a discloses a self-encoder based learning text hash method, which uses a hash model of a self-encoder structure to complete text hash model construction, and the method only reconstructs semantic information of a text, but cannot guarantee that documents of different categories are differentiated in a hash space.
Chinese patent literature: CN110955745a discloses a text hash search method based on deep learning, which uses a classification layer to classify the hash codes obtained by encoding to integrate category information, in this way, no explicit similarity constraint relation is established in the hash space, and only integration of flattened categories, not hierarchical category information, is considered. In this patent too, fuzzy samples that are inconsistent in semantic and class information similarity are not processed.
In summary, it is still difficult to satisfy the hierarchical classification requirement of the hierarchical text hash in the prior art.
Disclosure of Invention
The invention discloses a confusion-removing text hash algorithm for fusing hierarchical label information.
The invention also discloses a device for realizing the confusion-removing text hash algorithm.
Aiming at the text characteristics with layering labels, the invention constructs layering similarity relation in the hash space through multiple losses; in addition, in order to prevent the hash algorithm from being influenced by samples with inconsistent category and semantic similarity in the encoding process, a technical thought of confusion removal is introduced.
The detailed technical scheme of the invention is as follows:
the confusion-removing text hash algorithm for fusing hierarchical label information is characterized by comprising the following steps of:
s1: acquiring surface features of textEnglish feature, preferably, the surface layer features are expressed by using word frequency and inverse document frequency;
the surface featuresThe calculation method of (1) comprises the following steps:
s11: utilizing a coarse granularity word segmentation device of an open source word segmentation tool HanLP to segment texts in a language database;
s12: removing common Chinese stop words from word segmentation results, and obtaining word dictionary with front-ordered words from large to small according to word frequencyTF-IDF feature vector of each text as surface feature +.>;
S2: given a text surface feature set with hierarchical category labels, wherein />The TF-IDF characteristic vector is regularized by L2; /> and />The number of all different subclass labels and the number of different parent class labels in the data set are respectively; vector->Is>Personal component corresponds to subclass tag->Description when the component is 1The method comprises the steps of carrying out a first treatment on the surface of the Similarly, vector->Is>The individual components correspond to the parent class label +.>When the component is 1, it is stated +.>The method comprises the steps of carrying out a first treatment on the surface of the Category label information is available during the training phase and not available during the testing phase; the hierarchical text hash task is to learn a hash model to add the surface features +.>Mapping to a +.>Binary vector of dimension->The formula is +.>And satisfies the requirement for ++>Its corresponding hash code +>Distance is similar, for->Corresponding to itHash code->The distances are also similar;
s21: the constructed target hash code satisfies the following hierarchical distance based on the label:
given hash codeThe parent category and the sub-category to which the pairwise combination belongs satisfy the following conditions:
and />Respectively belonging to the same parent class and the same subclass;
and />Respectively belonging to different subclasses;
and />Respectively belonging to the same parent class and different subclasses;
and />Respectively belonging to different parent classes and different subclasses;
i.e., wherein />Andparent tags and child tags corresponding to the samples respectively;
then, hash codeThe constructed layering distance constraint is as follows:
;
;
wherein Super parameters for controlling the distance difference; />Is the Euclidean distance between two vectors, wherein +.>Two hash codes respectively referring to the calculated Euclidean distance; because of the binarycity of the hash codes, the excessive spans, namely 0 and 1, exist among the values of the corresponding dimensions of different hash codes, and optimization is still difficult to be performed when the Euclidean distance constraint is performed, so that the distance constraint is performed in a continuous space; due to arbitrary-> and />In positive correlation, the above hierarchical distance constraint condition is equivalent to satisfy the constraint relationship of the following two hierarchies:
;
;
wherein ,hash code for corresponding anchor sample>Hidden vector before binarization, +.>,/>Correspondingly constructing a hidden vector set for the whole training set sample; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample heteroparent class and the anchor point sample heterosubclass; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample and the parent class and the subclass; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample heterology is constructed; />Is a super parameter;
distance constraint is carried out on each layer:
the hierarchical distance constraint translates into optimizing the following objective function:
in the formulas (1) and (2),constraint loss for total layering distance on training set; />A binarization pre-hidden vector corresponding to the anchor point sample; /> and />The binary pre-hidden vectors which are close to and far from the anchor hidden vector are hoped to be respectively;is a super parameter for controlling the size of the distance difference; />Constraint duty ratio super parameters for parent class distance; />Calculating a function for the desire; and />Respectively calculating the comparison loss expectations under the parent class and the sub class layers through anchor point samples; the corresponding super parameter is-> and />In training, the parameters of the Hash algorithm are updated in a twin network mode, and the method is hoped to ensure that under the same model parameters, the semantic similarity is kept and the same father and father are simultaneously realizedThe distance between the samples of the child is closer in the hash space than that between the samples of the father and the alien, and the distance between the samples of the father and the alien is closer in the hash space than that between the samples of the alien, so that the construction of the hierarchical similarity constraint in the hash space is completed.
The hash algorithm further comprises an S3 construction method for removing confusion constraint, which comprises the following steps:
for the core thought of the confusion removing supervision method, the method is divided into two parts, wherein firstly, samples with similar semantics but belonging to different categories need to be far away in the corresponding hash space, and secondly, samples with less similar semantics but belonging to the same category need to be similar in the corresponding hash space; consistent with the foregoing, by successive hidden vectorsTo construct hash +.>Similarity relation between:
let the samples of each parent and child class have a semantic center in hidden space.
S31: and calculating the semantic center based on the hidden vector of the sample of the category in the training process.
In the formula (3) of the present invention,representing parent class +.>Is a latent semantic center of (1); />Representation subclass->Is a latent semantic center of (1); />The function is calculated for the desired purpose.
S32: on the premise of defining various kinds of lingering semantic centers, calculating fuzzy radius based on the semantic centers for each parent class and each child class:
in the formula (4) of the present invention,is->Semantic blur radius of parent class; />Is->Semantic blur radius of subclass; />Is a super-parameter, and can be automatically adjusted according to the data; />The function is calculated for the desired purpose.
S33: the method is characterized in that the same class samples with the distance from the semantic center being larger than the fuzzy radius and the heterogeneous samples with the distance from the semantic center being smaller than the fuzzy radius are provided with confusing supervision information, and based on the concept, the distance constraint similar to the layering similarity constraint is introduced into the training step of each sample:
in the formula (5) of the present invention,total defrobulated supervision constraint loss on the training set; />The function is calculated for the desired purpose. and />The loss expectations of confusion removal and comparison under parent classes and subclasses obtained through semantic center calculation are respectively; /> and />Respectively associating a specific parent class semantic center and a specific sub-class semantic center; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and corresponds to the heterofather class and heteroson class sample is constructed; />Hidden vector set constructed corresponding to anchor point sample and father and subclass sample obtained by random sampling +.>The hidden vector set which is obtained for random sampling and is constructed corresponding to the anchor point sample and the parent class and the class of the same class of the different class sample meets the following conditions:
wherein and />Corresponding parent hidden vector fuzzy radius and sub-class hidden vector fuzzy radius respectively; likewise, the hash algorithm is optimized by adopting a twin network mode in the training process.
The two parts are key components of the invention, and the related super-parameter user can use default values and can also be customized to meet actual service requirements. Guan Chaocan of the sample similarity constraints involved and />Confusion-removing constraint super-ginseng and />And weight exceeding ginseng->And confusion boundary super-ginseng->The method can be automatically adjusted according to the data set and used for subsequent training of the confusion-removing text hash algorithm for fusing the hierarchical label information;
the distribution diagram of the blurred sample is shown in fig. 1;
the sample selection algorithm pseudo code of the above procedure is shown in fig. 2;
hash algorithmIntegration of semantic information of a document by Bernoulli variational self-encoder framework, according to Bayesian theorem, variational self-encoder can be optimized by maximizing variational lower bound, and can be converted into maximizing the following variational lower bound in a hash modelNamely maximizing the reconstructed original document and minimizing the KL divergence:
in the formula (6) of the present invention,the function is calculated for the desired purpose. The inference process is denoted as the hash algorithm encoding (approximate posterior) process, i.e. encoder +.>The encoded network parameters are denoted +.>The method comprises the steps of carrying out a first treatment on the surface of the The generation process is denoted as hash algorithm decoding process +.>The decoded network parameter is denoted +.>;/>A priori distribution without loss of generality; since the bernoulli variation self-encoder treats the a priori distribution as a multivariate bernoulli distribution, equation (6) is transformed in a specific process to optimize the objective function as follows:
in the formula (7) and the formula (8),reconstructing the loss for the document; KL is the relative entropy; />For a priori distributionIs a characteristic value of (2); />The method comprises the steps of reconstructing a characteristic vector for a decoding network; />Is a cross entropy loss function;
the hash algorithm adopts a label prediction network to integrate the hash code into the basic sub-category information of the hash code, and optimizes the hash code by using the following objective function, whereinThe category to which the sample predicted for the network belongs:
through the above discussion, the hash algorithm proposed by the present invention is optimized by the following objective function:
in the formulas (9) and (10), and (2)>Predicting losses for corresponding sub-class labels; />Predicting network parameters for the tag; />Loss from the encoder for the corresponding Bernoulli variation, including document reconstruction +.>Loss and relative entropy KL loss; />Respectively obtaining weight super-parameters corresponding to each part of objective function;constraint loss for the computed hierarchical similarity; />A defrobulated constraint loss calculated for the middle; />Is a complete objective function of the hash algorithm.
The device for realizing the confusion-removing text hash algorithm for fusing the hierarchical label information is characterized in that:
the whole hash model framework is realized from the encoder framework based on the Knuds variation; encoder with a plurality of sensorsUsing a multi-layer perceptron, english multilayer perceptron, english shorthand MLP; the surface layer characteristics of the document are->Mapping to a continuous hidden distribution space, and obtaining a hash code by sampling in the coded posterior multiple Bernoulli distribution;
the coding process comprises the following steps: first, the surface layer is characterizedInputting the multi-layer perceptron to encode to obtain +.>
(11)
Then activating by using Sigmoid function to obtain hidden vector before binarization
(12)
In the formula (12)Representing an exponential function;
obtaining hash codes by sampling operations
(13)
In equation (13), the approximate posterior of the hidden vector correlation is expressed as
(14)
DecoderObtaining hidden vectors after linear transformation, then reconstructing the original characteristic representation through an activation function, and predicting the labels:
(15)
in the case of the formula (15),representing decoded network output vectorsCorresponding position +.>An output of (2); />For decoding the network parameter matrix +.>To correspond to->A single heat vector of dimensions; />Is a network bias term;
(16)
in the formula (16) of the present invention,representing the corresponding position in the predicted network output vector>An output of (2); />A network parameter matrix is predicted; />To correspond to->A single heat vector of dimensions; />Is a network bias term;
(17)
(18)
in the formulas (11) - (18),in the calculation of the multi-label scene, the multi-label scene is replaced by a Sigmoid function, and the sampling process is carried outThe following form was used instead:
(19)
in the formula (19) of the present invention,is a fixed threshold sample value, ">To complete hidden distribution sampling, and to use a straight-through estimator to process the non-conductive phenomenon of the binarization operation in the training process.
In specific implementation details, two layers of hidden layers each containing 1000 neurons activated by adopting a leak ReLU are used as a network basic structure of the MLP, and a dropout technology is adopted after the second layer to prevent the phenomenon that a hash model is over-fitted, wherein the dropout probability is set to be 0.1. Then, a sigmoid activation function is adopted to realize the approximation process of the A-test Bernoulli distribution, and a fixed threshold value is adoptedBy->And the function samples the binary hash code to complete the construction of the inferred network. For generating a network, a layer of linear transformation is adopted to reconstruct an original feature vector and a predictive subcategory label vector, a Softmax function is adopted to activate network output when the original feature vector is reconstructed, and a Softmax is used to activate network output for a single label scene when the subcategory label is predictedAnd activating, namely for the multi-label scene, activating by using a Sigmoid function, and replacing a loss calculation mode with Euclidean distance loss. The inference network (encoder) and the generation network (decoder) are only used in combination in the training phase to complete the optimization process under the framework of the variational self-encoder, and in the practical application phase, only the encoder is used for inputting the surface layer characteristics->And encoding to obtain the hash code.
The whole framework of the implementation of the hash algorithm device is shown in fig. 3, and the training process is shown in fig. 4.
The technical advantages of the invention include:
compared with the prior art, the method only used, the method can better adapt to a real neighbor search scene, and effectively uses the label information to construct a hierarchical hash space, so that the hash code can be consistent with the original semantic similarity, and meanwhile, the same label sample aggregation and the hierarchical spatial distribution of different label sample dispersion are presented in different layers.
The confusion removing constraint proposed by the invention according to the actual problem can help the hash model to weaken the influence caused by confusion information from inconsistent category and semantics in the training process, and effectively increase the robustness of the model in actual use.
The invention can also be used on other similar tasks in the field of natural language processing, such as hierarchical text representation learning, hierarchical text classification and other related tasks based on hierarchical label scenes.
Drawings
FIG. 1 is a schematic diagram of the distribution of fuzzy samples in a hierarchical hash scenario of the present invention;
FIG. 2 is an algorithmic pseudocode of a sample selection method for modeling samples and hierarchical similarity hash space construction in accordance with the present invention;
FIG. 3 is a schematic diagram of the overall design framework of the confusion-removed text hashing algorithm incorporating hierarchical label information in the present invention;
fig. 4 is a training algorithm pseudo code of the defrobulated text hashing algorithm incorporating hierarchical label information in accordance with the present invention.
Detailed Description
The present invention will be described in detail with reference to examples and drawings, but is not limited thereto.
Example 1,
A defrobulated text hashing algorithm that fuses hierarchical tag information, comprising:
s1: acquiring surface features of textEnglish feature, preferably, the surface layer features are expressed by using word frequency and inverse document frequency;
the surface featuresThe calculation method of (1) comprises the following steps:
s11: utilizing a coarse granularity word segmentation device of an open source word segmentation tool HanLP to segment texts in a language database;
s12: removing common Chinese stop words from word segmentation results, and obtaining word dictionary with front-ordered words from large to small according to word frequencyTF-IDF feature vector of each text as surface feature +.>;
S2: given a text surface feature set with hierarchical category labels, wherein />The TF-IDF characteristic vector is regularized by L2; /> and />The number of all different subclass labels and the number of different parent class labels in the data set are respectively; vector->Is>Personal component corresponds to subclass tag->Description when the component is 1The method comprises the steps of carrying out a first treatment on the surface of the Similarly, vector->Is>The individual components correspond to the parent class label +.>When the component is 1, it is stated +.>The method comprises the steps of carrying out a first treatment on the surface of the Category label information is available during the training phase and not available during the testing phase; the hierarchical text hash task is to learn a hash model to add the surface features +.>Mapping to a +.>Binary vector of dimension->The formula is +.>And satisfies the requirement for ++>Its corresponding hash code +>Distance is similar, for->Its corresponding hash code +>The distances are also similar;
s21: the constructed target hash code satisfies the following hierarchical distance based on the label:
given hash codeThe parent category and the sub-category to which the pairwise combination belongs satisfy the following conditions:
and />Respectively belonging to the same parent class and the same subclass;
and />Respectively belonging to different subclasses;
and />Respectively belonging to the same parent class and different subclasses;
and />Respectively belonging to different parent classes and different subclasses;
i.e., wherein /> and />Parent tags and child tags corresponding to the samples respectively;
then, hash codeThe constructed layering distance constraint is as follows:
;
;
wherein Super parameters for controlling the distance difference; />Is the Euclidean distance between two vectors, wherein +.>Two hash codes respectively referring to the calculated Euclidean distance; because of the binarycity of the hash codes, the excessive spans, namely 0 and 1, exist among the values of the corresponding dimensions of different hash codes, and optimization is still difficult to be performed when the Euclidean distance constraint is performed, so that the distance constraint is performed in a continuous space; due to arbitrary-> and />In positive correlation, the above hierarchical distance constraint condition is equivalent to satisfy the constraint relationship of the following two hierarchies:
;
;
wherein ,hash code for corresponding anchor sample>Hidden vector before binarization, +.>,/>Correspondingly constructing a hidden vector set for the whole training set sample; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample heteroparent class and the anchor point sample heterosubclass; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample and the parent class and the subclass; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample heterology is constructed;/>is a super parameter;
distance constraint is carried out on each layer:
the hierarchical distance constraint translates into optimizing the following objective function:
in the formulas (1) and (2),constraint loss for total layering distance on training set; />A binarization pre-hidden vector corresponding to the anchor point sample; /> and />The binary pre-hidden vectors which are close to and far from the anchor hidden vector are hoped to be respectively;is a super parameter for controlling the size of the distance difference; />Constraint duty ratio super parameters for parent class distance; />Calculating a function for the desire; and />Respectively calculating the comparison loss expectations under the parent class and the sub class layers through anchor point samples; the corresponding super parameter is-> and />In training, the parameters of the hash algorithm are updated in a twin network mode, so that semantic similarity is expected to be kept under the same model parameters, meanwhile, the distance between samples of the same father and son in the hash space is closer than that between samples of the same father and son, the distance between samples of the same father and son in the hash space is closer than that between samples of different father and son, and therefore the construction of the hierarchical similarity constraint in the hash space is completed.
EXAMPLE 2,
The method for constructing the confusion-removing text hash algorithm for fusing hierarchical label information according to embodiment 1, further comprising S3 a construction method for the confusion-removing constraint, comprising:
for the core thought of the confusion removing supervision method, the method is divided into two parts, wherein firstly, samples with similar semantics but belonging to different categories need to be far away in the corresponding hash space, and secondly, samples with less similar semantics but belonging to the same category need to be similar in the corresponding hash space; consistent with the foregoing, by successive hidden vectorsTo construct hash +.>Similarity relation between:
let the samples of each parent and child class have a semantic center in hidden space,
s31: semantic center calculation based on hidden vectors of samples of this type in training process
In the formula (3) of the present invention,representing parent class +.>Is a latent semantic center of (1); />Representation subclass->Is a latent semantic center of (1); />The function is calculated for the desired purpose.
S32: on the premise of defining various kinds of lingering semantic centers, calculating fuzzy radius based on the semantic centers for each parent class and each child class:
in the formula (4) of the present invention,is->Semantic blur radius of parent class; />Is->Semantic blur radius of subclass; />Is a super-parameter, and can be automatically adjusted according to the data; />The function is calculated for the desired purpose.
S33: the method is characterized in that the same class samples with the distance from the semantic center being larger than the fuzzy radius and the heterogeneous samples with the distance from the semantic center being smaller than the fuzzy radius are provided with confusing supervision information, and based on the concept, the distance constraint similar to the layering similarity constraint is introduced into the training step of each sample:
in the formula (5) of the present invention,total defrobulated supervision constraint loss on the training set; />The function is calculated for the desired purpose. and />The loss expectations of confusion removal and comparison under the parent class and the child class obtained through semantic center calculation are respectively shown; /> and />Respectively associating a specific parent class semantic center and a specific sub-class semantic center; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and corresponds to the heterofather class and heteroson class sample is constructed; />Hidden vector set constructed corresponding to anchor point sample and father and subclass sample obtained by random sampling +.>The hidden vector set which is obtained for random sampling and is constructed corresponding to the anchor point sample and the parent class and the class of the same class of the different class sample meets the following conditions:
wherein and />Corresponding parent hidden vector fuzzy radius and sub-class hidden vector fuzzy radius respectively; likewise, the hash algorithm is optimized by adopting a twin network mode in the training process.
The two parts are key components of the invention, and the related super-parameter user can use default values and can also be customized to meet actual service requirements. Guan Chaocan of the sample similarity constraints involved and />Confusion-removing constraint super-ginseng and />And weight exceeding ginseng->And confusion boundary super-ginseng->The method can be automatically adjusted according to the data set and used for subsequent training of the confusion-removing text hash algorithm for fusing the hierarchical label information;
the distribution diagram of the blurred sample is shown in fig. 1;
the sample selection algorithm pseudo code of the above procedure is shown in fig. 2;
the Hash algorithm integrates semantic information of a document through a Bernoulli variation self-encoder framework, and the variation self-encoder can be optimized by maximizing a variation lower bound according to the Bayesian theorem and can be converted into the maximized variation lower bound in a Hash modelNamely maximizing the reconstructed original document and minimizing the KL divergence:
in the formula (6) of the present invention,the function is calculated for the desired purpose. The inference process is denoted as the hash algorithm encoding (approximate posterior) process, i.e. encoder +.>The encoded network parameters are denoted +.>The method comprises the steps of carrying out a first treatment on the surface of the The generation process is denoted as hash algorithm decoding process +.>The decoded network parameter is denoted +.>;/>A priori distribution without loss of generality; since the bernoulli variation self-encoder treats the a priori distribution as a multivariate bernoulli distribution, equation (6) is transformed in a specific process to optimize the objective function as follows:
in the formula (7) and the formula (8),reconstructing the loss for the document; KL is the relative entropy; />For a priori distribution->Is a characteristic value of (2); />The method comprises the steps of reconstructing a characteristic vector for a decoding network; />Is a cross entropy loss function;
the hash algorithm adopts a label prediction network to integrate the hash code into the basic sub-category information of the hash code, and optimizes the hash code by using the following objective function, whereinThe category to which the sample predicted for the network belongs:
through the above discussion, the hash algorithm proposed by the present invention is optimized by the following objective function:
in the formulas (9) and (10), and (2)>Predicting losses for corresponding sub-class labels; />Predicting network parameters for the tag; />Loss from the encoder for the corresponding Bernoulli variation, including document reconstruction +.>Loss and relative entropy KL loss; />Respectively obtaining weight super-parameters corresponding to each part of objective function;constraint loss for the computed hierarchical similarity; />The resulting de-confusing constraint loss is calculated. />Is a complete objective function of the hash algorithm.
EXAMPLE 3,
An implementation device of a confusion-removing text hash algorithm for fusing hierarchical label information comprises:
the whole hash model framework is realized from the encoder framework based on the Knuds variation; encoder with a plurality of sensorsEnglish multilayer perceptron, english letter using multi-layer perceptronWriting MLP; the surface layer characteristics of the document are->Mapping to a continuous hidden distribution space, and obtaining a hash code by sampling in the coded posterior multiple Bernoulli distribution;
the coding process comprises the following steps: first, the surface layer is characterizedInputting the multi-layer perceptron to encode to obtain +.>
(11)
Then activating by using Sigmoid function to obtain hidden vector before binarization
(12)
In the formula (12)Representing an exponential function;
obtaining hash codes by sampling operations
(13)
In equation (13), the approximate posterior of the hidden vector correlation is expressed as
(14)
DecoderObtaining hidden vectors after linear transformation, then reconstructing the original characteristic representation through an activation function, and predicting the labels:
(15)
in the case of the formula (15),representing the corresponding position in the decoded network output vector +.>An output of (2); />For decoding the network parameter matrix +.>To correspond to->A single heat vector of dimensions; />Is a network bias term;
(16)
in the formula (16) of the present invention,representing the corresponding position in the predicted network output vector>An output of (2); />To predictA network parameter matrix; />To correspond to->A single heat vector of dimensions; />Is a network bias term;
(17)
(18)
in the formulas (11) - (18),in the calculation of the multi-label scene, the multi-label scene is replaced by a Sigmoid function, and the sampling process is carried outThe following form was used instead:
(19)
in the formula (19) of the present invention,is a fixed threshold sample value, ">To complete the implicit distributed sampling. A pass-through estimator is used to handle the non-conductive phenomenon of the binarization operation in the training process.
In specific implementation details, two layers of hidden layers each containing 1000 neurons activated by adopting a leak ReLU are used as a network basic structure of the MLP, and a dropout technology is adopted after the second layer to prevent the phenomenon that a hash model is over-fitted, wherein the dropout probability is set to be 0.1. Thereafter, use is made ofsigmoid activation function to achieve approximation of Bernoulli posterior distribution and employ fixed thresholdBy->And the function samples the binary hash code to complete the construction of the inferred network. For generating a network, a layer of linear transformation is adopted to reconstruct an original feature vector and a predicted subcategory label vector, a Softmax function is adopted to activate network output when the original feature vector is reconstructed, a Softmax function is adopted to activate network output for a single-label scene when the subcategory label is predicted, a Sigmoid function is adopted to activate for a multi-label scene, and a loss calculation mode is replaced by Euclidean distance loss. The inference network (encoder) and the generation network (decoder) are only used in combination in the training phase to complete the optimization process under the framework of the variational self-encoder, and in the practical application phase, only the encoder is used for inputting the surface layer characteristics->And encoding to obtain the hash code.
The whole framework of the implementation of the hash algorithm device is shown in fig. 3, and the training process is shown in fig. 4.
In the above embodiment, the performance of the hash algorithm proposed by the present invention is quantitatively analyzed based on the 32-bit hash code:
training and testing the hash algorithm of the invention on a WOS data set, wherein the WOS data set adopts a data set from https:// data.mendeley.com/data/9 rw3vkcfy4/6, the test result is shown in table 1, wherein a Distance column is the Hamming Distance between other samples searched by a query sample and hash codes of the query sample, a Domain column is a large field, the data column can be understood as a parent class to which the sample belongs, the Area column is a subdivision field, the data column can be understood as a subclass to which the sample belongs, a Keywords column is a corresponding sample keyword, and a Code column is a corresponding sample hash Code:
table 1: test results (text is encoded into 32-bit hash codes based on the related technology of the invention, and the hash codes of the text to be queried are searched by using the hash codes of the query text to obtain test results in the following table).
As can be seen from Table 1, as the Hamming distance increases, the retrieved documents become less relevant. In addition, documents under the same parent class are closer in hash space than documents within different parent classes, even though they come from different subclasses, exhibiting hierarchical similarity relationships. Therefore, the result in the table 1 shows that the hash algorithm provided by the invention can encode the hash codes for effectively measuring the hierarchical similarity between the documents, and the hierarchical category label information is fully utilized.
In the embodiment, the retrieval performance evaluation of the hash model proposed herein is performed on 16-bit, 32-bit, 64-bit and 128-bit hash codes by using a precision@100 index based on a 20New Groups data set, wherein the 20New Groups data set is derived from http:// qword.com/-jason/20 New Groups; the Method column is the result of comparing the hash algorithm with the hash algorithm in the text, and the columns of 16bits, 32bits, 64bits and 128bits respectively represent the model performance on the hash codes of 16bits, 32bits, 64bits and 128 bits.
Table 2: precision@100 retrieval performance evaluation table
In Table 2, ours is the performance of the hash algorithm proposed by the present invention, and other prior art manifestations are cited in the IHDH Hash technical literature (Guo J N, mao X L, wei W, et al, intra-category aware hierarchical supervised document hashing [ J ]. IEEE Transactions on Knowledge and Data Engineering, 2022.).
As can be seen from table 2, the performance of the hash algorithm in this document exceeds that of the other prior art in comparison with the hash codes of four common lengths in the experiment, which fully demonstrates the advantages of the hash model proposed by the present invention.
Claims (5)
1. The confusion-removing text hash algorithm for fusing hierarchical label information is characterized by comprising the following steps of:
s1: acquiring surface features of text;
S2: given a text surface feature set with hierarchical category labels, wherein />The TF-IDF characteristic vector is regularized by L2; /> and />The number of all different subclass labels and the number of different parent class labels in the data set are respectively; vector->Is>Personal component corresponds to subclass tag->Description when the component is 1The method comprises the steps of carrying out a first treatment on the surface of the Similarly, vector->Is>The individual components correspond to the parent class label +.>When the component is 1, it is stated +.>;
S3: a method of construction of a deconrobustable constraint, comprising: by successive hidden vectorsTo construct hash +.>The similarity relation between the parent category and the child category is set to have a semantic center in the hidden space.
2. The defrobulated text hashing algorithm of claim 1, wherein in S1, said surface features areThe calculation method of (1) comprises the following steps:
s11: utilizing a coarse granularity word segmentation device of an open source word segmentation tool HanLP to segment texts in a language database;
s12: removing Chinese stop words from word segmentation results, and obtaining word dictionary with front words according to word frequency from large to smallTF-IDF feature vector of each text as surface feature +.>;
The S2 further includes:
the constructed target hash code satisfies the following hierarchical distance based on the label:
given hash codeThe parent category and the sub-category to which the pairwise combination belongs satisfy the following conditions:
and />Respectively belonging to the same parent class and the same subclass;
and />Respectively belonging to different subclasses;
and />Respectively belonging to the same parent class and different subclasses;
and />Respectively belonging to different parent classes and different subclasses;
i.e., wherein /> and />Parent tags and child tags corresponding to the samples respectively;
then, hash codeThe constructed layering distance constraint is as follows:
;
;
wherein Super parameters for controlling the distance difference; />Is the Euclidean distance between two vectors, wherein +.>Two hash codes respectively referring to the calculated Euclidean distance; the above hierarchical distance constraint is equivalent to satisfying the following two hierarchical constraint relationships:
;
;
wherein ,hash code for corresponding anchor sample>Hidden vector before binarization, +.>,/>Correspondingly constructing a hidden vector set for the whole training set sample; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample heteroparent class and the anchor point sample heterosubclass;a hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample and the parent class and the subclass; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample heterology is constructed; />Is a super parameter;
distance constraint is carried out on each layer:
the hierarchical distance constraint translates into optimizing the following objective function:
in the formulas (1) and (2),constraint loss for total layering distance on training set; />A binarization pre-hidden vector corresponding to the anchor point sample; /> and />The binary pre-hidden vectors which are close to and far from the anchor hidden vector are hoped to be respectively; />Is a super parameter for controlling the size of the distance difference; />Constraint duty ratio super parameters for parent class distance; />Calculating a function for the desire; and />Respectively calculating the comparison loss expectations under the parent class and the sub class layers through anchor point samples; the corresponding super parameter is-> and />。
3. The confusion-removing text hashing algorithm for fusing hierarchical label information according to claim 2, wherein S3 specifically comprises:
s31: in the training process, calculating semantic centers based on hidden vectors of the samples of the category:
in the formula (3) of the present invention,representing parent class +.>Is a latent semantic center of (1); />Representation subclass->Is a latent semantic center of (1); />Calculating a function for the desire;
s32: performing semantic center-based fuzzy radius calculation on each parent class and subclass:
in the formula (4) of the present invention,is->Semantic blur radius of parent class; />Is->Semantic fuzzy half of subclassesDiameter is as follows;is a super ginseng; />Calculating a function for the desire;
s33: a distance constraint similar to the hierarchical similarity constraint is introduced in the training step of each sample:
in the formula (5) of the present invention,total defrobulated supervision constraint loss on the training set; />Calculating a function for the desire; and />The loss expectations of confusion removal and comparison under parent classes and subclasses obtained through semantic center calculation are respectively; /> and />Respectively associating a specific parent class semantic center and a specific sub-class semantic center; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and corresponds to the heterofather class and heteroson class sample is constructed; />Hidden vector set constructed corresponding to anchor point sample and father and subclass sample obtained by random sampling +.>The hidden vector set which is obtained for random sampling and is constructed corresponding to the anchor point sample and the parent class and the class of the same class of the different class sample meets the following conditions:
wherein and />Corresponding parent hidden vector fuzzy radius and sub-class hidden vector fuzzy radius respectively; likewise, the hash algorithm is optimized by adopting a twin network mode in the training process.
4. The defrobulated text hashing algorithm of claim 2 wherein the method of optimizing the hashing algorithm includes:
maximizing the lower bound of the variationNamely maximizing the reconstructed original document and minimizing the KL divergence:
in the formula (6) of the present invention,calculating a function for the desire; the inference procedure is denoted as hash algorithm coding procedure, i.e. encoder +.>The encoded network parameters are denoted +.>The method comprises the steps of carrying out a first treatment on the surface of the The generation process is denoted as hash algorithm decoding process +.>The decoded network parameter is denoted +.>;/>A priori distribution without loss of generality; since the bernoulli variation self-encoder treats the a priori distribution as a multivariate bernoulli distribution, equation (6) is transformed in a specific process to optimize the objective function as follows:
in the formula (7) and the formula (8),reconstructing the loss for the document; />For a priori distribution->Is a characteristic value of (2); />The method comprises the steps of reconstructing a characteristic vector for a decoding network; />Is a cross entropy loss function;
the hash algorithm adopts a label prediction network to integrate the hash code into the basic sub-category information of the hash code, and optimizes the hash code by using the following objective function, whereinThe category to which the sample predicted for the network belongs:
the hash algorithm is optimized by the following objective function:
in the formulas (9) and (10), and (2)>Predicting losses for corresponding sub-class labels; />Predicting network parameters for the tag; />Loss from encoder for corresponding Bernoulli variation, including document reconstructionLoss and KL loss; />Respectively obtaining weight super-parameters corresponding to each part of objective function; />Constraint loss for the computed hierarchical similarity; />A defrobulated constraint loss calculated for the middle; />Is a complete objective function of the hash algorithm.
5. The device for realizing the confusion-removing text hash algorithm for fusing the hierarchical label information is characterized in that:
encoder with a plurality of sensorsUsing a multi-layer perceptron, characterizing the surface layer of the document->Mapping to a continuous hidden distribution space, and obtaining a hash code by sampling in the coded posterior multiple Bernoulli distribution;
the coding process comprises the following steps: first, the surface layer is characterizedInputting into a multi-layer sensing machineCoding get->
(11)
Then activating by using Sigmoid function to obtain hidden vector before binarization
(12)
In the formula (12)Representing an exponential function;
obtaining hash codes by sampling operations
(13)
In equation (13), the approximate posterior of the hidden vector correlation is expressed as
(14)
DecoderUsing linear transformation to obtain hidden vector, then reconstructing original characteristic representation by activating function and making labelAnd (3) predicting:
(15)
in the case of the formula (15),representing the corresponding position in the decoded network output vector +.>An output of (2); />In order to decode the network parameter matrix,to correspond to->A single heat vector of dimensions; />Is a network bias term;
(16)
in the formula (16) of the present invention,representing the corresponding position in the predicted network output vector>An output of (2); />A network parameter matrix is predicted;to correspond to->A single heat vector of dimensions; />Is a network bias term;
(17)
(18)
in the formulas (11) - (18),in the calculation of the multi-label scene, the multi-label scene is replaced by a Sigmoid function, and the sampling process is carried outThe following form was used instead:
(19)
in the formula (19) of the present invention,is a fixed threshold sample value, ">To complete the implicit distributed sampling.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310956922.9A CN116662490B (en) | 2023-08-01 | 2023-08-01 | Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310956922.9A CN116662490B (en) | 2023-08-01 | 2023-08-01 | Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116662490A true CN116662490A (en) | 2023-08-29 |
CN116662490B CN116662490B (en) | 2023-10-13 |
Family
ID=87715785
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310956922.9A Active CN116662490B (en) | 2023-08-01 | 2023-08-01 | Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116662490B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001033337A2 (en) * | 1999-11-01 | 2001-05-10 | Curl Corporation | System and method supporting property values as options |
CN110188209A (en) * | 2019-05-13 | 2019-08-30 | 山东大学 | Cross-module state Hash model building method, searching method and device based on level label |
CN111460077A (en) * | 2019-01-22 | 2020-07-28 | 大连理工大学 | Cross-modal Hash retrieval method based on class semantic guidance |
CN111753189A (en) * | 2020-05-29 | 2020-10-09 | 中山大学 | Common characterization learning method for few-sample cross-modal Hash retrieval |
CN112199520A (en) * | 2020-09-19 | 2021-01-08 | 复旦大学 | Cross-modal Hash retrieval algorithm based on fine-grained similarity matrix |
CN113806580A (en) * | 2021-09-28 | 2021-12-17 | 西安电子科技大学 | Cross-modal Hash retrieval method based on hierarchical semantic structure |
CN113889228A (en) * | 2021-09-22 | 2022-01-04 | 武汉理工大学 | Semantic enhanced Hash medical image retrieval method based on mixed attention |
WO2022037295A1 (en) * | 2020-08-20 | 2022-02-24 | 鹏城实验室 | Targeted attack method for deep hash retrieval and terminal device |
WO2022104540A1 (en) * | 2020-11-17 | 2022-05-27 | 深圳大学 | Cross-modal hash retrieval method, terminal device, and storage medium |
CN116383415A (en) * | 2023-03-21 | 2023-07-04 | 华中科技大学 | Construction method and application of hash generation model of multimedia data |
-
2023
- 2023-08-01 CN CN202310956922.9A patent/CN116662490B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001033337A2 (en) * | 1999-11-01 | 2001-05-10 | Curl Corporation | System and method supporting property values as options |
CN111460077A (en) * | 2019-01-22 | 2020-07-28 | 大连理工大学 | Cross-modal Hash retrieval method based on class semantic guidance |
CN110188209A (en) * | 2019-05-13 | 2019-08-30 | 山东大学 | Cross-module state Hash model building method, searching method and device based on level label |
CN111753189A (en) * | 2020-05-29 | 2020-10-09 | 中山大学 | Common characterization learning method for few-sample cross-modal Hash retrieval |
WO2022037295A1 (en) * | 2020-08-20 | 2022-02-24 | 鹏城实验室 | Targeted attack method for deep hash retrieval and terminal device |
CN112199520A (en) * | 2020-09-19 | 2021-01-08 | 复旦大学 | Cross-modal Hash retrieval algorithm based on fine-grained similarity matrix |
WO2022104540A1 (en) * | 2020-11-17 | 2022-05-27 | 深圳大学 | Cross-modal hash retrieval method, terminal device, and storage medium |
CN113889228A (en) * | 2021-09-22 | 2022-01-04 | 武汉理工大学 | Semantic enhanced Hash medical image retrieval method based on mixed attention |
CN113806580A (en) * | 2021-09-28 | 2021-12-17 | 西安电子科技大学 | Cross-modal Hash retrieval method based on hierarchical semantic structure |
CN116383415A (en) * | 2023-03-21 | 2023-07-04 | 华中科技大学 | Construction method and application of hash generation model of multimedia data |
Non-Patent Citations (3)
Title |
---|
XIAOQIANG LU; YAXIONG CHEN; XUELONG LI: "Hierarchical Recurrent Neural Hashing for Image Retrieval With Hierarchical Convolutional Features", IEEE TRANSACTIONS ON IMAGE PROCESSING, pages 106 * |
曹路;杨文强;: "基于离散监督哈希的相似性检索算法", 科学技术与工程, no. 26, pages 250 - 255 * |
李鹏: "基于身份与位置分离的RANGI映射系统的研究与实现", 中国优秀硕士学位论文全文数据库, pages 139 - 16 * |
Also Published As
Publication number | Publication date |
---|---|
CN116662490B (en) | 2023-10-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Geng et al. | Ontozsl: Ontology-enhanced zero-shot learning | |
Cao et al. | Deep visual-semantic quantization for efficient image retrieval | |
Tu et al. | TransNet: Translation-Based Network Representation Learning for Social Relation Extraction. | |
Liu et al. | Collaborative hashing | |
Yu et al. | AS-GCN: Adaptive semantic architecture of graph convolutional networks for text-rich networks | |
CN113177132B (en) | Image retrieval method based on depth cross-modal hash of joint semantic matrix | |
CN114218389A (en) | Long text classification method in chemical preparation field based on graph neural network | |
Zhang et al. | Scalable discrete matrix factorization and semantic autoencoder for cross-media retrieval | |
Qin et al. | Reformulating graph kernels for self-supervised space-time correspondence learning | |
Liang et al. | Cross-media semantic correlation learning based on deep hash network and semantic expansion for social network cross-media search | |
Wang et al. | A text classification method based on LSTM and graph attention network | |
Zhu et al. | Cross-modal retrieval: a systematic review of methods and future directions | |
Wang et al. | A lightweight knowledge graph embedding framework for efficient inference and storage | |
Qin et al. | Deep adaptive quadruplet hashing with probability sampling for large-scale image retrieval | |
Fan et al. | Three-stage semisupervised cross-modal hashing with pairwise relations exploitation | |
Qiu et al. | Efficient document retrieval by end-to-end refining and quantizing BERT embedding with contrastive product quantization | |
Wang et al. | SBHA: sensitive binary hashing autoencoder for image retrieval | |
Liu et al. | Bit reduction for locality-sensitive hashing | |
Yin et al. | A Cross‐Modal Image and Text Retrieval Method Based on Efficient Feature Extraction and Interactive Learning CAE | |
Lv et al. | ACE: Ant colony based multi-level network embedding for hierarchical graph representation learning | |
CN116662490B (en) | Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information | |
Qiu et al. | HiHPQ: Hierarchical Hyperbolic Product Quantization for Unsupervised Image Retrieval | |
CN112529187A (en) | Knowledge acquisition method fusing multi-source data semantics and features | |
CN113157892A (en) | User intention processing method and device, computer equipment and storage medium | |
CN114118273B (en) | Limit multi-label classified data enhancement method based on label and text block attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |