CN116662490A - Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information - Google Patents

Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information Download PDF

Info

Publication number
CN116662490A
CN116662490A CN202310956922.9A CN202310956922A CN116662490A CN 116662490 A CN116662490 A CN 116662490A CN 202310956922 A CN202310956922 A CN 202310956922A CN 116662490 A CN116662490 A CN 116662490A
Authority
CN
China
Prior art keywords
hash
class
sample
hierarchical
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310956922.9A
Other languages
Chinese (zh)
Other versions
CN116662490B (en
Inventor
孙宇清
黄钿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202310956922.9A priority Critical patent/CN116662490B/en
Publication of CN116662490A publication Critical patent/CN116662490A/en
Application granted granted Critical
Publication of CN116662490B publication Critical patent/CN116662490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A confusion-removing text hash algorithm and a confusion-removing text hash device fusing layering label information belong to the technical field of natural language processing and information retrieval. Aiming at the text characteristics with layering labels, the invention constructs layering similarity relation in the hash space through multiple losses; in addition, in order to prevent the hash algorithm from being influenced by samples with inconsistent category and semantic similarity in the encoding process, a technical thought of confusion removal is introduced. The invention can establish effective layering similarity relation in the hash space, can better adapt to a real neighbor retrieval scene, and effectively uses the label information to construct layering hash space, so that the hash code can be gathered by the same label sample in different layers while meeting the requirement of consistency with the original semantic similarity, and the layering space distribution of different label sample dispersions is presented. The invention also effectively increases the robustness of the model in practical use.

Description

Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information
Technical Field
The invention relates to a confusion-removing text hash algorithm and a confusion-removing text hash device for fusing hierarchical label information, and belongs to the technical field of natural language processing and information retrieval.
Background
The text hash is a method for mapping the document from an original high-dimensional symbol feature space to a low-dimensional binary address space through hash, can enable the document with similar semantics to be mapped to a similar address space, can measure the distance between hash codes through Hamming distance, and compared with the traditional expensive calculation power and time consumption for calculating the similarity in European space, the text hash only needs to evaluate the similarity between binary hash codes through calculating Hamming distance, and the retrieval efficiency is remarkably improved. Generally, the category labels of the text may be used to assist in the construction of the hash codes, so that the text hash code distances of the same category remain similar. Hierarchical text hashing is a special form of text hashing that hashes text with hierarchical category labels. The text with the layering type accords with the real text semantic distribution scene, for example, in news texts, news texts such as basketball, football, table tennis and the like under the 'sports' major categories keep similar semantic aggregation among news of different ball categories, and the semantic aggregation is presented in the sports major categories while the different categories are scattered, and the news texts are scattered with texts under other major categories such as economy and the like. The hash retrieval is used as a neighbor retrieval technology, and it is important to keep the hierarchical neighbor relation of the real scene. Therefore, the hierarchical text hash has important application value in the field of large-scale information retrieval.
In the hierarchical text hashing problem, how to build efficient hierarchical similarity constraints in the hash space is quite challenging. Unlike non-hierarchical text hashing scenarios, hierarchical text hashing requires that the intelligent model be able to build a label-based hierarchical neighbor relation in the hash space, where the neighbor relation can be described as a similarity relation of hash codes, and introducing hierarchical label information by means of multi-label prediction of the encoded hash codes alone is far from sufficient, as this approach does not build explicit hierarchical similarity constraints in the hash space. In addition, fuzzy samples with inconsistent semantic and category information similarity sometimes appear in the hierarchical text hash process, so that the generalization performance of the model is reduced. The hash model aims to solve the challenge of establishing effective layering similarity in the hash code, and the challenge of weakening the influence of mixed signals with inconsistent semantic and category similarity on the performance of the hash model, so that the generalization capability and the robustness of the hash model are improved.
In the prior art, there have been many heuristics in text Ha Xifang down:
chinese patent literature: CN113449849a discloses a self-encoder based learning text hash method, which uses a hash model of a self-encoder structure to complete text hash model construction, and the method only reconstructs semantic information of a text, but cannot guarantee that documents of different categories are differentiated in a hash space.
Chinese patent literature: CN110955745a discloses a text hash search method based on deep learning, which uses a classification layer to classify the hash codes obtained by encoding to integrate category information, in this way, no explicit similarity constraint relation is established in the hash space, and only integration of flattened categories, not hierarchical category information, is considered. In this patent too, fuzzy samples that are inconsistent in semantic and class information similarity are not processed.
In summary, it is still difficult to satisfy the hierarchical classification requirement of the hierarchical text hash in the prior art.
Disclosure of Invention
The invention discloses a confusion-removing text hash algorithm for fusing hierarchical label information.
The invention also discloses a device for realizing the confusion-removing text hash algorithm.
Aiming at the text characteristics with layering labels, the invention constructs layering similarity relation in the hash space through multiple losses; in addition, in order to prevent the hash algorithm from being influenced by samples with inconsistent category and semantic similarity in the encoding process, a technical thought of confusion removal is introduced.
The detailed technical scheme of the invention is as follows:
the confusion-removing text hash algorithm for fusing hierarchical label information is characterized by comprising the following steps of:
s1: acquiring surface features of textEnglish feature, preferably, the surface layer features are expressed by using word frequency and inverse document frequency;
the surface featuresThe calculation method of (1) comprises the following steps:
s11: utilizing a coarse granularity word segmentation device of an open source word segmentation tool HanLP to segment texts in a language database;
s12: removing common Chinese stop words from word segmentation results, and obtaining word dictionary with front-ordered words from large to small according to word frequencyTF-IDF feature vector of each text as surface feature +.>
S2: given a text surface feature set with hierarchical category labels, wherein />The TF-IDF characteristic vector is regularized by L2; /> and />The number of all different subclass labels and the number of different parent class labels in the data set are respectively; vector->Is>Personal component corresponds to subclass tag->Description when the component is 1The method comprises the steps of carrying out a first treatment on the surface of the Similarly, vector->Is>The individual components correspond to the parent class label +.>When the component is 1, it is stated +.>The method comprises the steps of carrying out a first treatment on the surface of the Category label information is available during the training phase and not available during the testing phase; the hierarchical text hash task is to learn a hash model to add the surface features +.>Mapping to a +.>Binary vector of dimension->The formula is +.>And satisfies the requirement for ++>Its corresponding hash code +>Distance is similar, for->Corresponding to itHash code->The distances are also similar;
s21: the constructed target hash code satisfies the following hierarchical distance based on the label:
given hash codeThe parent category and the sub-category to which the pairwise combination belongs satisfy the following conditions:
and />Respectively belonging to the same parent class and the same subclass;
and />Respectively belonging to different subclasses;
and />Respectively belonging to the same parent class and different subclasses;
and />Respectively belonging to different parent classes and different subclasses;
i.e., wherein />Andparent tags and child tags corresponding to the samples respectively;
then, hash codeThe constructed layering distance constraint is as follows:
wherein Super parameters for controlling the distance difference; />Is the Euclidean distance between two vectors, wherein +.>Two hash codes respectively referring to the calculated Euclidean distance; because of the binarycity of the hash codes, the excessive spans, namely 0 and 1, exist among the values of the corresponding dimensions of different hash codes, and optimization is still difficult to be performed when the Euclidean distance constraint is performed, so that the distance constraint is performed in a continuous space; due to arbitrary-> and />In positive correlation, the above hierarchical distance constraint condition is equivalent to satisfy the constraint relationship of the following two hierarchies:
wherein ,hash code for corresponding anchor sample>Hidden vector before binarization, +.>,/>Correspondingly constructing a hidden vector set for the whole training set sample; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample heteroparent class and the anchor point sample heterosubclass; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample and the parent class and the subclass; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample heterology is constructed; />Is a super parameter;
distance constraint is carried out on each layer:
the hierarchical distance constraint translates into optimizing the following objective function:
in the formulas (1) and (2),constraint loss for total layering distance on training set; />A binarization pre-hidden vector corresponding to the anchor point sample; /> and />The binary pre-hidden vectors which are close to and far from the anchor hidden vector are hoped to be respectively;is a super parameter for controlling the size of the distance difference; />Constraint duty ratio super parameters for parent class distance; />Calculating a function for the desire; and />Respectively calculating the comparison loss expectations under the parent class and the sub class layers through anchor point samples; the corresponding super parameter is-> and />In training, the parameters of the Hash algorithm are updated in a twin network mode, and the method is hoped to ensure that under the same model parameters, the semantic similarity is kept and the same father and father are simultaneously realizedThe distance between the samples of the child is closer in the hash space than that between the samples of the father and the alien, and the distance between the samples of the father and the alien is closer in the hash space than that between the samples of the alien, so that the construction of the hierarchical similarity constraint in the hash space is completed.
The hash algorithm further comprises an S3 construction method for removing confusion constraint, which comprises the following steps:
for the core thought of the confusion removing supervision method, the method is divided into two parts, wherein firstly, samples with similar semantics but belonging to different categories need to be far away in the corresponding hash space, and secondly, samples with less similar semantics but belonging to the same category need to be similar in the corresponding hash space; consistent with the foregoing, by successive hidden vectorsTo construct hash +.>Similarity relation between:
let the samples of each parent and child class have a semantic center in hidden space.
S31: and calculating the semantic center based on the hidden vector of the sample of the category in the training process.
In the formula (3) of the present invention,representing parent class +.>Is a latent semantic center of (1); />Representation subclass->Is a latent semantic center of (1); />The function is calculated for the desired purpose.
S32: on the premise of defining various kinds of lingering semantic centers, calculating fuzzy radius based on the semantic centers for each parent class and each child class:
in the formula (4) of the present invention,is->Semantic blur radius of parent class; />Is->Semantic blur radius of subclass; />Is a super-parameter, and can be automatically adjusted according to the data; />The function is calculated for the desired purpose.
S33: the method is characterized in that the same class samples with the distance from the semantic center being larger than the fuzzy radius and the heterogeneous samples with the distance from the semantic center being smaller than the fuzzy radius are provided with confusing supervision information, and based on the concept, the distance constraint similar to the layering similarity constraint is introduced into the training step of each sample:
in the formula (5) of the present invention,total defrobulated supervision constraint loss on the training set; />The function is calculated for the desired purpose. and />The loss expectations of confusion removal and comparison under parent classes and subclasses obtained through semantic center calculation are respectively; /> and />Respectively associating a specific parent class semantic center and a specific sub-class semantic center; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and corresponds to the heterofather class and heteroson class sample is constructed; />Hidden vector set constructed corresponding to anchor point sample and father and subclass sample obtained by random sampling +.>The hidden vector set which is obtained for random sampling and is constructed corresponding to the anchor point sample and the parent class and the class of the same class of the different class sample meets the following conditions:
wherein and />Corresponding parent hidden vector fuzzy radius and sub-class hidden vector fuzzy radius respectively; likewise, the hash algorithm is optimized by adopting a twin network mode in the training process.
The two parts are key components of the invention, and the related super-parameter user can use default values and can also be customized to meet actual service requirements. Guan Chaocan of the sample similarity constraints involved and />Confusion-removing constraint super-ginseng and />And weight exceeding ginseng->And confusion boundary super-ginseng->The method can be automatically adjusted according to the data set and used for subsequent training of the confusion-removing text hash algorithm for fusing the hierarchical label information;
the distribution diagram of the blurred sample is shown in fig. 1;
the sample selection algorithm pseudo code of the above procedure is shown in fig. 2;
hash algorithmIntegration of semantic information of a document by Bernoulli variational self-encoder framework, according to Bayesian theorem, variational self-encoder can be optimized by maximizing variational lower bound, and can be converted into maximizing the following variational lower bound in a hash modelNamely maximizing the reconstructed original document and minimizing the KL divergence:
in the formula (6) of the present invention,the function is calculated for the desired purpose. The inference process is denoted as the hash algorithm encoding (approximate posterior) process, i.e. encoder +.>The encoded network parameters are denoted +.>The method comprises the steps of carrying out a first treatment on the surface of the The generation process is denoted as hash algorithm decoding process +.>The decoded network parameter is denoted +.>;/>A priori distribution without loss of generality; since the bernoulli variation self-encoder treats the a priori distribution as a multivariate bernoulli distribution, equation (6) is transformed in a specific process to optimize the objective function as follows:
in the formula (7) and the formula (8),reconstructing the loss for the document; KL is the relative entropy; />For a priori distributionIs a characteristic value of (2); />The method comprises the steps of reconstructing a characteristic vector for a decoding network; />Is a cross entropy loss function;
the hash algorithm adopts a label prediction network to integrate the hash code into the basic sub-category information of the hash code, and optimizes the hash code by using the following objective function, whereinThe category to which the sample predicted for the network belongs:
through the above discussion, the hash algorithm proposed by the present invention is optimized by the following objective function:
in the formulas (9) and (10), and (2)>Predicting losses for corresponding sub-class labels; />Predicting network parameters for the tag; />Loss from the encoder for the corresponding Bernoulli variation, including document reconstruction +.>Loss and relative entropy KL loss; />Respectively obtaining weight super-parameters corresponding to each part of objective function;constraint loss for the computed hierarchical similarity; />A defrobulated constraint loss calculated for the middle; />Is a complete objective function of the hash algorithm.
The device for realizing the confusion-removing text hash algorithm for fusing the hierarchical label information is characterized in that:
the whole hash model framework is realized from the encoder framework based on the Knuds variation; encoder with a plurality of sensorsUsing a multi-layer perceptron, english multilayer perceptron, english shorthand MLP; the surface layer characteristics of the document are->Mapping to a continuous hidden distribution space, and obtaining a hash code by sampling in the coded posterior multiple Bernoulli distribution;
the coding process comprises the following steps: first, the surface layer is characterizedInputting the multi-layer perceptron to encode to obtain +.>
(11)
Then activating by using Sigmoid function to obtain hidden vector before binarization
(12)
In the formula (12)Representing an exponential function;
obtaining hash codes by sampling operations
(13)
In equation (13), the approximate posterior of the hidden vector correlation is expressed as
(14)
DecoderObtaining hidden vectors after linear transformation, then reconstructing the original characteristic representation through an activation function, and predicting the labels:
(15)
in the case of the formula (15),representing decoded network output vectorsCorresponding position +.>An output of (2); />For decoding the network parameter matrix +.>To correspond to->A single heat vector of dimensions; />Is a network bias term;
(16)
in the formula (16) of the present invention,representing the corresponding position in the predicted network output vector>An output of (2); />A network parameter matrix is predicted; />To correspond to->A single heat vector of dimensions; />Is a network bias term;
(17)
(18)
in the formulas (11) - (18),in the calculation of the multi-label scene, the multi-label scene is replaced by a Sigmoid function, and the sampling process is carried outThe following form was used instead:
(19)
in the formula (19) of the present invention,is a fixed threshold sample value, ">To complete hidden distribution sampling, and to use a straight-through estimator to process the non-conductive phenomenon of the binarization operation in the training process.
In specific implementation details, two layers of hidden layers each containing 1000 neurons activated by adopting a leak ReLU are used as a network basic structure of the MLP, and a dropout technology is adopted after the second layer to prevent the phenomenon that a hash model is over-fitted, wherein the dropout probability is set to be 0.1. Then, a sigmoid activation function is adopted to realize the approximation process of the A-test Bernoulli distribution, and a fixed threshold value is adoptedBy->And the function samples the binary hash code to complete the construction of the inferred network. For generating a network, a layer of linear transformation is adopted to reconstruct an original feature vector and a predictive subcategory label vector, a Softmax function is adopted to activate network output when the original feature vector is reconstructed, and a Softmax is used to activate network output for a single label scene when the subcategory label is predictedAnd activating, namely for the multi-label scene, activating by using a Sigmoid function, and replacing a loss calculation mode with Euclidean distance loss. The inference network (encoder) and the generation network (decoder) are only used in combination in the training phase to complete the optimization process under the framework of the variational self-encoder, and in the practical application phase, only the encoder is used for inputting the surface layer characteristics->And encoding to obtain the hash code.
The whole framework of the implementation of the hash algorithm device is shown in fig. 3, and the training process is shown in fig. 4.
The technical advantages of the invention include:
compared with the prior art, the method only used, the method can better adapt to a real neighbor search scene, and effectively uses the label information to construct a hierarchical hash space, so that the hash code can be consistent with the original semantic similarity, and meanwhile, the same label sample aggregation and the hierarchical spatial distribution of different label sample dispersion are presented in different layers.
The confusion removing constraint proposed by the invention according to the actual problem can help the hash model to weaken the influence caused by confusion information from inconsistent category and semantics in the training process, and effectively increase the robustness of the model in actual use.
The invention can also be used on other similar tasks in the field of natural language processing, such as hierarchical text representation learning, hierarchical text classification and other related tasks based on hierarchical label scenes.
Drawings
FIG. 1 is a schematic diagram of the distribution of fuzzy samples in a hierarchical hash scenario of the present invention;
FIG. 2 is an algorithmic pseudocode of a sample selection method for modeling samples and hierarchical similarity hash space construction in accordance with the present invention;
FIG. 3 is a schematic diagram of the overall design framework of the confusion-removed text hashing algorithm incorporating hierarchical label information in the present invention;
fig. 4 is a training algorithm pseudo code of the defrobulated text hashing algorithm incorporating hierarchical label information in accordance with the present invention.
Detailed Description
The present invention will be described in detail with reference to examples and drawings, but is not limited thereto.
Example 1,
A defrobulated text hashing algorithm that fuses hierarchical tag information, comprising:
s1: acquiring surface features of textEnglish feature, preferably, the surface layer features are expressed by using word frequency and inverse document frequency;
the surface featuresThe calculation method of (1) comprises the following steps:
s11: utilizing a coarse granularity word segmentation device of an open source word segmentation tool HanLP to segment texts in a language database;
s12: removing common Chinese stop words from word segmentation results, and obtaining word dictionary with front-ordered words from large to small according to word frequencyTF-IDF feature vector of each text as surface feature +.>
S2: given a text surface feature set with hierarchical category labels, wherein />The TF-IDF characteristic vector is regularized by L2; /> and />The number of all different subclass labels and the number of different parent class labels in the data set are respectively; vector->Is>Personal component corresponds to subclass tag->Description when the component is 1The method comprises the steps of carrying out a first treatment on the surface of the Similarly, vector->Is>The individual components correspond to the parent class label +.>When the component is 1, it is stated +.>The method comprises the steps of carrying out a first treatment on the surface of the Category label information is available during the training phase and not available during the testing phase; the hierarchical text hash task is to learn a hash model to add the surface features +.>Mapping to a +.>Binary vector of dimension->The formula is +.>And satisfies the requirement for ++>Its corresponding hash code +>Distance is similar, for->Its corresponding hash code +>The distances are also similar;
s21: the constructed target hash code satisfies the following hierarchical distance based on the label:
given hash codeThe parent category and the sub-category to which the pairwise combination belongs satisfy the following conditions:
and />Respectively belonging to the same parent class and the same subclass;
and />Respectively belonging to different subclasses;
and />Respectively belonging to the same parent class and different subclasses;
and />Respectively belonging to different parent classes and different subclasses;
i.e., wherein /> and />Parent tags and child tags corresponding to the samples respectively;
then, hash codeThe constructed layering distance constraint is as follows:
wherein Super parameters for controlling the distance difference; />Is the Euclidean distance between two vectors, wherein +.>Two hash codes respectively referring to the calculated Euclidean distance; because of the binarycity of the hash codes, the excessive spans, namely 0 and 1, exist among the values of the corresponding dimensions of different hash codes, and optimization is still difficult to be performed when the Euclidean distance constraint is performed, so that the distance constraint is performed in a continuous space; due to arbitrary-> and />In positive correlation, the above hierarchical distance constraint condition is equivalent to satisfy the constraint relationship of the following two hierarchies:
wherein ,hash code for corresponding anchor sample>Hidden vector before binarization, +.>,/>Correspondingly constructing a hidden vector set for the whole training set sample; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample heteroparent class and the anchor point sample heterosubclass; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample and the parent class and the subclass; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample heterology is constructed;/>is a super parameter;
distance constraint is carried out on each layer:
the hierarchical distance constraint translates into optimizing the following objective function:
in the formulas (1) and (2),constraint loss for total layering distance on training set; />A binarization pre-hidden vector corresponding to the anchor point sample; /> and />The binary pre-hidden vectors which are close to and far from the anchor hidden vector are hoped to be respectively;is a super parameter for controlling the size of the distance difference; />Constraint duty ratio super parameters for parent class distance; />Calculating a function for the desire; and />Respectively calculating the comparison loss expectations under the parent class and the sub class layers through anchor point samples; the corresponding super parameter is-> and />In training, the parameters of the hash algorithm are updated in a twin network mode, so that semantic similarity is expected to be kept under the same model parameters, meanwhile, the distance between samples of the same father and son in the hash space is closer than that between samples of the same father and son, the distance between samples of the same father and son in the hash space is closer than that between samples of different father and son, and therefore the construction of the hierarchical similarity constraint in the hash space is completed.
EXAMPLE 2,
The method for constructing the confusion-removing text hash algorithm for fusing hierarchical label information according to embodiment 1, further comprising S3 a construction method for the confusion-removing constraint, comprising:
for the core thought of the confusion removing supervision method, the method is divided into two parts, wherein firstly, samples with similar semantics but belonging to different categories need to be far away in the corresponding hash space, and secondly, samples with less similar semantics but belonging to the same category need to be similar in the corresponding hash space; consistent with the foregoing, by successive hidden vectorsTo construct hash +.>Similarity relation between:
let the samples of each parent and child class have a semantic center in hidden space,
s31: semantic center calculation based on hidden vectors of samples of this type in training process
In the formula (3) of the present invention,representing parent class +.>Is a latent semantic center of (1); />Representation subclass->Is a latent semantic center of (1); />The function is calculated for the desired purpose.
S32: on the premise of defining various kinds of lingering semantic centers, calculating fuzzy radius based on the semantic centers for each parent class and each child class:
in the formula (4) of the present invention,is->Semantic blur radius of parent class; />Is->Semantic blur radius of subclass; />Is a super-parameter, and can be automatically adjusted according to the data; />The function is calculated for the desired purpose.
S33: the method is characterized in that the same class samples with the distance from the semantic center being larger than the fuzzy radius and the heterogeneous samples with the distance from the semantic center being smaller than the fuzzy radius are provided with confusing supervision information, and based on the concept, the distance constraint similar to the layering similarity constraint is introduced into the training step of each sample:
in the formula (5) of the present invention,total defrobulated supervision constraint loss on the training set; />The function is calculated for the desired purpose. and />The loss expectations of confusion removal and comparison under the parent class and the child class obtained through semantic center calculation are respectively shown; /> and />Respectively associating a specific parent class semantic center and a specific sub-class semantic center; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and corresponds to the heterofather class and heteroson class sample is constructed; />Hidden vector set constructed corresponding to anchor point sample and father and subclass sample obtained by random sampling +.>The hidden vector set which is obtained for random sampling and is constructed corresponding to the anchor point sample and the parent class and the class of the same class of the different class sample meets the following conditions:
wherein and />Corresponding parent hidden vector fuzzy radius and sub-class hidden vector fuzzy radius respectively; likewise, the hash algorithm is optimized by adopting a twin network mode in the training process.
The two parts are key components of the invention, and the related super-parameter user can use default values and can also be customized to meet actual service requirements. Guan Chaocan of the sample similarity constraints involved and />Confusion-removing constraint super-ginseng and />And weight exceeding ginseng->And confusion boundary super-ginseng->The method can be automatically adjusted according to the data set and used for subsequent training of the confusion-removing text hash algorithm for fusing the hierarchical label information;
the distribution diagram of the blurred sample is shown in fig. 1;
the sample selection algorithm pseudo code of the above procedure is shown in fig. 2;
the Hash algorithm integrates semantic information of a document through a Bernoulli variation self-encoder framework, and the variation self-encoder can be optimized by maximizing a variation lower bound according to the Bayesian theorem and can be converted into the maximized variation lower bound in a Hash modelNamely maximizing the reconstructed original document and minimizing the KL divergence:
in the formula (6) of the present invention,the function is calculated for the desired purpose. The inference process is denoted as the hash algorithm encoding (approximate posterior) process, i.e. encoder +.>The encoded network parameters are denoted +.>The method comprises the steps of carrying out a first treatment on the surface of the The generation process is denoted as hash algorithm decoding process +.>The decoded network parameter is denoted +.>;/>A priori distribution without loss of generality; since the bernoulli variation self-encoder treats the a priori distribution as a multivariate bernoulli distribution, equation (6) is transformed in a specific process to optimize the objective function as follows:
in the formula (7) and the formula (8),reconstructing the loss for the document; KL is the relative entropy; />For a priori distribution->Is a characteristic value of (2); />The method comprises the steps of reconstructing a characteristic vector for a decoding network; />Is a cross entropy loss function;
the hash algorithm adopts a label prediction network to integrate the hash code into the basic sub-category information of the hash code, and optimizes the hash code by using the following objective function, whereinThe category to which the sample predicted for the network belongs:
through the above discussion, the hash algorithm proposed by the present invention is optimized by the following objective function:
in the formulas (9) and (10), and (2)>Predicting losses for corresponding sub-class labels; />Predicting network parameters for the tag; />Loss from the encoder for the corresponding Bernoulli variation, including document reconstruction +.>Loss and relative entropy KL loss; />Respectively obtaining weight super-parameters corresponding to each part of objective function;constraint loss for the computed hierarchical similarity; />The resulting de-confusing constraint loss is calculated. />Is a complete objective function of the hash algorithm.
EXAMPLE 3,
An implementation device of a confusion-removing text hash algorithm for fusing hierarchical label information comprises:
the whole hash model framework is realized from the encoder framework based on the Knuds variation; encoder with a plurality of sensorsEnglish multilayer perceptron, english letter using multi-layer perceptronWriting MLP; the surface layer characteristics of the document are->Mapping to a continuous hidden distribution space, and obtaining a hash code by sampling in the coded posterior multiple Bernoulli distribution;
the coding process comprises the following steps: first, the surface layer is characterizedInputting the multi-layer perceptron to encode to obtain +.>
(11)
Then activating by using Sigmoid function to obtain hidden vector before binarization
(12)
In the formula (12)Representing an exponential function;
obtaining hash codes by sampling operations
(13)
In equation (13), the approximate posterior of the hidden vector correlation is expressed as
(14)
DecoderObtaining hidden vectors after linear transformation, then reconstructing the original characteristic representation through an activation function, and predicting the labels:
(15)
in the case of the formula (15),representing the corresponding position in the decoded network output vector +.>An output of (2); />For decoding the network parameter matrix +.>To correspond to->A single heat vector of dimensions; />Is a network bias term;
(16)
in the formula (16) of the present invention,representing the corresponding position in the predicted network output vector>An output of (2); />To predictA network parameter matrix; />To correspond to->A single heat vector of dimensions; />Is a network bias term;
(17)
(18)
in the formulas (11) - (18),in the calculation of the multi-label scene, the multi-label scene is replaced by a Sigmoid function, and the sampling process is carried outThe following form was used instead:
(19)
in the formula (19) of the present invention,is a fixed threshold sample value, ">To complete the implicit distributed sampling. A pass-through estimator is used to handle the non-conductive phenomenon of the binarization operation in the training process.
In specific implementation details, two layers of hidden layers each containing 1000 neurons activated by adopting a leak ReLU are used as a network basic structure of the MLP, and a dropout technology is adopted after the second layer to prevent the phenomenon that a hash model is over-fitted, wherein the dropout probability is set to be 0.1. Thereafter, use is made ofsigmoid activation function to achieve approximation of Bernoulli posterior distribution and employ fixed thresholdBy->And the function samples the binary hash code to complete the construction of the inferred network. For generating a network, a layer of linear transformation is adopted to reconstruct an original feature vector and a predicted subcategory label vector, a Softmax function is adopted to activate network output when the original feature vector is reconstructed, a Softmax function is adopted to activate network output for a single-label scene when the subcategory label is predicted, a Sigmoid function is adopted to activate for a multi-label scene, and a loss calculation mode is replaced by Euclidean distance loss. The inference network (encoder) and the generation network (decoder) are only used in combination in the training phase to complete the optimization process under the framework of the variational self-encoder, and in the practical application phase, only the encoder is used for inputting the surface layer characteristics->And encoding to obtain the hash code.
The whole framework of the implementation of the hash algorithm device is shown in fig. 3, and the training process is shown in fig. 4.
In the above embodiment, the performance of the hash algorithm proposed by the present invention is quantitatively analyzed based on the 32-bit hash code:
training and testing the hash algorithm of the invention on a WOS data set, wherein the WOS data set adopts a data set from https:// data.mendeley.com/data/9 rw3vkcfy4/6, the test result is shown in table 1, wherein a Distance column is the Hamming Distance between other samples searched by a query sample and hash codes of the query sample, a Domain column is a large field, the data column can be understood as a parent class to which the sample belongs, the Area column is a subdivision field, the data column can be understood as a subclass to which the sample belongs, a Keywords column is a corresponding sample keyword, and a Code column is a corresponding sample hash Code:
table 1: test results (text is encoded into 32-bit hash codes based on the related technology of the invention, and the hash codes of the text to be queried are searched by using the hash codes of the query text to obtain test results in the following table).
As can be seen from Table 1, as the Hamming distance increases, the retrieved documents become less relevant. In addition, documents under the same parent class are closer in hash space than documents within different parent classes, even though they come from different subclasses, exhibiting hierarchical similarity relationships. Therefore, the result in the table 1 shows that the hash algorithm provided by the invention can encode the hash codes for effectively measuring the hierarchical similarity between the documents, and the hierarchical category label information is fully utilized.
In the embodiment, the retrieval performance evaluation of the hash model proposed herein is performed on 16-bit, 32-bit, 64-bit and 128-bit hash codes by using a precision@100 index based on a 20New Groups data set, wherein the 20New Groups data set is derived from http:// qword.com/-jason/20 New Groups; the Method column is the result of comparing the hash algorithm with the hash algorithm in the text, and the columns of 16bits, 32bits, 64bits and 128bits respectively represent the model performance on the hash codes of 16bits, 32bits, 64bits and 128 bits.
Table 2: precision@100 retrieval performance evaluation table
In Table 2, ours is the performance of the hash algorithm proposed by the present invention, and other prior art manifestations are cited in the IHDH Hash technical literature (Guo J N, mao X L, wei W, et al, intra-category aware hierarchical supervised document hashing [ J ]. IEEE Transactions on Knowledge and Data Engineering, 2022.).
As can be seen from table 2, the performance of the hash algorithm in this document exceeds that of the other prior art in comparison with the hash codes of four common lengths in the experiment, which fully demonstrates the advantages of the hash model proposed by the present invention.

Claims (5)

1. The confusion-removing text hash algorithm for fusing hierarchical label information is characterized by comprising the following steps of:
s1: acquiring surface features of text
S2: given a text surface feature set with hierarchical category labels, wherein />The TF-IDF characteristic vector is regularized by L2; /> and />The number of all different subclass labels and the number of different parent class labels in the data set are respectively; vector->Is>Personal component corresponds to subclass tag->Description when the component is 1The method comprises the steps of carrying out a first treatment on the surface of the Similarly, vector->Is>The individual components correspond to the parent class label +.>When the component is 1, it is stated +.>
S3: a method of construction of a deconrobustable constraint, comprising: by successive hidden vectorsTo construct hash +.>The similarity relation between the parent category and the child category is set to have a semantic center in the hidden space.
2. The defrobulated text hashing algorithm of claim 1, wherein in S1, said surface features areThe calculation method of (1) comprises the following steps:
s11: utilizing a coarse granularity word segmentation device of an open source word segmentation tool HanLP to segment texts in a language database;
s12: removing Chinese stop words from word segmentation results, and obtaining word dictionary with front words according to word frequency from large to smallTF-IDF feature vector of each text as surface feature +.>
The S2 further includes:
the constructed target hash code satisfies the following hierarchical distance based on the label:
given hash codeThe parent category and the sub-category to which the pairwise combination belongs satisfy the following conditions:
and />Respectively belonging to the same parent class and the same subclass;
and />Respectively belonging to different subclasses;
and />Respectively belonging to the same parent class and different subclasses;
and />Respectively belonging to different parent classes and different subclasses;
i.e., wherein /> and />Parent tags and child tags corresponding to the samples respectively;
then, hash codeThe constructed layering distance constraint is as follows:
wherein Super parameters for controlling the distance difference; />Is the Euclidean distance between two vectors, wherein +.>Two hash codes respectively referring to the calculated Euclidean distance; the above hierarchical distance constraint is equivalent to satisfying the following two hierarchical constraint relationships:
wherein ,hash code for corresponding anchor sample>Hidden vector before binarization, +.>,/>Correspondingly constructing a hidden vector set for the whole training set sample; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample heteroparent class and the anchor point sample heterosubclass;a hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample and the parent class and the subclass; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample heterology is constructed; />Is a super parameter;
distance constraint is carried out on each layer:
the hierarchical distance constraint translates into optimizing the following objective function:
in the formulas (1) and (2),constraint loss for total layering distance on training set; />A binarization pre-hidden vector corresponding to the anchor point sample; /> and />The binary pre-hidden vectors which are close to and far from the anchor hidden vector are hoped to be respectively; />Is a super parameter for controlling the size of the distance difference; />Constraint duty ratio super parameters for parent class distance; />Calculating a function for the desire; and />Respectively calculating the comparison loss expectations under the parent class and the sub class layers through anchor point samples; the corresponding super parameter is-> and />
3. The confusion-removing text hashing algorithm for fusing hierarchical label information according to claim 2, wherein S3 specifically comprises:
s31: in the training process, calculating semantic centers based on hidden vectors of the samples of the category:
in the formula (3) of the present invention,representing parent class +.>Is a latent semantic center of (1); />Representation subclass->Is a latent semantic center of (1); />Calculating a function for the desire;
s32: performing semantic center-based fuzzy radius calculation on each parent class and subclass:
in the formula (4) of the present invention,is->Semantic blur radius of parent class; />Is->Semantic fuzzy half of subclassesDiameter is as follows;is a super ginseng; />Calculating a function for the desire;
s33: a distance constraint similar to the hierarchical similarity constraint is introduced in the training step of each sample:
in the formula (5) of the present invention,total defrobulated supervision constraint loss on the training set; />Calculating a function for the desire; and />The loss expectations of confusion removal and comparison under parent classes and subclasses obtained through semantic center calculation are respectively; /> and />Respectively associating a specific parent class semantic center and a specific sub-class semantic center; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and corresponds to the heterofather class and heteroson class sample is constructed; />Hidden vector set constructed corresponding to anchor point sample and father and subclass sample obtained by random sampling +.>The hidden vector set which is obtained for random sampling and is constructed corresponding to the anchor point sample and the parent class and the class of the same class of the different class sample meets the following conditions:
wherein and />Corresponding parent hidden vector fuzzy radius and sub-class hidden vector fuzzy radius respectively; likewise, the hash algorithm is optimized by adopting a twin network mode in the training process.
4. The defrobulated text hashing algorithm of claim 2 wherein the method of optimizing the hashing algorithm includes:
maximizing the lower bound of the variationNamely maximizing the reconstructed original document and minimizing the KL divergence:
in the formula (6) of the present invention,calculating a function for the desire; the inference procedure is denoted as hash algorithm coding procedure, i.e. encoder +.>The encoded network parameters are denoted +.>The method comprises the steps of carrying out a first treatment on the surface of the The generation process is denoted as hash algorithm decoding process +.>The decoded network parameter is denoted +.>;/>A priori distribution without loss of generality; since the bernoulli variation self-encoder treats the a priori distribution as a multivariate bernoulli distribution, equation (6) is transformed in a specific process to optimize the objective function as follows:
in the formula (7) and the formula (8),reconstructing the loss for the document; />For a priori distribution->Is a characteristic value of (2); />The method comprises the steps of reconstructing a characteristic vector for a decoding network; />Is a cross entropy loss function;
the hash algorithm adopts a label prediction network to integrate the hash code into the basic sub-category information of the hash code, and optimizes the hash code by using the following objective function, whereinThe category to which the sample predicted for the network belongs:
the hash algorithm is optimized by the following objective function:
in the formulas (9) and (10), and (2)>Predicting losses for corresponding sub-class labels; />Predicting network parameters for the tag; />Loss from encoder for corresponding Bernoulli variation, including document reconstructionLoss and KL loss; />Respectively obtaining weight super-parameters corresponding to each part of objective function; />Constraint loss for the computed hierarchical similarity; />A defrobulated constraint loss calculated for the middle; />Is a complete objective function of the hash algorithm.
5. The device for realizing the confusion-removing text hash algorithm for fusing the hierarchical label information is characterized in that:
encoder with a plurality of sensorsUsing a multi-layer perceptron, characterizing the surface layer of the document->Mapping to a continuous hidden distribution space, and obtaining a hash code by sampling in the coded posterior multiple Bernoulli distribution;
the coding process comprises the following steps: first, the surface layer is characterizedInputting into a multi-layer sensing machineCoding get->
(11)
Then activating by using Sigmoid function to obtain hidden vector before binarization
(12)
In the formula (12)Representing an exponential function;
obtaining hash codes by sampling operations
(13)
In equation (13), the approximate posterior of the hidden vector correlation is expressed as
(14)
DecoderUsing linear transformation to obtain hidden vector, then reconstructing original characteristic representation by activating function and making labelAnd (3) predicting:
(15)
in the case of the formula (15),representing the corresponding position in the decoded network output vector +.>An output of (2); />In order to decode the network parameter matrix,to correspond to->A single heat vector of dimensions; />Is a network bias term;
(16)
in the formula (16) of the present invention,representing the corresponding position in the predicted network output vector>An output of (2); />A network parameter matrix is predicted;to correspond to->A single heat vector of dimensions; />Is a network bias term;
(17)
(18)
in the formulas (11) - (18),in the calculation of the multi-label scene, the multi-label scene is replaced by a Sigmoid function, and the sampling process is carried outThe following form was used instead:
(19)
in the formula (19) of the present invention,is a fixed threshold sample value, ">To complete the implicit distributed sampling.
CN202310956922.9A 2023-08-01 2023-08-01 Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information Active CN116662490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310956922.9A CN116662490B (en) 2023-08-01 2023-08-01 Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310956922.9A CN116662490B (en) 2023-08-01 2023-08-01 Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information

Publications (2)

Publication Number Publication Date
CN116662490A true CN116662490A (en) 2023-08-29
CN116662490B CN116662490B (en) 2023-10-13

Family

ID=87715785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310956922.9A Active CN116662490B (en) 2023-08-01 2023-08-01 Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information

Country Status (1)

Country Link
CN (1) CN116662490B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001033337A2 (en) * 1999-11-01 2001-05-10 Curl Corporation System and method supporting property values as options
CN110188209A (en) * 2019-05-13 2019-08-30 山东大学 Cross-module state Hash model building method, searching method and device based on level label
CN111460077A (en) * 2019-01-22 2020-07-28 大连理工大学 Cross-modal Hash retrieval method based on class semantic guidance
CN111753189A (en) * 2020-05-29 2020-10-09 中山大学 Common characterization learning method for few-sample cross-modal Hash retrieval
CN112199520A (en) * 2020-09-19 2021-01-08 复旦大学 Cross-modal Hash retrieval algorithm based on fine-grained similarity matrix
CN113806580A (en) * 2021-09-28 2021-12-17 西安电子科技大学 Cross-modal Hash retrieval method based on hierarchical semantic structure
CN113889228A (en) * 2021-09-22 2022-01-04 武汉理工大学 Semantic enhanced Hash medical image retrieval method based on mixed attention
WO2022037295A1 (en) * 2020-08-20 2022-02-24 鹏城实验室 Targeted attack method for deep hash retrieval and terminal device
WO2022104540A1 (en) * 2020-11-17 2022-05-27 深圳大学 Cross-modal hash retrieval method, terminal device, and storage medium
CN116383415A (en) * 2023-03-21 2023-07-04 华中科技大学 Construction method and application of hash generation model of multimedia data

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001033337A2 (en) * 1999-11-01 2001-05-10 Curl Corporation System and method supporting property values as options
CN111460077A (en) * 2019-01-22 2020-07-28 大连理工大学 Cross-modal Hash retrieval method based on class semantic guidance
CN110188209A (en) * 2019-05-13 2019-08-30 山东大学 Cross-module state Hash model building method, searching method and device based on level label
CN111753189A (en) * 2020-05-29 2020-10-09 中山大学 Common characterization learning method for few-sample cross-modal Hash retrieval
WO2022037295A1 (en) * 2020-08-20 2022-02-24 鹏城实验室 Targeted attack method for deep hash retrieval and terminal device
CN112199520A (en) * 2020-09-19 2021-01-08 复旦大学 Cross-modal Hash retrieval algorithm based on fine-grained similarity matrix
WO2022104540A1 (en) * 2020-11-17 2022-05-27 深圳大学 Cross-modal hash retrieval method, terminal device, and storage medium
CN113889228A (en) * 2021-09-22 2022-01-04 武汉理工大学 Semantic enhanced Hash medical image retrieval method based on mixed attention
CN113806580A (en) * 2021-09-28 2021-12-17 西安电子科技大学 Cross-modal Hash retrieval method based on hierarchical semantic structure
CN116383415A (en) * 2023-03-21 2023-07-04 华中科技大学 Construction method and application of hash generation model of multimedia data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
XIAOQIANG LU; YAXIONG CHEN; XUELONG LI: "Hierarchical Recurrent Neural Hashing for Image Retrieval With Hierarchical Convolutional Features", IEEE TRANSACTIONS ON IMAGE PROCESSING, pages 106 *
曹路;杨文强;: "基于离散监督哈希的相似性检索算法", 科学技术与工程, no. 26, pages 250 - 255 *
李鹏: "基于身份与位置分离的RANGI映射系统的研究与实现", 中国优秀硕士学位论文全文数据库, pages 139 - 16 *

Also Published As

Publication number Publication date
CN116662490B (en) 2023-10-13

Similar Documents

Publication Publication Date Title
Geng et al. Ontozsl: Ontology-enhanced zero-shot learning
Cao et al. Deep visual-semantic quantization for efficient image retrieval
Tu et al. TransNet: Translation-Based Network Representation Learning for Social Relation Extraction.
Liu et al. Collaborative hashing
Yu et al. AS-GCN: Adaptive semantic architecture of graph convolutional networks for text-rich networks
CN113177132B (en) Image retrieval method based on depth cross-modal hash of joint semantic matrix
CN114218389A (en) Long text classification method in chemical preparation field based on graph neural network
Zhang et al. Scalable discrete matrix factorization and semantic autoencoder for cross-media retrieval
Qin et al. Reformulating graph kernels for self-supervised space-time correspondence learning
Liang et al. Cross-media semantic correlation learning based on deep hash network and semantic expansion for social network cross-media search
Wang et al. A text classification method based on LSTM and graph attention network
Zhu et al. Cross-modal retrieval: a systematic review of methods and future directions
Wang et al. A lightweight knowledge graph embedding framework for efficient inference and storage
Qin et al. Deep adaptive quadruplet hashing with probability sampling for large-scale image retrieval
Fan et al. Three-stage semisupervised cross-modal hashing with pairwise relations exploitation
Qiu et al. Efficient document retrieval by end-to-end refining and quantizing BERT embedding with contrastive product quantization
Wang et al. SBHA: sensitive binary hashing autoencoder for image retrieval
Liu et al. Bit reduction for locality-sensitive hashing
Yin et al. A Cross‐Modal Image and Text Retrieval Method Based on Efficient Feature Extraction and Interactive Learning CAE
Lv et al. ACE: Ant colony based multi-level network embedding for hierarchical graph representation learning
CN116662490B (en) Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information
Qiu et al. HiHPQ: Hierarchical Hyperbolic Product Quantization for Unsupervised Image Retrieval
CN112529187A (en) Knowledge acquisition method fusing multi-source data semantics and features
CN113157892A (en) User intention processing method and device, computer equipment and storage medium
CN114118273B (en) Limit multi-label classified data enhancement method based on label and text block attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant