CN116662490A

CN116662490A - Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information

Info

Publication number: CN116662490A
Application number: CN202310956922.9A
Authority: CN
Inventors: 孙宇清; 黄钿
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-08-01
Filing date: 2023-08-01
Publication date: 2023-08-29
Anticipated expiration: 2043-08-01
Also published as: CN116662490B

Abstract

A confusion-removing text hash algorithm and a confusion-removing text hash device fusing layering label information belong to the technical field of natural language processing and information retrieval. Aiming at the text characteristics with layering labels, the invention constructs layering similarity relation in the hash space through multiple losses; in addition, in order to prevent the hash algorithm from being influenced by samples with inconsistent category and semantic similarity in the encoding process, a technical thought of confusion removal is introduced. The invention can establish effective layering similarity relation in the hash space, can better adapt to a real neighbor retrieval scene, and effectively uses the label information to construct layering hash space, so that the hash code can be gathered by the same label sample in different layers while meeting the requirement of consistency with the original semantic similarity, and the layering space distribution of different label sample dispersions is presented. The invention also effectively increases the robustness of the model in practical use.

Description

Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information

Technical Field

The invention relates to a confusion-removing text hash algorithm and a confusion-removing text hash device for fusing hierarchical label information, and belongs to the technical field of natural language processing and information retrieval.

Background

The text hash is a method for mapping the document from an original high-dimensional symbol feature space to a low-dimensional binary address space through hash, can enable the document with similar semantics to be mapped to a similar address space, can measure the distance between hash codes through Hamming distance, and compared with the traditional expensive calculation power and time consumption for calculating the similarity in European space, the text hash only needs to evaluate the similarity between binary hash codes through calculating Hamming distance, and the retrieval efficiency is remarkably improved. Generally, the category labels of the text may be used to assist in the construction of the hash codes, so that the text hash code distances of the same category remain similar. Hierarchical text hashing is a special form of text hashing that hashes text with hierarchical category labels. The text with the layering type accords with the real text semantic distribution scene, for example, in news texts, news texts such as basketball, football, table tennis and the like under the 'sports' major categories keep similar semantic aggregation among news of different ball categories, and the semantic aggregation is presented in the sports major categories while the different categories are scattered, and the news texts are scattered with texts under other major categories such as economy and the like. The hash retrieval is used as a neighbor retrieval technology, and it is important to keep the hierarchical neighbor relation of the real scene. Therefore, the hierarchical text hash has important application value in the field of large-scale information retrieval.

In the hierarchical text hashing problem, how to build efficient hierarchical similarity constraints in the hash space is quite challenging. Unlike non-hierarchical text hashing scenarios, hierarchical text hashing requires that the intelligent model be able to build a label-based hierarchical neighbor relation in the hash space, where the neighbor relation can be described as a similarity relation of hash codes, and introducing hierarchical label information by means of multi-label prediction of the encoded hash codes alone is far from sufficient, as this approach does not build explicit hierarchical similarity constraints in the hash space. In addition, fuzzy samples with inconsistent semantic and category information similarity sometimes appear in the hierarchical text hash process, so that the generalization performance of the model is reduced. The hash model aims to solve the challenge of establishing effective layering similarity in the hash code, and the challenge of weakening the influence of mixed signals with inconsistent semantic and category similarity on the performance of the hash model, so that the generalization capability and the robustness of the hash model are improved.

In the prior art, there have been many heuristics in text Ha Xifang down:

chinese patent literature: CN113449849a discloses a self-encoder based learning text hash method, which uses a hash model of a self-encoder structure to complete text hash model construction, and the method only reconstructs semantic information of a text, but cannot guarantee that documents of different categories are differentiated in a hash space.

Chinese patent literature: CN110955745a discloses a text hash search method based on deep learning, which uses a classification layer to classify the hash codes obtained by encoding to integrate category information, in this way, no explicit similarity constraint relation is established in the hash space, and only integration of flattened categories, not hierarchical category information, is considered. In this patent too, fuzzy samples that are inconsistent in semantic and class information similarity are not processed.

In summary, it is still difficult to satisfy the hierarchical classification requirement of the hierarchical text hash in the prior art.

Disclosure of Invention

The invention discloses a confusion-removing text hash algorithm for fusing hierarchical label information.

The invention also discloses a device for realizing the confusion-removing text hash algorithm.

Aiming at the text characteristics with layering labels, the invention constructs layering similarity relation in the hash space through multiple losses; in addition, in order to prevent the hash algorithm from being influenced by samples with inconsistent category and semantic similarity in the encoding process, a technical thought of confusion removal is introduced.

The detailed technical scheme of the invention is as follows:

the confusion-removing text hash algorithm for fusing hierarchical label information is characterized by comprising the following steps of:

s1: acquiring surface features of textEnglish feature, preferably, the surface layer features are expressed by using word frequency and inverse document frequency;

the surface featuresThe calculation method of (1) comprises the following steps:

s11: utilizing a coarse granularity word segmentation device of an open source word segmentation tool HanLP to segment texts in a language database;

s12: removing common Chinese stop words from word segmentation results, and obtaining word dictionary with front-ordered words from large to small according to word frequencyTF-IDF feature vector of each text as surface feature +.>；

S2: given a text surface feature set with hierarchical category labels, wherein />The TF-IDF characteristic vector is regularized by L2; /> and />The number of all different subclass labels and the number of different parent class labels in the data set are respectively; vector->Is>Personal component corresponds to subclass tag->Description when the component is 1The method comprises the steps of carrying out a first treatment on the surface of the Similarly, vector->Is>The individual components correspond to the parent class label +.>When the component is 1, it is stated +.>The method comprises the steps of carrying out a first treatment on the surface of the Category label information is available during the training phase and not available during the testing phase; the hierarchical text hash task is to learn a hash model to add the surface features +.>Mapping to a +.>Binary vector of dimension->The formula is +.>And satisfies the requirement for ++>Its corresponding hash code +>Distance is similar, for->Corresponding to itHash code->The distances are also similar;

s21: the constructed target hash code satisfies the following hierarchical distance based on the label:

given hash codeThe parent category and the sub-category to which the pairwise combination belongs satisfy the following conditions:

and />Respectively belonging to the same parent class and the same subclass;

and />Respectively belonging to different subclasses;

and />Respectively belonging to the same parent class and different subclasses;

and />Respectively belonging to different parent classes and different subclasses;

i.e., wherein />Andparent tags and child tags corresponding to the samples respectively;

then, hash codeThe constructed layering distance constraint is as follows:

；

wherein Super parameters for controlling the distance difference; />Is the Euclidean distance between two vectors, wherein +.>Two hash codes respectively referring to the calculated Euclidean distance; because of the binarycity of the hash codes, the excessive spans, namely 0 and 1, exist among the values of the corresponding dimensions of different hash codes, and optimization is still difficult to be performed when the Euclidean distance constraint is performed, so that the distance constraint is performed in a continuous space; due to arbitrary-> and />In positive correlation, the above hierarchical distance constraint condition is equivalent to satisfy the constraint relationship of the following two hierarchies:

；

wherein ,hash code for corresponding anchor sample>Hidden vector before binarization, +.>，/>Correspondingly constructing a hidden vector set for the whole training set sample; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample heteroparent class and the anchor point sample heterosubclass; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample and the parent class and the subclass; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample heterology is constructed; />Is a super parameter;

distance constraint is carried out on each layer:

the hierarchical distance constraint translates into optimizing the following objective function:

in the formulas (1) and (2),constraint loss for total layering distance on training set; />A binarization pre-hidden vector corresponding to the anchor point sample; /> and />The binary pre-hidden vectors which are close to and far from the anchor hidden vector are hoped to be respectively;is a super parameter for controlling the size of the distance difference; />Constraint duty ratio super parameters for parent class distance; />Calculating a function for the desire; and />Respectively calculating the comparison loss expectations under the parent class and the sub class layers through anchor point samples; the corresponding super parameter is-> and />In training, the parameters of the Hash algorithm are updated in a twin network mode, and the method is hoped to ensure that under the same model parameters, the semantic similarity is kept and the same father and father are simultaneously realizedThe distance between the samples of the child is closer in the hash space than that between the samples of the father and the alien, and the distance between the samples of the father and the alien is closer in the hash space than that between the samples of the alien, so that the construction of the hierarchical similarity constraint in the hash space is completed.

The hash algorithm further comprises an S3 construction method for removing confusion constraint, which comprises the following steps:

for the core thought of the confusion removing supervision method, the method is divided into two parts, wherein firstly, samples with similar semantics but belonging to different categories need to be far away in the corresponding hash space, and secondly, samples with less similar semantics but belonging to the same category need to be similar in the corresponding hash space; consistent with the foregoing, by successive hidden vectorsTo construct hash +.>Similarity relation between:

let the samples of each parent and child class have a semantic center in hidden space.

S31: and calculating the semantic center based on the hidden vector of the sample of the category in the training process.

In the formula (3) of the present invention,representing parent class +.>Is a latent semantic center of (1); />Representation subclass->Is a latent semantic center of (1); />The function is calculated for the desired purpose.

S32: on the premise of defining various kinds of lingering semantic centers, calculating fuzzy radius based on the semantic centers for each parent class and each child class:

in the formula (4) of the present invention,is->Semantic blur radius of parent class; />Is->Semantic blur radius of subclass; />Is a super-parameter, and can be automatically adjusted according to the data; />The function is calculated for the desired purpose.

S33: the method is characterized in that the same class samples with the distance from the semantic center being larger than the fuzzy radius and the heterogeneous samples with the distance from the semantic center being smaller than the fuzzy radius are provided with confusing supervision information, and based on the concept, the distance constraint similar to the layering similarity constraint is introduced into the training step of each sample:

in the formula (5) of the present invention,total defrobulated supervision constraint loss on the training set; />The function is calculated for the desired purpose. and />The loss expectations of confusion removal and comparison under parent classes and subclasses obtained through semantic center calculation are respectively; /> and />Respectively associating a specific parent class semantic center and a specific sub-class semantic center; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and corresponds to the heterofather class and heteroson class sample is constructed; />Hidden vector set constructed corresponding to anchor point sample and father and subclass sample obtained by random sampling +.>The hidden vector set which is obtained for random sampling and is constructed corresponding to the anchor point sample and the parent class and the class of the same class of the different class sample meets the following conditions:

wherein and />Corresponding parent hidden vector fuzzy radius and sub-class hidden vector fuzzy radius respectively; likewise, the hash algorithm is optimized by adopting a twin network mode in the training process.

The two parts are key components of the invention, and the related super-parameter user can use default values and can also be customized to meet actual service requirements. Guan Chaocan of the sample similarity constraints involved and />Confusion-removing constraint super-ginseng and />And weight exceeding ginseng->And confusion boundary super-ginseng->The method can be automatically adjusted according to the data set and used for subsequent training of the confusion-removing text hash algorithm for fusing the hierarchical label information;

the distribution diagram of the blurred sample is shown in fig. 1;

the sample selection algorithm pseudo code of the above procedure is shown in fig. 2;

hash algorithmIntegration of semantic information of a document by Bernoulli variational self-encoder framework, according to Bayesian theorem, variational self-encoder can be optimized by maximizing variational lower bound, and can be converted into maximizing the following variational lower bound in a hash modelNamely maximizing the reconstructed original document and minimizing the KL divergence:

in the formula (6) of the present invention,the function is calculated for the desired purpose. The inference process is denoted as the hash algorithm encoding (approximate posterior) process, i.e. encoder +.>The encoded network parameters are denoted +.>The method comprises the steps of carrying out a first treatment on the surface of the The generation process is denoted as hash algorithm decoding process +.>The decoded network parameter is denoted +.>；/>A priori distribution without loss of generality; since the bernoulli variation self-encoder treats the a priori distribution as a multivariate bernoulli distribution, equation (6) is transformed in a specific process to optimize the objective function as follows:

in the formula (7) and the formula (8),reconstructing the loss for the document; KL is the relative entropy; />For a priori distributionIs a characteristic value of (2); />The method comprises the steps of reconstructing a characteristic vector for a decoding network; />Is a cross entropy loss function;

the hash algorithm adopts a label prediction network to integrate the hash code into the basic sub-category information of the hash code, and optimizes the hash code by using the following objective function, whereinThe category to which the sample predicted for the network belongs:

through the above discussion, the hash algorithm proposed by the present invention is optimized by the following objective function:

in the formulas (9) and (10), and (2)>Predicting losses for corresponding sub-class labels; />Predicting network parameters for the tag; />Loss from the encoder for the corresponding Bernoulli variation, including document reconstruction +.>Loss and relative entropy KL loss; />Respectively obtaining weight super-parameters corresponding to each part of objective function;constraint loss for the computed hierarchical similarity; />A defrobulated constraint loss calculated for the middle; />Is a complete objective function of the hash algorithm.

The device for realizing the confusion-removing text hash algorithm for fusing the hierarchical label information is characterized in that:

the whole hash model framework is realized from the encoder framework based on the Knuds variation; encoder with a plurality of sensorsUsing a multi-layer perceptron, english multilayer perceptron, english shorthand MLP; the surface layer characteristics of the document are->Mapping to a continuous hidden distribution space, and obtaining a hash code by sampling in the coded posterior multiple Bernoulli distribution;

the coding process comprises the following steps: first, the surface layer is characterizedInputting the multi-layer perceptron to encode to obtain +.>

（11）

Then activating by using Sigmoid function to obtain hidden vector before binarization

（12）

In the formula (12)Representing an exponential function;

obtaining hash codes by sampling operations

（13）

In equation (13), the approximate posterior of the hidden vector correlation is expressed as

（14）

DecoderObtaining hidden vectors after linear transformation, then reconstructing the original characteristic representation through an activation function, and predicting the labels:

（15）

in the case of the formula (15),representing decoded network output vectorsCorresponding position +.>An output of (2); />For decoding the network parameter matrix +.>To correspond to->A single heat vector of dimensions; />Is a network bias term;

（16）

in the formula (16) of the present invention,representing the corresponding position in the predicted network output vector>An output of (2); />A network parameter matrix is predicted; />To correspond to->A single heat vector of dimensions; />Is a network bias term;

（17）

（18）

in the formulas (11) - (18),in the calculation of the multi-label scene, the multi-label scene is replaced by a Sigmoid function, and the sampling process is carried outThe following form was used instead:

（19）

in the formula (19) of the present invention,is a fixed threshold sample value, ">To complete hidden distribution sampling, and to use a straight-through estimator to process the non-conductive phenomenon of the binarization operation in the training process.

In specific implementation details, two layers of hidden layers each containing 1000 neurons activated by adopting a leak ReLU are used as a network basic structure of the MLP, and a dropout technology is adopted after the second layer to prevent the phenomenon that a hash model is over-fitted, wherein the dropout probability is set to be 0.1. Then, a sigmoid activation function is adopted to realize the approximation process of the A-test Bernoulli distribution, and a fixed threshold value is adoptedBy->And the function samples the binary hash code to complete the construction of the inferred network. For generating a network, a layer of linear transformation is adopted to reconstruct an original feature vector and a predictive subcategory label vector, a Softmax function is adopted to activate network output when the original feature vector is reconstructed, and a Softmax is used to activate network output for a single label scene when the subcategory label is predictedAnd activating, namely for the multi-label scene, activating by using a Sigmoid function, and replacing a loss calculation mode with Euclidean distance loss. The inference network (encoder) and the generation network (decoder) are only used in combination in the training phase to complete the optimization process under the framework of the variational self-encoder, and in the practical application phase, only the encoder is used for inputting the surface layer characteristics->And encoding to obtain the hash code.

The whole framework of the implementation of the hash algorithm device is shown in fig. 3, and the training process is shown in fig. 4.

The technical advantages of the invention include:

compared with the prior art, the method only used, the method can better adapt to a real neighbor search scene, and effectively uses the label information to construct a hierarchical hash space, so that the hash code can be consistent with the original semantic similarity, and meanwhile, the same label sample aggregation and the hierarchical spatial distribution of different label sample dispersion are presented in different layers.

The confusion removing constraint proposed by the invention according to the actual problem can help the hash model to weaken the influence caused by confusion information from inconsistent category and semantics in the training process, and effectively increase the robustness of the model in actual use.

The invention can also be used on other similar tasks in the field of natural language processing, such as hierarchical text representation learning, hierarchical text classification and other related tasks based on hierarchical label scenes.

Drawings

FIG. 1 is a schematic diagram of the distribution of fuzzy samples in a hierarchical hash scenario of the present invention;

FIG. 2 is an algorithmic pseudocode of a sample selection method for modeling samples and hierarchical similarity hash space construction in accordance with the present invention;

FIG. 3 is a schematic diagram of the overall design framework of the confusion-removed text hashing algorithm incorporating hierarchical label information in the present invention;

fig. 4 is a training algorithm pseudo code of the defrobulated text hashing algorithm incorporating hierarchical label information in accordance with the present invention.

Detailed Description

The present invention will be described in detail with reference to examples and drawings, but is not limited thereto.

Example 1,

A defrobulated text hashing algorithm that fuses hierarchical tag information, comprising:

S2: given a text surface feature set with hierarchical category labels, wherein />The TF-IDF characteristic vector is regularized by L2; /> and />The number of all different subclass labels and the number of different parent class labels in the data set are respectively; vector->Is>Personal component corresponds to subclass tag->Description when the component is 1The method comprises the steps of carrying out a first treatment on the surface of the Similarly, vector->Is>The individual components correspond to the parent class label +.>When the component is 1, it is stated +.>The method comprises the steps of carrying out a first treatment on the surface of the Category label information is available during the training phase and not available during the testing phase; the hierarchical text hash task is to learn a hash model to add the surface features +.>Mapping to a +.>Binary vector of dimension->The formula is +.>And satisfies the requirement for ++>Its corresponding hash code +>Distance is similar, for->Its corresponding hash code +>The distances are also similar;

and />Respectively belonging to the same parent class and the same subclass;

and />Respectively belonging to different subclasses;

and />Respectively belonging to the same parent class and different subclasses;

i.e., wherein /> and />Parent tags and child tags corresponding to the samples respectively;

then, hash codeThe constructed layering distance constraint is as follows:

；

wherein ,hash code for corresponding anchor sample>Hidden vector before binarization, +.>，/>Correspondingly constructing a hidden vector set for the whole training set sample; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample heteroparent class and the anchor point sample heterosubclass; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample and the parent class and the subclass; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample heterology is constructed;/>is a super parameter;

distance constraint is carried out on each layer:

in the formulas (1) and (2),constraint loss for total layering distance on training set; />A binarization pre-hidden vector corresponding to the anchor point sample; /> and />The binary pre-hidden vectors which are close to and far from the anchor hidden vector are hoped to be respectively;is a super parameter for controlling the size of the distance difference; />Constraint duty ratio super parameters for parent class distance; />Calculating a function for the desire; and />Respectively calculating the comparison loss expectations under the parent class and the sub class layers through anchor point samples; the corresponding super parameter is-> and />In training, the parameters of the hash algorithm are updated in a twin network mode, so that semantic similarity is expected to be kept under the same model parameters, meanwhile, the distance between samples of the same father and son in the hash space is closer than that between samples of the same father and son, the distance between samples of the same father and son in the hash space is closer than that between samples of different father and son, and therefore the construction of the hierarchical similarity constraint in the hash space is completed.

EXAMPLE 2,

The method for constructing the confusion-removing text hash algorithm for fusing hierarchical label information according to embodiment 1, further comprising S3 a construction method for the confusion-removing constraint, comprising:

let the samples of each parent and child class have a semantic center in hidden space,

s31: semantic center calculation based on hidden vectors of samples of this type in training process

in the formula (5) of the present invention,total defrobulated supervision constraint loss on the training set; />The function is calculated for the desired purpose. and />The loss expectations of confusion removal and comparison under the parent class and the child class obtained through semantic center calculation are respectively shown; /> and />Respectively associating a specific parent class semantic center and a specific sub-class semantic center; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and corresponds to the heterofather class and heteroson class sample is constructed; />Hidden vector set constructed corresponding to anchor point sample and father and subclass sample obtained by random sampling +.>The hidden vector set which is obtained for random sampling and is constructed corresponding to the anchor point sample and the parent class and the class of the same class of the different class sample meets the following conditions:

the distribution diagram of the blurred sample is shown in fig. 1;

the Hash algorithm integrates semantic information of a document through a Bernoulli variation self-encoder framework, and the variation self-encoder can be optimized by maximizing a variation lower bound according to the Bayesian theorem and can be converted into the maximized variation lower bound in a Hash modelNamely maximizing the reconstructed original document and minimizing the KL divergence:

in the formula (7) and the formula (8),reconstructing the loss for the document; KL is the relative entropy; />For a priori distribution->Is a characteristic value of (2); />The method comprises the steps of reconstructing a characteristic vector for a decoding network; />Is a cross entropy loss function;

in the formulas (9) and (10), and (2)>Predicting losses for corresponding sub-class labels; />Predicting network parameters for the tag; />Loss from the encoder for the corresponding Bernoulli variation, including document reconstruction +.>Loss and relative entropy KL loss; />Respectively obtaining weight super-parameters corresponding to each part of objective function;constraint loss for the computed hierarchical similarity; />The resulting de-confusing constraint loss is calculated. />Is a complete objective function of the hash algorithm.

EXAMPLE 3,

An implementation device of a confusion-removing text hash algorithm for fusing hierarchical label information comprises:

the whole hash model framework is realized from the encoder framework based on the Knuds variation; encoder with a plurality of sensorsEnglish multilayer perceptron, english letter using multi-layer perceptronWriting MLP; the surface layer characteristics of the document are->Mapping to a continuous hidden distribution space, and obtaining a hash code by sampling in the coded posterior multiple Bernoulli distribution;

（11）

（12）

In the formula (12)Representing an exponential function;

obtaining hash codes by sampling operations

（13）

（14）

（15）

in the case of the formula (15),representing the corresponding position in the decoded network output vector +.>An output of (2); />For decoding the network parameter matrix +.>To correspond to->A single heat vector of dimensions; />Is a network bias term;

（16）

in the formula (16) of the present invention,representing the corresponding position in the predicted network output vector>An output of (2); />To predictA network parameter matrix; />To correspond to->A single heat vector of dimensions; />Is a network bias term;

（17）

（18）

（19）

in the formula (19) of the present invention,is a fixed threshold sample value, ">To complete the implicit distributed sampling. A pass-through estimator is used to handle the non-conductive phenomenon of the binarization operation in the training process.

In specific implementation details, two layers of hidden layers each containing 1000 neurons activated by adopting a leak ReLU are used as a network basic structure of the MLP, and a dropout technology is adopted after the second layer to prevent the phenomenon that a hash model is over-fitted, wherein the dropout probability is set to be 0.1. Thereafter, use is made ofsigmoid activation function to achieve approximation of Bernoulli posterior distribution and employ fixed thresholdBy->And the function samples the binary hash code to complete the construction of the inferred network. For generating a network, a layer of linear transformation is adopted to reconstruct an original feature vector and a predicted subcategory label vector, a Softmax function is adopted to activate network output when the original feature vector is reconstructed, a Softmax function is adopted to activate network output for a single-label scene when the subcategory label is predicted, a Sigmoid function is adopted to activate for a multi-label scene, and a loss calculation mode is replaced by Euclidean distance loss. The inference network (encoder) and the generation network (decoder) are only used in combination in the training phase to complete the optimization process under the framework of the variational self-encoder, and in the practical application phase, only the encoder is used for inputting the surface layer characteristics->And encoding to obtain the hash code.

In the above embodiment, the performance of the hash algorithm proposed by the present invention is quantitatively analyzed based on the 32-bit hash code:

training and testing the hash algorithm of the invention on a WOS data set, wherein the WOS data set adopts a data set from https:// data.mendeley.com/data/9 rw3vkcfy4/6, the test result is shown in table 1, wherein a Distance column is the Hamming Distance between other samples searched by a query sample and hash codes of the query sample, a Domain column is a large field, the data column can be understood as a parent class to which the sample belongs, the Area column is a subdivision field, the data column can be understood as a subclass to which the sample belongs, a Keywords column is a corresponding sample keyword, and a Code column is a corresponding sample hash Code:

table 1: test results (text is encoded into 32-bit hash codes based on the related technology of the invention, and the hash codes of the text to be queried are searched by using the hash codes of the query text to obtain test results in the following table).

As can be seen from Table 1, as the Hamming distance increases, the retrieved documents become less relevant. In addition, documents under the same parent class are closer in hash space than documents within different parent classes, even though they come from different subclasses, exhibiting hierarchical similarity relationships. Therefore, the result in the table 1 shows that the hash algorithm provided by the invention can encode the hash codes for effectively measuring the hierarchical similarity between the documents, and the hierarchical category label information is fully utilized.

In the embodiment, the retrieval performance evaluation of the hash model proposed herein is performed on 16-bit, 32-bit, 64-bit and 128-bit hash codes by using a precision@100 index based on a 20New Groups data set, wherein the 20New Groups data set is derived from http:// qword.com/-jason/20 New Groups; the Method column is the result of comparing the hash algorithm with the hash algorithm in the text, and the columns of 16bits, 32bits, 64bits and 128bits respectively represent the model performance on the hash codes of 16bits, 32bits, 64bits and 128 bits.

Table 2: precision@100 retrieval performance evaluation table

In Table 2, ours is the performance of the hash algorithm proposed by the present invention, and other prior art manifestations are cited in the IHDH Hash technical literature (Guo J N, mao X L, wei W, et al, intra-category aware hierarchical supervised document hashing [ J ]. IEEE Transactions on Knowledge and Data Engineering, 2022.).

As can be seen from table 2, the performance of the hash algorithm in this document exceeds that of the other prior art in comparison with the hash codes of four common lengths in the experiment, which fully demonstrates the advantages of the hash model proposed by the present invention.

Claims

1. The confusion-removing text hash algorithm for fusing hierarchical label information is characterized by comprising the following steps of:

s1: acquiring surface features of text；

S2: given a text surface feature set with hierarchical category labels, wherein />The TF-IDF characteristic vector is regularized by L2; /> and />The number of all different subclass labels and the number of different parent class labels in the data set are respectively; vector->Is>Personal component corresponds to subclass tag->Description when the component is 1The method comprises the steps of carrying out a first treatment on the surface of the Similarly, vector->Is>The individual components correspond to the parent class label +.>When the component is 1, it is stated +.>；

S3: a method of construction of a deconrobustable constraint, comprising: by successive hidden vectorsTo construct hash +.>The similarity relation between the parent category and the child category is set to have a semantic center in the hidden space.

2. The defrobulated text hashing algorithm of claim 1, wherein in S1, said surface features areThe calculation method of (1) comprises the following steps:

s12: removing Chinese stop words from word segmentation results, and obtaining word dictionary with front words according to word frequency from large to smallTF-IDF feature vector of each text as surface feature +.>；

The S2 further includes:

the constructed target hash code satisfies the following hierarchical distance based on the label:

and />Respectively belonging to the same parent class and the same subclass;

and />Respectively belonging to different subclasses;

and />Respectively belonging to the same parent class and different subclasses;

then, hash codeThe constructed layering distance constraint is as follows:

；

wherein Super parameters for controlling the distance difference; />Is the Euclidean distance between two vectors, wherein +.>Two hash codes respectively referring to the calculated Euclidean distance; the above hierarchical distance constraint is equivalent to satisfying the following two hierarchical constraint relationships:

；

wherein ,hash code for corresponding anchor sample>Hidden vector before binarization, +.>，/>Correspondingly constructing a hidden vector set for the whole training set sample; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample heteroparent class and the anchor point sample heterosubclass;a hidden vector set which is obtained by random sampling and is constructed corresponding to the anchor point sample and the parent class and the subclass; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample heterology is constructed; />Is a super parameter;

distance constraint is carried out on each layer:

in the formulas (1) and (2),constraint loss for total layering distance on training set; />A binarization pre-hidden vector corresponding to the anchor point sample; /> and />The binary pre-hidden vectors which are close to and far from the anchor hidden vector are hoped to be respectively; />Is a super parameter for controlling the size of the distance difference; />Constraint duty ratio super parameters for parent class distance; />Calculating a function for the desire; and />Respectively calculating the comparison loss expectations under the parent class and the sub class layers through anchor point samples; the corresponding super parameter is-> and />。

3. The confusion-removing text hashing algorithm for fusing hierarchical label information according to claim 2, wherein S3 specifically comprises:

s31: in the training process, calculating semantic centers based on hidden vectors of the samples of the category:

in the formula (3) of the present invention,representing parent class +.>Is a latent semantic center of (1); />Representation subclass->Is a latent semantic center of (1); />Calculating a function for the desire;

s32: performing semantic center-based fuzzy radius calculation on each parent class and subclass:

in the formula (4) of the present invention,is->Semantic blur radius of parent class; />Is->Semantic fuzzy half of subclassesDiameter is as follows;is a super ginseng; />Calculating a function for the desire;

s33: a distance constraint similar to the hierarchical similarity constraint is introduced in the training step of each sample:

in the formula (5) of the present invention,total defrobulated supervision constraint loss on the training set; />Calculating a function for the desire; and />The loss expectations of confusion removal and comparison under parent classes and subclasses obtained through semantic center calculation are respectively; /> and />Respectively associating a specific parent class semantic center and a specific sub-class semantic center; />A hidden vector set which is obtained by random sampling and corresponds to the anchor point sample and the parent class and the subclass sample is constructed; />A hidden vector set which is obtained by random sampling and corresponds to the heterofather class and heteroson class sample is constructed; />Hidden vector set constructed corresponding to anchor point sample and father and subclass sample obtained by random sampling +.>The hidden vector set which is obtained for random sampling and is constructed corresponding to the anchor point sample and the parent class and the class of the same class of the different class sample meets the following conditions:

4. The defrobulated text hashing algorithm of claim 2 wherein the method of optimizing the hashing algorithm includes:

maximizing the lower bound of the variationNamely maximizing the reconstructed original document and minimizing the KL divergence:

in the formula (6) of the present invention,calculating a function for the desire; the inference procedure is denoted as hash algorithm coding procedure, i.e. encoder +.>The encoded network parameters are denoted +.>The method comprises the steps of carrying out a first treatment on the surface of the The generation process is denoted as hash algorithm decoding process +.>The decoded network parameter is denoted +.>；/>A priori distribution without loss of generality; since the bernoulli variation self-encoder treats the a priori distribution as a multivariate bernoulli distribution, equation (6) is transformed in a specific process to optimize the objective function as follows:

in the formula (7) and the formula (8),reconstructing the loss for the document; />For a priori distribution->Is a characteristic value of (2); />The method comprises the steps of reconstructing a characteristic vector for a decoding network; />Is a cross entropy loss function;

the hash algorithm is optimized by the following objective function:

in the formulas (9) and (10), and (2)>Predicting losses for corresponding sub-class labels; />Predicting network parameters for the tag; />Loss from encoder for corresponding Bernoulli variation, including document reconstructionLoss and KL loss; />Respectively obtaining weight super-parameters corresponding to each part of objective function; />Constraint loss for the computed hierarchical similarity; />A defrobulated constraint loss calculated for the middle; />Is a complete objective function of the hash algorithm.

5. The device for realizing the confusion-removing text hash algorithm for fusing the hierarchical label information is characterized in that:

encoder with a plurality of sensorsUsing a multi-layer perceptron, characterizing the surface layer of the document->Mapping to a continuous hidden distribution space, and obtaining a hash code by sampling in the coded posterior multiple Bernoulli distribution;

the coding process comprises the following steps: first, the surface layer is characterizedInputting into a multi-layer sensing machineCoding get->

（11）

（12）

In the formula (12)Representing an exponential function;

obtaining hash codes by sampling operations

（13）

（14）

DecoderUsing linear transformation to obtain hidden vector, then reconstructing original characteristic representation by activating function and making labelAnd (3) predicting:

（15）

in the case of the formula (15),representing the corresponding position in the decoded network output vector +.>An output of (2); />In order to decode the network parameter matrix,to correspond to->A single heat vector of dimensions; />Is a network bias term;

（16）

in the formula (16) of the present invention,representing the corresponding position in the predicted network output vector>An output of (2); />A network parameter matrix is predicted;to correspond to->A single heat vector of dimensions; />Is a network bias term;

（17）

（18）

（19）

in the formula (19) of the present invention,is a fixed threshold sample value, ">To complete the implicit distributed sampling.