CN116522271A

CN116522271A - Feature fusion model processing and sample retrieval methods and devices and computer equipment

Info

Publication number: CN116522271A
Application number: CN202210036856.9A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2023-08-01

Abstract

The present application relates to a feature fusion model processing method, apparatus, computer device, storage medium and computer program product, and a sample method, apparatus, computer device, storage medium and computer program product. The method comprises the following steps: acquiring semantic features and non-semantic features corresponding to the training samples; determining an activated feature component from feature components contained in each candidate feature by taking at least one of semantic features and non-semantic features as the candidate feature; counting the activation feature components belonging to the same feature bit, and determining the respective activation degree of each feature bit according to the counting result; respectively setting feature components corresponding to feature bits with activation degrees meeting preset conditions in each candidate feature as non-activation values to obtain each target feature; based on each target feature, fusing by using a feature fusion model to obtain each fused feature; and training a feature fusion model based on each fusion feature. By adopting the method, the accuracy of the sample characteristics can be improved.

Description

Feature fusion model processing and sample retrieval methods and devices and computer equipment

Technical Field

The present invention relates to the field of computer technology, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for processing a symptom fusion model, and a sample retrieval method, an apparatus, a computer device, a storage medium, and a computer program product.

Background

With the development of computer technology, machine learning techniques have emerged by which various machine learning models can be trained, for example, a model for obtaining sample characteristics can be trained, which model can obtain characteristics corresponding to input samples, based on which similar samples can be retrieved in a database.

In the traditional technology, the sample input model can extract the bottom features, however, the bottom features cannot accurately characterize the sample due to the fact that the bottom features do not have semantic measurement capability, and the accuracy is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a feature fusion model processing method, apparatus, computer device, computer-readable storage medium, and computer program product that can improve the accuracy of the obtained features.

In one aspect, the present application provides a feature fusion model processing method. The method comprises the following steps: acquiring semantic features and non-semantic features corresponding to training samples in a training sample set; determining an activated feature component from feature components contained in each candidate feature by taking at least one of the semantic feature and the non-semantic feature as the candidate feature; counting the activated feature components belonging to the same feature bit, and determining the respective activation degree of each feature bit corresponding to each candidate feature according to the counting result; respectively setting feature components corresponding to feature bits with activation degrees meeting preset conditions in the candidate features as non-activation values to obtain target features; based on the target features, carrying out fusion processing by utilizing a feature fusion model to be trained to obtain fusion features; training the feature fusion model based on each fusion feature, and obtaining a target fusion model when the training stop condition is met; the target fusion model is used for carrying out fusion processing on semantic features and non-semantic features of an input sample to obtain target fusion features.

On the other hand, the application also provides a symptom fusion model processing device. The device comprises: the feature acquisition module is used for acquiring semantic features and non-semantic features corresponding to training samples in the training sample set; an activation component determining module, configured to determine an activation feature component from feature components included in each candidate feature by using at least one of the semantic feature and the non-semantic feature as a candidate feature; the activation degree determining module is used for counting activation feature components belonging to the same feature bit and determining respective activation degrees of feature bits corresponding to the candidate features according to the counting result; the target feature obtaining module is used for respectively setting feature components corresponding to feature bits with the activation degree meeting preset conditions in the candidate features as non-activation values to obtain target features; the fusion processing module is used for carrying out fusion processing by utilizing a feature fusion model to be trained based on each target feature to obtain each fusion feature; the parameter adjustment module is used for training the feature fusion model based on each fusion feature, and obtaining a target fusion model when the training stop condition is met; the target fusion model is used for carrying out fusion processing on semantic features and non-semantic features of an input sample to obtain target fusion features.

In another aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the feature fusion model processing method when executing the computer program.

In another aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the feature fusion model processing method described above.

In another aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the feature fusion model processing method described above.

According to the feature fusion model processing method, the device, the computer equipment, the storage medium and the computer program product, on one hand, the semantic features and the non-semantic features of the input sample can be fused through the target fusion model to obtain the target fusion features, and the target fusion features have semantic measurement capability and non-semantic measurement capability simultaneously, can more accurately characterize the sample and are high in accuracy; on the other hand, in the process of training to obtain the target fusion model, at least one of semantic features and non-semantic features is used as candidate features, an activated feature component is determined from feature components contained in each candidate feature, the activated feature components belonging to the same feature bit are counted, the respective activation degree of each feature bit corresponding to each candidate feature is determined according to a counting result, the feature component corresponding to the feature bit with the activation degree meeting the preset condition in each candidate feature is set as an inactive value to obtain each target feature, and fusion processing is carried out on the basis of each target feature and by utilizing the feature fusion model to be trained, so that the influence of the feature of the low-activation feature bit on the training process can be reduced, the model can learn key feature bits better, the feature fusion of the trained target fusion model can be fused better, and the accuracy of the target fusion feature is further improved.

In another aspect, the present application provides a sample retrieval method. The method comprises the following steps: acquiring semantic features and non-semantic features corresponding to the query sample; inputting the semantic features and the non-semantic features into a target fusion model for fusion processing to obtain target fusion features corresponding to the query sample; the target fusion model is obtained when a feature fusion model to be trained is trained based on each fusion feature until a training stop condition is met; each fusion feature is obtained by fusion processing based on each target feature and by utilizing the feature fusion model; each target feature is obtained by respectively setting a feature component corresponding to a feature bit with the activation degree meeting a preset condition in each candidate feature as a non-activation value; the respective activation degree of each feature bit is determined according to the statistical result by counting the activation feature components belonging to the same feature bit; the activation feature component is determined from feature components contained in each of the candidate features by taking at least one of the semantic feature and the non-semantic feature as a candidate feature; the semantic features and the non-semantic features correspond to training samples in the training sample set; and retrieving a target retrieval sample corresponding to the query sample from a sample database based on the target fusion characteristic.

In another aspect, the present application provides a sample retrieval apparatus. The device comprises: the feature acquisition module is used for acquiring semantic features and non-semantic features corresponding to training samples in the training sample set; an activation component determining module, configured to determine an activation feature component from feature components included in each candidate feature by using at least one of the semantic feature and the non-semantic feature as a candidate feature; the activation degree determining module is used for counting activation feature components belonging to the same feature bit and determining respective activation degrees of feature bits corresponding to the candidate features according to the counting result; the target feature obtaining module is used for respectively setting feature components corresponding to feature bits with the activation degree meeting preset conditions in the candidate features as non-activation values to obtain target features; the fusion processing module is used for carrying out fusion processing by utilizing a feature fusion model to be trained based on each target feature to obtain each fusion feature; the parameter adjustment module is used for training the feature fusion model based on each fusion feature, and obtaining a target fusion model when the training stop condition is met; the target fusion model is used for carrying out fusion processing on semantic features and non-semantic features of an input sample to obtain target fusion features.

In another aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the sample retrieval method described above when the processor executes the computer program.

In another aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the sample retrieval method described above.

In another aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the sample retrieval method described above.

According to the sample retrieval method, the sample retrieval device, the computer equipment, the storage medium and the computer program product, on one hand, the semantic features and the non-semantic features of the query sample can be fused through the target fusion model to obtain the target fusion features, and the target fusion features have semantic measurement capability and non-semantic measurement capability simultaneously, so that the sample can be more accurately characterized, and the retrieval is performed based on the target fusion features, so that the accuracy of the obtained retrieval result is high; on the other hand, the target fusion model is obtained when the feature fusion model to be trained is trained based on each fusion feature until the training stopping condition is met, each fusion feature is obtained by fusion processing based on each target feature and by using the feature fusion model, each target feature is obtained by setting a feature component corresponding to a feature position with the activation degree meeting the preset condition in each candidate feature as a non-activation value, each activation degree of each feature position is obtained by counting the activation feature components belonging to the same feature position, and the activation feature component is determined from the feature components contained in each candidate feature by taking at least one of semantic features and non-semantic features as the candidate feature, so that the influence of the feature of the low activation feature position on the training process can be reduced in the model training process, the model can learn the key feature position better, and the target fusion model obtained by training can perform feature fusion better, and the sample retrieval accuracy is further improved.

Drawings

FIG. 1 is an application environment diagram of a feature fusion model processing method in one embodiment;

FIG. 2 is a flow chart of a feature fusion model processing method in one embodiment;

FIG. 3 is a schematic diagram of a feature bit distribution of a training sample in one embodiment;

FIG. 4 is a flow chart of a feature fusion model processing method in another embodiment;

FIG. 5 is a schematic diagram of a fusion process in one embodiment;

FIG. 6 is a flow chart of a training step of the feature extraction model in one embodiment;

FIG. 7 is a schematic diagram of a similar sample in one embodiment;

FIG. 8 is a schematic diagram of a similar sample in another embodiment;

FIG. 9 is a sample schematic generated by an image attack in one embodiment;

FIG. 10 is a flow chart of a sample retrieval method in one embodiment;

FIG. 11 is a schematic diagram of a model structure in one embodiment;

FIG. 12 is a block diagram of a feature fusion model processing device in one embodiment;

FIG. 13 is a block diagram of a sample retrieval apparatus in one embodiment;

FIG. 14 is an internal block diagram of a computer device in one embodiment;

fig. 15 is an internal structural view of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Computer Vision (CV) is a science of how to "look" a machine, and more specifically, a camera and a Computer are used to replace human eyes to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies generally include technologies such as image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, automatic driving, intelligent transportation, and the like, and common biological feature recognition technologies such as face recognition, fingerprint recognition, and the like.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. The computer equipment is specially researched to simulate or realize the learning behavior of human beings so as to acquire new knowledge or skills, and the existing knowledge structure is reorganized to continuously improve the performance of the computer equipment. Machine learning is the core of artificial intelligence, a fundamental approach to enabling computer devices to have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The scheme provided by the embodiment of the application relates to the technologies of computer vision, machine learning and the like of artificial intelligence, and is specifically described by the following embodiments:

the feature fusion model processing method and the sample retrieval method provided by the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a communication network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 may be, but is not limited to, a notebook computer, a smart phone, a tablet computer, a desktop computer, a smart television, a vehicle-mounted terminal, and a portable wearable device. The terminal can be provided with an application program, the input sample can be searched through the application program, a target search sample is obtained, the application program can be a client installed in the terminal, and the client (also called as application client and APP client) can be a program installed and operated in the terminal; an application may also refer to an installation-free application, i.e., an application that can be used without downloading an installation, such an application is also commonly referred to as an applet, which typically runs as a subroutine in a client; an application may also refer to a web application that is opened through a browser; etc. The server 104 may be implemented as a stand-alone server or as a server cluster or cloud server composed of a plurality of servers. It may be understood that the embodiments of the present application do not limit the number of terminals and servers, and one or more terminals may be provided, and one or more servers may be provided, which may be specifically set as needed. Wherein a plurality refers to at least two.

Both the terminal 102 and the server 104 may be used separately to perform the feature fusion model processing and sample retrieval methods provided in embodiments of the present application.

For example, the server 104 may obtain semantic features and non-semantic features corresponding to training samples in the training sample set, take at least one of the semantic features and the non-semantic features as candidate features, determine an activated feature component from feature components included in each candidate feature, count activated feature components belonging to a same feature bit, determine respective activation degrees of feature bits corresponding to each candidate feature according to a statistical result, respectively set feature components corresponding to feature bits with activation degrees meeting a preset condition in each candidate feature as non-activation values, obtain each target feature, perform fusion processing by using a feature fusion model to be trained based on each target feature, obtain each fusion feature, train a feature fusion model based on each fusion feature, and obtain a target fusion model when a training stop condition is met; the target fusion model is used for carrying out fusion processing on semantic features and non-semantic features of the input sample to obtain target fusion features.

The server 104 may obtain semantic features and non-semantic features corresponding to the query sample, input the semantic features and the non-semantic features into a target fusion model for fusion processing, obtain target fusion features corresponding to the query sample, and retrieve a target retrieval sample corresponding to the query sample from a sample database based on the target fusion features.

The terminal 102 and the server 104 may also cooperate to perform the feature fusion model processing and sample retrieval methods provided in embodiments of the present application.

For example, the server 104 may obtain a training sample from the terminal, after training to obtain a target fusion model, the server 104 may send the target fusion model obtained by training to the terminal, and the terminal performs fusion processing on semantic features and non-semantic features of the input sample based on the target fusion model to obtain target fusion features. The terminal can further retrieve a target retrieval sample corresponding to the input sample from a database of the server based on the extracted target fusion feature.

In one embodiment, as shown in fig. 2, a feature fusion model processing method is provided, which is illustrated by a computer device, and it is understood that the computer device may be the terminal 102 shown in fig. 1 or the server 104. In this embodiment, the feature fusion model processing method includes the following steps:

Step 202, obtaining semantic features and non-semantic features corresponding to training samples in a training sample set.

The training sample set comprises a plurality of training samples, wherein the training samples refer to content samples for training the feature extraction model, and the content can be any one of text, audio or images. The non-semantic features refer to features without semantic measurement capability, can be features from the bottom layer of a deep learning model, take training samples as images as examples, and can comprise characterization information for describing image textures and feature layout. Semantic features refer to features that have semantic metrology capabilities. Semantic features, due to their semantic metrology capabilities, can be used to classify input samples. Taking a training sample as an image as an example, the semantic features can include a representation describing certain specified semantic content parts in the image, such as features used for describing dogs, and the features of the positions of dogs in the image can be extracted as the semantic features.

Specifically, the computer device may obtain a trained target semantic feature extraction model and a trained target non-semantic feature extraction model, and input training samples in the training sample set into the target semantic feature extraction model and the target non-semantic feature extraction model respectively, so as to obtain semantic features and non-semantic features corresponding to the training samples.

In one embodiment, both semantic and non-semantic features are emmbedding features, which may be floating point features or binary quantized features, i.e., hash features. Binary quantization refers to a process of binary encoding a feature, for example, the feature may be encoded as a binary code having values of 0, 1.

In step 204, at least one of the semantic features and the non-semantic features is used as a candidate feature, and an activated feature component is determined from feature components included in each candidate feature.

Wherein the feature component refers to a feature value for the constituent feature. For example, assuming a feature vector of feature 1x N, each dimension is a feature component, which contains a total of N feature components. The activated feature component refers to the feature component that is activated.

In one embodiment, the computer device may determine the activation feature component from feature components contained in the semantic features corresponding to each of the training samples in the set of training samples using the semantic features as candidate features. In one embodiment, the computer device may determine the activation feature component from feature components contained in the non-semantic features corresponding to each of the training samples in the set of training samples using the non-semantic features as candidate features.

In other embodiments, the computer device may determine the active feature component from feature components included in the semantic features corresponding to each of the training samples in the set of training samples, and determine the active feature component from feature components included in the non-semantic features corresponding to each of the training samples in the set of training samples, using both the semantic features and the non-semantic features as candidate features. In specific implementation, the computer device may map the feature components included in each semantic feature and the feature components included in each non-semantic feature according to a preset mapping manner to obtain respective mapping feature values of each feature component, compare the respective mapping feature values of each feature component with the activation values set by the preset mapping manner, and determine the feature component corresponding to the mapping feature value with consistent comparison as the activation feature component.

And 206, counting the activated feature components belonging to the same feature bit, and determining the respective activation degree of each feature bit corresponding to each candidate feature according to the counting result.

Wherein, the feature bit refers to the position of the feature component in the feature. The active feature components belonging to the same feature bit refer to active feature components belonging to the same position in different features. For example, assuming that feature 1 is (a 1, a2, a3, a 4), feature 2 is (b 1, b2, b3, b 4), wherein feature component a1 and feature component b1 belong to the same feature bit, feature component a2 and feature component b2 belong to the same feature bit, feature component a3 and feature component b3 belong to the same feature bit, feature component a4 and feature component b4 belong to the same feature bit, and assuming that both a1 and b1 are active feature components, a1 and b1 are active feature components belonging to the same feature bit. The feature bits corresponding to the candidate features refer to feature bits included in the candidate features, for example, the feature bits corresponding to the feature 1 and the feature 2 are the position of the feature component in the first dimension in the feature vector, the position of the feature component in the second dimension in the feature vector, the position of the feature component in the third dimension in the feature vector, and the position of the feature component in the third dimension in the feature vector. The corresponding activation degree of the feature bit is used for representing the influence of the feature component belonging to the feature bit in the model learning process, and the higher the activation degree is, the larger the influence of the feature component belonging to the feature bit in the model learning process is, and on the contrary, the lower the activation degree is, the smaller the influence of the feature component belonging to the feature bit in the model learning process is.

Specifically, after determining the activated feature components from feature components contained in each candidate feature, the computer device counts activated feature components belonging to the same feature bit in the activated feature components, and according to the statistics result, the respective activation degree of each feature bit corresponding to each candidate feature can be determined.

In one embodiment, the counting of the active feature components belonging to the same feature bit by the computer device may specifically be calculating the number of active feature components belonging to the same feature bit, and the counting result obtained may specifically be the total number of active feature components corresponding to each feature bit. For example, as shown in fig. 3, assuming that 4 training samples are included in the training sample set, the candidate features are semantic features, the semantic features of training sample 1 are (a 1, a2, a3, a 4), the semantic features of training sample 2 are (b 1, b2, b3, b 4), the semantic features of training sample 3 are (c 1, c2, c3, c 4), the semantic features of training sample 4 are (d 1, d2, d3, d 4), wherein b1, c1, a2, b2, d2, b4 are active feature components, the total number of active feature components belonging to feature bit 1 is 2, the total number of active feature components belonging to feature bit 2 is 3, the total number of active feature components belonging to feature bit 3 is 0, and the total number of active feature components belonging to feature bit 4 is 1.

In one embodiment, the computer device may count the number of activated feature components belonging to the same feature bit, obtain respective target numbers of each feature bit corresponding to each candidate feature, respectively count the total number of feature components belonging to each feature bit, and determine respective activation degrees of each feature bit based on the respective target numbers and the total number of each feature bit.

And step 208, respectively setting the feature components corresponding to the feature bits with the activation degree meeting the preset conditions in each candidate feature as non-activation values to obtain each target feature.

Wherein the non-activation value refers to a predefined value for distinguishing non-activation feature components. The activation value is a predefined value for distinguishing between non-activated feature components. For example, assuming that both semantic features and non-semantic features are hash features, the activation value may be 1 and the non-activation value may be-1. The feature bits with the activation degree satisfying the preset condition, namely the low activation feature bits, may be feature bits with the activation degree smaller than a preset activation degree threshold, or feature bits arranged before a preset sorting threshold when the activation degree of each feature bit is sorted from small to large. The activation degree threshold may be set as needed.

Considering that meaningless feature information may exist in semantic features and non-semantic features during feature fusion, in order to reduce the influence of the loss of feature bits with poor effects on the learning of key feature bits during learning, low-activation feature bits can be screened out in a mode of whether features are effective during feature fusion, and then the low-activation feature bits are discarded in the learning process after the feature fusion, so that the influence of more low-activation feature bits on high-activation parts in the spliced features is avoided.

In one embodiment, the computer device may compare the respective activation levels of the respective feature bits with a preset activation level threshold to determine feature bits having an activation level less than the preset activation level threshold, randomly select a preset number of feature bits from the feature bits to obtain target feature bits, and set feature components of the target feature bits to inactive values in the respective candidate features.

In another embodiment, the computer device may sort the respective activation degrees of the feature bits from small to large, randomly select a preset number of feature bits from feature bits corresponding to the activation degrees of the N preceding bits, to obtain target feature bits, and set feature components of the target feature bits to inactive values in each candidate feature.

Step 210, fusion processing is performed based on each target feature and by using a feature fusion model to be trained, so as to obtain each fusion feature.

The feature fusion model to be trained refers to a feature fusion model to be trained. The feature fusion model refers to a machine learning model for fusion processing of features.

In one embodiment, the computer device only takes the semantic features as candidate features, so that target features corresponding to the semantic features can be obtained, the computer device further splices the target features and the non-semantic features belonging to the same training sample to obtain spliced features, and the spliced features are input into a feature fusion model to be subjected to fusion processing to obtain fusion features corresponding to each training sample.

In one embodiment, the computer device only takes the non-semantic features as candidate features, so that target features corresponding to the non-semantic features can be obtained, the computer device further splices the target features and the semantic features belonging to the same training sample to obtain spliced features, and the spliced features are input into a feature fusion model to be subjected to fusion processing to obtain fusion features corresponding to each training sample.

In one embodiment, the computer device may use both the semantic feature and the non-semantic feature as candidate features, so that the target feature corresponding to the non-semantic feature and the target feature corresponding to the semantic feature may be obtained, the computer device further performs stitching on two target features belonging to the same training sample to obtain a stitched feature, and inputs the stitched feature into a feature fusion model to perform fusion processing, so as to obtain a fusion feature corresponding to each training sample.

Step 212, training a feature fusion model based on each fusion feature, and obtaining a target fusion model when the training stop condition is met; the target fusion model is used for carrying out fusion processing on semantic features and non-semantic features of the input sample to obtain target fusion features.

The training stopping condition may be that the model parameter is not changed, the loss reaches a minimum value, the training frequency reaches a maximum iteration frequency, the training time reaches a preset time, and the like.

Specifically, the computer device may determine a target loss based on each fusion feature, adjust model parameters of the feature fusion model based on the target loss and continue training, and obtain the target fusion model when a training stop condition is satisfied. The target fusion model is used for carrying out fusion processing on semantic features and non-semantic features of an input sample to obtain target fusion features, and the target fusion features are used for carrying out sample retrieval. Wherein, the sample retrieval refers to retrieving similar samples corresponding to the input samples in the database. The database may be deduplicated by sample retrieval.

In one embodiment, during the training process, the computer device may use any of a random gradient descent algorithm, an adagard ((Adaptive Gradient, adaptive gradient) algorithm, adadelta (an improvement of Adagrad algorithm), RMSprop (an improvement of Adagrad algorithm), adam (Adaptive Moment Estimation ) algorithm, etc. to adjust model parameters of the feature fusion model.

In one embodiment, the computer device may obtain a comparison training sample corresponding to the training sample, obtain a comparison semantic feature and a comparison non-semantic feature corresponding to the comparison training sample, use at least one of the comparison semantic feature and the comparison non-semantic feature as comparison candidate features, determine an activated feature component from feature components included in each comparison candidate feature, count the activated feature components belonging to the same feature bit, determine respective activation degrees of feature bits corresponding to each comparison candidate feature according to a statistical result, respectively set feature components corresponding to feature bits with activation degrees satisfying a preset condition in each comparison candidate feature as non-activation values, obtain each comparison target feature, perform fusion processing based on each comparison target feature and using a feature fusion model to be trained, obtain each comparison fusion feature, and calculate a target loss based on a difference between a fusion feature corresponding to the same training sample and the comparison fusion feature. The contrast training sample may be at least one of a positive contrast training sample and a negative contrast training sample. The fusion feature and the contrast fusion feature corresponding to the same training sample refer to the fusion feature corresponding to the training sample and the contrast fusion feature of the contrast training sample corresponding to the training sample, for example, assuming that the fusion feature of a certain training sample a is feature 1, the contrast training sample of the training sample a is training sample B, the fusion feature of the training sample B is feature 2, and both feature 1 and feature 2 correspond to training sample a.

In the feature fusion model processing method, on one hand, the semantic features and the non-semantic features of the input sample can be fused through the target fusion model to obtain the target fusion features, and the target fusion features have semantic measurement capability and non-semantic measurement capability simultaneously, so that the sample can be more accurately characterized, and the accuracy is high; on the other hand, in the process of training to obtain the target fusion model, at least one of semantic features and non-semantic features is used as candidate features, an activated feature component is determined from feature components contained in each candidate feature, the activated feature components belonging to the same feature bit are counted, the respective activation degree of each feature bit corresponding to each candidate feature is determined according to a counting result, the feature component corresponding to the feature bit with the activation degree meeting the preset condition in each candidate feature is set as an inactive value to obtain each target feature, and fusion processing is carried out on the basis of each target feature and by utilizing the feature fusion model to be trained, so that the influence of the feature of the low-activation feature bit on the training process can be reduced, the model can learn key feature bits better, the feature fusion of the trained target fusion model can be fused better, and the accuracy of the target fusion feature is further improved.

In one embodiment, counting the activated feature components belonging to the same feature bit, and determining the respective activation degree of each feature bit corresponding to each candidate feature according to the statistical result includes: counting the number of the activated feature components belonging to the same feature bit to obtain the respective activated number of each feature bit corresponding to each candidate feature; respectively counting the total number of the characteristic components belonging to each characteristic bit; the respective activation degree of each feature bit is determined based on the respective activation number of each feature bit and the total number of features.

Wherein the number of activations corresponding to a feature bit is used to characterize how many of the activated feature components belong to that feature bit. The total number of features of a feature bit is used to characterize how much of the feature component belongs to the feature bit. For example, with continued reference to FIG. 3, feature bit 1 corresponds to an activation number of 2 and a feature total number of 4.

Specifically, the computer device may count the number of activated feature components belonging to the same feature bit, to obtain the number of activated feature components of each feature bit included in the candidate feature, further count the total number of features of feature components belonging to each feature bit, and determine the influence of the feature bit in the model learning process based on the number of activated feature bits and the total number of features, so as to obtain the respective activation degree of each feature bit.

In one embodiment, the computer device may calculate a ratio between the number of activations and the total number of features to determine a respective activation ratio for each feature bit, and may then determine a respective activation level for each feature bit based on the respective activation ratio for each feature bit. The corresponding activation proportion of the feature bit can represent how much proportion of the training sample can be distinguished by the feature component belonging to the feature bit, and when the distinguishable proportion is closer to 0.5, the influence of the feature bit is larger, and the activation degree is higher.

In the above embodiment, by counting the number of activations of the activation feature components of each feature bit and the total number of features of the feature components of each feature bit, the respective degrees of activations of each feature bit can be quickly determined based on the number of activations and the total number of features.

In one embodiment, determining the respective activation level of each feature bit based on the respective activation number and the total number of features of each feature bit comprises: determining the respective activation proportion of each feature bit based on the respective activation quantity and the total number of the features of each feature bit; and calculating the respective activation entropy of each feature bit based on the respective activation proportion of each feature bit, and determining the respective activation entropy of each feature bit as the respective activation degree of each feature bit.

Wherein the activation proportion of the feature bits user characterizes the duty ratio of the activated feature components of the feature bits in all feature components belonging to the feature bits. For example, in fig. 3, the activation ratio corresponding to the feature bit 1 is 2/4=0.5.

Specifically, after determining the respective activation ratios of the respective feature bits based on the respective activation numbers and the total number of features of the respective feature bits, the computer device may calculate the respective activation entropies of the respective feature bits with reference to the following formula (1), where H is used to represent the activation entropies, p _a Indicating the activation ratio.

H＝-p _a *log(1-p _a ) (1)

The activation entropy may be used to represent the respective information amount of each feature bit, and the larger the information amount is, the larger the influence in the model learning process is, so in this embodiment, the computer device may determine the respective activation entropy of each feature bit as the respective activation degree of each feature bit.

In the above embodiment, the activation proportion is determined by calculation, the activation entropy is determined based on the activation proportion, and since the activation entropy can characterize the information quantity, the activation entropy is used as the activation degree, and the influence of the feature bit in the model learning process can be accurately expressed.

In one embodiment, determining an activation feature component from feature components contained in each candidate feature comprises: mapping the feature components contained in each candidate feature according to a preset mapping mode to obtain respective mapping feature values of each feature component; comparing the respective mapping characteristic values of the characteristic components with an activation value set by a preset mapping mode; and determining the feature components corresponding to the mapping feature values which are consistent in comparison as activated feature components.

In this embodiment, the semantic features and the non-semantic features are vectors with values of 1*N tending to-1 or 1. The preset mapping mode refers to a preset mode of mapping values in semantic features and non-semantic features to activated values or non-activated values. The preset mapping mode can be set according to the requirement. In one embodiment, the computer device may map the feature components included in each candidate feature with reference to the following formula (2) to obtain respective mapped feature values of each feature component, where Q _i For the ith feature component of the candidate features, B _i Mapping eigenvalues for the ith eigenvalue:

after the mapping characteristic values are obtained, the respective mapping characteristic values of the characteristic components of the computer equipment are compared with the activation values set by a preset mapping mode, and when the characteristic components corresponding to the mapping characteristic values which are consistent in comparison are determined to be the activation characteristic components. For example, when the activation value set by the preset mapping manner is 1, that is, when the feature component is 1, the feature component is an activated feature component.

In the above embodiment, the active feature component is determined by obtaining the respective mapped feature value of each feature component, and comparing the respective mapped feature value of each feature component with the activation value set by the preset mapping manner, so that the active feature component can be determined from the feature components included in each candidate feature.

In one embodiment, the method includes respectively setting feature components corresponding to feature bits with activation degrees satisfying a preset condition in each candidate feature as inactive values to obtain each target feature, where the method includes: based on the respective activation degrees of the feature bits, sequencing the feature bits, and determining feature bits with activation degrees meeting preset conditions based on sequencing results to obtain target feature bits; and respectively setting the feature components corresponding to the target feature bits in the candidate features as inactive values to obtain the target features.

Specifically, the computer device may sort the feature bits according to the activation degree from small to large based on the respective activation degree of each feature bit, determine a feature bit with a preset proportion having the minimum activation degree based on the sorting result, obtain target feature bits, and set feature components belonging to the target feature bits as inactive values in each candidate feature respectively, so as to obtain each target feature.

For example, assuming that the candidate feature is a vector of 1X64 and the corresponding feature bit is 64 bits, the 64 feature bits may be sorted from small to large according to the activation degree, and the feature bit with the smallest activation degree of 10% is determined based on the sorting result, and the target feature bit may be, for example, the feature bit with the sequence number of 0, 1, 2, 61, 62, 63 bits. Note that, here, the sequence number refers to the position order of the feature bits in the vector, for example, the feature bit with the sequence number 0 refers to the position of the leftmost feature component in the candidate feature.

In the above embodiment, by sorting the feature bits, determining the feature bits with the activation degree satisfying the preset condition based on the sorting result, and obtaining the target feature bit, the feature bits with low activation degree can be quickly obtained, and further the feature components of the feature bits are set as the non-activation values, so as to reduce the influence of the low activation feature bits on the key feature bits in the training process.

In one embodiment, as shown in fig. 4, a feature fusion model processing method is provided, which is illustrated by a computer device, and it is understood that the computer device may be the terminal 102 shown in fig. 1 or the server 104. In this embodiment, the feature fusion model processing method includes the following steps:

step 402, obtaining semantic features and non-semantic features corresponding to training samples in the training sample set.

Step 404, determining an activated feature component corresponding to the semantic feature from feature components contained in each semantic feature, and determining an activated feature component corresponding to the non-semantic feature from feature components contained in each non-semantic feature.

Specifically, for each training sample, after obtaining the semantic features and the non-semantic features of the training sample, the computer device may determine an active feature component corresponding to the semantic features from feature components included in the semantic features, and determine an active feature component corresponding to the non-semantic features from feature components included in the non-semantic features.

Step 406, statistics is performed on the activated feature components belonging to the same feature bit in each semantic feature, and respective activation degrees of each feature bit corresponding to each semantic feature are determined according to the statistics result.

Step 408, statistics is performed on the activated feature components belonging to the same feature bit in each non-semantic feature, and respective activation degrees of each feature bit corresponding to each non-semantic feature are determined according to the statistics result.

And 410, respectively setting feature components corresponding to feature bits with the activation degree meeting the preset condition in each semantic feature as non-activation values to obtain each first target feature.

And step 412, respectively setting the feature components corresponding to the feature bits with the activation degree meeting the preset conditions in each non-semantic feature as non-activation values, and obtaining each second target feature.

In step 414, fusion processing is performed by using the feature fusion model to be trained based on each first target feature and each second target feature, so as to obtain each fusion feature.

Specifically, the computer device may splice semantic features and non-semantic features corresponding to the same training sample to obtain spliced features, input the spliced features into a feature fusion model, and perform fusion processing through the feature fusion model to obtain fusion features corresponding to each training sample.

Step 416, training a feature fusion model based on each fusion feature, and obtaining a target fusion model when the training stop condition is satisfied; the target fusion model is used for carrying out fusion processing on semantic features and non-semantic features of the input sample to obtain target fusion features.

In the above embodiment, by selecting the semantic features and the non-semantic features, the feature components of the low-activation feature bits are set as the non-activation values, and in the fusion process, the influence of the features of the low-activation feature bits on the key features can be avoided to the greatest extent, so that the fused features can better characterize the training samples.

In one embodiment, fusion processing is performed based on each target feature and by using a feature fusion model to be trained, so as to obtain each fusion feature, including: splicing the first target feature and the second target feature corresponding to the same training sample to obtain spliced features; inputting the spliced features into a feature fusion model; the feature fusion model comprises a first full-connection layer and a second full-connection layer; mapping the splicing characteristics through a first full-connection layer to obtain intermediate characteristics; mapping the intermediate features through a second full-connection layer to obtain fusion features; the intermediate features are the same as the stitching features in dimension, and the fused features are less in dimension than the stitching features.

Specifically, the computer device may splice the first target feature and the second target feature corresponding to the same training sample to obtain a spliced feature, sequentially input the spliced feature into a first full-connection layer and a second full-connection layer of the feature fusion model, map the spliced feature through the first full-connection layer to obtain an intermediate feature, map the intermediate feature through the second full-connection layer to obtain a fusion feature, the intermediate feature obtained by the first full-connection layer is the same as the dimension of the spliced feature, so that information fusion can be performed on the feature in the spliced feature, and the dimension of the fusion feature obtained by the second full-connection layer is smaller than the dimension of the spliced feature, so that the feature can be compressed.

For example, referring to fig. 5, after feature components corresponding to feature bits with activation degrees satisfying a preset condition in semantic features 502 of 1x64 and non-semantic features 504 of 1x64 of a certain training sample are set to inactive values, a first target feature 506 of 1x64 and a second target feature 508 of 1x64 are obtained, a shadow portion in the first target feature 506 and the second target feature 508 is an inactive value, the first target feature 506 and the second target feature 508 are spliced to obtain a spliced feature 510 of 1x128, the spliced feature 510 is input into a feature fusion model, the feature fusion model includes a first fully connected layer and a second fully connected layer, parameters of the first fully connected layer are 128x128 parameter matrices, parameters of the second fully connected layer are 128x96 parameter matrices, after mapping through the first fully connected layer and the second fully connected layer, a fusion feature is finally obtained, and a vector of the fusion feature is 1x 96.

In the above embodiment, the splicing features are mapped through the two full-connection layers, so that feature information among all feature bits can be fused more, and finally, the fusion features with compressed feature dimensions are obtained, and the feature fusion model obtained through training can reduce the consumption of continuous resources when the input samples are fused.

In one embodiment, the semantic features of the training sample are obtained by extracting features of the training sample through a trained target semantic feature extraction model, and the corresponding non-semantic features of the training sample are obtained by extracting features of the training sample through a trained target non-semantic feature extraction model; as shown in fig. 6, the training steps of the target semantic feature extraction model and the target non-semantic feature extraction model include:

step 602, obtaining a comparison training sample corresponding to the training sample in the training sample set.

The comparison training samples comprise at least one of positive comparison training samples and negative comparison training samples. The similarity between the training sample and the positive contrast training sample is greater than the similarity between the training sample and the negative contrast training sample. For example, the positive contrast training sample corresponding to the training sample may be a training sample similar to the positive contrast training sample, while the negative contrast training sample may be a training sample dissimilar to the training sample.

In one embodiment, the target fusion model obtained through training is used for carrying out fusion processing on semantic features and non-semantic features of an input sample to obtain target fusion features, and the target fusion features are used for carrying out image duplicate elimination retrieval. In a deduplication system, two samples need to be extremely similar to calculate similar samples, as shown in fig. 7 and 8, or other images generated by image attacks, such as fig. 9. Image attacks are modifications to an image by means of image enhancement, such as tone change, histogram equalization, watermarking, cropping, rotation, watermarking, black border, etc. Specifically, referring to fig. 7, in fig. 7, the (a) diagram, (b) diagram, (c) diagram, and (d) diagram each include the same object, the (a) diagram and (b) diagram each include text, the (c) diagram and (d) diagram do not include text, and the facial gestures, expressions, and the like of the (a) diagram, (b) diagram, (c) diagram, and (d) diagram are different, and these images all belong to extremely similar samples. Referring to fig. 8, wherein (a), (b), (c) and (d) in fig. 8 each contain the same object, the (a), (b), (c) and (d) are images of the object under different gesture expressions, which all belong to extremely similar samples. Referring to fig. 9, where (a), (b), (c) and (d) in fig. 9 each contain the same object, the (b) and (c) in fig. 9 are obtained by watermarking the (a), the (d) is obtained by watermarking the (a) and performing tone conversion, and the (b), (c) and (d) are images generated by image attack.

Step 604, the training sample and the comparison training sample are respectively used as target samples to be input into an initial semantic feature extraction model to be trained and an initial non-semantic feature extraction model to be trained.

Specifically, the computer equipment takes the training sample and the comparison training sample as target samples respectively, namely, the training sample is respectively input into an initial semantic feature extraction model to be trained and an initial non-semantic feature extraction model to be trained, and the comparison training sample is respectively input into the initial semantic feature extraction model to be trained and the initial non-semantic feature extraction model to be trained.

Step 606, outputting semantic training features corresponding to the target samples through the initial semantic feature extraction model, and outputting non-semantic training features corresponding to the target samples through the initial non-semantic feature extraction model.

The feature extraction model is a machine learning model for extracting features and outputting feature vectors. The semantic feature extraction model refers to a feature extraction model for extracting semantic features, and the non-semantic feature extraction model refers to a feature extraction model for extracting non-semantic features. The initial semantic feature extraction model is a model obtained by initializing parameters. The initial non-semantic feature extraction model may be a model obtained by parameter initialization or a model obtained by pre-training.

In one embodiment, the feature extraction model includes at least an embedding model, where the embedding model is used to output feature vectors, and the feature vectors output by the embedding model may be referred to as embedding. The casting model may be a model of one or more full connection layers (full connection). The output features of the casting model are the features extracted by the feature extraction model.

In one embodiment, the feature vectors output by the emmbedding model may be normalized such that the respective component values of the feature vectors range from-1 to 1. In other embodiments, the feature vector may be further subjected to binary quantization to obtain a binary quantization feature, i.e. a hash feature, where the embedding model may be referred to as a hash quantization model.

In one embodiment, the feature extraction model may include only an emmbedding model, an input of which may be connected to an output of the trained base neural network model, receiving as input the output of the base neural network model. The basic neural network model may be a model for extracting feature information contained in the content, and the basic neural network model may be an artificial intelligence-based neural network, for example, a convolutional neural network (Convolutional Neural Networks, CNN), a network such as a res net101 (depth residual network 101) or a res net18 (depth residual network 18), and the like. Among them, convolutional neural networks are a type of feedforward neural network (Feedforward Neural Networks) that contains convolutional calculations and has a deep structure. In other embodiments, the feature extraction model may include a base neural network model and an emmbedding model, where the base neural network model and the emmbedding model are trained together as the feature extraction model. It will be appreciated that different feature extraction models may be trained for different types of samples, for example, if the sample is an image type sample, then the image feature extraction model may be trained, and if the sample is a speech type sample, then the speech feature extraction model may be trained.

Specifically, the computer equipment extracts features from the training sample through an initial semantic feature extraction model to obtain semantic training features corresponding to the training sample, extracts features from the training sample through an initial non-semantic feature extraction model to obtain non-semantic training features corresponding to the training sample, extracts features from the contrast training sample through the initial semantic feature extraction model to obtain semantic training features corresponding to the contrast training sample, extracts features from the contrast training sample through the initial non-semantic feature extraction model to obtain non-semantic training features corresponding to the contrast training sample.

Step 608, obtaining semantic feature loss based on the semantic training features corresponding to the training samples and the semantic training features corresponding to the comparison training samples;

specifically, the computer device may obtain the semantic feature loss based on a difference between the semantic training features corresponding to the training samples and the semantic training features corresponding to the comparison training samples.

In one embodiment, the computer device may train the cosine similarity between the semantic training feature corresponding to the sample and the semantic training feature corresponding to the comparison training sample, and characterize the difference between the semantic training feature corresponding to the training sample and the semantic training feature corresponding to the comparison training sample with the obtained cosine similarity, so as to calculate the difference between the cosine similarity and the training label to obtain the semantic feature loss. When the comparison training sample is a positive comparison training sample, the training label is 1, and when the comparison training sample is a negative comparison training sample, the training label is 0.

In one embodiment, the computer device may classify the semantic training features corresponding to the training samples and the semantic training features corresponding to the comparison training samples into two classes, wherein the classes are similar and dissimilar, the similar probability and the dissimilar probability are obtained, a probability larger value is taken as a classification result, the probability value of the classification result is used for representing the difference between the training sample features and the comparison sample extraction features, and then the difference between the probability of the classification result and the training label can be calculated to obtain semantic feature loss. When the comparison training sample is a positive comparison training sample, the training label is 1, and when the comparison training sample is a negative comparison training sample, the training label is 0.

In one embodiment, the computer device may calculate a feature distance between the semantic training feature corresponding to the training sample and the semantic training feature corresponding to the contrast training sample, characterize a difference between the semantic training feature corresponding to the training sample and the semantic training feature corresponding to the contrast training sample with the feature distance, and lose the feature distance as a semantic feature, where the feature distance may be, for example, a euclidean distance or an L2 distance.

Step 610, obtaining a first non-semantic feature loss based on the non-semantic training features corresponding to the training samples and the non-semantic training features corresponding to the comparison training samples.

Specifically, the computer device may obtain the first non-semantic feature loss based on a difference between the non-semantic training features corresponding to the training samples and the non-semantic training features corresponding to the comparison training samples.

Step 612, classifying the training sample based on the semantic training features of the training sample to obtain a classification result, and determining the classification loss based on the classification result.

Specifically, the computer device classifies the training samples based on semantic training features of the training samples to obtain classification results, and determines a classification loss based on a difference between the classification results and class labels of the training samples. The category labels refer to pre-labeled labels used for representing categories to which training samples belong.

In one embodiment, the computer device may employ a classification model to classify the training sample, i.e., input semantic training features of the training sample into the classification model, and obtain an output of the classification model as a classification result. The classification model refers to a machine learning model capable of performing category recognition. In one embodiment, the classification result may be a class identifier that is used to characterize the class to which the training sample belongs, e.g., the classification model classifies N classes, then an N-dimensional vector (1, 0, … …, 0) may be used to represent a first class, a vector (0, 1,0, … …, 0) may be used to represent a second class, and so on. In other embodiments, the classification result may be a probability for characterizing the class to which the training sample belongs, for example, the classification model may be used to classify N classes, and then the classification model may output a probability vector including one N dimensions, where the probability of each dimension is used to characterize the size of the probability that the training sample belongs to that class.

In other embodiments, the computer device may calculate a feature distance between the semantic training feature of the training sample and the class feature corresponding to each candidate class, and determine the candidate class corresponding to the class feature as the classification class to which the training sample belongs when the feature distance between the semantic training feature of the training sample and the class feature is less than a distance threshold, where the classification result may be a class identifier of the classification class. Wherein the class characteristics corresponding to the candidate classes may be determined by: and collecting a plurality of content samples corresponding to the training category, extracting feature vectors of the content samples through an embedding model, and calculating an average vector as the category feature corresponding to the candidate category.

Step 614, training the initial semantic feature extraction model and the initial non-semantic feature extraction model based on the semantic feature loss, the first non-semantic feature loss and the classification loss, respectively, and obtaining a trained target semantic feature extraction model and a trained target non-semantic feature extraction model when the training stop condition is satisfied.

Specifically, the computer device may count the semantic feature loss, the first non-semantic feature loss, and the classification loss corresponding to each training sample to obtain a target loss, adjust model parameters of the initial semantic feature extraction model and the initial non-semantic feature extraction model based on the target loss, and continue training, and when the training stop condition is satisfied, obtain a trained target semantic feature extraction model and a trained target non-semantic feature extraction model. For example, the computer device may weight sum semantic feature loss, first non-semantic feature loss, and classification loss corresponding to the same training sample to obtain the target loss.

In the above embodiment, the training effect can be improved by jointly training the initial semantic feature extraction model and the initial non-semantic feature extraction model, and the obtained target semantic feature extraction model and the obtained target non-semantic feature extraction model can better perform feature extraction.

In one embodiment, there is a shared network layer for the initial semantic feature extraction model and the initial non-semantic feature extraction model; outputting semantic training features corresponding to each target sample through the initial semantic feature extraction model, and outputting non-semantic training features corresponding to each target sample through the initial non-semantic feature extraction model, wherein the method comprises the following steps: carrying out convolution processing on each target sample through a shared network layer to obtain respective corresponding shared characteristics of each target sample; performing feature processing on each shared feature through a first feature processing layer of the initial semantic feature extraction model to obtain semantic training features corresponding to each target sample; performing feature processing on each shared feature through a second feature processing layer of the initial non-semantic feature extraction model to obtain a non-semantic training feature corresponding to each target sample; the first feature handling layer has a greater network depth than the second feature handling layer.

The shared network layer exists between the initial semantic feature extraction model and the initial non-semantic feature extraction model, which means that a common underlying basic feature extraction network exists between the initial semantic feature extraction model and the initial non-semantic feature extraction model, the output of the basic feature extraction network is connected with the input of the first feature processing layer to jointly form the initial semantic feature extraction model, and the output of the basic feature extraction network is connected with the input of the second feature processing layer to jointly form the initial non-semantic feature extraction model. The initial semantic feature extraction model and the initial non-semantic feature extraction model may constitute a dual-branched feature extraction model.

Considering that the result output by the basic feature extraction network supports global bottom layer feature learning, the semantic features of the high-level abstraction need to be extracted effectively based on deeper texture characterization, and in order to make the extraction effective, the semantic features need to be designed to be at network positions deeper than the non-semantic features, so that the network depth of the first feature processing layer is greater than that of the second feature processing layer. In one embodiment, the first feature processing layer may be a convolution layer, and the second feature processing layer may be a multi-layer convolution layer, and by setting the multi-layer convolution layer on the second feature processing layer, firstly, it can be avoided that the classified gradient returns to reach the basic feature extraction network too quickly, so as to affect the effect of the non-semantic features; secondly, the classification module can further perform classification related key part feature extraction on the global features of the basic feature extraction network, so that the semantic characterization is more sufficient. For example, as shown in fig. 11, which is a schematic diagram of a model structure in an embodiment, it can be seen that the semantic feature branches and the non-semantic feature branches share a CNN network with input ends, and output ends of the CNN network are connected to input ends of two branches at the same time, where the branches where the semantic features are located include multiple convolution layers, so that the semantic features of high-level abstraction can be extracted.

In the above embodiment, the complexity of the model can be reduced by setting the shared network layer, the data processing amount of the model is reduced, and the parameters required to be learned by the model are reduced.

In one embodiment, the initial non-semantic feature extraction model is obtained by: respectively inputting the training samples and the comparison training samples into candidate non-semantic feature extraction models, and outputting non-semantic training features corresponding to the training samples and non-semantic training features corresponding to the comparison training samples; obtaining second non-semantic feature loss based on the non-semantic training features corresponding to the training samples of the candidate non-semantic feature extraction model and the non-semantic training features corresponding to the comparison training samples; based on the second non-semantic feature loss, model parameters of the candidate non-semantic feature extraction model are adjusted and training is continued, and when training stop conditions are met, an initial non-semantic feature extraction model is obtained.

Specifically, after obtaining the non-semantic training features corresponding to the training samples and the non-semantic training features corresponding to the comparison training samples, the computer equipment can calculate to obtain second non-semantic feature loss based on the difference between the non-semantic training features corresponding to the training samples and the non-semantic training features corresponding to the comparison training samples, adjust model parameters of the candidate non-semantic feature extraction model based on the second non-semantic feature loss and continue training, obtain an initial non-semantic feature extraction model when the training stop condition is met, and then perform joint training on the initial non-semantic feature extraction model and the initial semantic feature extraction model to obtain a target non-semantic feature extraction model and a target semantic feature extraction model. The initial semantic feature extraction model is obtained through parameter initialization.

In the above embodiment, considering that the non-semantic feature extraction model converges slowly, the non-semantic feature extraction model may be trained in advance, and the model obtained after training is combined with the semantic feature extraction model for training, so that training efficiency and training effect may be improved.

In one embodiment, the fusion features are binary quantized features, a feature fusion model is trained based on each fusion feature, and when a training stop condition is satisfied, a target fusion model is obtained, including: acquiring a quantization target corresponding to the feature component in the fusion feature, and determining quantization loss based on the difference between the feature component in the fusion feature and the corresponding quantization target; acquiring a contrast fusion characteristic corresponding to a contrast training sample of the training sample, and determining fusion loss based on the difference between the fusion characteristic and the contrast fusion characteristic; the contrast fusion characteristics are obtained based on a characteristic fusion model; counting the quantization loss and the fusion loss to obtain a target loss; and adjusting model parameters of the feature fusion model based on the target loss, and continuing training until the training stopping condition is met, so as to obtain the target fusion model.

Where quantization loss refers to the loss of computing quantization effects (whether close enough to-1 or-1), during training, it is desirable that each quantized component in the fusion feature be close enough to 1 or-1. Quantization loss, the difference between quantization targets corresponding to the quantized component and the quantized component becomes a positive correlation.

Specifically, the computer device may obtain quantization targets corresponding to each feature component in the fusion features, determine quantization loss based on differences between each feature component and each quantization target, further obtain comparative fusion features corresponding to comparative training samples of each training sample, determine fusion loss based on differences between fusion features of the same training sample and comparative fusion features, perform weighted summation on the quantization loss and the fusion loss to obtain target loss, and finally adjust model parameters of the feature fusion model based on the target loss and continue training until training stop conditions are met, thereby obtaining the target fusion model.

In one embodiment, the computer device may calculate the quantization loss with reference to equation (3) below, where Q _i For the value of the fusion feature Q at the ith bit, B _i For the quantization target of the ith bit, B _i From Q _i Generating a sign function through a preset sign function, wherein the sign function can fuse each bit Q of the characteristic Q _i The target codes Bi and sign functions thereof are calculated respectively, and the above formula (2) can be referred to specifically.

In one embodiment, for each training sample in the training sample set, the computer device may obtain a respective comparison training sample of each training sample, obtain a semantic feature and a non-semantic feature corresponding to the comparison training sample, and determine an activation feature component from feature components included in each candidate feature by using at least one of the semantic feature and the non-semantic feature as a candidate feature; and counting the activated feature components belonging to the same feature bit, determining the respective activation degree of each feature bit corresponding to each candidate feature according to a statistical result, respectively setting the feature component corresponding to the feature bit with the activation degree meeting the preset condition in each candidate feature as a non-activation value to obtain each target feature, and carrying out fusion processing by utilizing a feature fusion model to be trained based on each target feature to obtain each contrast fusion feature.

In the above embodiment, the target loss is obtained by calculating the quantization loss and the fusion loss and combining the two losses, and then the feature fusion model is trained based on the target loss, and the obtained target fusion model can obtain accurate binary quantization features through fusion processing.

In one embodiment, the contrast fusion features include positive contrast fusion features corresponding to positive contrast training samples and negative contrast fusion features corresponding to negative contrast training samples; determining a fusion loss based on a difference between the fusion feature and the comparative fusion feature, comprising: acquiring a forward characteristic difference value, wherein the forward characteristic difference value is a characteristic difference value between a fusion characteristic and a corresponding forward contrast fusion characteristic; acquiring a negative characteristic difference value which is a characteristic difference value between the fusion characteristic and the corresponding negative comparison fusion characteristic; and determining fusion loss based on the positive characteristic difference value and the negative characteristic difference value.

Specifically, the computer device may obtain a fusion feature corresponding to the positive contrast training sample to obtain a positive contrast fusion feature, obtain a fusion feature corresponding to the negative contrast training sample to obtain a negative contrast fusion feature, obtain a feature difference value between the fusion feature corresponding to the same training sample and the corresponding positive contrast fusion feature to obtain a positive feature difference value, obtain a feature difference value between the fusion feature corresponding to the same training sample and the corresponding negative contrast fusion feature to obtain a negative feature difference value, and finally determine a fusion loss value based on the positive feature difference value and the negative feature difference value. It can be understood that the positive contrast fusion feature corresponding to the training sample is the fusion feature of the positive contrast training sample corresponding to the training sample, and the negative contrast fusion feature corresponding to the training sample is the fusion feature of the negative contrast training sample corresponding to the training sample.

In one embodiment, a computer deviceThe fusion loss value may be determined by referring to the following equation (4), where x _a To fuse features, x _p For positive contrast fusion features, x _n For negative contrast fusion features, ||x _a -x _p I represents x _a And x _p The L2 distance between the two, namely the forward characteristic difference value, ||x _a -x _n I represents x _a And x _n The distance between the two samples L2, namely, the negative characteristic difference value, and the purpose of the formula (4) is to make the distance between the training sample and the negative comparison training sample be greater than the distance between the training sample and the positive comparison training sample by alpha, where alpha is margin (interval term), and the value of alpha can be set according to the requirement, for example, can be set to 56.

L _tri ＝max(||x _a -x _p ||-||x _a -x _n ||+α,0) (4)

As can be seen from the formula (4), the fusion loss value is 0 only when the distance between the training sample and the negative comparison training sample is greater than the distance between the training sample and the positive comparison training sample by alpha, otherwise, the fusion loss value is greater than 0, so that the distance between the training sample and the negative comparison training sample is developed in the direction of greater than the distance between the current training sample and the positive comparison training sample in the process of reducing the loss value, and the fusion feature obtained by the feature fusion model can pay attention to the preservation of the semantics better.

In the above embodiment, the fusion loss is determined based on the positive characteristic difference value and the negative characteristic difference value, so that when the characteristic extraction model performs similarity measurement learning, the influence of the too small inter-class characteristic distance on classification is considered, and the semantic accuracy of the extracted characteristics of the characteristic extraction model is improved.

In one embodiment, the method further comprises: determining an activated feature component from feature components contained in each fusion feature, counting the activated feature components belonging to the same feature bit, and determining respective activation degrees of feature bits corresponding to each fusion feature according to a statistical result; respectively taking the characteristic components corresponding to the characteristic bits with the activation degree meeting the preset condition in each fusion characteristic as target characteristic components; acquiring a quantization target corresponding to a feature component in the fusion feature, and determining a quantization loss based on a difference between the feature component and the corresponding quantization target, including: acquiring a quantization target corresponding to a non-target feature component in the fusion feature, and determining quantization loss based on the difference between the non-target feature component and the corresponding quantization target; the non-target feature component is a feature component other than the target feature component in the fusion feature.

Wherein the characteristic bit of which the activation degree satisfies the preset condition, namely the low activation characteristic bit, reference is specifically made to the description in the above embodiment.

Specifically, the computer device may determine the feature components corresponding to the low-activation feature bits from the fusion feature, obtain target feature components, discard the target feature components when calculating the loss, and calculate the quantization loss only for non-target feature components other than the target feature components, so as to further reduce the influence of the features of the low-activation feature bits on the key features in the model training process.

In one embodiment, as shown in fig. 10, a sample retrieval method is provided, which is illustrated by way of example as being performed by a computer device, it being understood that the computer device may be the terminal 102 shown in fig. 1 or the server 104. In this embodiment, the feature fusion model processing method includes the following steps:

step 1002, obtaining semantic features and non-semantic features corresponding to a query sample.

Where a query sample refers to a sample of content that requires recall of a similar sample from a sample database. For example, given image A, recall similar images for image A from the database, image A is a query sample.

And step 1004, inputting the semantic features and the non-semantic features into a target fusion model for fusion processing to obtain target fusion features corresponding to the query sample.

The target fusion model is obtained when a feature fusion model to be trained is trained based on each fusion feature until a training stop condition is met; each fusion feature is obtained by fusion processing based on each target feature by utilizing a feature fusion model; each target feature is obtained by respectively setting a feature component corresponding to a feature bit with the activation degree meeting a preset condition in each candidate feature as a non-activation value; the respective activation degree of each feature bit is determined according to the statistical result by counting the activation feature components belonging to the same feature bit; activating feature components is determined from feature components contained in each candidate feature by taking at least one of semantic features and non-semantic features as the candidate feature; the semantic features and the non-semantic features correspond to training samples in the training sample set.

In step 1006, a target retrieval sample corresponding to the query sample is retrieved from the sample database based on the target fusion feature.

In one embodiment, the computer device may obtain the target fusion features of each content sample in each sample database based on the same method as the query sample, calculate feature distances between the target fusion features of the query sample and the target fusion features of each content sample, and determine the content sample whose feature distance satisfies the distance condition as the target retrieval sample. The distance condition is that the characteristic distance is smaller than a preset distance threshold value, or the characteristic distance is arranged before the sorting threshold value according to the sorting from small to large. Wherein the distance threshold and the sorting threshold can be set according to the needs.

According to the sample retrieval method, on one hand, the semantic features and the non-semantic features of the query sample can be fused through the target fusion model to obtain the target fusion features, and the target fusion features have semantic measurement capability and non-semantic measurement capability simultaneously, so that the sample can be more accurately characterized, and retrieval is performed based on the target fusion features, so that the accuracy of the obtained retrieval result is high; on the other hand, the target fusion model is obtained when the feature fusion model to be trained is trained based on each fusion feature until the training stopping condition is met, each fusion feature is obtained by fusion processing based on each target feature and by using the feature fusion model, each target feature is obtained by setting a feature component corresponding to a feature position with the activation degree meeting the preset condition in each candidate feature as a non-activation value, each activation degree of each feature position is obtained by counting the activation feature components belonging to the same feature position, and the activation feature component is determined from the feature components contained in each candidate feature by taking at least one of semantic features and non-semantic features as the candidate feature, so that the influence of the feature of the low activation feature position on the training process can be reduced in the model training process, the model can learn the key feature position better, and the target fusion model obtained by training can perform feature fusion better, and the sample retrieval accuracy is further improved.

In one embodiment, the training step of the target fusion model comprises: acquiring semantic features and non-semantic features corresponding to training samples in a training sample set; determining an activated feature component from feature components contained in each candidate feature by taking at least one of semantic features and non-semantic features as the candidate feature; counting the activated feature components belonging to the same feature bit, and determining the respective activation degree of each feature bit corresponding to each candidate feature according to the counting result; respectively setting feature components corresponding to feature bits with activation degrees meeting preset conditions in each candidate feature as non-activation values to obtain each target feature; based on each target feature, carrying out fusion processing by utilizing a feature fusion model to be trained to obtain each fusion feature; and training the feature fusion model based on each fusion feature, and obtaining the target fusion model when the training stop condition is met.

The application scenario is applicable to the feature extraction model processing method and the sample retrieval method. In the application scene, the content is an image, the training sample is an image sample, the feature extraction model processing method and the sample retrieval method can be applied to image retrieval, and the sample database is subjected to weight removal through image retrieval so as to remove repeated images in the sample database.

The method provided by the embodiment of the application realizes the effective learning of fusion embadd by designing a double-branch feature extraction network, a feature fusion mode, a multi-stage learning mode and the like, and mainly comprises the following steps: 1) Network structure of the shared network bottom layer parameters of the double-branch feature extraction network: considering that features which can be learned by different network depths are different, a shallow network learns global fine grain textures, a deep network can learn more abstract information (such as semantic class information, i.e. learning dog class, joint judgment of head, body hair, limb tail and the like) of multi-local feature joint, and the network structure of the application respectively adopts shallow and deep output to respectively represent global features and semantic features aiming at the global similarity features and semantic features to be learned; 2) Feature selection: considering that if the characteristics are directly fused, hash bits which are activated to a low degree are easily brought into a target, the embodiment of the application firstly screens the characteristics before fusion, firstly screens high-activation hash bits, and then splices the hash bits which are more significant to image characterization, and the hash bits are used as the basis of subsequent learning; 3) The learning mode is as follows: because the feature convergence speeds of different depths are different, the deep semantic features are fast in convergence, the shallow measurement features are slow in convergence, and the fusion features are required to be learned more effectively based on relatively stable basic features, the method provided by the embodiment of the application designs a three-stage learning mode, learns the shallow features slow in convergence in advance, learns the deep semantic features again, and learns the fusion features after the features of the two tasks are stable.

Specifically, the method is described below as being performed by a computer device, and it is understood that the computer device may be the terminal 102 shown in fig. 1 or the computer device 104. The application of the feature extraction model processing method and the sample retrieval method in the application scene is as follows:

1. data preparation

The training process takes the triplet sample as input, but finding a proper triplet from mass data is difficult, so that a positive sample pair is marked, and then negative sample mining is carried out through the positive sample pair, so that the triplet sample is obtained.

1.1, preparing annotation data. Acquiring a positive sample pair: labeling whether pairs of image samples are similar, such as extracting two images from mass data as a pair, labeling whether each pair is sufficiently similar. Since the model is used in a deduplication system, the two samples need to be extremely similar to calculate similar samples, as in fig. 7, 8 below, or other graphs generated by image attacks, such as in fig. 9. Where pairs of samples labeled as similar are positive pairs of samples and dissimilar are negative pairs of samples. In this embodiment, positive sample pairs are mainly collected, negative sample pairs may not be collected, and the negative samples of the triples may be obtained by the following mining method.

1.2, triplet data mining: since the training metric learning feature needs to perform loss function learning by using a triplet sample (a, p, n) consisting of an anchor point anchor (i.e., the previous training sample, hereinafter referred to as a), a positive sample active (i.e., the previous positive contrast training sample, hereinafter referred to as p), and a negative sample negative (i.e., the previous negative contrast training sample, hereinafter referred to as n), where a and p form a positive sample pair, a and n form a negative sample pair, in the learning task, the positive sample pair needs to be close enough to the feature-L2 distance is small enough to be searched for each other, and the negative sample pair needs to be far away. Each sample pair that has been obtained in the above labeling can be an anchor and a positive of a triplet (only by randomly selecting one image as an anchor), how to further mine the negative samples (including difficult negative samples and global negative samples) as follows:

because the GPU (graphics processing unit, graphics processor) of the computer device has limited memory and requires that the full positive samples be fed into the GPU for training in batches (batch) in whole training, the negative samples are mined in one batch (batch) inside.

The negative samples were mined for the triplet samples for positive sample pairs (assuming bs pairs) for each batch, respectively, as follows: for x-anchors in a certain sample x pair (randomly selecting one as anchor): the distance between the rest bs-1 sample pairs (each pair randomly selects one image) and the x-anchor is calculated, the distances are sorted from small to large, after the top5 images are removed, the first 20 samples are taken as difficult negative samples (because the features of extremely similar samples are needed to be learned, the smaller the distance is considered to be more similar, and the probability of extremely similar in mass data is lower, so that the similar samples of the top5 are directly removed, the rest samples can form difficult negative samples in the triples), and the triples are respectively formed with the x, so that each sample pair generates 20 triples, and the whole batch obtains 20 bs triples. To ensure that the negative sample is effective for mining, bs needs to be set to a relatively large value, e.g., 1024.

The measurement learning based on the triples has more requirements on difficult samples, and if the samples are simple samples, the model cannot learn the characterization with the distinguishing degree. In practice, the first 20 negative samples cannot be guaranteed to be all difficult negative samples, but can be guaranteed to be more difficult samples, so that the method is beneficial to learning.

2. Training process

Wherein the model to be trained comprises three parts: semantic feature extraction model, non-semantic feature extraction model and feature fusion model. The semantic feature extraction model and the non-semantic feature extraction model have a shared basic feature extraction network, the basic feature extraction network adopts a ResNet101, the basic feature extraction network comprises a convolution layer 1 and 5 parts of a convolution layer 2-convolution layer 5, the convolution layer 1 is a convolution of 7×7×64, the step length (stride) is 2, the convolution layer 2 comprises a 3×3 maximum pooling layer (max pool) and 3 ResNet modules (block), and the convolution layer 3-convolution layer 5 respectively comprises 3 ResNet modules, 4 ResNet modules, 23 ResNet modules and 3 ResNet modules. The parameters of the underlying feature extraction network can be seen in table 1.

TABLE 1

The non-semantic feature extraction model also includes a first feature extraction layer, the input of which is the output of the convolution layer 5, the output of which is a vector whose value of 1 x 64 tends to be-1 or 1. The first feature extraction layer includes a pooling layer and a hash quantization layer, the parameters are shown in table 2.

TABLE 2

The semantic feature extraction model further comprises a second feature extraction layer, wherein the second feature extraction layer comprises a convolution layer 6 and a hash quantization layer, the input of the convolution layer 6 is the output of the convolution layer 5, and deeper semantic information is extracted for shallow features output by the convolution layer 5. The input to the hash quantization layer is the output of convolution layer 6 in table 3, and the value of 1 x 64 for the hash quantization layer tends to be a vector of-1 or 1. In addition, the device also comprises a classifying layer, the number of classifications 100 is 100 classifications (which can be adjusted according to actual practice) to be distinguished in the task of the present duplication elimination. The second feature extraction layer parameters are shown in tables 3 and 4.

TABLE 3 Table 3

TABLE 4 Table 4

Name of layer	Output size	Layer(s)
			Hash quantization layer	1x64	Full connection layer
Classification layer	1x100	Full connection layer

The input of the feature fusion model (i.e. the input of the fusion layer 1) is the result of the concatenation of the outputs of tables 2 and 4, the output of the fusion layer 1 is continuously input into the fusion layer 2, and the hash feature finally obtained by fusion is output by the fusion layer 2, and is a vector with the value of 1 x 96 tending to be-1 or 1. Considering that useless or redundant hash bits may exist in the non-semantic hash features output by the first feature extraction layer and the semantic hash features output by the second feature extraction layer, the actual valid bits of the finally obtained fusion features are generally less than those of direct splicing, and 96 (< 128) is set to be adjustable according to the actual. The parameters of the feature fusion model can be seen in particular in table 5.

TABLE 5

It should be noted that, the basic feature extraction network, the hash quantization layer and the classification layer may also adopt other model structures, for example, the basic feature extraction network adopts a reset 18CNN module, for example, the hash quantization layer adopts a multi-layer full-connection layer connection. The specific training process is as follows:

1.1, initializing parameters: convolutional layer 1-convolutional layer 5 uses the parameters of ResNet101 pre-trained on a pre-set data set, and other network layers such as fully connected layer, convolutional layer 6, etc. are initialized with a gaussian distribution with variance of 0.01 and mean of 0.

1.2, overall procedure: for the total M sample pairs, each bs sample pair is a batch, M/bs batches are used, each 1 batch of samples is firstly subjected to triplet mining, the triplet samples are obtained and then input into a model, model forward calculation is carried out, corresponding loss values are calculated, and gradient of each parameter is calculated by means of gradient feedback; updating model parameters according to the gradient, and completing one epoch after finishing the full M/bs updating; the end was reached after a total of K epochs training (or when the average loss of 10 epochs in succession did not decrease).

1.3, parameter learning setting: for the first stage: setting tables 1 and 2 as parameters to be learned; for the second stage, setting tables 1, 2, 3 and 4 as parameters to be learned; for the third stage, table 5 is set as the parameter to be learned. And only updating the parameters to be learned in each stage when learning in each stage.

1.4, model forward: the neural network performs forward computation on each image of the input triplet sample during training to obtain the output of the table 2, the table 4 and the table 5, and calculates different loss by adopting different outputs according to different stages, specifically as follows:

1) In the first stage, table 2 is adopted to output and calculate Loss1, referring to fig. 11, each sample in the triplet samples is respectively used as an input sample to a CNN network (namely, a convolution layer 1 to a convolution layer 5), the CNN network is a network layer shared by a semantic feature extraction model and a non-semantic feature extraction model, the output features of the CNN network are input into a non-semantic feature extraction branch to extract a non-semantic hash feature, non-semantic embedding measurement learning is performed based on the non-semantic hash feature, the triplet Loss corresponding to the non-semantic feature is obtained referring to the above formula (4), and further the total Loss in the first stage is calculated referring to the following formula (5):

Loss ₁ ＝w ₁ L _{hash1-triplet} +w ₂ L _hash1-coding (5)

wherein, loss ₁ L is the total loss of the first stage _{hash1-triplet} For the triplet loss corresponding to the non-semantic feature calculated by referring to the above formula (4), L _hash1-coding The loss (i.e., quantization loss) used to calculate the quantization effect (whether close enough to-1 or 1) on the vector of the hash quantization layer output is sufficient to describe the application of the quantization feature if each bit of the Q output is close enough to 1 or-1, since the final application after quantization output needs to be mapped to-1/1 binary, if each bit of the Q output is close enough to 1 or-1, otherwise the measurement effect is not good enough to represent the characterization of quantization in the application, so it is desirable that each bit of the Q output is close enough to 1 or-1. For the quantization result of each image, L can be calculated by referring to the formula (2) and the formula (3) in the foregoing _coding Loss of, wherein Q _i Quantization of the image Q is at the value of the ith bit (256 bits in this embodiment), B _i For the quantization target of the ith bit, B _i Is Q _i Target coding of a quantization learning task generated through a sign function.

It should be noted that, in application, a sign function is directly adopted to generate quantized binary vector (binary is selected to be 0/1, so that bits with values of 0 or 1 in computer equipment are convenient to characterize), and the binary vector can be used for image retrieval.

2) The second stage learning adopts the output calculation Loss2 of tables 2 and 4 to continuously refer to fig. 11, each sample in the triplet sample is respectively used as an input sample to be input into a CNN network obtained by first stage training, the output characteristic of the CNN is input into a non-semantic feature extraction branch obtained by first stage training to be extracted to obtain a non-semantic hash characteristic, the non-semantic hash characteristic is input into a semantic feature extraction branch to be extracted to obtain a semantic hash characteristic, non-semantic meshing metric learning is carried out based on the non-semantic hash characteristic, a triplet Loss corresponding to the non-semantic feature is obtained by referring to the formula (4), semantic meshing metric learning is carried out based on the semantic hash characteristic, a triplet Loss corresponding to the semantic feature is obtained by referring to the formula (4), semantic information learning is carried out based on the semantic hash characteristic, and classification Loss is obtained by referring to the classification Loss function calculation of the following formula (6), and p in the formula (6) _ic Representing the prediction probability that sample i belongs to class c, y _ic A flag indicating whether the sample i is c, if c is yic =1, otherwise 0, n is the number of images in batch. The classification penalty for each batch is: the classification loss is calculated for each graph and then averaged.

The computer device then calculates the total loss for the second phase with reference to the following equation (7):

Loss ₂ ＝w ₁ L _{hash1-triplet} +w ₂ L _hash1-coding +w ₃ L _{hash2-triplet} +w ₄ L _hash2-coding +w ₅ L _class (7)

wherein, loss ₂ L is the total loss of the second stage _{hash1-triplet} For the triplet loss corresponding to the non-semantic feature calculated by referring to the above formula (4), L _hash1-coding For the quantization loss corresponding to the non-semantic features calculated with reference to the above formulas (2) and (3), L _{hash2-triplet} For the triplet loss corresponding to the non-semantic feature calculated by referring to the above formula (4), L _hash2-coding For the quantization loss corresponding to the non-semantic features calculated with reference to the above formulas (2) and (3), L _class Is a classification loss.

3) In the third stage, table 5 is adopted to output and calculate los 3, continuing to refer to fig. 11, each sample in the triplet sample is respectively used as an input sample and is input into a CNN network obtained by training in the second stage, the output characteristic of the CNN network is input into a non-semantic feature extraction branch obtained by training in the second stage to extract a non-semantic hash feature, the non-semantic hash feature is obtained by inputting the output characteristic of the CNN network into a semantic feature extraction branch obtained by training in the second stage, the semantic hash feature and the non-semantic hash feature are obtained by extracting the semantic feature extraction branch obtained by training in the second stage, feature selection is carried out on the semantic hash feature and the non-semantic hash feature, namely, feature components of low-activation hash feature bits are selected from the semantic hash feature and the non-semantic hash feature and are set to be-1, then the selected features are spliced, and then input into a feature fusion model for fusion learning to obtain fusion features, fusion emmbedding measurement learning is carried out based on the fusion features, and the triplet Loss corresponding to the fusion features is obtained by referring to the formula (4) above, and then the total Loss in the third stage is obtained by calculation is obtained by referring to the following formula (8):

Loss ₃ ＝w ₆ L _{hashall-triplet} +w ₇ L _{hashall-coding} (8)

Wherein, loss ₃ L is the total loss of the third stage _{hashall-triplet} For the triplet loss corresponding to the non-semantic feature calculated by referring to the above formula (4), L _{hashall-coding} Quantization loss corresponding to the non-semantic features calculated with reference to the above formulas (2) and (3).

In the above-described loss calculation, the weight w1=w3=w6=1 for measuring the learning loss, the weight w2=w4=w7=0.01 for hash coding, and the weight w5=1 for classifying the loss may be set.

1.5, model parameter updating process: and carrying out gradient backward calculation by adopting the losses calculated in each stage to obtain the gradient of the network parameters required to be updated in the stage, and updating the parameter values. The network parameters participating in learning adopt a learning rate of 0.0005, and each 10 rounds of learning rate is adjusted to be 0.1 times of the original learning rate; after gradient is obtained from loss return in each learning round, the network weight is updated according to different learning rates.

3. Quantized feature retrieval application

Extracting semantic features from all inventory images through a semantic feature extraction model obtained through training, extracting non-semantic features through a non-semantic feature extraction model obtained through training, fusing the semantic features and the non-semantic features through a fusion model obtained through training to obtain fusion features, activating the fusion features through sign functions to obtain binary quantized features, and warehousing. Extracting semantic features from a query image through a semantic feature extraction model obtained through training, extracting non-semantic features through a non-semantic feature extraction model obtained through training, fusing the semantic features and the non-semantic features through a fusion model obtained through training to obtain fusion features, activating the fusion features through sign functions to obtain binary quantized features, comparing the binary quantized features with binary quantized vectors in stock one by one, calculating the Hamming distance aiming at the binary quantized features can accelerate the calculation efficiency (compared with floating point type packing features), and acquiring K most similar returns sequenced in the front according to the descending order after calculating the distance, so that image duplicate checking is realized.

The embodiment can produce the following beneficial effects:

1. the maintenance cost is reduced: a unified hash feature is generated by combining the two feature extraction models, and only one set of retrieval feature and retrieval system is required to be maintained in application, so that the maintenance cost of the application is reduced.

2. Reducing computing resource consumption: by fusing features into one model, when features need to be extracted in application, the computational resources needed by model reasoning are reduced by nearly half compared with the way of generating two features by two models.

3. Application efficiency and effect are improved: by fusing the dual features to the same feature, recall of retrieving similar samples is improved over single features, while by compressing the original two features (1 x64 bits each) to 1x96, the consumption of resources is reduced.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a feature fusion model processing device for realizing the feature fusion model processing method and a sample retrieval device for realizing the sample retrieval method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the device for processing a feature fusion model provided below may be referred to the limitation of the method for processing a feature fusion model hereinabove, and will not be described herein.

In one embodiment, as shown in fig. 12, there is provided a feature fusion model processing apparatus 1200, including:

a feature obtaining module 1202, configured to obtain semantic features and non-semantic features corresponding to training samples in the training sample set;

an activation component determining module 1204, configured to determine an activation feature component from feature components included in each candidate feature, using at least one of a semantic feature and a non-semantic feature as a candidate feature;

the activation degree determining module 1206 is configured to perform statistics on activation feature components belonging to the same feature bit, and determine respective activation degrees of feature bits corresponding to each candidate feature according to a statistical result;

The target feature obtaining module 1208 is configured to set, as an inactive value, feature components corresponding to feature bits whose activation degree satisfies a preset condition in each candidate feature, so as to obtain each target feature;

the fusion processing module 1210 is configured to perform fusion processing based on each target feature and using a feature fusion model to be trained to obtain each fusion feature;

the parameter adjustment module 1212 is configured to train the feature fusion model based on each fusion feature, and obtain a target fusion model when the training stop condition is satisfied; the target fusion model is used for carrying out fusion processing on semantic features and non-semantic features of the input sample to obtain target fusion features.

According to the feature fusion model processing device, on one hand, the semantic features and the non-semantic features of the input sample can be fused through the target fusion model to obtain the target fusion features, and the target fusion features have semantic measurement capability and non-semantic measurement capability simultaneously, so that the sample can be more accurately represented, and the accuracy is high; on the other hand, in the process of training to obtain the target fusion model, at least one of semantic features and non-semantic features is used as candidate features, an activated feature component is determined from feature components contained in each candidate feature, the activated feature components belonging to the same feature bit are counted, the respective activation degree of each feature bit corresponding to each candidate feature is determined according to a counting result, the feature component corresponding to the feature bit with the activation degree meeting the preset condition in each candidate feature is set as an inactive value to obtain each target feature, and fusion processing is carried out on the basis of each target feature and by utilizing the feature fusion model to be trained, so that the influence of the feature of the low-activation feature bit on the training process can be reduced, the model can learn key feature bits better, the feature fusion of the trained target fusion model can be fused better, and the accuracy of the target fusion feature is further improved.

In one embodiment, the activation degree determining module is configured to count the number of activated feature components belonging to the same feature bit, so as to obtain respective activation numbers of feature bits corresponding to each candidate feature; respectively counting the total number of the characteristic components belonging to each characteristic bit; the respective activation degree of each feature bit is determined based on the respective activation number of each feature bit and the total number of features.

In one embodiment, the activation degree determining module is further configured to determine an activation proportion corresponding to each feature bit based on the respective activation number and the feature total number of each feature bit; and calculating the respective activation entropy of each feature bit based on the respective activation proportion of each feature bit, and determining the respective activation entropy of each feature bit as the respective activation degree of each feature bit.

In one embodiment, the activation component determining module is configured to map feature components included in each candidate feature according to a preset mapping manner to obtain respective mapping feature values of each feature component; comparing the respective mapping characteristic values of the characteristic components with an activation value set by a preset mapping mode; and determining the feature components corresponding to the mapping feature values which are consistent in comparison as activated feature components.

In one embodiment, the target feature obtaining module is configured to sort each feature bit based on respective activation degrees of each feature bit, and determine feature bits with activation degrees meeting a preset condition based on a sorting result to obtain target feature bits; and respectively setting the feature components corresponding to the target feature bits in the candidate features as inactive values to obtain the target features.

In one embodiment, the activation component determining module is configured to determine an activation feature component corresponding to a semantic feature from feature components included in each semantic feature, and determine an activation feature component corresponding to a non-semantic feature from feature components included in each non-semantic feature; the activation degree determining module is used for counting the activation feature components belonging to the same feature bit in each semantic feature, and determining the respective activation degree of each feature bit corresponding to each semantic feature according to the counting result; counting the activated feature components belonging to the same feature bit in each non-semantic feature, and determining the respective activation degree of each feature bit corresponding to each non-semantic feature according to the counting result; the target feature obtaining module is used for respectively setting feature components corresponding to feature bits with the activation degree meeting the preset conditions in each semantic feature as non-activation values to obtain each first target feature; and respectively setting the feature components corresponding to the feature bits with the activation degree meeting the preset conditions in each non-semantic feature as non-activation values to obtain each second target feature.

In one embodiment, the fusion processing module is configured to splice the first target feature and the second target feature corresponding to the same training sample to obtain a spliced feature; inputting the spliced features into a feature fusion model; the feature fusion model comprises a first full-connection layer and a second full-connection layer; mapping the splicing characteristics through a first full-connection layer to obtain intermediate characteristics; mapping the intermediate features through a second full-connection layer to obtain fusion features; the intermediate features are the same as the splice features in dimension, and the fusion features are less in dimension than the splice features.

In one embodiment, the semantic features of the training sample are obtained by extracting features of the training sample through a trained target semantic feature extraction model, and the corresponding non-semantic features of the training sample are obtained by extracting features of the training sample through a trained target non-semantic feature extraction model; the device further comprises: the feature extraction model training module is used for acquiring a comparison training sample corresponding to the training sample in the training sample set; respectively inputting a training sample and a comparison training sample as target samples into an initial semantic feature extraction model to be trained and an initial non-semantic feature extraction model to be trained; outputting semantic training features corresponding to each target sample through an initial semantic feature extraction model, and outputting non-semantic training features corresponding to each target sample through an initial non-semantic feature extraction model; based on the semantic training features corresponding to the training samples and the semantic training features corresponding to the comparison training samples, semantic feature loss is obtained; based on the non-semantic training features corresponding to the training samples and the non-semantic training features corresponding to the comparison training samples, obtaining first non-semantic feature loss; classifying the training samples based on semantic training features of the training samples to obtain classification results, and determining classification loss based on the classification results; training an initial semantic feature extraction model and an initial non-semantic feature extraction model based on the semantic feature loss, the first non-semantic feature loss and the classification loss, and obtaining a trained target semantic feature extraction model and a trained target non-semantic feature extraction model when the training stop condition is satisfied.

In one embodiment, there is a shared network layer for the initial semantic feature extraction model and the initial non-semantic feature extraction model; the feature extraction model training module is also used for carrying out convolution processing on each target sample through the shared network layer to obtain the shared features corresponding to each target sample; performing feature processing on each shared feature through a first feature processing layer of the initial semantic feature extraction model to obtain semantic training features corresponding to each target sample; performing feature processing on each shared feature through a second feature processing layer of the initial non-semantic feature extraction model to obtain a non-semantic training feature corresponding to each target sample; the first feature handling layer has a greater network depth than the second feature handling layer.

In one embodiment, the feature extraction model training module is further configured to input a training sample and a comparison training sample into the candidate non-semantic feature extraction model respectively, and output non-semantic training features corresponding to the training sample and non-semantic training features corresponding to the comparison training sample; obtaining second non-semantic feature loss based on the non-semantic training features corresponding to the training samples of the candidate non-semantic feature extraction model and the non-semantic training features corresponding to the comparison training samples; based on the second non-semantic feature loss, model parameters of the candidate non-semantic feature extraction model are adjusted and training is continued, and when training stop conditions are met, an initial non-semantic feature extraction model is obtained.

In one embodiment, the fusion feature is a binary quantization feature, and the parameter adjustment module is configured to obtain a quantization target corresponding to a feature component in the fusion feature, and determine a quantization loss based on a difference between the feature component in the fusion feature and the corresponding quantization target; acquiring a contrast fusion characteristic corresponding to a contrast training sample of the training sample, and determining fusion loss based on the difference between the fusion characteristic and the contrast fusion characteristic; the contrast fusion characteristics are obtained based on a characteristic fusion model; counting the quantization loss and the fusion loss to obtain a target loss; and adjusting model parameters of the feature fusion model based on the target loss, and continuing training until the training stopping condition is met, so as to obtain the target fusion model.

In one embodiment, the contrast fusion features include positive contrast fusion features corresponding to positive contrast training samples and negative contrast fusion features corresponding to negative contrast training samples; the parameter adjustment module is also used for acquiring a forward characteristic difference value, wherein the forward characteristic difference value is a characteristic difference value between a fusion characteristic and a corresponding forward comparison fusion characteristic; acquiring a negative characteristic difference value which is a characteristic difference value between the fusion characteristic and the corresponding negative comparison fusion characteristic; and determining fusion loss based on the positive characteristic difference value and the negative characteristic difference value.

In one embodiment, the parameter adjustment module is further configured to determine an activation feature component from feature components included in each fusion feature, count activation feature components belonging to the same feature bit, and determine respective activation degrees of feature bits corresponding to each fusion feature according to a statistical result; respectively taking the characteristic components corresponding to the characteristic bits with the activation degree meeting the preset condition in each fusion characteristic as target characteristic components; acquiring a quantization target corresponding to a non-target feature component in the fusion feature, and determining quantization loss based on the difference between the non-target feature component and the corresponding quantization target; the non-target feature component is a feature component other than the target feature component in the fusion feature.

The above-described feature fusion model processing apparatus and each module in the sample retrieval apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, as shown in fig. 13, there is provided a sample retrieval apparatus 1300 comprising:

A feature obtaining module 1302, configured to obtain semantic features and non-semantic features corresponding to the query sample;

the feature fusion module 1304 is configured to input semantic features and non-semantic features into the target fusion model for fusion processing, so as to obtain target fusion features corresponding to the query sample; the target fusion model is obtained when the feature fusion model to be trained is trained based on each fusion feature until the training stopping condition is met; each fusion feature is obtained by fusion processing based on each target feature by utilizing a feature fusion model; each target feature is obtained by respectively setting a feature component corresponding to a feature bit with the activation degree meeting a preset condition in each candidate feature as a non-activation value; the respective activation degree of each feature bit is determined according to the statistical result by counting the activation feature components belonging to the same feature bit; activating feature components is determined from feature components contained in each candidate feature by taking at least one of semantic features and non-semantic features as the candidate feature; the semantic features and the non-semantic features correspond to training samples in the training sample set;

the sample retrieval module 1306 is configured to retrieve a target retrieval sample corresponding to the query sample from the sample database based on the target fusion feature.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 14. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device may be used to store training sample data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a feature fusion model processing method or a sample retrieval method.

In one embodiment, a computer device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 15. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program, when executed by a processor, implements a feature fusion model processing method or a sample retrieval method. The display unit of the computer equipment is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device, wherein the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on a shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structures shown in fig. 14 and 15 are merely block diagrams of portions of structures related to the aspects of the present application and are not intended to limit the computer device to which the aspects of the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or may have different arrangements of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

It should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method for processing a feature fusion model, the method comprising:

acquiring semantic features and non-semantic features corresponding to training samples in a training sample set;

determining an activated feature component from feature components contained in each candidate feature by taking at least one of the semantic feature and the non-semantic feature as the candidate feature;

Counting the activated feature components belonging to the same feature bit, and determining the respective activation degree of each feature bit corresponding to each candidate feature according to the counting result;

respectively setting feature components corresponding to feature bits with activation degrees meeting preset conditions in the candidate features as non-activation values to obtain target features;

based on the target features, carrying out fusion processing by utilizing a feature fusion model to be trained to obtain fusion features;

training the feature fusion model based on each fusion feature, and obtaining a target fusion model when the training stop condition is met; the target fusion model is used for carrying out fusion processing on semantic features and non-semantic features of an input sample to obtain target fusion features.

2. The method according to claim 1, wherein the counting the activated feature components belonging to the same feature bit, and determining the respective activation degree of each feature bit corresponding to each candidate feature according to the counted result, includes:

counting the number of the activated feature components belonging to the same feature bit to obtain the respective activated number of each feature bit corresponding to each candidate feature;

Respectively counting the total number of the characteristic components belonging to each characteristic bit;

and determining the respective activation degree of each feature bit based on the respective activation quantity and the total number of features of each feature bit.

3. The method of claim 2, wherein said determining the respective activation level of each of said feature bits based on the respective number of activations and the total number of features of each of said feature bits comprises:

determining the respective activation proportion of each feature bit based on the respective activation quantity and the total number of the features of each feature bit;

and calculating the respective activation entropy of each feature bit based on the respective activation proportion of each feature bit, and determining the respective activation entropy of each feature bit as the respective activation degree of each feature bit.

4. The method of claim 1, wherein said determining an activation feature component from feature components contained in each of said candidate features comprises:

mapping the feature components contained in each candidate feature according to a preset mapping mode to obtain respective mapping feature values of each feature component;

comparing the respective mapping characteristic values of the characteristic components with the activation values set by the preset mapping mode;

And determining the feature components corresponding to the mapping feature values which are consistent in comparison as activated feature components.

5. The method according to claim 1, wherein the step of setting, as the inactive value, feature components corresponding to feature bits whose activation degree satisfies a preset condition in each of the candidate features, respectively, to obtain each target feature includes:

based on the respective activation degrees of the feature bits, sequencing the feature bits, and determining feature bits with activation degrees meeting preset conditions based on sequencing results to obtain target feature bits;

and respectively setting the feature components corresponding to the target feature bits in the candidate features as non-activated values to obtain the target features.

6. The method of claim 1, wherein determining an activation feature component from feature components contained in each of the candidate features using at least one of the semantic features and the non-semantic features as candidate features comprises:

determining an activated feature component corresponding to the semantic feature from feature components contained in each semantic feature, and determining an activated feature component corresponding to the non-semantic feature from feature components contained in each non-semantic feature;

The step of counting the activation feature components belonging to the same feature bit, and determining the respective activation degree of each feature bit corresponding to each candidate feature according to the statistical result, comprises the following steps:

counting the activated feature components belonging to the same feature bit in each semantic feature, and determining the respective activation degree of each feature bit corresponding to each semantic feature according to the counting result;

counting the activated feature components belonging to the same feature bit in each non-semantic feature, and determining the respective activation degree of each feature bit corresponding to each non-semantic feature according to the counting result;

setting feature components corresponding to feature bits with activation degrees meeting preset conditions in the candidate features as non-activation values, and obtaining target features comprises the following steps:

respectively setting feature components corresponding to feature bits with activation degrees meeting preset conditions in the semantic features as non-activation values to obtain first target features;

and respectively setting the feature components corresponding to the feature bits with the activation degree meeting the preset conditions in the non-semantic features as non-activation values to obtain second target features.

7. The method according to claim 6, wherein the fusing process based on the target features and using the feature fusion model to be trained to obtain the fused features includes:

splicing the first target feature and the second target feature corresponding to the same training sample to obtain spliced features;

inputting the spliced features into the feature fusion model; the feature fusion model comprises a first full-connection layer and a second full-connection layer;

mapping the splicing characteristics through the first full-connection layer to obtain intermediate characteristics;

mapping the intermediate features through the second full connection layer to obtain the fusion features;

the intermediate features are the same as the stitching features in dimension, and the fusion features are smaller in dimension than the stitching features.

8. The method according to claim 1, wherein the semantic features of the training sample are obtained by feature extraction of the training sample by a trained target semantic feature extraction model, and the corresponding non-semantic features of the training sample are obtained by feature extraction of the training sample by a trained target non-semantic feature extraction model; the training step of the target semantic feature extraction model and the target non-semantic feature extraction model comprises the following steps:

Obtaining a comparison training sample corresponding to a training sample in the training sample set;

respectively inputting the training sample and the comparison training sample as target samples into an initial semantic feature extraction model to be trained and an initial non-semantic feature extraction model to be trained;

outputting semantic training features corresponding to the target samples respectively through the initial semantic feature extraction model, and outputting non-semantic training features corresponding to the target samples respectively through the initial non-semantic feature extraction model;

obtaining semantic feature loss based on the semantic training features corresponding to the training samples and the semantic training features corresponding to the comparison training samples;

obtaining a first non-semantic feature loss based on the non-semantic training features corresponding to the training samples and the non-semantic training features corresponding to the comparison training samples;

classifying the training samples based on semantic training features of the training samples to obtain classification results, and determining classification loss based on the classification results;

training the initial semantic feature extraction model and the initial non-semantic feature extraction model based on the semantic feature loss, the first non-semantic feature loss and the classification loss, and obtaining a trained target semantic feature extraction model and a trained target non-semantic feature extraction model when a training stop condition is satisfied.

9. The method of claim 8, wherein the initial semantic feature extraction model and the initial non-semantic feature extraction model have a shared network layer; outputting the semantic training features corresponding to the target samples through the initial semantic feature extraction model, and outputting the non-semantic training features corresponding to the target samples through the initial non-semantic feature extraction model, wherein the semantic training features comprise:

convolving each target sample through the shared network layer to obtain each corresponding shared characteristic of each target sample;

performing feature processing on each shared feature through a first feature processing layer of the initial semantic feature extraction model to obtain semantic training features corresponding to each target sample;

performing feature processing on each shared feature through a second feature processing layer of the initial non-semantic feature extraction model to obtain a non-semantic training feature corresponding to each target sample;

the network depth of the first feature processing layer is greater than the network depth of the second feature processing layer.

10. The method of claim 8, wherein the initial non-semantic feature extraction model is obtained by:

Respectively inputting the training sample and the comparison training sample into a candidate non-semantic feature extraction model, and outputting non-semantic training features corresponding to the training sample and non-semantic training features corresponding to the comparison training sample;

obtaining a second non-semantic feature loss based on the non-semantic training features corresponding to the training samples of the candidate non-semantic feature extraction model and the non-semantic training features corresponding to the comparison training samples;

and based on the second non-semantic feature loss, adjusting model parameters of the candidate non-semantic feature extraction model and continuing training, and obtaining the initial non-semantic feature extraction model when a training stopping condition is met.

11. The method of claim 1, wherein the fusion features are binary quantized features, wherein the training the feature fusion model based on each of the fusion features, when a training stop condition is satisfied, results in a target fusion model, comprising:

acquiring a quantization target corresponding to the feature component in the fusion feature, and determining quantization loss based on the difference between the feature component in the fusion feature and the corresponding quantization target;

acquiring a contrast fusion characteristic corresponding to a contrast training sample of the training sample, and determining fusion loss based on the difference between the fusion characteristic and the contrast fusion characteristic; the contrast fusion characteristics are obtained based on the characteristic fusion model;

Counting the quantization loss and the fusion loss to obtain a target loss;

and adjusting model parameters of the feature fusion model based on the target loss, and continuing training until the training stopping condition is met, so as to obtain the target fusion model.

12. The method of claim 11, wherein the contrast fusion features comprise positive contrast fusion features corresponding to positive contrast training samples and negative contrast fusion features corresponding to negative contrast training samples; the determining a fusion loss based on a difference between the fusion feature and the comparative fusion feature comprises:

acquiring a forward characteristic difference value, wherein the forward characteristic difference value is a characteristic difference value between the fusion characteristic and a corresponding forward comparison fusion characteristic;

acquiring a negative characteristic difference value, wherein the negative characteristic difference value is a characteristic difference value between the fusion characteristic and a corresponding negative comparison fusion characteristic;

and determining fusion loss based on the positive characteristic difference value and the negative characteristic difference value.

13. The method of claim 11, wherein the method further comprises:

determining an activated feature component from feature components contained in each fusion feature, counting the activated feature components belonging to the same feature bit, and determining respective activation degrees of feature bits corresponding to each fusion feature according to a statistical result;

Respectively taking the characteristic components corresponding to the characteristic bits with the activation degree meeting the preset conditions in the fusion characteristics as target characteristic components;

the obtaining the quantization target corresponding to the feature component in the fusion feature, and determining the quantization loss based on the difference between the feature component in the fusion feature and the corresponding quantization target, includes:

acquiring a quantization target corresponding to a non-target feature component in the fusion feature, and determining quantization loss based on a difference between the non-target feature component and the corresponding quantization target; the non-target feature component is a feature component of the fused feature other than the target feature component.

14. A sample retrieval method, the method comprising:

acquiring semantic features and non-semantic features corresponding to the query sample;

inputting the semantic features and the non-semantic features into a target fusion model for fusion processing to obtain target fusion features corresponding to the query sample; the target fusion model is obtained when a feature fusion model to be trained is trained based on each fusion feature until a training stop condition is met; each fusion feature is obtained by fusion processing based on each target feature and by utilizing the feature fusion model; each target feature is obtained by respectively setting a feature component corresponding to a feature bit with the activation degree meeting a preset condition in each candidate feature as a non-activation value; the respective activation degree of each feature bit is determined according to the statistical result by counting the activation feature components belonging to the same feature bit; the activation feature component is determined from feature components contained in each of the candidate features by taking at least one of the semantic feature and the non-semantic feature as a candidate feature; the semantic features and the non-semantic features correspond to training samples in the training sample set;

And retrieving a target retrieval sample corresponding to the query sample from a sample database based on the target fusion characteristic.

15. A feature fusion model processing apparatus, the apparatus comprising:

the feature acquisition module is used for acquiring semantic features and non-semantic features corresponding to training samples in the training sample set;

an activation component determining module, configured to determine an activation feature component from feature components included in each candidate feature by using at least one of the semantic feature and the non-semantic feature as a candidate feature;

the activation degree determining module is used for counting activation feature components belonging to the same feature bit and determining respective activation degrees of feature bits corresponding to the candidate features according to the counting result;

the target feature obtaining module is used for respectively setting feature components corresponding to feature bits with the activation degree meeting preset conditions in the candidate features as non-activation values to obtain target features;

the fusion processing module is used for carrying out fusion processing by utilizing a feature fusion model to be trained based on each target feature to obtain each fusion feature;

The parameter adjustment module is used for training the feature fusion model based on each fusion feature, and obtaining a target fusion model when the training stop condition is met; the target fusion model is used for carrying out fusion processing on semantic features and non-semantic features of an input sample to obtain target fusion features.

16. A sample retrieval apparatus, the apparatus comprising:

the feature acquisition module is used for acquiring semantic features and non-semantic features corresponding to the query sample;

the feature fusion module is used for inputting the semantic features and the non-semantic features into a target fusion model for fusion processing to obtain target fusion features corresponding to the query sample; the target fusion model is obtained when a feature fusion model to be trained is trained based on each fusion feature until a training stop condition is met; each fusion feature is obtained by fusion processing based on each target feature and by utilizing the feature fusion model; each target feature is obtained by respectively setting a feature component corresponding to a feature bit with the activation degree meeting a preset condition in each candidate feature as a non-activation value; the respective activation degree of each feature bit is determined according to the statistical result by counting the activation feature components belonging to the same feature bit; the activation feature component is determined from feature components contained in each of the candidate features by taking at least one of the semantic feature and the non-semantic feature as a candidate feature; the semantic features and the non-semantic features correspond to training samples in the training sample set;

And the sample retrieval module is used for retrieving a target retrieval sample corresponding to the query sample from a sample database based on the target fusion characteristics.

17. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 14 when the computer program is executed.

18. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 14.

19. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 14.