CN113705589A

CN113705589A - Data processing method, device and equipment

Info

Publication number: CN113705589A
Application number: CN202111279272.6A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2021-11-26

Abstract

The application provides a data processing method, a data processing device and data processing equipment, which can be applied to various fields or scenes such as cloud technology, artificial intelligence, block chains, Internet of vehicles, intelligent transportation, intelligent home and the like, and the method comprises the following steps: acquiring training samples, wherein the training samples comprise reference samples, positive samples and negative samples; calling a feature extraction model to perform feature extraction processing on the reference sample, the positive sample and the negative sample to obtain a reference feature of the reference sample, a positive feature of the positive sample and a negative feature of the negative sample; determining a similar loss according to the reference characteristic and the positive characteristic, and determining a contrast loss according to the reference characteristic, the positive characteristic and the negative characteristic; and superposing the similar loss and the contrast loss into a target loss, training a feature extraction model according to the target loss to obtain a target feature extraction model, wherein the target feature extraction model is used for extracting data features of the multimedia data.

Description

Data processing method, device and equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, a data processing apparatus, and a computer device.

Background

With the rapid development of computer technology, the feature extraction model is more and more widely applied, before the feature extraction model is applied, the feature extraction model usually needs to be trained, the training quality can determine the accuracy of the feature extraction model (determined by the quality of features), and the loss calculation method determines the training quality of the feature extraction model to a great extent.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device and data processing equipment, which can be used for combining contrast loss and similar loss and effectively improving the accuracy of a feature extraction model.

In one aspect, an embodiment of the present application provides a data processing method, where the method includes:

acquiring a training sample, wherein the training sample comprises a reference sample, a positive sample and a negative sample, the reference sample and the positive sample satisfy a similar relation, and the reference sample and the negative sample satisfy a dissimilar relation;

calling a feature extraction model to perform feature extraction processing on the reference sample, the positive sample and the negative sample to obtain a reference feature of the reference sample, a positive feature of the positive sample and a negative feature of the negative sample;

determining a similar loss according to the reference feature and the positive feature, and determining a contrast loss according to the reference feature, the positive feature and the negative feature;

and superposing the similarity loss and the contrast loss into a target loss, and training the feature extraction model according to the target loss to obtain a target feature extraction model, wherein the target feature extraction model is used for extracting the data features of the multimedia data.

In another aspect, an embodiment of the present application provides a data processing apparatus, where the apparatus includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring training samples, the training samples comprise reference samples, positive samples and negative samples, the reference samples and the positive samples meet a similar relation, and the reference samples and the negative samples meet a dissimilar relation;

the processing unit is used for calling a feature extraction model to perform feature extraction processing on the reference sample, the positive sample and the negative sample to obtain a reference feature of the reference sample, a positive feature of the positive sample and a negative feature of the negative sample;

the processing unit is further used for determining similar loss according to the reference characteristic and the positive characteristic, and determining contrast loss according to the reference characteristic, the positive characteristic and the negative characteristic;

the processing unit is further configured to superimpose the similarity loss and the contrast loss into a target loss, train the feature extraction model according to the target loss, and obtain a target feature extraction model, where the target feature extraction model is used to extract data features of multimedia data.

Accordingly, an embodiment of the present application provides a computer device, where the computer device includes a processor, a communication interface, and a memory, where the processor, the communication interface, and the memory are connected to each other, where the memory stores a computer program, and the processor is configured to call the computer program to execute the data processing method according to any possible implementation manner.

Accordingly, the present application provides a computer-readable storage medium, which stores a computer program, where the computer program is executed by a processor to implement the steps of the data processing method provided by the present application.

Accordingly, the embodiments of the present application also provide a computer program product, where the computer program product includes a computer program or computer instructions, and the computer program or the computer instructions are executed by a processor to implement the steps of the data processing method provided by the embodiments of the present application.

Accordingly, the embodiment of the present application further provides a computer program, where the computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method provided by the embodiment of the present application.

In the embodiment of the application, a feature extraction model is called to perform feature extraction processing on a reference sample, a positive sample and a negative sample included in a training sample to obtain a reference feature of the reference sample, a positive feature of the positive sample and a negative feature of the negative sample, wherein the reference sample and the positive sample meet a similar relation, the reference sample and the negative sample meet a dissimilar relation, a similarity loss is determined according to the reference feature and the positive feature, a contrast loss is determined according to the reference feature, the positive feature and the negative feature, the similarity loss and the contrast loss are superposed to form a target loss, a feature extraction model is trained according to the target loss to obtain a target feature extraction model, and the target feature extraction model is used for extracting data features of multimedia data; according to the method, the feature extraction model is trained by combining the contrast loss and the similar loss, so that the feature extraction model can ensure that the difference between dissimilar samples is far larger than that between similar samples, the similarity between similar samples can be ensured, and the accuracy of the feature extraction model can be effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a network architecture suitable for a data processing method according to an embodiment of the present application;

fig. 2 is a first flowchart illustrating a data processing method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a residual block according to an embodiment of the present disclosure;

fig. 4 is a flowchart illustrating a data processing method according to an embodiment of the present application;

fig. 5 is a third schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 6 is a fourth schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the descriptions of "first", "second", etc. referred to in the embodiments of the present application are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a technical feature defined as "first," "second," etc. may explicitly or implicitly include at least one of the feature.

The data processing method can train the feature extraction model by combining the contrast loss and the similar loss, so that the feature extraction model can ensure that the difference between dissimilar samples is far larger than that between similar samples, the similarity between similar samples can also be ensured, and the accuracy of the feature extraction model can be effectively improved. The data processing algorithm particularly relates to a Machine Learning technology in an artificial intelligence technology, Machine Learning (ML) is a multi-field cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

In a possible embodiment, the data processing method provided in the embodiment of the present application may also be implemented based on Cloud technology (Cloud technology) and/or block chain technology. Block chains: a block or block chain is a series of text records (also called blocks) that are cryptographically connected and protected. Each chunk contains the cryptographic hash of the previous chunk, the corresponding time stamp, and transaction data (typically represented by hash values computed using the Merkle tree algorithm), such a design makes the content of the chunk tamper-resistant. The distributed account book concatenated by the block chain technology can effectively record the transaction by two parties and permanently check the transaction. The data processing method provided by the embodiment of the application can specifically relate to one or more of Cloud storage (Cloud storage), a Cloud Database (Cloud Database) and Big data (Big data) in the Cloud technology. For example, data (e.g., training samples, feature extraction models, etc.) needed to perform the data processing method is obtained from a cloud database. For another example, the data required for executing the data processing method may be stored in the form of blocks on a block chain; data (e.g., reference features, positive features, negative features, etc.) generated by performing the data processing method may be stored in the form of blocks on a block chain; in addition, the computer device that executes the data processing method may be a node device in a blockchain network.

The data processing method provided by the embodiment of the application can be applied to the network architecture shown in fig. 1. The computer device 10 shown in fig. 1 may be a server or a terminal having a data processing function, where the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, a cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, and the like. The terminal 11 shown in fig. 1 is connected to the computer device 10 via a network. The data processing method provided by the embodiment of the present application may be executed by the computer device 10, specifically:

the method comprises the steps of calling a feature extraction model to carry out feature extraction processing on a reference sample, a positive sample and a negative sample included by a training sample to obtain reference features of the reference sample, positive features of the positive sample and negative features of the negative sample, wherein the reference sample and the positive sample meet a similar relation, the reference sample and the negative sample meet a dissimilar relation, determining similar loss according to the reference features and the positive features, determining contrast loss according to the reference features, the positive features and the negative features, superposing the similar loss and the contrast loss into target loss, training the feature extraction model according to the target loss to obtain a target feature extraction model, combining the contrast loss and the similar loss, and effectively improving accuracy of the feature extraction model.

The target feature extraction model 12 shown in fig. 1 is obtained by using the above data processing method, and may be deployed in the computer device 10, in an embodiment, the computer device 10 may extract a data feature of multimedia data by using the deployed target feature extraction model 12 to perform a retrieval task, specifically: the computer device 10 receives an inquiry request (carrying target multimedia data) sent by the terminal 11, invokes the target feature extraction model 12 to perform feature extraction processing on the target multimedia data to obtain data features of the target multimedia data, and additionally invokes the target feature extraction model 12 to perform feature extraction processing on a plurality of multimedia data to be recalled to obtain data features of the plurality of multimedia data to be recalled, determines similarity between the data features of the target multimedia data and the data features of the plurality of multimedia data to be recalled, selects the recalled multimedia data from the plurality of multimedia data to be recalled according to the similarity, and outputs the recalled multimedia data to the terminal 11, so that accuracy of a retrieval task can be improved.

The data attribute type of the target multimedia data or the multimedia data to be recalled may be any one of an image, a text, an audio, and a video included in the multimedia data, for example, when it is determined whether two pieces of audio are similar, the target multimedia data and the multimedia data to be recalled are audio data. When the multimedia data to be recalled is selected, the multimedia data to be recalled having the greatest similarity with the target multimedia data may be selected, or the multimedia data to be recalled having a similarity with the target multimedia data greater than a similarity threshold (which may be set manually) may be selected.

It can be understood that the schematic diagram of the network architecture described in the embodiment of the present application is for more clearly illustrating the technical solution of the embodiment of the present application, and does not form a limitation on the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the network architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

The data processing method provided by the embodiment of the present application is briefly introduced above, and a specific implementation of the data processing method is described in detail below.

Referring to fig. 2, fig. 2 is a first flowchart illustrating a data processing method according to an embodiment of the present disclosure. The data processing method includes but is not limited to the following steps:

s201, obtaining a training sample, wherein the training sample comprises a reference sample, a positive sample and a negative sample, the reference sample and the positive sample satisfy a similar relation, and the reference sample and the negative sample satisfy a dissimilar relation.

In the embodiment of the present application, the training sample is sample multimedia data, the reference sample is reference multimedia data, the positive sample is positive multimedia data, the negative sample is negative multimedia data, the data attribute type of the training sample may be any one of text, image, audio, and video included in the multimedia data, for example, when the data attribute type of the training sample is text, the training sample is sample text data, the reference sample is reference text data, the positive sample is body text data, the negative sample is negative text data, and for example, when the data attribute type of the training sample is audio, the training sample is sample audio data, the reference sample is reference audio data, the positive sample is positive audio data, and the negative sample is negative audio data. The reference sample and the positive sample satisfy a similar relationship (the reference sample and the positive sample constitute a similar sample), and the reference sample and the negative sample satisfy a dissimilar relationship (the reference sample and the negative sample constitute a dissimilar sample), for example, the reference sample and the positive sample whose images are both cats satisfy a similar relationship and are similar samples, the reference sample whose image is a cat and the negative sample whose image is a dog satisfy a dissimilar relationship and are dissimilar samples.

In one embodiment, when the computer device obtains the training samples, three samples satisfying the similarity relationship and the dissimilarity relationship may be randomly extracted from the sample library to form the training samples. However, the training samples generated by random extraction may be easy samples (i.e., easily distinguishable training samples, such as a reference sample and a positive sample, which have extremely high similarity and/or a reference sample and a negative sample, which have extremely low similarity), and the easy samples may help the learning of the feature extraction model in the initial stage of training, but after the feature extraction model establishes a good distinguishing capability for the easy samples, the feature extraction model may not have the good distinguishing capability for the difficult samples due to lack of learning of the difficult samples (i.e., the training samples which are not easily distinguishable).

In another embodiment, the computer device obtains a training sample, specifically: obtaining K groups of sample pairs, wherein each group of sample pairs comprises a reference sample and a positive sample, determining the distance between the reference sample and K-1 candidate samples included in a target sample pair, determining M negative samples from the K-1 candidate samples according to the distance, and respectively combining the target sample pair and the M negative samples into training samples, wherein M is a positive integer and is less than or equal to K-1.

In the embodiment of the application, the target sample pair is any sample pair in the K groups of sample pairs, the K-1 candidate samples are K-1 reference samples or K-1 positive samples included in the K-1 reference sample pairs, and the reference sample pair is a sample pair in the K groups of sample pairs except the target sample pair. Correspondingly, when the training sample is sample multimedia data, and the reference sample is reference multimedia data, the positive sample is positive multimedia data, and the negative sample is negative multimedia data, the candidate sample is candidate multimedia data, the reference sample included in the sample pair is reference multimedia data, and the positive sample included in the sample pair is positive multimedia data.

Specifically, after calculating the distances between the reference sample in the target sample pair and K-1 candidate samples (which may be the reference sample or the positive sample in the reference sample pair), the computer device may rank the K-1 candidate samples according to the distances from small to large, and select the top M (positive integer, for example, M is 20) candidate samples as the negative samples, so that K sets of positive sample pairs and M negative samples may form K × M training samples. The distance may be calculated as an euclidean distance, a manhattan distance, a chebyshev distance, a minkowski distance, a normalized euclidean distance, or the like.

It should be noted that the distance is used for measuring the similarity between samples, the smaller the distance between samples is, the higher the similarity between samples is, and a candidate sample with a small distance from a reference sample may form a difficult sample with a positive sample pair, so that the feature extraction model may learn the difficult sample, and thus, the sample difficult to distinguish may have good distinguishing capability. In addition, under the distance constraint of similar samples, the similarity of the similar samples in the global training sample data can be improved, and the problem that the distance fluctuation between the similar samples in the training sample is overlarge when the training sample is generated by random sampling is solved.

S202, calling a feature extraction model to perform feature extraction processing on the reference sample, the positive sample and the negative sample to obtain the reference feature of the reference sample, the positive feature of the positive sample and the negative feature of the negative sample.

The following steps are illustrated by taking a training sample as an example:

in the embodiment of the present application, the feature extraction model is used for feature extraction, and the better the feature quality (or feature expression capability) extracted by the feature extraction model is, the higher the accuracy of the feature extraction model is, and correspondingly, the worse the feature quality (or feature expression capability) extracted by the feature extraction model is, the lower the accuracy of the feature extraction model is. The feature extraction model can be any neural network, the neural network is a technology from an imitated biological neural network, and a target is finally achieved by connecting a plurality of feature values and combining linearity and nonlinearity. The feature extraction model can be Mobilene-v 2, ResNet-18 or ResNet-101, etc., which is not limited in this application. Correspondingly, when the training samples are sample multimedia data, and the reference samples are reference multimedia data, the positive samples are positive multimedia data, the negative samples are negative multimedia data, the reference feature is a reference multimedia data feature, the positive feature is a positive multimedia data feature, the negative feature is a negative multimedia data feature, the reference feature is a feature after feature extraction processing is performed on the reference multimedia data, the positive feature is a feature after feature extraction processing is performed on the positive multimedia data, the negative feature is a feature after feature extraction processing is performed on the negative multimedia data, the reference feature, the positive feature and the negative feature correspond to the reference sample, the positive sample and the negative sample one to one, for example, when the data attribute type of the training sample is a text, the reference text data feature corresponds to the reference text data, the body text data feature corresponds to the body text data, and the negative text data feature corresponds to the negative text data. It can be known that when the data attribute type of the training sample is text, the reference multimedia data feature, the positive multimedia data feature and the negative multimedia data feature (or the reference feature, the positive feature and the negative feature) are both a text feature, and so on.

In one embodiment, the feature extraction model includes a feature extraction submodule and a feature representation submodule. The computer equipment can call a feature extraction submodule and a feature representation submodule included in the feature extraction model to respectively perform feature extraction on the reference sample, the positive sample and the negative sample to obtain the reference feature of the reference sample, the positive feature of the positive sample and the negative feature of the negative sample.

In the embodiment of the present application, a feature extraction submodule is taken as an example of ResNet-101 pre-trained on an ImageNet data set, please refer to table 1 below, where table 1 records a network structure of ResNet-101:

TABLE 1

As shown in Table 1 above, ResNet-101 includes 5 convolutional layers, using 64 convolutional kernels with a step size (stride) of 2 and a size of 7 × 7 in the first convolutional layer; maximum pooling (max pool) of 3 × 3 in size and 2 in step size, and 3 residual blocks (blocks) are used in the second convolutional layer, and the structure of each residual block is shown in fig. 3, and convolution kernels of 1 × 1 and 3 × 3 are mainly used, wherein the first convolution kernel of 1 × 1 in size is mainly used for compressing the number of features, the third convolution kernel of 1 × 1 in size is used for recovering the number of features, and the number of the convolution kernels is 64 and 256; 4 residual blocks are used in the third convolutional layer; 23 residual blocks were used in the fourth convolutional layer; the fifth convolutional layer uses 3 residual blocks, and ResNet-101 uses 34 residual blocks in common.

In one embodiment, referring to table 2 below, table 2 records a network structure of a feature representation submodule, where the feature representation submodule includes a pooling layer using maximum pooling and a full connection layer, and extracted features can be output through the full connection layer:

TABLE 2

Alternatively, the feature extraction model may be trained with a gaussian distribution with variance of 0.01 and mean of 0 for parameter initialization and a learning rate of 0.0005.

S203, determining similar loss according to the reference characteristic and the positive characteristic, and determining contrast loss according to the reference characteristic, the positive characteristic and the negative characteristic.

When the feature extraction model carries out the training of contrast learning on the training sample by adopting contrast loss, the distance between the reference feature of the reference sample and the negative feature of the negative sample is enabled to be (simple)Referred to as D_an) Greater than the distance between the reference feature of the reference sample and the positive feature of the positive sample (abbreviated as D)_ap) And satisfies that the difference between the two distances is greater than a specified boundary, i.e. D_an-D_ap≥

，

To specify the boundary, a positive integer (which can be set manually) is used. For different training samples, D_anAnd D_apIs different, e.g. training sample 1 has the same reference and positive samples as training sample 2, D of training sample 1_an=23，D_ap= 20; training sample 2D_an=32，D_ap=30, if

To 5, training sample 1 and training sample 2 both satisfy D_an-D_ap≥

It can be seen that the contrast loss guarantees the relative distance of the training samples (i.e., guarantees D)_an-D_ap≥

) And the absolute distance between the reference sample and the positive sample cannot be guaranteed (i.e., the similarity between the reference sample and the positive sample cannot be guaranteed). If the model learning is insufficient and the learning feature quality is poor, D is likely to be caused_anIs less than D_ap。

In addition, there is another problem in training the feature extraction model by using contrast loss, because the error labeling may cause noise in the training samples, for example, two samples having no similarity are labeled as a reference sample and a positive sample (when the data type of the positive sample is the noise type), or two samples having similarity are labeled as a reference sample and a negative sample (when the data type of the negative sample is the noise type)The data type of the present is a noise type). For training samples with noise, there are many possibilities for optimizing the contrast loss, such as the distance D between the reference feature and the negative feature even if the reference feature and the negative feature have similarity_anLarge enough to satisfy D_an-D_ap≥

Although the constraint of relative distance required for contrast learning is satisfied, the feature extraction model cannot be effectively trained in practice.

In one embodiment, the computer device determines the contrast loss according to the reference feature, the positive feature and the negative feature, specifically: a first distance between the reference feature and the positive feature and a second distance between the reference feature and the negative feature are determined, and the first distance and the second distance are input to a contrast loss function to determine a contrast loss. The contrast loss function is shown in the following formula (1):

L _t=max(D_ap－D_an＋

,0)（1）

wherein,L _trepresenting the contrast loss function, D_apRepresenting a first distance between the reference feature and the positive feature, which may in particular be the euclidean distance, D, between the reference feature and the positive feature_anRepresenting a second distance between the reference feature and the negative feature, which may in particular be a euclidean distance between the reference feature and the negative feature,

represents a first specified boundary (positive integer) for defining that the difference between the second distance and the first distance is equal to or greater than

(i.e., for purposes of contrast loss), e.g., if D is defined_an-D_apNot less than 20, then

Is 20. The contrast loss determined by the contrast loss function can enable the feature extraction model to realize contrast learning.

In one embodiment, the computer device determines a similarity loss based on a first distance between the reference feature and the positive feature, the similarity loss being used to limit the distance between the reference feature and the positive feature to within a certain threshold range (i.e., for purposes of similarity learning).

In one embodiment, the computer device determines the contrast loss according to the reference feature and the positive feature, specifically: inputting a first distance between the reference feature and the positive feature into a similarity loss function, and determining a similarity loss, wherein the similarity loss function is represented by the following formula (2):

（2）

wherein,L _sa similar loss function is represented as a function of,

representing a second specified boundary (positive integer) defining a first distance between the reference feature and the positive feature less than

For example, define D_ap< 10, then

Is 10. The similarity loss determined by the similarity loss function can enable the feature extraction model to realize similar learning.

And S204, superposing the similarity loss and the contrast loss into a target loss, and training the feature extraction model according to the target loss to obtain a target feature extraction model, wherein the target feature extraction model is used for extracting data features of multimedia data.

In the embodiment of the application, the target loss is used for adjusting parameters of the feature extraction model, and the accuracy of the feature extraction model can be improved by reducing the target loss.

The target loss determined by combining the contrast loss and the similarity loss can ensure that the reference feature and the negative feature have dissimilarity on the basis of ensuring that the reference feature and the positive feature have similarity. For example, the second instruction boundary is 10 (i.e., defines a first distance D between the reference feature and the positive feature)_apLess than 10), the first specified boundary is 20 (i.e., defines the second distance D)_anAt a first distance D_apThe difference between them is equal to or greater than 20), even a first distance between similar samples that are not very similar needs to satisfy 10, while a second distance between dissimilar samples is equal to or greater than 30, to ensure that the contrast loss and the similarity loss are absolutely satisfied.

In one embodiment, the computer device may determine a weight of the similar loss and a weight of the contrast loss, and superimpose the similar loss and the contrast loss as the target loss using the weights of the similar loss and the weight of the contrast loss.

Specifically, similar weights are lostw ₁And weight of contrast loss w ₂And similar lossesL _sAnd contrast lossL _tInputting the target loss function, determining the target lossL _totalThe objective loss function is expressed by the following formula (3):

L _total =w ₁ L _t+ w ₂ L _s（3）

the target loss determined by the target loss function can enable the feature extraction model to realize the joint learning of similar learning and contrast loss.

In order to avoid the negative influence of noise in the training samples on the feature extraction model, the loss of the samples is tracked, and a relaxation learning strategy is executed on the weight of the similar loss and the weight of the contrast loss, so that the final target loss is determined.

In an embodiment, the computer device obtains a trained batch quantity of the feature extraction model, sets a weight of the similar loss and a weight of the contrast loss as a first parameter if the trained batch quantity is smaller than a preset value, determines a weight of the similar loss and a weight of the contrast loss according to a data type of the positive sample and a data type of the negative sample if the trained batch quantity is not smaller than the preset value, and superposes the similar loss and the contrast loss into a target loss according to the weight of the similar loss and the weight of the contrast loss.

In the embodiment of the present application, one training iteration (epoch) performs one full training on the feature extraction model for all training samples, and all training samples may be divided into N (positive integer) training batches (batch). If there are S (positive integer) training generations, there are S × N training batches, and the number of iterations is S × N, that is, each time training of one training batch is completed, one iteration is completed.

In the embodiment of the present application, the amount of trained batches is the number of training batches for which training has been completed. The preset value is a positive integer and can be set manually. In an embodiment, the trained batch size may be determined according to the training batch in which the training sample is located, specifically: the training batch in which the training samples are located is s-N (the nth training batch representing the s-th training generation), each epoch includes N training batches, the training batch s-N in which the training samples are located is converted into a numerical value form in the form of (s-1) × N + (N-1), and the obtained numerical value is determined as the amount of the trained batch, for example, each epoch includes 3 training batches, the training batch in which the training samples are located is 2-1, and the amount of the trained batch is: (2-1) × 3+ (1-1) = 3. Whether the quantity of the trained batches is smaller than a preset value can be conveniently judged.

In an embodiment, when the amount of the trained batch is smaller than the preset value, in the initial stage of training, the weight of the similar loss and the weight of the contrast loss are both directly set as the first parameter (for example, the first parameter is 1); and when the quantity of the trained batches is not less than the preset value, determining the reference similarity loss and the reference contrast loss of the training batch in which the training sample is positioned, and determining the data type of the positive sample and the data type of the negative sample according to the reference similarity loss and the reference contrast loss so as to determine the weight of the similarity loss and the weight of the contrast loss.

In an embodiment, the determining the reference similarity loss and the reference contrast loss of the training batch in which the training sample is located specifically includes: if the training batch in which the training samples are located is the target training batch, taking the average similarity loss of the training batch in which the training samples are located as a reference similarity loss, taking the average contrast loss of the training batch in which the training samples are located as a reference contrast loss, if the training batch in which the training samples are located is not the target training batch, determining the reference similarity loss according to the average similarity loss of the training batch in which the training samples are located and the training batch adjacent to the training batch in which the training samples are located, and determining the reference contrast loss according to the average contrast loss of the training batch in which the training samples are located and the training batch adjacent to the training batch in which the training samples are located.

In the embodiment of the application, the target training batch is a designated training batch, for example, the target training batch is a training batch separated from the first training batch in the first training generation by the predetermined value of-1 (the predetermined value is referred to as k), or the target training batch is a first training batch in the remaining training generations except for the first training generation. The average contrast loss is an average of the contrast losses of the plurality of training samples, and the average similarity loss is an average of the similarity losses of the plurality of training samples. The training batch adjacent to the training batch in which the training sample is located is a training batch that completes training before the training batch in which the training sample is located, and for convenience of description, the training batch is hereinafter referred to as a previous training batch.

Aiming at the fact that a target training batch is a training batch which is separated from a first training batch in a first training generation by k-1 training batches, if the training batch in which training samples are located is the target training batch (assuming that k is 4, the training batch in which the training samples are located is 1-4), the average similarity loss of the training batch in which the training samples are located is used as a reference similarity loss, and the average contrast loss of the training batch in which the training samples are located is used as a reference contrast loss. If the training batch in which the training sample is located is not the target training batch (assuming that k is 4, the training batches in which the training sample is located are 1-5, 1-6 and 1-7 …), determining the reference similarity loss of the training batch in which the training sample is located according to the average similarity loss of the training batch in which the training sample is located and the reference similarity loss of the previous training batch; and determining the reference contrast loss of the training batch in which the training sample is positioned according to the average contrast loss of the training batch in which the training sample is positioned and the reference contrast loss of the previous training batch.

For the target training batch being the first training batch in the rest training generations except the first training generation, if the training batch in which the training samples are located is the target training batch (at this time, the training batches in which the training samples are located are 2-1, 3-1 and 4-1 …), the average similarity loss of the training batch in which the training samples are located is taken as the reference similarity loss, and the average contrast loss of the training batch in which the training samples are located is taken as the reference contrast loss. If the training batch in which the training samples are located is not the target training batch (at this time, the training batches in which the training samples are located are 2-2, 2-3, …, 3-2, 3-3, …, 4-2 and 4-3 …), determining the reference similarity loss of the training batch in which the training samples are located according to the average similarity loss of the training batch in which the training samples are located and the reference similarity loss of the previous training batch; and determining the reference contrast loss of the training batch in which the training sample is positioned according to the average contrast loss of the training batch in which the training sample is positioned and the reference contrast loss of the previous training batch.

Referring to fig. 4, memory unit 1 in fig. 4 records the reference contrast loss of the previous training batch, memory unit 2 records the reference similarity loss of the previous training batch, and the computer device can determine the reference similarity loss and the reference contrast loss of the previous training batch by reading the reference similarity loss and the reference contrast loss in the memory units (including memory unit 1 and memory unit 2). In addition, as can be seen from the above description, the rest training generations except the first training generation need to record the reference similarity loss and the reference contrast loss of the current training batch in the memory unit again from the first training batch, so that the training effect of the newly learned training generation is always better than that of the previous training generation.

In an embodiment, if the training batch in which the training samples are located is the target training batch, the contrast losses of all the training samples included in the training batch in which the training samples are located are determined according to the above formula (1), a mean value of the contrast losses of all the training samples (i.e., an average contrast loss) is determined, and the obtained mean value is determined as a reference contrast loss, where the reference contrast loss is represented by the following formula (4):

ML _t=avg(L _t)（4）

wherein,ML _twhich represents a loss of reference contrast,avg(L _t) Represents the mean of the contrast loss for all training samples.

In an embodiment, if the training batch in which the training samples are located is the target training batch, the similarity losses of all the training samples included in the training batch in which the training samples are located are determined according to the above formula (2), a mean value of the similarity losses of all the training samples (i.e., an average similarity loss) is determined, and the obtained mean value is determined as a reference similarity loss, where the reference similarity loss is represented by the following formula (5):

ML _s=avg(L _s)（5）

wherein,ML _sindicating a similar loss with reference to the reference,avg(L _s) Represents the mean of similar losses for all training samples.

In one embodiment, if the training batch in which the training sample is located is not the target training batch, the reference contrast loss and the reference similarity loss can be determined by the following equations (6) and (7), respectively:

ML _t=0.95 ML _t-1 +0.05avg(L _t)（6）

ML _s=0.95 ML _s-1 +0.05avg(L _s)（7）

wherein,ML _t-1loss of reference contrast for training batches adjacent to the training batch in which the training sample is located，ML _s-1Is the loss of reference contrast for training batches adjacent to the training batch in which the training sample is located.

In a possible embodiment, in formula (6)avg(L _t) And in formula (7)avg(L _s) The loss of all training samples may not be used, specifically: and if the data type of the positive sample included in the training sample is the noise type, removing the similarity loss of the training sample, and if the data type of the negative sample included in the training sample is the noise type, removing the contrast loss of the training sample.

In one embodiment, if the contrast loss of the training samples is greater than or equal to the reference contrast loss and the similarity loss of the training samples is greater than the reference similarity loss, anL _t≥L _s+1.5a, a may be set artificially, and the data types of the positive sample and the negative sample included in the training sample are both noise types. In this case, training samples of noise types have a negative effect on model learning, and need to be discarded, and both the weight of the similarity loss and the weight of the contrast loss may be set to null, and the target loss obtained by superimposing the similarity loss whose weight is null and the contrast loss whose weight is null may be null.

In an embodiment, if the similarity loss of the training samples is greater than or equal to the reference similarity loss and the contrast loss of the training samples is less than the reference contrast loss, the data type of the positive samples included in the training samples is a noise type, and the data type of the negative samples included in the training samples is a non-noise type. In this case, the weight of the similar loss is set as the second parameter, and the weight of the contrast loss is set as the first parameter.

In another embodiment, if the similarity loss of the training samples is less than the reference similarity loss, the training samples include positive samples of which the data type is a noise type and negative samples of which the data type is a non-noise type. In this case, the weight of the similar loss is set as the second parameter, and the weight of the contrast loss is set as the first parameter.

It should be noted that the first parameter is greater than the second parameter, for example, the weight of the similarity loss is 0.2, and the weight of the contrast loss is 1. At this time, by making the weight of the similarity loss smaller than the weight of the contrast loss, the influence of noise on the similarity loss can be mitigated, and the effect of generalization of the metric of the feature extraction model can be improved.

In an embodiment, if the similarity loss of the training samples is less than or equal to the reference similarity loss and the contrast loss of the training samples is greater than or equal to the reference contrast loss, the data type of the positive samples included in the training samples is a non-noise type, and the data type of the negative samples included in the training samples is a noise type. In this case, the weight of the similar loss is set as a first parameter, and the weight of the contrast loss is set as a second parameter.

It should be noted that the first parameter is greater than the second parameter, for example, the weight of the contrast loss is 0.2, and the weight of the similarity loss is 1. At this time, by making the weight of the contrast loss smaller than the weight of the similarity loss, the influence of noise on the contrast loss can be mitigated, and the effect of generalization of the metric of the feature extraction model can be improved.

In an embodiment, if the similarity loss of the training samples is less than the reference similarity loss and the contrast loss of the training samples is less than the reference contrast loss, the data types of the positive samples and the negative samples included in the training samples are both non-noise types, and the weight of the similarity loss and the weight of the contrast loss are both set as the first parameter without executing the mitigation learning strategy with small adjustment weight.

Referring to fig. 4, fig. 4 is a schematic flow chart of a data processing method provided in an embodiment of the present application, in each round of training of a training batch, extracting, by using a feature extraction model as shown in fig. 4, reference features of reference samples included in a triplet sample (corresponding to the training sample) in a current training batch, positive features of the positive sample, and negative features of the negative sample, and determining a contrast loss and a similar loss according to equations (1) and (2), and determining a weight of the contrast loss and a weight of the similar loss according to the reference contrast loss and the reference similar loss, determining a target loss according to the weight of the contrast loss and the weight of the similar loss, and the target loss function shown in equations (3), determining a total target loss according to target losses of all triplet samples included in the current training batch, and adjusting parameters of the feature extraction model once according to the total target loss and a random gradient descent method, and after training of one training batch is completed, taking the next training batch to continuously adjust the parameters of the feature extraction model on the basis of the adjusted feature extraction model until a training stop condition is met, wherein the training stop condition is met if a specified number of iterations is reached, or the training stop condition is met if a target loss function is converged.

Combining the target loss of contrast loss and similarity loss ensures that the difference between dissimilar samples is much greater than the difference between similar samples, and that the similarity between similar samples, but not the absolute dissimilarity between dissimilar samples, e.g., the first designated boundary is 10 (i.e., D)_an-D_ap≧ 10), the second specified boundary is 5 (i.e., D)_ap< 5), if the distance D between similar samples_ap5, the distance D between dissimilar samples_anSimilar losses and contrast losses can be satisfied at 15, 20, and 30.

In a feasible embodiment, dissimilar loss is determined according to the reference feature and the negative feature, the dissimilar loss, the similar loss and the contrast loss are superposed to form target loss, and the feature extraction model is trained according to the target loss to obtain the target feature extraction model.

Specifically, a weight of dissimilar loss, a weight of similar loss, and a weight of contrast loss are determined, and the dissimilar loss, the similar loss, and the contrast loss are superimposed as a target loss using the weight of dissimilar loss, the weight of similar loss, and the weight of contrast loss.

Dissimilarity penalties are used to make the distance between the reference feature and the negative feature larger than some specified boundary to guarantee absolute dissimilarity (i.e., the purpose of dissimilarity learning) between dissimilar samples. In one embodiment, the reference signature and the negative signature are input to a dissimilar loss function, and the dissimilar loss is determined, the dissimilar loss function being represented by equation (8) below:

wherein,L _ua function representing a dissimilar loss function is represented,

representing a third specified boundary for defining a second distance between the reference feature and the positive feature greater than

(phase not)LikeFor purposes of loss), e.g., define D_anIf greater than 30, then

Is 30. When the third designated boundary is greater than or equal to the sum of the second designated boundary and the first designated boundary, it can be ensured that the difference between dissimilar samples is much greater than the difference between similar samples, the similarity between similar samples is ensured, and the absolute dissimilarity between dissimilar samples is ensured. Dissimilar losses determined by dissimilar loss functions may enable dissimilar learning by the feature extraction model.

In one embodiment, a computer device obtains a query request, invokes a target feature extraction model to perform feature extraction processing on target multimedia data to obtain data features of the target multimedia data, invokes the target feature extraction model to perform feature extraction processing on a plurality of reference multimedia data to obtain data features of the plurality of reference multimedia data, determines similarity between the data features of the target multimedia data and the data features of the plurality of reference multimedia data, selects dissimilar multimedia data from the plurality of reference multimedia data according to the similarity, and deletes the dissimilar multimedia data. When dissimilar multimedia data is selected, reference multimedia data having the smallest similarity with the target multimedia data may be selected, or reference multimedia data having a similarity with the target multimedia data smaller than a similarity threshold (which may be set manually) may be selected, so that accuracy of a retrieval task (in this case, dissimilar multimedia data is retrieved) may be improved. The data attribute type of the target multimedia data or the data attribute type of the reference multimedia data may be any one of an image, a text, an audio, and a video included in the multimedia data, and is the same as the data attribute type of the training sample. The data attribute type of the target multimedia data is an image characteristic when the data attribute type of the target multimedia data is an image, the data attribute type of the reference multimedia data is an image characteristic when the data attribute type of the reference multimedia data is an image, and so on.

In the embodiment of the application, the feature extraction model can be trained by combining the contrast loss and the similar loss, so that the feature extraction model can ensure that the difference between dissimilar samples is far larger than that between similar samples, the similarity between similar samples can also be ensured, and the accuracy of the feature extraction model can be effectively improved; meanwhile, through adjusting the weight of the comparison loss and the weight of the similar loss through the data type, the noise can be subjected to a relaxation learning strategy, so that the noise is prevented from influencing the learning of the feature extraction model, and the effect of measuring generalization of the feature extraction model is improved.

Referring to fig. 5, fig. 5 is a third schematic flow chart of a data processing method according to an embodiment of the present application. The data processing method includes but is not limited to the following steps:

s501, training samples are obtained, wherein the training samples comprise reference samples, positive samples and negative samples, the reference samples and the positive samples meet a similar relation, and the reference samples and the negative samples meet a dissimilar relation.

S502, calling a feature extraction model to perform feature extraction processing on the reference sample, the positive sample and the negative sample to obtain the reference feature of the reference sample, the positive feature of the positive sample and the negative feature of the negative sample.

For details of the implementation processes of steps S501 and S502, reference may be made to steps S201 and S202, which are not described again in this embodiment.

S503, determining quantization loss, determining similar loss according to the reference characteristic and the positive characteristic, and determining contrast loss according to the reference characteristic, the positive characteristic and the negative characteristic.

In the embodiment of the present application, the determining of the similar loss according to the reference feature and the positive feature, and the determining of the contrast loss according to the reference feature, the positive feature and the negative feature may refer to step S203 described above, which is not repeated in this embodiment.

The feature extraction model usually outputs features of floating point numbers, quantization loss is a binary feature (i.e. for the purpose of quantization learning, a special data feature) which makes the output features closer to the application, and the binary feature has two elements, for example, the elements are-1 and 1, or the elements are 1 and 0.

In an embodiment, the computer device determines a feature to be processed from the reference feature, the positive feature and the negative feature, performs binarization processing on elements in the feature to be processed to obtain a binary feature, determines quantization loss according to the feature to be processed and the binary feature, and superimposes the quantization loss, the similarity loss and the contrast loss into a target loss.

In the embodiments of the present application, the feature to be processed may be one or more of a reference feature, a positive feature, and a negative feature. Converting the to-be-processed features output by the feature extraction model into binary features, and performing binarization processing by adopting an sgn function, wherein the sgn function is shown as the following formula (9):

wherein,u _irepresenting the elements in the feature to be processed,b _irepresenting the elements after the binarization processing.

In one embodiment, the quantization loss is determined if the feature to be processed is one of a reference feature, a positive feature and a negative featureL _cAs shown in the following formula (10):

in an embodiment, if the feature to be processed is a plurality of reference features, positive features and negative features, the quantization losses of the plurality of features to be processed may be obtained by equation (10), and the final quantization loss may be obtained according to the sum of the quantization losses of the plurality of features to be processed, or the final quantization loss may be obtained according to the average of the quantization losses of the plurality of features to be processed.

S404, superposing the quantization loss, the similarity loss and the contrast loss into a target loss, and training the feature extraction model according to the target loss to obtain a target feature extraction model, wherein the target feature extraction model is a binary feature for extracting multimedia data.

In one embodiment, a computer device determines weights for quantization lossw ₃Weight of similar lossw ₂And weight of contrast lossw ₁To quantize the weight of the lossw ₃Weight of similar lossw ₂And weight of contrast lossw ₁And quantization lossL _cSimilar loss ofL _sAnd contrast lossL _tInputting a target loss functionL _totalDetermining a target loss, a target loss functionL _totalAs shown in the following formula (11):

L _total =w ₁ L _t+ w ₂ L _s+ w ₃ L _c（11）

wherein the weight of the similar lossw ₂And weight of contrast lossw ₁See step S204, which is not described in detail in this embodiment. When the quantization loss is a non-dominant learning position, the weight of the quantization loss may be set to a third parameter, which is smaller than the second parameter and smaller than the first parameter, for example, to 0.1. The target loss determined by the target loss function can enable the feature extraction model to realize quantitative learning, similar learning and combined learning of contrast loss.

Referring to fig. 6, fig. 6 is a flowchart illustrating a fourth data processing method according to an embodiment of the present application, in each training run, extracting, by using a feature extraction model as shown in fig. 6, a reference feature of a reference sample, a positive feature of a positive sample, and a negative feature of a negative sample included in a triplet sample (corresponding to the training sample) in a current training run, and determining a contrast loss, a similarity loss, and a quantization loss according to equations (1), (2), and (10), and determining a weight of a contrast loss and a weight of a similarity loss according to the reference contrast loss and the reference similarity loss, determining a target loss according to the weight of the contrast loss, the weight of the similarity loss, and the weight of the quantization loss, and a target loss function as shown in equation (11), determining a total target loss according to target losses of all triplet samples included in the current training run, and adjusting parameters of the feature extraction model once according to the total target loss and a random gradient descent method, and after training of a training batch is completed, taking the next training batch to continuously adjust the feature extraction model on the basis of the adjusted feature extraction model until a training stop condition is met, wherein the training stop condition is met if a specified number of iterations is reached, or the training stop condition is met if a target loss function is converged.

In an embodiment, a computer device obtains a query request, invokes a target feature extraction model to perform feature extraction processing on target multimedia data carried by the query request to obtain binary features of the target multimedia data, invokes the target feature extraction model to perform feature extraction processing on a plurality of multimedia data to be recalled to obtain binary features of the plurality of multimedia data to be recalled, determines similarity between the binary features of the target multimedia data and the binary features of the plurality of multimedia data to be recalled, selects recalled multimedia data from the plurality of multimedia data to be recalled according to the similarity, and outputs the recalled multimedia data. The data attribute type of the target multimedia data or the data attribute type of the multimedia data to be recalled may be any one of an image, a text, an audio, and a video included in the multimedia data, and is the same as the data attribute type of the training sample. When the data attribute type of the target multimedia data is an image, the data feature of the target multimedia data is an image feature, and when the data attribute type of the multimedia data to be recalled is an image, the data feature of the multimedia data to be recalled is an image feature, and so on.

Specifically, the similarity between the binary features can be determined by the hamming distance, for example, if the binary feature 1 is (0, 0, 0, 1), the binary feature 2 is (1, 1, 0, 1), and the hamming distance is the number of different positions, the similarity is 2. When the multimedia data to be recalled is selected, the multimedia data to be recalled having the greatest similarity with the target multimedia data may be selected, or the multimedia data to be recalled having the similarity with the target multimedia data greater than a similarity threshold (which may be set manually) may be selected, which may improve the accuracy of the retrieval task.

In a feasible embodiment, dissimilar loss is determined according to the reference characteristic and the negative characteristic, the dissimilar loss, the similar loss, the quantization loss and the contrast loss are superposed to form target loss, and the characteristic extraction model is trained according to the target loss to obtain a target characteristic extraction model.

Specifically, the weight of dissimilar loss, the weight of similar loss, the weight of quantization loss and the weight of contrast loss are determined, and the weight of dissimilar loss, the weight of similar loss, the weight of quantization loss and the weight of contrast loss are utilized to superpose the dissimilar loss, the similar loss, the quantization loss and the contrast loss into a target loss, so that the feature extraction model can ensure that the difference between dissimilar samples is far larger than the difference between similar samples, ensure the similarity between similar samples and ensure the absolute dissimilarity between dissimilar samples, the accuracy of the feature extraction model can be effectively improved, binary features close to application can be obtained, and the application capability of the features can be improved.

In the embodiment of the application, the feature extraction model is trained by combining the contrast loss and the similar loss, so that the feature extraction model can ensure that the difference between dissimilar samples is far larger than that between similar samples, the similarity between similar samples can also be ensured, the accuracy of the feature extraction model can be effectively improved, in addition, the binary features close to application can be obtained by combining the quantization loss, and the application capability of the features can be improved.

While the method of the embodiments of the present application has been described in detail above, to facilitate better implementation of the above-described aspects of the embodiments of the present application, the apparatus of the embodiments of the present application is provided below accordingly. Referring to fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus according to an exemplary embodiment of the present application, where the data processing apparatus 70 may include:

an obtaining unit 701 obtains a training sample, where the training sample includes a reference sample, a positive sample, and a negative sample, the reference sample and the positive sample satisfy a similar relationship, and the reference sample and the negative sample satisfy a dissimilar relationship;

the processing unit 702 is configured to invoke a feature extraction model to perform feature extraction processing on the reference sample, the positive sample, and the negative sample, so as to obtain a reference feature of the reference sample, a positive feature of the positive sample, and a negative feature of the negative sample;

the processing unit 702 is further configured to determine a similarity loss according to the reference feature and the positive feature, and determine a contrast loss according to the reference feature, the positive feature and the negative feature;

the processing unit 702 is further configured to superimpose the similarity loss and the contrast loss into a target loss, train the feature extraction model according to the target loss, and obtain a target feature extraction model, where the target feature extraction model is used to extract data features of multimedia data.

In an embodiment, the obtaining unit 701 is specifically configured to:

acquiring the trained batch quantity of the feature extraction model;

the processing unit 702 is specifically configured to:

if the trained batch quantity is smaller than a preset value, setting the weight of the similar loss and the weight of the contrast loss as first parameters;

if the trained batch quantity is not smaller than the preset value, determining the weight of the similar loss and the weight of the contrast loss according to the data type of the positive sample and the data type of the negative sample;

and according to the weight of the similar loss and the weight of the contrast loss, superposing the similar loss and the contrast loss into a target loss.

In an embodiment, the obtaining unit 701 is specifically configured to:

obtaining reference similarity loss and reference contrast loss of a training batch in which the training sample is positioned;

the processing unit 702 is specifically configured to:

determining the data type of the positive sample and the data type of the negative sample according to the reference similarity loss and the reference contrast loss, wherein the data types comprise a noise type and a non-noise type;

if the data type of the positive sample is the noise type and the data type of the negative sample is the non-noise type, setting the weight of the similar loss as a second parameter and setting the weight of the contrast loss as the first parameter, wherein the first parameter is larger than the second parameter;

if the data type of the positive sample is the non-noise type and the data type of the negative sample is the noise type, setting the weight of the similar loss as the first parameter and setting the weight of the contrast loss as the second parameter.

In an embodiment, the processing unit 702 is specifically configured to:

if the data types of the positive sample and the negative sample are the non-noise types, setting the weight of the similar loss and the weight of the contrast loss as the first parameter;

and if the data types of the positive sample and the negative sample are the noise types, setting the weight of the similar loss and the weight of the contrast loss to be null, and superposing the null similar loss and the null contrast loss to obtain a null target loss.

In an embodiment, the processing unit 702 is specifically configured to:

if the training batch in which the training sample is located is the target training batch, taking the average similarity loss of the training batch in which the training sample is located as the reference similarity loss, and taking the average contrast loss of the training batch in which the training sample is located as the reference contrast loss;

if the training batch in which the training sample is located is not the target training batch, determining the reference similarity loss according to the average similarity loss of the training batch in which the training sample is located and a training batch adjacent to the training batch in which the training sample is located, and determining the reference contrast loss according to the average contrast loss of the training batch in which the training sample is located and the training batch adjacent to the training batch in which the training sample is located.

In an embodiment, the processing unit 702 is specifically configured to:

selecting a feature to be processed from the reference feature, the positive feature and the negative feature;

carrying out binarization processing on elements in the feature to be processed to obtain a binary feature;

determining quantization loss according to the feature to be processed and the binary feature;

and superposing the quantization loss, the similarity loss and the contrast loss into a target loss.

In an embodiment, the processing unit 702 is specifically configured to:

acquiring K groups of sample pairs, wherein each group of sample pairs comprises a reference sample and a positive sample;

determining distances between a reference sample included in a target sample pair and K-1 candidate samples, wherein the target sample pair is any sample pair in the K groups of sample pairs, the candidate samples are any one of K-1 reference samples and K-1 positive samples included in the K-1 reference sample pairs, and the reference sample pair is a sample pair in the K groups of sample pairs except the target sample pair;

and determining M negative samples from the K-1 candidate samples according to the distance, and respectively combining the target sample pair and the M negative samples into training samples, wherein M is a positive integer and is not more than K-1.

In an embodiment, the obtaining unit 701 is specifically configured to:

acquiring a query request, wherein the query request carries target multimedia data;

the processing unit 702 is specifically configured to:

calling the target feature extraction model to extract the data features of the target multimedia data;

determining similarity between the data characteristics of the target multimedia data and the data characteristics of a plurality of multimedia data to be recalled, wherein the data characteristics of each multimedia data to be recalled are extracted by calling the target characteristic extraction model;

and selecting recall multimedia data from the plurality of multimedia data to be recalled according to the similarity, and outputting the recall multimedia data.

It can be understood that the functions of the functional units of the data processing apparatus described in the embodiments of the present application can be specifically implemented according to the method in the foregoing method embodiments, and the specific implementation process of the method can refer to the description related to the foregoing method embodiments, which is not described herein again.

In the embodiment of the application, a feature extraction model is called to perform feature extraction processing on a reference sample, a positive sample and a negative sample included in a training sample to obtain a reference feature of the reference sample, a positive feature of the positive sample and a negative feature of the negative sample, wherein the reference sample and the positive sample meet a similar relation, the reference sample and the negative sample meet a dissimilar relation, a similarity loss is determined according to the reference feature and the positive feature, a contrast loss is determined according to the reference feature, the positive feature and the negative feature, the similarity loss and the contrast loss are superposed to form a target loss, the feature extraction model is trained according to the target loss to obtain the target feature extraction model, and the accuracy of the feature extraction model can be effectively improved.

As shown in fig. 8, fig. 8 is a schematic structural diagram of a computer device provided in an embodiment of the present application, and an internal structure of the computer device 80 is shown in fig. 8, and includes: one or more processors 801, memory 802, and a communication interface 803. The processor 801, the memory 802 and the communication interface 803 may be connected by a bus 804 or in other ways, and the embodiment of the present application is exemplified by the bus 804.

The processor 801 (or CPU) is a computing core and a control core of the computer device 80, and can analyze various instructions in the computer device 80 and process various data of the computer device 80, for example: the CPU may be configured to analyze a power on/off instruction sent by the user to the computer device 80, and control the computer device 80 to perform power on/off operation; the following steps are repeated: the CPU may transfer various types of interactive data between the internal structures of the computer device 80, and so on. The communication interface 803 may optionally include a standard wired interface, a wireless interface (e.g., Wi-Fi, mobile communication interface, etc.), under the control of the processor 801 for transceiving data. The Memory 802 (Memory) is a Memory device in the computer device 80 for storing the first computer program and data. It is understood that the memory 802 may comprise a built-in memory of the computer device 80, and may also comprise an expansion memory supported by the computer device 80. Memory 802 provides storage space that stores an operating system for computer device 80, which may include, but is not limited to: windows system, Linux system, etc., which are not limited in this application. Specifically, the processor 801 executes the following operations by executing the first computer program stored in the memory 802:

In an embodiment, the processor 801 is specifically configured to:

acquiring the trained batch quantity of the feature extraction model;

In an embodiment, the processor 801 is specifically configured to:

In a specific implementation, the processor 801, the memory 802, and the communication interface 803 described in this embodiment may execute an implementation manner of a computer device described in a data processing method provided in this embodiment, and may also execute an implementation manner described in a data processing apparatus provided in this embodiment, which is not described herein again.

Embodiments of the present application further provide a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to execute the data processing method according to the embodiments of the present application. For specific implementation, reference may be made to the foregoing description, which is not repeated herein.

Embodiments of the present application further provide a computer program product, where the computer program product includes a computer program or computer instructions, and when the computer program or the computer instructions are executed by a processor, the steps of the data processing method provided in the embodiments of the present application are implemented. For specific implementation, reference may be made to the foregoing description, which is not repeated herein.

The embodiment of the present application further provides a computer program, where the computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method provided in the embodiment of the present application. For specific implementation, reference may be made to the foregoing description, which is not repeated herein.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above disclosure is only a few examples of the present application, and certainly should not be taken as limiting the scope of the present application, which is therefore intended to cover all modifications that are within the scope of the present application and which are equivalent to the claims.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein said superimposing said similarity loss and said contrast loss as a target loss comprises:

acquiring the trained batch quantity of the feature extraction model;

3. The method of claim 2, wherein determining the weight of the similar penalty and the weight of the contrast penalty based on the data type of the positive sample and the data type of the negative sample comprises:

4. The method of claim 3, further comprising:

5. The method of claim 3, wherein obtaining the reference similarity loss and the reference contrast loss for the training batch in which the training samples are located comprises:

6. The method of claim 1, wherein said superimposing said similarity loss and said contrast loss as a target loss comprises:

7. The method of claim 1, wherein the obtaining training samples comprises:

determining distances between a reference sample included in a target sample pair and K-1 candidate samples, wherein the target sample pair is any sample pair in the K groups of sample pairs, the K-1 candidate samples are K-1 reference samples or K-1 positive samples included in the K-1 reference sample pairs, and the reference sample pair is a sample pair in the K groups of sample pairs except the target sample pair;

8. The method of claim 1, further comprising:

9. A data processing apparatus, characterized in that the apparatus comprises:

10. A computer device comprising a memory, a communication interface, and a processor, wherein the memory, the communication interface, and the processor are interconnected; the memory stores a computer program that the processor calls upon for executing the data processing method of any of claims 1-8.