CN114358109A

CN114358109A - Feature extraction model training method, feature extraction model training device, sample retrieval method, sample retrieval device and computer equipment

Info

Publication number: CN114358109A
Application number: CN202111247520.9A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-04-15

Abstract

The application relates to a method, a device, computer equipment and a storage medium for feature extraction model training and sample retrieval. The method comprises the following steps: inputting each sample in the first sample group into an initial feature extraction model to obtain an initial classification feature, an initial semantic feature and an initial fusion feature; the first sample group comprises a target sample, a corresponding reference sample and a class label of the sample, and the initial feature extraction model comprises a sample classification network, a non-semantic feature extraction network and a feature fusion network; obtaining classification loss based on the initial classification characteristics and the class labels corresponding to the same sample; obtaining characteristic loss based on other characteristics corresponding to the target sample and the reference sample; and adjusting model parameters of the initial feature extraction model based on the feature loss and the classification loss until a convergence condition is met, so as to obtain a target feature extraction model for extracting sample features of the input sample, wherein the sample features are used for sample retrieval. By adopting the method, the training efficiency of the model can be improved.

Description

Feature extraction model training method, feature extraction model training device, sample retrieval method, sample retrieval device and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for feature extraction model training and sample retrieval, a computer device, and a storage medium.

Background

With the development of computer technology, retrieval techniques, for example, image retrieval techniques, have emerged. The retrieval technology is that the characteristics of the query sample are extracted, and the characteristics of the query sample are matched with the characteristics of the sample stored in the sample retrieval library, so that the sample which is similar to the query sample is retrieved in the sample retrieval library.

In the conventional technology, a sample is generally input into a model for feature extraction, and one model is used for extracting one feature, so that different models need to be trained respectively according to different features, and the training is time-consuming.

Disclosure of Invention

In view of the above, it is necessary to provide a feature extraction model training method, a sample retrieval method, an apparatus, a computer device, and a storage medium, which can improve training efficiency.

A method of feature extraction model training, the method comprising:

obtaining a first sample group, and inputting each sample in the first sample group into an initial feature extraction model; the first sample group comprises target samples, reference samples corresponding to the target samples and class labels corresponding to the samples, and the initial feature extraction model comprises a sample classification network, a non-semantic feature extraction network and a feature fusion network;

outputting an initial classification characteristic and an initial semantic characteristic through the sample classification network, and outputting an initial non-semantic characteristic through the non-semantic characteristic extraction network;

fusing the initial semantic features and the initial non-semantic features of the same sample through the feature fusion network to obtain initial fusion features corresponding to the samples respectively;

calculating loss based on the initial semantic features, the initial non-semantic features and the initial fusion features corresponding to the target sample and the reference sample to obtain feature loss;

calculating loss based on the initial classification characteristics and the class labels corresponding to the same sample to obtain classification loss;

based on the characteristic loss and the classification loss, adjusting model parameters of the initial characteristic extraction model until a convergence condition is met to obtain a target characteristic extraction model; the target feature extraction model is used for extracting sample features of an input sample, and the sample features are used for sample retrieval.

In one embodiment, the feature sizes corresponding to the initial semantic features, the initial non-semantic features and the initial fusion features are the same.

A feature extraction model training apparatus, the apparatus comprising:

the first sample group processing module is used for acquiring a first sample group and inputting each sample in the first sample group into an initial feature extraction model; the first sample group comprises target samples, reference samples corresponding to the target samples and class labels corresponding to the samples, and the initial feature extraction model comprises a sample classification network, a non-semantic feature extraction network and a feature fusion network;

the characteristic output module is used for outputting initial classification characteristics and initial semantic characteristics through the sample classification network and outputting initial non-semantic characteristics through the non-semantic characteristic extraction network;

the characteristic fusion module is used for fusing the initial semantic characteristics and the initial non-semantic characteristics of the same sample through the characteristic fusion network to obtain the initial fusion characteristics corresponding to each sample;

the characteristic loss determining module is used for calculating loss based on the initial semantic features, the initial non-semantic features and the initial fusion features corresponding to the target sample and the reference sample to obtain characteristic loss;

the classification loss determining module is used for calculating loss based on the initial classification characteristics and the class labels corresponding to the same sample to obtain classification loss;

the model parameter adjusting module is used for adjusting the model parameters of the initial feature extraction model based on the feature loss and the classification loss until a convergence condition is met to obtain a target feature extraction model; the target feature extraction model is used for extracting sample features of an input sample, and the sample features are used for sample retrieval.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

A method of sample retrieval, the method comprising:

acquiring a query sample and a candidate recall sample set;

inputting the candidate recall samples in the query sample and the candidate recall sample set into a target feature extraction model to obtain query sample features corresponding to the query sample and recall sample features corresponding to the candidate recall sample;

determining a retrieval result sample corresponding to the query sample from the candidate recall sample set based on the query sample feature and the recall sample feature;

the training process of the target feature extraction model is as follows:

outputting initial classification features and initial semantic features through the sample classification network, outputting initial non-semantic features through the non-semantic feature extraction network, and fusing the initial semantic features and the initial non-semantic features of the same sample through the feature fusion network to obtain initial fusion features corresponding to the samples respectively;

calculating loss based on the initial semantic features, the initial non-semantic features and the initial fusion features corresponding to the target sample and the reference sample to obtain feature loss, and calculating loss based on the initial classification features and the class labels corresponding to the same sample to obtain classification loss;

and adjusting the model parameters of the initial feature extraction model based on the feature loss and the classification loss until a convergence condition is met to obtain a target feature extraction model.

A sample retrieval device, the device comprising:

the data acquisition module is used for acquiring a query sample and a candidate recall sample set;

the data processing module is used for inputting the candidate recall samples in the query sample and the candidate recall sample set into a target feature extraction model to obtain query sample features corresponding to the query sample and recall sample features corresponding to the candidate recall sample;

a retrieval result determining module, configured to determine, based on the query sample feature and the recall sample feature, a retrieval result sample corresponding to the query sample from the candidate recall sample set;

the training process of the target feature extraction model is as follows:

acquiring a query sample and a candidate recall sample set;

the training process of the target feature extraction model is as follows:

acquiring a query sample and a candidate recall sample set;

the training process of the target feature extraction model is as follows:

According to the feature extraction model training and sample retrieval method, device, computer equipment and storage medium, each sample in a first sample group is input into an initial feature extraction model by obtaining the first sample group; the first sample group comprises target samples, reference samples corresponding to the target samples and class labels corresponding to the samples, and the initial feature extraction model comprises a sample classification network, a non-semantic feature extraction network and a feature fusion network; outputting initial classification features and initial semantic features through a sample classification network, and outputting initial non-semantic features through a non-semantic feature extraction network; fusing the initial semantic features and the initial non-semantic features of the same sample through a feature fusion network to obtain initial fusion features corresponding to the samples respectively; calculating loss based on the initial semantic features, the initial non-semantic features and the initial fusion features corresponding to the target sample and the reference sample to obtain feature loss; calculating loss based on the initial classification characteristics and the class labels corresponding to the same sample to obtain classification loss; and adjusting model parameters of the initial feature extraction model based on the feature loss and the classification loss until a convergence condition is met to obtain a target feature extraction model. Therefore, the unified model is established to learn semantic features and non-semantic features, and simultaneously learn fusion features, the fusion features comprise both semantic information and non-semantic information, the model obtained through final training can output the semantic features and the non-semantic features comprising single-dimensional information and can also output the fusion features comprising two-dimensional information, and the model can output diversified features only by performing model training on one model, so that the training efficiency is improved. Subsequently, when the sample retrieval is performed, the candidate recall sample and the query sample in the candidate recall sample set may be input into the target feature extraction model to obtain a query sample feature corresponding to the query sample and a recall sample feature corresponding to the candidate recall sample, and based on the query sample feature and the recall sample feature, a retrieval result sample corresponding to the query sample is determined from the candidate recall sample set. In this way, the sample retrieval is performed through the diversified characteristics output by the model, and the accuracy and the efficiency of the sample retrieval can be improved.

Drawings

FIG. 1 is a diagram of an exemplary embodiment of a feature extraction model training method and a sample search method;

FIG. 2 is a schematic flow chart diagram of a method for training a feature extraction model in one embodiment;

FIG. 3 is a schematic diagram of a model structure in one embodiment;

FIG. 4 is a schematic flow diagram of training a feature extraction model in one embodiment;

FIG. 5 is a schematic flow chart of training a feature extraction model in another embodiment;

FIG. 6 is a schematic flow chart diagram illustrating a sample retrieval method in one embodiment;

FIG. 7 is a schematic diagram of a process for training an image feature extraction model in one embodiment;

FIG. 8 is a flow diagram illustrating image retrieval in one embodiment;

FIG. 9 is a block diagram showing the structure of a feature extraction model training apparatus according to an embodiment;

FIG. 10 is a block diagram showing the structure of a sample search device according to an embodiment;

FIG. 11 is a diagram of the internal structure of a computer device in one embodiment;

FIG. 12 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Computer Vision technology (CV) is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.

Key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology and Speech synthesis Technology, as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The scheme provided by the embodiment of the application relates to the technologies of artificial intelligence, such as computer vision, voice, machine learning and the like, and is specifically explained by the following embodiments:

the feature extraction model training method and the sample retrieval method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but is not limited to, a laptop, a smartphone, a tablet, a desktop, a smart television, a vehicle terminal, and a portable wearable device. The terminal is provided with an application program, wherein the application program can refer to a client installed in the terminal, and the client (which can be called an application client and an APP client) refers to a program installed and running in the terminal; the application program can also be an installation-free application program, namely, the application program can be used without downloading and installing, and the application program is also commonly called an applet and usually runs in a client as a subprogram; an application may also refer to a web application that is opened through a browser; and so on. The various applications described above are divided according to the application functions they provide, and the types of applications may include, but are not limited to: search applications, instant messaging applications, payment applications, audio and video applications, and the like. The server 104 may be implemented as a stand-alone server or a server cluster consisting of a plurality of servers or a cloud server.

The terminal 102 and the server 104 can be used separately to perform the feature extraction model training and sample retrieval methods provided in the embodiments of the present application.

For example, the terminal obtains a first sample group, where the first sample group includes a target sample, a reference sample corresponding to the target sample, and a category label corresponding to each sample. And the terminal inputs each sample in the first sample group into an initial feature extraction model, wherein the initial feature extraction model comprises a sample classification network, a non-semantic feature extraction network and a feature fusion network. The method comprises the steps of outputting initial classification features and initial semantic features through a sample classification network, outputting initial non-semantic features through a non-semantic feature extraction network, fusing the initial semantic features and the initial non-semantic features of the same sample through a feature fusion network, and obtaining initial fusion features corresponding to the samples respectively. And the terminal calculates to obtain the characteristic loss based on the initial semantic characteristics, the initial non-semantic characteristics and the initial fusion characteristics corresponding to the target sample and the reference sample, and calculates to obtain the classification loss based on the initial classification characteristics and the class labels corresponding to the same sample. And the terminal adjusts the model parameters of the initial feature extraction model based on the feature loss and the classification loss until the convergence condition is met, so as to obtain the target feature extraction model.

The terminal obtains a query sample and a candidate recall sample set, inputs the candidate recall sample in the query sample and the candidate recall sample set into a target feature extraction model, and obtains a query sample feature corresponding to the query sample and a recall sample feature corresponding to the candidate recall sample. And the terminal determines a retrieval result sample corresponding to the query sample from the candidate recall sample set based on the query sample characteristic and the recall sample characteristic.

The terminal 102 and the server 104 may also be cooperatively used to perform the feature extraction model training and sample retrieval methods provided in the embodiments of the present application.

For example, the server acquires a first sample group from the terminal, and the server performs model training on the initial feature extraction model based on the first sample group to obtain a target feature extraction model. The server obtains a query sample from the terminal, obtains a candidate recall sample set from the database, carries out sample retrieval based on the target feature extraction model, and determines a retrieval result sample corresponding to the query sample from the candidate recall sample set. And the server sends the retrieval result sample to the terminal.

The embodiment of the invention can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, driving assistance, sample retrieval (information search and information recommendation) and the like.

In one embodiment, as shown in fig. 2, a feature extraction model training method is provided, and is described by taking a computer device as an example, it is understood that the computer device may be the terminal 102 shown in fig. 1 or the server 104. In this embodiment, the feature extraction model training method includes the following steps:

step S202, a first sample group is obtained, and each sample in the first sample group is input into an initial feature extraction model; the first sample group comprises target samples, reference samples corresponding to the target samples and class labels corresponding to the samples, and the initial feature extraction model comprises a sample classification network, a non-semantic feature extraction network and a feature fusion network.

The sample refers to an object for transmitting and presenting information, and may specifically be at least one of an image, a voice, and a text. The samples have corresponding class labels that identify the class of the sample. For example, if the sample is an image, the type label corresponding to the sample may be animal species such as dog, cat, and fish, plant species such as coral, pine, and sweet-scented osmanthus, and article species such as a magnifying glass, a cabinet, and a water bottle. The first sample group comprises a target sample, a reference sample corresponding to the target sample and a category label corresponding to each sample. The first sample set is a training sample of the model for training the initial feature extraction model. The reference sample corresponding to the target sample comprises at least one of a positive sample and a negative sample corresponding to the target sample. The sample similarity of the target sample and the corresponding positive sample is greater than the sample similarity of the target sample and the corresponding negative sample. The first group of samples may be at least one.

The feature extraction model is a machine learning and deep learning model and is used for extracting sample features of the input samples. The input data of the feature extraction model is a sample, and the output data is sample features corresponding to the sample. Different feature extraction models may be trained for different types of samples, for example, if the samples are image type samples, the image feature extraction models may be trained, and if the samples are speech type samples, the speech feature extraction models may be trained.

The initial feature extraction model refers to a feature extraction model to be trained. The initial feature extraction model comprises a sample classification network, a non-semantic feature extraction network and a feature fusion network. The sample classification network is used for classifying input data and extracting semantic features and classification features of the input data, the input data of the sample classification network are samples, and the output data of the sample classification network are the classification features and the semantic features of the samples. The non-semantic feature extraction network is used for extracting non-semantic features of input data, the input data of the non-semantic feature extraction network are samples, and the output data of the non-semantic feature extraction network are non-semantic features of the samples. The non-semantic feature extraction network is used for fusing semantic features and non-semantic features, input data of the feature fusion network are the semantic features and the non-semantic features, and output data of the feature fusion network are fusion features. Various features may be represented in vector or matrix form.

Specifically, the computer device may obtain the first sample group locally or from a terminal or a server, and train the initial feature extraction model based on the first sample group to obtain the target feature extraction model.

And S204, outputting the initial classification characteristic and the initial semantic characteristic through a sample classification network, and outputting the initial non-semantic characteristic through a non-semantic characteristic extraction network.

The non-semantic features refer to features without semantic measurement capability, and the semantic features refer to features with semantic measurement capability. The classification features are used for characterizing the features of the sample class, and the prediction label of the sample, namely the prediction class, can be determined based on the classification features. It can be understood that the semantic features are output by the sample classification network, and the sample classification network not only needs to extract the features, but also needs to classify and predict the features, that is, the sample classification network not only considers the specific content of the sample but also considers the class to which the sample belongs when extracting the features, so that the semantic features have the capability of semantic measurement and are beneficial to determining the class of the sample. The non-semantic feature extraction network only considers the content information of the samples, does not have the capability of semantic measurement, and cannot determine the types of the samples.

Specifically, the computer device inputs each sample in the first sample group into an initial feature extraction model, performs data processing on the input sample through a sample classification network of the initial feature extraction model, outputs an initial classification feature and an initial semantic feature, performs data processing on the input sample through a non-semantic feature extraction network of the initial feature extraction model, and outputs the initial non-semantic feature.

In one embodiment, to improve the training efficiency of the model, the non-semantic feature extraction network and the sample classification network may share underlying network parameters.

And S206, fusing the initial semantic features and the initial non-semantic features of the same sample through a feature fusion network to obtain initial fusion features corresponding to the samples respectively.

Specifically, after the initial semantic features and the initial non-semantic features are obtained, the initial semantic features and the initial non-semantic features of the same sample are further fused through a feature fusion network of an initial feature extraction model, so that initial fusion features corresponding to the target sample and the reference sample respectively are obtained. The feature fusion specifically may be to splice the features, or to splice the features and then compress the features, so as to reduce the data size of the features.

And S208, calculating loss based on the initial semantic features, the initial non-semantic features and the initial fusion features corresponding to the target sample and the reference sample to obtain feature loss.

Specifically, the computer device may calculate the loss based on the initial semantic features, the initial non-semantic features, and the initial fusion features corresponding to the target sample and the reference sample, to obtain the feature loss. For example, the computer device may obtain a semantic feature loss based on a distance between initial semantic features corresponding to the target sample and the reference sample, obtain a non-semantic feature loss based on a distance between initial non-semantic features corresponding to the target sample and the reference sample, obtain a fusion feature loss based on a distance between initial fusion features corresponding to the target sample and the reference sample, and obtain a feature loss by integrating the semantic feature loss, the non-semantic feature loss, and the fusion feature loss.

And step S210, calculating loss based on the initial classification characteristics and the class labels corresponding to the same sample to obtain classification loss.

In particular, the computer device may derive a classification penalty based on a difference between the initial classification features and the class labels corresponding to the target sample and a difference between the initial classification features and the class labels corresponding to the reference sample. For example, the distance between the initial classification feature and the class label corresponding to the same sample is calculated, and the classification loss is obtained based on the calculation result corresponding to each sample. In one embodiment, the computer device may calculate the classification loss by a cross-entropy loss function.

Step S212, based on the characteristic loss and the classification loss, adjusting model parameters of the initial characteristic extraction model until a convergence condition is met to obtain a target characteristic extraction model; the target feature extraction model is used for extracting sample features of the input sample, and the sample features are used for sample retrieval.

The convergence condition may be at least one of that the number of model iterations reaches a preset number, that the total loss is smaller than a preset loss, that a change rate of the total loss in a continuous preset number of iterations is smaller than a preset threshold, and the like. The target feature extraction model refers to a trained feature extraction model.

Specifically, after the feature loss and the target loss are obtained, the computer device may perform back propagation based on the feature loss and the classification loss, update the model parameters of the initial feature extraction model to obtain an updated initial feature extraction model, return to the step of inputting each sample in the first sample group into the initial feature extraction model for iterative execution, continue training until a convergence condition is satisfied, and complete training to obtain the target feature extraction model.

In one embodiment, the full training sample set may be divided into different batches, a first sample group for each batch is obtained, training is performed using the first sample group for each batch, and multiple iterations of the full training sample set are performed. The model parameters may be updated in a reverse direction by using a gradient descent algorithm, for example, a loss gradient may be calculated by using a random gradient descent method, and the loss gradient is transmitted back to each network to update the model parameters.

After obtaining the target feature extraction model, the target feature extraction model may be used to extract sample features of the input sample. For example, if the target feature extraction model is an image feature extraction model, the image is input into the image feature extraction model, and the image feature extraction model can output semantic features, non-semantic features, and fusion features of the image. The sample features extracted by the target feature extraction model can be used for sample retrieval. The sample retrieval refers to retrieving at least one sample which is most similar to the target sample and the query sample from a sample library. Whether the two samples are similar or not can be determined based on the similarity between the sample characteristics of the two samples, and when the sample retrieval is carried out, the samples similar to the query sample in the sample library can be determined based on the similarity between the query sample and the sample characteristics of the samples in the sample library, so that the samples similar to the query sample can be used as the sample retrieval result of the query sample. In the case of performing the sample search, all the sample features may be used, but only one sample feature may be used to improve the search efficiency. Further, the sample retrieval may be triggered passively, for example, in a search application, a user inputs search information in a search box for information retrieval, the search application triggers the sample retrieval, the search information of the user is used as a query sample, a sample retrieval result is determined in a search library based on the query sample, the sample retrieval result is used as a search result, and the search result is presented to the user. The sample retrieval can also be triggered automatically, for example, in an audio-visual application program, a user does not need to input search information, the audio-visual application program can automatically recommend information to the user, the audio-visual application program can automatically trigger the sample retrieval, a historical search result of the user is used as a query sample, a sample retrieval result is determined in an audio-visual library based on the query sample, the sample retrieval result is used as a recommendation result, and the recommendation result is actively displayed to the user. Therefore, the sample features extracted by the target feature extraction model can be applied to an information search scene and also can be applied to an information recommendation scene.

The feature size refers to the data size of the feature and the dimension of the feature. For example, a semantic feature is represented by a vector of 1 x 128, then the feature size of the semantic feature may be 1 x 128.

Specifically, the feature sizes corresponding to the initial semantic feature, the initial non-semantic feature, and the initial fusion feature may be the same. Therefore, when the sample retrieval is carried out, on the premise of ensuring the retrieval accuracy, in order to improve the retrieval efficiency, only the fusion feature which comprises the semantic information and the non-semantic information can be used, and the calculation amount during the retrieval is effectively reduced. And calculating the similarity of the query sample and each sample in the sample library based on the fusion characteristics respectively corresponding to the query sample and each sample in the sample library, and determining a retrieval result sample corresponding to the query sample from the sample library based on the similarity.

In the feature extraction model training method, each sample in a first sample group is input into an initial feature extraction model by obtaining the first sample group; the first sample group comprises target samples, reference samples corresponding to the target samples and class labels corresponding to the samples, and the initial feature extraction model comprises a sample classification network, a non-semantic feature extraction network and a feature fusion network; outputting initial classification features and initial semantic features through a sample classification network, and outputting initial non-semantic features through a non-semantic feature extraction network; fusing the initial semantic features and the initial non-semantic features of the same sample through a feature fusion network to obtain initial fusion features corresponding to the samples respectively; calculating loss based on the initial semantic features, the initial non-semantic features and the initial fusion features corresponding to the target sample and the reference sample to obtain feature loss; calculating loss based on the initial classification characteristics and the class labels corresponding to the same sample to obtain classification loss; and adjusting model parameters of the initial feature extraction model based on the feature loss and the classification loss until a convergence condition is met to obtain a target feature extraction model. Therefore, the unified model is established to learn semantic features and non-semantic features, and simultaneously learn fusion features, the fusion features comprise both semantic information and non-semantic information, the model obtained through final training can output the semantic features and the non-semantic features comprising single-dimensional information and can also output the fusion features comprising two-dimensional information, and the model can output diversified features only by performing model training on one model, so that the training efficiency is improved.

In one embodiment, the sample classification network includes a semantic feature extraction sub-network and a semantic feature classification sub-network, the semantic feature extraction sub-network and the non-semantic feature extraction network existing at a shared network layer. Outputting initial classification features and initial semantic features through a sample classification network, and outputting initial non-semantic features through a non-semantic feature extraction network, wherein the method comprises the following steps:

performing convolution processing on an input sample through a shared network layer to obtain shared characteristics; performing feature processing on the shared features through a feature processing layer of the semantic feature extraction sub-network to obtain initial semantic features; classifying the initial semantic features through a semantic feature classification sub-network to obtain initial classification features; and carrying out feature processing on the shared features through a feature processing layer of the non-semantic feature extraction network to obtain initial non-semantic features.

The sample classification network comprises a semantic feature extraction sub-network and a semantic feature classification sub-network, wherein the semantic feature classification sub-network is connected behind the semantic feature extraction sub-network, and input data of the semantic feature classification sub-network is output data of the semantic feature extraction sub-network. The input data of the semantic feature extraction sub-network is a sample, and the output data is a semantic feature. The input data of the semantic feature classification sub-network is a semantic feature, and the output data is a classification feature.

Furthermore, the semantic feature extraction sub-network and the non-semantic feature extraction network have a shared network layer, that is, the semantic feature extraction sub-network and the non-semantic feature extraction network share a network infrastructure. The shared network layer is a network structure with overlapped semantic feature extraction sub-networks and non-semantic feature extraction networks.

Specifically, after a sample is input into an initial feature extraction model, firstly, convolution processing is carried out on the input sample through a shared network layer, depth feature information of the input sample is extracted, shared features are obtained, and then the shared features are respectively input into a semantic feature extraction sub-network and a subsequent network layer in a non-semantic feature extraction network. And performing further feature processing on the shared features through a subsequent feature processing layer of the semantic feature extraction sub-network, compressing the features to obtain initial semantic features, and finally performing classification processing on the initial semantic features through the semantic feature classification sub-network to obtain initial classification features. And further carrying out feature processing on the shared features through a subsequent feature processing layer of the non-semantic feature extraction network, and compressing the features to obtain initial non-semantic features.

Referring to fig. 3, the feature extraction network includes a sample classification network, a non-semantic feature extraction network, and a feature fusion network. The sample classification network comprises a sharing network layer, a first feature processing layer and a semantic feature classification sub-network which are connected in sequence. The shared network layer and the first feature processing layer constitute a semantic feature extraction subnetwork. The non-semantic feature extraction network comprises a shared network layer and a second feature processing layer which are connected in sequence. The output data of the first feature processing layer is semantic features, the output data of the semantic feature classification sub-network is classification features, and the output data of the second feature processing layer is non-semantic features. The first feature processing layer and the second feature processing layer are respectively connected with the feature fusion network, output data of the first feature processing layer and the second feature processing layer are used as input data of the feature fusion network, and output data of the feature fusion network are fusion features.

In one embodiment, the shared network layer comprises the network structures shown in tables 1 and 2, i.e., the shared network layer comprises a convolutional layer for extracting features and a pooling layer for compressing features.

TABLE 1 ResNet-101 feature Module Structure Table

TABLE 2 pooling layer Structure for compressing depth features output by ResNet-101 into one-dimensional vectors

The feature processing layer of the semantic-free feature extraction network includes a network structure shown in table 3, that is, the feature processing layer of the semantic-free feature extraction network includes a filter layer for filtering redundant features and an embedding layer for compressing features.

TABLE 3 semanteme-free modular structure table, including semanteme-free embedding extraction

The feature processing layer and the semantic feature classification sub-network of the semantic feature extraction sub-network include network structures shown in table 4, that is, the feature processing layer of the semantic feature extraction sub-network includes a filter layer and an embedding layer, and the semantic feature classification sub-network includes a classification layer for performing feature classification.

TABLE 4 semantic Module Structure Table including semantic embedding extraction and semantic Classification

The feature fusion network includes a network structure shown in table 5, that is, the feature fusion network includes a merging layer and an embedding layer, and the merging layer is used for feature concatenation. The merging layer functions to transform the 1 × 128 eigenvectors output by the embedded layer 1 of table 3 and the embedded layer 2 of table 4 end to end into 1 × 256 eigenvectors. The role of the embedding layer 3 in table 5 is to perform information fusion on the spliced 1 × 256 feature vectors, compressing them to 1 × 128 feature vectors. The initial feature extraction models established by the network layers of tables 1 to 5 are trained to obtain a target feature extraction model. The Conv1-Conv5 may be initialized by using parameters of ResNet101 pre-trained on the ImageNet data set, and the newly added layers may be initialized by using gaussian distributions with variance of 0.01 and mean of 0, such as the embedding layers.

TABLE 5 fusion Module Structure Table, including fusion embedding

In the above embodiment, the sample classification network includes a semantic feature extraction sub-network and a semantic feature classification sub-network, and the semantic feature extraction sub-network and the non-semantic feature extraction network have a shared network layer, which can reduce the complexity of the model, reduce the data processing amount of the model, and reduce the parameters that the model needs to learn. And finally outputting the initial semantic features and the initial classification features through the cooperation of the shared network layer, the semantic feature extraction sub-network and the semantic feature classification sub-network.

In one embodiment, the reference samples include positive and negative samples corresponding to the target sample. Calculating loss based on the initial semantic features, the initial non-semantic features and the initial fusion features corresponding to the target sample and the reference sample to obtain feature loss, wherein the method comprises the following steps:

obtaining forward semantic loss, forward non-semantic loss and forward fusion loss based on the distance between the same type of features corresponding to the target sample and the positive sample; obtaining negative semantic loss, negative non-semantic loss and negative fusion loss based on the distance between the same type of features corresponding to the target sample and the negative sample; obtaining an initial semantic loss based on a distance between the positive semantic loss and the negative non-semantic loss, obtaining an initial non-semantic loss based on a distance between the positive non-semantic loss and the negative non-semantic loss, and obtaining an initial fusion loss based on a distance between the positive fusion loss and the negative fusion loss; the feature loss is obtained based on the initial semantic loss, the initial non-semantic loss, and the initial fusion loss.

The reference samples corresponding to the target samples include positive samples and negative samples corresponding to the target samples, that is, the first sample group includes three samples. The similarity between the positive sample corresponding to the target sample and the target sample is greater than the similarity between the negative sample corresponding to the target sample and the target sample.

Specifically, if the reference sample includes a positive sample and a negative sample corresponding to the target sample, and the training target of the model may be to make the feature distance between the target sample and the positive sample smaller than the feature distance between the target sample and the negative sample, then, during feature retrieval, a sample more similar to the query sample may be retrieved from the massive samples based on the sample features.

When calculating the feature loss, the computer device may obtain a positive semantic loss, a positive non-semantic loss, and a positive fusion loss based on a distance between the same type of features corresponding to the target sample and the positive sample, and obtain a negative semantic loss, a negative non-semantic loss, and a negative fusion loss based on a distance between the same type of features corresponding to the target sample and the negative sample. For example, the forward fusion loss is obtained based on the distance between the initial fusion features corresponding to the target sample and the positive sample, respectively. The computer device further integrates the positive loss and the negative loss of the same type to obtain the characteristic loss. The computer device obtains an initial semantic loss based on a distance between the positive semantic loss and the negative non-semantic loss, obtains an initial non-semantic loss based on a distance between the positive non-semantic loss and the negative non-semantic loss, and obtains an initial fusion loss based on a distance between the positive fusion loss and the negative fusion loss. And finally, fusing the initial semantic loss, the initial non-semantic loss and the initial fusion loss by the computer equipment to obtain the characteristic loss.

It can be understood that the computer device may calculate the distance between the features by using a custom algorithm or formula, or may calculate the distance between the features by using calculation methods such as euclidean distance and manhattan distance.

In the above embodiment, the reference sample includes a positive sample and a negative sample corresponding to the target sample, a positive loss is obtained based on a distance between the target sample and the same type of feature corresponding to the positive sample, a negative loss is obtained based on a distance between the target sample and the same type of feature corresponding to the positive sample, a feature loss is obtained based on the positive loss and the negative loss, and the model has a capability of distinguishing the positive sample from the negative sample by adjusting the model parameter based on the feature loss.

In one embodiment, deriving the feature loss based on the initial semantic loss, the initial non-semantic loss, and the initial fusion loss comprises:

updating the initial semantic loss based on the semantic loss adjusting parameter to obtain an intermediate semantic loss, and determining a target semantic loss based on the intermediate semantic loss and a matching result of a preset parameter; updating the initial non-semantic loss based on the non-semantic loss adjusting parameter to obtain intermediate non-semantic loss, and determining target non-semantic loss based on the intermediate non-semantic loss and a matching result of a preset parameter; updating the initial fusion loss based on the fusion loss adjustment parameter to obtain an intermediate fusion loss, and determining a target fusion loss based on a matching result of the intermediate fusion loss and a preset parameter; the fusion loss adjustment parameter is greater than the semantic loss adjustment parameter, and the fusion loss adjustment parameter is greater than the non-semantic loss adjustment parameter; generating a feature loss based on the target semantic loss, the target non-semantic loss, and the target fusion loss.

Wherein the loss adjustment parameter is used to adjust the loss, controlling the distance between the positive loss and the negative loss. The loss adjustment parameters may be set as desired. The loss adjustment parameters for different losses may be the same or different. The preset parameter may be set according to actual needs, for example, to 0.

Specifically, when fusing various losses, in order to further enlarge the characteristic distance of the target sample between the positive sample and the negative sample, the computer device may adjust and then fuse the various losses. Taking the initial semantic loss as an example, the initial semantic loss can represent the distance between the positive semantic loss and the negative semantic loss, the computer device can obtain the semantic loss adjusting parameter, update the initial semantic loss based on the semantic loss adjusting parameter, and amplify the initial semantic loss to obtain the intermediate semantic loss. And the computer equipment matches the intermediate semantic loss with the preset parameters, compares the intermediate semantic loss with the numerical values of the preset parameters, and acquires data with larger numerical values as the target semantic loss. The goal of the target semantic loss is to enlarge the distance between the positive semantic loss and the negative semantic loss, so that similar samples and dissimilar samples can be distinguished based on the semantic features output by the model. Similarly, the computer device may update the initial non-semantic loss based on the non-semantic loss adjustment parameter to obtain an intermediate non-semantic loss, and determine the target non-semantic loss based on a matching result of the intermediate non-semantic loss and the preset parameter. The computer device may update the initial fusion loss based on the fusion loss adjustment parameter to obtain an intermediate fusion loss, and determine the target fusion loss based on a matching result of the intermediate fusion loss and the preset parameter.

Further, the fusion loss adjustment parameter is greater than the semantic loss adjustment parameter, and the fusion loss adjustment parameter is greater than the non-semantic loss adjustment parameter, for example, the fusion loss adjustment parameter is 0.8, and the semantic loss adjustment parameter and the non-semantic loss adjustment parameter are 0.6. That is, after combining the semantic information and the non-semantic information, the fusion features can further enlarge the feature distance between the target sample and the positive and negative samples, and further enlarge the distance between the positive fusion loss and the negative fusion loss. Subsequently, when the sample retrieval is carried out, the retrieval is carried out only based on the fusion characteristics, and an accurate retrieval result can be obtained.

After obtaining the target semantic loss, the target non-semantic loss, and the target fusion loss, the computer device may use a sum of the target semantic loss, the target non-semantic loss, and the target fusion loss as the feature loss, and may also use a weighted sum of the target semantic loss, the target non-semantic loss, and the target fusion loss as the feature loss. The weights respectively corresponding to the target semantic loss, the target non-semantic loss and the target fusion loss can be set according to requirements.

In one embodiment, the target loss is calculated as follows:

l_tri＝max(||x_a-x_p||-||x_a-x_n||+α,0)

where max (a, b) represents taking the maximum of a and b. x is the number of_aSample features, x, representing a target sample_pSample feature, x, representing a positive sample corresponding to a target sample_nAnd the sample characteristics of the negative sample corresponding to the target sample are represented. | x_a-x_pI represents the calculation x_aAnd x_pThe L2 distance between them, the euclidean distance. α represents a loss adjustment parameter. l_triThe purpose of (a) is to make the ratio of the distance of the target sample to the negative sample greater than a distance to the positive sample. The target semantic loss, the target non-semantic loss and the target fusion loss can be calculated by using the formula, but alpha corresponding to the target fusion loss is larger than alpha corresponding to the target semantic loss and the target non-semantic loss.

In the above embodiment, the feature loss is generated based on the target semantic loss, the target non-semantic loss, and the target fusion loss, and the feature distinguishing capability of the model can be improved by adjusting the model parameters based on the feature loss.

In one embodiment, calculating a loss based on the initial classification features and the class labels corresponding to the same sample to obtain a classification loss includes:

performing label coding on the category label of each sample to obtain corresponding label characteristics; carrying out normalization processing on the classification features corresponding to the samples to obtain corresponding normalization features; carrying out logarithmic transformation on each normalized feature, and fusing the label feature corresponding to the same sample and the normalized feature after the logarithmic transformation to obtain the corresponding classification sub-loss of each sample; the classification loss is obtained based on the respective classification sub-losses.

Wherein the tag encoding is used to convert the class tag into data represented by a binary. Each vector dimension in the label feature vector corresponds to a label category, and a specific vector value in the vector dimension indicates whether the sample is matched with the corresponding label category. The normalization process is used to map the vector values of the feature vectors into a preset numerical range. The predetermined value range may be set as desired, for example, between 0 and 1.

In particular, to facilitate calculation of classification losses, the computer device may process the classification features and class labels into data that is easy to compare, calculate. For the class labels, the computer device may perform label encoding on the class labels corresponding to the target sample and the reference sample, and convert each class label into a label feature. For example, the tag encoding may be performed by using a one-hot encoding method. The computer device can respectively perform normalization processing on the classification features corresponding to the target sample and the reference sample, and map the vector value of the classification feature vector into a preset numerical value range, so as to obtain the normalization features corresponding to the samples. For example, the normalization process may be performed using a softmax function. The computer device performs logarithmic transformation on each normalized feature, and specifically may perform logarithmic transformation with a preset value as a base number and the normalized feature as a true number. And finally, fusing the label characteristics corresponding to the same sample and the normalized characteristics after logarithmic transformation by the computer equipment to obtain the classification sub-losses corresponding to each sample, and then synthesizing the classification sub-losses to obtain the classification losses.

In one embodiment, the classification loss is calculated as follows:

wherein p is_kRepresenting the signature corresponding to the kth sample, q_kRepresents the k-th sample pairThe corresponding classification features are normalized to obtain normalized features, and N represents the number of samples.

In the above embodiment, the label coding is performed on the category label, the normalization processing and the logarithm transformation are performed on the classification characteristic, and the processing result of the normalization processing and the logarithm transformation is fused to obtain the accurate classification loss.

In one embodiment, the sample classification network includes a semantic feature extraction sub-network and a semantic feature classification sub-network. Based on the feature loss and the classification loss, adjusting model parameters of the initial feature extraction model until a convergence condition is met to obtain a target feature extraction model, wherein the method comprises the following steps:

obtaining target loss based on the characteristic loss and the classification loss, and performing gradient calculation on the target loss to obtain a loss gradient; updating the loss gradient based on the first adjustment parameter to obtain a first loss, and updating the loss gradient based on the second adjustment parameter to obtain a second loss; the first adjustment parameter is smaller than the second adjustment parameter; and adjusting network parameters of the semantic feature classification sub-network based on the first loss, and adjusting network parameters of other networks based on the second loss until a convergence condition is met, so as to obtain a target feature extraction model.

In particular, since the semantic feature classification sub-network relates to classification processing and is easy to be over-fitted, in order to reduce the influence of the semantic feature classification sub-network on other networks, the semantic feature classification sub-network and other networks can be updated with different losses. After obtaining the update loss and the classification loss, the computer device may fuse the feature loss and the classification loss to obtain a target loss, for example, a sum of the feature loss and the classification loss is used as the target loss, a weighted sum of the feature loss and the classification loss is used as the target loss, and then gradient calculation of the target loss is performed to obtain a loss gradient. The computer device may update the loss gradient based on a first adjustment parameter to obtain a first loss, adjust a network parameter of the semantic feature classification sub-network based on the first loss, update the loss gradient based on a second adjustment parameter to obtain a second loss, and adjust network parameters of other networks based on the second loss until a convergence condition is satisfied to obtain the target feature extraction model. Updating the loss gradient based on the adjustment parameter may specifically be a multiplication of the adjustment parameter and the loss gradient.

The first adjustment parameter is smaller than the second adjustment parameter, and the first adjustment parameter and the second adjustment parameter can be set as needed, for example, the first adjustment parameter is 10 times the second adjustment parameter, the first adjustment parameter is set to 0.005, and the second adjustment parameter is set to 0.0005. It can be understood that the first adjustment parameter is smaller than the second adjustment parameter, so that the loss generated by classification is only transmitted back to the semantic feature classification sub-network in full quantity and transmitted back to other networks by a multiple smaller than 1, thereby reducing the influence of the semantic feature classification sub-network on other networks and ensuring the overall training effect of the model. Furthermore, if the semantic feature extraction sub-network and the non-semantic feature extraction network have a shared network layer, the first adjustment parameter is smaller than the second adjustment parameter, so that overfitting of the semantic embedding by classification information can be avoided, and influence of classification on semantic embedding caused by excessive classification gradient returned to the shared features of the bottom layer is also avoided.

In the above embodiment, the target loss is obtained based on the feature loss and the classification loss, the gradient calculation is performed on the target loss to obtain the loss gradient, the loss gradient is updated based on the first adjustment parameter to obtain the first loss, the loss gradient is updated based on the second adjustment parameter to obtain the second loss, the first adjustment parameter is smaller than the second adjustment parameter, the network parameter of the semantic feature classification sub-network is adjusted based on the first loss, the network parameters of other networks are adjusted based on the second loss, and the training effect of the model can be improved.

In one embodiment, as shown in fig. 4, before obtaining the first sample group and inputting each sample in the first sample group into the initial feature extraction model, the method further includes:

step S402, a second sample group is obtained, each sample in the second sample group is input into the candidate feature extraction model, and a candidate non-semantic feature set, a candidate semantic feature set and a candidate fusion feature set corresponding to the second sample group are obtained.

And S404, calculating loss based on the candidate non-semantic feature set, the candidate semantic feature set and the candidate fusion feature set to obtain candidate loss.

Step S406, based on the candidate loss, adjusting network parameters of a target network in the candidate feature extraction model until a first condition is met, and obtaining an initial feature extraction model; the target network comprises a sample classification network and a non-semantic feature extraction network.

The second sample group and the first sample group may include the same sample or may include different samples. The second set of samples may also be at least one. The candidate feature extraction model is also a feature extraction model to be trained, and an initial feature extraction model is obtained after the candidate feature extraction model is pre-trained. Similar to the convergence condition, the first condition may also be that the number of model iterations reaches a preset number, the candidate loss is smaller than a preset loss, the rate of change of the candidate loss in a consecutive preset number of iterations is smaller than a preset threshold, and the like.

Specifically, since the fusion features are obtained based on the semantic features and the non-semantic features, if the extraction effect of the semantic features and the non-semantic features is good, the fusion features with excellent performance can be obtained more easily. In order to improve the model training efficiency, the network for extracting the semantic features and the non-semantic features can be trained, and on the basis of good performance of the network for extracting the semantic features and the non-semantic features, comprehensive training is performed, and the overall model parameters are finely adjusted, so that the target feature extraction model can be quickly obtained.

The computer device may obtain a second sample group, train the candidate feature extraction model based on the second sample group, and adjust only network parameters of a sample classification network and a non-semantic feature extraction network in the candidate feature extraction model to obtain an initial feature extraction model. The computer equipment inputs each sample in the second sample group into the candidate feature extraction model, and the candidate feature extraction model outputs the candidate non-semantic feature, the candidate semantic feature and the candidate fusion feature which respectively correspond to each sample in the second sample group, so that a candidate non-semantic feature set, a candidate semantic feature set and a candidate fusion feature set which correspond to the second sample group are obtained. According to the calculation method of the reference feature loss, the computer equipment can obtain the candidate loss based on the candidate non-semantic feature set, the candidate semantic feature set and the candidate fusion feature set. And the computer equipment only adjusts the network parameters of the sample classification network and the non-semantic feature extraction network in the feature extraction model based on the candidate loss, and fixes the network parameters of the feature fusion network. The computer device may perform back propagation based on the candidate loss, update model parameters of the candidate feature extraction model to obtain an updated candidate feature extraction model, return to the step of inputting each sample in the second sample group into the candidate feature extraction model for iterative execution, continue training until a convergence condition is satisfied, and complete training to obtain an initial feature extraction model.

In one embodiment, similar to the model parameters adjusted based on the feature loss and the classification loss, when the model parameters are adjusted based on the candidate loss, the loss gradient of the candidate loss may also be adjusted based on a third adjustment parameter to obtain a third loss, the network parameters of the semantic feature classification sub-network are adjusted based on the third loss, the loss gradient of the candidate loss is adjusted based on a fourth adjustment parameter to obtain a fourth loss, and the network parameters of the non-semantic feature extraction network and the semantic feature extraction sub-network are adjusted based on the fourth loss until the first condition is satisfied to obtain the initial feature extraction model. The third adjustment parameter is greater than the fourth adjustment parameter, for example, the third adjustment parameter is 0.005, and the fourth adjustment parameter is 0.0005.

In the above embodiment, the candidate feature extraction model is trained based on the second sample group, the sample classification network and the non-semantic feature extraction network in the model are adjusted to obtain the initial feature extraction model, and then the target feature extraction model can be quickly obtained by fine-tuning the feature extraction model.

In one embodiment, as shown in fig. 5, before obtaining the second sample group and inputting each sample in the second sample group into the candidate feature extraction model, the method further includes:

step S502, a third sample group is obtained, each sample in the third sample group is input into the feature extraction model to be trained, and a non-semantic feature set corresponding to the third sample group is obtained.

And step S504, calculating loss based on the non-semantic feature set to obtain initial loss.

And S506, based on the initial loss, adjusting model parameters of the non-semantic feature extraction network in the feature extraction model to be trained until a second condition is met, and obtaining a candidate feature extraction model.

The third sample group and the first sample group may include the same sample or may include different samples. The third sample group may also be at least one. Similar to the convergence condition, the second condition may also be that the number of model iterations reaches a preset number, the initial loss is smaller than a preset loss, the rate of change of the initial loss in a preset number of consecutive iterations is smaller than a preset threshold, and the like.

Specifically, in order to improve the training effect of the model, the non-semantic feature extraction network may be trained first, the relevant model parameters of the non-semantic feature extraction network may be adjusted first, and then the training of the next stage may be performed.

The computer device may obtain a third sample group, train the feature extraction model to be trained based on the third sample group, and adjust only model parameters of the non-semantic feature extraction network in the model to obtain a candidate feature extraction model. And the computer equipment inputs each sample in the third sample group into the feature extraction model to be trained, and the feature extraction model to be trained outputs the non-semantic features corresponding to each sample in the third sample group, so that a non-semantic feature set corresponding to the third sample group is obtained. Referring to the calculation method of the feature loss, the computer device can obtain the initial loss based on the non-semantic feature set. The computer device adjusts only network parameters of the non-semantic feature extraction network in the feature extraction model based on the initial loss. The computer equipment can perform back propagation based on the initial loss, update the model parameters of the non-semantic feature extraction network to obtain an updated feature extraction model to be trained, return to the step of inputting each sample in the third sample group into the feature extraction model to be trained for iterative execution, continue training until a convergence condition is met, and complete training to obtain a candidate feature extraction model.

In the above embodiment, the feature extraction model to be trained is trained based on the third sample group, and the non-semantic feature extraction network in the model is adjusted, so that the convergence balance between the semantic branches and the non-semantic branches in the subsequent training stage is kept, and the training efficiency is improved.

Of course, the computer device may also obtain a fourth sample group, input each sample in the fourth sample group into the feature extraction model having the initialization parameter, obtain the non-semantic feature set corresponding to the fourth sample group, and calculate loss information based on the non-semantic feature set corresponding to the fourth sample group, to adjust the network parameters of the non-semantic feature extraction network in the feature extraction model until the third condition is satisfied, so as to obtain the initial feature extraction model.

In one embodiment, the current sample group is any one of the first sample group, the second sample group, and the third sample group.

Obtaining a current sample set comprising: obtaining a plurality of similar sample pairs; determining a current sample and a positive sample corresponding to the current sample from the current similar sample pair, and determining a plurality of candidate samples from the rest similar sample pairs; determining at least one negative sample corresponding to the current sample from each candidate sample based on the sample similarity between the current sample and each candidate sample; and taking the positive sample and the negative sample corresponding to the current sample as reference samples corresponding to the current sample, and obtaining at least one current sample group based on the current sample and the corresponding reference samples.

Wherein, a similar sample pair refers to a sample pair labeled that two samples are the same or similar samples.

Specifically, when determining the sample triplet, the computer device may obtain a plurality of similar sample pairs, randomly select one similar sample pair from the plurality of similar sample pairs as a current similar sample pair, use one sample in the current similar sample pair as a current sample, and use another sample as a positive sample corresponding to the current sample. The computer device then randomly selects a plurality of samples from the remaining pairs of similar samples other than the current pair of similar samples as candidate samples, for example, randomly selects one sample from each of the remaining pairs of similar samples as a candidate sample. Then, the computer device calculates the sample similarity of the current sample and each candidate sample, and determines at least one negative sample corresponding to the current sample from each candidate sample based on the sample similarity. When calculating the sample similarity, the computer device may adopt a custom algorithm, or may also adopt a conventional text similarity calculation algorithm, an image similarity calculation algorithm, a voice similarity calculation algorithm, or the like. When the negative samples are selected, the candidate samples can be ranked according to the sequence of the sample similarity from large to small, and a plurality of candidate samples ranked at the front are obtained as the negative samples. After the positive sample and the negative sample corresponding to the current sample are obtained, the positive sample and the negative sample corresponding to the current sample are used as reference samples corresponding to the current sample, a current sample group is formed based on the current sample and the corresponding reference samples, and finally the current sample group with the same number as the negative samples is obtained.

It is understood that a current sample group includes three samples, namely a current sample, a positive sample corresponding to the current sample, and a negative sample. And if the number of the negative samples corresponding to the current sample is multiple, forming a current sample group by each negative sample and the current sample pair corresponding to the current sample, and finally obtaining multiple current sample groups. In addition, the class labels corresponding to the samples in the similar sample pairs are identical, that is, the samples and the positive samples corresponding to the samples have identical class labels, but the class labels corresponding to the positive samples and the negative samples corresponding to the same samples may be identical or different. The first sample set, the second sample set, and the third sample set may all be obtained by the above-described method. In one embodiment, the first, second and third sample groups may comprise the same pair of similar samples.

In the above embodiment, the negative sample corresponding to the current sample is determined based on the sample similarity from the remaining similar sample pairs, and the negative sample can be quickly determined based on full use of the existing data.

In one embodiment, determining at least one negative sample corresponding to the current sample from each candidate sample based on the sample similarity between the current sample and each candidate sample comprises:

inputting the current sample and each candidate sample into a matched current feature extraction model to obtain sample feature sets respectively corresponding to the current sample and each candidate sample; the current feature extraction model matched with the first sample group is an initial feature extraction model, the current feature extraction model matched with the second sample group is a candidate feature extraction model, and the current feature extraction model matched with the third sample group is a feature extraction model to be trained; calculating the sample similarity of the current sample and each candidate sample respectively based on the sample feature set corresponding to the current sample and each candidate sample; dividing each candidate sample into a first type sample and a second type sample based on the sample similarity; the sample similarity corresponding to the first type of sample is greater than the sample similarity corresponding to the second type of sample; and determining at least one negative sample corresponding to the current sample from the first type of samples.

Specifically, the similarity between samples may be calculated from the characteristics of the samples output by the model. Taking the first sample group as an example, when the first sample group is determined, the target sample and each candidate sample may be input into the initial feature extraction model, a sample feature set corresponding to the target sample and each candidate sample respectively is obtained according to output data of the model, the sample feature set includes at least one of semantic features, non-semantic features and fusion features, and further, sample similarities between the target sample and each candidate sample are calculated based on the sample feature sets corresponding to the target sample and each candidate sample, for example, a sample feature distance between two samples is taken as a sample similarity between samples, and the sample similarity is larger as the sample feature distance is smaller. The computer device may divide each candidate sample into a first type sample and a second type sample based on the sample similarity, the sample similarity corresponding to the first type sample being greater than the sample similarity corresponding to the second type sample, and select at least one sample from the first type sample as at least one negative sample corresponding to the target sample. When the samples are classified, the candidate samples can be ranked according to the sequence of the similarity of the samples from large to small, a preset number of candidate samples ranked in the front are obtained as a first type of sample, the remaining candidate samples are used as a second type of sample, and the preset number can be set as required.

It can be understood that the negative samples are determined from the candidate samples with high sample similarity with the target sample, the negative samples selected from the candidate samples comprise a certain proportion of difficult samples, and then, during model training, the difficult samples are beneficial to improving the feature distinguishing capability of the model, so that the performance of the model is improved.

In a similar manner to the determination of the first sample group, when determining the second sample group, the current sample and each candidate sample may be input to the candidate feature extraction model, and the sample similarity may be calculated from the output data of the model, thereby determining the negative sample in the second sample group, and when determining the third sample group, the current sample and each candidate sample may be input to the candidate feature extraction model, and the sample similarity may be calculated from the output data of the model, thereby determining the negative sample in the third sample group.

In the embodiment, different training samples are adopted to perform model training in different training stages, so that the generalization capability of the model can be improved. And calculating the sample similarity according to the sample characteristics output by the model trained in the previous stage, so that the training sample in the next stage can be quickly determined. Negative samples are determined from the first type of samples with larger similarity, difficult samples can be obtained, and the difficult samples are beneficial to improving the model performance in the training process.

In one embodiment, as shown in fig. 6, a sample retrieval method is provided, and is described by taking a computer device as an example, it is understood that the computer device may be the terminal 102 shown in fig. 1 or the server 104. In this embodiment, the sample retrieval method includes the following steps:

step S602, a query sample and a set of candidate recall samples are obtained.

Wherein, the query sample refers to the sample needing to query the most similar sample. The query sample may be at least one of an image, speech, text. In the information search scenario, the query sample may be search information submitted by a user through a terminal at the time of search. In an information recommendation scenario, the query sample may be user reference information autonomously acquired by the computer device, the user reference information may be user attribute information, for example, interest and hobbies in user account registration information may be used as the user reference information, the user reference information may also be determined based on user historical behaviors, for example, historical search results and historical browsing samples of the user may be used as the user reference information. The user reference information can reflect the interests, hobbies and attention hotspots of the user. The candidate recall sample set refers to a sample library and a search library and comprises a large number of candidate recall samples. At least one sample most similar to the query sample needs to be retrieved from the sample library as a retrieval result.

Specifically, when receiving a retrieval task and a query task, the computer device may obtain a query sample and a candidate recall sample set, and determine a retrieval result corresponding to the query sample from the candidate recall sample set by using a trained target feature extraction model.

Step S604, inputting the query sample and the candidate recall sample in the candidate recall sample set into a target feature extraction model to obtain a query sample feature corresponding to the query sample and a recall sample feature corresponding to the candidate recall sample.

The training process of the target feature extraction model is as follows: obtaining a first sample group, and inputting each sample in the first sample group into an initial feature extraction model; the first sample group comprises target samples, reference samples corresponding to the target samples and class labels corresponding to the samples, and the initial feature extraction model comprises a sample classification network, a non-semantic feature extraction network and a feature fusion network; outputting initial classification features and initial semantic features through a sample classification network, outputting initial non-semantic features through a non-semantic feature extraction network, and fusing the initial semantic features and the initial non-semantic features of the same sample through a feature fusion network to obtain initial fusion features corresponding to the samples respectively; calculating loss based on initial semantic features, initial non-semantic features and initial fusion features corresponding to the target sample and the reference sample to obtain feature loss, and calculating loss based on initial classification features and class labels corresponding to the same sample to obtain classification loss; and adjusting model parameters of the initial feature extraction model based on the feature loss and the classification loss until a convergence condition is met to obtain a target feature extraction model.

It is to be understood that, for the training process of the target feature extraction model, reference may be made to the foregoing embodiments of the feature extraction model training method, and details are not described here again.

Step S606, based on the query sample characteristics and the recall sample characteristics, determining a retrieval result sample corresponding to the query sample from the candidate recall sample set.

Specifically, the computer device may input the query sample and the candidate recall sample in the candidate recall sample set into the target feature extraction model, and obtain a query sample feature corresponding to the query sample and a recall sample feature corresponding to each candidate recall sample according to output data of the model. The query sample features and the recall sample features each include at least one of semantic features, non-semantic features, and fused features. The computer equipment can calculate the similarity between the samples based on the query sample characteristics and the recall sample characteristics, and further obtains the candidate recall sample with higher similarity from the candidate recall sample set as a retrieval result sample corresponding to the query sample. Subsequently, the computer device may return a search result sample to the sender of the search task or the query task.

According to the sample retrieval method, when the model is trained, the unified model is established to learn the semantic features and the non-semantic features, and simultaneously learn the fusion features, the fusion features comprise both semantic information and non-semantic information, the model obtained through final training can output the semantic features and the non-semantic features comprising single-dimensional information and can also output the fusion features comprising two-dimensional information, and the model can output diversified features only by performing model training on one model, so that the training efficiency is improved. Therefore, when the sample retrieval is carried out, the sample retrieval is carried out through the diversified characteristics output by the model, and the accuracy and the efficiency of the sample retrieval can be improved.

In one embodiment, the recall sample feature comprises at least one of a semantic feature, a non-semantic feature, and a fused feature, and the query sample feature and the recall sample feature comprise the same type of sample feature. Determining a retrieval result sample corresponding to the query sample from the candidate recall sample set based on the query sample feature and the recall sample feature, wherein the method comprises the following steps:

establishing indexes based on the characteristics of all recall samples corresponding to the same type to obtain sample recall indexes corresponding to all types respectively; and performing sample retrieval from the corresponding sample recall indexes based on the corresponding query sample characteristics of the same type, and determining a retrieval result sample based on the sample retrieval result.

Specifically, when a sample is retrieved, similar samples need to be queried from the same perspective by using the same type of sample features. Thus, the query sample features and the recall sample features need to include sample features of the same type, and the recall sample features may include at least one of semantic features, non-semantic features, and fused features.

Further, in order to improve the retrieval efficiency, the computer device may establish an index for retrieval based on the characteristics of the recall sample, where the index may characterize the distribution of the characteristics of the recall sample in the space, and then, when performing retrieval, by determining a distribution area corresponding to the characteristics of the query sample in the index, a retrieval result sample may be quickly determined in the candidate recall sample corresponding to the distribution area. The retrieval based on the index can avoid the similarity calculation between the query sample and the sample in the sample library in pairs during each retrieval, and avoid a large amount of complex calculation.

The computer device may establish an index based on each recall sample feature corresponding to the same type, to obtain sample recall indexes corresponding to each type, for example, a non-semantic feature corresponds to a first sample recall index, a semantic feature corresponds to a second sample recall index, a fusion feature corresponds to a third sample recall index, and accordingly, sample retrieval needs to be performed from the corresponding sample recall indexes based on query sample features corresponding to the same type. Finally, the computer device may determine a retrieval result sample from the set of candidate recall samples based on the sample retrieval results.

In one embodiment, the computer device may pre-build the sample recall index before the retrieval task is not acquired. Then, after the retrieval task is acquired, the computer device can quickly perform sample retrieval based on the sample recall index, and quickly determine a retrieval result sample.

In the above embodiment, the sample retrieval is performed through the sample recall index, so that the retrieval efficiency can be improved.

In one embodiment, the creating an index based on the recall sample features corresponding to the same type to obtain the sample recall indexes corresponding to the types respectively includes:

performing feature clustering on each recall sample feature corresponding to the same type to obtain a plurality of clustering clusters corresponding to each type; each clustering cluster has a corresponding clustering center; and obtaining sample recall indexes respectively corresponding to the types based on all the clustering clusters corresponding to the same type.

Specifically, when the index is established, the computer device may perform feature clustering on each recall sample feature corresponding to the same type, cluster recall sample features adjacent to each other in the space in the same cluster, and cluster recall sample features having a certain distance in the space in different clusters, thereby obtaining a plurality of cluster clusters corresponding to each type. Each cluster is independent, each cluster corresponds to a local vector space, each cluster has a corresponding cluster center, and the cluster centers can also be represented by vectors or matrixes, similar to the characteristics of the recall samples. The computer equipment can establish a sample recall index based on each clustering cluster corresponding to the same type, take the clustering center as a main component of the index, take candidate recall samples of the clustering clusters corresponding to the clustering center with sample characteristics as associated information of the index, and finally obtain the sample recall indexes respectively corresponding to each type.

In one embodiment, if the sample library has additional samples, the sample characteristics of the additional samples and the distances between the clustering centers can be calculated, which clustering cluster the additional samples belong to is determined according to the distances, and the additional samples are associated to the clustering cluster to which the clustering center corresponding to the minimum distance belongs.

In the above embodiment, the index is established based on the cluster, so that the subsequent retrieval efficiency can be improved.

In one embodiment, a sample retrieval is performed from a corresponding sample recall index based on the corresponding query sample features of the same type, and a retrieval result sample is determined based on a sample retrieval result, including:

determining target cluster clusters from corresponding cluster clusters based on the query sample characteristics corresponding to the same type and the distance between the cluster centers to obtain target cluster clusters corresponding to the types respectively; determining target sample characteristics from the recall sample characteristics corresponding to the target clustering cluster based on the distance between the query sample characteristics corresponding to the same type and the recall sample characteristics in the target clustering cluster; and taking the candidate recall sample corresponding to the target sample characteristic as a retrieval result sample.

Specifically, the computer device calculates the distance between the query sample feature corresponding to the same feature type and the corresponding clustering center, and takes the clustering cluster corresponding to the minimum distance value as the target clustering cluster, so as to obtain the target clustering clusters corresponding to each feature type, thereby limiting the retrieval range. And then, the computer equipment searches in each candidate recall sample corresponding to the target clustering cluster, calculates the distance between the query sample characteristic corresponding to the same characteristic type and each recall sample characteristic in the corresponding target clustering cluster, takes at least one recall sample characteristic with the closest distance as a target sample characteristic, and takes the candidate recall sample corresponding to the target sample characteristic as a final retrieval result sample. And aiming at different types of sample characteristics, obtaining retrieval result samples corresponding to the characteristic types respectively.

In the embodiment, the search range is narrowed based on the clustering center, and then the accurate search is performed in the target clustering cluster, so that the search efficiency can be improved.

In one embodiment, when the query task corresponding to the query sample is a first type task, the recall sample feature comprises a fusion feature, and when the query task corresponding to the query sample is a second type task, the recall sample feature comprises a semantic feature and a non-semantic feature, and the query frequency of the first type task is greater than that of the second type task.

Specifically, since the fusion features include both semantic information and non-semantic information, in order to improve the retrieval efficiency, for a task with high response and high frequency request, the query sample of this type of task is retrieved by using the fusion features. That is, when the query task corresponding to the query sample is a first type task with relatively high query frequency, the recall sample feature and the query sample feature include a fusion feature, and the fusion feature is used for performing rapid retrieval.

In the embodiment, different retrieval modes are used through different tasks, so that the retrieval flexibility can be improved.

In a specific embodiment, the feature extraction model training method and the sample retrieval method can be applied to the image field.

1. Data preparation

And marking the image similarity samples, and further mining the triple samples. Similar image pairs are acquired in each batch (batch represents training batch, and bs similar image pairs are included in total), and image triples are mined in the image pairs of each batch. For example, for image x in a similar image pair, from the remaining bs-1 sample pairs, each pair randomly selects an image, calculates its distance from x, sorts the distances from small to large, takes the first 10 samples as negative samples, and forms a triplet with the positive samples of x and x respectively, then each x may generate 10 triples, and the whole batch may obtain 10 × bs triples. And marking the corresponding category of each image in the triple.

2. Model structure

The image feature extraction model is a two-branch fusion model, and the model structure can refer to fig. 7. The bottom layer structure of the model is a shared network layer, which may be a shared CNN network, for example, the CNN network may be ResNet101 pre-trained based on large-scale open source data ImageNet, and the specific structure of the ResNet101 may refer to table 1 and table 2. The method comprises the steps that an image in a triple is input into a model, an input image is processed in the table 1 and the table 2 in sequence, and then is input into a table 3 and a table 4 double-branch embedding extraction structure, target semanteme-free embedding (namely output data of an embedding layer 1) and semantic embedding (namely output data of an embedding layer 2) are generated, and then two embedding are input into the table 5 to generate fused embedding (namely output data of the embedding layer 3) and output.

3. Model training

And performing supervised learning on the model by utilizing the mined triple samples and the semantic annotation information. An epoch (representing a training round) round of iterations is performed on the full-scale data, and a full-scale sample is processed once per round of iteration. Referring to fig. 7, in the model training process, it is necessary to calculate the loss (i.e. loss) for semantic and non-semantic embedding, and it is also necessary to calculate the loss for fused embedding, and for the semantic module in table 3, a classification layer is further provided above embedding for providing semantic information, and the classification layer also needs to calculate the corresponding loss.

Considering that the convergence degrees of different branches are different, the non-semantic branch convergence is slow, the semantic branch convergence is fast, and the fused embedding is obtained based on the semantic embedding and the non-semantic embedding, the training process is carried out in three stages. Firstly, training the non-semantic embedding by using Loss1, learning 5 epochs in a non-semantic mode, and only calculating the Loss of the non-semantic branch and updating related model parameters in the stage. Then, 2 basic embedding is trained by Loss2, semantic + non-semantic branches learn 10 epochs simultaneously, and only the semantic + non-semantic branch model parameters are updated at this stage. Subsequently, all branches (semantic + non-semantic + fusion branches) are trained by using the Loss3, the SGD random gradient descent method is adopted, gradient backward calculation is carried out on the calculated Loss, the updated values of all model parameters are obtained, and the model is updated, for example, when the model Loss continues for 10 rounds and is not descended, the model training is stopped.

loss1＝l_triplet1

loss2＝l_triplet1+l_triplet2+l_class

loss3＝l_triplet1+l_triplet2+l_triplet3+l_class

Wherein l_triplet1Representing non-semantic loss,/_triplet2Representing semantic loss,/_classRepresents a loss of classification,/_triplet3Indicating fusion loss.

Like this, it is the delay amalgamation to fuse embedding in this training process, carries out amalgamation embedding fine setting again after training to good basis embedding, can avoid two basis embedding convergence discordance just directly fuse and lead to overall effect not good.

4. Model application

The trained model can be used for extracting the nonsymentic embedding, the semantic embedding and the fusion embedding of the input image. Referring to fig. 8, embedding of each image in the image library is first extracted by using a model, and then a clustering center of the embedding is determined. During retrieval, the nearest n clustering centers (namely target clustering centers) are found according to the embedding of the query image, the associated images of the clustering centers are obtained to be used as candidate images, the Euclidean distance is calculated according to the embedding of each candidate image and the embedding of the query image, the images are sorted from small to large, and topK images in the sorting are obtained to be used as the final retrieval result.

A single retrieval system: the model combines two embeddings to be unified to the fused embedding, so that the embedding containing semantic and non-semantic information can be obtained only by performing model reasoning once during application, and only one retrieval system needs to be established under the fused embedding, thereby reducing the storage space of the retrieval system. It is very helpful in the case of limited reasoning or retrieval resources.

Double retrieval system: and a double-retrieval system can also be established based on semantic and non-semantic embedding obtained by model reasoning. And aiming at the respective clustering of the two features, respectively establishing an index system by using the corresponding clustering centers, respectively retrieving, and then directly combining retrieval results.

It can be understood that in the image search scene and the image recommendation scene, a single retrieval system may be established, and a dual retrieval system may also be established.

The embodiment has the following beneficial effects:

1. low inference time consumption, low storage resources: the unified model is established to learn two characteristics, and simultaneously learn the combined characteristics, wherein the combined characteristics have the advantage that one characteristic vector is used, so that not only semantic information but also non-semantic information can be expressed, and finally, the model can output a high-quality characteristic with rich information.

2. Flexible single or dual feature retrieval applications: because semantic and non-semantic information are embodied on one fused embedding, a retrieval system can be established only for the embedding. On the other hand, when the application service can support larger retrieval memory, a dual-feature retrieval system can also be established by means of the dual embedding of the model. In a word, the retrieval system can be flexibly established by adopting the model.

3. The convergence performance of multi-task learning can be ensured through multi-stage training and fusion characteristic fine tuning.

It should be understood that although the steps in the flowcharts of fig. 2, 4 and 5 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 4, and 5 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least some of the other steps or stages.

In one embodiment, as shown in fig. 9, there is provided a feature extraction model training apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two modules, and specifically includes: a first sample group processing module 902, a feature output module 904, a feature fusion module 906, a feature loss determination module 908, a classification loss determination module 910, and a model parameter adjustment module 912, wherein:

a first sample group processing module 902, configured to obtain a first sample group, and input each sample in the first sample group into an initial feature extraction model; the first sample group comprises target samples, reference samples corresponding to the target samples and class labels corresponding to the samples, and the initial feature extraction model comprises a sample classification network, a non-semantic feature extraction network and a feature fusion network;

a feature output module 904, configured to output an initial classification feature and an initial semantic feature through a sample classification network, and output an initial non-semantic feature through a non-semantic feature extraction network;

a feature fusion module 906, configured to fuse the initial semantic features and the initial non-semantic features of the same sample through a feature fusion network to obtain initial fusion features corresponding to the samples, respectively;

a feature loss determining module 908, configured to calculate a loss based on the initial semantic features, the initial non-semantic features, and the initial fusion features corresponding to the target sample and the reference sample, so as to obtain a feature loss;

a classification loss determining module 910, configured to calculate a loss based on an initial classification feature and a category label corresponding to the same sample, so as to obtain a classification loss;

a model parameter adjusting module 912, configured to adjust a model parameter of the initial feature extraction model based on the feature loss and the classification loss until a convergence condition is satisfied, so as to obtain a target feature extraction model; the target feature extraction model is used for extracting sample features of the input sample, and the sample features are used for sample retrieval.

In one embodiment, the sample classification network includes a semantic feature extraction sub-network and a semantic feature classification sub-network, the semantic feature extraction sub-network and the non-semantic feature extraction network existing at a shared network layer. The characteristic output module is also used for carrying out convolution processing on the input sample through the shared network layer to obtain shared characteristics; performing feature processing on the shared features through a feature processing layer of the semantic feature extraction sub-network to obtain initial semantic features; classifying the initial semantic features through a semantic feature classification sub-network to obtain initial classification features; and carrying out feature processing on the shared features through a feature processing layer of the non-semantic feature extraction network to obtain initial non-semantic features.

In one embodiment, the reference samples include positive and negative samples corresponding to the target sample. The feature loss determining module is further used for obtaining forward semantic loss, forward non-semantic loss and forward fusion loss based on the distance between the same type of features corresponding to the target sample and the positive sample; obtaining negative semantic loss, negative non-semantic loss and negative fusion loss based on the distance between the same type of features corresponding to the target sample and the negative sample; obtaining an initial semantic loss based on a distance between the positive semantic loss and the negative non-semantic loss, obtaining an initial non-semantic loss based on a distance between the positive non-semantic loss and the negative non-semantic loss, and obtaining an initial fusion loss based on a distance between the positive fusion loss and the negative fusion loss; the feature loss is obtained based on the initial semantic loss, the initial non-semantic loss, and the initial fusion loss.

In one embodiment, the characteristic loss determining module is further configured to update the initial semantic loss based on the semantic loss adjusting parameter to obtain an intermediate semantic loss, and determine the target semantic loss based on a matching result of the intermediate semantic loss and a preset parameter; updating the initial non-semantic loss based on the non-semantic loss adjusting parameter to obtain intermediate non-semantic loss, and determining target non-semantic loss based on the intermediate non-semantic loss and a matching result of a preset parameter; updating the initial fusion loss based on the fusion loss adjustment parameter to obtain an intermediate fusion loss, and determining a target fusion loss based on a matching result of the intermediate fusion loss and a preset parameter; the fusion loss adjustment parameter is greater than the semantic loss adjustment parameter, and the fusion loss adjustment parameter is greater than the non-semantic loss adjustment parameter; generating a feature loss based on the target semantic loss, the target non-semantic loss, and the target fusion loss.

In one embodiment, the classification loss determining module is further configured to perform label coding on the class label of each sample to obtain a corresponding label feature; carrying out normalization processing on the classification features corresponding to the samples to obtain corresponding normalization features; carrying out logarithmic transformation on each normalized feature, and fusing the label feature corresponding to the same sample and the normalized feature after the logarithmic transformation to obtain the corresponding classification sub-loss of each sample; the classification loss is obtained based on the respective classification sub-losses.

In one embodiment, the sample classification network includes a semantic feature extraction sub-network and a semantic feature classification sub-network. The model parameter adjusting module is also used for obtaining target loss based on the characteristic loss and the classification loss, and performing gradient calculation on the target loss to obtain a loss gradient; updating the loss gradient based on the first adjustment parameter to obtain a first loss, and updating the loss gradient based on the second adjustment parameter to obtain a second loss; the first adjustment parameter is smaller than the second adjustment parameter; and adjusting network parameters of the semantic feature classification sub-network based on the first loss, and adjusting network parameters of other networks based on the second loss until a convergence condition is met, so as to obtain a target feature extraction model.

In one embodiment, the feature extraction model training apparatus further includes:

the second sample group processing module is used for acquiring a second sample group, inputting each sample in the second sample group into the candidate feature extraction model, and obtaining a candidate non-semantic feature set, a candidate semantic feature set and a candidate fusion feature set corresponding to the second sample group; calculating loss based on the candidate non-semantic feature set, the candidate semantic feature set and the candidate fusion feature set to obtain candidate loss; based on the candidate loss, adjusting network parameters of a target network in the candidate feature extraction model until a first condition is met to obtain an initial feature extraction model; the target network comprises a sample classification network and a non-semantic feature extraction network.

the third sample group processing module is used for acquiring a third sample group, inputting each sample in the third sample group into the feature extraction model to be trained, and obtaining a non-semantic feature set corresponding to the third sample group; calculating loss based on the non-semantic feature set to obtain initial loss; and adjusting model parameters of the non-semantic feature extraction network in the feature extraction model to be trained based on the initial loss until a second condition is met to obtain a candidate feature extraction model.

In one embodiment, the current sample group is any one of the first sample group, the second sample group, and the third sample group. The first sample group processing module, the second sample group processing module and the third sample group processing module are also used for acquiring a plurality of similar sample pairs; determining a current sample and a positive sample corresponding to the current sample from the current similar sample pair, and determining a plurality of candidate samples from the rest similar sample pairs; determining at least one negative sample corresponding to the current sample from each candidate sample based on the sample similarity between the current sample and each candidate sample; and taking the positive sample and the negative sample corresponding to the current sample as reference samples corresponding to the current sample, and obtaining at least one current sample group based on the current sample and the corresponding reference samples.

In one embodiment, the first sample group processing module, the second sample group processing module, and the third sample group processing module are further configured to input the current sample and each candidate sample into a matched current feature extraction model, so as to obtain sample feature sets corresponding to the current sample and each candidate sample respectively; the current feature extraction model matched with the first sample group is an initial feature extraction model, the current feature extraction model matched with the second sample group is a candidate feature extraction model, and the current feature extraction model matched with the third sample group is a feature extraction model to be trained; calculating the sample similarity of the current sample and each candidate sample respectively based on the sample feature set corresponding to the current sample and each candidate sample; dividing each candidate sample into a first type sample and a second type sample based on the sample similarity; the sample similarity corresponding to the first type of sample is greater than the sample similarity corresponding to the second type of sample; and determining at least one negative sample corresponding to the current sample from the first type of samples.

According to the feature extraction model training device, the unified model is established to learn semantic features and non-semantic features, and simultaneously learn fusion features, the fusion features comprise both semantic information and non-semantic information, the model obtained through final training can output the semantic features and the non-semantic features comprising single-dimensional information and can also output the fusion features comprising two-dimensional information, and the model can output diversified features only by performing model training on one model, so that the training efficiency is improved.

In one embodiment, as shown in fig. 10, there is provided a sample retrieval apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: a data acquisition module 1002, a data processing module 1004, and a retrieval result determination module 1006, wherein:

a data obtaining module 1002, configured to obtain a query sample and a candidate recall sample set;

the data processing module 1004 is configured to input the query sample and the candidate recall sample in the candidate recall sample set into the target feature extraction model, so as to obtain a query sample feature corresponding to the query sample and a recall sample feature corresponding to the candidate recall sample;

a retrieval result determining module 1006, configured to determine, based on the query sample feature and the recall sample feature, a retrieval result sample corresponding to the query sample from the candidate recall sample set;

the training process of the target feature extraction model is as follows:

obtaining a first sample group, and inputting each sample in the first sample group into an initial feature extraction model; the first sample group comprises target samples, reference samples corresponding to the target samples and class labels corresponding to the samples, and the initial feature extraction model comprises a sample classification network, a non-semantic feature extraction network and a feature fusion network; outputting initial classification features and initial semantic features through a sample classification network, outputting initial non-semantic features through a non-semantic feature extraction network, and fusing the initial semantic features and the initial non-semantic features of the same sample through a feature fusion network to obtain initial fusion features corresponding to the samples respectively; calculating loss based on initial semantic features, initial non-semantic features and initial fusion features corresponding to the target sample and the reference sample to obtain feature loss, and calculating loss based on initial classification features and class labels corresponding to the same sample to obtain classification loss; and adjusting model parameters of the initial feature extraction model based on the feature loss and the classification loss until a convergence condition is met to obtain a target feature extraction model.

In one embodiment, the recall sample feature comprises at least one of a semantic feature, a non-semantic feature, and a fused feature, and the query sample feature and the recall sample feature comprise the same type of sample feature. The retrieval result determination module includes:

and the index establishing unit is used for establishing indexes based on the characteristics of all the recall samples corresponding to the same type to obtain the sample recall indexes corresponding to all the types respectively.

And the sample retrieval unit is used for retrieving samples from the corresponding sample recall indexes based on the query sample characteristics corresponding to the same type and determining retrieval result samples based on the sample retrieval results.

In one embodiment, the index establishing unit is further configured to perform feature clustering on each recall sample feature corresponding to the same type to obtain a plurality of clustering clusters corresponding to each type; each clustering cluster has a corresponding clustering center; and obtaining sample recall indexes respectively corresponding to the types based on all the clustering clusters corresponding to the same type.

In one embodiment, the sample retrieval unit is further configured to determine target cluster clusters from the corresponding cluster clusters based on the query sample features corresponding to the same type and the distances between the cluster centers, and obtain target cluster clusters corresponding to the types respectively; determining target sample characteristics from the recall sample characteristics corresponding to the target clustering cluster based on the distance between the query sample characteristics corresponding to the same type and the recall sample characteristics in the target clustering cluster; and taking the candidate recall sample corresponding to the target sample characteristic as a retrieval result sample.

According to the sample retrieval device, when a model is trained, a unified model is established to learn semantic features and non-semantic features, and simultaneously, fusion features are learned, wherein the fusion features comprise both semantic information and non-semantic information, the model obtained through final training can output the semantic features and the non-semantic features comprising single-dimensional information and can also output the fusion features comprising two-dimensional information, and the model can output diversified features only by performing model training on one model, so that the training efficiency is improved. Therefore, when the sample retrieval is carried out, the sample retrieval is carried out through the diversified characteristics output by the model, and the accuracy and the efficiency of the sample retrieval can be improved.

For specific limitations of the feature extraction model training device and the sample retrieval device, reference may be made to the above limitations of the feature extraction model training method and the sample retrieval method, which are not described herein again. All or part of each module in the feature extraction model training device and the sample retrieval device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 11. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as similar sample pairs, candidate recall sample sets, target feature extraction models and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a feature extraction model training method, a sample retrieval method.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 12. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a feature extraction model training method, a sample retrieval method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the configurations shown in fig. 11 and 12 are block diagrams of only some of the configurations relevant to the present disclosure, and do not constitute a limitation on the computing devices to which the present disclosure may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for training a feature extraction model, the method comprising:

2. The method of claim 1, wherein the sample classification network comprises a semantic feature extraction sub-network and a semantic feature classification sub-network, the semantic feature extraction sub-network and the non-semantic feature extraction network having a shared network layer, the outputting the initial classification features and the initial semantic features through the sample classification network, and the outputting the initial non-semantic features through the non-semantic feature extraction network comprises:

performing convolution processing on an input sample through the shared network layer to obtain shared characteristics;

performing feature processing on the shared features through a feature processing layer of the semantic feature extraction sub-network to obtain the initial semantic features;

classifying the initial semantic features through the semantic feature classification sub-network to obtain the initial classification features;

and carrying out feature processing on the shared features through a feature processing layer of the non-semantic feature extraction network to obtain the initial non-semantic features.

3. The method of claim 1, wherein the reference samples comprise positive samples and negative samples corresponding to the target samples, and wherein calculating the loss based on the initial semantic features, the initial non-semantic features, and the initial fusion features corresponding to the target samples and the reference samples comprises:

obtaining forward semantic loss, forward non-semantic loss and forward fusion loss based on the distance between the same type of features corresponding to the target sample and the positive sample;

obtaining negative semantic loss, negative non-semantic loss and negative fusion loss based on the distance between the same type of features corresponding to the target sample and the negative sample;

obtaining an initial semantic loss based on a distance between the positive semantic loss and the negative non-semantic loss, obtaining an initial non-semantic loss based on a distance between the positive non-semantic loss and the negative non-semantic loss, and obtaining an initial fusion loss based on a distance between the positive fusion loss and the negative fusion loss;

and obtaining the characteristic loss based on the initial semantic loss, the initial non-semantic loss and the initial fusion loss.

4. The method of claim 3, wherein deriving the feature loss based on the initial semantic loss, the initial non-semantic loss, and the initial fusion loss comprises:

updating the initial semantic loss based on the semantic loss adjusting parameter to obtain an intermediate semantic loss, and determining a target semantic loss based on the intermediate semantic loss and a matching result of a preset parameter;

updating the initial non-semantic loss based on the non-semantic loss adjusting parameter to obtain intermediate non-semantic loss, and determining target non-semantic loss based on the intermediate non-semantic loss and a matching result of a preset parameter;

updating the initial fusion loss based on the fusion loss adjustment parameter to obtain an intermediate fusion loss, and determining a target fusion loss based on a matching result of the intermediate fusion loss and a preset parameter; the fusion loss adjustment parameter is greater than the semantic loss adjustment parameter, and the fusion loss adjustment parameter is greater than the non-semantic loss adjustment parameter;

and generating the characteristic loss based on the target semantic loss, the target non-semantic loss and the target fusion loss.

5. The method of claim 1, wherein calculating the loss based on the initial classification features and class labels corresponding to the same sample to obtain a classification loss comprises:

performing label coding on the category label of each sample to obtain corresponding label characteristics;

carrying out normalization processing on the classification features corresponding to the samples to obtain corresponding normalization features;

carrying out logarithmic transformation on each normalized feature, and fusing the label feature corresponding to the same sample and the normalized feature after the logarithmic transformation to obtain the corresponding classification sub-loss of each sample;

the classification loss is derived based on the respective classification sub-losses.

6. The method of claim 1, wherein the sample classification network comprises a semantic feature extraction sub-network and a semantic feature classification sub-network, and wherein the adjusting the model parameters of the initial feature extraction model based on the feature loss and the classification loss until a convergence condition is satisfied to obtain a target feature extraction model comprises:

obtaining target loss based on the characteristic loss and the classification loss, and performing gradient calculation on the target loss to obtain a loss gradient;

updating the loss gradient based on a first adjustment parameter to obtain a first loss, and updating the loss gradient based on a second adjustment parameter to obtain a second loss; the first adjustment parameter is smaller than the second adjustment parameter;

and adjusting network parameters of the semantic feature classification sub-network based on the first loss, and adjusting network parameters of other networks based on the second loss until a convergence condition is met, so as to obtain the target feature extraction model.

7. The method of any one of claims 1 to 6, wherein before the obtaining the first sample set and inputting each sample in the first sample set into an initial feature extraction model, the method further comprises:

acquiring a second sample group, inputting each sample in the second sample group into a candidate feature extraction model, and obtaining a candidate non-semantic feature set, a candidate semantic feature set and a candidate fusion feature set corresponding to the second sample group;

calculating loss based on the candidate non-semantic feature set, the candidate semantic feature set and the candidate fusion feature set to obtain candidate loss;

based on the candidate loss, adjusting network parameters of a target network in the candidate feature extraction model until a first condition is met to obtain the initial feature extraction model; the target network comprises the sample classification network and a non-semantic feature extraction network.

8. The method of claim 7, wherein before obtaining the second set of samples and inputting each sample in the second set of samples into the candidate feature extraction model, the method further comprises:

acquiring a third sample group, inputting each sample in the third sample group into a feature extraction model to be trained, and acquiring a non-semantic feature set corresponding to the third sample group;

calculating loss based on the non-semantic feature set to obtain initial loss;

and adjusting model parameters of a non-semantic feature extraction network in the feature extraction model to be trained based on the initial loss until a second condition is met to obtain the candidate feature extraction model.

9. The method of claim 8, wherein obtaining the current sample set is any one of the first sample set, the second sample set, and the third sample set, and comprises:

obtaining a plurality of similar sample pairs;

determining a current sample and a positive sample corresponding to the current sample from a current similar sample pair, and determining a plurality of candidate samples from the rest similar sample pairs;

determining at least one negative sample corresponding to the current sample from each candidate sample based on the sample similarity of the current sample and each candidate sample;

and taking the positive sample and the negative sample corresponding to the current sample as reference samples corresponding to the current sample, and obtaining at least one current sample group based on the current sample and the corresponding reference samples.

10. The method of claim 9, wherein the determining at least one negative sample corresponding to the current sample from each candidate sample based on the sample similarity between the current sample and each candidate sample comprises:

inputting the current sample and each candidate sample into a matched current feature extraction model to obtain sample feature sets respectively corresponding to the current sample and each candidate sample; the current feature extraction model matched with the first sample group is the initial feature extraction model, the current feature extraction model matched with the second sample group is the candidate feature extraction model, and the current feature extraction model matched with the third sample group is the feature extraction model to be trained;

calculating sample similarity of the current sample and each candidate sample respectively based on the sample feature sets corresponding to the current sample and each candidate sample;

dividing each candidate sample into a first type sample and a second type sample based on the sample similarity; the sample similarity corresponding to the first type of sample is greater than the sample similarity corresponding to the second type of sample;

and determining at least one negative sample corresponding to the current sample from the first type of samples.

11. A method for sample retrieval, the method comprising:

acquiring a query sample and a candidate recall sample set;

the training process of the target feature extraction model is as follows:

12. The method of claim 11, wherein the recall sample feature comprises at least one of a semantic feature, a non-semantic feature, and a fused feature, and wherein the query sample feature and the recall sample feature comprise a same type of sample feature;

the determining, from the candidate recall sample set, a retrieval result sample corresponding to the query sample based on the query sample feature and recall sample feature includes:

establishing indexes based on the characteristics of all recall samples corresponding to the same type to obtain sample recall indexes corresponding to all types respectively;

and performing sample retrieval from the corresponding sample recall indexes based on the query sample features corresponding to the same type, and determining the retrieval result sample based on the sample retrieval result.

13. The method of claim 12, wherein the creating an index based on the recall sample features corresponding to the same type to obtain a sample recall index corresponding to each type includes:

performing feature clustering on each recall sample feature corresponding to the same type to obtain a plurality of clustering clusters corresponding to each type; each clustering cluster has a corresponding clustering center;

and obtaining sample recall indexes respectively corresponding to the types based on all the clustering clusters corresponding to the same type.

14. The method of claim 13, wherein the performing sample retrieval from a corresponding sample recall index based on the same type of corresponding query sample features, and determining the retrieval result sample based on a sample retrieval result comprises:

determining target cluster clusters from corresponding cluster clusters based on the query sample characteristics corresponding to the same type and the distance between the cluster centers to obtain target cluster clusters corresponding to the types respectively;

determining target sample characteristics from the recall sample characteristics corresponding to the target clustering cluster based on the distance between the query sample characteristics corresponding to the same type and the recall sample characteristics in the target clustering cluster;

and taking the candidate recall sample corresponding to the target sample characteristic as the retrieval result sample.

15. The method of claim 12, wherein the recall sample feature comprises a fusion feature when the query task corresponding to the query sample is a first type task, and a semantic feature and a non-semantic feature when the query task corresponding to the query sample is a second type task, and wherein a query frequency of the first type task is greater than a query frequency of the second type task.

16. A feature extraction model training apparatus, characterized in that the apparatus comprises:

17. A sample retrieval apparatus, the apparatus comprising:

the training process of the target feature extraction model is as follows:

18. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 10 or 11 to 15.

19. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10 or 11 to 15.

20. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 10 or 11 to 15 when executed by a processor.