CN117009776A

CN117009776A - Feature extraction method, model training method, device and electronic equipment

Info

Publication number: CN117009776A
Application number: CN202211361033.XA
Authority: CN
Inventors: 孙众毅; 陈小双; 鄢科; 丁守鸿
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-11-07

Abstract

The embodiment of the application discloses a feature extraction method, a model training method, a device and electronic equipment, which can restrict the distribution of sample features by utilizing the feature space reservation loss through determining the feature space reservation loss, achieve the effect of feature space reservation, prevent feature conflict of newly added sample data in the incremental training process, train a feature extraction model according to the self-supervision loss and the feature space reservation loss, improve the training effect of the feature extraction model in the scene of combining the incremental training with the self-supervision training, and improve the accuracy of the features extracted by the feature extraction model when the target features of the target media data are extracted based on the trained feature extraction model.

Description

Feature extraction method, model training method, device and electronic equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a feature extraction method, a model training device, and an electronic device.

Background

At present, the feature extraction model is widely applied to various common scenes, and when the feature extraction model is subjected to incremental training, the features of sample data extracted from different batches possibly collide with the increase of training batches. In the related art, under the mode of supervised learning, the representation of sample features of different categories can be restrained by using the labeling information, so that the influence caused by feature conflict is relieved to a certain extent, however, under the mode of self-supervision training, the problem of feature conflict cannot be solved by using the labeling information because the sample data does not have the labeling information, so that the training effect of a feature extraction model is influenced, and the accuracy of the features extracted by the feature extraction model is reduced.

Disclosure of Invention

The following is a summary of the subject matter of the detailed description of the application. This summary is not intended to limit the scope of the claims.

The embodiment of the application provides a feature extraction method, a model training method, a device and electronic equipment, which can improve the effect of performing self-supervision incremental training on a feature extraction model and improve the accuracy of features extracted by the feature extraction model.

In one aspect, an embodiment of the present application provides a feature extraction method, including:

Acquiring sample media data in a current incremental training batch, extracting sample characteristics of the sample media data based on a characteristic extraction model, and determining self-supervision loss of the characteristic extraction model according to the sample characteristics;

determining a plurality of target clustering centers, dividing the plurality of target clustering centers into a first center set and a second center set, determining a first distance between the sample media data and the first center set and a second distance between the sample media data and the second center set, and determining a feature space reservation loss according to the first distance and the second distance;

training the feature extraction model according to the self-supervision loss and the feature space reservation loss;

and acquiring target media data, and extracting target features of the target media data based on the trained feature extraction model.

On the other hand, the embodiment of the application also provides a model training method, which comprises the following steps:

and training the feature extraction model according to the self-supervision loss and the feature space reservation loss.

On the other hand, the embodiment of the application also provides a feature extraction device, which comprises:

the first processing module is used for acquiring sample media data in the current incremental training batch, extracting sample characteristics of the sample media data based on a characteristic extraction model, and determining self-supervision loss of the characteristic extraction model according to the sample characteristics;

the second processing module is used for determining a plurality of target clustering centers, dividing the plurality of target clustering centers into a first center set and a second center set, determining a first distance between the sample media data and the first center set and a second distance between the sample media data and the second center set, and determining a characteristic space reservation loss according to the first distance and the second distance;

The first parameter adjustment module is used for training the feature extraction model according to the self-supervision loss and the feature space reservation loss;

and the third processing module is used for acquiring target media data and extracting target characteristics of the target media data based on the trained characteristic extraction model.

Further, the first processing module is specifically configured to:

acquiring a first reference feature in a current incremental training batch, acquiring contrast data of the sample media data, and extracting contrast features of the contrast data based on the feature extraction model;

determining a first similarity between the sample feature and the first reference feature, and a second similarity between the comparison feature and the first reference feature, determining a comparison loss according to the first similarity and the second similarity, and taking the comparison loss as a self-supervision loss of the feature extraction model.

Further, the first processing module is specifically configured to:

acquiring a plurality of second reference features in a previous incremental training batch, wherein the target clustering center is obtained by clustering the plurality of second reference features;

determining the target cluster center as a sampling center, determining a distribution variance of the second reference feature taking the target cluster center as a cluster center, and taking the distribution variance as a sampling variance;

And carrying out Gaussian random sampling according to the sampling center and the sampling variance to obtain a first reference characteristic in the current incremental training batch.

Further, the first parameter adjustment module is specifically configured to:

determining a clustering loss when clustering the plurality of second reference features;

and determining target loss according to the self-supervision loss, the feature space reservation loss and the clustering loss, and training the feature extraction model according to the target loss.

Further, the first parameter adjustment module is specifically configured to:

determining a plurality of original cluster centers, and determining a third distance between the second reference feature and each original cluster center;

determining a target distance with the largest numerical value from a plurality of third distances;

and determining a first quantity of the second reference features, and determining clustering loss when clustering a plurality of the second reference features based on the original clustering center according to the first quantity and the target distance.

Further, the second processing module is specifically configured to:

taking the distance between the sample media data and the target clustering center closest to the first center set as a first distance between the sample media data and the first center set;

Taking the distance between the sample media data and the target clustering center closest to the second center set as a second distance between the sample media data and the second center set;

determining a second amount of the sample media data, and a difference between the first distance and the second distance, and determining a feature space reservation loss based on the second amount and the difference.

Further, the first parameter adjustment module is specifically configured to:

creating a copy of the feature extraction model at the beginning of the current incremental training batch to obtain a copy model;

acquiring an initialized coding model, wherein the coding model is used for extracting characteristics of the enhanced sample media data in a current incremental training batch;

taking the copy model and the coding model as a teacher network, taking the characteristic extraction model as a student network, and determining distillation loss of the characteristic extraction model;

and determining target loss according to the self-supervision loss, the characteristic space reserved loss and the distillation loss, and training the characteristic extraction model according to the target loss.

Further, the first parameter adjustment module is specifically configured to:

Acquiring first enhancement data and second enhancement data obtained after the sample media data are enhanced based on different modes;

extracting first enhancement features of the first enhancement data based on the feature extraction model, extracting second enhancement features of the first enhancement data based on the replica model, determining a third similarity between the first enhancement features and the second enhancement features, and determining a first loss according to the third similarity;

extracting third enhancement features of the second enhancement data based on the coding model, determining a fourth similarity between the first enhancement features and the third enhancement features, and determining a second loss according to the fourth similarity;

determining a distillation loss of the feature extraction model from the first loss and the second loss.

Further, the first parameter adjustment module is specifically configured to:

acquiring an initialized mapping model;

mapping the first enhancement feature to a feature space where the second enhancement feature is located based on the mapping model to obtain a mapping feature;

and regarding the similarity between the mapping feature and the second enhancement feature as a third similarity between the first enhancement feature and the second enhancement feature.

Further, the second processing module is specifically configured to:

determining a segmentation surface of the feature space according to the midpoint between the two target clustering centers farthest from each other;

and adjusting the dividing surface according to a preset rotation angle step length until the number difference value of the target clustering centers on two sides of the dividing surface is smaller than or equal to a preset number threshold value, taking the target clustering center on one side of the dividing surface as a first center set, and taking the target clustering center on the other side of the dividing surface as a second center set.

On the other hand, the embodiment of the application also provides a model training device, which comprises:

a fourth processing module, configured to obtain sample media data in a current incremental training batch, extract sample features of the sample media data based on a feature extraction model, and determine self-supervision loss of the feature extraction model according to the sample features;

a fifth processing module, configured to determine a plurality of target cluster centers, divide a plurality of target cluster centers into a first center set and a second center set, determine a first distance between the sample media data and the first center set, and a second distance between the sample media data and the second center set, and determine a feature space reservation loss according to the first distance and the second distance;

And the second parameter adjustment module is used for training the feature extraction model according to the self-supervision loss and the feature space reservation loss.

On the other hand, the embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the characteristic extraction method or the model training method when executing the computer program.

In another aspect, an embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the feature extraction method or the model training method described above.

In another aspect, embodiments of the present application also provide a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device performs the feature extraction method or the model training method described above.

The embodiment of the application at least comprises the following beneficial effects: the method comprises the steps of determining a plurality of target clustering centers, dividing the plurality of target clustering centers into a first center set and a second center set, determining a first distance between sample media data and the first center set and a second distance between the sample media data and the second center set, determining feature space reservation loss according to the first distance and the second distance, restraining distribution of sample features by using feature space reservation loss, achieving the effect of feature space reservation, preventing feature collision of newly added sample data in the incremental training process, and finally training a feature extraction model according to the self-supervision loss and the feature space reservation loss, so that training effect on the feature extraction model can be improved in a scene of combining incremental training with self-supervision training, and the accuracy of features extracted by the feature extraction model can be improved when the target features of the target media data are extracted by the feature extraction model after training.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and do not limit the application.

FIG. 1 is a schematic illustration of an alternative implementation environment provided by an embodiment of the present application;

FIG. 2 is a schematic flow chart of an alternative feature extraction method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative computing architecture for contrast loss provided by embodiments of the present application;

FIG. 4 is a schematic diagram of an alternative feature space reservation provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative confirmation flow of a target cluster center according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of an alternative method for obtaining a first reference feature according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an alternative training architecture of a feature extraction model according to an embodiment of the present application;

FIG. 8 is a schematic illustration of an alternative calculation of distillation loss provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of another alternative training architecture of a feature extraction model provided by an embodiment of the present application;

FIG. 10 is a schematic flow chart of an alternative model training method according to an embodiment of the present application;

FIG. 11 is a schematic diagram of an alternative complete architecture of a feature extraction method according to an embodiment of the present application;

FIG. 12 is a schematic diagram of another alternative complete architecture of the feature extraction method according to the embodiment of the present application;

FIG. 13 is a schematic view of an alternative structure of a feature extraction device according to an embodiment of the application;

FIG. 14 is a schematic view of an alternative structure of a model training device according to an embodiment of the present application;

fig. 15 is a partial block diagram of a terminal according to an embodiment of the present application;

fig. 16 is a partial block diagram of a server according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In the embodiments of the present application, when related processing is performed on data related to characteristics of a target object according to attribute information or attribute information set of the target object, permission or consent of the target object is obtained first, and collection, use, processing, etc. of the data complies with relevant laws and regulations and standards of relevant countries and regions, where the target object may be a user. In addition, when the embodiment of the application needs to acquire the attribute information of the target object, the independent permission or independent consent of the target object is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the target object is explicitly acquired, the necessary target object related data for enabling the embodiment of the application to normally operate is acquired.

In order to facilitate understanding of the technical solution provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained here:

catastrophic forgetting-in the event that in the incremental training mode, if the task sequence is unlabeled, may switch randomly, the same task may not reappear for a long time, the knowledge of the previous task will be suddenly lost when the model learns the current task.

SSL (Self-Supervised Learning), the data has no marked information, and the supervision signal is derived from the content of the data.

Incremental training: IL (Incremental Learning) the learning system constantly learns new knowledge from new samples and can save old knowledge learned before.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

In the related art, under the mode of supervised learning, the representation of sample features of different categories can be restrained by using the labeling information, so that the influence caused by feature conflict is relieved to a certain extent, however, under the mode of self-supervision training, the problem of feature conflict cannot be solved by using the labeling information because the sample data does not have the labeling information, so that the training effect of a feature extraction model is influenced, and the accuracy of the features extracted by the feature extraction model is reduced.

Based on the above, the embodiment of the application provides a feature extraction method, a model training method, a device and electronic equipment, which can improve the effect of performing self-supervision incremental training on a feature extraction model and improve the accuracy of features extracted by the feature extraction model

Referring to fig. 1, fig. 1 is a schematic diagram of an alternative implementation environment provided in an embodiment of the present application, where the implementation environment includes a terminal 101 and a server 102, where the terminal 101 and the server 102 are connected through a communication network.

For example, the server 102 may obtain sample media data in a current incremental training batch, extract sample features of the sample media data based on a feature extraction model, determine self-supervision loss of the feature extraction model according to the sample features, determine a plurality of target cluster centers, divide the plurality of target cluster centers into a first center set and a second center set, determine a first distance between the sample media data and the first center set, and a second distance between the sample media data and the second center set, determine feature space reservation loss according to the first distance and the second distance, train the feature extraction model according to the self-supervision loss and the feature space reservation loss, then receive the target media data sent by the terminal 101, and extract target features of the target media data based on the trained feature extraction model.

In addition, after training the feature extraction model, the server 102 may transmit the feature extraction model to the terminal 101, and the terminal 101 may apply the feature extraction model. Alternatively still, the terminal 101 obtains sample media data from the server 102, performs training and application of the feature extraction model locally at the terminal 101, and so on.

The server 102 divides the multiple target cluster centers into a first center set and a second center set by determining multiple target cluster centers, determines a first distance between the sample media data and the first center set and a second distance between the sample media data and the second center set, determines a feature space reservation loss according to the first distance and the second distance, can restrict distribution of sample features by utilizing the feature space reservation loss, achieves an effect of feature space reservation, prevents feature conflicts from occurring in sample data newly added in an incremental training process, and finally trains a feature extraction model according to the self-supervision loss and the feature space reservation loss, so that training effects on the feature extraction model can be improved under a scene of combining the incremental training with the self-supervision training, and the accuracy of features extracted by the feature extraction model can be improved when the target features of the target media data are extracted by the feature extraction model after training.

The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. In addition, server 102 may also be a node server in a blockchain network.

The terminal 101 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted terminal, etc. The terminal 101 and the server 102 may be directly or indirectly connected through wired or wireless communication, and embodiments of the present application are not limited herein.

The method provided by the embodiment of the application can be applied to various technical fields including, but not limited to, the technical fields of cloud technology, artificial intelligence and the like.

Referring to fig. 2, fig. 2 is a schematic flowchart of an alternative feature extraction method provided in an embodiment of the present application, where the feature extraction method may be performed by a terminal, or may be performed by a server, or may be performed by a terminal and a server in cooperation, and the feature extraction method includes, but is not limited to, the following steps 201 to 204.

Step 201: sample media data in the current incremental training batch is obtained, sample features of the sample media data are extracted based on the feature extraction model, and self-supervision loss of the feature extraction model is determined according to the sample features.

In the incremental training mode, a plurality of training batches are generally performed, and samples for training are updated as the training batches increase, so that the trained model can learn new knowledge.

In a possible implementation manner, the sample media data is a training sample in the current incremental training batch, after the sample features of the sample media data are extracted based on the feature extraction model, the self-supervision loss of the feature extraction model is determined according to the sample features, specifically, a decoder is introduced, wherein the processing procedures of the decoder and the feature extraction model are reciprocal, the sample features are input into the decoder to obtain the recovery data, the self-supervision loss of the feature extraction model is determined according to the difference between the sample media data and the recovery data, and the effect of self-supervision training is achieved. It will be appreciated that the specific manner of determining the difference between the sample media data and the restored data may be by mean squared error, etc., and embodiments of the present application are not limited. In addition, an architecture that generates self-supervised training such as an countermeasure network may also be employed to determine self-supervised loss of the feature extraction model.

In addition, in one possible implementation manner, when determining the self-supervision loss of the feature extraction model according to the sample features, the method specifically may obtain a first reference feature in the current incremental training batch, obtain the contrast data of the sample media data, and extract the contrast feature of the contrast data based on the feature extraction model; determining a first similarity between the sample feature and the first reference feature, and a second similarity between the comparison feature and the first reference feature, determining a comparison loss according to the first similarity and the second similarity, and taking the comparison loss as a self-supervision loss of the feature extraction model.

The first reference feature is a general feature used in the current incremental training batch and used for representing one type of sample data, and may also be referred to as a prototype variable, where the first reference feature and the sample feature belong to the same feature space, the first reference feature may be preset, and each incremental training batch may use the same first reference feature. Referring to fig. 3, fig. 3 is a schematic diagram of an alternative computing architecture of contrast loss provided by the embodiment of the present application, where sample media data and contrast data are a training data pair, and similarity is computed with a first reference feature, and then the contrast loss is determined by the first similarity and a second similarity, so that a feature extraction model can be trained, and at this time, because of the existence of the first reference feature, it is not necessary to label information such as category and the like on the sample media data, thereby achieving the effect of self-supervision incremental training.

And after obtaining the sample characteristics of the two sample media data, the contrast loss can be actually calculated directly according to the sample characteristics and the contrast characteristics, but only the local characteristic structure of the sample is focused in this way, in the embodiment of the application, the contrast loss is not calculated directly according to the sample characteristics and the contrast characteristics, but the first similarity between the sample characteristics and the first reference characteristics and the second similarity between the contrast characteristics and the first reference characteristics are respectively determined by introducing the first reference characteristics, and the contrast loss is determined according to the first similarity and the second similarity, so that the difference of the local characteristic structure of the sample in the whole space can be distinguished, and the robustness and the mobility of the characteristic extraction model are improved.

In one possible implementation manner, the number of the first reference features is a plurality, the calculated first similarity and the second similarity may be similarity vectors, and vector elements in the similarity vectors are similarities between the sample features and the respective first reference features, or similarities between the comparison features and the respective first reference features. And determining the contrast loss according to the first similarity and the second similarity may be calculating a cross entropy loss between the first similarity and the second similarity, and taking the cross entropy loss as the contrast loss.

In one possible implementation, the sample media data and the contrast data may be two similar media data to each other, i.e., the similarity between the sample media data and the contrast data may be greater than or equal to a preset similarity threshold. In addition, the contrast data may be obtained by enhancing the sample media data, or the sample media data and the contrast data may be obtained by enhancing the same media data based on different modes. The sample media data may be image data, text data or voice data, which is not limited in the embodiment of the present application, and accordingly, the data type of the comparison data is the same as the data type of the sample media data.

For example, if the comparison data is obtained by enhancing the sample media data in different manners, when the sample media data is image data, the comparison data can be obtained by converting the same image data in manners of rotation, clipping, overturning, amplifying, shrinking, and the like; when the sample media data is text data, the comparison data can be obtained by converting the same text data in a mode of word replacement, word deletion and the like; when the sample media data is voice data, the comparison data can be obtained by converting the same voice data through voice frame replacement, voice frame deletion, noise addition and the like.

In one possible implementation, the feature extraction model is used to map sample media data or contrast data, and further output sample features of the sample media data or contrast features of the contrast data. The specific structure of the feature extraction model may depend on the specific type of sample media data, for example, when the sample media data or the contrast data is image data, the feature extraction model may mainly include a structure such as a convolution layer; when the sample media data or the comparison data is text data, the feature extraction model may mainly include a structure such as a transducer layer; when the sample media data or the comparison data is voice data, the feature extraction model may mainly include structures such as a filtering layer, and of course, the embodiment of the present application does not limit the specific structure of the feature extraction model.

Step 202: determining a plurality of target clustering centers, dividing the plurality of target clustering centers into a first center set and a second center set, determining a first distance between the sample media data and the first center set, and a second distance between the sample media data and the second center set, and determining the reservation loss of the feature space according to the first distance and the second distance.

The target cluster center and the sample feature are located in the same feature space, the target cluster center can be preset, and when the target cluster center is set, the distance between different target cluster centers can be used as a standard, so that a plurality of target cluster centers are uniformly distributed in the feature space. In addition, the target clustering center can be obtained by clustering a plurality of first reference features, and the first reference features are kept unchanged in different incremental training batches.

In a possible implementation manner, when the plurality of target clustering centers are divided into a first center set and a second center set, a random division manner may be adopted, a plurality of target clustering centers may be included in the first center set and the second center set, and the number of target clustering centers in the first center set and the second center set may be equal or unequal. The first distance between the sample media data and the first center set is used for measuring the similarity between the sample media data and the characteristics of the target clustering centers in the first center set, wherein when the first distance between the sample media data and the first center set is calculated, the target clustering centers in the first center set can be clustered to obtain a set center of the first center set, and the similarity between the sample media data and the characteristics of the set center is used as the first distance between the sample media data and the first center set. Alternatively, when calculating the first distance between the sample media data and the first center set, the similarity between the sample feature and the feature of each target cluster center in the first center set may be calculated, and then an average value of the similarities may be used as the first distance between the sample media data and the first center set. Alternatively, the similarity between the sample feature and the feature of the closest target cluster center in the first center set may be calculated as the first distance between the sample media data and the first center set.

Similarly, the principle of calculating the second distance between the sample media data and the second center set is the same as that of calculating the first distance, and will not be described herein. The first distance or the second distance may be calculated by adopting a cosine similarity, an euclidean distance, a manhattan distance, or the like, which is not limited in the embodiment of the present application.

The feature space reservation loss is determined by the first distance and the second distance, and the feature space reservation loss is used for restraining the size relation between the first distance and the second distance, so that the first distance is larger than the second distance, or the second distance is larger than the first distance, therefore, when the feature space reservation loss is converged, sample features are more prone to be distributed in the feature space close to the first center set or the second center set, the feature space occupied by the sample features obtained in the current incremental training batch is reduced, the feature space is reserved for the subsequent incremental training batch, and the probability of feature conflict is reduced.

In one possible implementation manner, when determining the feature space reservation loss according to the first distance and the second distance, specifically, a distance between the sample media data and a target cluster center closest to the first center set may be used as a first distance between the sample media data and the first center set; taking the distance between the sample media data and the nearest target clustering center in the second center set as a second distance between the sample media data and the second center set; a second amount of sample media data is determined, and a difference between the first distance and the second distance is determined, and a feature space reservation loss is determined based on the second amount, the difference between the first distance and the second distance.

Specifically, the method can be expressed as:

wherein L is _reserve Representing a feature space reservation loss, N representing a second quantity, C ₁ Representing a first set of centers, C ₂ Representing a second set of centers, f _i Representing sample characteristics c _k Representing the target cluster centers in the first set of centers,features representing target cluster centers in the first set of centers c _j Representing the target cluster centers in the second center set, < >>Features representing target cluster centers in the second set of centers, +.>Representing the distance between the sample media data and the target cluster center in the first set of centers, +.>And (3) representing the distance between the sample media data and the target cluster center in the second center set, wherein max represents the nearest distance (highest similarity), reLU is an activation function, i and j are positive integers, and lambda is a preset coefficient.

Referring to FIG. 4, FIG. 4 is an alternative schematic diagram of feature space reservation, sim, according to an embodiment of the present application ₁ Representing the distance between the sample media data and the nearest target cluster center in the first center set (i.e., the first distance), sim ₂ Representing the distance between the sample media data and the nearest target cluster center in the second center set (i.e. the second distance), calculating a feature space reservation loss by introducing the difference between the first distance and the second distance, which satisfies Sim when the feature space reservation loss converges ₁ ≥λSim ₂ So that the sample feature willThe method is more prone to being distributed in the feature space close to the first center set, meanwhile, calculation of the reservation loss of the feature space can be effectively simplified, and training efficiency of the feature extraction model is improved.

In one possible implementation manner, when the plurality of target cluster centers are divided into the first center set and the second center set, a partition surface of the feature space may be determined specifically according to a midpoint between two target cluster centers farthest from each other; and adjusting the dividing plane according to the preset rotation angle step length until the difference value of the number of the target clustering centers at two sides of the dividing plane is smaller than or equal to a preset number threshold value, taking the target clustering center at one side of the dividing plane as a first center set, and taking the target clustering center at the other side of the dividing plane as a second center set.

Specifically, the segmentation plane of the feature space is determined according to the midpoint between the two target cluster centers with the farthest distance, the feature space can be generally initially divided, on the basis, the segmentation plane is further adjusted according to the preset rotation angle step length until the number difference value of the target cluster centers at two sides of the segmentation plane is smaller than or equal to the preset number threshold value, the distribution of the target cluster centers at two sides of the segmentation plane is relatively uniform, at the moment, the target cluster center at one side of the segmentation plane is used as a first center set, and the target cluster center at the other side of the segmentation plane is used as a second center set, so that the uniformity of the division of the target cluster centers can be improved.

It is understood that the rotation angle step may be set according to practical situations, for example, may be set to 5 degrees, 10 degrees, etc., which is not limited in the embodiment of the present application.

Step 203: and training the feature extraction model according to the self-supervision loss and the feature space reservation loss.

In one possible implementation manner, after determining the feature space reserved loss and the self-supervision loss, the feature extraction model may be trained according to the feature space reserved loss and the self-supervision loss, in addition to the feature space reserved loss and the self-supervision loss, the target loss may be determined according to the feature space reserved loss and the self-supervision loss, and the feature extraction model may be trained according to the target loss, where the target loss is determined according to the feature space reserved loss and the self-supervision loss, and the feature space reserved loss and the self-supervision loss may be weighted, so as to obtain the target loss, which may be specifically expressed as:

L _total ＝L _ssl +βL _reserve

wherein L is _total Indicating target loss, L _ssl Represents a self-supervision loss (which may be, for example, the aforementioned contrast loss), L _reserve The characteristic space reservation loss is represented, and beta is the weight of the characteristic space reservation loss and can be set according to actual conditions.

Step 204: and acquiring target media data, and extracting target features of the target media data based on the trained feature extraction model.

The target media data, namely the media data of the features to be extracted, extracts the target features of the target media data based on the trained feature extraction model, and can be applied to various different scenes.

For example, when the target media data is image data, the user may input the image data by using the terminal to perform image classification, the terminal sends the image data to the server, the server extracts the target feature of the image data based on the trained feature extraction model, obtains the image classification result of the image data according to the target feature, and sends the image classification result to the terminal to display. Or, the user may input the image data by using the terminal to perform image retrieval, the terminal sends the image data to the server, the server extracts the target feature of the image data based on the trained feature extraction model, matches other image data similar to the image data according to the target feature, and sends the image retrieval result to the terminal to display. Or, after the terminal extracts the target feature of the image data based on the trained feature extraction model, the terminal may match the image retrieval result corresponding to the image data according to the target feature in the local database.

For another example, when the target media data is text data, the server may acquire text data that has been browsed by the terminal history, and after extracting the target features of the text data based on the trained feature extraction model, match other text data similar to the text data according to the target features, and push the text data as recommended content to the terminal.

For another example, when the target media data is voice data, the terminal may send the voice data to the server, and after the server extracts the target feature of the image data based on the trained feature extraction model, the server converts the voice data into text data according to the target feature and then sends the text data to the terminal.

It will be appreciated that the different scenarios described above are merely illustrative, and are not limiting of the application scenarios for the target features extracted via the feature extraction model.

According to the embodiment of the application, the plurality of target clustering centers are divided into the first center set and the second center set, the first distance between the sample media data and the first center set and the second distance between the sample media data and the second center set are determined, the feature space reservation loss is determined according to the first distance and the second distance, the distribution of sample features can be restrained by utilizing the feature space reservation loss, the effect of feature space reservation is achieved, the feature conflict of the sample data newly added in the incremental training process is prevented, the feature extraction model is trained according to the self-supervision loss and the feature space reservation loss, the training effect on the feature extraction model can be improved under the scene of combining the incremental training with the self-supervision training, and the accuracy of the features extracted by the feature extraction model can be improved when the target features of the target media data are extracted by the feature extraction model after training.

In one possible implementation, based on the architecture shown in fig. 3, the reference features in each incremental training batch may be updated, i.e., the first reference feature in the current incremental training batch is not fixed, to promote diversity of sample features. At this time, when the first reference feature in the current incremental training batch is acquired, a plurality of second reference features in the previous incremental training batch may be acquired, the target cluster center is determined to be a sampling center, the distribution variance of the second reference features with the target cluster center as the cluster center is determined, the distribution variance is taken as the sampling variance, and gaussian random sampling is performed according to the sampling center and the sampling variance, so as to obtain the first reference feature in the current incremental training batch.

The target cluster center is obtained by clustering a plurality of second reference features, a plurality of original cluster centers can be initialized, the second reference features are clustered according to the original cluster centers, a plurality of cluster clusters are obtained, and the cluster center of the cluster is used as the target cluster center.

In this case, the target cluster center in the current incremental training batch is determined based on the previous incremental training batch, and the second reference feature is a reference feature in the previous incremental training batch, and the principle is similar to the first reference feature, so as to characterize the general feature of the sample data. The original cluster center is an initial cluster center, and can be obtained by a random initialization mode, and the number of the original cluster centers can also be determined according to actual conditions.

For example, referring to fig. 5, fig. 5 is an optional confirmation flow diagram of a target cluster center provided by the embodiment of the present application, assuming that the number of original cluster centers is three, namely, a cluster center a, a cluster center B and a cluster center C, the number of second reference features is nine, namely, a category a1, a category a2, a category B1, a category B2, a category B3, a category C1, a category C2, a category C3 and a category C4, then clustering the second reference features, and finally forming a cluster by the category a1 and the category a2, where the cluster center of the cluster is the cluster center a; the category B1, the category B2 and the category B3 form a cluster, and the cluster center of the cluster is a cluster center B; the category C1, the category C2, the category C3 and the category C4 form a cluster, and the cluster center of the cluster is a cluster center C. After each cluster center is obtained, gaussian random sampling can be performed based on each cluster center, so that the first reference features in the current incremental training batch are respectively classified into a3, a4, a b5, a b6, a c5, a c6, a c7 and a c8.

In a possible implementation manner, referring to fig. 6, fig. 6 is a schematic flow chart of an alternative procedure for obtaining a first reference feature provided in an embodiment of the present application, when training is performed, a queue is set first, a second reference feature in a previous incremental training batch may be stored by using the queue, after entering a current training batch, the second reference feature in the previous incremental training batch is read from the queue, a target cluster center of the second reference feature is used as a sampling center, a distribution variance of the second reference feature is used as a sampling variance, and gaussian random sampling is performed to obtain the first reference feature, and the first reference feature is used to replace the original second reference feature in the drop-out queue. And by analogy, in the next incremental training batch, taking the target cluster center of the first reference feature in the current incremental training batch as a sampling center, and performing Gaussian random sampling in a similar manner to obtain a third reference feature, wherein the third reference feature is used for calculating the contrast loss in the next incremental training batch.

Therefore, the first reference feature in the current incremental training batch is obtained by carrying out Gaussian random sampling according to the sampling center and the sampling variance, the updating effect of the reference feature in each incremental training batch can be achieved, and the reference feature in each incremental training batch is randomly distributed according to the sampling center and the sampling variance, so that the representation of the reference feature can be effectively enriched, the number of categories indicated by the reference feature is increased, and the accuracy of contrast loss is improved.

In addition, as the first reference feature is updated along with the incremental training batches, the target cluster center is updated along with each incremental training batch, and compared with a mode of presetting a fixed target cluster center, the feature space reservation loss can be more flexible and reliable.

It can be understood that in the process of clustering the plurality of second reference features to obtain the target cluster center, a plurality of clustering operations may be actually performed, and in this process, the cluster center is continuously changed based on the original cluster center, based on which, when the plurality of second reference features are clustered according to the original cluster center, a cluster loss may be introduced, and the original cluster center when the cluster loss converges is the target cluster center. Accordingly, when the feature extraction model is trained according to the self-supervision loss and the feature space reservation loss, particularly, the clustering loss when the plurality of second reference features are clustered can be determined, the target loss is determined according to the self-supervision loss, the feature space reservation loss and the clustering loss, and the feature extraction model is trained according to the target loss.

Through introducing the cluster loss, can utilize the cluster loss to promote the accuracy that carries out the cluster to a plurality of second benchmark characteristics, simultaneously, when confirming the target loss, reserve the loss and draw the basis at the feature space, further add the cluster loss for more have the variety when calculating the target loss, promote the reliability of target loss, and then promote the training effect of feature extraction model.

Based on this, the target loss can also be expressed as:

L _total ＝L _ssl +αL _cluster +βL _reserve

wherein L is _total Indicating target loss, L _ssl Represents a self-supervision loss (which may be, for example, the aforementioned contrast loss), L _reserve Representing feature space reservation loss, L _cluster The clustering loss is represented, alpha is the weight of the clustering loss, beta is the weight of the reserved loss of the feature space, and the clustering loss can be set according to actual conditions.

In one possible implementation manner, when determining the cluster loss when clustering the plurality of second reference features, a plurality of original cluster centers may be specifically determined, and a third distance between the second reference features and each original cluster center is determined; determining a target distance with the largest numerical value from the plurality of third distances; determining a first number of second reference features, and determining a clustering loss when clustering the plurality of second reference features based on the original clustering center according to the first number and the target distance.

The clustering loss can be expressed specifically as:

wherein L is _cluster Represents the clustering loss, n _p A first number is indicated and a second number is indicated,representing a second reference feature->Represents the original cluster center, max represents the nearest distance (highest similarity),/the nearest distance>Represents a third distance, n _c And the number of the original cluster centers is represented, and i and j are positive integers.

In one possible implementation manner, in order to unify the number of target cluster centers in different incremental training stages, the number of original cluster centers may be fixed, even if the number of original cluster centers is fixed, the clustering loss is calculated in the above manner, and in the process of forming a cluster in the clustering process, the existence of empty cluster centers or overlapping cluster centers is supported, so that the method is suitable for self-supervision training with uncertain category number, improves the flexibility of training, and enlarges the application range of training.

With the increase of incremental batches, the number of sample media data is gradually increased, when there is not enough storage space to store the sample media data in the historical incremental training batch, the feature extraction network obtained by training in the historical incremental training batch is used to extract the sample features of the sample media data in the current incremental training batch, and as for the newly added sample media data, the extracted features are biased, and the problem of disastrous forgetting often occurs, so that the training effect of the feature extraction model is affected. Based on this, in one possible implementation, when training the feature extraction model according to the self-supervision loss and the feature space reservation loss, a copy of the feature extraction model may be specifically created at the beginning of the current incremental training batch, to obtain a copy model; acquiring an initialized coding model; taking the copy model and the coding model as a teacher network, taking the feature extraction model as a student network, and determining distillation loss of the feature extraction model; and determining target loss according to the self-supervision loss, the characteristic space reservation loss and the distillation loss, and training a characteristic extraction model according to the target loss.

Specifically, referring to fig. 7, fig. 7 is a schematic diagram of an alternative training architecture of a feature extraction model provided in an embodiment of the present application, where in a current incremental training batch, since a replica model is created at the beginning of the current incremental training batch, that is, the replica model is a feature extraction model in a previous incremental training batch, the feature extraction model is distilled by the replica model, and the replica model can provide distillation information to the feature extraction model in the current incremental training batch at the feature representation level, so that when feature extraction is performed on the same sample media data by using the replica model and the feature extraction model, the mapped regions in the feature space are similar.

On the basis, a coding model is further introduced, wherein the coding model is used for extracting the characteristics of the enhanced sample media data in the current incremental training batch, the coding model is only used in the current incremental training batch, random initialization is carried out in each incremental training batch, and the coding model is trained together with the characteristic extraction model. Because the coding model is used for extracting features of the enhanced sample media data, the coding model can provide distillation information for the feature extraction model in the current incremental training batch at the feature relation level, so that when the sample media data before and after enhancement or the sample media data obtained by different enhancement modes are subjected to feature extraction, the mapped areas in the feature space are similar. Wherein the distillation loss may be divided into two parts, one part being a first loss between the feature extraction model and the replica model and the other part being a second loss between the feature extraction model and the coding model, e.g. the distillation loss may be the sum of the first loss and the second loss.

Therefore, even if there is not enough storage space to store sample media data in the historical incremental training batch, the training architecture of the feature extraction model provided by the embodiment of the application can not only perform distillation training on the feature representation level through the replica model, but also perform distillation training on the feature relation level through the coding model, so that the problem of catastrophic forgetting can be relieved to a certain extent, and the reliability of the training feature extraction model is improved.

Based on this, the target loss can be expressed specifically as:

L _total ＝L _ssl +βL _reserve +γL _distill

wherein L is _total Indicating target loss, L _ssl Represents a self-supervision loss (which may be, for example, the aforementioned contrast loss), L _reserve Representing feature space reservation loss, L _distill The distillation loss is represented by γ, which is the weight of the distillation loss, and β, which is the weight of the reserved loss in the feature space, can be set according to practical conditions.

In a possible implementation manner, referring to fig. 8, fig. 8 is a schematic diagram of an alternative calculation flow of distillation loss provided by an embodiment of the present application, and when determining distillation loss of a feature extraction model, specifically, first enhancement data and second enhancement data obtained after enhancement of sample media data based on different manners may be obtained; extracting first enhancement features of the first enhancement data based on the feature extraction model, extracting second enhancement features of the first enhancement data based on the replica model, determining a third similarity between the first enhancement features and the second enhancement features, and determining a first loss according to the third similarity; extracting third enhancement features of the second enhancement data based on the coding model, determining a fourth similarity between the first enhancement features and the third enhancement features, and determining a second loss according to the fourth similarity; determining a distillation loss of the feature extraction model based on the first loss and the second loss.

Similar to the principle of contrast data, when the sample media data is image data, the first enhancement data or the second enhancement data can be obtained by converting the same image data in a manner of rotation, clipping, overturning, amplifying, shrinking and the like; when the sample media data is text data, the first enhancement data or the second enhancement data can be obtained by converting the same text data in a mode of word replacement, word deletion and the like; when the sample media data is voice data, the first enhancement data or the second enhancement data may be obtained by converting the same voice data by voice frame replacement, voice frame deletion, noise addition, and the like, and of course, enhancement modes of obtaining the first enhancement data and the second enhancement data are different, for example, if the first enhancement data is obtained by rotation, the second enhancement data may be obtained by clipping.

It can be appreciated that under the architecture shown in fig. 3, multiplexing can be performed between the contrast data and the first enhancement data or the second enhancement data, so as to improve the efficiency of data processing.

Specifically, the distillation loss described above can be expressed as:

wherein,representing a first loss, ++ >Representing a second loss, N representing the number of sample media data, f _i ^t Representing a first enhancement feature, f _i ^t-1 Representing a second enhancement feature,/->Represents a third similarity, p _i,k And (3) representing a fourth similarity, wherein i, k and t are positive integers.

The fourth similarity may be a probability distribution obtained through a Softmax function, which may be specifically expressed as:

wherein p is _i,k Represent a fourth degree of similarity, f _i Representing a first enhancement feature, f _k ' represents a third enhancement feature, τ is a temperature parameter, and i and j are positive integers.

In a possible implementation manner, referring to fig. 9, fig. 9 is a schematic diagram of another alternative training architecture of a feature extraction model provided by an embodiment of the present application, and a mapping model is further introduced on the basis of the architecture shown in fig. 7, where when determining a third similarity between a first enhancement feature and a second enhancement feature, an initialized mapping model may be specifically obtained, the first enhancement feature is mapped to a feature space where the second enhancement feature is located based on the mapping model, so as to obtain a mapping feature, and the similarity between the mapping feature and the second enhancement feature is used as the third similarity between the first enhancement feature and the second enhancement feature.

Specifically, after the first enhancement feature is extracted by the feature extraction model, the similarity between the first enhancement feature and the second enhancement feature is not directly determined, because training in this manner is equivalent to forcing the first enhancement feature and the second enhancement feature to be similar, which may limit the feature extraction model to learn new knowledge to some extent. Therefore, a mapping model can be further introduced, the first enhancement feature is mapped to the feature space where the second enhancement feature is located based on the mapping model, the mapping feature is obtained, the similarity between the mapping feature and the second enhancement feature is used as the third similarity between the first enhancement feature and the second enhancement feature, in this case, when the first loss converges, the mapping model is indicated to be capable of perfectly mapping the first enhancement feature to the feature space where the second enhancement feature is located, and therefore the feature space where the feature extracted by the feature extraction model is indicated to have good feature representation performance. Therefore, by further introducing the mapping model and further calculating the third similarity, the range of learning knowledge of the feature extraction model can be expanded, and the training effect of the feature extraction model is improved.

In one possible implementation, as distillation loss is introduced, the network structure of the feature extraction model in the current incremental training batch, which is a student network, may be pruned in training the feature extraction model according to the target loss, for example, when the feature extraction model includes a convolution layer, the number of convolution layers of the feature extraction model in the current incremental training batch may be reduced, or when the feature extraction model includes a transform layer, the number of transform layers of the feature extraction model in the current incremental training batch may be reduced, or when the feature extraction model includes a filter layer, and so on. The network structure of the feature extraction model is pruned to achieve the distillation effect, so that the network structure of the feature extraction model is simplified.

Referring to fig. 10, fig. 10 is an optional flowchart of a model training method according to an embodiment of the present application, where the model training method may be performed by a terminal, or may be performed by a server, or may be performed by a combination of the terminal and the server, and the feature extraction method includes, but is not limited to, the following steps 1001 to 1003.

Step 1001: acquiring sample media data in a current incremental training batch, extracting sample characteristics of the sample media data based on a characteristic extraction model, and determining self-supervision loss of the characteristic extraction model according to the sample characteristics;

step 1002: determining a plurality of target clustering centers, dividing the plurality of target clustering centers into a first center set and a second center set, determining a first distance between sample media data and the first center set and a second distance between the sample media data and the second center set, and determining a feature space reservation loss according to the first distance and the second distance;

step 1003: and training the feature extraction model according to the self-supervision loss and the feature space reservation loss.

The model training method and the feature extraction method are based on the same invention conception, a plurality of target clustering centers are determined, the plurality of target clustering centers are divided into a first center set and a second center set, a first distance between sample media data and the first center set and a second distance between the sample media data and the second center set are determined, feature space reservation loss is determined according to the first distance and the second distance, distribution of feature space reservation loss constraint sample features can be utilized, the effect of feature space reservation is achieved, feature collision of newly added sample data in the incremental training process is prevented, finally, the feature extraction model is trained according to self-supervision loss and feature space reservation loss, and training effect of the feature extraction model can be improved under the scene of combining incremental training with self-supervision training.

The principles of steps 1001 to 1003 are similar to those of steps 201 to 203, and may be specifically explained with reference to the foregoing principles, which are not repeated herein.

The effect of the model training method provided by the embodiment of the application is described below by taking a classification task as an example, and referring to table 1, table 1 is the classification accuracy comparison data provided by the embodiment of the application, wherein in table 1, a feature extraction model obtained by training the model training method provided by the embodiment of the application is compared with a feature extraction model obtained by training a common self-supervision training method in the related art, and the feature extraction model obtained by training the common self-supervision training method in the related art is in an experimental result of 5-task on CIFAR100 and ImageNet100 data sets, and the evaluation index is the top1 classification accuracy after linear detection, so that the accuracy of the feature extraction model obtained by training the model training method provided by the embodiment of the application is obviously higher than that of the common self-supervision training method in the related art.

Method	CIFAR100	ImageNet100
			Common self-supervision training method in related technology	57.8	66.0
The embodiment of the application provides a model training method	64.15	67.92

The principle of the feature extraction method provided by the embodiment of the present application will be described in detail below with an example.

The embodiment of the application can be widely applied to various common application scenes, for example, in a classification scene, all kinds of data can not be acquired at the same time in many times, and meanwhile, the data volume is larger, and the manual labeling cost is larger. In addition, the server for receiving the online data in real time only acquires partial category data in each period of time, the data volume is larger, new category data is acquired after the feature extraction model is trained, if all past data are retrained, the cost is very large, or all past data cannot be acquired due to limited storage space, and secondly, when the data volume is very large, the cost of manual annotation is very large.

Referring to fig. 11, fig. 11 is an optional complete architecture schematic diagram of a feature extraction method according to an embodiment of the present application, specifically:

when the current incremental training batch starts, creating a copy of the feature extraction model to obtain a copy model, and simultaneously, storing second reference features in the previous incremental training batch in a queue, firstly acquiring the second reference features from the queue, clustering a plurality of second reference features, and determining that the plurality of second reference features are subjected to clustering in the clustering processThe cluster loss of row clusters, denoted L _cluster The specific calculation principle can be seen from the previous explanation, and will not be described herein.

Clustering the second reference features to obtain a plurality of target clustering centers, taking the target clustering centers as sampling centers, determining the distribution variance of the second reference features taking the target clustering centers as the cluster centers, taking the distribution variance as the sampling variance, performing Gaussian random sampling according to the sampling centers and the sampling variance to obtain first reference features in the current incremental training batch, and replacing the first reference features with original second reference features in the queue.

Next, a sample data pair is obtained, the sample data pair comprises sample media data and corresponding contrast data, the sample media data is subjected to data enhancement to obtain first enhancement data, contrast characteristics of the contrast data are extracted based on a characteristic extraction model, first similarity between the sample characteristics and first reference characteristics is determined, second similarity between the contrast characteristics and the first reference characteristics is determined, contrast loss is determined according to the first similarity and the second similarity, and the contrast loss is taken as self-supervision loss of the characteristic extraction model and is marked as L _ssl The specific calculation principle can be seen from the previous explanation, and will not be described herein.

Next, dividing a plurality of target clustering centers obtained by clustering a plurality of second reference features into a first center set and a second center set, determining a first distance between the sample media data and the first center set, and a second distance between the sample media data and the second center set, determining a feature space reservation loss according to the first distance and the second distance, and recording as L _reserve The specific calculation principle can be seen from the previous explanation, and will not be described herein.

Next, an initialized coding model and an initialized mapping model are obtained, a copy model and the coding model are used as a teacher network, a feature extraction model is used as a student network, another different data enhancement is carried out on the sample media data, second enhancement data is obtained, first enhancement features of the first enhancement data are extracted based on the feature extraction model, and the first enhancement features are based on the copyThe model extracts second enhancement features of the first enhancement data, maps the first enhancement features to feature spaces where the second enhancement features are located based on the mapping model to obtain mapping features, determines third similarity between the mapping features and the second enhancement features, and determines first loss according to the third similarity; extracting third enhancement features of the second enhancement data based on the coding model, determining a fourth similarity between the first enhancement features and the third enhancement features, and determining a second loss according to the fourth similarity; determining a distillation loss of the feature extraction model from the first loss and the second loss, denoted as L _distill The specific calculation principle can be seen from the previous explanation, and will not be described herein.

Determining a target loss according to the clustering loss, the self-supervision loss, the feature space reservation loss and the distillation loss, and adjusting parameters of a feature extraction network, a coding model and a mapping model according to the target loss, wherein the target loss can be expressed as:

L _total ＝L _ssl +αL _cluster +βL _reserve +γL _distill

wherein L is _total Indicating target loss, L _ssl Representing self-supervision loss, L _cluster Representing cluster loss, L _reserve Representing feature space reservation loss, L _distill The distillation loss is represented by α, γ, and β, respectively, and the weight of the cluster loss, the distillation loss, and the reservation loss in the feature space can be set according to practical conditions.

The method comprises the steps that training of a current incremental training batch is completed, a first reference feature in the current incremental training batch is stored in a current queue, when a next incremental training batch starts, the first reference feature in the current incremental training batch is changed into a second reference feature in a previous incremental training batch, the next incremental training batch is trained by adopting a similar process, and the cycle is continued until the number of the incremental training batches reaches a preset batch threshold value, and training of a feature extraction model is completed. After training of the feature extraction model is completed, feature extraction can be performed by using the trained feature extraction model to complete downstream tasks.

Therefore, the training architecture divides a plurality of target clustering centers into the first center set and the second center set, determines the first distance between the sample media data and the first center set and the second distance between the sample media data and the second center set, determines the feature space reservation loss according to the first distance and the second distance, can restrict the distribution of sample features by utilizing the feature space reservation loss, achieves the effect of feature space reservation, prevents feature conflicts from occurring in the newly added sample data in the incremental training process, determines the target clustering centers by introducing the clustering loss, further performs Gaussian random sampling according to the sampling centers and the sampling variance, obtains the first reference feature in the current incremental training batch, can achieve the updating effect of the reference feature in each incremental training batch, simultaneously introduces the distillation loss, can perform distillation training on the feature representation level through a replica model, can alleviate the problem of catastrophe forgetting at a certain degree by using the feature space reservation loss, can accurately extract the feature extraction training model according to the self-supervision loss, clustering loss and feature space distillation loss, can extract the feature extraction feature on the feature extraction model after the feature extraction training, and can accurately extract the feature on the basis of the feature extraction training model after the feature extraction training model.

Referring to fig. 12, fig. 12 is a schematic diagram of another alternative complete architecture of the feature extraction method according to the embodiment of the present application, specifically:

when a current incremental training batch starts, creating a copy of the feature extraction model to obtain a copy model, acquiring a sample data pair, wherein the sample data pair comprises sample media data and corresponding contrast data, performing data enhancement on the sample media data to obtain first enhancement data, extracting contrast features of the contrast data based on the feature extraction model, determining a first similarity between the sample features and first reference features, and determining the contrast features and the first reference featuresA second similarity between the features, determining a contrast loss according to the first similarity and the second similarity, and taking the contrast loss as a self-supervision loss of the feature extraction model and marking the self-supervision loss as L _ssl The specific calculation principle can be seen from the previous explanation, and will not be described herein.

Next, a plurality of preset target clustering centers are acquired, the target clustering centers are divided into a first center set and a second center set, a first distance between the sample media data and the first center set is determined, a second distance between the sample media data and the second center set is determined, and a feature space reservation loss is determined according to the first distance and the second distance and is marked as L _reserve The specific calculation principle can be seen from the previous explanation, and will not be described herein.

Next, an initialized coding model and an initialized mapping model are obtained, a copy model and the coding model are used as teacher networks, a feature extraction model is used as student networks, another different data enhancement is carried out on sample media data to obtain second enhancement data, first enhancement features of the first enhancement data are extracted based on the feature extraction model, second enhancement features of the first enhancement data are extracted based on the copy model, the first enhancement features are mapped to feature spaces where the second enhancement features are located based on the mapping model, mapping features are obtained, third similarity between the mapping features and the second enhancement features is determined, and first loss is determined according to the third similarity; extracting third enhancement features of the second enhancement data based on the coding model, determining a fourth similarity between the first enhancement features and the third enhancement features, and determining a second loss according to the fourth similarity; determining a distillation loss of the feature extraction model from the first loss and the second loss, denoted as L _distill The specific calculation principle can be seen from the previous explanation, and will not be described herein.

Determining a target loss according to the self-supervision loss, the feature space reservation loss and the distillation loss, and adjusting parameters of the feature extraction network, the decoder, the coding model and the mapping model according to the target loss, wherein the target loss can be expressed as:

L _total ＝L _ssl +βL _reserve +γL _distill

Wherein L is _total Indicating target loss, L _ssl Representing self-supervision loss, L _reserve Representing feature space reservation loss, L _distill The distillation loss is represented by γ, which is the weight of the distillation loss, and β, which is the weight of the reserved loss in the feature space, can be set according to practical conditions.

And finishing the training of the current incremental training batch, and circulating until the number of the incremental training batches reaches a preset batch threshold value, thereby finishing the training of the feature extraction model. After training of the feature extraction model is completed, feature extraction can be performed by using the trained feature extraction model to complete downstream tasks.

Therefore, the training architecture divides a plurality of target clustering centers into the first center set and the second center set, determines a first distance between sample media data and the first center set and a second distance between the sample media data and the second center set, determines feature space reservation loss according to the first distance and the second distance, can restrict distribution of sample features by utilizing the feature space reservation loss, achieves the effect of feature space reservation, prevents feature conflicts from occurring in newly added sample data in the incremental training process, introduces distillation loss, can perform distillation training on a feature representation level through a copy model, can perform distillation training on a feature relation level through a coding model, so that the problem of catastrophic forgetting can be relieved to a certain extent, finally trains a feature extraction model according to self-supervision loss, feature space reservation loss and distillation loss, can improve the training effect of the feature extraction model under the scene of the incremental training combined with self-supervision training, and can improve the accuracy of the feature extracted by the feature model when the target features of the target media data are extracted based on the feature extraction model after training.

It will be appreciated that, although the steps in the flowcharts described above are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages that are not necessarily performed at the same time but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a portion of the steps or stages in other steps or other steps.

Referring to fig. 13, fig. 13 is a schematic structural diagram of an alternative feature extraction device provided in an embodiment of the application, where the feature extraction device 1300 includes:

a first processing module 1301, configured to obtain sample media data in a current incremental training batch, extract sample features of the sample media data based on a feature extraction model, and determine self-supervision loss of the feature extraction model according to the sample features;

A second processing module 1302, configured to determine a plurality of target cluster centers, divide the plurality of target cluster centers into a first center set and a second center set, determine a first distance between the sample media data and the first center set, and a second distance between the sample media data and the second center set, and determine a feature space reservation loss according to the first distance and the second distance;

the first parameter adjustment module 1303 is configured to train the feature extraction model according to the self-supervision loss and the feature space reservation loss;

a third processing module 1304 is configured to obtain target media data, and extract target features of the target media data based on the trained feature extraction model.

Further, the first processing module 1301 is specifically configured to:

acquiring a first reference feature in a current incremental training batch, acquiring contrast data of sample media data, and extracting contrast features of the contrast data based on a feature extraction model;

Further, the first processing module 1301 is specifically configured to:

acquiring a plurality of second reference features in a previous incremental training batch, wherein a target clustering center is obtained by clustering the plurality of second reference features;

determining a target cluster center as a sampling center, determining a distribution variance of a second reference feature taking the target cluster center as a cluster center, and taking the distribution variance as the sampling variance;

Further, the first parameter adjustment module 1303 is specifically configured to:

determining a target distance with the largest numerical value from the plurality of third distances;

determining a first number of second reference features, and determining a clustering loss when clustering the plurality of second reference features based on the original clustering center according to the first number and the target distance.

Further, the second processing module 1302 is specifically configured to:

taking the distance between the sample media data and the nearest target clustering center in the first center set as the first distance between the sample media data and the first center set;

taking the distance between the sample media data and the nearest target clustering center in the second center set as a second distance between the sample media data and the second center set;

a second amount of sample media data and a difference between the first distance and the second distance are determined, and a feature space reservation loss is determined based on the second amount and the difference.

acquiring an initialized coding model, wherein the coding model is used for extracting characteristics of the enhanced sample media data in the current incremental training batch;

taking the copy model and the coding model as a teacher network, taking the feature extraction model as a student network, and determining distillation loss of the feature extraction model;

and determining target loss according to the self-supervision loss, the characteristic space reservation loss and the distillation loss, and training a characteristic extraction model according to the target loss.

determining a distillation loss of the feature extraction model based on the first loss and the second loss.

acquiring an initialized mapping model;

and the similarity between the mapping feature and the second enhancement feature is used as a third similarity between the first enhancement feature and the second enhancement feature.

Further, the second processing module 1302 is specifically configured to:

Determining a segmentation surface of the feature space according to the midpoint between two target cluster centers with the farthest distance;

and adjusting the dividing plane according to the preset rotation angle step length until the difference value of the number of the target clustering centers at two sides of the dividing plane is smaller than or equal to a preset number threshold value, taking the target clustering center at one side of the dividing plane as a first center set, and taking the target clustering center at the other side of the dividing plane as a second center set.

The feature extraction apparatus 1300 and the feature extraction method are based on the same inventive concept, so that the feature extraction apparatus 1300 divides a plurality of target cluster centers into a first center set and a second center set by determining the plurality of target cluster centers, determines a first distance between sample media data and the first center set and a second distance between the sample media data and the second center set, determines a feature space reservation loss according to the first distance and the second distance, and can restrict distribution of sample features by using the feature space reservation loss, achieve the effect of feature space reservation, prevent feature collision of newly added sample data in the incremental training process, and finally train a feature extraction model according to the self-supervision loss and the feature space reservation loss, so that training effect on the feature extraction model can be improved in a scene of combining incremental training with self-supervision training, and then, when extracting target features of target media data based on the trained feature extraction model, accuracy of features extracted by the feature extraction model can be improved.

Referring to fig. 14, fig. 14 is a schematic structural diagram of an alternative model training apparatus provided in an embodiment of the present application, where the model training apparatus 1400 includes:

a fourth processing module 1401, configured to obtain sample media data in a current incremental training batch, extract sample features of the sample media data based on the feature extraction model, and determine self-supervision loss of the feature extraction model according to the sample features;

a fifth processing module 1402, configured to determine a plurality of target cluster centers, divide the plurality of target cluster centers into a first center set and a second center set, determine a first distance between the sample media data and the first center set, and a second distance between the sample media data and the second center set, and determine a feature space reservation loss according to the first distance and the second distance;

the second parameter adjustment module 1403 is configured to train the feature extraction model according to the self-supervision loss and the feature space reservation loss.

The model training apparatus 1400 and the model training method are based on the same inventive concept, so that the model training apparatus 1400 divides a plurality of target cluster centers into a first center set and a second center set by determining the plurality of target cluster centers, determines a first distance between sample media data and the first center set, and a second distance between sample media data and the second center set, determines a feature space reservation loss according to the first distance and the second distance, and can restrict distribution of sample features by using the feature space reservation loss, thereby achieving the effect of feature space reservation, preventing feature conflicts from occurring in newly added sample data in the incremental training process, and finally training the feature extraction model according to the self-supervision loss and the feature space reservation loss, so that the training effect on the feature extraction model can be improved in a scene of combining incremental training with self-supervision training.

The electronic device for executing the feature extraction method or the model training method according to the embodiment of the present application may be a terminal, and referring to fig. 15, fig. 15 is a partial block diagram of the terminal according to the embodiment of the present application, where the terminal includes: radio Frequency (RF) circuitry 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuitry 1560, wireless fidelity (wireless fidelity, wiFi) module 1570, processor 1580, and power supply 1590. It will be appreciated by those skilled in the art that the terminal structure shown in fig. 15 is not limiting of the terminal and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The RF circuit 1510 may be used for receiving and transmitting signals during a message or a call, and particularly, after receiving downlink information of a base station, the signal is processed by the processor 1580; in addition, the data of the design uplink is sent to the base station.

The memory 1520 may be used to store software programs and modules, and the processor 1580 performs various functional applications and data processing of the terminal by executing the software programs and modules stored in the memory 1520.

The input unit 1530 may be used to receive input numerical or character information and generate key signal inputs related to the setting and function control of the terminal. In particular, the input unit 1530 may include a touch panel 1531 and other input devices 1532.

The display unit 1540 may be used to display input information or provided information and various menus of the terminal. The display unit 1540 may include a display panel 1541.

Audio circuitry 1560, speakers 1561, and microphone 1562 may provide an audio interface.

In this embodiment, the processor 1580 included in the terminal may perform the feature extraction method or the model training method of the previous embodiment.

The electronic device for performing the feature extraction method or the model training method according to the embodiment of the present application may also be a server, and referring to fig. 16, fig. 16 is a partial block diagram of a server according to the embodiment of the present application, where server 1600 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (Central Processing Units, abbreviated as CPU) 1622 (e.g., one or more processors) and a memory 1632, and one or more storage media 1630 (e.g., one or more mass storage devices) storing application programs 1642 or data 1644. Wherein memory 1632 and storage medium 1630 may be transitory or persistent. The program stored on the storage medium 1630 may include one or more modules (not shown), each of which may include a series of instruction operations on the server 1600. Further, the central processor 1622 may be configured to communicate with a storage medium 1630 to execute a series of instruction operations on the storage medium 1630 on the server 1600.

The server 1600 may also include one or more power supplies 1626, one or more wired or wireless network interfaces 1650, one or more input output interfaces 1658, and/or one or more operating systems 1641, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The processor in server 1600 may be used to perform a feature extraction method or a model training method.

The embodiments of the present application also provide a computer readable storage medium storing a program code for executing the feature extraction method or the model training method of the foregoing embodiments.

Embodiments of the present application also provide a computer program product comprising a computer program stored on a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device performs the feature extraction method or the model training method described above.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It should be understood that in the description of the embodiments of the present application, plural (or multiple) means two or more, and that greater than, less than, exceeding, etc. are understood to not include the present number, and that greater than, less than, within, etc. are understood to include the present number.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It should also be appreciated that the various embodiments provided by the embodiments of the present application may be arbitrarily combined to achieve different technical effects.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the above embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit and scope of the present application, and these equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.

Claims

1. A feature extraction method, comprising:

2. The feature extraction method of claim 1, wherein said determining self-supervised loss of the feature extraction model from the sample features comprises:

3. The method of feature extraction of claim 2, wherein the obtaining the first reference feature in the current incremental training batch comprises:

4. A method of feature extraction as claimed in claim 3, wherein said training the feature extraction model in accordance with the self-supervising loss and the feature space reservation loss comprises:

5. The feature extraction method according to claim 4, wherein the determining a clustering loss when clustering the plurality of the second reference features includes:

6. The feature extraction method of claim 1, wherein the determining a first distance between the sample media data and the first center set and a second distance between the sample media data and the second center set, determining a feature space reservation loss based on the first distance and the second distance, comprises:

7. The feature extraction method according to any one of claims 1 to 6, characterized in that the training of the feature extraction model according to the self-supervision loss and the feature space reservation loss comprises:

8. The feature extraction method of claim 7, wherein the determining a distillation loss of the feature extraction model comprises:

9. The feature extraction method of claim 8, wherein the determining a third similarity between the first enhancement feature and the second enhancement feature comprises:

acquiring an initialized mapping model;

10. The feature extraction method according to claim 1, wherein the dividing the plurality of target cluster centers into a first center set and a second center set includes:

11. A method of model training, comprising:

12. A feature extraction device, comprising:

13. A model training device, comprising:

14. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the feature extraction method of any one of claims 1 to 10 or the model training method of claim 11 when executing the computer program.

15. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the feature extraction method of any one of claims 1 to 10, or implements the model training method of claim 11.

16. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the feature extraction method of any one of claims 1 to 10 or the model training method of claim 11.