CN117541804A

CN117541804A - Model training method, feature extraction method, device, equipment and medium

Info

Publication number: CN117541804A
Application number: CN202311518180.8A
Authority: CN
Inventors: 胡君一; 王琪; 高鹏飞; 周雍恺
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2023-11-14
Filing date: 2023-11-14
Publication date: 2024-02-09

Abstract

The application discloses a feature extraction model training method, a convolutional neural network training method, a model training method, a feature extraction method, a device, equipment and a medium, which are used for rapidly and accurately extracting high-precision image feature vectors. The method comprises the steps of obtaining a first feature vector of a first sample image based on a convolutional neural network; processing the first sample image into a one-dimensional sequence, and inputting the one-dimensional sequence into a feature extraction model to be trained, wherein the feature extraction model is a model obtained based on a frame structure of an MPCFomer model and training logic of an SETR model; obtaining a second feature vector of the first sample image based on the feature extraction model; based on the second feature vector, the first feature vector is calibrated to obtain a first reference feature vector, the feature extraction model is trained based on the first reference feature vector, and based on the first reference feature vector, the purpose of rapidly and accurately extracting the high-precision image feature vector based on the feature extraction model is achieved.

Description

Model training method, feature extraction method, device, equipment and medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a feature extraction model training method, a convolutional neural network training method, a model training method, a feature extraction method, a device, equipment, and a medium.

Background

In the privacy calculation, the multiparty Secure calculation (MPC) is that under the condition of no trusted third party, a plurality of participants cooperatively complete the calculation target, and each participant can not obtain any input information of other participants except the calculation result, so that the privacy information of the user can be well protected.

If the input information is an image containing the privacy information of the user, the image may be subjected to feature extraction in order to protect the privacy information of the user, and multiparty security calculation may be performed based on the extracted feature vector. However, how to quickly and accurately extract the high-precision image feature vector is a technical problem to be solved at present.

Disclosure of Invention

The application provides a feature extraction model training method, a convolutional neural network training method, a model training method, a feature extraction method, a device, equipment and a medium, which are used for rapidly and accurately extracting high-precision image feature vectors.

In a first aspect, the present application provides a feature extraction model training method, the method comprising:

for any one of the acquired first sample images, acquiring a first feature vector of the first sample image based on a convolutional neural network after training; processing the first sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model to be trained, and obtaining a second feature vector of the first sample image based on the feature extraction model; calibrating the first feature vector based on the second feature vector to obtain a calibrated first reference feature vector, wherein the feature extraction model is a model obtained by training based on a framework structure of an MPCFomer model and training logic of a semantic visual transducer model SETR;

And training the feature extraction model based on the first reference feature vector, the tag feature vector carried by the first sample image and the configured first loss function.

In a possible implementation manner, after the obtaining of the second feature vector of the first sample image, before the calibrating the first feature vector based on the second feature vector, the method further includes:

carrying out global average pooling on the second feature vector to obtain global feature vector information corresponding to each dimension feature in the second feature vector, and obtaining an importance weight coefficient corresponding to each dimension feature in the global feature vector information based on a probability distribution function; calibrating the second feature vector based on the importance weight coefficient corresponding to each dimension feature to obtain a calibrated second feature vector; and based on the calibrated second feature vector, carrying out the subsequent step of calibrating the first feature vector based on the second feature vector.

In a possible implementation manner, the first loss function includes an importance weight coefficient corresponding to each dimension feature.

In a possible implementation manner, the calibrating the first feature vector based on the second feature vector to obtain a calibrated first reference feature vector includes:

and performing an exclusive nor operation on each one-dimensional feature contained in the second feature vector and each one-dimensional feature contained in the first feature vector to obtain a calibrated first reference feature vector.

In a possible implementation manner, after the performing an exclusive nor operation on each one-dimensional feature included in the second feature vector and each one-dimensional feature included in the first feature vector, before the obtaining the calibrated first reference feature vector, the method further includes:

and carrying out normalization processing on the vector obtained after the exclusive OR operation, and determining a calibrated first reference feature vector based on the vector obtained after the normalization processing.

In one possible implementation, the excitation function used by the feature extraction model includes: 2quad excitation functions.

In a possible implementation manner, the feature extraction model is a model obtained based on knowledge distillation.

In a second aspect, the present application provides a convolutional neural network training method, the method comprising:

Inputting the second sample image into a convolutional neural network to be trained aiming at any acquired second sample image, and acquiring a third feature vector of the second sample image based on the convolutional neural network; processing the second sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model after training, and obtaining a fourth feature vector of the second sample image based on the feature extraction model; calibrating the third feature vector based on the fourth feature vector to obtain a calibrated second reference feature vector, wherein the feature extraction model is a model obtained by training based on a framework structure of an MPCFomer model and training logic of a semantic visual transducer model SETR;

and training the convolutional neural network based on the second reference feature vector, the label feature vector carried by the second sample image and the configured second loss function.

In a possible implementation manner, the calibrating the third feature vector based on the fourth feature vector to obtain a calibrated second reference feature vector includes:

And performing an exclusive nor operation on each one-dimensional feature contained in the fourth feature vector and each one-dimensional feature contained in the third feature vector to obtain a calibrated second reference feature vector.

In a possible implementation manner, after the performing an exclusive nor operation on each one-dimensional feature included in the fourth feature vector and each one-dimensional feature included in the first feature vector, before the obtaining the calibrated second reference feature vector, the method further includes:

and carrying out normalization processing on the vector obtained after the exclusive OR operation, and determining a calibrated second reference feature vector based on the vector obtained after the normalization processing.

In a third aspect, the present application provides a model training method, the method comprising:

aiming at any acquired third sample image, acquiring a fifth feature vector of the third sample image based on a convolutional neural network to be trained; processing the third sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model to be trained, and obtaining a sixth feature vector of the third sample image based on the feature extraction model; calibrating the fifth feature vector based on the sixth feature vector to obtain a calibrated third reference feature vector, wherein the feature extraction model is a model obtained by training based on a frame structure of an MPCFomer model and training logic of a semantic visual transducer model SETR;

Training the feature extraction model based on the third reference feature vector, the tag feature vector carried by the third sample image and the configured first loss function; and training the convolutional neural network based on the third reference feature vector, the label feature vector carried by the third sample image and the configured second loss function.

In a fourth aspect, the present application provides a feature extraction method, the method comprising:

receiving an image to be processed;

the feature vector of the image to be processed is obtained based on the feature extraction model trained by the method of any one of the first aspect and the third aspect, or based on the convolutional neural network trained by the method of any one of the second aspect and the third aspect.

In one possible embodiment, the method further comprises:

and based on the feature vector, performing multiparty security calculation.

In a fifth aspect, the present application provides a feature extraction model training apparatus, the apparatus comprising:

the first calibration module is used for acquiring a first feature vector of any one of the acquired first sample images based on the convolutional neural network after training; processing the first sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model to be trained, and obtaining a second feature vector of the first sample image based on the feature extraction model; calibrating the first feature vector based on the second feature vector to obtain a calibrated first reference feature vector, wherein the feature extraction model is a model obtained by training based on a framework structure of an MPCFomer model and training logic of a semantic visual transducer model SETR;

And the first training module is used for training the feature extraction model based on the first reference feature vector, the tag feature vector carried by the first sample image and the configured first loss function.

In a possible embodiment, the first calibration module is further configured to:

In a possible embodiment, the first calibration module is specifically configured to:

In a sixth aspect, the present application provides a convolutional neural network training device, the device comprising:

the second calibration module is used for inputting any acquired second sample image into a convolutional neural network to be trained, and acquiring a third feature vector of the second sample image based on the convolutional neural network; processing the second sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model after training, and obtaining a fourth feature vector of the second sample image based on the feature extraction model; calibrating the third feature vector based on the fourth feature vector to obtain a calibrated second reference feature vector, wherein the feature extraction model is a model obtained by training based on a framework structure of an MPCFomer model and training logic of a semantic visual transducer model SETR;

And the second training module is used for training the convolutional neural network based on the second reference feature vector, the label feature vector carried by the second sample image and the configured second loss function.

In a possible embodiment, the second calibration module is specifically configured to:

In a possible embodiment, the second calibration module is further configured to:

In a seventh aspect, the present application provides a model training apparatus, the apparatus comprising:

the third calibration module is used for acquiring a fifth feature vector of any one of the acquired third sample images based on a convolutional neural network to be trained; processing the third sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model to be trained, and obtaining a sixth feature vector of the third sample image based on the feature extraction model; calibrating the fifth feature vector based on the sixth feature vector to obtain a calibrated third reference feature vector, wherein the feature extraction model is a model obtained by training based on a frame structure of an MPCFomer model and training logic of a semantic visual transducer model SETR;

The third training module is used for training the feature extraction model based on the third reference feature vector, the label feature vector carried by the third sample image and the configured first loss function; and training the convolutional neural network based on the third reference feature vector, the label feature vector carried by the third sample image and the configured second loss function.

In an eighth aspect, the present application provides a feature extraction apparatus, the apparatus comprising:

the receiving module is used for receiving the image to be processed;

the extraction module is used for obtaining the feature vector of the image to be processed based on the feature extraction model obtained by training the method according to any one of the first aspect and the third aspect or the convolutional neural network obtained by training the method according to any one of the second aspect and the third aspect.

In one possible embodiment, the apparatus further comprises:

and the multiparty safety calculation module is used for carrying out multiparty safety calculation based on the feature vector.

In a ninth aspect, the present application also provides an electronic device comprising at least a processor and a memory, the processor being adapted to implement the steps of any of the methods described above when executing a computer program stored in the memory.

In a tenth aspect, the present application provides a computer readable storage medium storing a computer program which when executed by a processor performs the steps of any of the methods described above.

In the embodiment of the application, the first sample image can be processed into a one-dimensional sequence, the one-dimensional sequence is input into the feature extraction model, so that the feature extraction model can perform feature extraction on the first sample image, and the feature extraction model in the embodiment of the application is a model obtained by training the frame structure of the MPCFomer model and the training logic of the SETR model, so that the feature extraction model can capture more accurate feature vectors (second feature vectors) such as faces and the like based on a self-attention mechanism, calibrate the first feature vector acquired by a convolutional neural network based on the second feature vector, obtain a first reference feature vector capable of expressing image features more accurately, train the feature extraction model based on the first reference feature vector and a configured first loss function, determine (extract) the accuracy of the feature vector of the feature extraction model after training, and achieve the purpose of rapidly and accurately extracting the image feature vector with high accuracy.

In addition, the feature extraction model in the embodiment of the application is a model obtained by training based on the framework structure of the MPCFomer model and the training logic of the SETR model, so that the feature extraction model can obtain the feature vector with accurate pixel level, and the accuracy of determining (extracting) the image feature vector by the feature extraction model after training is improved. In addition, in the embodiment of the application, the unimportant features in the first feature vector and the like can be weakened or removed (ignored) based on the self-attention mechanism, so that fewer and precise feature vectors can be obtained, the accuracy of the obtained feature vectors can be improved, the calculation time consumption can be reduced, the efficiency can be improved, no extra calculation overhead is brought, and the communication bottleneck of the MPC can be weakened.

Drawings

In order to more clearly illustrate the embodiments of the present application or the implementation in the related art, a brief description will be given below of the drawings required for the embodiments or the related art descriptions, and it is apparent that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings for those of ordinary skill in the art.

FIG. 1 illustrates a first feature extraction model training process schematic provided by some embodiments;

FIG. 2A is a schematic view of an image corresponding to a first eigenvector obtained based on a convolutional neural network in the related art according to some embodiments;

FIG. 2B is a schematic view of an image corresponding to a second feature vector obtained based on a feature extraction model according to some embodiments;

FIG. 3 illustrates a second feature extraction model training process schematic provided by some embodiments;

FIG. 4 illustrates a third feature extraction model training process schematic provided by some embodiments;

FIG. 5 illustrates a schematic diagram of communication durations for different stimulus functions provided by some embodiments;

FIG. 6 illustrates a schematic diagram of a transducer model provided by some embodiments;

FIG. 7 illustrates a schematic diagram of an SETR model provided by some embodiments;

FIG. 8 illustrates a convolutional neural network training process schematic provided by some embodiments;

FIG. 9 illustrates a schematic diagram of a multiparty secure computing process provided by some embodiments;

FIG. 10 illustrates a model training process schematic provided by some embodiments;

FIG. 11 illustrates a feature extraction process schematic provided by some embodiments;

FIG. 12 illustrates a schematic diagram of a feature extraction model training apparatus provided by some embodiments;

FIG. 13 illustrates a schematic diagram of a convolutional neural network training device provided by some embodiments;

FIG. 14 illustrates a schematic diagram of a model training apparatus provided by some embodiments;

FIG. 15 illustrates a schematic diagram of a feature extraction apparatus provided by some embodiments;

fig. 16 illustrates a schematic diagram of an electronic device according to some embodiments.

Detailed Description

In order to quickly and accurately extract (acquire) high-precision image feature vectors, the application provides a feature extraction model training method, a convolutional neural network training method, a model training method, a feature extraction method, a device, equipment and a medium.

For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.

It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms "first," second, "" third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for limiting a particular order or sequence, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the function associated with that element.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

All the embodiments of the application acquire, store, use, process and the like data according to relevant regulations of national laws and regulations.

Example 1:

FIG. 1 illustrates a schematic diagram of a first feature extraction model training process provided by some embodiments, as shown in FIG. 1, the process comprising the steps of:

s101: for any one of the acquired first sample images, acquiring a first feature vector of the first sample image based on a convolutional neural network after training; processing the first sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model to be trained, and obtaining a second feature vector of the first sample image based on the feature extraction model; and calibrating the first feature vector based on the second feature vector to obtain a calibrated first reference feature vector, wherein the feature extraction model is a model obtained by training based on a framework structure of an MPCFomer model and training logic of a semantic visual transducer model SETR.

In a possible implementation manner, the feature extraction model may be a model obtained by training based on the framework structure of the MPCFormer model and the training logic of the semantic vision Transformer (SETR) model, where the feature extraction model in the embodiment of the present application is obtained by training based on the framework structure of the MPCFormer model and the SETR model, and is not described in detail herein.

In one possible implementation manner, when training the feature extraction model, in order to improve the accuracy of the feature vector obtained by the feature extraction model, any first sample image may be obtained from the first sample image set, and the obtained first sample image is input into a convolutional neural network (Convolutional Neural Network, CNN) after training, such as a depth residual network (res net) or the like, where the convolutional neural network performs feature extraction on the first sample image, to obtain the first feature vector of the first sample image. The first sample image may be an image containing user privacy information, such as a biological feature of a face, and the application of the first sample image is not specifically limited.

In a possible implementation manner, the first sample image may be processed into a one-dimensional sequence based on a preset image processing manner, the one-dimensional sequence is input into a feature extraction model to be trained, and feature extraction is performed on the first sample image based on the feature extraction model to obtain a second feature vector of the first sample image; the first feature vector is calibrated based on the second feature vector, so that a more accurate feature vector (first reference feature vector) can be obtained.

Specifically, considering that the framework structure of the feature extraction model is based on the framework structure of the MPCFormer model, the MPCFormer model in the related art is generally used for processing natural language, in order to enable the feature extraction model to process images, in the embodiment of the application, a first sample image can be processed into a one-dimensional sequence, the one-dimensional sequence is input into the feature extraction model, and the feature extraction model is trained by training logic of a semantic visual transformation (SEgmentation TRansformer, SETR) model, so that the operation logic of the feature extraction model is the same as that of the segr model, thereby enabling the feature extraction model to process images, acquiring more accurate image features at pixel level based on the operation logic of the segr model, acquiring more accurate (less accurate) image features (feature vectors) based on a multi-head self-attention mechanism and the like, and further improving the rapidness and accuracy of multi-party security calculation (MPC). The process of calibrating the first feature vector based on the second feature vector is explained below, for how to process the first sample image into a one-dimensional sequence, and how to obtain the feature extraction model in the embodiment of the present application based on the frame structure of the mpcfomer model and the training logic of the SETR model.

For convenience of understanding, in the embodiment of the present application, the first feature vector is denoted by F, the second feature vector is denoted by a, and when the first feature vector is calibrated based on the second feature vector a, the first feature vector F may be pooled by using the second feature vector a, that is, an exclusive nor operation is performed on each one-dimensional feature included in the second feature vector (Attention Map) and each one-dimensional feature included in the first feature vector (as), for example, a vector v obtained by exclusive nor operation is performed on a kth vector (Attention Map) in the second feature vector (Attention Maps) _k ＝∑ _i,j F _i,j,k ⊙A _i,j,k Where "i, j, k" represents the ith row, jth column of pixels of the kth vector (Attention Map).

In one possible implementation manner, when performing the exclusive nor operation on each one-dimensional feature included in the second feature vector and each one-dimensional feature included in the first feature vector, when the pixels of the corresponding feature in the second feature vector and the first feature vector are both higher, the pixels of the feature are considered to be better, and the pixels of the corresponding feature obtained after the exclusive nor operation are configured as higher pixels. If the pixel of the feature corresponding to any one party is worse, the pixel of the feature is considered worse, and the pixel of the corresponding feature obtained after the exclusive nor operation is configured as a lower pixel, so that more accurate (less accurate) image features (feature vectors) can be obtained through calibration.

Alternatively, taking the second feature vector as an example, if the pixel of a certain feature is higher than the average pixel of the feature in the second feature vector, the pixel of the feature may be considered to be higher, otherwise, if the pixel of the certain feature is not higher than the average pixel of the feature in the second feature vector, the pixel of the feature may be considered to be worse. Similarly, if the pixel of a feature in the first feature vector is higher than the average pixel of the feature in the first feature vector, the pixel of the feature may be considered higher, whereas if the pixel of the feature is not higher than the average pixel of the feature in the first feature vector, the pixel of the feature may be considered worse. The above-mentioned specific calculation process of performing the exclusive nor operation on each one-dimensional feature included in the second feature vector and each one-dimensional feature included in the first feature vector is merely an exemplary illustration, and the specific calculation process of the exclusive nor operation is not specifically limited in this application.

In one possible implementation, after the exclusive-nor operation is performed on each dimension feature included in the second feature vector and each dimension feature included in the first feature vector, the vector obtained after the exclusive-nor operation may be normalized based on the full connection layer (Fully Connected Layer) to a feature space with the same size, and then the normalized vector is averaged again to obtain a calibrated first reference feature vector, that is, a final calibrated first reference feature vector

Referring to fig. 2A and fig. 2B, fig. 2A is a schematic image diagram of a first feature vector obtained based on a convolutional neural network in the related art according to some embodiments. Fig. 2B illustrates an image schematic diagram corresponding to a second feature vector obtained based on a feature extraction model according to some embodiments. In fig. 2B, the black area is an irrelevant feature area, and it can be seen that the feature extraction model captures more specific facial features in the image by using a self-Attention mechanism, and ignores the rest irrelevant feature areas in the image, so that the features of the image subject part with real value are extracted by using the Attention mechanism. In particular to the face features, the effect can be expressed as that the Attention can screen out the face features, and the rest image information such as illumination, background and the like in the environment is shielded, and the rest image information is expressed by a black area, so that the face features with less and more precision and more accuracy can be obtained.

In a possible embodiment, taking the first sample image as an image containing a face, there may be a portion of the face in the image that is blocked or inclined, so as to weaken the portion of the feature and obtain a more accurate face feature with more emphasis, after obtaining the second feature vector a of the first sample image, the second feature vector a may be calibrated before calibrating the first feature vector F based on the second feature vector a, and based on the calibrated second feature vector a ^～ And calibrating the first characteristic vector, so that the accuracy of the obtained first reference characteristic vector can be further improved. Wherein, the second eigenvector A is calibrated to obtain a calibrated second eigenvector A ^～ The process of (2) may be as follows:

alternatively, the second feature vector a may be first subjected to global average pooling (Global Average Pooling, GAP) to obtain global feature vector information corresponding to each dimension feature in the second feature vector, and the obtained global feature vector information may be mapped to a softmax probability distribution function (f _ex (·) =softmax (·)) based on the softmax probability distribution function (which may also be referred to as a normalized exponential function), such that the more important feature information in the second feature vector a is further amplified and an importance weight coefficient s, exemplary, corresponding to each dimension feature in the global feature vector information is obtaineds may be:the importance weight coefficient s corresponding to the important features (such as face features which are not blocked and not inclined) in the second feature vector is larger, and the importance weight coefficient s corresponding to the less important features (such as face features which are blocked and inclined) in the second feature vector is smaller, so that the purpose of weakening the part which belongs to the blocked or inclined part in the features such as the face and the like and setting the activation (weight coefficient) of the part which belongs to the blocked or inclined part in the attribute map to a lower degree can be realized, and the accuracy of the determined first reference feature vector is further improved.

In one possible implementation, the second feature vector a may be transformed using a sigmoid function, and the transformed second feature vector (e.g., sigmoid (a) _k ) With a corresponding importance weighting coefficient s _k Multiplying to obtain a calibrated second feature vector A ^～ . Exemplary, in the calibrated second eigenvector

Obtaining a calibrated second feature vector A ^～ Thereafter, the second feature vector A can be calibrated ^～ And calibrating the first characteristic vector F to obtain a first reference characteristic vector F. Wherein based on the calibrated second feature vector A ^～ The process of calibrating the first feature vector F is similar to the process of calibrating the first feature vector F based on the second feature vector a, for example, the second feature vector (Attention Maps) a may be calibrated ^～ And performing an exclusive OR operation (while) on each one-dimensional feature contained in the first feature vector, wherein the k-th vector (Attention Map) is obtained after the exclusive OR operation:and will not be described in detail herein.

S102: and training the feature extraction model based on the first reference feature vector, the tag feature vector carried by the first sample image and the configured first loss function.

In one possible implementation manner, the first sample image may carry a pre-configured tag feature vector, where the configuration process of the tag feature vector is not specifically limited in the present application. The feature extraction model can be trained based on the first reference feature vector f, the tag feature vector carried by the first sample image and the configured first loss function L, so as to obtain a feature extraction model which can be deployed on a line and is used for extracting image feature vectors (such as feature vectors of face images).

Alternatively, the first loss function L may include a cross entropy loss function, a weighted difference regularized loss function, and an L2 regularized loss function. In order to extract the high-precision image feature vector, the first loss function may include an importance weight coefficient corresponding to each dimension feature. Illustratively, the cross entropy loss function in the first loss function L may be configured based on the importance weighting coefficient s _k Cross entropy loss term parameter L _CE,k To determine a cross entropy loss function. For example, the importance weighting coefficient s corresponding to each dimension feature can be calculated _k Cross entropy loss term parameter L with corresponding dimension _CE,k And a cross entropy loss function is determined. For ease of understanding, the cross entropy loss function L may be formulated _wCE Expressed as: l (L) _wCE ＝∑ _k s _k ·L _CE,k . Employing cross entropy loss function L in embodiments of the present application _wCE The loss caused by the convergence of the model of the unimportant part which belongs to the occlusion or the angle deflection in the Attention Maps is constrained by the importance weight coefficient s of the unimportant part, so that the partial loss does not participate in the calculation of the convergence of the model, and the accuracy of the feature vector determined by the trained model (feature extraction model) can be improved. Wherein the cross entropy loss term parameter L can be determined based on the prior art _CE,k And will not be described in detail herein.

Optionally, a first loss function is configuredThe weighted difference regular loss function in the number L can be based on the importance weight coefficient s _k Weighted differential loss term parameter P _i,j,k To determine a weighted differencing canonical loss function L _wDIV . For easy understanding, the weighted difference regularized loss function L can be formulated _wDIV Expressed as: l (L) _wDIV ＝1-∑ _i,j max _k (s _k ·P _i,j,k ) Wherein the weighted differencing loss term parametersWeighted differential loss term parameter P _i,j,k Can be considered as a penalty term for cross-over features, and can be based on a weighted-differencing canonical loss function L _wDIV The feature vector with cross overlapping is weakened or removed, so that loss and the like caused by the part of features do not participate in calculation of model convergence, and the accuracy of the feature vector determined by the trained model (feature extraction model) can be improved. Wherein the weighted differentiation loss term parameter P can be determined based on the prior art _i,j,k And will not be described in detail herein. In addition, the L2 regularization loss function L can also be determined by adopting the prior art _REG And will not be described in detail.

Alternatively, it may be the cross entropy loss function L in the first loss function _wCE Weighted differentiated canonical loss function L _wDIV L2 regularization loss function L _REG The three sub-loss functions are respectively configured with a corresponding weight, and the loss function L is determined based on the sum of products of the sub-loss functions and the corresponding weights. Exemplary, cross entropy loss function L _wCE Lambda for corresponding weight _wCE Representing a weighted differentiated canonical loss function L _wDIV Lambda for corresponding weight _wDIV Representing the L2 regularized loss function L _REG Lambda for corresponding weight _REG Representing, then the first loss function l=λ _wCE L _wCE +λ _wDIV L _wDIV +λ _REG L _REG . Wherein, a specific training process for training the feature extraction model based on the first loss function may be The prior art is not described in detail herein. The weight corresponding to each sub-loss function is not particularly limited, and can be flexibly set according to requirements.

Alternatively, when the feature extraction model is trained, whether the recognition result of the feature extraction model is accurate may be determined according to whether the first reference feature vector is consistent with the tag feature vector carried by the first sample image. In the implementation, if the identification results of the feature extraction models are inconsistent, the parameters of the feature extraction models need to be adjusted so that the first reference feature vectors obtained based on the feature extraction models can gradually approach the tag feature vectors, and the feature extraction models are trained.

In specific implementation, when the parameters in the feature extraction model are adjusted, a gradient descent algorithm may be adopted to counter-propagate the gradient of the parameters of the feature extraction model, so as to train the feature extraction model.

In one possible implementation, the above operation may be performed on each first sample image in the first sample set, and the feature extraction model training is determined to be completed when a preset convergence condition is satisfied. The meeting of the preset convergence condition may be that the number of the sample images correctly identified by the first sample image set through the original feature extraction model is greater than a set number, or the number of iterations of training the feature extraction model reaches a set maximum number of iterations, etc. The implementation may be flexibly set, and is not particularly limited herein.

In one possible implementation manner, when training the original feature extraction model, the sample images in the sample set may be divided into training sample images and test sample images, the original feature extraction model is trained based on the training sample images, and then the reliability degree of the trained feature extraction model is verified based on the test sample images.

For ease of understanding, the feature extraction model training process provided in this application is explained below by way of one specific embodiment. Referring to fig. 3, fig. 3 illustrates a second feature extraction model training process provided by some embodiments, the process comprising the steps of:

s301: for any one of the acquired first sample images, acquiring a first feature vector of the first sample image based on the convolutional neural network after training; and processing the first sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model to be trained, and obtaining a second feature vector of the first sample image based on the feature extraction model. The feature extraction model is a model obtained by training a framework structure based on an MPCFomer model and training logic of an SETR model.

S302: carrying out global average pooling on the second feature vector to obtain global feature vector information corresponding to each dimension feature in the second feature vector, and obtaining an importance weight coefficient corresponding to each dimension feature in the global feature vector information based on a probability distribution function; based on the importance weight coefficient corresponding to each dimension feature, calibrating the second feature vector to obtain a calibrated second feature vector; and calibrating the first feature vector based on the calibrated second feature vector to obtain a calibrated first reference feature vector.

S303: and training the feature extraction model based on the first reference feature vector, the tag feature vector carried by the first sample image and the configured first loss function.

For ease of understanding, the feature extraction model training process provided in this application is explained in the following by a specific embodiment. Referring to fig. 4, fig. 4 illustrates a schematic diagram of a third feature extraction model training process provided by some embodiments, the process including the steps of:

after any first sample image is obtained, the first sample image can be input into a convolutional neural network (such as a CNN neural network such as ResNet) after training is completed, and Feature extraction is performed on the first sample image based on the convolutional neural network, so as to obtain a Feature map of the first sample image, namely a first Feature vector (Feature Maps) F. Meanwhile, the first sample image may be processed into a one-dimensional sequence, wherein the process of processing the first sample image into the one-dimensional sequence may be as follows:

A 2-dimensional (2D) first sample image (such as a face image) with length x width x color channel number (C) of H x W x 3 (unit: pixel) can be cut (Split) intoAnd combining theseThe small block is elongated into a one-dimensional (1D) sequence with a length l=hw/256, which is embedded as input (input Embedding) into the feature extraction model (transducer) to be trained. Alternatively, the original image position information may also be encoded using a position embedding algorithm, for each vector e in a one-dimensional sequence _i Adding a corresponding position code p _i A one-dimensional sequence (input sequence) E is obtained that is finally input to the feature extraction model, where E can be expressed as: { e ₁ +p ₁ ，e ₂ +p ₂ ,…,e _L +p _L }。

In one possible implementation, the one-dimensional sequence may be input into a feature extraction model to be trained, the framework of the feature extraction model may be a framework of an MPCFormer model, the arithmetic logic (may also be referred to as training logic) may be an arithmetic logic of an SETR model, and the second feature vector (Attention Maps) extracted based on a self-Attention mechanism may be obtained based on an Encoder (Encoder) or the like in the feature extraction model. Alternatively, each encoder in the feature extraction model may be composed of a Multi-head self-attention (MSA) and a feed-forward neural network (also referred to as a Multi-Layer Perceptron (MLP)), respectively, assuming a common L _e And an Encoder. Wherein the input of the self-attention of the first layer is Z ^l-1 Calculated three-dimensional tuple (value): wherein query=z ^l- ¹ W _Q ，key＝Z ^l-1 W _k ，value＝Z ^l-1 W _V . Wherein W is _Q ，W _k ，W _V . Is a learnable weight matrix, d is the dimension of a three-dimensional tuple (value), then self-attention can be expressed as:

multi-head (multi-layer) self-attention is obtained by splicing m SAs, MSA (Z ^l-1 )＝[SA ₁ (Z ^l-1 )；SA ₂ (Z ^l ^-1 )；…；SA _m (Z ^l-1 )]Wo, where Wo e R ^md×C . Output of each Encoder: z is Z ^l ＝MSA(Z ^l-1 )+MLP(MSA(Z ^l-1 ))∈R ^L×C The total output of all Encoders is { Z ¹ ,Z ² ,…,Z ^Le }. Wherein the conversion of a two-dimensional image into a one-dimensional sequence can be considered as converting the image x ε R ^C×H×W Conversion to the sequence Z ε R ^L×C . Optionally, when the feature extraction reasoning is performed based on the feature extraction model, the process of extracting the second feature vector A (Attention Maps) by the feature extraction model may be implemented by a general MPC calculation engine.

Optionally, after the second feature vector a is obtained based on the feature extraction model, the second feature vector a may be adjusted (referred to as an attribute map-based recalibration in the figure). When the second feature vector a is calibrated, global average pooling (global pooling) may be performed on the second feature vector to obtain global feature vector information corresponding to each dimension feature in the second feature vector, and based on a SoftMax probability distribution function (referred to as a SoftMax normalized exponential function (fex (degree)) in the figure), an importance weight coefficient S (may also be represented by uppercase S) corresponding to each dimension feature in the global feature vector information is obtained. The second feature vector a may be transformed using a sigmoid function, and the transformed second feature vector (sigmoid (a _k ) Multiplied by a corresponding importance weight coefficient s (calibration product) to obtain a calibrated second feature vector a ^～ . Exemplary, in the calibrated second eigenvector

Obtaining a calibrated second feature vector A ^～ Thereafter, the second feature vector A can be calibrated ^～ And calibrating the first characteristic vector F to obtain a calibrated first reference characteristic vector F. For example, a second feature vector (Attention Maps) A ^～ Each of the bits contained inThe sign is exclusive-ored with each one of the features contained in the first feature vector (), and a kth vector (Attention Map) is obtained after exclusive-ored operation:and will not be described in detail herein. The process of calibrating the first feature vector through the exclusive nor operation may also be referred to as an attention-based pooling process.

Optionally, after the first reference feature vector is obtained, the feature extraction model may be trained based on the first reference feature vector and the configured first loss function, and the like. The feature extraction model obtained through training can be deployed on a line to be used for extracting feature vectors of images such as face images.

For example, when feature extraction is performed by using the feature extraction model obtained by training, when an image to be processed for feature extraction is received, the image to be processed may be input into the feature extraction model obtained by training, and a feature vector of the image to be processed may be obtained based on an output result of the feature extraction model.

In one possible implementation, after the feature vector of the image to be processed is obtained, a multiparty security calculation process may be performed based on the feature vector. For example, the feature extraction model may be encrypted and deployed at a Server (Server), and one or more clients (clients) encrypt and upload data such as images, and perform multiparty collaborative reasoning or extract feature vectors of the images, so as to perform multiparty security calculation, where the process of performing multiparty security calculation based on the feature vectors may use the prior art, which is not described herein.

Example 2:

in order to ensure the execution efficiency of the multiparty security computation and reduce the communication delay of different participants participating in the multiparty security computation, in the embodiment of the present application, the excitation function used by the feature extraction model includes: 2quad excitation functions. In addition, the feature extraction model may be a model obtained based on knowledge distillation.

In one possible embodimentThe selection of different stimulus functions may have different effects on the computational efficiency of the MPC calculation engine. Referring to fig. 5, fig. 5 illustrates a schematic diagram of communication durations for different stimulus functions provided by some embodiments. It can be seen that when the excitation function is an activation function (GaussianError Linear Units, which may be simply called GELU or gel) based on a gaussian error function, the total operation time length (referred to as total time in the figure) is 11s, and the communication time length (referred to as comm, time in the figure) between the participants is 9.6s. When the stimulus function is quad, the total run length is 0.5s, with a communication length between the participants of 9.6s. When the incentive function is softmax, the total operation time is 40s, wherein the communication time between the participants is 34.1s. When the incentive function is 2relu, the total operating time is 14.7s, with a communication time between the parties of 12.8s. And when the excitation function is The total operating time is 3.2s, with a communication time between the participants of 1.7s. It can be seen that when the excitation functions of the feature extraction model comprise 2quad excitation functions, the communication time between the participants, as well as the total run time, can be significantly reduced. Alternatively, the excitation functions of the feature extraction model may include, in addition to the 2quad excitation functions, quad excitation functions (GeLU (x) ≡0.125x ² +0.25x+0.5)。

In one possible implementation, to improve the inference speed and accuracy of the feature extraction model, the feature extraction model may be a small model obtained based on knowledge distillation. For example, a larger feature extraction model (MPCFormer model) may be distilled through knowledge to obtain a smaller model, and feature extraction may be performed using a small model obtained based on knowledge distillation.

The process of deriving the feature extraction model based on the framework of the MPCFormer model and the training logic of the SETR model is described below.

The transducer model is a base of a ChatGPT large model, is a deep neural network model based on a self-attention mechanism, and can efficiently process sequence data in parallel. The transducer model is a neural network model that learns context and thus meaning by tracking relationships in sequence data (e.g., words in this sentence), with great success in language understanding, especially in large models. Before a transducer can appear, a user must train the neural network using a large labeled dataset, which is costly and time consuming to produce. The Transformer model as a base model can, in many cases, replace convolutional neural networks (Convolutional Neural Networks, CNN) and recurrent neural networks (Recurrent Neural Network, RNN). The transducer model may be a large Encoder and Decoder module that processes data, implemented based on an Encoder-Decoder (Encoder-Decoder) framework, and referring to fig. 6, fig. 6 illustrates a schematic diagram of a transducer model provided by some embodiments. The transducer model is roughly divided into two parts, namely an Encoder (Encoder) and a Decoder (Decoder), corresponding to the left and right parts in fig. 6 (left and right parts are shown in the figure, wherein the left part corresponds to the Encoder and the right part corresponds to the Decoder).

Specifically, the Input part (Inputs) of the Transformer model includes two parts, namely (word Embedding transformation) Input Embedding and position-based word Embedding (also referred to as position coding, positional Encoding). Whether text or image, the digital representation of Input encoding can be converted into a vector representation, and the relationship between the text or the relationship between the images can be obtained in a high-dimensional space. After Input encoding, position encoding is added, so that information with different meanings generated at different Input positions can be added into vectors, and the supplement of position information is increased. Wherein both the encoder and the decoder contain inputs, and the structure of the inputs of the two parts is the same, but the use of reasoning is different, the encoder only makes a reasoning once, and the decoder makes a cyclic reasoning like RNN, and the prediction result is continuously generated.

The encoder section of the transducer model is formed by stacking N encoder layers, each of which is made up of two sub-layer connection structures. The first sub-layer comprises a multi-head self-attention (MSA), normalization layer and a residual connection, the second sub-layer comprises a feed forward full-connection layer (feed forward), normalization layer and a residual connection, and so on. Both sublayers Add an addition and normalization layer (Add & Norm). Each encoder layer completes a feature extraction process, i.e., an encoding process, on the input once.

The decoder of the transducer model is likewise stacked with N identical layers, but with a slightly different structure from each layer in the encoder. For each layer of the decoder, the decoder contains one sub-layer in addition to two sub-layers in the encoder: mask-based Multi-Head Attention layer (mask Multi-Head Attention), as shown in the figure, each sub-layer also uses an Add and normalize layer (Add & Norm).

The decoder of the transducer model performs a feature extraction operation, i.e., a decoding process, on the target based on the result of the encoder and the result of the last prediction. When outputting, the Linear layer (full connection output layer, linear) is used for carrying out Linear change to obtain the output with the specified dimension, and the dimension conversion is carried out. And scaling the numbers in the one-dimensional vector into the probability value domain of 0-1 by a flexible maximum transfer function (Softmax) layer and adding up to 1 yields the prediction result (Output Probabillities) of the model.

Alternatively, the transducer model may be pre-trained on a large dataset to gain general understanding, and then fine-tuned on a small downstream dataset to learn the characteristics of a particular task.

The reasoning process of the transducer model can be expressed as a 2-party calculation (2 PC). In 2PC, for example, the user inputs data, and the model provider inputs a transducer model. Together they calculate an inference result. Throughout the reasoning process, 2PC ensures that both parties only know information about their own inputs and results. The converter reasoning of MPC multiparty participation can be realized based on safe multiparty calculation and knowledge distillation technology. The mpcfomer model is a trained (or fine tuned) transducer model, is a transducer model with low inference delay and high ML performance, and can be partially approximated using a given MPC, and can be used to construct a high performance approximated transducer model by knowledge distillation techniques when the computation is used. During reasoning, the MPCFormer model implements private model reasoning with the MPC engine.

Wherein, the reasoning process based on the MPCFomer model can be described as: a convolution calculation stage and an inference stage. The convolution calculation phase can be described as: s=mpcfomer (T, D, a), where T is the trained Transfomer model, D is the input dataset, and a is the MPC-based calculation process. The inference phase can be described as: y=mpcs (X). In MPC-based converter model inference, geLU functions and Softmax functions are the main sources of communication bottlenecks (high communication complexity). GeLU function computation is slow (long time consuming) because the computation of the error function requires computation by a high order Taylor expansion (involving a large number of multiplications), and Softmax function computation is slow because the exponential function requires evaluation by multiple squaring iterations, again involving a large number of multiplications.

The SETR model is described below. SETR consists essentially of three parts: input-conversion-output. Referring to fig. 7, fig. 7 illustrates a schematic diagram of an SETR model according to some embodiments, where part (a) in fig. 7 is mainly an input preprocessing and feature extraction process of the SETR model, (b) is mainly a progressive upsampling process, and part (c) is mainly a multi-level feature aggregation process.

First, the original input picture needs to be processed into a format that Transformer (SETR) can support. That is, the input image is subjected to a slicing process, each 2D image slice (patch) is a one-dimensional (1D) sequence, and the one-dimensional sequence is input into the model as a whole. To encode the spatial information for each slice, a specific embedding may be learned for each local position and added to a linear projection function to form the final input sequence. As such, although Transofomer (SETR) is unordered, the corresponding spatial location information (p) may still be retained.

The one-dimensional sequence is input into an SETR model, and a full connection layer (linear projection layers) in the SETR model processes the one-dimensional sequence to obtain a vector (Patch encoding) of the slice and a position information vector (Position Embedding) corresponding to the slice. The SETR model includes a plurality of transducer layers (transformer layer). Optionally, either transformer layer mainly comprises two parts: a Multi-head Self-Attention (MSA) and a Multi-Layer perceptron (Multilayer Perceptron, MLP), both of which are connected to a Layer normalization process (Layer Normalization, layer Norm). Among them, layer Norm is a normalization technique used in deep neural networks that normalizes the output of each neuron in the network so that the output of each Layer in the network has a similar distribution.

Feature extraction can be performed by inputting a one-dimensional sequence into a transducer architecture (SETR model), an Encoder (Encoder) for compressing the spatial resolution of the original input image and progressively extracting more advanced abstract semantic features, and a Decoder (Decoder) for upsampling the advanced features extracted by the Encoder to the original input resolution for pixel-level prediction, enabling segmentation of the image.

Referring to part (b) of fig. 7, the object of the Decoder is to generate a segmentation result on the original two-dimensional image (h×w), and the output Z of the Encoder needs to be reshaped (reshape) from the two-dimensional HW/256×c to the three-dimensional feature map H/16×w/16×c.

The decoder may employ Progressive Upsampling (PUP). In particular, considering that direct one-step up-sampling introduces a large noise, the up-sampling multiple may be set to 2 times using a method of alternately performing convolution and up-sampling to mitigate the noise. Z from H/16 XW/16 ^Le The original picture size is obtained and 4 up-sampling operations are performed.

See part (c) of fig. 7 for a process of multi-layer feature fusion (MLA). Multi-level feature fusion similar to feature pyramid, but with the difference that feature Z ^L From each layer of the transducer and with the same resolution. Every other L _e Extracting a layer of features { Z }/M layer ^m }(m∈{L _e /M，2L _e /M,…,ML _e M), the features of M paths are extracted in total. Wherein each layer is prepared by first adding Z ^Le Reshaping from HW/256 XC to H/16 XW/16 XC, then processing using a three-layer neural network (convolution kernel sizes 1×1,3×3), halving the number of channels at the first and third layers, respectively, and then after the third layer4-fold bilinear interpolation upsampling was performed. And from the 2 nd path, the features of the previous paths are fused in sequence, the information interaction between each path is enhanced, after a convolution layer of 3×3 is passed again, the feature images obtained by each path are spliced in channel dimension, and 4 times bilinear interpolation is adopted for up-sampling, so that the original image size is recovered.

The process of obtaining the feature extraction model based on the framework structure of the mpcfomer model and the training logic of the SETR can be understood as replacing the framework structure of the SETR model in the part (a) of fig. 7 with the framework structure of the mpcfomer model, and training the mpcfomer model by using the training logic (operation logic) in the part (b) of fig. 7 and the part (c) of fig. 7, so that the feature extraction model can be obtained.

Example 3:

based on the same technical concept, the present application further provides a convolutional neural network training method, referring to fig. 8, fig. 8 shows a schematic diagram of a convolutional neural network training process provided by some embodiments, where the process includes the following steps:

s801: inputting the second sample image into a convolutional neural network to be trained aiming at any acquired second sample image, and acquiring a third feature vector of the second sample image based on the convolutional neural network; processing the second sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model after training, and obtaining a fourth feature vector of the second sample image based on the feature extraction model; and calibrating the third feature vector based on the fourth feature vector to obtain a calibrated second reference feature vector, wherein the feature extraction model is a model obtained by training based on a framework structure of an MPCFomer model and training logic of a semantic visual transducer model SETR.

In one possible implementation manner, when training a convolutional neural network (Convolutional Neural Network, CNN), such as a depth residual network (res net), any one of the second sample images may be acquired from the second sample image set in order to improve the accuracy of the feature vector obtained by the convolutional neural network, and the acquired second sample image is input into the convolutional neural network to be trained. The second sample image may be the same as or different from the first sample image, which is not specifically limited in this application. For example, the second sample image and the first sample image may be images containing an occluded face or an angularly tilted face, or the like.

Optionally, the convolutional neural network performs feature extraction on the second sample image to obtain a third feature vector F of the second sample image, where, for convenience of description, the third feature vector and the first feature vector obtained by the convolutional neural network are both denoted by F.

In one possible implementation manner, the second sample image may be further processed into a one-dimensional sequence based on a preset image processing manner, the one-dimensional sequence is input into a feature extraction model after training, and feature extraction is performed on the second sample image based on the feature extraction model, so as to obtain a fourth feature vector of the second sample image. The feature extraction model in the embodiment of the present application is the same as the feature extraction model in the above embodiments, and is a model obtained based on the framework structure of the mpcfomer model and the training logic of the SETR model, which is not described herein again. For convenience of description, the fourth feature vector and the second feature vector obtained by the feature extraction model are denoted by a.

Optionally, in order to improve the accuracy of the feature vector obtained by the convolutional neural network, the third feature vector F may be calibrated based on the fourth feature vector a, to obtain a calibrated second reference feature vector F. The process of calibrating the third feature vector F based on the fourth feature vector a is similar to the process of calibrating the first feature vector F based on the second feature vector a in the above embodiment, for example, a common-or operation may be performed on each one-dimensional feature included in the fourth feature vector and each one-dimensional feature included in the third feature vector, so as to obtain a calibrated second reference feature vector. And the method can also perform the exclusive nor operation on each one-dimensional feature contained in the fourth feature vector and each one-dimensional feature contained in the third feature vector, normalize the vector obtained after the exclusive nor operation, and determine the calibrated second reference feature vector based on the vector after the normalization operation. In addition, global average pooling can be carried out on the fourth feature vector to obtain global feature vector information corresponding to each dimension feature in the fourth feature vector, an importance weight coefficient corresponding to each dimension feature in the global feature vector information is obtained based on a probability distribution function, and the fourth feature vector is calibrated based on the importance weight coefficient corresponding to each dimension feature to obtain a calibrated fourth feature vector (A to F); based on the calibrated fourth feature vector, the third feature vector is calibrated, which is not described herein.

S802: and training the convolutional neural network based on the second reference feature vector, the label feature vector carried by the second sample image and the configured second loss function.

In one possible implementation manner, the second sample image may carry a pre-configured tag feature vector, where the configuration process of the tag feature vector is not specifically limited in the present application. The convolutional neural network can be trained based on the second reference feature vector f, the label feature vector carried by the second sample image and the configured second loss function, so that a convolutional neural network which can be deployed on a line and is used for extracting image feature vectors (such as feature vectors of face images) is obtained.

Alternatively, the second loss function may include a cross entropy loss function, a weighted difference regularized loss function, and an L2 regularized loss function. Wherein, the second loss function may be configured by adopting the prior art, and will not be described herein.

Optionally, when training the convolutional neural network, whether the recognition result of the convolutional neural network is accurate may be determined according to whether the second reference feature vector is consistent with the tag feature vector carried by the second sample image. In the implementation, if the identification result of the convolutional neural network is inconsistent, the parameters of the convolutional neural network need to be adjusted so that the second reference feature vector obtained based on the convolutional neural network can gradually approach the label feature vector, and the convolutional neural network is trained.

In specific implementation, when the parameters in the convolutional neural network are adjusted, a gradient descent algorithm can be adopted to counter-propagate the gradient of the parameters of the convolutional neural network, so that the convolutional neural network is trained.

In one possible implementation, the above operation may be performed on each second sample image in the second sample set, and the convolutional neural network training is determined to be completed when a preset convergence condition is satisfied. The meeting of the preset convergence condition may be that the number of the second sample images in the second sample set passing through the original convolutional neural network is greater than a set number, or the number of iterations of training the convolutional neural network reaches a set maximum number of iterations, etc. The implementation may be flexibly set, and is not particularly limited herein.

In one possible implementation manner, when the original convolutional neural network is trained, the sample images in the sample set can be divided into training sample images and test sample images, the original convolutional neural network is trained based on the training sample images, and then the trained convolutional neural network is verified to be reliable based on the test sample images.

In the embodiment of the application, the second sample image can be input into the feature extraction model, so that the feature extraction model can perform feature extraction on the second sample image, and the feature extraction model in the embodiment of the application is a model obtained by training the frame structure based on the MPCFomer model and the training logic of the SETR model, so that the feature extraction model can capture more accurate feature vectors (fourth feature vector) such as faces based on a self-attention mechanism, the third feature vector obtained by the convolutional neural network can be calibrated based on the fourth feature vector, a second reference feature vector capable of expressing the image features more accurately is obtained, the convolutional neural network is trained based on the second reference feature vector and a configured second loss function, the accuracy of determining (extracting) the image feature vector by the trained convolutional neural network can be improved, and the purpose of rapidly and accurately extracting the high-accuracy image feature vector is achieved.

For example, when feature extraction is performed by using the convolutional neural network obtained by training, when an image to be processed for feature extraction is received, the image to be processed may be input into the convolutional neural network obtained by training, and a feature vector of the image to be processed may be obtained based on an output result of the convolutional neural network. After the feature vector of the image to be processed is obtained, a multipartite security calculation process can be performed based on the feature vector. Illustratively, referring to FIG. 9, FIG. 9 illustrates a schematic diagram of a multiparty secure computing process provided by some embodiments. The convolutional neural network may be deployed at a Server (Server), after the feature vector of the image to be processed is obtained through the trained convolutional neural network, the feature vector may be secretly segmented and sent to a plurality of clients (clients) that cooperate to perform multiparty security computation, for example, the feature vector may be divided into a secret sharing segment 1 and a secret sharing segment 2, the secret sharing segment 1 may be distributed to the Client of the participant 1 performing multiparty security computation, and the secret sharing segment 2 may be distributed to the Client of the participant 2 performing multiparty security computation. And each Client terminal cooperatively extracts characteristics based on a Convolutional Neural Network (CNN) in a secret sharing mode to perform multiparty security calculation. The multiparty security calculation mode can prevent the biological feature privacy information such as the face of the user from being revealed, and improves the security of the user data. The process of performing multiparty security computation based on the feature vector may adopt the prior art, and will not be described herein.

Example 4:

based on the same technical concept, the present application further provides a model training method, referring to fig. 10, fig. 10 shows a schematic diagram of a model training process provided by some embodiments, where the process includes the following steps:

s1001: aiming at any acquired third sample image, acquiring a fifth feature vector of the third sample image based on a convolutional neural network to be trained; processing the third sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model to be trained, and obtaining a sixth feature vector of the third sample image based on the feature extraction model; and calibrating the fifth feature vector based on the sixth feature vector to obtain a calibrated third reference feature vector, wherein the feature extraction model is a model obtained by training based on a framework structure of an MPCFomer model and training logic of a semantic visual transducer model SETR.

In one possible implementation, the convolutional neural network and the feature extraction model may be trained simultaneously, respectively. For example, when the convolutional neural network and the feature extraction model are trained, in order to improve the accuracy of feature vectors obtained by the convolutional neural network and the feature extraction model after the training is completed, any one of the third sample images may be obtained from the third sample image set, and the obtained third sample image is input into the convolutional neural network to be trained. The third sample image may be the same as or different from the first sample image and the second sample image, which is not specifically limited in this application. The third sample image may be, for example, an image containing an occluded face or an angularly tilted face, etc.

Optionally, the convolutional neural network performs feature extraction on the second sample image to obtain a fifth feature vector F of the third sample image, where, for convenience of description, the fifth feature vector, the third feature vector and the first feature vector obtained by the convolutional neural network are all denoted by F.

In one possible implementation manner, the third sample image may be processed into a one-dimensional sequence based on a preset image processing manner, the one-dimensional sequence is input into a feature extraction model to be trained, and feature extraction is performed on the third sample image based on the feature extraction model, so as to obtain a sixth feature vector of the third sample image. The feature extraction model in the embodiment of the present application is the same as the feature extraction model in the above embodiments, and is a model obtained based on the framework structure of the mpcfomer model and the training logic of the SETR model, which is not described herein again. For convenience of description, the sixth feature vector, the fourth feature vector, and the second feature vector obtained by the feature extraction model are denoted by a.

Optionally, in order to improve the accuracy of the feature vector obtained by the convolutional neural network and the feature extraction model, the fifth feature vector F may be calibrated based on the sixth feature vector a, to obtain a calibrated third reference feature vector F. The process of calibrating the fifth feature vector F based on the sixth feature vector a is similar to the process of calibrating the first feature vector F based on the second feature vector a in the above embodiment and the process of calibrating the third feature vector F based on the fourth feature vector a, for example, a common-or operation may be performed on each one-dimensional feature included in the sixth feature vector and each one-dimensional feature included in the fifth feature vector, so as to obtain a calibrated third reference feature vector. And the method can also perform the exclusive nor operation on each one-dimensional feature contained in the sixth feature vector and each one-dimensional feature contained in the fifth feature vector, normalize the vector obtained after the exclusive nor operation, and determine the calibrated third reference feature vector based on the vector after the normalization operation. In addition, global average pooling can be carried out on the sixth feature vector to obtain global feature vector information corresponding to each dimension feature in the sixth feature vector, an importance weight coefficient corresponding to each dimension feature in the global feature vector information is obtained based on a probability distribution function, and the sixth feature vector is calibrated based on the importance weight coefficient corresponding to each dimension feature to obtain a calibrated sixth feature vector (A to F); the fifth feature vector is calibrated based on the calibrated sixth feature vector, and the like, and will not be described herein.

S1002: training the feature extraction model based on the third reference feature vector, the tag feature vector carried by the third sample image and the configured first loss function; and training the convolutional neural network based on the third reference feature vector, the label feature vector carried by the third sample image and the configured second loss function.

The process of training the feature extraction model based on the third reference feature vector, the tag feature vector carried by the third sample image, and the configured first loss function is similar to the process of training the feature extraction model based on the first reference feature vector, the tag feature vector carried by the first sample image, and the configured first loss function in the above embodiment, and is not repeated herein.

The training process of the convolutional neural network based on the third reference feature vector, the tag feature vector carried by the third sample image and the configured second loss function is similar to the training process of the convolutional neural network based on the second reference feature vector, the tag feature vector carried by the second sample image and the configured second loss function in the above embodiment, and is not repeated herein.

Example 5:

based on the same technical concept, the present application further provides a feature extraction method, referring to fig. 11, fig. 11 shows a schematic view of a feature extraction process provided by some embodiments, where the process includes the following steps:

s1101: an image to be processed is received.

S1102: and obtaining the feature vector of the image to be processed based on the feature extraction model or the convolutional neural network obtained by training by any one of the methods.

In one possible embodiment, the method further comprises:

and based on the feature vector, performing multiparty security calculation.

Example 6:

based on the same technical concept, the present application provides a feature extraction model training apparatus, referring to fig. 12, fig. 12 shows a schematic diagram of a feature extraction model training apparatus provided in some embodiments, where the apparatus includes:

a first calibration module 1201, configured to obtain, for any one of the obtained first sample images, a first feature vector of the first sample image based on the convolutional neural network after training is completed; processing the first sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model to be trained, and obtaining a second feature vector of the first sample image based on the feature extraction model; calibrating the first feature vector based on the second feature vector to obtain a calibrated first reference feature vector, wherein the feature extraction model is a model obtained by training based on a framework structure of an MPCFomer model and training logic of a semantic visual transducer model SETR;

A first training module 1202, configured to train the feature extraction model based on the first reference feature vector, the tag feature vector carried by the first sample image, and a configured first loss function.

In a possible implementation manner, the first calibration module 1201 is further configured to:

In one possible implementation, the first calibration module 1201 is specifically configured to:

Example 7:

based on the same technical concept, the present application provides a convolutional neural network training device, referring to fig. 13, fig. 13 shows a schematic diagram of a convolutional neural network training device provided in some embodiments, where the device includes:

a second calibration module 1301, configured to input, for any second sample image obtained, the second sample image into a convolutional neural network to be trained, and obtain a third feature vector of the second sample image based on the convolutional neural network; processing the second sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model after training, and obtaining a fourth feature vector of the second sample image based on the feature extraction model; calibrating the third feature vector based on the fourth feature vector to obtain a calibrated second reference feature vector, wherein the feature extraction model is a model obtained by training based on a framework structure of an MPCFomer model and training logic of a semantic visual transducer model SETR;

A second training module 1302, configured to train the convolutional neural network based on the second reference feature vector, the label feature vector carried by the second sample image, and a configured second loss function.

In a possible implementation manner, the second calibration module 1302 is specifically configured to:

In a possible implementation, the second calibration module 1302 is further configured to:

Example 8:

based on the same technical concept, the present application provides a model training apparatus, referring to fig. 14, fig. 14 shows a schematic diagram of a model training apparatus provided in some embodiments, where the apparatus includes:

a third calibration module 1401, configured to obtain, for any one of the obtained third sample images, a fifth feature vector of the third sample image based on a convolutional neural network to be trained; processing the third sample image into a one-dimensional sequence based on a preset image processing mode, inputting the one-dimensional sequence into a feature extraction model to be trained, and obtaining a sixth feature vector of the third sample image based on the feature extraction model; calibrating the fifth feature vector based on the sixth feature vector to obtain a calibrated third reference feature vector, wherein the feature extraction model is a model obtained by training based on a frame structure of an MPCFomer model and training logic of a semantic visual transducer model SETR;

A third training module 1402, configured to train the feature extraction model based on the third reference feature vector, the tag feature vector carried by the third sample image, and the configured first loss function; and training the convolutional neural network based on the third reference feature vector, the label feature vector carried by the third sample image and the configured second loss function.

Example 9:

based on the same technical concept, the present application provides a feature extraction device, referring to fig. 15, fig. 15 shows a schematic diagram of a feature extraction device provided by some embodiments, where the device includes:

a receiving module 1501 for receiving an image to be processed;

an extraction module 1502, configured to obtain a feature vector of the image to be processed based on a feature extraction model trained by the method according to any one of the first aspect and the third aspect, or a convolutional neural network trained by the method according to any one of the second aspect and the third aspect.

In one possible embodiment, the apparatus further comprises:

Example 10:

based on the same technical concept, the present application further provides an electronic device, fig. 16 shows a schematic structural diagram of an electronic device provided by some embodiments, and as shown in fig. 16, the electronic device includes: the device comprises a processor 1601, a communication interface 1602, a memory 1603 and a communication bus 1604, wherein the processor 1601, the communication interface 1602 and the memory 1603 are in communication with each other through the communication bus 1604;

the memory 1603 stores a computer program that, when executed by the processor 1601, causes the processor 1601 to perform the steps of any of the method embodiments described above, which are not described herein.

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface 1602 is used for communication between the electronic device and other devices described above.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; but also digital instruction processors (Digital Signal Processing, DSP), application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

Example 11:

based on the same technical concept, the embodiments of the present application provide a computer readable storage medium, in which a computer program executable by an electronic device is stored, where when the program runs on the electronic device, the program causes the electronic device to implement the steps in any of the method embodiments described above when the program is executed, which is not described herein again.

Based on the same technical idea, the present application provides a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the method of any of the method embodiments described above as applied to an electronic device.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof, and may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions that, when loaded and executed on a computer, fully or partially produce a process or function in accordance with embodiments of the present application.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A method of training a feature extraction model, the method comprising:

2. The method of claim 1, wherein after the obtaining the second feature vector of the first sample image, the method further comprises, prior to calibrating the first feature vector based on the second feature vector:

3. The method according to claim 2, wherein the first loss function includes importance weight coefficients corresponding to each of the dimension features.

4. The method of claim 1, wherein calibrating the first feature vector based on the second feature vector to obtain a calibrated first reference feature vector comprises:

5. The method of claim 4, wherein after the performing the exclusive nor operation on each of the dimensional features included in the second feature vector and each of the dimensional features included in the first feature vector, the method further comprises, prior to the obtaining the calibrated first reference feature vector:

6. The method of claim 1, wherein the feature extraction model uses an excitation function comprising: 2quad excitation functions.

7. The method of claim 1, wherein the feature extraction model is a model derived based on knowledge distillation.

8. A convolutional neural network training method, the method comprising:

9. The method of claim 8, wherein calibrating the third feature vector based on the fourth feature vector to obtain a calibrated second reference feature vector comprises:

10. The method of claim 9, wherein after the performing the exclusive nor operation on each of the dimensional features included in the fourth feature vector and each of the dimensional features included in the first feature vector, the method further comprises, prior to the obtaining the calibrated second reference feature vector:

11. A method of model training, the method comprising:

12. A method of feature extraction, the method comprising:

Receiving an image to be processed;

the feature vector of the image to be processed is obtained based on a feature extraction model trained by the method of any one of claims 1 to 7, 11, or based on a convolutional neural network trained by the method of any one of claims 8 to 11.

13. The method according to claim 12, wherein the method further comprises:

and based on the feature vector, performing multiparty security calculation.

14. A feature extraction model training apparatus, the apparatus comprising:

15. A convolutional neural network training device, the device comprising:

16. A model training apparatus, the apparatus comprising:

17. A feature extraction apparatus, the apparatus comprising:

the receiving module is used for receiving the image to be processed;

an obtaining module, configured to obtain a feature vector of the image to be processed based on the feature extraction model trained by the method of any one of claims 1 to 7 and 11, or based on the convolutional neural network trained by the method of any one of claims 8 to 11.

18. An electronic device comprising at least a processor and a memory, the processor being adapted to implement the steps of the method according to any of claims 1-13 when executing a computer program stored in the memory.

19. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the steps of the method according to any one of claims 1-13.