CN110009013B

CN110009013B - Encoder training and representation information extraction method and device

Info

Publication number: CN110009013B
Application number: CN201910219343.XA
Authority: CN
Inventors: 焦剑波; 暴林超; 魏云超; 石宏辉; 刘永雄; 刘威; 黄煦涛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-03-21
Filing date: 2019-03-21
Publication date: 2021-04-27
Anticipated expiration: 2039-03-21
Also published as: CN110009013A

Abstract

The method comprises the steps of respectively adopting encoders with the same model parameters to obtain corresponding encoding characteristics aiming at least two loss data of original sample data and original sample data, adopting corresponding decoder decoding characteristics, and obtaining prediction loss based on each encoding characteristic, the original sample data and each decoding characteristic; and if the prediction loss meets the preset convergence condition, initializing a target encoder by adopting the model parameters, and acquiring the characterization information of the data by adopting the target encoder. Therefore, the training efficiency and effect of the encoder training are improved, and the effectiveness of extracted representation information extraction is improved.

Description

Encoder training and representation information extraction method and device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for encoder training and representation information extraction.

Background

Machine learning: the algorithm is an algorithm for automatically analyzing and obtaining rules from data and predicting unknown data by using the rules. Because a large number of statistical theories are involved in the machine learning algorithm, the machine learning is particularly closely associated with the inference statistics, which are also called statistical learning theories. And (4) algorithm design. Machine learning is the core of artificial intelligence, is the fundamental approach for making computers have intelligence, is applied to all fields of artificial intelligence, and mainly uses induction, synthesis rather than deduction.

Machine learning tasks, such as classification problems, typically require that the input be very easy to process, either mathematically or computationally. However, data in our real world, such as pictures, video, and sensor measurements, are very complex, redundant, and variable. It is very important how to effectively extract and express features.

Since the traditional manual feature extraction requires a lot of manpower and relies on very specialized knowledge, and is inconvenient to popularize, the feature learning comes up. So-called token learning is a collection of techniques to learn a feature, i.e., to convert original sample data into a form of efficiently developed data that can be machine-learned. The method avoids the trouble of manually extracting the features, and allows a computer to learn how to extract the features while learning the use features.

In the prior art, during characterization learning, an encoder is usually trained by combining multiple tasks and learning simultaneously or by means of discriminant learning, so that characterization information of data is extracted through the trained encoder, a required target model is built based on the trained encoder, data processing is performed by using the target model, and for example, transfer learning and the like are further performed by using the characterization information.

Because an encoder for extracting the representation information is a key link of data processing in machine learning, how to improve the training efficiency and effect of the encoder is a problem to be considered at present.

Disclosure of Invention

The embodiment of the application provides a method and a device for training an encoder and extracting characterization information, which are used for improving the training efficiency and effect of the encoder and the effectiveness of the extracted characterization information.

In one aspect, an encoder training method is provided, including:

carrying out noise superposition processing on original sample data to obtain at least two loss data;

aiming at original sample data and at least two loss data, respectively adopting encoders with the same model parameters to carry out encoding processing to obtain corresponding encoding characteristics;

decoding the obtained coding features by adopting a corresponding decoder to obtain corresponding decoding features;

obtaining a discrimination loss based on each coding feature, and obtaining a reconstruction loss based on original sample data and each decoding feature;

obtaining corresponding triple training data according to original sample data;

respectively adopting an encoder with model parameters to carry out feature extraction processing on each training data in the triple training data of the original sample data to obtain corresponding feature vectors;

determining a triplet loss representing the distance relationship between the feature vectors;

obtaining a predicted loss based on the reconstruction loss, the discrimination loss and the triple loss, wherein the predicted loss is positively correlated with the reconstruction loss, the discrimination loss and the triple loss;

and if the predicted loss conforms to the preset convergence condition, determining the model parameter as a reference value of the target parameter of the encoder, and if the predicted loss does not conform to the preset convergence condition, adjusting the model parameter until the predicted loss conforms to the preset convergence condition.

In one aspect, a method for extracting characterization information is provided, including:

obtaining a target model parameter of a target encoder by adopting a reference value of the target parameter of the encoder obtained by the encoder training method;

initializing a target encoder according to the target model parameters;

and adopting the target encoder to obtain the characterization information of the data.

In one aspect, an encoder training apparatus is provided, including:

the superposition unit is used for carrying out noise superposition processing on the original sample data to obtain at least two loss data;

the encoding unit is used for respectively adopting encoders with the same model parameters to carry out encoding processing on the original sample data and the at least two loss data to obtain corresponding encoding characteristics;

the decoding unit is used for decoding the obtained coding characteristics by adopting a corresponding decoder to obtain corresponding decoding characteristics;

the first obtaining unit is used for obtaining the discrimination loss based on each coding characteristic and obtaining the reconstruction loss based on the original sample data and each decoding characteristic;

the second obtaining unit is used for obtaining corresponding triple training data according to the original sample data;

the extracting unit is used for respectively adopting an encoder with model parameters to carry out feature extraction processing on each training data in the triple training data of the original sample data to obtain corresponding feature vectors;

the first determining unit is used for determining the triple loss representing the distance relation among the characteristic vectors;

the prediction unit is used for obtaining prediction loss based on the reconstruction loss, the discrimination loss and the triple loss, and the prediction loss is positively correlated with the reconstruction loss, the discrimination loss and the triple loss;

and the second determining unit is used for determining the model parameter as the reference value of the target parameter of the encoder if the prediction loss conforms to the preset convergence condition, and adjusting the model parameter until the prediction loss conforms to the preset convergence condition if the prediction loss does not conform to the preset convergence condition.

In one aspect, a representation information extraction apparatus is provided, including:

an obtaining unit, configured to obtain a target model parameter of a target encoder from a reference value of an encoder target parameter obtained by the encoder training method;

a setting unit for initializing the target encoder according to the target model parameters;

and the extraction unit is used for acquiring the characterization information of the data by adopting the target encoder.

In one aspect, there is provided a control apparatus comprising:

at least one memory for storing program instructions;

and the at least one processor is used for calling the program instructions stored in the memory and executing the steps of any one of the encoder training methods or the characterization information extraction method according to the obtained program instructions.

In one aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of any of the encoder training methods or the characterization information extraction methods described above.

In the method and the device for extracting the encoder training and the characterization information, corresponding coding characteristics and decoding characteristics are obtained respectively for at least two loss data of original sample data and original sample data, discrimination loss is obtained based on each coding characteristic, and reconstruction loss is obtained based on the original sample data and each decoding characteristic; respectively obtaining a feature vector of each training data in the triple training data of the original sample data, and determining triple loss representing the distance relationship among the feature vectors; obtaining a predicted loss based on the reconstruction loss, the discrimination loss and the triple loss; and if the prediction loss meets the preset convergence condition, initializing a target encoder by adopting the model parameters, and acquiring the characterization information of the data by adopting the target encoder. Therefore, the training efficiency and effect of the encoder training are improved, special processing is not needed to be carried out on data needing to extract the representation information, various data formats and modes can be applied, the application range is wide, and the extraction effectiveness of the extracted representation information is improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic diagram of an encoder training in an embodiment of the present application;

FIG. 2 is a flowchart illustrating an implementation of a method for training an encoder according to an embodiment of the present disclosure;

FIG. 3a is a schematic diagram of a loss data acquisition in an embodiment of the present application;

FIG. 3b is a schematic diagram of noise superposition according to an embodiment of the present disclosure;

FIG. 3c is a schematic diagram illustrating a noise superposition effect according to an embodiment of the present disclosure;

FIG. 3d is a diagram illustrating an image random warping process according to an embodiment of the present disclosure;

FIG. 3e is a diagram illustrating a comparison of characterization learning results according to an embodiment of the present disclosure;

fig. 4 is a flowchart of an implementation of a method for extracting characterization information according to an embodiment of the present application;

FIG. 5a is a schematic structural diagram of an encoder training apparatus according to an embodiment of the present disclosure;

FIG. 5b is a schematic structural diagram of a representation information extraction apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a control device in an embodiment of the present application.

Detailed Description

In order to make the purpose, technical solution and beneficial effects of the present application more clear and more obvious, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

First, some terms referred to in the embodiments of the present application are explained to facilitate understanding by those skilled in the art.

Machine learning: it is mainly to design and analyze some algorithms that allow computers to "learn" automatically. The machine learning algorithm is an algorithm for automatically analyzing and obtaining rules from data and predicting unknown data by using the rules. Because a large number of statistical theories are involved in the learning algorithm, machine learning is particularly closely related to inference statistics, which are also called statistical learning theories. In the aspect of algorithm design, the machine learning theory concerns the realizable and effective learning algorithm. Machine learning is the core of artificial intelligence, is the fundamental approach for making computers have intelligence, is applied to all fields of artificial intelligence, and mainly uses induction, synthesis rather than deduction.

And (3) characterization learning: is a collection of techniques to learn a feature, i.e., to convert the original sample data into a form that can be efficiently developed by machine learning. The method avoids the trouble of manually extracting the features, and allows a computer to learn how to extract the features while learning the use features.

Laplace transform: it is an integral transform commonly used in engineering mathematics, also known as Laplace transform. The Laplace transform is a linear transform, and can convert a function with a parameter real number t (t is more than or equal to 0) into a function with a parameter complex number s.

And (3) supervision and learning: is a machine learning task that infers a function from labeled training data. The task of supervised learning is to learn a model, which is applied to predict the corresponding output for a given input. This model is typically in the form of a decision function Y ═ f (X) or a conditional probability distribution P (Y | X).

Unsupervised learning: in the machine learning process, no artificial labeled learning form is used, and the method is opposite to 'supervised learning'.

Spatial domain: also known as data space (image space), is a space made up of data pixels. The direct processing of pixel values in data space with length (distance) as an argument is called spatial domain processing.

A Gaussian pyramid: is a technique used in data processing, computer vision, signal processing. The gaussian pyramid is essentially a multi-scale representation of the signal, i.e., the same signal or picture is gaussian blurred multiple times and down-sampled to generate multiple sets of signals or pictures at different scales for subsequent processing.

Laplacian pyramid: and subtracting the predicted data after the upsampling and Gaussian convolution of the data of the upper layer from each layer of data of the Gaussian pyramid to obtain a series of difference data. In the operation process of the gaussian pyramid, partial high-frequency detail information can be lost by data through convolution and downsampling operations, and in order to describe the high-frequency information, the laplacian pyramid is defined.

Affine transformation: the affine transformation between two vector spaces consists of a non-singular linear transformation and a translation transformation.

Judging the model: is a method of modeling the relationship between unobserved data and observed data. In the probabilistic framework, knowing the input variable x, the discriminant model predicts the output y by solving the conditional probability distribution P (y | x).

Convolutional Neural Network (CNN): the artificial neuron is a feedforward neural network, can respond to peripheral units and can perform large-scale data processing. The convolutional neural network includes convolutional layers and pooling layers.

Generate a countermeasure Network (GAN): the system consists of a generating network and a judging network. The generation network takes as input a random sampling from the underlying space, and its output needs to mimic as much as possible the real samples in the training set. The input of the discrimination network is the real sample or the output of the generation network, and the purpose is to distinguish the output of the generation network from the real sample as much as possible. The generation network should cheat the discrimination network as much as possible. The two networks resist each other and continuously adjust parameters, and the final purpose is to make the judgment network unable to judge whether the output result of the generated network is real or not.

The design concept of the embodiment of the present application is described below.

As society moves into the digital information age, real world data (e.g., pictures, video, and sensor measurements) are becoming more complex and varied, which presents significant challenges to data management and analysis. For example, machine learning tasks typically require that the input data be mathematically or computationally very convenient to process, requiring that valid features be extracted and expressed in advance.

Since the traditional manual feature extraction requires a lot of manpower and relies on very specialized knowledge, and is inconvenient to popularize, the feature learning comes up. The method avoids the trouble of manually extracting the features, and allows a computer to learn how to extract the features while learning the use features. For example, the visual characterization learning is to use a camera and a computer to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further perform graphic processing, and the data is processed by the computer to become data more suitable for human eyes to observe or transmitted to an instrument to detect. The method can be applied to the fields of visual object identification, such as Web data automatic labeling, mass data searching, data content filtering, medical remote consultation and the like; the method can also be applied to the detection of visual objects, such as industrial robots, unmanned automobiles and other fields; the method can also be applied to visual object tracking, such as identifying and tracking people in video monitoring.

In the traditional scheme, the following modes are mainly adopted during characterization learning:

the first mode is as follows: by reconstructing the original sample data, the compressed features are learned. However, in this way, the learned characterization effect is weak because the task of reconstructing data is simple.

The second way is: the characterization learning is performed by defining different related tasks, such as the relative position relationship of the predicted data blocks, the rotation angle of the predicted data, and the like. However, in this way, strong a priori knowledge is required and there are specific requirements on the format and modality of the input data.

The third mode is as follows: and the characterization learning is realized by integrating multiple tasks and learning at the same time. For example, the relative relationship task, the coloring task, the template task, and the motion segmentation task are merged into one framework. However, since each task corresponds to a respective objective function, input data requires special processing for multi-task learning.

The fourth mode is as follows: and (4) realizing characterization learning by adopting discriminant learning. For example, a twin network or ternary twin network structure is used to distinguish between different samples. However, this method requires large-scale labeling, has a small application range, and consumes a lot of manpower and material resources.

The applicant analyzes the traditional technology and finds that an encoder for extracting the representation information is a key link of data processing, but the traditional technology does not provide a technical scheme of the encoder capable of directly extracting the effective representation information of the original data, so that the training efficiency and the training effect of the encoder are problems to be considered.

In view of this, the applicant considers that laplace transform and noise superposition may be adopted to obtain multiple damaged data of original sample data, and a discriminant inference method may be adopted to perform random distortion processing on the original sample data, so as to obtain triple training data including the original data, and an encoder created based on a convolutional neural network is trained by using the original sample data, the damaged data, and the triple training data, so as to obtain a target encoder, so that the characterization information of the data may be extracted according to the target encoder.

In view of the above analysis and consideration, the embodiment of the present application provides a technical scheme for encoder training and characterizing information extraction, in which laplace transform and noise superposition are applied to original sample data to obtain multiple damaged data of the original sample data; carrying out random distortion processing on original sample data by adopting an differential reasoning method to obtain positive sample data and triple training data containing anchor point sample data, namely the original sample data, the positive sample data and the negative sample data; according to at least two damaged data of the original sample data, respectively adopting encoders with the same model parameters to obtain the discrimination loss and the reconstruction loss of the original sample data; respectively adopting encoders with the same model parameters to obtain the feature vectors of each training data in the triple training data, and determining triple losses representing the distance relationship among the feature vectors; if the prediction loss obtained based on the reconstruction loss, the discrimination loss and the triple loss accords with the convergence condition, the target encoder is obtained according to the encoding parameters, otherwise, the model parameters are adjusted, and the step of adopting Laplace transform and noise superposition to the original sample data is returned. Further, the target encoder is adopted to extract the characterization information of the data. Therefore, the training efficiency and effect of the encoder training are improved, special processing is not needed to be carried out on data needing to extract the representation information, various data formats and modes can be applied, the application range is wide, and the extraction effectiveness of the extracted representation information is improved.

According to the technical scheme for encoder training and representation information extraction, the target encoder used for accurately extracting the representation information can be obtained, and further, a target model applied to the fields of image classification, target detection, automatic driving, robots and the like can be built based on the target encoder.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide the method operation steps as shown in the following embodiments or figures, more or less operation steps may be included in the method based on the conventional or non-inventive labor. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application. The method can be executed in sequence or in parallel according to the method shown in the embodiment or the figure when the method is executed in an actual processing procedure or a device.

Referring to fig. 1, a schematic diagram of an encoder training scheme provided in the present application is shown. The main principles of encoder training are as follows:

s101: the encoding characteristics and the decoding characteristics of original sample data and damaged data are obtained through the Laplace distillation module 101, and the feature vectors corresponding to the training data in the triple training data are obtained through the discriminant reasoning module 102.

The damaged data is obtained by performing laplacian transform and noise superposition on original sample data. The triplet training data includes: anchor sample data, positive sample data, and negative sample data. The anchor sample data is the original sample data. The positive sample data is data obtained by randomly distorting original sample data. The negative sample data is data different from the original sample data.

S102: obtaining the discrimination loss through each coding feature; obtaining reconstruction loss through each decoding characteristic and original sample data; and obtaining the triple loss according to each feature vector.

S103: and obtaining the predicted loss according to the discrimination loss, the reconstruction loss and the triple loss.

S104: if the prediction loss meets the preset convergence condition, obtaining a target encoder based on the model parameters of each encoder, otherwise, adjusting the model parameters of the laplacian distillation module 101 and the discriminant inference module 102 according to the prediction loss. Optionally, the predicted loss meets a preset convergence condition, and may be that the predicted loss is lower than a preset threshold.

Wherein the laplace distillation module 101: the method comprises the steps of performing Laplace transform and noise superposition processing on original sample data to obtain at least two damaged data; respectively adopting encoders with the same model parameters to carry out encoding processing on original sample data and each damaged data to obtain corresponding encoding characteristics; and respectively adopting a corresponding decoder to decode each coding characteristic to obtain corresponding decoding characteristics.

Wherein the discriminant inference module 102: the method comprises the steps of performing random distortion processing on original sample data to obtain positive sample data; combining the positive sample data, the negative sample data and the anchor point data into triple training data; and respectively carrying out coding processing and full connection on each training data in the triple training data by adopting a coder to obtain corresponding characteristic vectors.

In fig. 1, an image "dog" is used as original sample data, an image cat is used as negative sample data, and random noise, information removal noise, and blurring noise are used as three different superimposed noises as examples for explanation. In practical application, original sample data, negative sample data and noise types can be selected according to actual requirements. For example, the noise type may also be true random noise, multi-scale blurring or information loss, and the like, which is not limited herein. The model parameters of each encoder in the set of encoders 103 are shared.

Referring to fig. 2, a flowchart of an implementation of an encoder training method provided in the present application is shown. The method comprises the following specific processes:

step 201: the control equipment acquires original sample data and negative sample data to be processed.

Specifically, when step 201 is executed, the negative sample data is data different from the original sample data. Optionally, any data different from the original sample data may be selected from the data set. The original sample data can be data in image, video, multi-frame data and other formats.

For example, the original sample data is a peony image, and the negative sample data is a rose image.

Step 202: the control equipment performs Laplace transform and noise superposition on the original sample data to obtain at least two damaged data.

Specifically, referring to fig. 3a, a schematic diagram of the acquisition of loss data is shown.

S2021: and the control equipment performs Gaussian transformation on the original sample data to obtain a Gaussian pyramid.

The gaussian pyramid is a technique used in encoder training, computer vision, and signal processing. The gaussian pyramid is essentially a multi-scale representation of the signal, i.e., the same signal or picture is gaussian blurred multiple times and down-sampled to generate multiple sets of signals or pictures at different scales for subsequent processing.

S2022: the control device performs laplacian transformation on the gaussian pyramid to obtain a laplacian pyramid.

In the operation process of the gaussian pyramid, partial high-frequency detail information can be lost by data through convolution and downsampling operations, and in order to describe the high-frequency information, the laplacian pyramid is defined. The laplacian pyramid is: and subtracting the predicted data after the upsampling and Gaussian convolution of the data of the upper layer from each layer of data of the Gaussian pyramid to obtain a series of difference data. The laplacian pyramid contains at least two layers of sampled data.

S2022: the control device performs the following steps for each noise type of noise in the noise set, respectively: and superposing noise of a noise type in a layer of randomly selected sampling data in the Laplace pyramid, and performing Laplace inverse transformation on the Laplace pyramid subjected to the superposition of the noise to obtain corresponding loss data.

Optionally, for different superimposed noise types, when obtaining loss data, the following formula may also be adopted:

wherein,

to lose data, x is the original sample data,

the first-layer sampling data is obtained after noise is superimposed on the first-layer sampling data of the Laplacian pyramid, and C is a noise type in the noise set C.

Specifically, when obtaining the loss data based on the random noise, the following formula may be adopted:

wherein,

to lose data, x is the original sample data,

the first layer of sampled data is obtained after noise is superimposed on the first layer of sampled data of the Laplacian pyramid. Dn represents random noise.

Specifically, when obtaining loss data based on information denoising, the following formula may be adopted:

wherein,

to lose data, x is the original sample data,

the first layer of sampled data is obtained after noise is superimposed on the first layer of sampled data of the Laplacian pyramid. In represents information removal noise.

Specifically, when obtaining the loss data based on the blurring noise, the following formula may be adopted:

wherein,

to lose data, x is the original sample data,

the first layer of sampled data is obtained after noise is superimposed on the first layer of sampled data of the Laplacian pyramid. SR denotes blurring noise.

Optionally, if the original sample data is image data, the original sample data may be adjusted to a set length and width, and random cropping may be performed. For example, the length and width are set to 256x256, and the length and width after random cutting is 227x 227.

In the embodiment of the present application, the noise types of the noise set are described by taking random noise, information removal noise, and fuzzification noise as examples. The noise type may also be true random noise, multi-scale blurring or information loss, etc. And are not intended to be limiting herein.

Optionally, the random noise may be gaussian random noise with a set variance (e.g., 25), and when the random noise is superimposed, a layer of sampling images is randomly selected from the laplacian pyramid for superimposition.

Optionally, noise is removed from the information, a set percentage of pixel points can be randomly removed, and when the noise is removed from the information and superimposed, a layer of sampling images is randomly selected from the laplacian pyramid for superimposing.

The step of blurring noise is to remove the high-frequency information by removing the information of the lowest layer of the Gaussian pyramid.

In the embodiment of the application, original sample data is constructed into the laplacian pyramid, noise superposition is respectively carried out in the laplacian pyramid through noises of various noise types, and the laplacian pyramid after the noises are superposed is reconstructed into lost data. The original sample data of the spatial domain is converted into a Laplacian pyramid of the Laplacian domain through Laplacian transformation, and then is inversely transformed into lost data of the spatial domain.

In this way, noise superposition is performed in the laplacian domain, rather than spatial domain superposition noise in the traditional approach, such that changes to the data are accompanied by global semantic information, rather than just local semantic information. Because it is difficult to capture the non-local semantic concepts only by the local semantic information, in the embodiment of the application, better representation can be learned by the global semantic information.

Further, in the embodiment of the application, noise superposition is performed by parallelly adopting the noises of multiple noise types, so that an encoder can learn a more difficult task, obtain stronger learning capability and learn better characterization information.

Referring to fig. 3b, a schematic diagram of noise superposition is shown, and fig. 3b shows the result of noise superposition at different laplacian pyramid levels. The images shown in fig. 3b are, in order: the original sample data includes conventional data obtained by conventionally superimposing noise (i.e., directly superimposing noise in a spatial domain), loss data with a noise superimposed Level (LPS) of 4, loss data with LPS of 6, and loss data with LPS ═ 8.

As can be seen from fig. 3b, compared with the conventional method in which noise is directly superimposed in the spatial domain, the loss data obtained by superimposing noise in the laplacian transform domain focuses not only on the local information but also on the global information. And loss data obtained by superposing noise at different LPS levels also shows different ranges on interference scale, so that a better encoder for extracting characterization information can be obtained in subsequent steps.

Fig. 3c is a schematic diagram illustrating the noise superposition effect. The images shown in fig. 3c are, in order: original sample data, loss data of superimposed random noise, loss data of superimposed information noise removal, and loss data of superimposed fuzzification noise. As can be seen in fig. 3c, the superposition of noise of different noise types produces different noise effects, but each image reflects features that combine local and global information.

Step 203: and the control equipment acquires triple training data according to the original sample data and the negative sample data.

Specifically, the triplet training data includes: anchor sample data, positive sample data, and negative sample data. The anchor sample data is the original sample data. The positive sample data is obtained by randomly distorting original sample data. The negative sample data is data different from the original sample data. The random warping process may be implemented by perspective transformation, affine transformation, rotation transformation, and the like, which is not limited herein.

When the control device obtains positive sample data, the following steps can be adopted:

s2031: and randomly sampling the original sample data to obtain randomly sampled data.

Specifically, the original sample data is normalized, and random sampling is performed in a designated area to obtain each random sampling data.

S2032: and obtaining an affine transformation matrix according to the random sampling data and the target data.

Specifically, the affine transformation matrix satisfies the following conditions: and multiplying the affine transformation matrix and the random sampling data into target data.

If the original sample data is an original image, the length and width of the original image are normalized (for example, 256 × 256), and random sampling is performed in designated areas (for example, 100 × 100 at four corners) of the original image respectively, so as to obtain random sampling coordinate points, and obtain a quadrilateral area. The affine transformation matrix satisfies the following formula:

where M is an affine transformation matrix, i is a randomly sampled coordinate point with a sequence number of 0, 1, 2, 3 … …, t is a transformation coefficient, and src (i) is a randomly sampled coordinate point (x)_i，y_i)，x_i，y_iRespectively the abscissa and ordinate of the randomly sampled coordinate point. Coordinate point dst (i) ═ (x) of target point_i′，y_i′)，x_i′，y_i' are the abscissa and ordinate of the coordinate point of the target point, respectively.

S2033: and randomly distorting the original sample data according to the affine transformation matrix, and cutting and scaling the randomly distorted original sample data to obtain positive sample data.

Specifically, since the affine transformation matrix satisfies the following condition: the product of the affine transformation matrix and the random sampling data is the target data, so that the original sample data can be subjected to affine transformation matrix to realize random distortion of the original sample data. And then the edges of the original data after random distortion can be cut and filled, and the original data can be scaled to the original size.

Optionally, when positive sample data is obtained, the following formula may be adopted:

x^p＝Pers(x)；

wherein x is original sample data, x^pPers () is a random warping function for positive sample data. Alternatively, the random warping processing function may adopt an affine transformation matrix or a perspective transformation function.

For example, referring to fig. 3d, a schematic diagram of an image random warping process is shown. The images shown in fig. 3d are, in order: and carrying out random sampling on the original sample data, and carrying out perspective transformation on the original sample data to obtain positive sample data.

As shown in fig. 3d, the control device performs random sampling on the original sample data to obtain each random sampling coordinate point, and performs perspective transformation on the original sample data according to the random sampling coordinate point and the coordinate point of the target point to obtain the positive sample data.

In the embodiment of the application, original sample data is used as anchor data, the original sample data is transformed to obtain positive sample data, and a sample different from the original sample data is selected as negative sample data. And combining anchor point data, positive sample data and negative sample data into triple training data. In this way, after the original sample data is randomly warped, although the positive sample data is deformed and warped more greatly than the original sample data (e.g. dog in the image in fig. 3 d), the main semantic information in the original sample data is retained in the positive sample data.

In the embodiment of the present application, only the step 202 is executed first and then the step 203 is executed as an example for description, in practical applications, the execution sequence of the step 202 and the step 203 may be executed sequentially or simultaneously, which is not limited to this.

Step 204: the control equipment obtains the encoding characteristics and the decoding characteristics of the original sample data and each damaged data, and obtains the characteristic vector of each training data in the triple training data.

Specifically, the control device establishes a CNN model through the CNN, obtains the encoding characteristics and the decoding characteristics of the original sample data and each damaged data by using the CNN model, and obtains the feature vector of each training data in the triplet training data. The CNN model mainly includes an encoder and a decoder.

When the encoding characteristics of the original sample data and each damaged data are obtained, the following steps can be adopted: and respectively adopting encoders with the same model parameters to carry out encoding processing on each damaged data and the original sample data to obtain corresponding encoding characteristics.

When the decoding characteristics of the original sample data and each damaged data are obtained, the following steps can be adopted: and respectively aiming at each coding characteristic, adopting a corresponding decoder to perform decoding processing to obtain a corresponding decoding characteristic.

When obtaining the feature vector of each training data in the triplet training data, the following steps may be adopted:

and respectively carrying out coding processing and full-connection processing on the characteristics by adopting encoders with the same model parameters aiming at each training data in the triple training data to obtain corresponding characteristic vectors. In the embodiment of the present application, the model parameters of each encoder are shared.

The CNN model body may have any structure, and in the embodiment of the present application, an AlexNet structure is taken as an example for description. The encoder adopts AlexNet, and the decoder is a three-layer deconvolution (deconv) layer and is used for decoding and reconstructing the coding characteristics obtained by the encoder into a data structure with the same size as the original sample data. The encoder is also used to extract feature vectors of the training data.

As shown in fig. 1, in the embodiment of the present application, since noise superposition processing is performed on original sample data by using three types of noise, three AlexNet with the same structure are used to perform encoding processing on each loss data, and three decoders are used to decode each obtained encoding feature. Wherein, the model parameters in each encoder are shared, and the model parameters in each decoder may not be shared. And fully connecting the feature vectors output by the encoder through a full connection layer aiming at each training data in the triple training data to obtain the fully connected feature vectors.

The characterization learned by the model is mainly embodied in the model parameters of the encoder, and the quality of the characterization can be verified by verifying the performance of the model parameters.

Step 205: and the control equipment obtains the prediction loss according to the original sample data, each coding characteristic, each decoding characteristic and each characteristic vector.

Specifically, the control device obtains the discrimination loss through each coding feature, obtains the reconstruction loss through each decoding feature and the original sample data, obtains the triple loss according to each feature vector, and obtains the prediction loss according to the discrimination loss, the reconstruction loss and the triple loss.

The discriminant loss represents the similarity of the encoding characteristics output by the encoder and the encoding characteristics of the original sample data in the characteristic distribution. The reconstruction loss is used for judging the similarity degree of the output data of the decoder and the original sample data in a spatial domain. The triplet penalty is used to represent: and (4) triplet loss of distance relation between feature vectors of training data in the triplet training data.

Wherein, when the discriminant loss is obtained, a discriminant sub-function may be employed:

L_D＝E_x[logD(G(x))]+∑_c∈CE_c[log(1-D(G(Lap_c)))]；

wherein L is_DTo distinguish loss, x is the original sample data, G (x) is the coding feature of the original sample data, G (Lap)_c) To lose the coding characteristics of the data, D () is the discriminator network, E is the data expectation, and C is one of the noise types in the noise set C.

In the embodiment of the application, the discriminant subfunction refers to the concept of GAN, a CNN model is used as a generator G, the generator G is implemented by adopting 4 layers of convolution (conv), and the output of an encoder is used as the input of the discriminant subfunction. Conventionally, GAN networks usually use discriminators for image domains, and in the embodiment of the present application, discriminators are used for feature domains, so as to expect to obtain similarity of feature planes. Therefore, the coding characteristics obtained by the coder and the coding characteristics obtained by the original sample data can be ensured to keep consistency in characteristic distribution, namely the similarity of data distribution.

Wherein, when the reconstruction loss is obtained, a reconstruction subfunction may be adopted:

L_rec＝∑_c∈CE_x‖x-z_c‖²+E_x‖x-z_x‖²；

wherein L is_recFor reconstruction loss, E is the mathematical expectation, x is the original sample data, z_cDecoding characteristics of the lost data for noise type c, z_xC is a noise type in the noise set C, which is a decoding characteristic of the original sample data.

Therefore, the reconstruction sub-function comprehensively judges the performance of all reconstruction processes according to each loss data and the reconstruction data of the original sample data, namely the decoding characteristics.

When the triple loss is obtained, a triple loss function can be adopted:

L_trip＝|d(F_θ(x)，F_θ(x^p))-d(F_θ(x)，F_θ(y))+δ|₊；

wherein L is_tripFor triple loss, x is the original sample data, y is the negative sample data, x^pAs positive sample data, F_θIs a feature vector | · non conducting phosphor₊Representing taking a positive function, i.e. taking 0 when the function value is a negative value and remaining unchanged when the function value is a non-negative value, d () is a distance function, optionally, an euclidean distance may be used, δ represents the minimum boundary of the feature vector of the positive sample data and the feature vector of the negative sample data, and optionally, δ may be set to 20.

Wherein the predicted loss can be obtained by using the following formula:

wherein L is the predicted loss, L_tripIs a triplet loss, L_recFor reconstruction of losses, L_DTo discriminate losses, G is the generator, which is used to obtain the coding features, and D () is the discriminator network.

Step 206: the control device determines whether the predicted loss meets a predetermined convergence condition, if so, performs step 207, otherwise, performs step 208.

Step 207: the control device determines the model parameters as reference values for the encoder target parameters.

Step 208: the control device adjusts the model parameters of the encoder and decoder according to the prediction loss, and step 201 is performed.

Specifically, when the steps 206 to 208 are executed, if the prediction loss meets the preset convergence condition, the control device determines the model parameter as the reference value of the encoder target parameter. And if the predicted loss does not accord with the preset convergence condition, the control equipment adjusts the model parameters until the predicted loss accords with the preset convergence condition.

After obtaining the reference value of the target parameter of the encoder, the target encoder may be initialized according to the reference value of the target parameter of the encoder, and the target encoder is used to obtain the characterization information of the data. Fig. 4 is a flowchart of an implementation of the method for extracting the characterization information. The method comprises the following specific processes:

step 401: and controlling the reference value of the target parameter of the encoder of the equipment to obtain the target model parameter of the target encoder.

Step 402: and the control equipment initializes the target encoder according to the reference value of the encoder target parameter and adopts the target encoder to obtain the characterization information of the data.

Further, the control device may build a desired target model according to the target encoder, and perform data processing using the target model.

The target model is mainly a model which needs to extract the representation information of data and process the data according to the representation information, and can be applied to the fields of image classification, target detection, automatic driving, visual object tracking, Web data automatic labeling, mass data searching, data content filtering, medical remote consultation, robots and the like.

For example, the object model may be a classification model, an object detection model, a semantic segmentation model, and the like. The target tasks may be classification tasks, target detection tasks, and semantic segmentation tasks.

In the embodiment of the application, the validity of the characterization information extracted by the target encoder is evaluated from the perspective of convolutional layer output, model initialization and transfer learning.

And evaluating the first evaluation scene according to the coding characteristics output by the convolutional layer. FIG. 3e is a diagram illustrating a comparison between the characterization learning result and the learning result. Fig. 3e (a) is: and when the traditional full-supervised characterization learning method is adopted to extract the coding features of the image, the coding features output by the first layer of convolutional layer. Fig. 3e (b) is: by adopting the scheme in the embodiment of the application, the coding characteristics output by the first layer of the convolutional layer are obtained when the target encoder extracts the coding characteristics of the image.

Obviously, the graph (b) in fig. 3e is similar to the graph (a), that is, the target encoder can obtain encoding characteristics similar to those of the conventional fully supervised characterization learning method, and can learn more accurate edge filters and color filters well.

Table 1.

And a second evaluation scenario, wherein evaluation is performed from the perspective of model initialization. Referring to table 1, an evaluation table is initialized for the model. Table 1 contains 5 initialization modes, in order: random initialization, spatial domain initialization, laplacian initialization, discriminant inference initialization, and the target encoder of the scheme. I.e. the encoders obtained by training respectively by random, spatial domain, laplacian, discriminant reasoning and in the present scheme.

Specifically, each encoder is obtained based on 5 manners in table 1, and a linear classifier is respectively connected to each layer of the 5 convolutional layers of each encoder (e.g., AlexNet), so as to evaluate the classification performance, i.e., the classification accuracy, of the encoder in data (e.g., an image network aggregation (ImageNet) data set).

As can be seen from table 1, the target encoder obtained by the present scheme is significantly higher than the encoders obtained by other schemes in the classification performance.

And applying a third scenario, evaluating from the perspective of transfer learning, namely detecting whether the obtained encoder can help other data and characterization learning of the task.

Referring to table 2, a migration learning evaluation table is shown. The values in table 2 represent the task scores. The 5 methods shown in table 2 are sequentially adopted to obtain each encoder, and a corresponding classification model, a target detection model and a semantic segmentation model are obtained based on each encoder respectively, so as to execute the tasks of classification, target detection and semantic segmentation.

Table 2.

The Fc6-8 refers to that when an encoder of the classification model is trained, model parameters of the convolution of the first 5 layers of the encoder are fixed and are not updated, and only model coefficients of the full connection layer Fc6-8 are updated. Correspondingly, ALL means that ALL model parameters are updated and learned when the encoder is trained. As can be seen from table 2, the task scores of the tasks executed by the models built based on the target encoder in the present solution are significantly higher than the task scores of other modes, that is, the present solution is significantly better than other solutions.

In the embodiment of the application, on one hand, the original sample data is subjected to Laplace transformation, the original sample data in a spatial domain is converted into a Laplace pyramid in a Laplace domain, noise is superimposed on a random layer of the Laplace pyramid to obtain loss data, combination of bottom-layer representation and high-layer representation is achieved, and features sensitive to edge features can be learned;

on the other hand, random distortion processing is carried out on the data through an identification reasoning method, the triple loss of the distance relation between the feature vectors of the triple training data is obtained, the distance of different contents is increased in the feature space, meanwhile, the difference of similar contents is reduced, and therefore the encoder can also obtain high-level semantic information of the data.

And moreover, discrimination loss and reconstruction loss of original sample data and triple loss of triple training data are obtained, and prediction loss is obtained based on the discrimination loss, the reconstruction loss and the triple loss, so that the scheme simultaneously considers the similarity of the spatial domain and the feature domain distribution and the feature vector similarity between positive sample data and negative sample data, jointly restricts the training process, and adopts multi-task learning (such as superposition of various noises) to make the representation information obtained by training more robust.

The embodiment of the application has no strong constraint on the input data, can use various data formats and modes, does not require any special processing on the input data, has wider applicability, can extract more bottom-layer semantic information, and obtains more robust and more representative model parameters for subsequent application. For example, instead of performing model training on a large-scale labeled data set and performing model initialization on a target model, the encoder training method provided in the embodiment of the present application is used to build a target model applied to the fields of image classification, target detection, automatic driving, visual object tracking, Web data automatic labeling, mass data search, data content filtering, medical remote consultation, robots, and the like.

And initializing the model.

Based on the same inventive concept, the embodiment of the present application further provides an encoder training apparatus, and as the principle of the apparatus and the device for solving the problem is similar to an encoder training method, the implementation of the apparatus can refer to the implementation of the method, and repeated details are omitted.

Fig. 5a is a schematic structural diagram of an encoder training apparatus according to an embodiment of the present application. An encoder training apparatus comprising:

a superposition unit 510, configured to perform noise superposition processing on original sample data to obtain at least two loss data;

the encoding unit 511 is configured to perform encoding processing on the original sample data and the at least two pieces of loss data by using encoders with the same model parameter, respectively, to obtain corresponding encoding characteristics;

a decoding unit 512, configured to perform decoding processing on the obtained encoding features by using corresponding decoders, so as to obtain corresponding decoding features;

a first obtaining unit 513, configured to obtain a discrimination loss based on each coding feature, and obtain a reconstruction loss based on original sample data and each decoding feature;

a second obtaining unit 514, configured to obtain, according to the original sample data, corresponding triple training data;

an extracting unit 515, configured to perform, for each training data in the triple training data of the original sample data, feature extraction processing by using an encoder with model parameters, respectively, to obtain a corresponding feature vector;

a first determining unit 516, configured to determine a triplet loss characterizing a distance relationship between feature vectors;

the prediction unit 517 is configured to obtain a prediction loss based on the reconstruction loss, the discrimination loss, and the triplet loss, where the prediction loss is positively correlated with the reconstruction loss, the discrimination loss, and the triplet loss;

a second determining unit 518, configured to determine the model parameter as a reference value of the target parameter of the encoder if the prediction loss meets a preset convergence condition, and adjust the model parameter until the prediction loss meets the preset convergence condition if the prediction loss does not meet the preset convergence condition.

Preferably, the two damaged data are obtained by performing laplace transform on original sample data and superimposing noise;

the triplet training data includes: the anchor point sample data is original sample data, the positive sample data is obtained by randomly distorting the original sample data, and the negative sample data is different from the original sample data.

Preferably, the first determining unit 516 is configured to:

determining a first distance between the feature vector of the anchor point sample data and the feature vector of the positive sample data;

determining a second distance between the feature vector of the anchor point sample data and the feature vector of the negative sample data;

a triplet penalty is determined based on a difference between the first distance and the second distance.

Preferably, the first obtaining unit 513 is configured to:

respectively obtaining an original discrimination value of the coding feature of original sample data and a loss discrimination value of the coding feature of each loss function by adopting a preset discrimination function;

determining a discrimination loss based on the original discrimination value and each loss discrimination value;

the discrimination loss represents the similarity degree of the encoding characteristics output by the encoder and the encoding characteristics of the original sample data on the characteristic distribution, and the discrimination loss is positively correlated with the original discrimination value and negatively correlated with the loss discrimination value.

Preferably, the first obtaining unit 513 is configured to:

respectively determining a decoding difference value between each decoding characteristic and original sample data;

obtaining a reconstruction loss based on each decoded difference;

the reconstruction loss is used for judging the similarity degree of the output data of the decoder and the original sample data in a spatial domain.

Based on the same inventive concept, the embodiment of the present application further provides a device for extracting characterization information, and since the principle of the device and the apparatus for solving the problem is similar to that of a method for extracting characterization information, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

Fig. 5b is a schematic structural diagram of a representation information extraction apparatus according to an embodiment of the present application. A characterization information extraction device includes:

an obtaining unit 521, configured to obtain a target model parameter of a target encoder according to a reference value of a target parameter of the encoder obtained by the above-mentioned encoder training method;

a setting unit 522 for initializing the target encoder according to the target model parameters;

an extracting unit 523, configured to obtain the characterization information of the data by using the target encoder.

Fig. 6 is a schematic structural diagram of a control device. Based on the same technical concept, the embodiment of the present application further provides a control device, which may include a memory 601 and a processor 602.

The memory 601 is used for storing computer programs executed by the processor 602. The memory 601 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like. The processor 602 may be a Central Processing Unit (CPU), a digital processing unit, or the like. The specific connection medium between the memory 601 and the processor 602 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 601 and the processor 602 are connected by a bus 603 in fig. 6, the bus 603 is represented by a thick line in fig. 6, and the connection manner between other components is merely for illustrative purposes and is not limited thereto. The bus 603 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.

The memory 601 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 601 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD), or any other medium which can be used to carry or store desired program code in the form of instructions or data structures and which can be accessed by a computer. The memory 601 may be a combination of the above memories.

A processor 602, configured to execute the encoder training method provided in the embodiment shown in fig. 2 and the characterization information extraction method provided in the embodiment shown in fig. 4 when calling the computer program stored in the memory 601.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the encoder training method and the representation information extraction method in any of the above method embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the technical solutions mentioned above substantially or otherwise contributing to the related art may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a control device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. An encoder training method for image processing, comprising:

performing noise superposition processing on image sample data to obtain at least two image loss data, wherein the two image loss data are obtained after performing Laplace transform and superposition noise on the image sample data;

aiming at the image sample data and the at least two image loss data, respectively adopting encoders with the same model parameters to carry out encoding processing to obtain corresponding image encoding characteristics;

adopting a corresponding decoder to decode the obtained image coding features to obtain corresponding image decoding features;

obtaining a discrimination loss based on each image coding feature, and obtaining a reconstruction loss based on the image sample data and each image decoding feature;

obtaining corresponding triple image training data according to the image sample data;

respectively adopting an encoder with the model parameters to carry out feature extraction processing on each training data in the triple image training data of the image sample data to obtain corresponding feature vectors;

obtaining a predicted loss based on the reconstruction loss, the discrimination loss and the triplet loss, wherein the predicted loss is positively correlated with the reconstruction loss, the discrimination loss and the triplet loss;

and if the prediction loss conforms to the preset convergence condition, determining the model parameter as a reference value of the target parameter of the encoder, and if the prediction loss does not conform to the preset convergence condition, adjusting the model parameter until the prediction loss conforms to the preset convergence condition.

2. The method of claim 1, wherein the triplet of image training data comprises: the image processing method comprises anchor point sample data, positive sample data and negative sample data, wherein the anchor point sample data are the image sample data, the positive sample data are obtained by performing random warping on the image sample data, and the negative sample data are data different from the image sample data.

3. The method of claim 2, wherein determining a triplet of penalties that characterize the distance relationship between feature vectors comprises:

determining a triplet penalty based on a difference between the first distance and the second distance.

4. The method of claim 1, 2 or 3, wherein obtaining a discriminant loss based on each image coding feature comprises:

respectively obtaining an original discrimination value of the image coding feature of the image sample data and a loss discrimination value of the image coding feature of each loss function by adopting a preset discrimination function;

the discrimination loss represents the similarity degree of the image coding features output by the encoder and the image coding features of the image sample data on feature distribution, and the discrimination loss is positively correlated with the original discrimination value and negatively correlated with the loss discrimination value.

5. The method of claim 1, 2 or 3, wherein obtaining a reconstruction loss based on the image sample data and image decoding features comprises:

respectively determining a decoding difference value between each image decoding characteristic and the image sample data;

obtaining a reconstruction loss based on each decoded difference;

and the reconstruction loss is used for judging the similarity degree of the output data of the decoder and the image sample data in a spatial domain.

6. An image representation information extraction method is characterized by comprising the following steps:

obtaining target model parameters of a target encoder by using the reference values of the target parameters of the encoder obtained by the method according to any one of claims 1 to 5;

initializing the target encoder according to the target model parameters;

and obtaining image representation information of the image data by adopting the target encoder.

7. An apparatus for training an encoder for image processing, comprising:

the superposition unit is used for carrying out noise superposition processing on image sample data to obtain at least two image loss data, wherein the two image loss data are obtained after the image sample data is subjected to Laplace transform and noise superposition;

the encoding unit is used for respectively adopting encoders with the same model parameters to carry out encoding processing on the image sample data and the at least two image loss data to obtain corresponding image encoding characteristics;

the decoding unit is used for decoding the obtained image coding features by adopting a corresponding decoder to obtain corresponding image decoding features;

the first obtaining unit is used for obtaining discrimination loss based on each image coding characteristic and obtaining reconstruction loss based on the image sample data and each image decoding characteristic;

a second obtaining unit, configured to obtain, according to the image sample data, corresponding triple image training data;

an extracting unit, configured to perform feature extraction processing on each training data in the triple image training data of the image sample data by using an encoder having the model parameter, respectively, to obtain a corresponding feature vector;

a prediction unit, configured to obtain a prediction loss based on the reconstruction loss, the discrimination loss, and the triplet loss, where the prediction loss is positively correlated with the reconstruction loss, the discrimination loss, and the triplet loss;

and the second determining unit is used for determining the model parameter as a reference value of the target parameter of the encoder if the prediction loss meets a preset convergence condition, and adjusting the model parameter until the prediction loss meets the preset convergence condition if the prediction loss does not meet the preset convergence condition.

8. The apparatus of claim 7, in which the triplet of image training data comprises: the image processing method comprises anchor point sample data, positive sample data and negative sample data, wherein the anchor point sample data are the image sample data, the positive sample data are obtained by performing random warping on the image sample data, and the negative sample data are data different from the image sample data.

9. The apparatus of claim 8, wherein the first determination unit is to:

10. The apparatus of claim 7, 8 or 9, wherein the first obtaining unit is to:

11. The apparatus of claim 7, 8 or 9, wherein the first obtaining unit is to:

obtaining a reconstruction loss based on each decoded difference;

12. An image representation information extraction apparatus characterized by comprising:

an obtaining unit, configured to obtain target model parameters of a target encoder by using the reference values of the encoder target parameters obtained by the method according to any one of claims 1 to 5;

a setting unit, configured to initialize the target encoder according to the target model parameter;

and the extraction unit is used for obtaining the image representation information of the image data by adopting the target encoder.

13. A control apparatus, characterized by comprising:

at least one memory for storing program instructions;

at least one processor for calling program instructions stored in said memory and for executing the steps of the method of any of the preceding claims 1-5 or 6 according to the obtained program instructions.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5 or 6.