CN117437440A - Model training method, image processing method and related device - Google Patents

Model training method, image processing method and related device Download PDF

Info

Publication number
CN117437440A
CN117437440A CN202210801213.9A CN202210801213A CN117437440A CN 117437440 A CN117437440 A CN 117437440A CN 202210801213 A CN202210801213 A CN 202210801213A CN 117437440 A CN117437440 A CN 117437440A
Authority
CN
China
Prior art keywords
image
feature
statistical information
model
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210801213.9A
Other languages
Chinese (zh)
Inventor
陈鑫
田奇
张恒亨
王延峰
姜宇轩
张娅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Cloud Computing Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Cloud Computing Technologies Co Ltd filed Critical Huawei Cloud Computing Technologies Co Ltd
Priority to CN202210801213.9A priority Critical patent/CN117437440A/en
Publication of CN117437440A publication Critical patent/CN117437440A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/72Data preparation, e.g. statistical preprocessing of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

A model training method is applied to the field of artificial intelligence and can improve the generalization capability of the field of models. In the method, the model predicts the feature statistical information of the domain where the input image is based on the feature statistical information of the input image and the feature statistical information of the images of other domains in the training process, and then processes the input image based on the feature statistical information of the domain where the input image is. In the process of predicting the feature statistical information, the model considers the commonality between the input image and the images of other domains and combines the feature distribution uniqueness of the input image, so that the model can effectively predict and obtain the feature distribution information of the domain where the input image is located. In this way, the input image can be processed based on the feature distribution of the domain of the input image on the basis of the feature distribution of the domain of the input image predicted by the model society, and the domain generalization capability of the model is effectively improved.

Description

Model training method, image processing method and related device
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a model training method, an image processing method, and a related device.
Background
With the development of deep learning, deep learning models have made considerable progress in various computer vision tasks. Currently, deep learning models are usually trained based on labeled source domain data, so deep learning models often only have better performance when processing data that is close to the source domain data. When there is a difference in feature distribution between the source domain data and the target domain data to be processed by the deep learning model, the deep learning model trained by the source domain data may drastically decrease in performance when processing the target domain data. For example, face recognition models trained using images of eastern people may exhibit significant degradation when used to recognize images of western people.
In order to solve the problem of characteristic distribution deviation of source domain data and target domain data, domain generalization technology is generated. The goal of domain generalization is to train a robust model in multiple source domain data, so that the model can be well generalized to target domain data with unknown feature distribution. The existing domain generalization method mainly starts from the parameters of the model, and measures domain invariance from the perspective of model parameter distribution, so that domain generalization is realized.
However, the existing domain generalization method focuses on considering invariance among different domains, ignoring uniqueness among the domains, and resulting in poor generalization capability of the model.
Disclosure of Invention
The model training method can enable the model to learn to predict the feature distribution of the domain where the input image is located, so that the input image is processed based on the feature distribution of the domain where the input image is located, and the domain generalization capability of the model is effectively improved.
The first aspect of the present application provides a model training method, including the following steps: firstly, a training set is obtained, wherein the training set comprises a plurality of groups of images belonging to different source domains, the images in the different source domains have different characteristic distributions, and each group of images is from the same source domain. That is, the images in the training set are divided into a plurality of groups of images according to the source domain to which they belong, each group of images corresponds to one source domain, and a plurality of images are included in each group of images.
And then, based on a first image and at least one group of images in a training set, obtaining first feature statistical information corresponding to a source domain where the first image is located through model prediction, wherein the first feature statistical information is used for indicating feature distribution of the image in the source domain where the first image is located, and the at least one group of images are the second feature statistical information and the third feature statistical information of the images belonging to different source domains with the first image in the training set and are used for indicating feature distribution of the images. And obtaining a first processing result corresponding to the first image through a model based on the feature map of the first image and the first feature statistical information.
Finally, a target loss value is determined based on the first processing result, and the model is trained based on the target loss value. Wherein the target loss value can be used to characterize a difference between the first processing result and a truth result corresponding to the first image.
In the scheme, in the model training process, the model predicts the feature statistical information of the domain where the input image is based on the input image and images of other domains, and then processes the input image based on the feature statistical information of the domain where the input image is. In the process of predicting the feature statistical information, the model considers the commonality between the input image and the images of other domains and combines the feature distribution uniqueness of the input image, so that the model can effectively predict and obtain the feature distribution information of the domain where the input image is located. In this way, the input image can be processed based on the feature distribution of the domain of the input image on the basis of the feature distribution of the domain of the input image predicted by the model society, and the domain generalization capability of the model is effectively improved.
Optionally, based on the first image and at least one group of images, the specific process of obtaining the first feature statistical information corresponding to the source domain where the first image is located through model prediction includes: extracting second characteristic statistical information corresponding to a first image and third characteristic statistical information corresponding to at least one group of images in the training set, wherein the second characteristic statistical information and the third characteristic statistical information are used for indicating characteristic distribution of the images; inputting the second characteristic statistical information and the third characteristic statistical information into a first sub-network in a model, and predicting to obtain the first characteristic statistical information corresponding to the source domain where the first image is located. The first feature statistical information is used for indicating feature distribution of the image in the source domain where the first image is located.
Optionally, based on the feature map of the first image and the first feature statistical information, a first processing result corresponding to the first image is obtained through the model, which specifically includes: inputting the first image and the first characteristic statistical information into a second sub-network in the model to obtain a first processing result corresponding to the first image; the second sub-network is used for extracting a feature map of the first image and carrying out normalization processing on the feature map based on the first feature statistical information.
Optionally, the second feature statistical information includes a mean and a variance of a feature map of the first image; the third feature statistics include a mean and variance of feature maps of the at least one set of images.
Optionally, the target loss value includes a first loss value and a second loss value. The determining a target loss value based on the first processing result includes: determining the first loss value based on the first processing result and a true value result corresponding to the first image; acquiring fourth characteristic statistical information corresponding to the image in the source domain where the first image is located based on the training set, and determining the second loss value based on the first characteristic statistical information and the fourth characteristic statistical information; and determining the target loss value according to the first loss value and the second loss value.
In the scheme, the loss value determined by the characteristic statistical information predicted value and the characteristic statistical information true value is introduced, so that the first sub-network can be well guided to predict the characteristic statistical information of the domain where the input image is located, the prediction accuracy of the first sub-network is improved, and the generalization performance of the model is further improved.
Optionally, the model further includes a third sub-network, and the target loss value further includes a third loss value; the method further comprises the steps of: inputting the image in the source domain where the first image is located into the third sub-network to obtain a second processing result corresponding to the first image, wherein the image in the source domain where the first image is located comprises the first image, and the third sub-network is used for extracting the fourth characteristic statistical information based on the image in the source domain where the first image is located and carrying out batch normalization processing on the characteristic graph of the first image based on the fourth characteristic statistical information; the determining the target loss value from the first loss value and the second loss value includes: determining the third loss value based on the second processing result and a true value result corresponding to the first image; and determining the target loss value according to the first loss value, the second loss value and the third loss value.
In the scheme, the third sub-network is adopted to execute a conventional processing flow on the input image, and the loss value obtained by the output result of the third sub-network is introduced to restrict the training of the first sub-network and the second sub-network in the model, so that the prediction performance of the first sub-network and the second sub-network in the model is effectively improved, and the generalization performance of the model is improved.
Optionally, the method further comprises: acquiring a first translation parameter and a first scaling parameter, wherein the first translation parameter and the first scaling parameter are parameters used for carrying out batch normalization processing on the feature images of the first image in the third sub-network; predicting to obtain a second translation parameter and a second scaling parameter according to the second characteristic statistical information, the first translation parameter and the first scaling parameter; the second sub-network is used for normalizing the feature map of the first image based on the first feature statistical information, the second translation parameter and the second scaling parameter.
In the scheme, parameters used when the input image is normalized based on the predicted characteristic statistical information are predicted by combining the characteristic statistical information of the input image and parameters adopted by the input image for batch normalization processing, so that the parameters which are well matched with the input image in the normalization processing process can be obtained, the processing performance of the model on the input image is improved, and the generalization performance of the model is further effectively improved.
Optionally, the first subnetwork includes an encoder and a decoder, and the encoder and the decoder are all fully connected layer structures.
Optionally, a linear rectification ReLU activation function is further connected between the encoder and the decoder, and a hyperbolic tangent Tanh activation function is further connected behind the decoder.
Optionally, the second sub-network includes a convolution layer, a normalization layer and a full connection layer, where the convolution layer is configured to extract a feature map of the first image, the normalization layer is configured to normalize the feature map based on the third feature statistical information, and the full connection layer is configured to process a normalization result output by the normalization layer to obtain the first processing result.
Optionally, the model is used to perform one or more of the following tasks: image classification, image recognition, image enhancement, object detection, and image segmentation.
A second aspect of the present application provides an image processing method, including: acquiring a target domain image; extracting first feature statistical information corresponding to the target domain image, wherein the first feature statistical information is used for indicating feature distribution of the target domain image; obtaining third feature statistical information corresponding to a target domain where the target domain image is located based on the first feature statistical information and second feature statistical information, wherein the second feature statistical information is feature statistical information corresponding to a source domain image adopted by the model in a training process, and the target domain image and the source domain image have different feature distribution; and obtaining a processing result corresponding to the target domain image based on the target domain image and the third characteristic statistical information.
Optionally, the obtaining, based on the target domain image and the third feature statistical information, a processing result corresponding to the target domain image includes: and extracting a feature map of the target domain image, and carrying out normalization processing on the feature map based on the third feature statistical information to obtain a processing result corresponding to the target domain image.
Optionally, the first feature statistical information includes a mean and a variance of a feature map of the target domain image; the second feature statistics include a mean and a variance of feature maps of the source domain image.
Optionally, the model includes a first sub-network, the first sub-network is configured to obtain the third feature statistical information based on the first feature statistical information and the second feature statistical information, the first sub-network includes an encoder and a decoder, and the encoder and the decoder are all in a full-connection layer structure.
Optionally, a ReLU activation function is further connected between the encoder and the decoder, and a Tanh activation function is further connected behind the decoder.
Optionally, the model includes a second sub-network, where the second sub-network is configured to obtain a processing result corresponding to the target domain image based on the target domain image and the third feature statistical information, and the second sub-network includes a convolution layer, a normalization layer, and a full connection layer, where the convolution layer is configured to extract a feature map of the first image, and the normalization layer is configured to perform normalization processing on the feature map based on the third feature statistical information, and the full connection layer is configured to process a normalization result output by the normalization layer to obtain the first processing result.
Optionally, the model is used to perform one or more of the following tasks: image classification, image recognition, image enhancement, object detection, and image segmentation.
A third aspect of the present application provides a model training apparatus, comprising: the acquisition unit is used for acquiring a training set, wherein the training set comprises a plurality of groups of images belonging to different source domains, the images in the different source domains have different characteristic distribution, and each group of images is from the same source domain; the processing unit is used for obtaining first characteristic statistical information corresponding to a source domain where the first image is located through a model based on the first image and at least one group of images in the training set, wherein the at least one group of images are images belonging to different source domains with the first image in the training set; the processing unit is further used for obtaining a first processing result corresponding to the first image through the model based on the first image and the first feature statistical information; the processing unit is further configured to determine a target loss value based on the first processing result, and train the model based on the target loss value.
Optionally, the processing unit is further configured to: extracting second characteristic statistical information corresponding to a first image and third characteristic statistical information corresponding to at least one group of images in the training set, wherein the second characteristic statistical information and the third characteristic statistical information are used for indicating characteristic distribution of the images; inputting the second feature statistical information and the third feature statistical information into a first sub-network in the model, and predicting to obtain the first feature statistical information corresponding to the source domain where the first image is located, where the first feature statistical information is used for indicating feature distribution of the image in the source domain where the first image is located.
Optionally, the processing unit is further configured to: inputting the first image and the first characteristic statistical information into a second sub-network in the model to obtain a first processing result corresponding to the first image; the second sub-network is used for extracting a feature map of the first image and carrying out normalization processing on the feature map based on the first feature statistical information.
Optionally, the second feature statistical information includes a mean and a variance of a feature map of the first image; the third feature statistics include a mean and variance of feature maps of the at least one set of images.
Optionally, the target loss value includes a first loss value and a second loss value; the processing unit is further configured to: determining the first loss value based on the first processing result and a true value result corresponding to the first image; acquiring fourth characteristic statistical information corresponding to the image in the source domain where the first image is located based on the training set, and determining the second loss value based on the third characteristic statistical information and the fourth characteristic statistical information; and determining the target loss value according to the first loss value and the second loss value.
Optionally, the model further includes a third sub-network, and the target loss value further includes a third loss value; the processing unit is further configured to: inputting the image in the source domain where the first image is located into the third sub-network to obtain a second processing result corresponding to the first image, wherein the image in the source domain where the first image is located comprises the first image, and the third sub-network is used for extracting the fourth characteristic statistical information based on the image in the source domain where the first image is located and carrying out batch normalization processing on the characteristic graph of the first image based on the fourth characteristic statistical information; determining the third loss value based on the second processing result and a true value result corresponding to the first image; and determining the target loss value according to the first loss value, the second loss value and the third loss value.
Optionally, the acquiring unit is further configured to acquire a first translation parameter and a first scaling parameter, where the first translation parameter and the first scaling parameter are parameters used for performing batch normalization processing on the feature map of the first image in the third sub-network; the processing unit is further configured to predict a second translation parameter and a second scaling parameter according to the second feature statistical information, the first translation parameter and the first scaling parameter; the second sub-network is used for normalizing the feature map of the first image based on the first feature statistical information, the second translation parameter and the second scaling parameter.
Optionally, the first subnetwork includes an encoder and a decoder, and the encoder and the decoder are all fully connected layer structures.
Optionally, a ReLU activation function is further connected between the encoder and the decoder, and a Tanh activation function is further connected behind the decoder.
Optionally, the second sub-network includes a convolution layer, a normalization layer and a full connection layer, where the convolution layer is configured to extract a feature map of the first image, the normalization layer is configured to normalize the feature map based on the third feature statistical information, and the full connection layer is configured to process a normalization result output by the normalization layer to obtain the first processing result.
Optionally, the model is used to perform one or more of the following tasks: image classification, image recognition, image enhancement, object detection, and image segmentation.
A fourth aspect of the present application provides an image processing apparatus, comprising: an acquisition unit configured to acquire a target domain image; the extraction unit is used for extracting first feature statistical information corresponding to the target domain image, wherein the first feature statistical information is used for indicating feature distribution of the target domain image; the processing unit is used for obtaining third characteristic statistical information corresponding to a target domain where the target domain image is located based on the first characteristic statistical information and the second characteristic statistical information, wherein the second characteristic statistical information is characteristic statistical information corresponding to a source domain image adopted by the model in a training process, and the target domain image and the source domain image have different characteristic distribution; the processing unit is further configured to obtain a processing result corresponding to the target domain image based on the target domain image and the third feature statistical information.
Optionally, the processing unit is specifically configured to extract a feature map of the target domain image, and normalize the feature map based on the third feature statistical information to obtain a processing result corresponding to the target domain image.
Optionally, the first feature statistical information includes a mean and a variance of a feature map of the target domain image;
the second feature statistics include a mean and a variance of feature maps of the source domain image.
Optionally, the model includes a first sub-network, the first sub-network is configured to obtain the third feature statistical information based on the first feature statistical information and the second feature statistical information, the first sub-network includes an encoder and a decoder, and the encoder and the decoder are all in a full-connection layer structure.
Optionally, a ReLU activation function is further connected between the encoder and the decoder, and a Tanh activation function is further connected behind the decoder.
Optionally, the model includes a second sub-network, where the second sub-network is configured to obtain a processing result corresponding to the target domain image based on the target domain image and the third feature statistical information, and the second sub-network includes a convolution layer, a normalization layer, and a full connection layer, where the convolution layer is configured to extract a feature map of the first image, and the normalization layer is configured to perform normalization processing on the feature map based on the third feature statistical information, and the full connection layer is configured to process a normalization result output by the normalization layer to obtain the first processing result.
Optionally, the model is used to perform one or more of the following tasks: image classification, image recognition, image enhancement, object detection, and image segmentation.
A fifth aspect of the present application provides an electronic device, which may include a processor, the processor being coupled to a memory, the memory storing program instructions which, when executed by the processor, implement the method according to any one of the implementations of the first or second aspect. For the steps in each possible implementation manner of the first aspect or the second aspect executed by the processor, reference may be made to the first aspect or the second aspect specifically, which is not described herein.
A sixth aspect of the present application provides a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the method of any implementation of the first or second aspect described above.
A seventh aspect of the present application provides a circuit system comprising a processing circuit configured to perform the method of any implementation of the first or second aspect described above.
An eighth aspect of the present application provides a computer program product which, when run on a computer, causes the computer to perform the method of any of the implementations of the first or fifth aspects above.
A ninth aspect of the present application provides a chip system comprising a processor for supporting a server or threshold value acquisition device to implement the functions involved in any of the implementations of the first or second aspects, e.g. to send or process data and/or information involved in the above method. In one possible design, the chip system further includes a memory for holding program instructions and data necessary for the server or the communication device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.
The advantages of the second to ninth aspects may be referred to the description of the first aspect, and are not repeated here.
Drawings
Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;
fig. 2 is a schematic diagram of comparing source domain data and target domain data according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a model training method according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of processing images in a model training process according to an embodiment of the present application;
FIG. 5 is a schematic diagram of another flow of processing images in a model training process according to an embodiment of the present application;
Fig. 6 is a schematic flow chart of an image processing method according to an embodiment of the present application;
FIG. 7 is a schematic flow chart of model training according to an embodiment of the present disclosure;
fig. 8 is a schematic flow chart of an image processing according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a model training device according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a chip according to an embodiment of the present disclosure;
fig. 13 is a schematic structural diagram of a computer readable storage medium according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can appreciate, with the development of technology and the appearance of new scenes, the technical solutions provided in the embodiments of the present application are applicable to similar technical problems.
The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which the embodiments of the application described herein have been described for objects of the same nature. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
For ease of understanding, technical terms related to embodiments of the present application are described below.
(1) Yuan Learning (Meta Learning)
Meta-learning, also commonly referred to as academic learning (Learning to learn), hopes that the model gains the ability to "learn about" so that the model can use past knowledge experience to guide learning of new tasks, thereby quickly learning new tasks based on the acquisition of existing knowledge.
(2) Domain generalization
Domain generalization refers to learning a model with strong generalization capability from multiple data sets (domains) with different data distributions, so as to obtain better effect on test sets with unknown data distributions. In domain generalization, a domain to be generalized is generally referred to as a source domain, and a domain to be learned is generally referred to as a target domain. The task of domain generalization is to migrate knowledge learned from source domain data with a large amount of data to a target domain with fewer data samples and lacking labels, so that a model with strong generalization capability is learned.
(3) Normalization
Normalization refers to the limitation of data to be processed within a certain range after the data is processed by an algorithm, for example, the limitation of the data within [0,1 ]. In general, normalization is for convenience of subsequent data processing, and ensures that convergence can be accelerated when the model is running. Overall, the specific role of normalization is to generalize the statistical distribution of the unified samples.
(4) Batch normalization (Batch Normalization, BN)
Batch normalization refers to normalizing batch data of an input model during training. The data relied on by the batch normalization process are the mean and variance of each channel in the batch data, and the translation parameters and scaling parameters learned in the model for each channel.
(5) Feature map
The feature map is the result of the input image after the neural network processing. The feature map characterizes a feature in neural space whose resolution depends on the step size of the convolution kernel in the neural network. Typically, there are multiple convolution kernels between layers of the neural network, and each feature map in the upper layer convolves with each convolution kernel to generate a feature map of the next layer.
(6) Source domain
In domain generalization, a source domain refers to the domain in which labeled training data is located, i.e., the feature distribution space of the training data. The source domain data refers to the labeled training data.
(7) Target domain
In domain generalization, a target domain refers to a domain in which data without a tag and with unknown feature distribution is located. Generally, the target domain data refers to data which needs to be processed by the model after training is completed and is different from the feature distribution of the training data.
(8) Feature statistics
Feature statistics refer to the mean and variance of feature maps under different channels of a single image.
(9) Convolutional neural network (Convolutional Neuras Network CNN)
A convolutional neural network is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer. The feature extractor can be seen as a filter and the convolution process can be seen as a convolution with an input image or convolution feature plane (feature map) using a trainable filter. The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way image information is extracted is independent of location. The underlying principle in this is: the statistics of a certain part of the image are the same as other parts. I.e. meaning that the image information learned in one part can also be used in another part. So we can use the same learned image information for all locations on the image. In the same convolution layer, a plurality of convolution kernels may be used to extract different image information, and in general, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.
The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.
The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal super-resolution model, such as a weight matrix.
(10) ReLU function
The ReLU function represents a linear rectifying function, also known as a modified linear unit, and is a commonly used activation function in artificial neural networks.
(11) Sigmoid function
The Sigmoid function represents an S-shaped function, which is used as an activation function for the neural network to map variables between [0,1] due to its single increment and inverse function single increment properties.
(12) Tanh function
The Tanh function represents an activation function of the neural network. Compared with the Sigmoid function, the output average value of the Tanh function is 0, so that the convergence speed of the Tanh function is faster than that of the Sigmoid function.
(13) Softmax function
The Softmax function is also called a normalized activation function, and is a promotion of a logic function. The Softma function can transform one K-dimensional vector Z containing arbitrary real numbers into another K-dimensional vector σ (Z) such that each element in the transformed vector σ (Z) ranges between (0, 1) and the sum of all elements is 1.
(14) Residual network (ResNet)
The traditional neural network consists of a convolutional layer, a pooling layer, a fully connected layer, and a softmax layer. The residual network is based on the traditional neural network and adds the residual structure. In particular, to address the degradation problem in deep networks, certain layers in the residual network skip the connection of the next layer of neurons, the interlayers are connected, and the strong connection between each layer is weakened.
For easy understanding, a scenario in which the model training method and the image processing method provided in the embodiments of the present application are applied will be described below.
Referring to fig. 1, fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application. As shown in fig. 1, the model training method provided in the embodiment of the present application adopts training data to train a model to be trained, and a trained model is obtained. The training data adopted in the model training stage comprises a plurality of image data of different source domains, and the image data in the different source domains have different characteristic distributions. For example, the training data includes a photo-patterned image, a cartoon-patterned image, and an artistic-patterned image, wherein the photo-patterned image is an image in the same source domain, the cartoon-patterned image is an image in another source domain, and the artistic-patterned image is an image in yet another source domain.
After the trained model is obtained, the image processing method provided by the embodiment of the application adopts the trained model to process the target domain data, and a processing result corresponding to the target domain data is obtained. Wherein the target domain data has a different characteristic distribution than the source domain data used in the training phase. Referring to fig. 2, fig. 2 is a schematic diagram illustrating comparison of source domain data and target domain data according to an embodiment of the present application. As shown in fig. 2, in the model training phase, the training data includes images under three different source fields (i.e., source field 1, source field 2, and source field 3), and the images in the three source fields are a photo-patterned image, a cartoon-patterned image, and an artistic-patterned image, respectively. In the model reasoning stage, the target domain data can be a sketch pattern image.
The model training method and the image processing method provided by the embodiment of the application can be applied to one or more of the following tasks: image classification, image recognition, image enhancement, object detection, image segmentation, and the like. The present embodiment does not limit the task to which the model training method and the image processing method are applied.
The image classification is an important support for object detection and semantic segmentation, and aims to divide different images into different categories and realize minimum classification errors. Today, image classification can be further applied in a wider range of scenarios, such as album auto classification for smartphones, product defect identification, unmanned, etc. In the case that the method provided by the embodiment of the application is applied to image classification, images in different source domains such as images in a photo pattern, images in a cartoon pattern, images in an art pattern and the like can be used for training the model, and the trained model can effectively classify images in a target domain (such as images in a sketch pattern).
The purpose of image recognition is to determine whether an object of interest exists in an image, and further recognize the object of interest in the image (for example, recognize the identity of a face in face recognition). In the case that the method provided by the embodiment of the application is applied to image recognition, images in different source domains such as an image of an asian face, an image of an european face, an image of an african face and the like can be used for training a model, and the trained model can be used for effectively recognizing the image in a target domain (for example, the image of the american face).
Image enhancement can generally effectively remove image noise, enhance image edges, highlight important information required in an image, remove or weaken unimportant information, achieve the effect of improving the visual quality of the image, and are more suitable for human observation or machine identification.
Target detection, also called target extraction, is an image segmentation based on target geometry and statistical features. Object detection is to combine object recognition and segmentation into one, and is usually used to recognize a specific object in an image and segment the object in the image.
The goal of image segmentation is to identify and divide objects of different categories in an image. Briefly, for an image, there may be multiple objects, multiple people, and even multiple layers of background in the image, and image segmentation is used to predict the category to which each pixel in the image belongs.
The scenario where the method provided by the embodiment of the present application is applied is described above, and the device where the method provided by the embodiment of the present application is applied will be described below.
The model training method and the image processing method provided by the embodiment of the application can be applied to electronic equipment. The model training method and the image processing method can be applied to the same electronic equipment; alternatively, the model training method and the image processing method are applied to different electronic devices, for example, the model training method is applied to a server, and the image processing method is applied to a mobile phone. By way of example, the terminal may be, for example, a mobile phone, a personal computer (personal computer, PC), a notebook computer, a server, a tablet, a smart television, a mobile internet device (mobile internet device, MID), a wearable device, a Virtual Reality (VR) device, an augmented reality (augmented reality, AR) device, a wireless terminal in industrial control (industrial control), a wireless terminal in unmanned driving (self driving), a wireless terminal in teleoperation (remote medical surgery), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation security (transportation safety), a wireless terminal in smart city (smart city), a wireless terminal in smart home (smart home), etc. For convenience of description, the model training method and the image processing method provided in the embodiments of the present application will be described below by taking an example in which the model training method and the image processing method are applied to a server.
Referring to fig. 3, fig. 3 is a flow chart of a model training method according to an embodiment of the present application. As shown in fig. 3, the model training method provided in the embodiment of the present application includes the following steps 301 to 304.
Step 301, a training set is obtained, wherein the training set comprises a plurality of groups of images belonging to different source domains, and the images in the different source domains have different feature distributions.
The server may obtain a training set for training the model before performing model training. Wherein the training set includes a plurality of images for training, and the images in the training set correspond to different source fields. Specifically, the images in the training set are divided into a plurality of groups of images according to the source domain to which the images belong, each group of images corresponds to one source domain, and each group of images includes a plurality of images. Images in different source domains have different feature distributions, i.e. images in different groups have different feature distributions.
For example, in the case where the images in the training set correspond to three source fields, the images in the training set may be divided into three groups of images, and 10000 images are included in each group of images. Specifically, among the three sets of images in the training set, the first set of images is a photograph-style image, the second set of images is a cartoon-style image, and the third set of images is an artistic-style image.
Step 302, obtaining, through a model, first feature statistical information corresponding to a source domain where the first image is located based on the first image and at least one group of images in the training set, where the first feature statistical information is used to indicate feature distribution of the image in the source domain where the first image is located, and the at least one group of images are images belonging to different source domains with the first image in the training set.
In this embodiment, the first image is an image used in the model training iterative process, and the first image may be any image in the training set. In the model training process, the server determines at least one group of images in the training set according to the first image, wherein the at least one group of images are images belonging to different source domains from the first image in the training set, and each group of images in the at least one group of images comprises a plurality of images.
In short, in the case that M sets of images belonging to different source domains are included in the training set, the server may determine N sets of images belonging to different source domains from the first image, and based on extracting second feature statistical information corresponding to the N sets of images. Wherein M is an integer greater than or equal to 2, and N is an integer greater than 0 and less than M. For example, in the case where M has a value of 3, N may have a value of 2.
For example, assume that images in a training set are divided into three groups of images, namely, images of photo style sets, images of cartoon style sets, and images of art style sets, according to the source domain to which the images belong. In the case where the first image is an image in a photo stylegroup, the server may determine that at least one of the images is an image in a cartoon stylegroup and an image in an artistic stylegroup.
Optionally, in this embodiment, after extracting second feature statistics corresponding to the first image and third feature statistics corresponding to the at least one group of images in the training set, the second feature statistics and the third feature statistics may be input into a first sub-network in the model, so as to predict and obtain the first feature statistics corresponding to a source domain where the first image is located.
The second feature statistical information corresponding to the first image may be a mean value and a variance of a feature map including the first image, where the second feature statistical information may indicate a feature distribution condition of the first image. The third feature statistical information corresponding to the at least one group of images comprises the mean value and the variance of the feature map of the at least one group of images, and the third feature statistical information can indicate the feature distribution condition of the whole image in the at least one group of images.
Specifically, in this embodiment, the first sub-network in the model may process the first image and at least one group of images in the training set to obtain the first feature statistical information corresponding to the source domain where the first image is located. The first sub-network is used for predicting the characteristic statistical information of the source domain where the first image is located based on the characteristic statistical information of the first image and the characteristic statistical information of other images belonging to different source domains with the first image. In brief, the first sub-network combines the feature distribution of the first image itself and the feature distribution of the images of other source domains to infer the overall feature distribution of the image of the source domain where the first image is located.
Alternatively, the first subnetwork may be a self-encoder, i.e. the first subnetwork comprises an encoder and a decoder, and both the encoder and the decoder are fully connected layer structures. In addition, a ReLU activation function is also connected between the encoder and the decoder, and a Tanh activation function is also connected behind the decoder, namely, the encoder, the ReLU activation function, the decoder and the Tanh activation function are sequentially connected. In this embodiment, the first sub-network may be implemented by adopting another structure, so long as the second feature statistical information and the third feature statistical information can be fused, and the first feature statistical information corresponding to the source domain where the first image is located is predicted, which does not limit the specific structure of the first sub-network
Step 303, obtaining a first processing result corresponding to the first image through a model based on the first image and the first feature statistical information.
For example, in this embodiment, the first image and the first feature statistical information may be input into the second sub-network in the model, so as to obtain a first processing result corresponding to the first image. The second sub-network is used for extracting a feature map of the first image and carrying out normalization processing on the feature map based on the first feature statistical information.
In this embodiment, the model for training includes the first subnetwork and the second subnetwork described above. After the first subnetwork in the model predicts the third feature statistical information corresponding to the source domain where the first image is located, the second subnetwork normalizes the feature map of the first image based on the first feature statistical information, and finally a first processing result corresponding to the first image is obtained.
The first processing result corresponding to the first image is related to the image processing task executed by the model. For example, when the image processing task executed by the model is image classification, the first processing result is a classification result of the first image; or when the image processing task executed by the model is image recognition, the first processing result is a recognition result of the first image.
Optionally, the second sub-network includes a convolution layer, a normalization layer and a full connection layer, the convolution layer is used for extracting a feature map of the first image, the normalization layer is used for normalizing the feature map based on the third feature statistical information, and the full connection layer is used for processing a normalization result output by the normalization layer to obtain a first processing result.
It can be understood that the second sub-network may include a plurality of convolution layers and a plurality of normalization layers, where each convolution layer is connected to one normalization layer, that is, one convolution layer and one normalization layer form a convolution normalization structure, and the second sub-network includes a plurality of convolution normalization structures that are sequentially connected. And the last normalization layer in the second sub-network is connected with the full-connection layer and is used for outputting the extracted feature map to the full-connection layer. And each normalization layer in the second subnetwork performs normalization processing on the feature map output by the convolution layer based on the third feature statistical information.
Alternatively, the second subnetwork may be a residual network, i.e. a structure with cross-layer connections between network layers in the second subnetwork.
Step 304, determining a target loss value based on the first processing result, and training the model based on the target loss value.
In this embodiment, after the first processing result output by the second sub-network in the model is obtained, the target loss value may be determined based on the first processing result and the truth result corresponding to the first image, where the target loss value may be used to characterize the difference between the first processing result and the truth result corresponding to the first image. And finally, the server carries out iterative training on the model based on the target loss value until the model reaches a convergence condition.
The truth result corresponding to the first image is related to the task executed by the model, and the truth result corresponding to the first image represents a real result obtained after the task executed by the model is executed on the first image. For example, when the image processing task executed by the model is image classification, the true value result corresponding to the first image is the true classification result of the first image; or when the image processing task executed by the model is image recognition, the true value result corresponding to the first image is the true recognition result of the first image.
Referring to fig. 4, fig. 4 is a schematic flow chart of processing an image in a model training process according to an embodiment of the present application. As shown in fig. 4, the model includes a first sub-network and a second sub-network. After the second feature statistical information corresponding to the first image and the third feature statistical information corresponding to the images of other source domains are extracted, the second feature statistical information and the third feature statistical information are input into the first sub-network, and the first feature statistical information corresponding to the source domain where the first image is located is predicted and obtained through an encoder and a decoder (an activation function in the first sub-network is not shown in the figure) in the first sub-network in sequence. Then, third feature statistics are input into a normalization layer of a second sub-network of the first image in a convolution layer of the second sub-network. After the convolution layer of the second sub-network extracts the feature map of the first image, the normalization layer in the second sub-network normalizes the feature map of the first image based on the third feature statistical information, and the feature map obtained after normalization is processed through the full-connection layer to obtain a first processing result corresponding to the first image.
In this embodiment, in the model training process, the model predicts the feature statistics information of the domain where the input image is located based on the feature statistics information of the input image and the feature statistics information of the images of other domains, and further processes the input image based on the feature statistics information of the domain where the input image is located. In the process of predicting the feature statistical information, the model considers the commonality between the input image and the images of other domains and combines the feature distribution uniqueness of the input image, so that the model can effectively predict and obtain the feature distribution information of the domain where the input image is located. In this way, the input image can be processed based on the feature distribution of the domain of the input image on the basis of the feature distribution of the domain of the input image predicted by the model society, and the domain generalization capability of the model is effectively improved.
Alternatively, the target loss value for training the model may be a value comprising a first loss value and a second loss value. Specifically, the server determines the target loss value based on the first processing result, including the following steps.
First, the server determines a first penalty value based on the first processing result and a truth result corresponding to the first image, the first penalty value being used to characterize a difference between the first processing result and the truth result corresponding to the first image.
And then, the server acquires fourth characteristic statistical information corresponding to the image in the source domain where the first image is based on the training set, and determines a second loss value based on the first characteristic statistical information and the fourth characteristic statistical information. Because the first feature statistics are feature statistics obtained by the first sub-network prediction in the model, in order to guide the first sub-network prediction to obtain more accurate feature statistics, the server obtains a feature statistics true value (i.e., the fourth feature statistics) corresponding to the image of the source domain where the first image is located based on the image of the source domain where the first image is located in the training set. And the server constructs a second loss value based on the first characteristic statistical information obtained by the first sub-network prediction and the fourth characteristic statistical information obtained by the image in the training set, wherein the second loss value is used for representing the difference between the characteristic statistical information obtained by the first sub-network prediction and the actual characteristic statistical information.
And secondly, the server determines a target loss value according to the first loss value and the second loss value. For example, the server performs a weighted summation of the first loss value and the second loss value to obtain the target loss value. The weights of the first loss value and the second loss value set in the weighted summation process can be adjusted according to actual application requirements, and are not specifically limited herein.
In the scheme, the loss value determined by the characteristic statistical information predicted value and the characteristic statistical information true value is introduced, so that the first sub-network can be well guided to predict the characteristic statistical information of the domain where the input image is located, the prediction accuracy of the first sub-network is improved, and the generalization performance of the model is further improved.
In some possible embodiments, the model further includes a third subnetwork, and the target loss value used to train the model further includes a third loss value.
Specifically, in the model training method provided in the embodiment corresponding to fig. 3, the method further includes the following steps: and the server inputs the image in the source domain where the first image is positioned into a third sub-network to obtain a second processing result corresponding to the first image. The image in the source domain where the first image is located comprises the first image, and the third sub-network is used for extracting fourth characteristic statistical information based on the image in the source domain where the first image is located and carrying out batch normalization processing on the characteristic map of the first image based on the fourth characteristic statistical information.
In addition, in the process of determining the target loss value, the server determines a third loss value based on the second processing result output by the third sub-network and the true value result corresponding to the first image, wherein the third loss value is used for representing the difference between the second processing result and the true value result corresponding to the first image. Finally, the server determines a target loss value according to the first loss value, the second loss value and the third loss value, for example, the server performs weighted summation on the first loss value, the second loss value and the third loss value to obtain the target loss value.
It will be appreciated that in a conventional model training process, the model usually processes the input batch images in parallel, and in the process of processing each image, batch normalization processing is performed on the image based on the feature statistical information corresponding to the input batch image, so as to finally obtain a processing result corresponding to the image.
Since in this embodiment, the second sub-network of the model processes the input image based on the predicted feature statistics, a situation in which the training direction shifts may occur during the model training process. Therefore, in this embodiment, the third sub-network is used to execute a conventional processing flow on the input image, and the loss value obtained from the output result of the third sub-network is introduced to constrain the training of the first sub-network and the second sub-network in the model, so as to effectively improve the prediction performance of the first sub-network and the second sub-network in the model, and improve the generalization performance of the model.
Referring to fig. 5, fig. 5 is another flow chart of processing an image in a model training process according to an embodiment of the present application. As shown in fig. 5, a third sub-network is included in the model in addition to the first sub-network and the second sub-network shown in fig. 4. The third sub-network is similar to the second sub-network in structure, and the third sub-network also comprises a convolution layer, a normalization layer and a full connection layer. After the first image and other images of the source domain where the first image is located are input into a third sub-network, the third sub-network extracts feature images of the first image and other images of the source domain where the first image is located, so that fourth feature statistical information corresponding to all images of the source domain where the first image is located is extracted, the first image is subjected to batch normalization processing based on the fourth feature statistical information through a normalization layer, and finally the feature images output by the normalization layer are processed through a full-connection layer, so that a second processing result corresponding to the first image is obtained.
In some possible embodiments, before the server normalizes the feature map of the first image through the second sub-network in the model, the server may further predict a translation parameter and a scaling parameter required in the normalization process, so that the second sub-network normalizes the feature map of the first image based on the predicted translation parameter and scaling parameter.
Specifically, the server firstly acquires a first translation parameter and a first scaling parameter, wherein the first translation parameter and the first scaling parameter are parameters for carrying out batch normalization processing on the feature map of the first image in the third sub-network. And, the first translation parameter and the first scaling parameter may be learned during training of the third subnetwork. And then, the server predicts the second translation parameter and the second scaling parameter according to the second characteristic statistical information, the first translation parameter and the first scaling parameter. That is, the server predicts the second translation parameter and the second scaling parameter based on the second feature statistical information corresponding to the first image and the translation parameter and the scaling parameter used to perform the batch normalization processing on the first image.
In this way, in the training process of the model, the second subnetwork normalizes the feature map of the first image based on the first feature statistical information, the second translation parameter and the second scaling parameter.
In the scheme, parameters used when the input image is normalized based on the predicted characteristic statistical information are predicted by combining the characteristic statistical information of the input image and parameters adopted by the input image for batch normalization processing, so that the parameters which are well matched with the input image in the normalization processing process can be obtained, the processing performance of the model on the input image is improved, and the generalization performance of the model is further effectively improved.
It should be noted that the second sub-network in the model may also normalize the feature map of the first image based on preset translation parameters and scaling parameters, which are not limited herein.
The model training method provided by the embodiment of the application is introduced above, and the process of processing the image by using the model after training to obtain the model based on the model training method is introduced below.
Referring to fig. 6, fig. 6 is a flowchart of an image processing method according to an embodiment of the present application. As shown in fig. 6, the image processing method provided in the embodiment of the present application includes the following steps 601 to 604.
Step 601, a target domain image is acquired.
After the model is trained, the server processes the target domain image by adopting the trained model. The target domain image and the source domain image adopted in model training correspond to different domains, namely the target domain image and the source domain image have different characteristic distribution.
Step 602, extracting first feature statistical information corresponding to the target domain image, where the first feature statistical information is used to indicate feature distribution of the target domain image.
After the target domain image is obtained, extracting and obtaining the feature images of all channels of the target domain image, so that the mean value and the variance of the feature images of all channels of the target domain image are calculated and obtained, and first feature statistical information corresponding to the target domain image is obtained.
Step 603, obtaining third feature statistics corresponding to the target domain where the target domain image is located based on the first feature statistics and the second feature statistics, where the second feature statistics is feature statistics corresponding to the source domain image adopted by the model in the training process.
In this embodiment, the first feature statistical information and the second feature statistical information may be input into the first subnetwork in the model to predict and obtain the third feature statistical information corresponding to the target domain where the target domain image is located.
Wherein second feature statistics calculated during the model training phase may be stored in the model. In this way, after the first feature statistical information corresponding to the target domain image is calculated, third feature statistical information corresponding to the target domain where the target domain image is located is predicted based on the first feature statistical information and the second feature statistical information stored in the model.
Step 604, obtaining a processing result corresponding to the target domain image based on the target domain image and the third feature statistical information.
In this embodiment, the target domain image and the third feature statistical information may be input into the second sub-network in the model, so as to obtain a processing result corresponding to the target domain image. The second sub-network is used for extracting a feature map of the target domain image and carrying out normalization processing on the feature map based on third feature statistical information to obtain a processing result corresponding to the target domain image.
The first sub-network and the second sub-network of the model in the embodiment of the present application are similar to the first sub-network and the second sub-network described in the above model training method, and specific reference is made to the description of the above embodiment, and details are not repeated here.
In the scheme, the characteristic statistical information of the target domain is predicted through the characteristic statistical information of the target domain image and the characteristic statistical information of the source domain image, the target domain image is processed based on the characteristic statistical information of the target domain, and finally a processing result is obtained. In the process of predicting the feature statistical information, the model considers the commonality between the target domain image and the images of other domains and combines the feature distribution uniqueness of the target domain image, so that the model can effectively predict and obtain the feature distribution information of the domain where the target domain image is located. Therefore, the target domain image can be processed based on the feature distribution of the domain where the target domain image is based on the feature distribution of the domain where the model is predicted, so that calculation deviation caused by normalizing the target domain image by utilizing the source domain feature statistical information before is avoided, and the domain generalization capability of the model is effectively improved.
The above describes a model training method and an image processing method provided by the embodiments of the present application. For easy understanding, a model training method and an image processing method provided in the embodiments of the present application will be described in detail below with reference to specific examples.
Referring to fig. 7, fig. 7 is a schematic flow chart of model training according to an embodiment of the present application. As shown in fig. 7, the model for training includes a first sub-network, a second sub-network, and a third sub-network. Wherein the third sub-network is used for executing a conventional image processing flow.
Specifically, in the training process of the model, the input of the third sub-network is batch image data, and the third sub-network processes each image in the batch image data in parallel to obtain a processing result corresponding to each image. Wherein, the images in the batch image data input into the third sub-network all belong to the same source domain.
Taking the first image as an example, the first image and other images in the source domain where the first image is located are input into the third sub-network as one batch of image data. And extracting the characteristic graphs of the images in the batch of image data by the convolution layer in the third sub-network. Then, the first calculation storage module in the third sub-network calculates and stores the feature statistical information of the first image (i.e., the second feature statistical information) and the feature statistical information of the source domain where the first image is located based on the feature map output by the convolution layer.
And then, the normalization layer in the third sub-network carries out batch normalization processing on the feature images of the first image based on the feature statistical information of the source domain where the first image is located, and the feature images obtained by batch normalization are processed by the full-connection layer, so that a second processing result is finally obtained. Specifically, the procedure of the batch normalization processing performed in the third sub-network is as shown in formula 1.
Wherein,feature map, x, representing feature statistics batch normalized by d-domain (i.e., source domain where the first image is located) d A first image representing the d-domain, gamma d Representing the translation parameter of the d-domain, beta d Represents the scaling parameter of the d-field, e represents a small constant preventing denominator 0, +.>And->Representing the mean and variance of the batch feature map from domain d, respectively.
Notably, the translation parameter γ d And scaling parameter beta d There may be two parameters that can be learned, i.e., both parameters will change as the model trains. Wherein the learnable translation parameter gamma d And scaling parameter beta d The method is used for keeping the learning results of each layer of the network in the training process. If there is no translation parameter gamma d And scaling parameter beta d Batch normalization will degrade to a common normalization, so that during training, the parameters of the network layers are updated, but their output distribution is almost unchanged (always 0 as the mean and 1 as the standard deviation), and learning cannot be performed effectively. Adding Learnable translation parameter gamma d And scaling parameter beta d The network can then adaptively learn a tailored profile (mean β, standard deviation γ) for each neuron, thereby preserving the learning outcome of each neuron.
For a B C H W lot map (B represents lot size, C represents channel number, H and W represent the height and width of the map, respectively),and->The calculation steps of (a) are shown in the following equations 2 and 3. />
It will be appreciated that the number of data in the same source domain in the training data may be relatively large, and that the batch data in each input model may be part of the data in one source domain, for example 10000 images in the same source domain, and that the batch data in each input model is 50 images. Therefore, in order to save the cost of calculating the feature statistics of the image data, the present embodiment uses the exponential moving average method to calculate the feature statistics of the same source domain, as shown in equation 4 and equation 5.
Wherein mu d′ Representing current time inputMean, sigma, of all d-domain images of the model d′ Variance, mu, of all d-domain images representing the current moment input model d Mean value, sigma, of all d-domain images representing input model at last moment d Representing the variance of all d-domain images of the input model at the previous time, λ is a constant (typically 0.9),mean value of d-domain batch image representing current moment input model,/->Representing the variance of the d-domain batch image of the current moment input model.
Further, the process of model calculation of the feature statistical information of the first image may be as shown in formula 6 and formula 7.
Wherein,a feature map representing a certain channel of the ith image from source domain d, +.>And->Representing the mean and variance, respectively.
For the first sub-network, the first sub-network acquires feature statistics of the first image (i.e., the second feature statistics) and feature statistics of other source domain images (i.e., the third feature statistics) output by the third sub-network, and processes the feature statistics of the first image and the feature statistics of other source domain images based on the structure connected to the encoder and the decoder, so as to predict and obtain prediction feature statistics of the source domain where the first image is located (i.e., the first feature statistics). Specifically, the process of predicting the statistical information of the prediction characteristic of the source domain where the first image is located by the first subnetwork is shown in equation 8 and equation 9.
Wherein μ' d And sigma' d Respectively representing the mean and variance in the statistical information of the prediction features of the source domain where the first image is located, N S Representing the number of source fields, the source field in which a single image is located is denoted d, other N S The 1 source domain is denoted p,and->Represents->Characteristic statistics of μ p Sum sigma p Characteristic statistics, g, representing the source domain p s Representing the self-encoder for predicting feature statistics. Self-encoder g s The encoder and decoder in (a) are both composed of a fully connected layer, the middle is connected by a ReLU activation function, the decoder is followed by a Sigmoid activation function, which acts to limit the output to [0,1 ]]Epsilon represents a small constant preventing zero division.
And finally, extracting the feature map of the first image by a convolution layer in the second sub-network, and carrying out normalization processing on the feature map of the first image by a normalization layer based on the predicted feature statistical information of the source domain where the first image is positioned to obtain a normalized feature map. In this way, after the normalized feature map is processed by the full-connection layer in the second sub-network, a first processing result corresponding to the first image can be obtained.
In addition, the translation parameter and the scaling parameter employed for normalizing the feature map of the first image at the second sub-network may be predicted based on the following equations 10 and 11.
Wherein, gamma' d And beta' d Gamma represents the predicted value of the translation parameter (i.e. the second translation parameter described above) and the predicted value of the scaling parameter (i.e. the second scaling parameter described above) p And beta p True values representing the source domain p translation parameters and scaling parameters (i.e., the first translation parameters and first scaling parameters described above), g r Representing a self-encoder for scaling statistics inferences. Wherein, from encoder g r With the self-encoder g described above s The structure of (a) is the same and will not be described in detail herein.
The process of normalizing the feature map of the first image by the second subnetwork is shown in equations 12 and 13.
Finally, the model is trained by establishing loss values based on the outputs of the first, second and third sub-networks, respectively. Specifically, the process of establishing the loss value is shown in the following equations 13, 14, 15 and 16.
L=L cls +α·(L′ cls +L dis ) Equation 16
Wherein L is dis Represents the corresponding loss value of the first sub-network (namely the second loss value), L cls ' represents the corresponding loss value of the second sub-network (i.e. the first loss value described above), L cls Representing a loss value corresponding to the third sub-network (i.e., the third loss value), L representing a loss value corresponding to the model (i.e., the target loss value), N representing the number of images, M representing the number of categories, y representing the label, p representing the second processing result output by the third sub-network, p' representing the first processing result output by the second sub-network, and α being a super parameter.
Referring to fig. 8, fig. 8 is a schematic flow chart of image processing according to an embodiment of the present application. As shown in fig. 8, after the model is trained by the training method shown in fig. 7, a trained image processing model is obtained. The trained image processing model of fig. 8 includes only the first sub-network and the second sub-network, as compared to the model of fig. 7 in which training is performed. In the image processing process, the second sub-network performs feature extraction on the input target domain image to obtain a feature map of the target domain image, and calculates feature statistical information of the target domain image based on the feature map. The first sub-network predicts and obtains the predicted feature statistical information of the target domain where the target domain image is based on the feature statistical information of the target domain image and the feature statistical information corresponding to all the source domain images in the training process. And then, carrying out normalization processing on the feature map of the target domain image based on the predicted feature statistical information of the target domain by a normalization layer in the second sub-network, and inputting the feature map obtained by the normalization processing into a full-connection layer to obtain a final output image processing result.
In the actual model training experiment process, the technical effects of the method provided by the embodiment of the application are as follows: for four field generalization tasks of the PACS data set, the average accuracy of a model obtained by training based on the method provided by the embodiment of the application in the ResNet18 model reaches 87.05 which is higher than that of a reference model by 6.03 percent, and the average accuracy of the model in the ResNet50 model reaches 89.56 which is higher than that of the reference model by 5.48 percent; for four domain generalization tasks of the VLCS dataset, its average accuracy in the res net18 model reaches 77.75, 5.27 percentages higher than the baseline model; for the Office-Home dataset, its average accuracy in the ResNet18 model reaches 66.60, 1.7 percent above the baseline model. Obviously, experimental results of a plurality of data sets fully verify the technical effect of the method of the embodiment of the application on improvement of model prediction generalization.
In general, in the scheme, feature statistics of the domain where the image is located are deduced by simulating feature statistics of a single image and other domains in the training process through meta learning, so that the model can deduce the feature statistics of the target domain through the single target domain image and source domain statistics stored in a model batch normalization layer in the reasoning stage. Therefore, the model can normalize the feature map of the target domain image by using inferred feature statistical information, so that calculation deviation caused by normalizing the target domain image by using source domain statistical information is avoided, and target domain feature information contained in a single target domain image is fully mined in an reasoning stage, thereby improving generalization of the model to unknown target domain images.
In addition, the statistical information of the whole target domain can be deduced by only transmitting a single target domain image in the forward direction of the model once, and the model can be self-adaptive in the reasoning stage in a non-training mode, so that the increase of test time and catastrophic forgetting of the model caused by updating model parameters by using target domain data are avoided, and the dependence on batch target domain data and dependence on target domain labels during the model test are also avoided.
Having described the methods provided by embodiments of the present application, an apparatus for performing the methods is described below.
Referring to fig. 9, fig. 9 is a schematic structural diagram of a model training device according to an embodiment of the present application. As shown in fig. 9, a model training apparatus provided in an embodiment of the present application includes: an acquisition unit 901 and a processing unit 902. The acquiring unit 901 is configured to acquire a training set, where the training set includes multiple groups of images belonging to different source domains, and the images in the different source domains have different feature distributions, and each group of images is from the same source domain; the processing unit 902 is configured to obtain, based on a first image in the training set and at least one set of images, first feature statistics information corresponding to a source domain where the first image is located through a model, where the first feature statistics information is used to indicate feature distribution of images in the source domain where the first image is located, and the at least one set of images is an image in the training set, where the at least one set of images belongs to a different source domain from the first image; the processing unit 902 is further configured to obtain, based on the first image and the first feature statistical information, a first processing result corresponding to the first image through the model, where the second sub-network is configured to extract a feature map of the first image and normalize the feature map based on the third feature statistical information, and the model includes the first sub-network and the second sub-network; the processing unit 902 is further configured to determine a target loss value based on the first processing result, and train the model based on the target loss value.
Optionally, the processing unit 902 is further configured to: extracting second characteristic statistical information corresponding to a first image and third characteristic statistical information corresponding to at least one group of images in the training set, wherein the second characteristic statistical information and the third characteristic statistical information are used for indicating characteristic distribution of the images; inputting the second feature statistical information and the third feature statistical information into a first sub-network in the model, and predicting to obtain the first feature statistical information corresponding to the source domain where the first image is located, where the first feature statistical information is used for indicating feature distribution of the image in the source domain where the first image is located.
Optionally, the processing unit 902 is further configured to: inputting the first image and the first characteristic statistical information into a second sub-network in the model to obtain a first processing result corresponding to the first image; the second sub-network is used for extracting a feature map of the first image and carrying out normalization processing on the feature map based on the first feature statistical information.
Optionally, the second feature statistical information includes a mean and a variance of a feature map of the first image; the third feature statistics include a mean and variance of feature maps of the at least one set of images.
Optionally, the target loss value includes a first loss value and a second loss value; the processing unit 902 is further configured to: determining the first loss value based on the first processing result and a true value result corresponding to the first image; acquiring fourth characteristic statistical information corresponding to the image in the source domain where the first image is located based on the training set, and determining the second loss value based on the first characteristic statistical information and the fourth characteristic statistical information; and determining the target loss value according to the first loss value and the second loss value.
Optionally, the model further includes a third sub-network, and the target loss value further includes a third loss value; the processing unit 902 is further configured to: inputting the image in the source domain where the first image is located into the third sub-network to obtain a second processing result corresponding to the first image, wherein the image in the source domain where the first image is located comprises the first image, and the third sub-network is used for extracting the fourth characteristic statistical information based on the image in the source domain where the first image is located and carrying out batch normalization processing on the characteristic graph of the first image based on the fourth characteristic statistical information; determining the third loss value based on the second processing result and a true value result corresponding to the first image; and determining the target loss value according to the first loss value, the second loss value and the third loss value.
Optionally, the acquiring unit 901 is further configured to acquire a first translation parameter and a first scaling parameter, where the first translation parameter and the first scaling parameter are parameters used for performing batch normalization processing on the feature map of the first image in the third sub-network; the processing unit 902 is further configured to predict a second translation parameter and a second scaling parameter according to the second feature statistics, the first translation parameter and the first scaling parameter; the second sub-network is used for normalizing the feature map of the first image based on the first feature statistical information, the second translation parameter and the second scaling parameter.
Optionally, the first subnetwork includes an encoder and a decoder, and the encoder and the decoder are all fully connected layer structures.
Optionally, a ReLU activation function is further connected between the encoder and the decoder, and a Tanh activation function is further connected behind the decoder.
Optionally, the second sub-network includes a convolution layer, a normalization layer and a full connection layer, where the convolution layer is configured to extract a feature map of the first image, the normalization layer is configured to normalize the feature map based on the third feature statistical information, and the full connection layer is configured to process a normalization result output by the normalization layer to obtain the first processing result.
Optionally, the model is used to perform one or more of the following tasks: image classification, image recognition, image enhancement, object detection, and image segmentation.
Referring to fig. 10, fig. 10 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application. As shown in fig. 10, an image processing apparatus provided in an embodiment of the present application includes: an acquisition unit 1001, an extraction unit 1002, and a processing unit 1003. Wherein, the acquiring unit 1001 is configured to acquire a target domain image; an extracting unit 1002, configured to extract first feature statistics corresponding to the target domain image, where the first feature statistics is used to indicate feature distribution of the target domain image; a processing unit 1003, configured to obtain third feature statistics corresponding to a target domain where the target domain image is located based on the first feature statistics and second feature statistics, where the second feature statistics is feature statistics corresponding to a source domain image adopted by the model in a training process, and the target domain image and the source domain image have different feature distributions; the processing unit 1003 is further configured to obtain a processing result corresponding to the target domain image based on the target domain image and the third feature statistical information.
Optionally, the processing unit 1003 is specifically configured to extract a feature map of the target domain image, and normalize the feature map based on the third feature statistical information to obtain a processing result corresponding to the target domain image.
Optionally, the first feature statistical information includes a mean and a variance of a feature map of the target domain image;
the second feature statistics include a mean and a variance of feature maps of the source domain image.
Optionally, the model includes a first sub-network, the first sub-network is configured to obtain the third feature statistical information based on the first feature statistical information and the second feature statistical information, the first sub-network includes an encoder and a decoder, and the encoder and the decoder are all in a full-connection layer structure.
Optionally, a ReLU activation function is further connected between the encoder and the decoder, and a Tanh activation function is further connected behind the decoder.
Optionally, the model includes a second sub-network, where the second sub-network is configured to obtain a processing result corresponding to the target domain image based on the target domain image and the third feature statistical information, and the second sub-network includes a convolution layer, a normalization layer, and a full connection layer, where the convolution layer is configured to extract a feature map of the first image, and the normalization layer is configured to perform normalization processing on the feature map based on the third feature statistical information, and the full connection layer is configured to process a normalization result output by the normalization layer to obtain the first processing result.
Optionally, the model is used to perform one or more of the following tasks: image classification, image recognition, image enhancement, object detection, and image segmentation.
Next, referring to fig. 11, fig. 11 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, where the electronic device 1100 may be specifically shown as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a server, and the like, which is not limited herein. Specifically, the electronic device 1100 includes: a receiver 1101, a transmitter 1102, a processor 1103 and a memory 1104 (where the number of processors 1103 in the electronic device 1100 may be one or more, one processor is exemplified in fig. 11), wherein the processor 1103 may comprise an application processor 11031 and a communication processor 11032. In some embodiments of the present application, the receiver 1101, transmitter 1102, processor 1103 and memory 1104 may be connected by a bus or other means.
The memory 1104 may include read-only memory and random access memory and provides instructions and data to the processor 1103. A portion of the memory 1104 may also include non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 1104 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.
The processor 1103 controls the operation of the electronic device. In a specific application, the various components of the electronic device are coupled together by a bus system that may include, in addition to a data bus, a power bus, a control bus, a status signal bus, and the like. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.
The method disclosed in the embodiments of the present application may be applied to the processor 1103 or implemented by the processor 1103. The processor 1103 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the method described above may be performed by integrated logic circuitry in hardware or instructions in software in the processor 1103. The processor 1103 may be a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The processor 1103 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1104, and the processor 1103 reads information in the memory 1104, and in combination with the hardware, performs the steps of the method described above.
The receiver 1101 may be used to receive input numeric or character information and to generate signal inputs related to the relevant settings and function control of the electronic device. The transmitter 1102 may be used to output numeric or character information through a first interface; the transmitter 1102 may also be configured to send instructions to the disk stack via the first interface to modify data in the disk stack; the transmitter 1102 may also include a display device such as a display screen.
The electronic device provided in this embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit, so that the chip in the electronic device performs the method for selecting the model hyper-parameters described in the above embodiment, or so that the chip in the training device performs the method for selecting the model hyper-parameters described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), etc.
Specifically, referring to fig. 12, fig. 12 is a schematic structural diagram of a chip provided in an embodiment of the present application, where the chip may be represented as a neural network processor NPU 1200, and the NPU 1200 is mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU distributes tasks. The core part of the NPU is an operation circuit 1203, and the operation circuit 1203 is controlled by the controller 1204 to extract matrix data in the memory and perform multiplication operation.
In some implementations, the operation circuit 1203 internally includes a plurality of processing units (PEs). In some implementations, the operational circuit 1203 is a two-dimensional systolic array. The operation circuit 1203 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 1203 is a general purpose matrix processor.
For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1202 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 1201 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 1208.
The unified memory 1206 is used to store input data and output data. The weight data is carried directly through the memory cell access controller (Direct Memory Access Controller, DMAC) 1205, the DMAC into the weight memory 1202. The input data is also carried into the unified memory 1206 through the DMAC.
BIU is Bus Interface Unit, bus interface unit 1212, for the AXI bus to interact with the DMAC and instruction fetch memory (Instruction Fetch Buffer, IFB) 1209.
The bus interface unit 1212 (Bus Interface Unit, abbreviated as BIU) is configured to fetch the instruction from the external memory by the instruction fetch memory 1209, and further configured to fetch the raw data of the input matrix a or the weight matrix B from the external memory by the memory unit access controller 1205.
The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1206 or to transfer weight data to the weight memory 1202 or to transfer input data to the input memory 1201.
The vector calculation unit 1207 includes a plurality of operation processing units, and further processes such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like are performed on the output of the operation circuit 1203 as needed. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization (batch normalization), pixel-level summation, up-sampling of a characteristic plane and the like.
In some implementations, the vector computation unit 1207 can store the vector of processed outputs to the unified memory 1206. For example, the vector calculation unit 1207 may perform a linear function; alternatively, a nonlinear function is applied to the output of the operation circuit 1203, for example, linear interpolation of the feature plane extracted by the convolution layer, and then, for example, vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 1207 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 1203, for example for use in subsequent layers in a neural network.
An instruction fetch memory (instruction fetch buffer) 1209 connected to the controller 1204, for storing instructions used by the controller 1204;
the unified memory 1206, the input memory 1201, the weight memory 1202, and the finger memory 1209 are all On-Chip memories. The external memory is proprietary to the NPU hardware architecture.
The processor mentioned in any of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.
Referring to fig. 13, fig. 13 is a schematic structural diagram of a computer readable storage medium according to an embodiment of the present application. The present application also provides a computer readable storage medium, in some embodiments, the method disclosed in fig. 3 above may be implemented as computer program instructions encoded on a computer readable storage medium in a machine readable format or encoded on other non-transitory media or articles of manufacture.
Fig. 13 schematically illustrates a conceptual partial view of an example computer-readable storage medium comprising a computer program for executing a computer process on a computing device, arranged in accordance with at least some embodiments presented herein.
In one embodiment, computer-readable storage medium 1300 is provided using signal bearing medium 1301. The signal bearing medium 1301 may include one or more program instructions 1302 that when executed by one or more processors may provide the functionality or portions of the functionality described above with respect to fig. 3. Thus, for example, referring to the embodiment shown in fig. 3, one or more features of the method shown in fig. 3 may be carried by one or more instructions associated with the signal bearing medium 1301. Further, program instructions 1302 in fig. 13 also describe example instructions.
In some examples, signal bearing medium 1301 may comprise a computer readable medium 1303, such as, but not limited to, a hard disk drive, a Compact Disc (CD), a Digital Video Disc (DVD), a digital tape, memory, ROM, or RAM, and the like.
In some implementations, the signal bearing medium 1301 may comprise a computer recordable medium 1304, such as, but not limited to, memory, read/write (R/W) CD, R/W DVD, and the like. In some implementations, the signal bearing medium 1301 may include a communication medium 1305, such as, but not limited to, a digital and/or analog communication medium (e.g., fiber optic cable, waveguide, wired communications link, wireless communications link, etc.). Thus, for example, the signal bearing medium 1301 may be conveyed by a communication medium 1305 in wireless form (e.g., a wireless communication medium compliant with the IEEE 802.13 standard or other transmission protocol).
The one or more program instructions 1302 may be, for example, computer-executable instructions or logic-implemented instructions. In some examples, a computing device of the computing device may be configured to provide various operations, functions, or actions in response to program instructions 1302 communicated to the computing device through one or more of computer-readable medium 1303, computer-recordable medium 1304, and/or communication medium 1305.
It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection therebetween, and can be specifically implemented as one or more communication buses or signal lines.
From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course may be implemented by dedicated hardware including application specific integrated circuits, dedicated CPUs, dedicated memories, dedicated components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment in many cases for the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a training device, or a network device, etc.) to perform the method described in the embodiments of the present application.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a mobile hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk.

Claims (21)

1. A method of model training, comprising:
acquiring a training set, wherein the training set comprises a plurality of groups of images belonging to different source domains, the images in the different source domains have different characteristic distribution, and each group of images is from the same source domain;
obtaining first characteristic statistical information corresponding to a source domain where the first image is located through a model based on the first image and at least one group of images in the training set, wherein the at least one group of images are images belonging to different source domains with the first image in the training set;
based on the feature map of the first image and the first feature statistical information, obtaining a first processing result corresponding to the first image through the model;
a target loss value is determined based on the first processing result and the model is trained based on the target loss value.
2. The method according to claim 1, wherein the obtaining, based on the first image and at least one group of images, the first feature statistical information corresponding to the source domain where the first image is located through a model includes:
extracting second characteristic statistical information corresponding to a first image and third characteristic statistical information corresponding to at least one group of images in the training set, wherein the second characteristic statistical information and the third characteristic statistical information are used for indicating characteristic distribution of the images;
Inputting the second feature statistical information and the third feature statistical information into a first sub-network in the model, and predicting to obtain the first feature statistical information corresponding to the source domain where the first image is located, where the first feature statistical information is used for indicating feature distribution of the image in the source domain where the first image is located.
3. The method according to claim 1, wherein the obtaining, by the model, the first processing result corresponding to the first image based on the feature map of the first image and the first feature statistics includes:
inputting the first image and the first characteristic statistical information into a second sub-network in the model to obtain a first processing result corresponding to the first image;
the second sub-network is used for extracting a feature map of the first image and carrying out normalization processing on the feature map based on the first feature statistical information.
4. A method according to any one of claims 1-3, characterized in that the target loss function comprises a first loss function and a second loss function;
the determining a target loss value based on the first processing result includes:
Determining the first loss value based on the first processing result and a true value result corresponding to the first image;
acquiring fourth characteristic statistical information corresponding to the image in the source domain where the first image is located based on the training set, and determining the second loss value based on the first characteristic statistical information and the fourth characteristic statistical information;
and determining the target loss value according to the first loss value and the second loss value.
5. The method of claim 4, wherein the model further comprises a third subnetwork, the target loss value further comprising a third loss value;
the method further comprises the steps of:
inputting the image in the source domain of the first image into a third sub-network to obtain a second processing result corresponding to the first image, wherein the image in the source domain of the first image comprises the first image, and the third sub-network is used for extracting the fourth characteristic statistical information based on the image in the source domain of the first image and carrying out batch normalization processing on the characteristic image of the first image based on the fourth characteristic statistical information;
the determining the target loss value from the first loss value and the second loss value includes:
Determining the third loss value based on the second processing result and a true value result corresponding to the first image;
and determining the target loss value according to the first loss value, the second loss value and the third loss value.
6. The method of claim 5, wherein the method further comprises:
acquiring a first translation parameter and a first scaling parameter, wherein the first translation parameter and the first scaling parameter are parameters used for carrying out batch normalization processing on the feature images of the first image in the third sub-network;
predicting to obtain a second translation parameter and a second scaling parameter according to the second characteristic statistical information, the first translation parameter and the first scaling parameter;
the second sub-network is used for normalizing the feature map of the first image based on the first feature statistical information, the second translation parameter and the second scaling parameter.
7. The method of claim 2, wherein the first subnetwork comprises an encoder and a decoder, and wherein the encoder and the decoder are all fully connected layer structures.
8. The method of claim 7, wherein a linear rectifying ReLU activation function is further connected between the encoder and the decoder, and wherein a hyperbolic tangent Tanh activation function is further connected after the decoder.
9. A method according to claim 3, wherein the second sub-network comprises a convolution layer, a normalization layer and a full connection layer, the convolution layer is used for extracting a feature map of the first image, the normalization layer is used for normalizing the feature map based on the third feature statistical information, and the full connection layer is used for processing a normalization result output by the normalization layer to obtain the first processing result.
10. The method according to any of claims 1-9, wherein the model is used to perform one or more of the following tasks: image classification, image recognition, image enhancement, object detection, and image segmentation.
11. An image processing method, comprising:
acquiring a target domain image;
extracting first feature statistical information corresponding to the target domain image, wherein the first feature statistical information is used for indicating feature distribution of the target domain image;
obtaining third feature statistical information corresponding to a target domain where the target domain image is located based on the first feature statistical information and second feature statistical information, wherein the second feature statistical information is feature statistical information corresponding to a source domain image adopted by the model in a training process, and the target domain image and the source domain image have different feature distribution;
And obtaining a processing result corresponding to the target domain image based on the target domain image and the third characteristic statistical information.
12. The method according to claim 11, wherein the obtaining the processing result corresponding to the target domain image based on the target domain image and the third feature statistical information includes:
and extracting a feature map of the target domain image, and carrying out normalization processing on the feature map based on the third feature statistical information to obtain a processing result corresponding to the target domain image.
13. The method according to claim 11 or 12, wherein the first feature statistics comprise means and variances of feature maps of the target domain image;
the second feature statistics include a mean and a variance of feature maps of the source domain image.
14. The method according to any of the claims 11-13, wherein the model comprises a first sub-network for deriving the third feature statistics based on the first feature statistics and the second feature statistics, the first sub-network comprising an encoder and a decoder, and wherein the encoder and the decoder are all fully connected layer structures.
15. The method of claim 14, wherein a linear rectifying ReLU activation function is further connected between the encoder and the decoder, and wherein a hyperbolic tangent Tanh activation function is further connected after the decoder.
16. The method according to any one of claims 11-15, wherein the model includes a second sub-network, the second sub-network is configured to obtain a processing result corresponding to the target domain image based on the target domain image and the third feature statistics, the second sub-network includes a convolution layer, a normalization layer, and a full-connection layer, the convolution layer is configured to extract a feature map of the first image, the normalization layer is configured to normalize the feature map based on the third feature statistics, and the full-connection layer is configured to process the normalization result output by the normalization layer to obtain the first processing result.
17. The method according to any of claims 11-16, wherein the model is used to perform one or more of the following tasks: image classification, image recognition, image enhancement, object detection, and image segmentation.
18. A model training device, comprising a memory and a processor; the memory stores code, the processor being configured to execute the code, the model training apparatus performing the method of any of claims 1 to 10 when the code is executed.
19. An image processing apparatus comprising a memory and a processor; the memory stores code, the processor being configured to execute the code, the image processing apparatus performing the method of any of claims 11 to 17 when the code is executed.
20. A computer storage medium storing instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 17.
21. A computer program product, characterized in that it stores instructions that, when executed by a computer, cause the computer to implement the method of any one of claims 1 to 17.
CN202210801213.9A 2022-07-08 2022-07-08 Model training method, image processing method and related device Pending CN117437440A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210801213.9A CN117437440A (en) 2022-07-08 2022-07-08 Model training method, image processing method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210801213.9A CN117437440A (en) 2022-07-08 2022-07-08 Model training method, image processing method and related device

Publications (1)

Publication Number Publication Date
CN117437440A true CN117437440A (en) 2024-01-23

Family

ID=89552129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210801213.9A Pending CN117437440A (en) 2022-07-08 2022-07-08 Model training method, image processing method and related device

Country Status (1)

Country Link
CN (1) CN117437440A (en)

Similar Documents

Publication Publication Date Title
Oyedotun et al. Deep learning in vision-based static hand gesture recognition
CN109949255B (en) Image reconstruction method and device
WO2022017245A1 (en) Text recognition network, neural network training method, and related device
CN111401406B (en) Neural network training method, video frame processing method and related equipment
US20230095606A1 (en) Method for training classifier, and data processing method, system, and device
CN111507993A (en) Image segmentation method and device based on generation countermeasure network and storage medium
CN113065636B (en) Pruning processing method, data processing method and equipment for convolutional neural network
CN113326930B (en) Data processing method, neural network training method, related device and equipment
US11574198B2 (en) Apparatus and method with neural network implementation of domain adaptation
CN111444807B (en) Target detection method, device, electronic equipment and computer readable medium
CN111738403B (en) Neural network optimization method and related equipment
CN113256592A (en) Training method, system and device of image feature extraction model
CN111680757A (en) Zero sample image recognition algorithm and system based on self-encoder
CN111079753A (en) License plate recognition method and device based on deep learning and big data combination
CN112364916A (en) Image classification method based on transfer learning, related equipment and storage medium
WO2022156475A1 (en) Neural network model training method and apparatus, and data processing method and apparatus
CN112085175A (en) Data processing method and device based on neural network calculation
CN113920382A (en) Cross-domain image classification method based on class consistency structured learning and related device
WO2024078112A1 (en) Method for intelligent recognition of ship outfitting items, and computer device
WO2022052647A1 (en) Data processing method, neural network training method, and related device
CN117437440A (en) Model training method, image processing method and related device
Munir et al. Background subtraction in videos using LRMF and CWM algorithm
CN115409159A (en) Object operation method and device, computer equipment and computer storage medium
WO2022133814A1 (en) Omni-scale convolution for convolutional neural networks
CN114692745A (en) Data processing method and device, integrated chip, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication