CN115294429A

CN115294429A - Feature domain network training method and device

Info

Publication number: CN115294429A
Application number: CN202110422769.2A
Authority: CN
Inventors: 赵寅; 张恋; 毛珏; 韩凯
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-04-17
Filing date: 2021-04-17
Publication date: 2022-11-04

Abstract

The application provides a feature domain network training method and device. The image visual task field based on the neural network is related to the image technical field based on Artificial Intelligence (AI). The method comprises the following steps: inputting the training image into a coding and decoding network for processing to obtain a reconstructed characteristic diagram, inputting the reconstructed image of the training image or the training image into an image-based domain network, inputting the reconstructed characteristic diagram into the image-based domain network, and correspondingly and respectively obtaining a first characteristic diagram and a second characteristic diagram. And constructing a loss function training feature domain network based on the first feature map and the second feature map. The distance between the first feature map and the second feature map based on the image domain network is used as one of loss terms of a loss function in training based on the feature domain network. In this way, the convergence speed and the convergence accuracy of the feature domain network can be improved.

Description

Feature domain network training method and device

Technical Field

The embodiment of the invention relates to the technical field of Artificial Intelligence (AI) -based network training, in particular to a feature domain-based network training method and device.

Background

Image codecs are widely used in digital video or image applications, such as broadcast digital television, audio and video or image transmission over the internet and mobile networks, real-time session applications such as audio and video chat and audio and video conferencing, DVD and blu-ray discs, audio and video or image content capture and editing systems, and security applications for camcorders.

Even where the movie is short, a large amount of audiovisual data needs to be described, which may cause difficulties when the data is to be sent or otherwise transmitted in a network with limited bandwidth capacity. Therefore, audio-visual or image data is typically compressed and then transmitted in modern telecommunication networks. As memory resources may be limited, the size of the audio-video or images may also become an issue when storing audio-video on a storage device. Audiovisual or image compression devices typically use software and/or hardware on the source side to encode the video data prior to transmission or storage, thereby reducing the amount of data required to represent the digital audiovisual or image. The compressed data is then received by the video decompression device at the destination side. With limited network resources and an increasing demand for higher video quality, there is a need for improved compression and decompression techniques that can increase the compression rate with little impact on the audio image quality.

Since 2017, the image codec based on the artificial Neural Network has been developed rapidly, and the image codec based on the artificial Neural Network has been developed from an early self-encoder-based (self encoder) structure and a Recurrent Neural Network (RNN) structure to a higher variation self encoder (VAE) structure, and the compression performance of the image codec can be equivalent to that of h.266/VCC encoding. One of the classical network models is published in the paper "spatial image compression with a scale superprior", in which the neural networks of many machine tasks are the visual task neural networks in the image domain, and they take the image as input and output the analysis result. Such as Resnet, fast-RCNN, mask-RCNN, YOLO, etc. Considering that an Image decoder based on an artificial neural network firstly analyzes a code stream to generate a reconstruction characteristic diagram and then inputs the reconstruction characteristic diagram into an Image reconstruction network to generate a reconstruction Image, an ICLR2018 conference paper Towards Image Understanding from Deep Compression with out Decoding proposes a characteristic diagram domain visual task neural network based on the characteristic diagram of an artificial neural network codec as input. The method has the advantages that the visual task is directly carried out by skipping an image decoding network, and the calculation power can be obviously reduced. The classification precision of the network cResnet trained by using the feature map recovered by the decoder as input can be close to (slightly less than) the classification precision of the recovered image of the decoder obtained through the Resnet network under the condition of using the same loss function by taking a classification task and a segmentation task as examples.

But the feature map channels used by current image codecs based on artificial neural networks are large (e.g. 192 dimensions, or 320 dimensions) and more abstract. Under the condition of keeping similar network structure, similar computing power and the same training gradient return times, the speed and the final accuracy of the network convergence of the visual task (such as a classification task) of the feature map domain are weaker than those of the visual task network of the image domain. This illustrates that there is a need for a better method to train the classification task network of the feature map domain to achieve better visual task accuracy.

Disclosure of Invention

The application provides a feature domain network training method and device, which can obviously improve the speed of network convergence and the accuracy during convergence under the condition of ensuring the accuracy of a visual task.

In a first aspect, a method for training a network based on a feature domain is provided, which includes: inputting a training image into a coding and decoding network for processing to obtain a reconstructed characteristic diagram, wherein the training image is any image in a training image set; inputting a reconstructed image of the training image or the training image into an image domain network to obtain a first feature map; inputting the reconstructed feature map into a feature domain network to obtain a second feature map; determining a loss function of the feature domain network based on the first feature map and the second feature map; and updating the model parameters based on the feature domain network according to the loss function based on the feature domain network.

The coding and decoding network specifically comprises a coding network and a decoding network. In one implementation, the encoding network includes an image transformation network and an entropy encoding network, and the decoding network includes an entropy decoding network and an image reconstruction network. In another implementation, the encoding network includes an image transformation network and an entropy encoding network, and the decoding network includes only an entropy decoding network. In yet another possible implementation manner, the codec network may specifically adopt a self-encoder structure.

The image domain network specifically refers to the input signal as training image or reconstructed image data, and the feature domain network specifically refers to the input signal as feature map data.

The image domain based network is a convolutional neural network. The image domain based network may include a plurality of sub-networks, each sub-network containing one or more convolutional layers. The network structures between the sub-networks may be the same or different from each other.

The characteristic domain based network is a convolutional neural network. The feature domain based network may comprise a plurality of sub-networks, each sub-network comprising one or more convolutional layers. The network structures between the sub-networks may be the same or different from each other.

In the technical scheme, the first feature map based on the image domain network is introduced into the loss function based on the feature domain network as a guide, so that the feature signal of the image domain is utilized. On one hand, due to the fact that correlation exists between the image domain and the feature domain signals, on the other hand, the trained image domain-based network can generate more accurate feature signals, and therefore the correlation and accurate feature signal reference are used for optimizing the feature domain-based network loss function. Therefore, the optimized loss function is used for training, so that the convergence speed and the accuracy of the output analysis result based on the characteristic domain network can be improved.

The loss term of the feature domain network based loss function includes a distance between the first feature map and the second feature map.

In particular, the loss term of the loss function may be constituted by at least one loss term. The distance between the first profile and the second profile can be measured in terms of L1 norm loss (also known as absolute error) or L2 norm loss (also known as mean squared error, MSE).

In one possible implementation, the inputting the reconstructed image of the training image or the training image into an image domain network to obtain a first feature map includes: when the code rate point corresponding to the coding and decoding network meets a first condition, the reconstructed image is used as the input of the image domain network; or when the code rate point meets a second condition, the training image is used as the input of the image domain network.

The first condition and the second condition are configured flexibly according to an actual application scenario, and are not limited specifically here.

In one possible implementation, the size of the first feature map matches the size of the second feature map.

In a possible implementation manner, the method further includes inputting the training image into the codec network for processing to obtain the reconstructed image.

In one possible implementation, the feature domain network-based loss function further includes a loss term for training the image domain network-based loss function.

The optimization of the feature domain network based loss function by introducing the loss term for training the image domain network based loss function further increases the correlation with the image domain network. This is because the loss term for training the image domain network based loss function has a strong correlation with the first feature map of the introduced guidance. Therefore, the convergence speed and the accuracy of the output analysis result based on the characteristic domain network are improved.

In a possible implementation manner, the updating the feature domain network-based parameter according to the feature domain network-based loss function includes: and calculating the gradient of each network layer parameter of the characteristic-based domain network according to the loss function result of the characteristic-based domain network, and updating the characteristic-based domain network parameters according to the gradient.

In one possible implementation, the image domain-based network is a trained fixed model parameter network. In particular, the training process is completed based on the image domain network, and parameters in each neural network layer are determined to be unchanged.

In one possible implementation, the analysis result based on the feature domain network is used to determine a class of the training image.

In a second aspect, a method for applying an image to a feature domain-based network is provided, which includes: inputting an original image into an encoding and decoding network for processing to obtain a reconstruction characteristic diagram; inputting the reconstructed feature map into a trained feature domain-based network for processing to obtain an analysis result, wherein the trained feature domain-based network is obtained by performing iteration updating for N times; the loss function updated by the nth iteration is obtained based on a first feature map and a second feature map, wherein the first feature map is obtained by processing a reconstructed image of the original image or the original image based on an image domain network; the second characteristic diagram is obtained by processing the reconstructed characteristic diagram based on the characteristic domain network after the (n-1) th iteration; and N and N are both positive integers, and 1< = N < N.

In the training process based on the feature domain network, the first feature map based on the image domain network is introduced into the loss function based on the feature domain network as a guide, so that the feature signals of the image domain are utilized. On one hand, the image domain and the feature domain signals have correlation, and on the other hand, the trained image domain-based network can generate more accurate feature signals, so that the correlation and accurate feature signal reference are utilized to optimize the loss function based on the feature domain network. Therefore, the accuracy of the analysis result based on the feature domain network can be improved by utilizing the optimized loss function for training. Therefore, in practical application, the trained network model based on the feature domain network is used for analyzing the reconstructed feature map of the original image, so that the accuracy of the analysis result is improved.

In one possible implementation, the feature domain network-based penalty function includes a distance between the first feature map and the second feature map.

In one possible implementation, the method further includes determining a category of the original image according to an analysis result output based on the feature domain network. The described, detected, segmented, local visual features, etc. The analysis result may be single data, one-dimensional data, two-dimensional image data, three-dimensional or multi-dimensional image data.

In a possible implementation manner, the reconstructed image is obtained by inputting the original image into the codec network for processing.

In a third aspect, a device for training a feature domain based network is provided, including: the acquisition module is used for inputting a training image into the coding and decoding network for processing to obtain a reconstructed characteristic image, wherein the training image is any image in a training image set; inputting a reconstructed image of the training image or the training image into an image domain network to obtain a first feature map; inputting the reconstructed feature map into a feature domain network to obtain a second feature map; a determining module for determining the feature domain network based loss function based on the first feature map and the second feature map; and the updating module is used for updating the model parameters based on the characteristic domain network according to the loss function based on the characteristic domain network.

For further implementation functions of the obtaining module, the determining module, and the updating module, reference may be made to the first aspect or any implementation manner of the first aspect, which is not described herein again.

In a fourth aspect, a device for training based on a feature domain network is provided, which includes: the data module is used for providing data preparation for training and comprises a training image, a reconstruction characteristic diagram obtained by inputting the training image into an encoding and decoding network for processing, a first characteristic diagram obtained by inputting the training image into an image domain network, and a second characteristic diagram obtained by inputting the reconstruction characteristic diagram into the image domain network; and the training module is used for determining the loss function based on the feature domain network based on the data output by the data module and updating the model parameters based on the feature domain network according to the loss function based on the feature domain network.

In a possible implementation manner, the input module further includes a reconstructed image of the training image, where the reconstructed image is obtained by inputting the training image into the codec network for processing.

For further implementation functions of the data module and the training module, reference may be made to the first aspect or any implementation manner of the first aspect, which is not described herein again.

In a fifth aspect, a feature domain based network application apparatus is provided, including: an obtaining module, configured to input an original image into an encoding and decoding network to obtain a reconstructed feature map, and input the reconstructed feature map into a trained feature domain-based network to obtain an analysis result, where the trained feature domain-based network is a network obtained through N times of iterative update, and a loss function of an nth time of iterative update is obtained based on a first feature map and a second feature map, where the first feature map is obtained by processing a reconstructed image of the original image or the original image based on an image domain network, and the second feature map is obtained by processing the reconstructed feature map based on the feature domain network after an N-1 th iteration; and the output module is used for outputting the analysis result.

In a possible implementation manner, the application device further includes a determination module, configured to determine a category of the image according to the analysis result.

For further implementation functions of the obtaining module, the output module, or the determining module, reference may be made to any implementation manner of the second aspect or the second aspect, which is not described herein again.

In a sixth aspect, the present application provides a trainer comprising processing circuitry for performing the method according to any of the first and second aspects above.

In a seventh aspect, the present application provides a training device, comprising: one or more processors; a non-transitory computer readable storage medium coupled to the processor and storing a program for execution by the processor, wherein the program, when executed by the processor, causes the decoder to perform the method of any of the first aspect and the first aspect.

In an eighth aspect, the present application provides a non-transitory computer readable storage medium comprising program code that, when executed by a computer device, performs the method of any one of the first and second aspects, and any one of the second and third aspects.

In a ninth aspect, the present application provides a computer program product comprising program code for performing the method of any one of the first and first aspects, and any one of the second and second aspects, when the computer program product is executed on a computer or processor.

In a tenth aspect, the present invention relates to a training apparatus having functionality to implement the actions in the method embodiments of the first aspect or any of the first aspects described above. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above-described functions. The modules may perform corresponding functions in any one of the method examples of the first aspect or the first aspect, for specific reference, detailed description of the method example is omitted here for brevity.

In an eleventh aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to perform the method of the first aspect.

Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the method in the first aspect.

A twelfth aspect provides an electronic device, which includes the motion recognition device in any one of the third aspect to the eighth aspect.

The artificial neural network-based image codec uses a larger number of feature map channels (e.g., 192 dimensions, or 320 dimensions) and is more abstract. If the network parameters are initialized randomly and Cross Entropy Loss (Cross Entropy Loss) is used as a Loss function, under the condition that the network structure is similar, the calculation power is approximate and the training gradient return times are the same, the network convergence speed and the final accuracy of the visual task (such as a classification task) of the feature domain are weaker than those of the visual task network of the image domain. By introducing the image domain network-based feature map into the feature domain network-based loss function as a guide, the feature signal of the image domain is utilized. On one hand, because the correlation exists between the image domain and the characteristic domain signal, and on the other hand, the trained image domain-based network can generate more accurate characteristic signals, the correlation and accurate characteristic signal reference are utilized to optimize the loss function based on the characteristic domain network. Therefore, the optimized loss function is used for training, so that the convergence speed and the accuracy of the output analysis result based on the feature domain network can be improved.

The details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.

Drawings

The drawings used in the embodiments of the present application are described below.

FIG. 1 is a schematic block diagram of an exemplary image-based feature domain network training system;

FIG. 2 is a schematic block diagram of an exemplary image-based feature domain-based network application;

FIGS. 3A-3B are diagrams illustrating exemplary configurations of encoding and decoding networks;

FIG. 4A is an exemplary diagram of an image domain based network architecture;

FIG. 4B is a diagram of an example of an image domain based image or feature domain based network subnetwork structure;

FIG. 4C is an exemplary diagram of a network structure of a network header of an image-based image domain or feature domain-based network;

FIGS. 4D-E are exemplary diagrams of a graph structure of an output feature based on an image domain network;

FIG. 5A is an exemplary diagram of a feature-based domain network architecture;

FIGS. 5B-C are exemplary diagrams of a feature domain based network output feature graph structure;

FIG. 6 is a block diagram illustrating an exemplary principle of a loss function based on a feature domain network;

FIGS. 7A-9 are schematic flow charts of feature domain network training methods according to embodiments of the present application;

10A-B are schematic flow diagrams of a feature domain network application-based method of an embodiment of the present application;

11A-B are schematic block diagrams of a feature domain network based training device according to an embodiment of the present application;

FIG. 12 is a schematic block diagram of a feature domain based network application apparatus according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a hardware structure of a feature domain network training apparatus according to an embodiment of the present application;

fig. 14 is a possible system architecture 1400 according to an embodiment of the present disclosure.

Fig. 15 is a system architecture 1500 in another possible cloud scenario provided in the embodiment of the present application.

Fig. 16 is a schematic diagram of a hardware structure of a chip of the feature domain network training apparatus according to the embodiment of the present application;

Detailed Description

The terms "first", "second", and the like, referred to in the embodiments of the present application, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance, nor order. Furthermore, the terms "comprises" and "comprising," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a list of steps or elements. A method, system, article, or apparatus is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, system, article, or apparatus.

It should be understood that, in this application, "at least one" means one or more, "a plurality" means two or more. "and/or" is used to describe the association relationship of the associated object, indicating that there may be three relationships, for example, "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The embodiment of the application provides an AI-based feature domain network training technology, in particular provides a neural network-based visual task network training technology of an image feature map, and particularly provides a feature domain network-based training system.

In the field of image coding, the terms "image (picture)", or "picture (image)", may be used as synonyms. Image coding (or coding in general) includes two parts, image coding and image decoding, in which a video is composed of a plurality of images and is a representation of consecutive images. Image encoding is performed on the source side, typically involving processing (e.g., compressing) the original video image to reduce the amount of data required to represent the image (and thus more efficient storage and/or transmission). Image decoding is performed at the destination side, typically involving inverse processing with respect to the encoder, to reconstruct the image. Embodiments refer to "encoding" of an image (or generally referred to as an image) as being understood to refer to "encoding" or "decoding" of an image. The encoding portion and the decoding portion are also collectively referred to as a CODEC (coding and decoding, CODEC).

In the case of lossless image coding, the original image or training image may be reconstructed, i.e., the reconstructed image is of the same quality as the original image or training image (assuming no transmission loss or other data loss during storage or transmission). In the case of conventional lossy image coding, further compression is performed by quantization or the like to reduce the amount of data required to represent the image, while the decoder side cannot reconstruct the image completely, i.e., the reconstructed image has a lower or poorer quality than the original image or the training image.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the sake of understanding, the following description will be made first of all with respect to terms and concepts of the neural networks to which the embodiments of the present application may relate.

(1) Convolutional Neural Network (CNN)

The neural network comprises convolutional layers, and the neural network may further comprise modules such as an active layer (e.g., reLU, PReLU, etc.), a pooling layer (Pooling layer), a batch normalization layer (BN layer), a Fully Connected layer (full Connected layer), etc. Typical convolutional neural networks are like LeNet, alexNet, VGGNet, resNet, etc. The basic CNN can be composed of a backbone network and a head network; the complex CNN consists of backbone, neck and head networks.

(2) Characteristic diagram (feature map)

Three dimensions of three-dimensional data output by a convolutional layer, an active layer, a pooling layer, a batch normalization layer and the like in the convolutional neural network are respectively called Width (Width), height (Height) and Channel (Channel).

(3) Backbone network (backbone network)

The first part of the convolutional neural network, which functions to extract feature maps of multiple scales from an input image, is generally composed of convolutional layers, pooling layers, activation layers, and the like, and does not include fully connected layers. Typically, the feature map output of the layer closer to the input image in the backbone network has a larger resolution (width, height) but a smaller number of channels. Typical backbone networks VGG-16, resNet-50, resNeXt-101, and the like.

(4) Head network (head network)

The final part of the convolutional neural network has the function of processing the characteristic diagram to obtain a prediction result output by the neural network, and a common head network comprises a full connection layer, a softmax module and the like.

(5) Neck network (sock network)

And the middle part of the convolutional neural network has the function of further integrating and processing the characteristic diagram generated by the backbone to obtain a new characteristic diagram. Common networks are, for example, the Feature Pyramid Network (FPN) of Faster RCNN

(6) Bottleneck structure (bottle structure)

The input data of the network firstly passes through 1 layer or multiple layers of neural network layers to obtain intermediate data, and the intermediate data passes through 1 layer or multiple layers of neural network layers to obtain output data, wherein the data volume (namely the product of width, height and channel number) of the intermediate data is lower than the input data volume and the output data volume.

(7) GDN layer (generated visual normalization)

The ICLR2016 publication OF Density modification OF imaging Using A GENERALIZED NORMALIZATION TRANSFORMATION proposes that the GDN layer is a NORMALIZATION layer more suitable for image reconstruction. And the authors use the GDN layer and the Inverse GDN (IGDN) layer in the IMAGE COMPRESSION algorithm in ICLR2017 article END-TO-END OPTIMIZED IMAGE COMPRESSION, TO which reference is made for specific implementation formulas and procedures.

In the present application, there is no limitation on the type of training atlas and the number of training images included in each type of training atlas.

Fig. 1 illustrates a possible feature domain based network training system 100 provided by an embodiment of the present application.

In fig. 1, codec network 104 in feature domain network-based training system 100 represents, among other things, a device that may be used to perform techniques in accordance with various examples described herein.

As shown in FIG. 1, the feature domain based network training system 100 includes an image source 102, a codec network 104, an image domain based network 106, a feature domain based network 108, and a loss function 110.

The image source 102 may include or may be any type of image capture device for capturing real world images, and/or any type of image generation device, such as a computer graphics processor for generating computer animation images, or any type of device for acquiring and/or providing real world images, computer generated images (e.g., screen content, virtual Reality (VR) images, and/or any combination thereof (e.g., augmented Reality (AR) images).

The image (or image data) may also be referred to as a training image (or training image data).

The codec network 104 is configured to receive image data and provide decoded image data, and perform data processing on decoded image data (also referred to as reconstructed image data) such as a decoded image generated by the codec network 104 or a training image based on the image domain network 106 to obtain feature data after the data processing. The post-processing of data performed based on image domain network 106 may include any other processing, such as feature signal extraction.

The feature-based domain network 108 is configured to perform data processing on the decoded feature data (also referred to as reconstructed feature data) generated by the codec network 104 to obtain feature data after data processing. The data post-processing performed based on the feature domain network 108 may include any other processing, such as feature signal extraction.

The loss function 110 is configured to receive the feature data generated by the image domain based network 106 and the feature domain based network 108 and to calculate a loss based on the feature data to indicate an update to the feature domain based network to handle a particular visual task (e.g., processing an input image or image region or image patch to generate a classification of the input image or image region or image patch). The training data may be stored in a database (not shown), and the loss function 110 may generate feature data based on the training data to derive distance loss terms for training to derive a target model (e.g., a network that may be an image feature map for target recognition or classification, etc.). It should be noted that, in the embodiment of the present application, a source of the training data is not limited, and for example, the training data may be obtained from a cloud or other places to perform model training.

Fig. 2 shows a possible application system 200 based on a feature domain network according to an embodiment of the present application.

In fig. 2, codec network 204 in feature domain based network application system 200 represents, among other things, a device that may be used to perform techniques in accordance with various examples described in this application.

As shown in FIG. 2, the feature domain based network application system 200 includes an image source 202, a codec network 204, a feature domain based network 206, and an output interface 208.

The image source 202 may include or may be any type of image capture device for capturing real world images, and/or any type of image generation device, such as a computer graphics processor for generating computer animation images, or any type of device for acquiring and/or providing real world images, computer generated images (e.g., screen content, virtual Reality (VR) images, and/or any combination thereof (e.g., augmented Reality (AR) images).

The image (or image data) may also be referred to as an original image (or original image data).

The codec network 204 is configured to receive image data and provide decoded image data.

The feature domain based network 206 is configured to perform data processing on the decoding feature data (also referred to as reconstructed feature data) generated by the codec network 204 to obtain an analysis result after the data processing. The data post-processing performed based on the feature domain network 206 may include any other processing, such as feature signal extraction.

Output interface 208 is configured to receive the analysis results output from the feature domain network 206 network, e.g., output interface 208 may be configured to encapsulate the analysis results into a suitable format such as a message and/or process the decoded feature data using any type of transport encoding or processing for transmission over a communication link or communication network.

It will be obvious to the skilled person from the description that the presence and (exact) division of the different units or functions shown in fig. 1, 2 may differ depending on the actual device and application.

The codec Network 104 (e.g., image codec) of fig. 1, the image domain-based Network 106, the feature domain-based Network 108 or the lossy function 110, or the codec Network 204 (e.g., image codec) of fig. 2, the feature domain-based Network 206, or the output interface 208 may be implemented by Processing circuitry, such as one or more microprocessors, digital Signal Processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), graphics Processors (GPUs), embedded Neural Network Processors (NPUs), discrete logic, hardware, image coding specific processors, or any combination thereof. If portions of the techniques are implemented in software, the device may store the instructions of the software in a suitable non-transitory computer-readable storage medium and execute the instructions in hardware using one or more processors to perform the techniques of the present invention.

The various modules or units in the feature domain based network training system 100, the feature domain based network application system 200 may comprise any of a variety of devices, including any type of handheld or stationary device, such as a notebook or laptop computer, a cell phone, a smart phone, a tablet or tablet computer, a camera, a desktop computer, a set-top box, a television, a display device, a digital media player, a video game console, a video streaming device (e.g., a content services server or a content distribution server), a broadcast receiving device, a broadcast transmitting device, etc., and may be used without or with any type of operating system. In some cases, feature domain based network application system 200 may be equipped with components for wireless communications. Thus, the feature domain based network training system 100, the feature domain based network application system 200 may be a wireless communication device.

Based on the system framework of fig. 1, the following description focuses on a feature domain-based network training structure in conjunction with fig. 3-6. As described in the introduction of the basic concept above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, a convolutional neural network is a feed-forward artificial neural network in which individual neurons can respond to an image input thereto.

The codec network may receive images (image data) via an input, for example, images forming a video or video sequence. The received audio image and image data may be a preprocessed image (or preprocessed image data). For simplicity, the following description uses images. An image may also be referred to as a current image or a to-be-coded or an image.

The (digital) image is or can be considered as a two-dimensional array or matrix of pixels with intensity values. A pixel point in an array may also be called a pixel (pel or pel) (short for a picture element). The number of pixels in the array or image in both the horizontal and vertical directions (or axes) determines the size and/or resolution of the image. To represent color, three color components are typically employed, i.e., the image may be represented as or include an array of three pixel points. In the RBG format or color space, the image includes corresponding arrays of red, green, and blue pixel dots. However, in video or image coding, each pixel is typically represented in a luminance/chrominance format or color space, such as YCbCr, comprising a luminance component (sometimes also denoted L) indicated by Y and two chrominance components, denoted Cb, cr. The luminance (luma) component Y represents the luminance or gray level intensity (e.g., both are the same in a gray scale image), while the two chrominance (or chroma) components Cb and Cr represent the chrominance or color information components. Accordingly, an image in YCbCr format includes a luminance pixel point array of luminance pixel point values (Y) and two chrominance pixel point arrays of chrominance values (Cb and Cr). An image in RGB format may be converted or transformed into YCbCr format and vice versa, a process also known as color transformation or conversion. If the image is black and white, the image may include only an array of luminance pixel points. Accordingly, the image can be, for example, a luminance pixel dot array in monochrome format or a 4.

In one possibility, the image source may divide the image into a plurality of (typically non-overlapping) image blocks. These blocks may also be referred to as root blocks, macroblocks (h.264/AVC), or Coding Tree Blocks (CTBs), or Coding Tree Units (CTUs) in the h.265/HEVC and VVC standards. The segmentation unit may be adapted to use the same block size for all images in the video sequence and to use a corresponding grid defining the block sizes, or to change the block sizes between images or subsets or groups of images and to segment each image into corresponding blocks.

One possible codec network structure in the embodiment of the present application may be as shown in fig. 3A. In fig. 3A, codec network 300 includes an encoding network 315 and a decoding network 335. The encoding network 315 includes an image transformation network 310 and an entropy encoding network 320, and the decoding network 335 includes an entropy decoding network 330 and an image reconstruction network 340.

The image is input into the image transformation network 310, the entropy coding network 320 and the entropy decoding network 330 to obtain a reconstructed feature map. The image is input into the encoding network 315 and the decoding network 335 to obtain a reconstructed image. The output of the coding and decoding network comprises a reconstruction characteristic map and a reconstruction image.

Another possible codec network structure in the embodiment of the present application may be as shown in fig. 3B. In fig. 3B, the codec network includes an encoding network 315 and a decoding network 335. The encoding network 315 includes an image transformation network 310 and an entropy encoding network 320, and the decoding network 335 includes an entropy decoding network 330.

The image is input into the image transformation network 310, the entropy encoding network 320 and the entropy decoding network 330 to obtain a reconstructed feature map. The codec network output includes only the reconstructed feature map.

One possible image domain based network architecture in the embodiments of the present application may be as shown in fig. 4A. In fig. 4A, convolutional neural network 400 may include 5 sub-networks L1 to L5 as shown in fig. 4A, each of which contains 1 or more convolutional layers, and a header network H including an average pooling layer and a full connection layer.

In one possibility, the training images serve as input based on the image domain network.

In another possibility, the reconstructed image output by the codec network serves as an input to the image domain based network.

In another possibility, a training image or a reconstructed image output by the coding and decoding network can be adaptively selected as an input based on the image domain network. For example, the type of the input image may be selected according to the size of the code rate point corresponding to the coding/decoding network, when the code rate is smaller than a threshold (e.g., 0.5 bpp), the reconstructed image output by the coding/decoding network is selected as the input of the image-based domain network, otherwise, the training image is selected as the input of the image-based domain network. Other adaptive selection schemes are not limited herein.

In one possibility, the L1 network structure is shown in fig. 4B, and it can be seen that the L1 example specifically includes 3 convolutional layers.

In one possibility, the structure of the header network is shown in fig. 4C, and it can be seen that the header network in the example includes 2 network layers, specifically, a pooling layer and a full connection layer.

In one possibility, the network structures between the sub-networks may be the same or different from each other.

In one possibility, the image domain based network output comprises the analysis results.

Wherein at least one feature map is output based on the image domain network, as shown in particular in fig. 4D and 4E. Wherein 4D is the single feature map output condition, and 4E corresponds to 3 feature map output conditions.

The size and the dimension of the feature maps output based on the image domain network may be the same or different, and a certain feature map may be a feature vector, or may be single data, one-dimensional data, two-dimensional image data, three-dimensional or multi-dimensional image data.

Notably, the image domain-based network may be a convolutional neural network such as Resnet, yolov4, fast RCNN, mask RCNN, ASLFeat, and the like. The analysis result output based on the image domain network may be an object category, an object detection frame, an object segmentation map, a local visual feature, and the like, which is not limited herein.

In one possibility, when the analysis result is applied to the identification of object classes, the analysis result may be a set of vector data expressing probability information that the training or original image belongs to the respective object class.

One possible feature domain based network structure in the embodiments of the present application may be as shown in fig. 5A. In fig. 5A, the convolutional neural network 500 may include 3 sub-networks L1', L4' and L5' as shown in fig. 5A, and a head network H ', wherein each sub-network includes 1 or more convolutional layers, the head network H ' includes an average pooling layer and a full connection layer, and a reconstructed feature map output by the codec network is used as an input based on the feature domain network.

In one possibility, the L1 'network structure is shown in fig. 4B, and it can be seen that the L1' example specifically includes 3 convolutional layers.

In one possibility, the structure of the head network H 'is shown in fig. 4C, and it can be seen that the head network H' in the example includes 2 network layers, specifically, a pooling layer and a full connection layer.

In one possibility, the feature domain based network output includes analysis results.

Wherein at least one feature map is output based on the feature domain network, as shown in particular in fig. 5B and 5C. Wherein 5B is a single feature map output case, and 5C corresponds to 3 feature map output cases.

The size and the dimension of the feature maps output based on the feature domain network may be the same or different, and a certain feature map may be a feature vector, or may be single data, one-dimensional data, two-dimensional image data, three-dimensional or multi-dimensional image data.

Notably, the feature-based domain network may be a convolutional neural network such as Resnet, yolov4, faster RCNN, mask RCNN, ASLFeat, and the like. The analysis result output based on the feature domain network may be an object category, an object detection frame, an object segmentation graph, a local visual feature, and the like, which is not limited herein.

In one possibility, the feature domain based network structure may be configured with reference to an image domain based network structure. The partial network based on the feature domain network structure can be the same as the partial network based on the image domain network structure, and a network layer with a feature map width and height larger than the network input feature map width and height in the image domain network is not introduced, as shown in fig. 4A. L1 to L5 in fig. 2 represent 5 parts of the Resnet50 backbone network, respectively, and H represents the head network of Resnet 50. The feature domain based network structure can be made to skip both the L2 and L3 portions of the image domain based network structure because the feature maps of both portions are wider and taller than the network input feature maps.

In one possibility, in order to increase the network capacity and computational power to achieve better task accuracy, the number of network layers (e.g., L4') of the sub-network in the feature domain based network may also be increased, such as increasing 6 concatenated residual error units (ResBlock) to 10 concatenated residual error units.

In one possibility, to adapt the size and number of channels of the input feature map, the L1' part based on the feature domain network may not contain downsampling (i.e. stride =1 of convolution), and the number of output channels may be configured flexibly according to the calculation power that can be received by the calculation, which is not limited by the present invention.

In one possibility, a convolutional layer and an Inverse GDN (iGDN) layer may be added before the feature domain network-based subnetwork (e.g., L1').

The implementation process of the loss function based on the feature domain network in the embodiment of the present application may be as shown in fig. 6. In fig. 6, under the condition that the training is completed based on the image domain network and the model based on the image domain network is fixed, the training image or the reconstructed image output by the codec network is used as the input based on the image domain network, and the reconstructed feature map is used as the input based on the feature domain network. Calculating a distance between the at least one feature map information output by the image domain based network and the at least one feature map information output by the feature domain based network as one of the loss terms of the loss function in the feature domain network based training process.

In one possibility, the distance between feature maps may be measured by an L1-norm loss or an L2-norm loss (also known as mean squared error, MSE).

Where the L2 norm loss, which is the sum of the squares of the differences of the image domain network-based feature map and the corresponding points in the feature domain network-based feature map.

The L1 norm loss is also referred to as absolute value deviation. The method is to sum up the absolute difference values of corresponding points in the feature map based on the image domain network and the feature map based on the feature domain network.

y is a feature map based on the image domain network output,

is a feature graph based on feature domain network output. m +1 is the number of profiles included in the first profile or the second profile.

Specifically, the distance between a first feature map based on the image domain network and a second feature map based on the feature domain network is introduced as one of the loss terms of the loss function based on the feature domain network. The specific feature domain based network loss function can be exemplified as follows:

Cost＝P1 x D1+P2 x D2(x,y)

the above-mentioned feature domain based network loss function includes two loss terms D1 and D2. P1 and P2 are weighting coefficients of two loss terms, by which the magnitude of the loss Cost can be finally determined. Wherein the D2 loss term is specifically to calculate the distance between the first profile x and the second profile y.

In one possibility, the loss function based on the feature domain network may further include a loss term in the loss function used for training the image domain network-based process, such as cross entropy loss for the classification task; it may be a regression loss of the detection box or the like for the target detection task. The loss function, as for training an image domain based network, can be exemplified as follows:

Cost＝P3 x D3+P4 x D4

the D4 term in the image domain network-based loss function can be introduced into the feature domain network-based loss function, and the feature domain network-based loss function can be shown as follows:

Cost＝P1 x D1+P2 x D2(x,y)+P5 x D4

fig. 7A is a schematic flowchart 700 of a feature domain network training method according to an embodiment of the present application. Referring to fig. 7A and 7B, the method may include steps 1-4, which are described in detail below with respect to steps 1-4, respectively.

Step 1: and inputting the training image to the image domain network to obtain a first feature map.

Specifically, the image domain based network is a network in which model parameters have been obtained through training, and the model based on the image domain network is fixed, that is, the model based on the image domain network has been determined to be constant based on parameters in each neural network layer. The network structure may be a convolutional neural network such as Resnet, yolov4, faster RCNN, mask RCNN, ASLFeat, etc. Taking the structure of the image-based domain network 400 shown in fig. 4A as an example, the image-based domain network includes 5 sub-networks L1 to L5 and a header network H. Each sub-network contains 1 or more convolutional layers, and the header network H includes an average pooling layer and a full-link layer. For a detailed description of the image-based domain network, please refer to the description in fig. 4A, which is not repeated herein.

The first feature map includes at least one feature map generated based on the image domain network, and as shown in fig. 4A as an example, the first feature map includes a feature map output after the L5 sub-network in the image domain network and a feature map output after the head network (the position of the oval dotted line indicated in step 1).

It should be understood that the feature maps generated based on the image domain network may be generated after other sub-networks, and the number of feature maps is not limited, such as the feature map of the L5 output may be used; the feature map output by L4 and the feature map output by L5 may also be used; the feature map output by the L5 and the feature vector output by the header network may also be used.

It should be understood that the training image may be in RGB format or YUV, RAW, etc., and the training image may be subjected to preprocessing operations before being input into the codec network, where the preprocessing operations may include conversion, block division, filtering, pruning, etc.

It should be appreciated that multiple training images or multiple blocks of training images are allowed to be input to the codec network for processing within the same timestamp or at the same time to obtain a reconstructed feature map.

Step 2: and inputting the reconstructed feature map obtained by inputting the training image into the coding and decoding network based on the feature domain network to obtain a second feature map.

Specifically, the feature domain based network structure may be a convolutional neural network such as Resnet, YOLOv4, fast RCNN, mask RCNN, and aslfet. Taking the structure of the image domain-based network 500 shown in fig. 5A as an example, the feature domain-based network includes 3 sub-networks L1', L4', and L5 'and a header network H'. Each sub-network contains 1 or more convolutional layers, and the header network H' includes an average pooling layer and a full-connection layer. For a detailed description of the feature domain based network, please refer to the description in fig. 5A, which is not described herein again.

The second feature map includes at least one feature map generated based on the feature domain network, and as shown in fig. 5A as an example, the second feature map includes a feature map output based on the L5 'sub-network in the feature domain network and a feature map output based on the header network H' (the position of the oval dashed line indicated in step 2).

It should be understood that the feature map generated based on the feature domain network may be generated after other sub-networks, and the number of feature maps is not limited, such as the feature map of the L5' output may be used; the feature map output by L4 'and the feature map output by L5' may also be used; the feature map output by L5 'and the feature vector output by the head network H' may also be used.

It should be noted that the second feature map obtained based on the feature domain network structure needs to be obtained based on the image domain network in step 1The first feature map of (1) is matched in size. Specifically, the size matching means that the dimension size of any one feature map in the second feature map is consistent with the dimension size of the corresponding feature map in the first feature map, and the dimension size of each dimension. For example, the first feature map includes feature maps y1 (three-dimensional data), y2 (two-dimensional data), and y3 (one-dimensional data), and the second feature map includes feature maps

(three-dimensional data),

(two-dimensional data) and

(one-dimensional data), the y1 profile corresponds to

Characteristic diagram, y2 characteristic diagram corresponds to

Characteristic diagram, y3 characteristic diagram corresponds to

And (5) feature diagrams. Then y1 is equal to

The dimensions (e.g., width, height, and number of channels) of (A) are the same, y2 is equal to

The dimensions (e.g., width and height) of the two parts are the same. y3 and

the size (e.g., vector length) of (a) is required to be the same.

And 3, step 3: determining the loss function based on the image domain network based on the first characteristic diagram of the image domain network and the second characteristic diagram based on the characteristic domain network.

Specifically, the distance between a first feature map based on the image domain network and a second feature map based on the feature domain network is introduced as one of the loss terms of the feature domain network-based loss function. The specific characteristic domain-based network loss function can be exemplified as follows:

Cost＝P1 x D1+P2 x D2(x,y)

For a detailed description of the distance calculation, please refer to the above-mentioned implementation process of the loss function, which is not described herein again.

Meanwhile, the step of determining the loss function based on the feature domain network specifically also comprises the step of calculating the size of the loss function.

And 4, step 4: and updating the model parameters based on the feature domain network according to the loss function based on the feature domain network.

Specifically, according to the loss function based on the feature domain network, the size of the loss can be obtained, the gradient of each neural network layer parameter in the feature domain network is obtained through calculation, and the parameter based on the feature domain network is updated according to the gradient.

It should be noted that, in the present invention, a training period (epoch) of training refers to the process of performing all images in the training set through the above steps 1-4, while the entire training process based on the feature domain network may require a plurality of epochs (the process of performing the above steps 1-4 on the images in the training set is performed a plurality of times). The training mode of the present application can be used in each or only some training periods (epochs) during the training process based on the feature domain network, and the training application scope of the present invention is not particularly limited. For example, in the training process for the feature domain based network, the training method of the present invention can be used in the first N (N is, for example, 50) epochs, so as to achieve fast convergence to a model with better performance, while the method is not used in the last M (M is, for example, 20) epochs, and there is no need to have a limit on the distance between the first feature map and the second feature map in the loss function of the feature domain based network. The loss function of the previous N epochs based on the feature domain network is:

Cost＝P1 x D1+P2 x D2(x,y)

the loss function of the last M epochs based on the feature domain network is:

Cost＝D1

and calculating the gradient of each neural network layer parameter according to the loss calculated based on the characteristic domain network loss function, and updating the characteristic domain network-based parameters according to the gradient. In addition, in one possibility, a clipping process or the like may be performed on the gradient. For example, the method is implemented by a function such as loss () in the restore, step () and the like, wherein loss is a calculation result of a loss function, and optimizer is an optimizer.

In one possibility, the updating of the feature domain based network parameter is stopped when the feature domain based network loss function converges to a first threshold, or the feature domain based network current training number is greater than or equal to a second threshold.

Fig. 8 is a schematic flowchart 800 of a feature domain network training method according to a second embodiment of the present application. Referring to fig. 8 and 7B, the method may include steps 1-4, which are described in detail below for steps 1-4, respectively.

Step 1: and inputting a reconstructed image of the training image into the image domain network to obtain a first feature map.

The difference between step 1 of this embodiment and step 1 of this embodiment is that the reconstructed image is a training image based on the input of the image domain network. Specifically, the reconstructed image of the training image is obtained by inputting the training image into the codec network for processing.

Step 2 to step 4: same as step 2 to step 4 of the first embodiment.

Fig. 9 is a schematic flowchart 900 of a feature domain network training method according to a third embodiment of the present application. Referring to fig. 9 and 7B, the method may include steps 1-4, which are described in detail below for steps 1-4, respectively.

Step 1: and inputting a reconstructed image or a training image of the training image into the image-based domain network to obtain a first feature map.

Specifically, the reconstructed image of the training image is obtained by inputting the training image into the codec network. And selecting the type of the input image according to the size of a code rate point corresponding to the coding and decoding network, when the code rate is less than a threshold (such as 0.5 bpp), selecting a reconstructed image output by the coding and decoding network as the input based on the image domain network, otherwise, selecting a training image as the input based on the image domain network.

Step 2 to step 4: same as step 2 to step 4 of the first embodiment.

The image feature domain network-based training method provided by the fourth embodiment of the present application may include steps 1 to 4, and the steps 1 to 4 are described in detail below.

Step 1 to step 2: same as step 1 to step 2 of the first embodiment.

And step 3: determining the loss function based on the image domain network based on the first characteristic diagram of the image domain network and the second characteristic diagram based on the characteristic domain network.

Specifically, the distance between a first feature map based on the image domain network and a second feature map based on the feature domain network is introduced as one of the loss terms of the feature domain network-based loss function. Meanwhile, a D4 loss term introduced into the image domain network-based loss function is introduced into the feature domain network-based loss function. The loss function based on the image domain network can be exemplified as follows:

cost (image domain) = P3 x D3+ P4 x D4

Then the specific feature domain based network loss function can be exemplified as follows:

Cost＝P1 x D1+P2 x D2(x,y)+P5 x D4

the above-mentioned feature domain based network loss function includes two loss terms D1 and D2 and D4. P1, P2 and P5 are weighting coefficients of three loss terms, by which the size of the loss Cost based on the feature domain network can be finally determined. Wherein the D2 loss term is specifically to calculate the distance between the first feature map x and the second feature map y. D4 is a loss term in the loss function based on the image domain network.

And 4, step 4: same as step 4 of the first embodiment.

It should be understood that, in various training embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, the execution sequence of each process should be determined by its function and inherent logic, and the steps in the above-mentioned embodiments may be combined with each other to implement, but should not constitute any limitation on the implementation process of the embodiments of the present application.

Fig. 10A and 10B are schematic flow charts 1000 of a feature domain network application-based method according to an embodiment of the present application. Referring to fig. 10A and 10B, the method may include steps 1-2, and steps 1-2 are described in detail below, respectively.

Step 1: and inputting the original image into a coding and decoding network for processing to obtain a reconstruction characteristic diagram.

Specifically, the original image may be in RGB format or YUV, RAW, or other representation formats, and the original image may be subjected to a preprocessing operation before being input into the codec network, where the preprocessing operation may include conversion, block division, filtering, pruning, and other operations.

It should be appreciated that multiple original images or multiple original image blocks are allowed to be input to the codec network for processing within the same timestamp or at the same time to obtain the reconstructed feature map.

Step 2: and inputting the reconstructed feature map into a trained feature domain-based network for processing to obtain an analysis result.

Specifically, after the training of the feature domain network is completed according to the training method described in the present application, a fixed model based on the feature domain network is obtained. At this time, the parameters of each neural network layer in the network are fixed and do not change. And inputting the reconstructed feature map into the feature domain-based network to obtain an analysis result. When the analysis result is applied to the identification of the object class, the analysis result may be a set of vector data expressing probability information that the original image or the image block belongs to each object class so as to determine the object class of the original image or the image block.

An embodiment of the exercise device of the present application will be described in detail below with reference to fig. 11-12. It is to be understood that the description of the method embodiments corresponds to the description of the apparatus embodiments, and therefore reference may be made to the method embodiments above for parts which are not described in detail.

Fig. 11A is a schematic block diagram of a feature domain network training apparatus 1100 according to an embodiment of the present application. The training processing device 1100 may be implemented as part or all of a device by software, hardware, or a combination of both.

The training processing apparatus 1100 includes: an obtaining module 1110, a determining module 1120, and an updating module 1130, wherein:

an obtaining module 1110, configured to input a training image into an encoding and decoding network to be processed to obtain a reconstructed feature map, where the training image is any image in a training image set; inputting a reconstructed image of the training image or the training image into an image domain network to obtain a first feature map; and inputting the reconstructed feature map into a feature domain network to obtain a second feature map.

A determining module 1120 configured to determine the feature domain network-based loss function based on the first feature map and the second feature map.

An updating module 1130, configured to update the model parameters of the feature domain based network according to the loss function of the feature domain based network.

The feature domain network based loss function in the determination module 1120 comprises a distance between the first feature map and the second feature map.

Optionally, the obtaining module 1110 is specifically configured to: inputting a reconstructed image or a training image of a training image into an image domain based network to obtain a first feature map comprises: when the code rate point corresponding to the coding and decoding network meets a first condition, reconstructing the image as the input based on the image domain network; or when the code rate point meets a second condition, training the image as the input based on the image domain network.

The size of the first feature map in the capture module 1110 matches the size of the second feature map.

Optionally, the obtaining module 1110 is specifically configured to: the training image is input into a coding and decoding network to obtain a reconstructed image.

Optionally, the determining module 1120 is specifically configured to: the feature domain network-based loss function further includes a loss term for training the image domain network-based loss function.

Optionally, the update module 1130 is specifically configured to: updating the characteristic domain network-based parameters according to the characteristic domain network-based loss function comprises calculating the gradient of each network layer parameter of the characteristic domain network according to the characteristic domain network-based loss function result, and updating the characteristic domain network-based parameters according to the gradient.

The obtaining module 1110 is a trained fixed model parameter network based on an image domain network.

Fig. 11B is a schematic block diagram of another feature domain network training apparatus 1150 provided in this embodiment of the present application. The training processing device 1150 may be implemented as part or all of a device by software, hardware, or a combination of both.

The training processing device 1150 includes: data module 1180, training module 1190, wherein:

a data module 1180, configured to provide data preparation for training, including training images, inputting the training images into a coding and decoding network to obtain a reconstructed feature map, inputting the training images into a first feature map obtained based on an image domain network, and inputting the reconstructed feature map into a second feature map obtained based on the feature domain network.

A training module 1190, configured to determine the loss function based on the feature domain network based on the data output by the data module, and update the model parameter based on the feature domain network according to the loss function based on the feature domain network.

The feature domain network based loss function in the training module 1190 comprises a distance between the first feature map and the second feature map.

Optionally, the data module 1180 is specifically configured to: inputting a reconstructed image or a training image of a training image into an image domain based network to obtain a first feature map comprises: when the code rate point corresponding to the coding and decoding network meets a first condition, reconstructing the image as the input based on the image domain network; or when the code rate point meets a second condition, training the image as the input based on the image domain network.

The size of the first feature map in the data module 1180 matches the size of the second feature map.

Optionally, the data module 1180 is specifically configured to: the training image is input into a coding and decoding network to obtain a reconstructed image.

Optionally, the training module 1190 is specifically configured to: the feature domain network-based loss function further includes a loss term for training the image domain network-based loss function.

Optionally, the training module 1190 is specifically configured to: according to the loss function based on the feature domain network, updating the parameters of the feature domain network comprises calculating the gradient of each network layer parameter of the feature domain network according to the result of the loss function based on the feature domain network, and updating the parameters of the feature domain network according to the gradient.

The data module 1180 is a trained fixed model parameter network based on an image domain network.

Fig. 12 is a schematic block diagram of another feature domain based network application apparatus 1200 according to an embodiment of the present application. The application device 1200 may be implemented as part or all of a device by software, hardware, or a combination of both.

The application apparatus 1200 includes: an obtaining module 1210 and an outputting module 1220, wherein:

an obtaining module 1210, configured to input an original image into a coding/decoding network for processing to obtain a reconstructed feature map; inputting the reconstructed feature map into a trained feature domain-based network for processing to obtain an analysis result, wherein the trained feature domain-based network is obtained by performing iteration updating for N times; the loss function updated by the nth iteration is obtained based on a first feature map and a second feature map, wherein the first feature map is obtained by processing a reconstructed image of the original image or the original image based on an image domain network; and the second characteristic diagram is obtained by processing the reconstructed characteristic diagram based on the characteristic domain network after the (n-1) th iteration.

The output module 1220 is configured to output the analysis result.

The loss function includes a distance between the first feature map and the second feature map.

Optionally, the application apparatus 1200 further includes a determining module 1230, configured to determine the category of the image according to the analysis result.

Optionally, the obtaining module 1210 is specifically configured to: and the reconstructed image is obtained by inputting the original image into the coding and decoding network for processing.

Fig. 13 is a schematic hardware structure diagram of an image feature-based domain network training apparatus 1300 according to an embodiment of the present application. An image based feature domain network training apparatus 1300 (the apparatus 1300 may specifically be a computer device) shown in fig. 13 includes a memory 1301, a processor 1302, a communication interface 1303, and a bus 1304. The memory 1301, the processor 1302, and the communication interface 1303 are communicatively connected to each other through a bus 1304.

The memory 1301 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 1301 may store a program, and when the program stored in the memory 1201 is executed by the processor 1302, the processor 1302 is configured to perform the steps of the image based feature domain network training method according to the embodiment of the present application, for example, perform the steps shown in fig. 7A and fig. 7B.

It should be understood that the training processing apparatus 1300 shown in the embodiment of the present application may be a server, for example, a server in the cloud, or may also be a chip configured in the server in the cloud.

The processor 1302 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), a neural Network Processing Unit (NPU), or one or more integrated circuits, and is configured to execute related programs to implement the training method according to the embodiment of the present invention.

The processor 1302 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the training method of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1302.

The processor 1302 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1301, and the processor 1302 reads information in the memory 1301, and completes, in combination with hardware thereof, functions that need to be executed by a unit included in the apparatus 1300 for training an image based on a feature domain network shown in fig. 13, or performs a method for training an image based on a feature domain network shown in fig. 7A to 9 according to an embodiment of the method of the present application.

Communication interface 1303 enables communication between apparatus 1300 and other devices or communication networks using transceiver means, such as, but not limited to, a transceiver.

Bus 1304 may include pathways for communicating information between various components of device 1300, such as memory 1301, processor 1302, and communication interface 1303.

It should be noted that although the apparatus 1300 described above shows only memories, processors, and communication interfaces, in a particular implementation, those skilled in the art will appreciate that the apparatus 1300 may also include other devices necessary to achieve normal operation. Also, those skilled in the art will appreciate that the apparatus 1300 described above may also include hardware components for performing other additional functions, according to particular needs. Furthermore, those skilled in the art will appreciate that the apparatus 1300 described above may also include only those components necessary to implement the embodiments of the present application, and need not include all of the components shown in FIG. 13.

Fig. 14 shows a possible system architecture 1400 provided by the embodiment of the present application.

In fig. 14, a data acquisition device 260 is used to acquire training data. For the training method of the embodiment of the present application, the training data acquired by the data acquisition device 260 may be sample data or may also be a sample image, and the present application is not limited specifically.

After the training data is collected, the data collection device 260 stores the training data in the database 230, and the training device 220 trains the target model/rule 201 based on the training data maintained in the database 230.

The above-described target model/rule 201 can be used to achieve various purposes, such as image classification, and the like. The target model/rule 201 in the embodiment of the present application may specifically be a convolutional neural network.

It should be noted that, in practical applications, the training data maintained in the database 230 may not necessarily all come from the collection of the data collection device 260, and may also be received from other devices. It should be noted that, the training device 220 may not necessarily perform the training of the target model/rule 201 based on the training data maintained by the database 230, and may also obtain the training data from the cloud or other places for performing the model training, and the above description should not be taken as a limitation to the embodiment of the present application.

The target model/rule 201 obtained by training according to the training device 220 may be applied to different systems or devices, for example, the execution device 210 shown in fig. 14, where the execution device 210 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR), a vehicle-mounted terminal, or a server, or a cloud. In fig. 14, the execution device 210 configures an input/output (I/O) interface 212 for data interaction with an external device, and a user may input data to the I/O interface 212 through the client device 240, where the input data may include: sample data input by the client device.

The preprocessing module 213 and the preprocessing module 214 are configured to perform preprocessing according to input data (e.g., sample data) received by the I/O interface 212, and in this embodiment, the input data may be processed directly by the computing module 211 without the preprocessing module 213 and the preprocessing module 214 (or without one of them).

In the process that the execution device 210 preprocesses the input data or in the process that the calculation module 211 of the execution device 210 executes the calculation or other related processes, the execution device 210 may call the data, the code, and the like in the data storage system 250 for corresponding processes, or store the data, the instruction, and the like obtained by corresponding processes in the data storage system 250.

Finally, the I/O interface 212 returns the results of the processing to the client device 240 for presentation to the user.

It is worth noting that the training device 220 may generate corresponding target models/rules 201 for different targets or different tasks based on different training data, thereby providing the user with desired results.

In the case shown in fig. 14, in one case, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 212.

Alternatively, the client device 240 may automatically send the input data to the I/O interface 212, and if the client device 240 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 240. The user can view the result output by the execution device 210 at the client device 240, and the specific presentation form can be display, sound, action, and the like. The client device 240 may also serve as a data collection terminal, collecting input data of the input I/O interface 212 and output results of the output I/O interface 1412 as new sample data, and storing the new sample data in the database 230. Of course, the input data input to the I/O interface 212 and the output result output from the I/O interface 212 as shown in the figure may be directly stored in the database 230 as new sample data by the I/O interface 212 without being collected by the client device 240.

It should be noted that fig. 14 is only a schematic diagram of a system architecture provided in an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 14, the data storage system 250 is an external memory with respect to the execution device 210, and in other cases, the data storage system 250 may also be disposed in the execution device 210.

As shown in fig. 14, a target model/rule 201 is trained according to a training device 220, where the target model/rule 201 may be a Convolutional Neural Network (CNN) in the embodiment of the present application, or may be a Deep Convolutional Neural Network (DCNN), and the present application is not limited in this respect.

Fig. 15 shows a system architecture 1500 in another possible cloud scenario provided in an embodiment of the present application.

The system architecture 1500 includes a local device 520, a local device 530, and an execution device 510 and a data storage system 550, wherein the local device 520 and the local device 530 are connected to the execution device 510 via a communication network.

The execution device 510 may be implemented by one or more servers. Optionally, the execution device 510 may be used with other computing devices, such as: data storage, routers, load balancers, and the like. The execution device 510 may be disposed on one physical site or distributed across multiple physical sites. The execution device 510 may use data in the data storage system 550 or call program code in the data storage system 550 to implement the feature domain based network training or application method of the embodiments of the present application.

It should be noted that the execution device 510 may also be referred to as a cloud device, and in this case, the execution device 510 may be deployed in the cloud.

The user may operate respective user devices (e.g., local device 520 and local device 530) to interact with the execution device 510. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, game console, and so forth.

Each user's local device may interact with the enforcement device 1510 via a communication network of any communication mechanism/communication standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof.

In one implementation, the

local devices

520 and 530 may obtain relevant parameters based on the feature domain network model from the execution device 510, deploy the feature domain based network on the

local devices

520 and 530, and perform data processing and the like by using the feature domain based network.

In another implementation, a feature domain based network may be directly deployed on the execution device 510, and the execution device 510 performs data processing and the like on the image to be processed by acquiring the image to be processed from the local device 520 and the local device 530 and according to the feature domain based network.

In another implementation, the executing device 510 may execute a training process based on the feature domain network, and send the updated model network parameters based on the feature domain network to the local device 520 and the local device 530 through the communication network, and the local device 520 and the local device 530 perform data processing and the like according to the updated feature domain network after training.

Fig. 16 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural-Network Processing Unit (NPU) 400. The chip may be provided in the execution device 410 shown in fig. 14 to complete the calculation work of the calculation module 411. The chip may also be disposed in a training device 420 as shown in fig. 14 to complete the training work of the training device 420 and output the target model/rule 401. The algorithms for the various layers of the convolutional neural network shown in fig. 7A-10 may all be implemented in a chip as shown in fig. 16.

The NPU400 is mounted as a coprocessor on a main processing unit (CPU), and tasks are allocated by the main CPU. The core part of the NPU400 is an arithmetic circuit 403, and the controller 404 controls the arithmetic circuit 403 to extract data in a memory (a weight memory or an input memory) and perform an arithmetic operation.

It should be noted that the NPU400 may also be understood as a neural network accelerator, an AI accelerator, or an AI chip.

In some implementations, the arithmetic circuit 403 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 403 is a two-dimensional systolic array. The arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 403 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 403 fetches the corresponding data of the matrix B from the weight memory 402 and buffers it at each PE in the arithmetic circuit 403. The arithmetic circuit 403 takes the matrix a data from the input memory 401 and performs matrix operation with the matrix B, and stores a partial result or a final result of the obtained matrix in an accumulator 408 (accumulator), for example, the arithmetic circuit 403 stores an output result of a convolutional layer in a convolutional neural network in the accumulator 408.

It should be noted that the accumulator 408 generally has a function of storing or caching, and the local cache mentioned in the embodiment of the present application may be located in the accumulator 408, or the local cache mentioned in the embodiment of the present application refers to the accumulator 408, so that the arithmetic circuit 403 stores the output result of the convolutional layer in the accumulator 408, and the vector calculation unit 407 obtains the data result of the convolutional layer from the accumulator 408 to perform the processing of obtaining the first characteristic diagram or the second characteristic diagram, and the like.

The vector calculation unit 407 may further process the output of the operation circuit 403, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and so on. For example, the vector calculation unit 407 may be used for network calculation of the non-convolution/non-FC layer in the neural network, such as pooling (Pooling), batch Normalization (BN), local response normalization (local response normalization), and the like. In some implementations, the vector calculation unit 407 can store the processed output vector to the unified memory 406. For example, the vector calculation unit 407 may apply a non-linear function to the output of the arithmetic circuit 403, such as a vector of accumulated values, to generate the activation value.

In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 403, for example for use in subsequent layers in a neural network.

In this embodiment, a computer-readable storage medium is also provided, which stores instructions that, when executed on a computing device, cause the computing device to perform the method provided above.

In this embodiment, there is also provided a computer program product containing instructions which, when run on a computing device, cause the computing device to perform the method provided above.

In this embodiment, a chip is further provided, where the chip includes a processor and a data interface, and the processor reads an instruction stored in a memory through the data interface to execute the method provided above.

In a specific implementation process, the chip may be implemented in the form of a Central Processing Unit (CPU), a Micro Controller Unit (MCU), a microprocessor unit (MPU), a Digital Signal Processor (DSP), a system on chip (SoC), an application-specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA), or a Programmable Logic Device (PLD).

Optionally, in a specific implementation, the number of the processors is not limited. The processor is a general purpose processor, which can alternatively be implemented in hardware or in software. When implemented in hardware, the processor is a logic circuit, an integrated circuit, or the like; when implemented in software, the processor is a general-purpose processor implemented by reading software code stored in a memory integrated with the processor, located external to the processor, and residing independently.

The above-described embodiments may be implemented, in whole or in part, by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments are implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer instructions or the computer program are loaded or executed on a computer.

Alternatively, the computer is a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions can be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, by wire (e.g., infrared, wireless, microwave, etc.) from one website, computer, server, or data center to another website, computer, server, or data center.

The computer-readable storage medium is any available medium that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains one or more collections of available media. The usable medium is a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium is a solid state disk.

It should be understood that the term "and/or" herein is merely one kind of association relationship describing an associated object, and means that there are three kinds of relationships, for example, a and/or B, which means: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B are singular or plural. In addition, the "/" in this document generally indicates that the former and latter associated objects are in an "or" relationship, but may also indicate an "and/or" relationship, which may be understood with particular reference to the former and latter text.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not imply any order of execution, and the order of execution of the processes should be determined by their functions and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Those of ordinary skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method can be implemented in other ways.

For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is merely a logical division, and in actual implementation, there can be another division, for example, multiple units or components are combined or integrated into another system, or some features can be omitted, or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed is an indirect coupling or communication connection through some interfaces, devices or units, and is in an electrical, mechanical or other form.

Optionally, the units described as separate parts are physically separated or not, and the parts displayed as units are physically separated or not, i.e. located in one place, or distributed on a plurality of network units. Some or all of the units are selected according to actual needs to achieve the purpose of the scheme of the embodiment.

In addition, functional units in the embodiments of the present application can be integrated into one processing unit, and optionally, each unit exists alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, can be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (such as a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A network training method based on a feature domain is characterized by comprising the following steps:

inputting a training image into an encoding and decoding network for processing to obtain a reconstructed characteristic image, wherein the training image is any image in a training image set;

inputting a reconstructed image of the training image or the training image into an image domain network to obtain a first feature map;

inputting the reconstructed feature map into a feature domain network for processing to obtain a second feature map;

determining a loss function of the feature domain network based on the first feature map and the second feature map;

and updating the model parameters based on the characteristic domain network according to the loss function based on the characteristic domain network.

2. The method of claim 1, wherein the penalty term for the feature domain network based penalty function comprises a distance between the first feature map and the second feature map.

3. The method of claim 1 or 2, wherein the processing the reconstructed image of the training image or the training image input based on an image domain network to obtain a first feature map comprises:

when the code rate point corresponding to the coding and decoding network meets a first condition, inputting the reconstructed image of the training image into an image domain network for processing to obtain a first characteristic diagram; or

And when the code rate point meets a second condition, inputting the training image to be processed based on an image domain network to obtain a first characteristic diagram.

4. The method of any of claims 1-3, wherein the size of the first feature map matches the size of the second feature map.

5. The method according to any one of claims 1-4, further comprising inputting the training image into the codec network for processing to obtain the reconstructed image.

6. The method of claim 2, wherein the loss term of the feature domain network-based loss function further comprises a loss term used to train the image domain network-based loss function.

7. The method of any of claims 1-6, the updating the feature domain network-based model parameters according to the feature domain network-based loss function comprising:

and calculating the gradient of each network layer parameter of the characteristic-domain-network according to the value of the loss function of the characteristic-domain-network, and updating the model parameter of the characteristic-domain-network according to the gradient.

8. The method of any one of claims 1-7, wherein the image domain based network is a trained fixed model parameter network.

9. An image feature domain-based network application method is characterized by comprising the following steps:

inputting an original image into a coding and decoding network for processing to obtain a reconstruction characteristic diagram;

inputting the reconstructed feature map into a trained feature domain-based network for processing to obtain an analysis result, wherein the trained feature domain-based network is obtained by performing iteration updating for N times; the loss function updated by the nth iteration is obtained based on a first feature map and a second feature map, wherein the first feature map is obtained by processing a reconstructed image of the original image or the original image based on an image domain network; the second feature map is obtained by processing the reconstructed feature map based on the feature domain network after the (n-1) th iteration updating;

and N and N are both positive integers, and 1< = N < N.

10. The method of claim 9, wherein the penalty term for the feature domain network based penalty function comprises a distance between the first profile and the second profile.

11. The method according to claim 9 or 10, characterized in that the method further comprises: and determining the category of the original image according to the analysis result.

12. The method according to any one of claims 9-11, wherein the reconstructed image is obtained by inputting the original image into the codec network for processing.

13. A training device based on a feature domain network, comprising:

the acquisition module is used for inputting a training image into the coding and decoding network for processing to obtain a reconstructed characteristic image, wherein the training image is any image in a training image set; inputting a reconstructed image of the training image or the training image into an image domain network to obtain a first feature map; inputting the reconstructed feature map into a feature domain network to obtain a second feature map;

a determining module for determining the feature domain network based loss function based on the first feature map and the second feature map;

and the updating module is used for updating the model parameters based on the characteristic domain network according to the loss function based on the characteristic domain network.

14. The apparatus of claim 13, wherein the feature domain network based loss function comprises a distance between the first feature map and the second feature map.

15. The apparatus of claim 13-14, wherein the means for inputting the reconstructed image of the training image or the training image into an image domain based network to obtain a first feature map is configured to:

when the code rate point corresponding to the coding and decoding network meets a first condition, the reconstructed image is used as the input of the image domain network; or

And when the code rate point meets a second condition, the training image is used as the input of the image domain network.

16. The apparatus of any of claims 13-15, wherein the size of the first feature map matches the size of the second feature map.

17. The apparatus according to any of claims 13-16, wherein the obtaining module is further configured to input the training image into the codec network for processing to obtain the reconstructed image.

18. The apparatus of any of claims 13-17, wherein the feature domain network based loss function further comprises a loss term for training the image domain network based loss function.

19. The apparatus of any one of claims 13-18, wherein the update module is configured to: and calculating the gradient of each network layer parameter of the characteristic domain-based network according to the loss function result of the characteristic domain-based network, and updating the characteristic domain-based network parameter according to the gradient.

20. The apparatus of any of claims 13-19, wherein the image domain based network is a trained fixed model parameter network.

21. An image feature domain-based network application device, comprising:

the acquisition module is used for inputting the original image into the coding and decoding network for processing to obtain a reconstruction characteristic map; inputting the reconstructed feature map into a trained feature domain-based network for processing to obtain an analysis result, wherein the trained feature domain-based network is obtained by performing iteration updating for N times; the loss function updated by the nth iteration is obtained based on a first feature map and a second feature map, wherein the first feature map is obtained by processing a reconstructed image of the original image or the original image based on an image domain network; the second characteristic diagram is obtained by processing the reconstructed characteristic diagram based on the characteristic domain network after the (n-1) th iteration;

and the output module is used for outputting the analysis result.

And N and N are positive integers, and 1< = N < N.

22. The apparatus of claim 21, wherein the feature domain network based loss function comprises a distance between the first feature map and the second feature map.

23. The apparatus of any one of claims 21-22, further comprising:

and the determining module is used for determining the category of the image according to the analysis result.

24. The apparatus according to any of claims 21-23, wherein the reconstructed image is processed by the original image input to the codec network.

25. A computer device, comprising: a processor and a memory, the memory for storing a program, the processor for invoking and running the program from the memory to perform the method of any one of claims 1 to 20.

26. A computer-readable storage medium, comprising a computer program which, when run on a computer, causes the computer to perform the method of any one of claims 1 to 20.