CN112825132A

CN112825132A - Method, apparatus and readable storage medium for generating feature map

Info

Publication number: CN112825132A
Application number: CN202011282774.XA
Authority: CN
Inventors: 蒋薇; 王炜; 刘杉
Original assignee: Tencent America LLC
Current assignee: Tencent America LLC
Priority date: 2019-11-21
Filing date: 2020-11-17
Publication date: 2021-05-21
Anticipated expiration: 2040-11-17
Also published as: CN112825132B

Abstract

The embodiment of the application provides a method, a device and a readable storage medium for generating a feature map. The method comprises the following steps: receiving an image through a Deep Neural Network (DNN) and generating a first feature map based on the image while the DNN is in a trained state, wherein the DNN is configured to perform a task based on the image and is trained using a training image by using a feature sparse regularization process with smoothness and a back propagation and weight update process that updates the DNN based on an output of the feature sparse regularization process with smoothness.

Description

Method, apparatus and readable storage medium for generating feature map

Priority is claimed for us provisional application No. 62/938,672 filed on 21/11/2019 and us official application No. 17/063,111 filed on 5/10/2020, the disclosures of both applications being incorporated by reference in their entirety.

Technical Field

Embodiments of the present application relate to video encoding and decoding technologies, and more particularly, to a method, an apparatus, and a readable storage medium for generating a feature map.

Background

ISO/IEC MPEG (JTC 1/SC 29/WG 11) (international organization for standardization/international electrotechnical commission moving picture experts group (joint technical commission 1/29 th division commission/11 th working group)) has been actively looking for the potential need for future video codec technology standardization for visual analysis and understanding. The ISO adopted the Visual Search Compact Descriptor (CDVS) standard as a still image standard in 2015, which extracted feature representations for image similarity matching. The CDVS standard is listed as part 15 of MPEG 7 and ISO/IEC 15938-15 and was completed in 2018, which extracts global and local, manually designed and Deep Neural Network (DNN) based feature descriptors of video segments. The success of DNN in a wide range of video applications, such as semantic classification, target detection/recognition, target tracking, video quality enhancement, etc., has created a strong need for a compressed DNN model. The Moving Picture Experts Group (MPEG) is also working on the coding Representation of Neural Networks (NNR), which codes the DNN model to save memory space and computational effort.

In 7 s.2019, a small group was established for the Video coding for Machine Video (VCM) standard to explore the topic of "compression codec for Machine vision and compression for man-Machine hybrid systems", aiming to develop a standard that can be implemented on chip for widespread use with any Video-related Internet of Things (IoT) device. In contrast to previous Video analytics Compact Descriptors (CDVA) and CDVS, VCM is an emerging Video for machine standards that can be considered as a superset of CDVA. By combining multiple feature maps of the neural network backbone, the VCM can handle more advanced visual analysis tasks such as semantic segmentation and video restoration. There is still a need to improve the compression efficiency of feature maps to further save memory space and transmission resources.

Disclosure of Invention

The embodiment of the application provides a method, a device and a readable storage medium for training a deep neural network, generating a feature map through the trained deep neural network and compressing the feature map, wherein the deep neural network is trained by using a sparse regularization process with smoothness.

According to at least one embodiment, a method for generating a feature map is provided. The method comprises the following steps: receiving an image through a Deep Neural Network (DNN); and generating, by the DNN, a first feature map based on the image while the DNN is in a trained state, wherein the DNN is configured to perform a task based on the image, and the DNN is trained using a training image by using a feature sparse regularization process with smoothness and a back propagation and weight update process that updates the DNN based on an output of the feature sparse regularization process with smoothness.

According to one embodiment, the method further comprises: training the DNN, the training process comprising: generating a second feature map based on the training image through a network forward computation performed by the DNN, the network forward computation including the feature sparsity regularization process with smoothness; calculating a regularization loss of the second feature map based on the training images; calculating a smoothing loss of the second feature map based on the training image; calculating a total gradient based on the calculated regularization and smoothing losses; and updating the network coefficients of the DNN by performing the back propagation and weight update procedures based on the calculated overall gradient.

According to one embodiment, the training the DNN further comprises: calculating, based on the training image, a loss of empirical data for the task performed by the DNN, and the calculating the overall gradient comprises: calculating the overall gradient based on the calculated empirical data loss, regularization loss, and smoothing loss.

According to one embodiment, the training the DNN further comprises: accumulating a plurality of overall gradients of a batch of training data, the plurality of overall gradients including the overall gradient, wherein the updating the network coefficients of the DNN is performed based on the accumulated plurality of overall gradients.

According to one embodiment, the training the DNN is over a plurality of iterations, and the training the DNN further comprises: varying the hyper-parameters during the iterations such that the training emphasizes learning a sparse feature map, and then emphasizing smoothing of feature responses in subsequent ones of the iterations.

According to one embodiment, the training the DNN is over a plurality of iterations, and the training the DNN further comprises: varying the hyperparameter in the iterations such that the training emphasizes smoothness within a spatial dimension, and then emphasizing smoothness of a channel width in subsequent ones of the iterations.

According to one embodiment, the back propagation and weight update process updates the network coefficients of the DNN during the training process based on regularization and smoothing penalties calculated based on the output of the feature sparse regularization process with smoothness.

According to one embodiment, the back propagation and weight update process updates the network coefficients of the DNN in the training process based on: a computed regularization and smoothing loss, wherein the computed regularization and smoothing loss is computed based on an output of the feature sparse regularization with smoothness, and a computed empirical data loss for the task performed by the DNN.

According to one embodiment, the method further comprises compressing the first feature map.

According to one embodiment, the DNN is configured to perform at least one of semantic segmentation, image or video classification, object detection, and image or video super resolution as the task.

In accordance with one or more embodiments, a system is provided. The system comprises: at least one memory configured to store computer program code; and a Deep Neural Network (DNN) implemented by at least one processor configured to access the computer program code and to operate as directed by the computer program code. The computer program code includes: obtaining code configured to cause the at least one processor to generate a first feature map based on an image input into the DNN while the DNN is in a trained state, wherein the DNN is configured to perform a task based on the image and the DNN is trained using a training image by using a feature sparse regularization process with smoothness and a back propagation and weight update process that updates the DNN based on an output of the feature sparse regularization process with smoothness.

According to at least one embodiment, an apparatus for generating a feature map is provided. The device comprises: an acquisition module to generate a first feature map based on an image input into a Deep Neural Network (DNN) while the DNN is in a trained state, wherein the DNN is configured to perform a task based on the image and is trained using a training image by using a feature sparse regularization process with smoothness and a back propagation and weight update process that updates the DNN based on an output of the feature sparse regularization process with smoothness.

According to one embodiment, the DNN is trained by: generating a second feature map based on the training image by performing a network forward computation by the DNN, the network forward computation including the feature sparsity regularization process with smoothness; calculating a regularization loss of the second feature map based on the training images; calculating a smoothing loss of the second feature map based on the training image; calculating a total gradient based on the calculated regularization and smoothing losses; and updating network coefficients of the DNN by performing the back propagation and weight update processes based on the calculated overall gradient.

According to one embodiment, the DNN is further trained by: calculating a loss of empirical data for the task performed by the DNN based on the training images, wherein the overall gradient is calculated based on the calculated loss of empirical data, regularization loss, and smoothing loss.

According to one embodiment, the DNN is further trained by accumulating a plurality of overall gradients of a batch of training data, the plurality of overall gradients including the overall gradient, wherein the network coefficients of the DNN are updated based on the accumulated plurality of overall gradients.

According to one embodiment, the DNN is trained by: the DNN is updated over a number of iterations, and the hyperparameters are changed over the iterations, such that the training emphasizes learning a sparse feature map, and then emphasizing smoothing of feature responses in subsequent ones of the iterations.

According to one embodiment, the DNN is trained by: the DNN is updated over a plurality of iterations, and the hyperparameters are changed over the iterations, such that the training emphasizes smoothness within spatial dimensions, and then channel width smoothness is emphasized in subsequent ones of the iterations.

According to one embodiment, the apparatus further comprises a compression module for compressing the first profile.

According to an embodiment of the application, a non-transitory computer-readable storage medium storing computer instructions configured to, when executed by at least one processor implementing a Deep Neural Network (DNN), cause the at least one processor to: generating a first feature map based on an image input into the DNN while the DNN is in a trained state, wherein the DNN is configured to perform tasks based on the image, and the DNN is trained using training images by using a feature sparse regularization process with smoothness and a back propagation and weight update process that updates the DNN based on an output of the feature sparse regularization process with smoothness.

The smoothness regularization and sparsity regularization of the embodiment of the application can further improve the compression efficiency of the feature map extracted from the input image, so that the feature map can be effectively stored and transmitted.

Drawings

Further features, nature, and various advantages of the presently disclosed subject matter will become apparent from the following detailed description and the accompanying drawings, wherein:

FIG. 1 is a schematic diagram of an environment in which methods, apparatus, and systems described herein may be implemented according to embodiments of the application.

FIG. 2 is a block diagram of example components of at least one of the devices of FIG. 1.

FIG. 3 is a block diagram of a DNN according to an embodiment of the present application;

FIG. 4 is a schematic diagram indicating DNN training according to an embodiment of the present application;

FIG. 5A is a block diagram of a DNN training system according to an embodiment of the present application;

FIG. 5B is a flow diagram of a method for generating a feature map according to an embodiment of the present application;

FIG. 5C is a flow diagram of another method for generating a feature map according to an embodiment of the present application;

FIG. 6 is a block diagram of a computing system according to an embodiment of the present application.

FIG. 7 is a block diagram of a computing system according to an embodiment of the present application.

Detailed Description

Fig. 1 is a schematic diagram of an environment 100 in which methods, apparatus, and systems described herein may be implemented, according to an embodiment. As shown in FIG. 1, environment 100 may include user device 110, platform 120, and network 130. The devices of environment 100 may be interconnected by wired connections, wireless connections, or a combination of wired and wireless connections.

User device 110 comprises at least one device capable of receiving, generating, storing, processing, and/or providing information related to platform 120. For example, the user device 110 may include a computing device (e.g., a desktop computer, a laptop computer, a tablet computer, a handheld computer, a smart speaker, a server, etc.), a mobile phone (e.g., a smartphone, a wireless phone, etc.), a wearable device (e.g., smart glasses or a smart watch), or similar device. In some implementations, user device 110 may receive information from platform 120 and/or transmit information to platform 120.

The platform 120 includes at least one device as described elsewhere herein. In some implementations, the platform 120 may include a cloud server or a group of cloud servers. In some embodiments, the platform 120 may be designed to be modular such that software components may be swapped in and out according to particular needs. In this way, platform 120 may be easily and/or quickly reconfigured to have a different purpose.

In some implementations, as shown, the platform 120 may be hosted (hosted) in a cloud computing environment 122. Notably, although the embodiments described herein describe the platform 120 as being hosted in the cloud computing environment 122, in some embodiments the platform 120 is not cloud-based (i.e., may be implemented outside of the cloud computing environment) or may be partially cloud-based.

Cloud computing environment 122 comprises an environment hosting platform 120. The cloud computing environment 122 may provide computing, software, data access, storage, etc. services that do not require an end user (e.g., user device 110) to know the physical location and configuration of the systems and/or devices of the hosting platform 120. As shown, the cloud computing environment 122 may include a set of computing resources 124 (collectively referred to as "computing resources" 124 "and individually referred to as" computing resources "124").

Computing resources 124 include at least one personal computer, workstation computer, server device, or other type of computing and/or communication device. In some implementations, the computing resources 124 may host the platform 120. Cloud resources may include computing instances executing in computing resources 124, storage devices provided in computing resources 124, data transfer devices provided by computing resources 124, and so forth. In some implementations, the computing resources 124 may communicate with other computing resources 124 through wired connections, wireless connections, or a combination of wired and wireless connections.

As further shown in FIG. 1, computing resources 124 include a set of cloud resources, such as at least one application ("APP") 124-1, at least one virtual machine ("VM") 124-2, virtualized storage ("VS") 124-3, at least one hypervisor ("HYP") 124-4, and so forth.

The application 124-1 includes at least one software application that may be provided to or accessed by the user device 110 and/or the platform 120. The application 124-1 need not install and execute a software application on the user device 110. For example, the application 124-1 may include software related to the platform 120, and/or any other software capable of being provided through the cloud computing environment 122. In some embodiments, one application 124-1 may send/receive information to or from at least one other application 124-1 through the virtual machine 124-2.

The virtual machine 124-2 comprises a software implementation of a machine (e.g., a computer) that executes programs, similar to a physical machine. The virtual machine 124-2 may be a system virtual machine or a process virtual machine, depending on the use and correspondence of any real machine by the virtual machine 124-2. The system virtual machine may provide a complete system platform that supports execution of a complete operating system ("OS"). The process virtual machine may execute a single program and may support a single process. In some implementations, the virtual machine 124-2 can execute on behalf of a user (e.g., the user device 110) and can manage the infrastructure of the cloud computing environment 122, such as data management, synchronization, or long-term data transfer.

Virtualized storage 124-3 includes at least one storage system and/or at least one device that uses virtualization technology within the storage system or device of computing resources 124. In some embodiments, within the context of a storage system, the types of virtualization may include block virtualization and file virtualization. Block virtualization may refer to the abstraction (or separation) of logical storage from physical storage so that a storage system may be accessed without regard to physical storage or heterogeneous structure. The separation may allow an administrator of the storage system to flexibly manage end-user storage. File virtualization may eliminate dependencies between data accessed at the file level and the location where the file is physically stored. This may optimize performance of storage usage, server consolidation, and/or uninterrupted file migration.

Hypervisor 124-4 may provide hardware virtualization techniques that allow multiple operating systems (e.g., "guest operating systems") to execute concurrently on a host computer, such as computing resources 124. Hypervisor 124-4 may provide a virtual operating platform to the guest operating systems and may manage the execution of the guest operating systems. Multiple instances of various operating systems may share virtualized hardware resources.

The network 130 includes at least one wired and/or wireless network. For example, the Network 130 may include a cellular Network (e.g., a fifth generation (5G) Network, a Long Term Evolution (LTE) Network, a third generation (3G) Network, a Code Division Multiple Access (CDMA) Network, etc.), a Public Land Mobile Network (PLMN), a Local Area Network (LAN), a Wide Area Network (WAN), a Metropolitan Area Network (MAN), a Telephone Network (e.g., a Public Switched Telephone Network (PSTN)), a private Network, an ad hoc Network, an intranet, the internet, a fiber-based Network, etc., and/or a combination of these or other types of networks.

The number and arrangement of devices and networks shown in fig. 1 are provided as examples. In practice, there may be more devices and/or networks, fewer devices and/or networks, different devices and/or networks, or a different arrangement of devices and/or networks than those shown in FIG. 1. Further, two or more of the devices shown in fig. 1 may be implemented within a single device, or a single device shown in fig. 1 may be implemented as multiple distributed devices. Additionally or alternatively, a set of devices (e.g., at least one device) of environment 100 may perform at least one function described as being performed by another set of devices of environment 100.

FIG. 2 is a block diagram of example components of at least one of the devices of FIG. 1. Device 200 may correspond to user device 110 and/or platform 120. As shown in fig. 2, device 200 may include a bus 210, a processor 220, a memory 230, a storage component 240, an input component 250, an output component 260, and a communication interface 270.

Bus 210 includes components that allow communication among the components of device 200. The processor 220 is implemented in hardware, firmware, or a combination of hardware and software. Processor 220 is a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an Accelerated Processing Unit (APU), a microprocessor, a microcontroller, a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), or another type of processing component. In some embodiments, processor 220 includes at least one processor that can be programmed to perform functions. Memory 230 includes a Random Access Memory (RAM), a Read Only Memory (ROM), and/or another type of dynamic or static storage device (e.g., flash memory, magnetic memory, and/or optical memory) that stores information and/or instructions for use by processor 220.

The storage component 240 stores information and/or software related to the operation and use of the device 200. For example, storage component 240 may include a hard disk (e.g., a magnetic disk, an optical disk, a magneto-optical disk, and/or a solid state disk), a Compact Disc (CD), a Digital Versatile Disc (DVD), a floppy disk, a cassette tape, a magnetic tape, and/or another type of non-volatile computer-readable storage medium, and a corresponding drive.

Input components 250 include components that allow device 200 to receive information, such as through user input, for example, a touch screen display, a keyboard, a keypad, a mouse, buttons, switches, and/or a microphone. Additionally or alternatively, input component 250 may include sensors for sensing information (e.g., Global Positioning System (GPS) components, accelerometers, gyroscopes, and/or actuators). Output components 260 include components that provide output information from device 200, such as a display, a speaker, and/or at least one Light Emitting Diode (LED).

Communication interface 270 includes transceiver-like components (e.g., a transceiver and/or a separate receiver and transmitter) that enable device 200 to communicate with other devices, e.g., over a wired connection, a wireless connection, or a combination of wired and wireless connections. Communication interface 270 may allow device 200 to receive information from another device and/or provide information to another device. For example, the communication interface 270 may include an ethernet interface, an optical interface, a coaxial interface, an infrared interface, a Radio Frequency (RF) interface, a Universal Serial Bus (USB) interface, a Wi-Fi interface, a cellular network interface, and/or the like.

Device 200 may perform at least one process described herein. Device 200 may perform these processes in response to processor 220 executing software instructions stored by a non-transitory computer-readable storage medium (e.g., memory 230 and/or storage component 240). A computer-readable storage medium is defined herein as a non-volatile memory device. The memory device includes storage space within a single physical storage device or storage space distributed across multiple physical storage devices.

The software instructions may be read into memory 230 and/or storage component 240 from another computer-readable storage medium or from another device via communication interface 270. When executed, software instructions stored in memory 230 and/or storage component 240 may cause processor 220 to perform at least one process described herein. Additionally or alternatively, hardwired circuitry may be used in place of or in combination with software instructions to implement at least one process described herein. Thus, implementations described herein are not limited to any specific combination of hardware circuitry and software.

The number and arrangement of components shown in fig. 2 are provided as examples. In practice, the device 200 may include more components, fewer components, different components, or a different arrangement of components than those shown in FIG. 2. Additionally or alternatively, a set of components (e.g., at least one component) of device 200 may perform at least one function described as being performed by another set of components of device 200.

The embodiments described below may be implemented by at least one component of the environment 100. For example, the embodiments described below may be implemented by the user device 110 or the platform 120. The embodiments described below may be implemented by at least one processor and memory storing computer instructions. The computer instructions may be configured to cause the functions of embodiments of the present application to be performed when executed by at least one processor. For example, the computer instructions may be configured to cause the at least one processor to implement at least one DNN, train the at least one DNN, and compress an output of the at least one DNN as described below.

[ feature map thinning and smoothing ]

With reference to fig. 3-5A, a DNN training system including feature map sparsification with smoothing is described.

Is provided with

Representing a data set, where input x passes through DNN (e.g., in graph 300 of FIG. 3DNN) to generate a feature map f. DNN is another DNN⁰DNN N⁰Is using a data set D⁰＝{(x⁰,y⁰) Trained for a task, where each input x⁰And label y⁰And (4) associating. For example, for a semantic segmentation task, each input x⁰Can be a color image, label y⁰May be input with x⁰The segmentation maps have the same resolution, and each item (item) in the segmentation maps may be given an input x⁰The index of the semantic category to which the corresponding pixel in (a) is assigned. For super resolution tasks, input x⁰May be from the label y⁰A generated low resolution image wherein y⁰Is a Ground Truth (Ground Truth) high resolution image. According to an embodiment, the DNN may be DNN N⁰And may be referred to as DNN N⁰A backbone (or feature extraction) network. Data set D may be compared to data set D⁰The same is true. It may also be associated with the data set D⁰Different data set, but with data set D⁰With similar data distribution (e.g., input x and input x)⁰Have the same dimensions, and p (x, y) ═ p (x)⁰,y⁰) Where y is the underlying annotation associated with input x).

The feature map f may be a general 3D tensor of size (c, h, w), where h, w and c are the height, width and depth of the feature map f. For example, for semantic segmentation or super resolution tasks, h and w may be the same as the original height and width of the input x (e.g., image), which determines the resolution of the feature map f. Due to the nature of DNN calculations, e.g., convolution followed by nonlinear activation operations like ReLu and pooling operations like maxporoling, the profile f may be quite sparse. That is, many entries of the eigenmap f (i.e., the 3D tensor) are zero and many entries of the eigenmap f are close to zero or may be set to zero while N is the next DNN⁰The calculation in (2) has little influence. Thus, further compression procedures such as quantization and entropy coding may be applied to the sparse feature map f to substantially reduce the size of the feature map f for efficient storage and transmission of featuresAnd (f) is shown.

[ smoothing for compression ]

From the compression point of view, when the feature map f has a smooth property, the compression efficiency can be greatly improved. For example, when nearby entries of the feature map f have similar values, these similar values may be quantized to the same value without having a large influence on subsequent calculations. Because of the smooth nature of the original input x (e.g., the original input image) and the smoothness-preserving convolution operation of DNN, the feature map f along the h and w dimensions should be smooth in nature, and thus it is reasonable to pursue the smooth nature in the spatial domain. From a feature extraction perspective, along features of different channels, different aspects of the information are extracted to represent the input x, and the feature map f typically has low response (low feature values) in most regions. For example, for a semantic segmentation task, a feature map for one channel may have a high response to objects of one semantic class (e.g., cars) and a low response to all other regions. Therefore, for the vast majority of sparse feature maps, it is also reasonable to pursue local smoothness along the channel dimension.

The embodiment of the application can implement the loss function £_S(f) To measure the above smoothing property, as shown in the following equation (1):

L_s(f)＝∑_{g(l，m，n)∈G}S(f，g(l，m，n)) (1)

wherein G (l, M, n) ∈ G defines a size (M) centered at (l, M, n)₁(l，m，n)，M₂(l，m，n)，M₃(l, m, n)); g is a group of such local neighborhoods; s (f, g (l, m, n)) is a smoothness metric for each local neighborhood measured on the feature map f. In one embodiment, for each location (l, m, n), its local neighborhood may have a fixed size, e.g., one 3D blob centered at (l, m, n)

And N is₁(l，m，n)＝c、N₂(l，m，n)＝h、N₃When (l, m, n) is wG (l, m, n) covers the entire feature map f.

The smoothness metric (metric) may take a variety of forms. One aspect of the smoothness metric may be to bring the feature responses defined in the neighborhood g (l, m, n) as close as possible to each other (ideally the same). In one example embodiment, the manner in which embodiments of the present application implement the smoothness metric is defined as equation (2):

and

is further defined as:

wherein the content of the first and second substances,

are respectively the absolute values, β, of the gradients of the profile f along the three axes, measured at the position (p, i, j)_l、β_m、β_nIs a hyper-parameter for balancing the contributions of gradients along different axes. Intuitively, the smoothness metric of equation (2) causes neighboring feature responses along different axes within the local neighborhood to be similar, without significant variation. ρ > 0 is a hyperparameter, and in one example embodiment, ρ is empirically determinedρ is set to 1. In one embodiment, the gradient is simply calculated by:

[ thinning and smoothing ]

According to an embodiment, the smoothing loss in equation (1) may be compared to the sparsification regularization loss L_R(f) Combined to promote compression-friendly (compression-friendly) feature maps during the training of DNNs. Given data set

The feature map f may actually be calculated based on DNN and input x, and may be expressed as a function of input x and DNN: f ═ F (x, N). As described above, DNN may be a larger network N⁰The purpose of which is to predict the output of each respective input x

Given a pre-trained DNN⁰Wherein, DNN N⁰The corresponding network coefficient of DNN in (1) is denoted as N (0), and embodiments of the present application may be configured to lose training by

Minimized to find updated optimal network coefficients for DNN (denoted as DNN N in graph 400 of FIG. 4^*). In particular, can be

Defined as a loss of empirical data

And the regularization loss, as shown in equation (3) below:

all the penalty terms in equation (3) may be adjusted according to the network coefficients of the DNN to emphasize the dependence of the penalty calculations on the network. Loss of empirical data

The form of which may be determined by the specific task defined by the data set D. For example, cross-entropy loss may be used for classification tasks, while for semantic segmentation, per-pixel cross-entropy loss may be used. Sparse regularization loss £_R(f | N) may take a variety of forms, and in one example embodiment, the L1L2 norm may be used, as shown in equation (4) below:

wherein the content of the first and second substances,

|f|＝∑_k|f_ki, k is used to traverse all data entries f in the feature graph f_k。￡_S(f | N) may be the same as defined in equation (1), as described above, with the notation 'N' added here to emphasize its calculation dependency on DNN. Hyper-parametric lambda_D、λ_R、λ_S、

And η are used to balance the contributions of the different terms.

FIG. 5A depicts an example framework of a training system 500 for training optimal DNN using a sparse regularization process with smoothness using a data set D^*. In particular toTo train the optimal network, labels y are given for each respective input x (e.g.

). Through Network Forward computing process 510, each input x may be passed through a DNN that is larger than the original Network N⁰Having the same network structure, wherein the network coefficient of the sub-network corresponding to the feature extraction DNN (e.g., DNN) is denoted as N (t), and the remaining network coefficients are denoted as N⁰(t) of (d). The process may generate an estimated output

And feature F ═ F (x, n (t)). For the first iteration, N (t) is initialized to N (1) ═ N (0). Network coefficient N⁰(t) may be initialized to N⁰And, in one example embodiment, the network coefficient N⁰(t) may also be updated during the training process. Output based on annotation y (e.g., ground truth annotation) and estimation

The data loss in equation (3) may be calculated by a calculate data loss process 520

By the calculate regular loss process 530, the regular loss £ in equation (4) may be calculated based on the generated signature f_R(f | N). By the calculate smooth loss process 540, the smooth loss £ in equation (1) may also be calculated based on the generated signature f_S(f | N). The total lost gradient in equation (3) may then be calculated using the calculate total gradient process 550 to obtain the total gradient G_total(N) is provided. Here, G can be computed using an automatic gradient computation method used by a deep learning framework such as tensorflow or pytorch_total(N) is provided. Based on the total gradient G_total(N) by using a Back Propagation and weight update process 560, by Back PropagationBP) to obtain updated network coefficients N (t +1), and train the system 500 for the next iteration. In the back propagation and weight update process 560, embodiments of the present application may choose to accumulate (accumulate) a number of total gradients G of a batch of inputs_total(N) and updating the network coefficients only with the accumulated plurality of overall gradients. For example, for each input in the batch of inputs, an overall gradient may be calculated correspondingly, and then a plurality of overall gradients corresponding to the batch of inputs may be added to update the network coefficients. The input batch size may be a predefined hyper-parameter, and embodiments of the present application may perform multiple iterations on all training data, where each iteration is referred to as an epoch (epoch). The embodiment of the application can run for a plurality of periods until the optimal value of the loss converges. In another embodiment, the network coefficient N⁰(t) may be updated at a different frequency than the network coefficient n (t), or even remain unchanged.

During the iterative training process, the hyper-parameter lambda_D、λ_R、λ_S、

η、β_l、β_m、β_nMay be preset and fixed or may be adaptively changed as the training process progresses. For example, in one exemplary embodiment, λ_SIt may be set smaller in early iterations and larger in later iterations so that the training process emphasizes learning a sparse feature map first, and then smoothing the remaining feature response. In addition, β_lIt is also possible to set smaller in the early iterations and larger in the later ones, so that smoothness in the spatial dimension is emphasized first and smoothness of the channel width is then pursued on the spatially smooth feature map.

Based on the above description of fig. 5A, as shown in fig. 5B, an embodiment of the present application provides a method for generating a feature map. The method comprises the following steps:

step S501, receiving an image through a Deep Neural Network (DNN); and

step S502, generating a first feature map based on the image by the DNN when the DNN is in a trained state, wherein the DNN is configured to execute a task based on the image, and the DNN is trained by using a feature sparseness regularization process with smoothness and a back propagation and weight update process based on an output of the feature sparseness regularization process with smoothness.

According to an embodiment of the present application, as shown in fig. 5C, the method for generating a feature map further includes: step S503, training the DNN, wherein the training process includes: generating a second feature map based on the training image by performing a network forward computation by the DNN, the network forward computation including the feature sparsity regularization process with smoothness; calculating a regularization loss of the second feature map based on the training images; calculating a smoothing loss of the second feature map based on the training image; calculating a total gradient based on the calculated regularization and smoothing losses; and updating the network coefficients of the DNN by performing the back propagation and weight update procedures based on the calculated overall gradient.

According to an embodiment of the present application, the step S503 of training the DNN further includes: calculating, based on the training image, a loss of empirical data for the task performed by the DNN, and the calculating the overall gradient comprises: calculating the overall gradient based on the calculated empirical data loss, regularization loss, and smoothing loss.

According to an embodiment of the present application, the step S503 of training the DNN further includes: accumulating a plurality of overall gradients of a batch of training data, the plurality of overall gradients including the overall gradient, wherein the updating the network coefficients of the DNN is performed based on the accumulated plurality of overall gradients.

According to an embodiment of the present application, the training of the DNN is performed through a plurality of iterations, and the step S503 of training the DNN further includes: varying the hyper-parameters during the iterations such that the training emphasizes learning a sparse feature map, and then emphasizing smoothing of feature responses in subsequent ones of the iterations.

According to an embodiment of the present application, the training of the DNN is performed through a plurality of iterations, and the step S503 of training the DNN further includes: varying the hyperparameter in the iterations such that the training emphasizes smoothness within a spatial dimension, and then emphasizing smoothness of a channel width in subsequent ones of the iterations.

According to an embodiment of the application, the back propagation and weight update process updates the network coefficients of the DNN based on regularization and smoothing losses calculated based on the output of the feature sparse regularization process with smoothness in the training process.

According to an embodiment of the application, the back propagation and weight update process updates the network coefficients of the DNN in the training process based on: a computed regularization and smoothing loss, wherein the computed regularization and smoothing loss is computed based on an output of the feature sparse regularization with smoothness, and a computed empirical data loss for the task performed by the DNN.

According to an embodiment of the application, the method further comprises: s504, compressing the first feature map.

According to an embodiment of the application, the DNN is configured to perform at least one of semantic segmentation, image or video classification, object detection, and image or video super resolution as the task.

Referring to FIG. 6, embodiments of the present application may be implemented by a computing system 600. The computing system 600 may include at least one processor and memory storing computer instructions. The computer instructions may be configured to, when executed by the at least one processor, cause the at least one processor to implement the DNN of the present application. The computer instructions may include input code 610, acquisition code 620, and compression code 630.

The input code 610 may be configured to cause the at least one processor to receive an image via a DNN implemented by the at least one processor. The obtaining code 620 may be configured to cause the at least one processor to generate a first feature map based on the image via the DNN. The compression code 630 may be configured to cause the at least one processor to compress the first feature map.

The DNN implemented by the at least one processor may be configured to perform a task based on the image. In addition, the DNN may be in a trained state based on the training process described with reference to fig. 5.

For example, with reference to fig. 7, embodiments of the present application may be implemented by a computing system 700, the computing system 700 configured to train a DNN as described with reference to fig. 5-6. According to embodiments, computing system 600 and computing system 700 may refer to the same computing system or different computing systems. The computing system 700 may include at least one processor and memory storing computer instructions. The computer instructions may be configured to, when executed by the at least one processor, cause the at least one processor to implement the DNNs of the present application and train the DNNs. For example, computer instructions may include training code 710, where training code 710 includes acquisition code 720, calculation code 730, and update code 740.

The training code 710 may be configured to train a DNN with at least one training image by using a feature sparse regularization process with smoothness and a back propagation and weight update process that updates the DNN based on an output of the feature sparse regularization process with smoothness.

For example, the obtaining code 720 of the training code 710 may be configured to cause the at least one processor to generate a feature map by the DNN based on the training image by performing a network forward computation that includes the feature sparsity regularization process with smoothness. The acquisition code 720 may refer to the same code as the acquisition code 620 or a different code.

The calculation code 730 of the training code 710 may be configured to cause the at least one processor to calculate a regularization loss of the feature map, calculate a smoothing loss of the feature map, calculate an empirical data loss of a task performed by the DNN, and calculate an overall gradient based on the calculated empirical data loss, regularization loss, and smoothing loss.

According to an embodiment, the calculation code 730 may be configured to cause the at least one processor to accumulate a plurality of overall gradients of a batch of training data, including the overall gradient described above.

The update code 740 of the training code 710 may be configured to cause the at least one processor to update the network coefficients of the DNN by performing the back propagation and weight update processes based on the calculated overall gradient (or cumulative total gradients).

According to an embodiment, the update code 740 may be configured to cause the at least one processor to update the DNN over a plurality of iterations, and to change the hyper-parameters over the course of the iterations, such that the training emphasizes learning a sparse feature map, and then emphasizing smoothing of feature responses in subsequent ones of the iterations.

According to an embodiment, the update code 740 may be configured to cause the at least one processor to update the DNN over a plurality of iterations, and update the hyperparameter over the iterations, such that the training emphasizes smoothness within a spatial dimension, and then emphasize channel width smoothness in subsequent ones of the iterations.

Accordingly, an embodiment of the present application provides an apparatus, including: an acquisition module to generate a first feature map based on an image input into a Deep Neural Network (DNN) while the DNN is in a trained state, wherein the DNN is configured to perform a task based on the image and is trained using a training image by using a feature sparse regularization process with smoothness and a back propagation and weight update process that updates the DNN based on an output of the feature sparse regularization process with smoothness.

According to an embodiment of the application, the DNN is trained by: generating a second feature map based on the training image by performing a network forward computation by the DNN, the network forward computation including the feature sparsity regularization process with smoothness; calculating a regularization loss of the second feature map based on the training images; calculating a smoothing loss of the second feature map based on the training image; calculating a total gradient based on the calculated regularization and smoothing losses; and updating the network coefficients of the DNN by performing the back propagation and weight update procedures based on the calculated overall gradient.

According to an embodiment of the application, the DNN is further trained by: calculating a loss of empirical data for the task performed by the DNN based on the training images, wherein the overall gradient is calculated based on the calculated loss of empirical data, regularization loss, and smoothing loss.

According to an embodiment of the application, the DNN is further trained by accumulating a plurality of overall gradients of a batch of training data, the plurality of overall gradients including the overall gradient, wherein network coefficients of the DNN are updated based on the accumulated plurality of overall gradients.

According to an embodiment of the application, the DNN is trained by: the DNN is updated over a number of iterations, and the hyperparameters are changed over the iterations, such that the training emphasizes learning a sparse feature map, and then emphasizing smoothing of feature responses in subsequent ones of the iterations.

According to an embodiment of the application, the DNN is trained by: the DNN is updated over a plurality of iterations, and the hyperparameters are changed over the iterations, such that the training emphasizes smoothness within spatial dimensions, and then channel width smoothness is emphasized in subsequent ones of the iterations.

According to an embodiment of the application, the apparatus further includes a compression module configured to compress the first feature map.

The embodiments of the present application may be used alone or in any order in combination. Further, each of the methods and systems (e.g., encoder and decoder) may be implemented by a processing circuit (e.g., at least one processor or at least one integrated circuit). In one example, at least one processor executes a program stored in a non-transitory computer readable storage medium.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the foregoing various alternative implementations.

Embodiments of the application may have at least one of the following advantages.

The smoothness regularization and sparsity regularization of the embodiment of the application can improve the efficiency of further compressing the extracted feature map. The learned DNN can be customized by a training process that optimizes the joint loss of both the original learning objective and the sparsification regularization with smoothing to extract a feature map that is efficient for performing the original task and suitable for subsequent compression.

The methods of embodiments of the present application may be generally applied to data sets having different data formats. The input data x may be a generic 4D tensor, which may be a video clip, a color image or a grayscale image.

The framework of the embodiments of the present application can be generally applied to different tasks of extracting feature maps from a trained backbone network, such as semantic segmentation, image/video classification, object detection, image/video super-resolution, and the like.

The foregoing disclosure provides illustration and description, but is not intended to be exhaustive or to limit the embodiments to the precise form disclosed. Modifications and variations are possible in light of the above disclosure or may be acquired from practice of the embodiments.

As used in this application, the term "component" is intended to be broadly interpreted as hardware, firmware, or a combination of hardware and software.

It is to be understood that the systems and/or methods described herein may be implemented in various forms of hardware, firmware, or combinations of hardware and software. The actual specialized control hardware or software code for implementing the systems and/or methods is not limited to these embodiments. Thus, it should be understood that software and hardware may be designed to implement the systems and/or methods based on the description herein.

While combinations of features are recited in the claims and/or described in the specification, these combinations are not intended to limit the disclosure to possible embodiments. In fact, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed in the present application may be directly dependent on only one claim, the disclosure of possible embodiments includes each dependent claim in combination with every other claim in the set of claims.

No element, act, or instruction used in the present application should be construed as critical or essential unless explicitly described as such. In addition, as used herein, the articles "a" and "an" are intended to include at least one item, and may be used interchangeably with "at least one". Further, as used herein, the term "set" is intended to include at least one item (e.g., related items, unrelated items, combinations of related and unrelated items, etc.) and may be used interchangeably with "at least one". When only one item is intended, the term "one" or similar language is used. Further, as used herein, the terms "having," "containing," and the like are intended to be open-ended terms. Further, the phrase "based on" is intended to mean "based, at least in part, on" unless explicitly stated otherwise.

Claims

1. A method for generating a feature map, comprising:

receiving an image through a Deep Neural Network (DNN); and

generating, by the DNN, a first feature map based on the image while the DNN is in a trained state, wherein,

the DNN is configured to perform a task based on the image, and the DNN is trained using a training image by using a feature sparse regularization process with smoothness and a back propagation and weight update process that updates the DNN based on an output of the feature sparse regularization process with smoothness.

2. The method of claim 1, further comprising: training the DNN, the training process comprising:

generating a second feature map based on the training image by performing a network forward computation by the DNN, the network forward computation including the feature sparsity regularization process with smoothness;

calculating a regularization loss of the second feature map based on the training images;

calculating a smoothing loss of the second feature map based on the training image;

calculating a total gradient based on the calculated regularization and smoothing losses; and

updating network coefficients of the DNN by performing the back propagation and weight update processes based on the calculated overall gradient.

3. The method of claim 2,

the training the DNN further comprises: calculating a loss of empirical data for the task performed by the DNN based on the training images, an

The calculating the overall gradient comprises: calculating the overall gradient based on the calculated empirical data loss, regularization loss, and smoothing loss.

4. The method according to claim 2 or 3,

the training the DNN further comprises: accumulating a plurality of overall gradients of a batch of training data, the plurality of overall gradients including the overall gradient,

wherein the updating of the network coefficients of the DNN is performed based on the accumulated plurality of overall gradients.

5. The method according to claim 2 or 3,

the training the DNN is over a plurality of iterations, and the training the DNN further comprises: varying the hyper-parameters during the iterations such that the training emphasizes learning a sparse feature map, and then emphasizing smoothing of feature responses in subsequent ones of the iterations.

6. The method according to claim 2 or 3,

the training the DNN is over a plurality of iterations, and the training the DNN further comprises: varying the hyperparameter in the iterations such that the training emphasizes smoothness within a spatial dimension, and then emphasizing smoothness of a channel width in subsequent ones of the iterations.

7. The method of any of claims 1 to 3, wherein the back propagation and weight update process updates the network coefficients of the DNN during the training process based on regularization and smoothing losses calculated based on the output of the feature sparse regularization process with smoothness.

8. The method according to any of claims 1 to 3, wherein the back-propagation and weight update procedure updates the network coefficients of the DNN during the training procedure based on:

calculated regularization and smoothing losses, wherein the calculated regularization and smoothing losses are calculated based on an output of the feature sparse regularization with smoothness, an

A calculated loss of empirical data for the task performed by the DNN.

9. The method of any one of claims 1 to 3, further comprising compressing the first feature map.

10. The method of any of claims 1 to 3, wherein the DNN is configured to perform at least one of semantic segmentation, image or video classification, object detection, and image or video super resolution as the task.

11. An apparatus for generating a feature map, comprising:

an acquisition module for generating a first feature map based on images input into a Deep Neural Network (DNN) while the DNN is in a trained state, wherein,

12. The apparatus of claim 11, wherein the DNN is trained by:

13. The apparatus of claim 12, wherein the DNN is further trained by: calculating, based on the training images, a loss of empirical data for the task performed by the DNN,

wherein the overall gradient is calculated based on the calculated empirical data loss, regularization loss, and smoothing loss.

14. The apparatus of claim 12 or 13,

the DNN is further trained by accumulating a plurality of overall gradients of a batch of training data, the plurality of overall gradients comprising the overall gradient,

wherein the network coefficients of the DNN are updated based on the accumulated plurality of overall gradients.

15. The apparatus of claim 12 or 13, wherein the DNN is trained by: the DNN is updated over a number of iterations, and the hyperparameters are changed over the iterations, such that the training emphasizes learning a sparse feature map, and then emphasizing smoothing of feature responses in subsequent ones of the iterations.

16. The apparatus of claim 12 or 13, wherein the DNN is trained by: the DNN is updated over a plurality of iterations, and the hyperparameters are changed over the iterations, such that the training emphasizes smoothness within spatial dimensions, and then channel width smoothness is emphasized in subsequent ones of the iterations.

17. The apparatus of any of claims 11 to 13, wherein the back propagation and weight update process updates the network coefficients of the DNN during the training process based on regularization and smoothing losses calculated based on the output of the feature sparse regularization process with smoothness.

18. The apparatus according to any of claims 11 to 13, wherein the back propagation and weight update procedure updates the network coefficients of the DNN in the training procedure based on:

A calculated loss of empirical data for the task performed by the DNN.

19. The apparatus of any one of claims 11 to 13, further comprising a compression module for compressing the first profile.

20. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are configured to, when executed by at least one processor implementing a Deep Neural Network (DNN), cause the at least one processor to perform the method of any one of claims 1 to 10.