CN109982088B

CN109982088B - Image processing method and device

Info

Publication number: CN109982088B
Application number: CN201711456179.1A
Authority: CN
Inventors: 董晓; 卢兴敬; 刘雷
Original assignee: Huawei Technologies Co Ltd; Institute of Computing Technology of CAS
Current assignee: Huawei Technologies Co Ltd; Institute of Computing Technology of CAS
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2021-07-16
Anticipated expiration: 2037-12-28
Also published as: WO2019128735A1; CN109982088A

Abstract

The embodiment of the application provides an image processing method and a device thereof, wherein the method comprises the following steps: converting a foreground image corresponding to an original image into a compressed image; estimating first extraction time required by adopting the original image to extract depth features and second extraction time required by adopting the compressed image to extract the depth features; and determining to adopt the original image to carry out depth feature extraction or adopt the compressed image to carry out depth feature extraction according to the first extraction time and the second extraction time. By adopting the embodiment of the application, the calculated amount of the image depth feature extraction process can be effectively reduced, so that the execution time of video object detection is improved.

Description

Image processing method and device

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to an image processing method and device.

Background

Deep learning is a new field in machine learning research, and its motivation is to create and simulate a neural network for human brain to analyze and learn, which simulates the mechanism of human brain to interpret data such as images, sounds and texts. With the rapid development of artificial intelligence technologies represented by deep learning, deep learning technologies are beginning to be used in scenes such as image classification and object detection in real environments. Object detection refers to determining the position of all objects contained in a given image and providing a classification for each object. The video-based object detection and identification technology has wide application scenes, for example, vehicle identification in traffic monitoring videos can be used for traffic flow estimation and traffic accident identification, and pedestrian identity identification in road gates and traffic junction monitoring videos has important significance for public security and security.

The video object detection process can be divided into three phases: video preprocessing, image feature extraction and object position and category determination. The image preprocessing stage completes decoding of the video and converts the video into images frame by frame. Image feature extraction is a process of transforming an image from a pixel space to a suitable feature space, and requires obtaining features with sufficient distinguishing capability for subsequent classification, regression, and other tasks. According to different methods for extracting image features, image feature extraction can be divided into two modes of depth feature extraction and traditional feature extraction. The traditional feature extraction refers to color features, texture features, shape features and the like obtained by using various classical computer vision related algorithms, such as image histograms, scale invariant features and the like, and the traditional feature extraction is simple and convenient in calculation and limited in distinguishing capability. The depth feature extraction is to actually input the image into a Deep Neural Network (DNN) and take the calculation result of a certain middle layer of the DNN as a feature. The distinguishing capability of the depth feature extraction is strong, but the calculation cost for obtaining the image depth feature is huge and the time consumption is long.

In order to accelerate the computation process of the deep neural network, the computation amount performed by the deep neural network needs to be reduced. At present, the number of parameters is reduced by compressing the parameter scale of the deep neural network model, and further, the calculated amount of the model is reduced. According to the method, threshold pruning is firstly carried out on model parameters of the deep neural network, the threshold is a fixed value which is manually set, all parameters which are lower than the threshold in the deep neural network are set to be zero, and the zero does not influence the operation result of the deep neural network, so that the set-to-zero parameters can be discarded. And then clustering the parameters by using a clustering algorithm, wherein all the parameters falling into one clustering class use the clustering center as a new parameter value of the parameter, so that the parameter sharing in one clustering class is realized, the actual parameter number is reduced, and only the parameter value of the clustering center is reserved by the step. And finally, coding the parameters by using a Huffman coding technology, and reducing the storage space occupied by the model parameters so as to reduce the calculated amount of the model. According to the method, irregular non-structural sparsity can be generated when pruning is carried out based on a threshold value, and the sparsity is difficult to be converted into reduction of execution time of a deep neural network, so that the execution time of video object detection is influenced.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present application is to provide an image processing method and an image processing apparatus, which can effectively reduce the amount of computation in the image depth feature extraction process, thereby improving the execution time of video object detection.

A first aspect of an embodiment of the present application provides an image processing method, including:

converting a foreground image corresponding to an original image into a compressed image;

estimating first extraction time required by adopting an original image to extract depth features and second extraction time required by adopting a compressed image to extract the depth features;

and determining to adopt the original image to carry out depth feature extraction or adopt the compressed image to carry out depth feature extraction according to the first extraction time and the second extraction time.

A second aspect of embodiments of the present application provides an image processing apparatus comprising means or units (means) for performing the steps of the first aspect above.

A third aspect of the embodiments of the present application provides an image processing apparatus, including at least one processing element and at least one memory element, where the at least one memory element is configured to store a program and data, and the at least one processing element is configured to perform the method provided in the first aspect of the embodiments of the present application.

A fourth aspect of embodiments of the present application provides an image processing apparatus comprising at least one processing element (or chip) for performing the method of the first aspect above.

A fifth aspect of embodiments of the present application provides a computer program product, which when executed by a processor is configured to perform the method of the first aspect.

A sixth aspect of embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method of the aforementioned first aspect.

It can be seen that, in the above first to sixth aspects, by estimating the first extraction time for performing depth feature extraction using the original image and the second extraction time for performing depth feature extraction using the compressed image, and selecting whether to perform depth feature extraction using the original image or the compressed image according to the size between the two, the amount of computation in the image depth feature extraction process can be effectively reduced, thereby improving the execution time of video object detection.

In a possible implementation manner, before converting a foreground image corresponding to an original image into a compressed image, a to-be-processed video of an input terminal is decoded, and the decoded to-be-processed video is decomposed to obtain the original image. The multi-frame original image can be obtained by decoding and decomposing the video to be processed, and the background area of the previous array of original images can be referred to when a certain frame of original image is subjected to background segmentation processing, so that the accuracy of the background segmentation of the original image can be improved.

In a possible implementation manner, the original image is subjected to background segmentation processing to obtain a foreground image corresponding to the original image, so that the foreground image is converted to obtain a compressed image. The foreground image retains the valid information in the original image, i.e. the information related to object detection, and removes the background information. The foreground image only reduces background information, but the data size is not changed, and the foreground image is converted into a compressed image to change the data size, so that the execution time of depth feature extraction can be shortened under the condition that the foreground area is relatively small.

In a possible implementation manner, by obtaining execution environment information and predicting a first extraction time for performing depth feature extraction using an original image and a second extraction time for performing depth feature extraction using a compressed image according to a performance record corresponding to the execution environment information, the method may include the following steps:

acquiring execution environment information, wherein the execution environment information can be execution environment information of a terminal and comprises hardware platform information and/or software stack information, the hardware platform information is used for describing the type and the model of hardware platform equipment, the software stack information is used for describing the version information of a dependent library, and the execution environment information can be acquired through an interface of an operating system or a function library;

searching a first performance record and a second performance record corresponding to the execution environment information in a database according to the execution environment information, wherein the database comprises at least one first performance record suitable for an original image format and at least one second performance record suitable for a compressed image format, the original image format is an original image format, and the compressed image format is a compressed image format;

and estimating first extraction time for extracting the depth features by adopting the original image according to the first performance record, and estimating second extraction time for extracting the depth features by adopting the compressed image according to the second performance record.

The extraction time of the original image and the extraction time of the compressed image are estimated according to the corresponding relation between the execution environment information and the performance record included in the database, so that the mode of short extraction time is determined, namely, the original image or the compressed image is adopted for depth feature extraction. The first performance record and the second performance record are acquired based on the execution environment information of the terminal, in other words, the first performance record and the second performance record fit the current use environment of the image processing apparatus, so that the estimated extraction time is less different from the real extraction time.

In a possible implementation manner, the first performance record includes a first normalized execution time corresponding to each layer and execution environment information corresponding to each layer in multiple layers of the layers, where a layer may be a convolutional layer, an active layer, or a mosaic layer, and the like, a data scale corresponding to each layer is calculated according to an original image, a first execution time corresponding to each layer is estimated according to the first normalized execution time corresponding to each layer and the data scale corresponding to each layer, that is, the first execution time of each layer is estimated, and then a first extraction time for performing depth feature extraction using the original image is estimated according to the first execution time corresponding to each layer, that is, the first execution time corresponding to each layer is summed to obtain the first extraction time. And estimating the first execution time of each layer based on the original image format according to the first performance record, thereby summing to obtain the first extraction time, and the method has the advantages of convenient and simple calculation and small calculation amount.

In a possible implementation manner, the second performance record includes a second normalized execution time corresponding to each layer of multiple layers of the layer and execution environment information corresponding to each layer, where the layer may be a convolutional layer, an active layer, or a mosaic layer, and the like, the data scale corresponding to each layer is calculated according to the compressed image, a second execution time corresponding to each layer is estimated according to the second normalized execution time corresponding to each layer and the data scale corresponding to each layer, that is, the second execution time of each layer is estimated, and then a second extraction time for performing depth feature extraction using the compressed image is estimated according to the second execution time corresponding to each layer, that is, the second execution time corresponding to each layer is summed to obtain a second extraction time. And estimating the second execution time of each layer based on the compressed image format according to the second performance record, thereby summing to obtain the second extraction time, and the method has the advantages of convenient and simple calculation and small calculation amount.

In a possible implementation manner, under the condition that a first extraction time and a second extraction time are estimated, determining whether to adopt an original image or a compressed image for depth extraction by judging the size between the first extraction time and the second extraction time, and if the first extraction time is longer than the second extraction time, adopting the compressed image for depth feature extraction; if the first extraction time is shorter than the second extraction time, the original image is adopted for depth feature extraction, so that the time consumption for performing the depth feature extraction by adopting the finally determined depth feature extraction mode is shortest, and the execution time of video object detection is favorably improved. If the first extraction time is the same as the second extraction time, the depth feature extraction is performed by using the compressed image or the original image, that is, under the condition that the two extraction times are the same, the effects of using the compressed image and using the original image are the same, and one of the two extraction times can be selected.

In a possible implementation manner, if it is determined that an original image is used for depth feature extraction, the original image is input into an original depth neural network model for depth feature extraction, where the original depth neural network model is an existing depth neural network model, and time consumption for depth feature extraction using the original image and the depth neural network model is short.

In a possible implementation manner, if it is determined that a compressed image is used for depth feature extraction, the compressed image is input into a specific depth neural network model for depth feature extraction, wherein the specific depth neural network model can support depth feature extraction of an image in a compressed image format, the specific depth neural network model is obtained by improving, processing and optimizing the existing depth neural network model, and time consumption for depth feature extraction by using the compressed image and the depth neural network model is short.

In a possible implementation manner, after the depth feature extraction is performed by using the original image, a third execution time corresponding to each layer of the depth feature extraction performed by using the original image is recorded, and a first performance record corresponding to the execution environment information in the database is updated according to the third execution time corresponding to each layer, that is, the third execution time record corresponding to each layer in the process of performing the depth feature extraction by using the original image is to be recorded, and the corresponding first performance record is updated, so that a situation that data offset exists between offline data and an actual application scene is avoided.

In a possible implementation manner, after the depth feature extraction is performed by using the compressed image, a fourth execution time corresponding to each layer of the depth feature extraction performed by using the compressed image is recorded, and a second performance record corresponding to the execution environment information in the database is updated according to the fourth execution time corresponding to each layer, that is, the fourth execution time record corresponding to each layer in the process of performing the depth feature extraction by using the compressed image is to be recorded, and the corresponding second performance record is updated, so that a situation that data offset exists between the offline data and the actual application scene is avoided.

In a possible implementation manner, before the foreground image corresponding to the original image is converted into the compressed image, a first performance record and a second performance record corresponding to the execution environment information are generated in advance, and the first performance record and the second performance record are added to the database so as to perform pre-estimation according to the performance records recorded by the database. The pre-generation process may be an offline phase, i.e., generating and recording performance records before formal depth feature extraction.

In a possible implementation manner, before the foreground image corresponding to the original image is converted into the compressed image, whether the image feature extraction algorithm is a depth feature extraction algorithm or a non-depth feature extraction algorithm is determined, and if the image feature extraction algorithm is the depth feature extraction algorithm, the step of converting the foreground image corresponding to the original image into the compressed image is performed, so that the extraction time of the depth feature extraction of the compressed image and the original image is estimated. Whether the image feature extraction algorithm is a depth feature extraction algorithm or a non-depth feature extraction algorithm may be determined according to the selection instruction.

In a possible implementation manner, if the image feature extraction algorithm is determined to be a non-depth feature extraction algorithm, acquiring the data scale of the original image, determining an execution manner corresponding to the shortest extraction time according to the data scale of the original image and the database, and performing non-depth feature extraction on the original image by using the execution manner. The database comprises the basic execution time of each execution mode under each non-depth feature extraction algorithm, the extraction time corresponding to each execution mode can be obtained by multiplying the data scale of the original image by the basic execution time of each execution mode, the execution mode corresponding to the shortest extraction time is selected from the extraction time, and the non-depth feature extraction is carried out by adopting the execution mode, so that the time consumption of the non-depth feature extraction is shortest, and the execution time of video object detection is favorably improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present application, the drawings required to be used in the embodiments or the background art of the present application will be described below.

FIG. 1 is a schematic diagram of a system architecture to which embodiments of the present application are applied;

fig. 2 is a schematic diagram of an image processing apparatus according to an embodiment of the present application;

fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another image processing method according to an embodiment of the present application;

FIG. 5 is a diagram illustrating an example of a convolution calculation process according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a deep neural network model provided in an embodiment of the present application;

fig. 7 is a schematic diagram of a logic structure of an image processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic physical structure diagram of an image processing apparatus according to an embodiment of the present application.

Detailed Description

The following explains terms related to embodiments of the present application:

the original image is any one of a plurality of frames of images obtained by decoding and decomposing an input video, and the plurality of frames of images are continuous in time. It is understood that the original image is the image before the background segmentation and compression processing. The image format of the original image is the original image format.

The foreground image is obtained by segmenting the foreground area and the background area of the original image by using a background modeling method, the information related to object detection in the original image is reserved, and the background information is removed. For example, the original image is a road traffic monitoring image, and the foreground image may include pedestrians, motor vehicles, non-motor vehicles, and the like.

The background image is obtained by segmenting a foreground region and a background region of the original image by using a background modeling method, and comprises information irrelevant to object detection. For example, the original image is a road traffic monitoring image, and the background image may include roads, buildings, trees, and the like.

And compressing the image, wherein the image is obtained by compressing the foreground image by using a predefined compression storage format. The image format of the compressed image is a compressed image format.

And (4) depth feature extraction, namely inputting the image into a depth neural network, and taking a calculation result of a certain middle layer of the depth neural network as a feature. The deep neural network may include a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). The deep neural network model in the embodiments of the present application is exemplified by a convolutional neural network model, which includes multiple layers including convolutional layers, and the number of convolutional layers is not limited.

In a convolutional neural network, a convolution kernel can be considered to represent a "feature extraction pattern", that is, a convolution kernel extracts only one form of features and forms a feature map (feature map), that is, a feature map represents one feature. If only one convolution kernel is used for extraction on the same graph, only one feature graph is obtained, namely, only one type of feature can be extracted by us, but the final classification task needs to extract different features from the graph. For example, a common classification task needs to classify dogs and cats, one of the convolution kernels is used to extract the dog head, another convolution kernel is used to extract the dog tail, a third convolution kernel is used to extract the dog body, and the remaining convolution kernels are used to extract the features of cat parts. That is, different feature maps contain different features, and the final classifier determines whether the final class is a dog or a cat according to the "homogeneity" of the different features, for example, if the response values of the 3 feature maps of the dog head, the dog tail and the dog body are very high, and the response values of the remaining feature map describing the features of the cat are very low, then the final classifier determines that the image is a dog. The more feature maps are in order to extract more features that may be useful.

A convolution kernel, given an input image, where each pixel in the output image is a weighted average of the pixels in a small region of the input image, where the weights are defined by a function, referred to as the convolution kernel. The convolution kernel is typically represented by a matrix.

The deep neural network model is designed for the original image format, and the operation of the layer in the network supports the original image format with a regular rectangular shape.

The specific deep neural network model is a model obtained by processing an original deep neural network model, is designed aiming at a compressed image format, converts the operation of a layer in the network into the calculation of a regular matrix, and reduces the overhead of the conversion process as much as possible. Specifically, the calculation process of the layer in the original deep neural network model is designed for the original image, the calculation of the layer in the network has good support for the original image with a regular rectangular shape, and the calculation efficiency is high. Since the compressed image only contains the foreground region and the shape of the region cannot be guaranteed, when the compressed image is used as the input of the network, the best performance cannot be obtained by directly using the deep neural network model.

Non-depth feature extraction, i.e., traditional feature extraction, refers to color features, texture features, shape features, and the like, such as image histograms, scale invariant features, and the like, obtained using various classical computer vision related algorithms.

The number of channels/number of channels is the number of channels of three channels of red, green and blue.

An image processing method and an apparatus thereof provided by the embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a system architecture of a terminal 100 where an image processing apparatus according to an embodiment of the present application is located. As shown in fig. 1, the terminal 100 includes a software layer and a hardware layer.

In the embodiment of the present application, the application 101 in the software layer may be a video object detection application, which is configured to detect objects in an input video, to identify categories of the objects in the video, determine positions of the objects in the video, and so on, for example, to identify vehicles in a traffic monitoring video, so as to facilitate traffic flow estimation and traffic accident recognition; for another example, the identity of the pedestrian in the monitoring video of the road gate and the traffic hub is identified, so that the public security and the safety are facilitated. The video object detection application program is applied to road traffic and can detect pedestrians, motor vehicles, non-motor vehicles and the like.

The software layer of the terminal provides a software stack/software library 102 to the video object detection application for invocation by the video object detection application during execution. The software stack/software library 102 may include, but is not limited to, an open source computer vision library (openCV), a basic linear algebra subroutines library (BLAS), a display computing platform (e.g., universal parallel computing architecture (CUDA) for imperial resources (NVIDIA)), and so forth. openCV can implement many general algorithms in image processing and computer vision, among others. BLAS holds a large number of programs that have been written for linear algebraic operations. CUDA enables image processing units (GPUs) to solve complex computational problems.

In this embodiment, the video object detection application program needs to run on the hardware platform, and the hardware layer may include, but is not limited to, a Central Processing Unit (CPU) 103, a Graphics Processing Unit (GPU) 105, a memory 104, and other devices. The hardware platform provides a hardware infrastructure for the video object detection application to support the operation of the video object detection application.

In view of the disadvantages of long extraction time and long execution time of video object detection in the conventional image feature extraction process, embodiments of the present application provide an image processing method and an image processing device, which can effectively reduce the amount of calculation in the image depth feature extraction process and shorten the time consumption in the image depth feature extraction process by selecting a scheme with a short execution time by estimating the image feature extraction time of an original image and a compressed image, thereby improving the execution time of video object detection.

The terminal related to the embodiment of the present application may include, but is not limited to, an electronic device such as a desktop computer, a notebook computer, a mobile phone, a tablet computer, a portable device, and a vehicle-mounted terminal.

The image processing apparatus according to the embodiment of the present application may be all or part of the terminal device 100 described above. Fig. 2 is a schematic view of an image processing apparatus according to an embodiment of the present disclosure. The image processing apparatus can be divided into three modules to perform video object detection, a video preprocessing module 201, an image feature extraction module 202 and an object positioning and classification module 203. When the image processing apparatus receives input video data, the video preprocessing module 201 decodes the video data, and decomposes the video data into images frame by frame; the image feature extraction module 202 performs depth feature extraction/traditional feature extraction on the image; the object location and classification module 203 determines the specific type of object and the location where it appears in the image.

The original image and the compressed image output by the video pre-processing module 201 can be used as input of the image feature extraction module 202, and the features output by the image feature extraction module 202 can be used as input of the object positioning and classification module 203. The flow shown by the solid line in fig. 2 corresponds to the flow when the image feature extraction algorithm is the depth feature extraction algorithm, the flow shown by the first step corresponds to the flow when the image feature extraction algorithm is the non-depth feature extraction algorithm, the flows shown by the second step and the third step correspond to the flow when the image feature extraction algorithm is the depth feature extraction algorithm, wherein the second step corresponds to the flow for performing the depth feature extraction by using the original image, and the third step corresponds to the flow for performing the depth feature extraction by using the compressed image. The non-depth feature extraction algorithm in the embodiment of the application is a traditional feature extraction algorithm, such as an image histogram extraction algorithm, a scale invariant feature extraction algorithm, a texture feature extraction algorithm, and the like.

The video pre-processing module 201 includes a decoding decomposer 2011 and a video background analyzer 2012.

The decoding decomposer 2011 may be a program that processes input video data, and is responsible for decoding the input video and decomposing the video into images frame by frame. The input video is processed by the decoding decomposer 2011 to obtain an original image, that is, the output of the decoding decomposer 2011 is the original image. If the image feature extraction algorithm is determined to be the conventional feature extraction algorithm, the image feature extraction algorithm performance predictor 2021 may provide the basic execution time of the conventional feature extraction algorithm in different execution modes, so as to determine the execution mode corresponding to the shortest conventional execution mode. If the image feature extraction algorithm is determined to be the depth feature extraction algorithm, the image feature extraction algorithm performance predictor 2021 may provide the extraction time of the original image format and the extraction time of the compressed image format.

The video background analyzer 2012 processes the input video data, models the input video by using a computer vision background modeling method, and can segment the foreground region and the background region of each frame of the original image of the video by using the model, that is, segment the foreground region and the background region of the original image input by the decoding decomposer 2011, so as to obtain the foreground image.

The image feature extraction module 202 includes an image feature extraction algorithm performance predictor 2021, a compressed image generator 2022, a compressed image depth feature extraction algorithm optimizer 2023, a depth neural network module 2024, and a performance data corrector 2025. The image feature extraction module 202 may call the database 3013.

The image feature extraction algorithm performance predictor 2021: and (4) making predictions about the performance of each alternative implementation of the current image feature extraction algorithm by using the data of the database 3013. And predicting the execution time shown in the first step, the second step and the third step so as to determine a mode for extracting the features.

The compressed image generator 2022 generates a compressed image corresponding to the original image according to the predefined compressed storage format and the segmentation result of the video background analyzer 2012 on the foreground and background areas of the image, that is, processes the foreground image to obtain a compressed image.

The compressed image depth feature extraction algorithm optimizer 2023: the depth feature extraction algorithm under the scene is optimized aiming at the scene of the depth feature extraction algorithm taking the compressed image as input, namely, the depth neural network model in the depth neural network module 2024 is improved and optimized, so that the improved depth neural network model can efficiently support the depth feature extraction of the compressed image.

The deep neural network module 2024 includes the structure of the deep neural network model used in the deep feature extraction algorithm and the values of the parameters of the deep neural network model.

The compressed image depth feature extraction algorithm optimizer 2023 acts on the depth neural network module 2024 to improve and optimize the depth neural network module 2024 so that the improved depth neural network module 2024 can support depth feature extraction of the compressed image. If deep neural network module 2024 is not modified, deep neural network module 2024 can support deep feature extraction of the original image.

Performance data corrector 2025: in the running process of the video object detection application program, performance data realized by using an image feature extraction algorithm is dynamically collected and transmitted to the database 3013, so that the database 3013 updates the corresponding performance data record established in the offline stage, and the situation that data offset exists between the offline data and the actual application scene is avoided.

Database 3013: the performance database of the image feature extraction algorithm is an input of the performance predictor 2021 of the image feature extraction algorithm, and includes performance data realized by all feature extraction algorithms obtained in an off-line stage and operating environment features when acquiring the performance data. The database 3013 may also receive the performance data transmitted by the performance data corrector 2025, and update the performance data obtained in the corresponding offline stage according to the performance data. The database 3013 may be stored in a memory of a terminal where the image processing apparatus is located, or may be stored in a cloud server, and the image feature extraction module 202 may download the performance database from the cloud server when it needs to call the performance database.

The performance data obtained in the offline stage may be collected by the performance data collection module 3011, and the operating environment features obtained in the offline stage may be collected by the operating environment feature collection module 3012. The performance data collection module 3011 and the environmental characteristic collection module 3012 may be located on a cloud server.

Performance data collection module 3011: in order to predict the performance of different implementations of image feature extraction algorithms, performance data needs to be obtained at run-time of these different implementations. In the off-line stage, each image feature extraction algorithm is operated, real performance data during operation is recorded, the performance data of each image feature extraction algorithm during operation is recorded in the system, and finally the required database 3013 is output. In the formal operation stage of the video object detection application, the generated database 3013 is transferred to the performance predictor 2021 of the image feature extraction algorithm, so that the performance of each implementation can be predicted. In addition, the performance data can be provided to the performance data corrector 2025 at the same time, which ensures that the performance data corrector 2025 can correct the relevant performance data.

The operating environment feature collection module 3012: the performance of the same image feature extraction algorithm may also change under different hardware platforms, software stacks, and input data scales. The operation environment feature collection module completes the performance data output by the performance data collection module in an off-line stage, adds hardware platform information and software stack information to a performance data record, normalizes the performance data according to the scale of the input data of the operation, eliminates the influence of the data scale on the performance data, and facilitates the use of the performance predictor 2021 of the image feature extraction algorithm.

The off-line stage or off-line process refers to a software development stage before deployment of a video object detection application, and the main processing flow of the stage aims to collect operation performance data records realized by an image feature extraction algorithm reflecting actual execution conditions as much as possible and establish a database 3013.

The offline process may include the steps of:

step A, the performance data collection module 3011 obtains execution time for implementing operation of different feature extraction algorithms.

And processing by adopting different processes according to different image feature extraction algorithms. For the traditional feature extraction algorithm, functions in an algorithm library are called, timestamps before and after calling are recorded, and then subtraction is carried out to obtain the execution time of the feature extraction algorithm for realizing the execution at this time.

For the deep feature extraction algorithm, because the structure of the deep neural network model varies, it is infeasible to exhaust all possible structures of the deep neural network model in the off-line process by taking the whole deep neural network model as the basic unit of the performance record. In the embodiment of the application, the deep neural network model is decomposed to obtain different layers forming the deep neural network model, and the execution time of each layer is recorded. Since the types of layers that make up the deep neural network are limited, the deep neural network model that may be encountered during online execution can be better overlaid from the abstraction level of the layers. And respectively recording the time stamps before and after the execution of each layer, and subtracting the time stamps to obtain the execution time of the execution of the layer.

In an embodiment of the present application, the deep neural network model may be a convolutional neural network model, and the convolutional neural network model may include a convolutional layer, an activation layer, and a splice layer. For each layer, the time T at which the delamination line was performed is recorded. In order to obtain the execution time of a layer, two timestamps before and after the execution of the layer need to be recorded, which are respectively set as t₁，t₂Then, T:

T＝t₂-t₁

assuming that T is 2 ms, the type of the layer is denoted by type, and taking a convolution layer with a convolution window of 1 × 1 as an example, the performance output by the step is recorded as shown in table 1 below.

TABLE 1

Type (B)	Execution time	Hardware platform information	Software stack information
				1 x 1 convolution	2 milliseconds

And step B, recording the operating environment characteristics related to the performance of the characteristic extraction algorithm when the performance data is recorded by the performance data collecting module through the operating environment characteristic collecting module 3012.

According to the execution scene of the video object detection application program, the running environment features needing to be collected comprise: the type and model of hardware platform device being executed, and the version information of the feature extraction algorithm dependent library. The above information can be conveniently obtained using an interface provided by a hardware device or a software library.

For example, the obtained GPU model is Nvidia Pascal P4; for the convolutional layer, the execution time related software comprises CUDA and cudnn, and version information of the CUDA and cudnn is respectively obtained: CUDA8.0, cudnn 5.0.15. In conjunction with the performance record in step A, after step B, the performance records are shown in Table 2 below.

TABLE 2

And step C, performing normalization processing on the performance data recorded 3011 by the performance data collecting module according to the data scale, and storing the performance data into the database 3013.

The execution time of the image feature extraction algorithm is closely related to the data size of the input image. And calculating the data scale of the execution according to the characteristics of the image feature extraction algorithm. For example, for an algorithm for processing pixel by pixel, such as color space conversion, the data size is the number of pixels of an input image, i.e., the product of the image width and the image height. For a sliding window based algorithm, such as 2-dimensional convolution, the data scale is the number of windows. And D, dividing the execution time recorded in the step A with the data scale of the execution to obtain the performance data after the data scale is normalized, so that the performance record can be used for the conditions of different data scales.

Assuming that the execution time of the performance record in step a is the actual execution time of the layer, which is a function of the input data scale S, the execution time T in the performance record needs to be converted into a value independent of the data scale for performance prediction of the performance predictor of the feature extraction algorithm. Assuming that the height of an output feature map of a certain layer is h, the width of the output feature map is w, the number of channels of the output feature map is C, the number of channels of an input image is C, and the data scale calculation formula for the convolutional layer is as follows:

S_conv＝h*w*c*C

for the active layer, the data scale of the active layer is equal to the number of output elements, and the calculation formula is as follows:

S_relu＝h*w*c

assuming that the winding layer output feature map size is 64 × 64, including 16 channels, and the input data is 3 channels, the normalized execution time T 'can be calculated from the execution time normalization formula'_conv：

And calculating the normalized execution time, and modifying the performance record. The properties after modification are reported in table 3 below.

TABLE 3

The performance record is added to the database 3013.

In one possible implementation, the above table 3 further includes image formats, and the image formats can be divided into original image formats and compressed image formats, which can be shown in the following table 3.1.

TABLE 3.1

Type (B)	Execution time	Hardware platform information	Software stack information	Image format
					1 x 1 convolution	0.00001017 ms	Nvidia Pascal P4	CUDA8.0，cudnn5.0.15	Original image format

In the off-line stage, the performance record of the depth feature extraction by adopting the original depth neural network model and the performance record of the depth feature extraction by adopting the specific depth neural network model are respectively recorded. Since the specific deep neural network model has a process of converting an irregular shape into a regular matrix, there is a certain time overhead compared with the original deep neural network model, and there may be a difference in the normalized execution time between the two models.

The above steps a to C describe an offline process, and the following describes in detail the image processing method provided in the embodiment of the present application, that is, an online process, with reference to fig. 2 to 4.

Referring to fig. 3, a flowchart of an image processing method according to an embodiment of the present disclosure is shown, where the method may include, but is not limited to, step S301 to step S303.

Step S301, converting the foreground image corresponding to the original image into a compressed image.

In a possible implementation manner, before converting a foreground image corresponding to an original image into a compressed image, the image processing apparatus may decode a to-be-processed video of the input terminal, and decompose the decoded to-be-processed video to obtain the original image. The video to be processed is input by the user selection input terminal, and can be a traffic monitoring video, a road gate and traffic hub monitoring video, a cell entrance and exit monitoring video and the like.

It should be noted that, the image processing apparatus decomposes the decoded video to be processed to obtain multiple frames of original images, and the multiple frames of original images are consecutive in time. In the embodiment of the present application, a frame of original image is taken as an example for introduction, and a processing flow of each frame of original image is the same as that of the frame of original image.

With reference to the schematic diagram shown in fig. 2, the decoding decomposer 2011 decodes a video to be processed of the input terminal, and decomposes the decoded video to be processed to obtain an original image. Specifically, the video pre-processing module may decompose the video to be processed into raw images of Red Green Blue (RGB) three-channel colors frame by frame, and may provide the raw images to the video background analyzer 1012.

In one possible implementation manner, the image processing apparatus determines whether the image feature extraction algorithm is a depth feature extraction algorithm or a conventional feature extraction algorithm before converting a foreground image corresponding to the original image into a compressed image. In a first mode, the image processing apparatus may determine whether the image feature extraction algorithm is a depth feature extraction algorithm or a conventional feature extraction algorithm according to a selection instruction input by a user. The selection instruction can be input when the video to be processed is input, namely, the selection instruction and the video to be processed are input simultaneously, or after the video to be processed is input, an image feature extraction algorithm dialog box is output, and the image feature extraction algorithm is determined according to the selection instruction input aiming at the dialog box. In the second mode, the image processing device can randomly select to determine or autonomously determine whether the image feature extraction algorithm is a depth feature extraction algorithm or a traditional feature extraction algorithm. The manner of determining the image feature extraction algorithm is not limited in the embodiments of the present application.

And under the condition that the image feature extraction algorithm is determined to be the depth feature extraction algorithm, the image processing device performs background segmentation on the original image to obtain a foreground image, and converts the foreground image into a compressed image.

In a possible implementation manner, the image processing apparatus may segment the foreground and background regions of the original image by using a background modeling method, which is expressed as segmentation of the foreground and background regions of each frame of the original image in practical applications. For the road traffic monitoring video, the background area may include areas such as roads, buildings, trees, and the like which are not related to object detection, that is, areas with no or little change in the video, and the foreground area may include areas such as pedestrians, motor vehicles, non-motor vehicles, and the like, that is, areas with large change in the video. It can be understood that the foreground image retains the information related to object detection in the original image, and removes the background information.

Although the foreground image retains information related to object detection in the original image and removes background information, the foreground image and the original image are the same in structure and still have a regular two-dimensional structure, the data scale is not changed, if the foreground image is directly adopted for depth feature extraction, the execution time of the foreground image is not different from the execution time of the depth feature extraction of the original image, and the execution time is not shortened, so that the foreground image needs to be compressed, the regular two-dimensional structure is changed, the data scale is reduced, and the execution time can be shortened aiming at the condition that the foreground area is small in occupation ratio.

The image processing device compresses the foreground image to obtain a compressed image, and stores the compressed image. It is understood that the storage format of the compressed image is a compressed format and the storage format of the original image is an original image format. The image processing device may convert the foreground image using a predefined compressed storage format.

With reference to the schematic diagram shown in fig. 2, the video background analyzer 2012 performs background segmentation on the original image to obtain a foreground image. The objects concerned by the object detection are all located in the foreground part, the background part of the original image does not contain effective information, the effective information is the information related to the object detection, and taking a road traffic image as an example, the effective information can comprise pedestrians, motor vehicles, non-motor vehicles and the like.

The compressed image generator 2022 stores the foreground image as a compressed image format using a predefined compressed storage format, while auxiliary information required for a subsequent depth feature extraction algorithm can be established, which may include the height and width of the compressed image, the position of the foreground image in the original image, and so on. The predefined compressed storage format defines the data structure and storage of image data for discarding background areas in video object detection applications. The format includes a foreground data portion and an image metadata portion. Wherein the foreground data portion only contains data of foreground region in the original image, and the proportion of data size reduction relative to the original image data depends on the background segmentation result output by the video background analyzer 2012. In addition, the image metadata portion needs to maintain metadata information of the image, the metadata (metadata) information being information about organization of data, data fields and their relationship, the metadata information of the image containing size, number of channels, and the like of the image.

Step S302, estimating a first extraction time required for extracting depth features by using the original image and a second extraction time required for extracting depth features by using the compressed image.

The image processing apparatus estimates a first extraction time for performing depth feature extraction using an original image and a second extraction time for performing depth feature extraction using a compressed image.

In one possible implementation manner, the image processing apparatus may estimate, by obtaining execution environment information, which may be execution environment information of the terminal, a first performance record and a second performance record corresponding to the execution environment information in a database according to the execution environment information, and estimate, according to the first performance record, a first extraction time required for performing depth feature extraction using an original image, and estimate, according to the second performance record, a second extraction time required for performing depth feature extraction using a compressed image. That is, in this manner, the performance record corresponding to the execution environment information can be searched according to table 3.1 by considering the time difference between the original deep neural network model and the specific deep neural network model. The first performance record and the second performance record are acquired based on the execution environment information of the terminal, in other words, the first performance record and the second performance record fit the current use environment of the image processing apparatus, so that the estimated extraction time is less different from the real extraction time.

The image processing apparatus can acquire the execution environment information of the terminal in the following two ways. In the first mode, the image processing apparatus may obtain the execution environment information by calling a hardware device interface and a function library interface of the terminal, for example, calling a CPU interface to obtain the type and model of a CPU, and calling a software library interface to obtain the version number of the CUDA. In the second mode, the image processing apparatus may acquire the execution environment information by sending an inquiry instruction to a system of the terminal, the inquiry instruction being for inquiring the execution environment information of the terminal. The two modes do not constitute limitations to the embodiments of the present application, and the execution environment information of the terminal may also be obtained in other modes.

The execution environment information includes hardware platform information and software stack information. The hardware platform information is used for indicating the type and model of the hardware platform device, for example, the GPU model is Nvidia Pascal P4; the software stack information is used to indicate the name of the related software and its version number, e.g., CUDA8.0, cudnn 5.0.15.

The first performance record is obtained by performing depth feature extraction based on an original image format in an offline stage, and specifically is obtained by inputting an image in the original image format into an original depth neural network model to perform depth feature extraction in the offline stage. The first performance record includes a layer type of each layer of the plurality of layers, a first normalized execution time corresponding to each layer, and execution environment information corresponding to each layer. The first performance record further includes the image format being an original image format. Wherein, the layer can be a coiling layer, an activation layer or a splicing layer, etc. Similarly, the second performance record is a performance record obtained by performing depth feature extraction based on the compressed image format in the offline stage, specifically, the performance record obtained by inputting the image in the compressed image format into the specific depth neural network model for performing depth feature extraction in the offline stage. The second performance record includes a layer type of each layer of the plurality of layers, a second normalized execution time corresponding to each layer, and execution environment information corresponding to each layer. The first performance record further includes the image format being a compressed image format. It will be appreciated that the execution environment information for different layers is the same for the same terminal, excluding hardware and/or software updates. The database records and stores the performance records of different image formats respectively.

Specifically, the image processing device calculates a data scale corresponding to each layer according to the original image, that is, calculates a data scale corresponding to each layer of the original image in the original deep neural network model, estimates a first execution time corresponding to each layer according to a first normalized execution time corresponding to each layer and the data scale corresponding to each layer, and estimates a first extraction time for performing depth feature extraction by using the original image according to the first execution time corresponding to each layer. Similarly, the image processing device calculates the data scale corresponding to each layer according to the compressed image, that is, calculates the data scale corresponding to each layer of the compressed image in the specific depth neural network model, estimates the second execution time corresponding to each layer according to the second normalized execution time corresponding to each layer and the data scale corresponding to each layer, and estimates the second extraction time for extracting the depth features by using the compressed image according to the second execution time corresponding to each layer.

In conjunction with the deployment diagram shown in fig. 2, the image feature extraction algorithm performance predictor 2021 predicts a first extraction time for performing depth feature extraction using an original image and a second extraction time for performing depth feature extraction using a compressed image. The image feature extraction algorithm performance predictor 2021 needs to process the traditional features and the depth features in different ways. The conventional feature extraction process often corresponds to the invocation of an algorithm library, and the prediction performance can be calculated based on the record of the database 3013 and in combination with the data scale of the current image, that is, the execution time of the conventional feature extraction algorithm is estimated. The performance of the depth feature extraction is closely related to the structure of the depth neural network model, and all the depth neural network models which may be used in the video object detection application program are difficult to cover in the off-line process, so that the running time of each layer of the depth neural network model needs to be predicted from a lower abstract layer, and finally the prediction performance of the depth feature extraction process is obtained, namely the execution time of the depth feature extraction algorithm is predicted.

For example, the convolutional neural network model is decomposed into multiple layers, and assuming that the set of layers obtained by the decomposition is L, L ═ { conv1, conv1_1, conv5_1, relu1, relu3_1, relu5_1, conv3_2, conv5_2, relu3_2, relu5_2, conv5_3, relu5_3, concat }, conv represents a rolling layer, relu represents an active layer, and concat represents a splicing layer. Hardware platform information hardware ═ (Nvidia Pascal P4), software stack information software ═ (CUDA8.0, cudnn5.0.15), and use place to represent execution environment information including hardware platform information and software stack information, i.e., place ═ hardware + software. Representing an image format using form₁Representing the original image format, form₂Representing a compressed image format. For a certain layer L in the L layers, the execution environment information and the first performance record corresponding to the image format are searched in the database 3013DB according to the execution environment information of the terminal:

as can be seen from the formula, for execution environments not covered by the offline process, i.e., no performance record matching the execution environment and the image format is found in the database 3013, the image feature extraction algorithm performance predictor conservatively uses the slowest execution time of the layer in the database 3013 in all execution environments. Subsequent performance data correctors would dynamically add records in the database 3013 for this execution environment.

σ obtained from database 3013₁(l,plat,form₁) Based on the original image format to normalize the data sizeThe normalized time, i.e. the first normalized time corresponding to layer l, also needs to calculate the data scale of each layer of the original image. For conv1, conv1_1, conv5_1, their input data size is the data size of the input image; the data scale of the subsequent layer is the scale of the output data of the previous layer, and the data scale can be obtained by calculation in sequence according to the data dependency relationship between the layers. Suppose the data size of layer l is s_lAnd L belongs to L, the first execution time t corresponding to the layer L can be calculated_lCalculating to obtain the estimated execution time of each layer:

t_l＝σ₁(l,plat,form₁)*s_l

because the deep neural network model can be calculated layer by layer during execution, the estimated execution time of the whole depth feature extraction process can be calculated:

t above_sum1Namely the first extraction time. According to the above calculation procedure, for the compressed image, the second normalized execution time σ corresponding to the layer l is obtained from the database 3013₂(l,plat,form₂) If the data size of layer l is s_l' if the second execution time corresponding to layer l is t_l'＝σ₂(l,plat,form₂)*s_l', and then calculating to obtain a second extraction time t_sum2。

According to the calculation, the first extraction time required by the original image and the second extraction time required by the compressed image can be estimated, and the estimation of two depth feature extraction algorithms is realized.

In a possible implementation manner, the image processing apparatus may obtain execution environment information of the terminal first, search a performance record corresponding to the execution environment information in a database according to the execution environment information, and estimate, according to the performance record, a first extraction time required for performing depth feature extraction using an original image and a second extraction time required for performing depth feature extraction using a compressed image. That is, in this manner, regardless of the time difference between the original deep neural network model and the specific deep neural network model, the performance record corresponding to the execution environment information can be searched according to table 3.

And the performance record is obtained by performing depth feature extraction at an off-line stage and comprises the layer type of each layer, the normalized execution time corresponding to each layer and the execution environment information corresponding to each layer.

Specifically, the image processing device calculates a data scale corresponding to each layer according to the original image, that is, calculates a data scale corresponding to each layer of the original image in the original deep neural network model, estimates a first execution time corresponding to each layer according to a normalized execution time corresponding to each layer and the data scale corresponding to each layer, and estimates a first extraction time for performing depth feature extraction by using the original image according to the first execution time corresponding to each layer. Similarly, the image processing device calculates the data scale of the compressed image corresponding to each layer in the specific depth neural network model according to the data scale corresponding to each layer of the compressed image, pre-estimates a second execution time corresponding to each layer according to the normalized execution time corresponding to each layer and the data scale corresponding to each layer, and pre-estimates a second extraction time for performing depth feature extraction by using the specific image according to the second execution time corresponding to each layer.

For example, for a certain layer L in the L layers, the performance record corresponding to the execution environment information is searched in the database 3013DB according to the execution environment information of the terminal:

σ (l, plat) obtained from the database 3013 is the time after normalization of the data scale, i.e., the normalization time corresponding to layer l, and it is also necessary to calculate the data scale of each layer of the original image and the data scale of each layer of the compressed image.

Assume that the original image layer l has a data size s_lAnd L belongs to L, the execution time t corresponding to the layer L can be calculated_l：

t_l＝σ(l,plat)*s_l

Suppose that the data size of the compressed image layer l is s_lIf L belongs to L, the execution time t corresponding to layer L can be calculated_l’：

t_l'＝σ(l,plat)*s_l'

Further, the first extraction time t required by the original image can be estimated_sum1And a second extraction time t required for compressing the image_sum2：

Step S303, determining to adopt the original image to carry out depth feature extraction or adopt the compressed image to carry out depth feature extraction according to the first extraction time and the second extraction time.

And the image processing device determines to adopt the original image to carry out depth feature extraction or adopt the compressed image to carry out depth feature extraction according to the first extraction time and the second extraction time.

The image processing device judges whether the first extraction time is longer than the second extraction time, and if the first extraction time is longer than the second extraction time, namely the estimated time based on the compressed image format is shorter, the depth feature extraction is carried out by adopting the compressed image; if the first extraction time is shorter than the second extraction time, namely the estimated time based on the original image format is shorter, adopting the original image to extract the depth features; if the first extraction time is equal to the second extraction time, the original image or the compressed image can be adopted for depth feature extraction. And aiming at the scene with the foreground area accounting for about 20%, the execution time for extracting the depth features by adopting the compressed image is shorter. And aiming at the scene with a large foreground area occupation ratio, the execution time for extracting the depth features by adopting the original image is shorter.

In one implementation, if the original image is used for depth feature extraction, the original image is input into an original depth neural network model for depth feature extraction. The original deep neural network model is a neural network model which is not processed by the existing deep neural network model, in other words, the original deep neural network model is a deep neural network model widely adopted at present.

In one implementation, if the depth feature extraction is performed by using a compressed image, the compressed image is input to a specific depth neural network model for depth feature extraction. The specific deep neural network model is processed on the basis of the existing deep neural network model, so that the depth feature extraction in a compressed image format can be supported. In the case of format mismatch between adjacent layers, a data format conversion layer is automatically added between the adjacent layers, and then the layers using the compressed image format as input are optimized to convert the reduction of the input data size into the benefit of the execution time, namely, to shorten the execution time as much as possible.

The compressed image format only retains the foreground part in the original image, and destroys the original two-dimensional structure of the image. The calculation processes of convolution, downsampling and the like in the original deep neural network model can be efficiently executed only depending on the two-dimensional structure of the original image. For compressed images, an optimized and improved specific depth neural network model adopts a new calculation method to realize efficient depth feature extraction. Since the down-sampling process is very similar to the convolution, the new calculation method is described only by taking the convolution calculation process as an example: and converting the operation of the convolution layer on the compressed image into a regular matrix multiplication operation. Firstly, according to the shape of the convolution layer filter, the parameters of the convolution filter are converted into parameter matrixes, and then according to foreground region data and metadata information recorded in a compressed image, image data required by convolution calculation of a foreground region are organized into image matrixes. And then obtaining a convolution result through matrix multiplication calculation between the image matrix and the parameter matrix. Because the compressed image only contains the foreground part, the data size of the compressed image is smaller than that of the original image, which is directly reflected on the dimension of an image matrix, and the matrix multiplication operation can effectively convert the compressed image into performance benefit. In addition, in the process of organizing the compressed and stored input data into an image matrix, the mapping from each foreground data position in the compressed image to the coordinates in the image matrix needs to be established; this mapping is only related to the input data shape and the convolution operation shape (convolution filter shape, step size of convolution, size of data padding), so there is an opportunity for multiple convolution layers in the convolutional neural network model to share the same mapping relationship. The optimized specific deep neural network model can extract convolution layers which can share the mapping, repeated calculation in the process can be avoided, and the performance of the specific deep neural network model in depth feature extraction of the compressed image is further improved.

The above process can be seen in the exemplary graph of convolution calculation process shown in fig. 5, which takes a 2 × 2 convolution as an example. The left side of fig. 5 shows the process of compressing an image. The background area in the original image is white and the foreground area is gray. From the distribution of the foreground regions, it can be seen that most of the convolution windows contain only the invalid background regions, and only 4 convolution windows contain valid foreground regions, labeled 1-4, respectively. The foreground data of the compressed image, in combination with the metadata information, may determine the location of each foreground data in the original image. Each valid convolution window is stretched into a row and arranged into a 4 x 4 image matrix. The 3 x 2 convolution kernels on the right in fig. 5 are first transformed into a 4 x 3 parameter matrix, each convolution kernel being stretched into one column of the parameter matrix. And obtaining a convolution result by multiplying the image matrix and the parameter matrix. Each element in the foreground data of the compressed image is arranged to a specific position in the image matrix, and the curve labeled elements give an example of a mapping from the foreground data to a position in the image matrix. The specific deep neural network model can determine the convolution layer which can share the mapping, so that the mapping can be repeatedly utilized, and the performance of convolution calculation is improved.

With reference to the deployment diagram shown in fig. 2, the compressed image depth feature extraction algorithm optimizer 1023, acting on the depth neural network model 1024, can convert the calculation of the compressed image by the layers in the network into a regular calculation process according to the structural characteristics of the original depth neural network model and the data distribution features of the compressed image, and reduce the overhead of the conversion as much as possible, thereby effectively converting the reduction of the size of the input image data in the compressed image format into the performance gain of the image depth feature extraction process, i.e., shortening the execution time of the depth feature extraction.

Please refer to fig. 6, which is a schematic structural diagram of a deep neural network model according to an embodiment of the present disclosure, where the schematic structural diagram of the deep neural network model may be the specific deep neural network model. In fig. 6, conv1, conv1_1 and conv5_1 are convolution layers with convolution windows of 1 × 1, the convolution step is 1, and the data padding size is 0. conv3_2, conv5_2 and conv5_3 are convolution layers with convolution windows of 3 × 3, convolution step size is 1 and data padding size is 1. relu is the active layer using a linear rectifying function as the active function. concat is a splicing layer, and outputs from three layers of relu1, relu3_2 and relu5_3 are spliced together to form a complete output result. It should be noted that fig. 6 is for example and does not limit the embodiments of the present application.

And adding a format conversion layer in front of the splicing layer, and converting the output compressed image format data into original image format data.

Among them, the convolution layers conv1, conv1_1 and conv5_1 can share the mapping required by convolution calculation, i.e. convolution layers conv1, conv1_1 and conv5_1 can share the mapping of each foreground data position in the compressed image to the coordinate in the image matrix, for example, the mapping of the curve-marked element in fig. 5 from the position in the foreground data to the position in the image matrix can be shared, and the same convolution layers conv3_2, conv5_2 and conv5_3 can share the mapping, thereby reducing the overhead of calculation using the format data of the compressed image and being beneficial to obtaining better performance benefit.

In the embodiment shown in fig. 3, the first extraction time of depth feature extraction using the original image and the second extraction time of depth feature extraction using the compressed image are estimated, and the original image or the compressed image is selected to perform depth feature extraction according to the size between the two, so that the amount of calculation in the image depth feature extraction process can be effectively reduced, and the execution time of video object detection can be improved.

As a possible embodiment, after step S303, the image processing apparatus records a third execution time corresponding to each layer for performing depth feature extraction using the original image, and updates the first performance record corresponding to the execution environment information in the database according to the third execution time corresponding to each layer; or recording fourth execution time corresponding to each layer for depth feature extraction by using the compressed image, and updating a second performance record corresponding to the execution environment information in the database according to the fourth execution time corresponding to each layer.

In conjunction with the deployment diagram shown in fig. 2, the performance data corrector 1025 records the execution time of the image feature extraction process, and updates the performance record in the database 3013.

The performance data corrector records the execution time of each layer, and the recording method is the same as the step A in the off-line performance data collection process. Following the notation of step S302, the performance data corrector updates the database 3013 according to the following formula:

wherein w is an update speed parameter that can be set by a user, and when w is smaller, the performance data update process tends to be the obtained execution time of the online execution process; the opposite is the trend toward the execution time obtained by the offline performance data collection process. In addition, when there is no record of the current situation in the DB, w is 0, that is, it is equivalent to insert a performance record representing the current situation into the database 3013, for the performance predictor of the image feature extraction algorithm to use in the subsequent performance prediction.

The first performance record and the second performance record can be updated according to the formula so as to avoid the situation that data deviation exists between the offline data and the actual application scene, and further to better select to adopt the original image for depth feature extraction or adopt the compressed image for depth feature extraction next time.

Referring to fig. 4, a schematic flow chart of another image processing method provided in the embodiment of the present application may include, but is not limited to, steps S401 to S403.

Step S401, the data size of the original image is acquired.

Before determining the image feature extraction algorithm, the image processing apparatus may decode a to-be-processed video of the input terminal, and decompose the decoded to-be-processed video to obtain an original image, which may specifically refer to the corresponding description in step S301.

And if the image feature extraction algorithm is determined to be a non-depth feature extraction algorithm, namely a traditional feature extraction algorithm, the image processing device acquires the data scale of the original image. The image processing device can calculate the data scale of the original image according to the characteristics of a specific traditional feature extraction algorithm. For example, the conventional feature extraction algorithm is an algorithm for processing pixel by pixel, such as color space conversion, and the data scale of the original image is the number of pixels of the original image, that is, the product of the width and the height of the original image.

Step S402, determining an execution mode corresponding to the shortest extraction time according to the data scale of the original image and a database;

the image processing device determines an execution mode corresponding to the shortest conventional execution time according to the data size of the original image and the database. The database includes a basic execution time of each conventional feature extraction algorithm in each execution mode, for example, the basic execution time of the conventional feature extraction algorithm 1 in the execution mode a is t1, and the basic execution time of the conventional feature extraction algorithm 1 in the execution mode B is t 2. For a conventional feature extraction algorithm, different manufacturers may change the algorithm (e.g., change the execution environment and/or change the algorithm itself) to adapt to their performance requirements or product requirements, for example, change the version of the calling software library, which results in different execution modes of the conventional feature extraction algorithm. The database collects basic execution time of each traditional feature extraction algorithm under different execution modes, and at least one execution mode exists under each traditional feature extraction algorithm. The record of a plurality of basic execution times included in the database is collected, processed and recorded in an off-line stage. Note that the basic execution time is an execution time regardless of the data size.

In one possible implementation manner, the image processing apparatus may determine a conventional feature extraction algorithm, and determine the extraction time of each of the different execution manners according to the basic execution time of the conventional feature extraction algorithm in the different execution manners and the data scale of the original image, that is, multiply the basic execution time of the conventional feature extraction algorithm in the different execution manners by the data scale of the original image to obtain the conventional execution time of each of the different execution manners, select the shortest extraction time from the conventional execution time, and determine the execution manner corresponding to the shortest extraction time.

In a possible implementation manner, the image processing apparatus does not determine which traditional feature extraction algorithm is used for the moment, calculates extraction time of each traditional feature extraction algorithm in different execution manners according to the data scale of the original image and the basic execution time of each traditional feature extraction algorithm in different execution manners, selects the shortest extraction time from the extraction time, and determines the traditional feature extraction algorithm corresponding to the shortest extraction time and the execution manner corresponding to the traditional feature extraction algorithm.

The image processing device pre-estimates the extraction time of each execution mode under a specific traditional feature extraction algorithm according to the data scale of the original image and the database, or pre-estimates the extraction time of each execution mode under each traditional feature extraction algorithm, so that the calculation amount of the pre-estimated extraction time is reduced, and the image processing device is favorable for quickly determining the execution mode to be adopted for traditional feature extraction.

And step S403, performing non-depth feature extraction on the original image by adopting the execution mode.

And under the condition that the execution mode corresponding to the shortest conventional execution time is determined, performing conventional feature extraction by using the execution mode, so that the time consumption of performing conventional feature extraction on the original image by using the execution mode is shortest.

In the embodiment shown in fig. 4, by acquiring the data scale of the original image, determining the execution mode corresponding to the shortest non-extraction time according to the database, and further performing non-depth feature extraction by using the execution mode, the calculation amount of the non-depth feature extraction process can be effectively reduced, so that the execution time of video object detection is improved.

It should be noted that fig. 3 is a schematic flow chart of the image feature extraction algorithm as a depth feature extraction algorithm, which corresponds to the flows shown in (ii) and (iii) of fig. 2. Fig. 4 is a schematic flow chart of the image feature extraction algorithm as a conventional feature extraction algorithm, which corresponds to the flow chart shown in (r) in fig. 2.

The following describes an image processing apparatus provided in an embodiment of the present application, the image processing apparatus is used to implement the method embodiments shown in fig. 3 and fig. 4, and a detailed description of the method embodiments may be referred to for non-described portions of the apparatus embodiments.

Referring to fig. 7, a schematic diagram of a logical structure of an image processing apparatus according to an embodiment of the present disclosure is shown, where the image processing apparatus 60 includes a conversion unit 602, a prediction unit 603, and an execution unit 604.

The converting unit 602 converts the foreground image corresponding to the original image into a compressed image.

The estimating unit 603 is configured to estimate a first extraction time required for performing depth feature extraction using the original image and a second extraction time required for performing depth feature extraction using the compressed image.

An executing unit 604, configured to determine to perform depth feature extraction by using the original image or perform depth feature extraction by using the compressed image according to the first extraction time and the second extraction time.

In one implementation manner, the image processing apparatus further includes a decoding unit 601, configured to decode a to-be-processed video of an input terminal, and decompose the to-be-processed video after decoding to obtain an original image.

In one implementation, the pre-estimating unit 602 includes:

an environment information acquisition unit for acquiring execution environment information;

the performance record searching unit is used for searching a first performance record and a second performance record corresponding to the execution environment information in a database according to the execution environment information, wherein the database comprises at least one first performance record based on an original image format and at least one second performance record based on a compressed image format;

and the execution time pre-estimating unit is used for pre-estimating first extraction time for performing depth feature extraction by adopting the original image according to the first performance record and pre-estimating second extraction time for performing depth feature extraction by adopting the compressed image according to the second performance record.

In one implementation, the first performance record includes a first normalized execution time corresponding to each layer and execution environment information corresponding to each layer;

the execution time pre-estimation unit is specifically configured to calculate a data scale corresponding to each layer according to the original image; pre-estimating a first execution time corresponding to each layer according to the first normalized execution time corresponding to each layer and the data scale corresponding to each layer; and estimating the first extraction time for extracting the depth features by adopting the original image according to the first execution time corresponding to each layer.

In one implementation, the second performance record includes a second normalized execution time corresponding to each layer and execution environment information corresponding to each layer;

the execution time pre-estimation unit is specifically configured to calculate a data scale corresponding to each layer according to the compressed image; pre-estimating a second execution time corresponding to each layer according to the second normalized execution time corresponding to each layer and the data scale corresponding to each layer; and calculating second extraction time for extracting the depth features by adopting the compressed image according to the second execution time corresponding to each layer.

In one implementation, the execution unit 604 includes:

a judging unit for judging whether the first extraction time is longer than the second extraction time;

the compression execution unit is used for extracting the depth features by adopting the compressed image if the first extraction time is longer than the second extraction time;

and the original execution unit is used for adopting the original image to carry out depth feature extraction if the first extraction time is less than the second extraction time.

In one implementation, the original execution unit is specifically configured to input the original image into an original deep neural network model for depth feature extraction.

In one implementation, the compression execution unit is specifically configured to input the compressed image into a depth-specific neural network model for depth feature extraction.

In one implementation, the image processing apparatus 60 further includes:

the recording unit is used for recording a third execution time corresponding to a layer for performing depth feature extraction by adopting the original image;

the updating unit is used for updating a first performance record corresponding to the execution environment information in the database according to the third execution time;

or, the recording unit is used for recording a fourth execution time corresponding to a layer for performing depth feature extraction by using the compressed image;

and the updating unit is used for updating a second performance record corresponding to the execution environment information in the database according to the fourth execution time.

In one implementation, the image processing apparatus 60 further includes:

the generating unit is used for generating a first performance record and a second performance record corresponding to the execution environment information;

an adding unit to add the first performance record and the second performance record to the database.

The decoding unit 601 may correspond to the decoding resolver 1011 shown in fig. 2; the above-mentioned converting unit 602 is used for executing step S301 in the embodiment shown in fig. 3, and may correspond to the video background analyzer 2012 and the compressed image generator 2022 shown in fig. 2; the estimation unit 603 is configured to execute step S302 in the embodiment shown in fig. 3, and may correspond to the performance predictor 2021 of the image feature extraction algorithm shown in fig. 2; the execution unit 604 is configured to execute step S303 in the embodiment shown in fig. 3, and may correspond to the compressed image depth feature extraction algorithm optimizer 2023 and the deep neural network module 2024 shown in fig. 2, or correspond to the deep neural network module 2024 shown in fig. 2. The above-described recording unit and updating unit may correspond to the performance data corrector 2025 shown in fig. 2. For specific implementation processes of the above units, reference may be made to specific descriptions of the embodiment shown in fig. 3, which are not described herein again.

Referring to fig. 8, an entity structure diagram of the image processing apparatus provided in the embodiment of the present application is shown, where the image processing apparatus 70 may include a processor 701, a memory 702, and may further include an input device 703 and an output device 704. These components may be interconnected via a bus 705, or may be connected in other ways. The related functions implemented by the decoding unit 601, the segmentation unit 602, the prediction unit 603, and the execution unit 604 shown in fig. 6 may be implemented by one or more processors 701.

The processor 701 may include one or more processors, such as one or more Central Processing Units (CPUs), one or more GPUs, and in the case that the processor 701 is a CPU, the CPU may be a single-core CPU or a multi-core CPU. In this embodiment, the processor 701 is configured to execute steps S301 to S304 in the embodiment shown in fig. 3, which may specifically refer to the description of the embodiment shown in fig. 3 and is not described herein again. The processor 701 may be configured to implement the decoding decomposer 2011, the video background analyzer 2012, the compressed image generator 2022, the compressed image depth feature extraction algorithm optimizer 2023, the image feature extraction algorithm performance predictor 2021, the depth neural network module 2024, and the performance data corrector 2025 shown in fig. 2. May also be used to implement the object location and classification module 203 shown in fig. 2.

The memory 702 includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM), and the memory 702 is used for storing related program codes and data. In the embodiment of the present application, if the image processing apparatus can locally store the database 3013, the memory 702 may include the database 3013 shown in fig. 2.

The input device 703 may include, but is not limited to, a display screen, a stylus pen, a keyboard, a mouse, a microphone, and the like, for receiving an input operation by a user. In this embodiment, the input device 703 is used to receive an input video to be processed.

Output device 704 may include, but is not limited to, a display screen, speakers, etc., for outputting audio files, video files, image files, etc. In the embodiment of the present application, the output device 704 is used for outputting the processed image and video.

It will be appreciated that fig. 7 only shows a simplified design of the image processing apparatus. In practical applications, the image processing apparatus may further include other necessary components, including but not limited to any number of transceivers, communication units, and the like, and all devices that can implement the present application are within the protection scope of the present application.

One of ordinary skill in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the above method embodiments. And the aforementioned storage medium includes: various media capable of storing program codes, such as ROM or RAM, magnetic or optical disks, etc. Accordingly, a further embodiment of the present application provides a computer-readable storage medium having stored therein instructions, which, when executed on a computer, cause the computer to perform the method of the above aspects.

Yet another embodiment of the present application provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of the above aspects.

Those of ordinary skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims

1. An image processing method, comprising:

estimating first extraction time required by adopting the original image to extract depth features and second extraction time required by adopting the compressed image to extract the depth features;

if the first extraction time is longer than the second extraction time, inputting the compressed image into a deep neural network model for depth feature extraction;

and if the first extraction time is less than the second extraction time, inputting the original image into a deep neural network model for depth feature extraction.

2. The method of claim 1, wherein estimating a first extraction time for depth feature extraction using the original image and a second extraction time for depth feature extraction using the compressed image comprises:

acquiring execution environment information;

searching a first performance record and a second performance record corresponding to the execution environment information in a database according to the execution environment information, wherein the database comprises at least one first performance record suitable for an original image format and at least one second performance record suitable for a compressed image format;

and estimating first extraction time for extracting depth features by adopting the original image according to the first performance record, and estimating second extraction time for extracting depth features by adopting the compressed image according to the second performance record.

3. The method of claim 2, wherein the first performance record comprises a first normalized execution time corresponding to each of a plurality of layers and execution environment information corresponding to each of the layers;

the pre-estimating the first extraction time for extracting the depth features by adopting the original image according to the first performance record comprises the following steps:

calculating the data scale corresponding to each layer according to the original image;

pre-estimating the first execution time corresponding to each layer according to the first normalized execution time corresponding to each layer and the data scale corresponding to each layer;

and estimating the first extraction time for extracting the depth features by adopting the original image according to the first execution time corresponding to each layer.

4. The method of claim 2, wherein the second performance record comprises a second normalized execution time corresponding to each of a plurality of layers and execution environment information corresponding to each of the layers;

the estimating, according to the second performance record, a second extraction time for performing depth feature extraction using the compressed image, including:

calculating the data scale corresponding to each layer according to the compressed image;

pre-estimating a second execution time corresponding to each layer according to the second normalized execution time corresponding to each layer and the data scale corresponding to each layer;

and calculating second extraction time for extracting the depth features by adopting the compressed image according to the second execution time corresponding to each layer.

5. The method according to any one of claims 1 to 4, wherein the inputting the compressed image into a deep neural network model for depth feature extraction comprises:

and inputting the original image into an original depth neural network model for depth feature extraction.

6. The method according to any one of claims 1 to 4, wherein the inputting the compressed image into a deep neural network model for depth feature extraction comprises:

and inputting the compressed image into a specific depth neural network model for depth feature extraction.

7. The method according to any one of claims 1-4, wherein after determining whether to perform depth feature extraction using the original image or to perform depth feature extraction using the compressed image according to the first extraction time and the second extraction time, further comprising:

recording third execution time corresponding to each layer for performing depth feature extraction by using the original image, and updating a first performance record corresponding to the execution environment information in the database according to the third execution time corresponding to each layer;

or recording a fourth execution time corresponding to each layer for depth feature extraction by using the compressed image, and updating a second performance record corresponding to the execution environment information in the database according to the fourth execution time corresponding to each layer.

8. The method according to any one of claims 1-4, wherein before converting the foreground image corresponding to the original image into the compressed image, the method further comprises:

and generating a first performance record and a second performance record corresponding to the execution environment information, and adding the first performance record and the second performance record to the database.

9. An image processing apparatus characterized by comprising:

the conversion unit is used for converting a foreground image corresponding to the original image into a compressed image;

the pre-estimation unit is used for pre-estimating first extraction time required by adopting the original image to extract the depth features and second extraction time required by adopting the compressed image to extract the depth features;

the compression execution unit is used for inputting the compressed image into a deep neural network model for depth feature extraction if the first extraction time is longer than the second extraction time;

and the original execution unit is used for inputting the original image into a deep neural network model for depth feature extraction if the first extraction time is less than the second extraction time.

10. The apparatus of claim 9, wherein the estimating unit comprises:

an environment information acquisition unit for acquiring execution environment information of the terminal;

the performance record searching unit is used for searching a first performance record and a second performance record corresponding to the execution environment information in a database according to the execution environment information, wherein the database comprises at least one first performance record suitable for an original image format and at least one second performance record suitable for a compressed image format;

11. The apparatus of claim 10, wherein the first performance record comprises a first normalized execution time corresponding to each of a plurality of layers and execution environment information corresponding to each of the layers;

the execution time pre-estimation unit is specifically configured to calculate a data scale corresponding to each layer according to the original image; pre-estimating the first execution time corresponding to each layer according to the first normalized execution time corresponding to each layer and the data scale corresponding to each layer; and estimating the first extraction time for extracting the depth features by adopting the original image according to the first execution time corresponding to each layer.

12. The apparatus of claim 10, wherein the second performance record comprises a second normalized execution time corresponding to each of a plurality of layers and execution environment information corresponding to each of the layers;

13. The apparatus according to any one of claims 9 to 12, wherein the raw execution unit is specifically configured to input the raw image into a raw deep neural network model for depth feature extraction.

14. The apparatus according to any one of claims 9 to 12, wherein the compression execution unit is specifically configured to input the compressed image into a depth-specific neural network model for depth feature extraction.

15. The apparatus according to any one of claims 9-12, further comprising:

the recording unit is used for recording a third execution time corresponding to each layer for performing depth feature extraction by adopting the original image;

the updating unit is used for updating a first performance record corresponding to the execution environment information in the database according to the third execution time corresponding to each layer;

or, the recording unit is configured to record a fourth execution time corresponding to each layer, where the depth feature extraction is performed using the compressed image;

and the updating unit is used for updating a second performance record corresponding to the execution environment information in the database according to the fourth execution time corresponding to each layer.

16. The apparatus according to any one of claims 9-12, further comprising:

17. An image processing apparatus comprising a processor, a memory, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-8.

18. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any of claims 1-8.