CN115439692A

CN115439692A - Image processing method and device, electronic equipment and medium

Info

Publication number: CN115439692A
Application number: CN202211106978.7A
Authority: CN
Inventors: 夏春龙
Original assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd; Apollo Zhixing Technology Guangzhou Co Ltd
Current assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd; Apollo Zhixing Technology Guangzhou Co Ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2022-12-06

Abstract

The disclosure provides an image processing method, an image processing device, electronic equipment and a medium, and relates to the technical field of artificial intelligence, in particular to the technical field of computer vision. The specific implementation scheme is as follows: and acquiring an image to be processed, and then performing feature extraction on the image to be processed by utilizing at least one layer of dynamic convolution operator and multiple layers of static convolution operators in a pre-trained feature extraction network to obtain a target feature map which is output by the feature extraction network and comprises the features of the image to be processed. Wherein, the processing of each layer of dynamic convolution operator comprises: predicting a dynamic expansion coefficient based on the received characteristic diagram, and performing convolution operation on the received characteristic diagram according to the dynamic expansion coefficient to obtain an output characteristic diagram; the processing of each layer of static convolution operators comprises: and carrying out convolution operation on the received characteristic diagram according to the default expansion coefficient to obtain an output characteristic diagram. And then determining the recognition result of the image to be processed based on the target characteristic graph. The accuracy of image recognition can be improved.

Description

Image processing method and device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to the field of computer vision technology.

Background

In the field of computer vision, a computer vision algorithm is generally used to perform image detection or image classification on an image, and in the process of image detection or image classification, feature extraction needs to be performed on the image first, and then a target in the image is further detected according to an extracted feature map, or the category of the image is determined.

Disclosure of Invention

The embodiment of the disclosure provides an image processing method, an image processing device, electronic equipment and a medium.

In a first aspect of the embodiments of the present disclosure, an image processing method is provided, including:

acquiring an image to be processed;

performing feature extraction on the image to be processed by using at least one layer of dynamic convolution operator and multiple layers of static convolution operators in a pre-trained feature extraction network to obtain a target feature map which is output by the feature extraction network and comprises features of the image to be processed; wherein, the processing of each layer of dynamic convolution operator comprises: predicting a dynamic expansion coefficient based on the received characteristic diagram, and performing convolution operation on the received characteristic diagram according to the dynamic expansion coefficient to obtain an output characteristic diagram; the processing of each layer of static convolution operators comprises the following steps: carrying out convolution operation on the received characteristic diagram according to the default expansion coefficient to obtain an output characteristic diagram;

and determining the recognition result of the image to be processed based on the target feature map.

In a second aspect of the disclosed embodiments, there is provided an image processing apparatus comprising:

the acquisition module is used for acquiring an image to be processed;

the characteristic extraction module is used for extracting the characteristics of the image to be processed acquired by the acquisition module by utilizing at least one layer of dynamic convolution operator and multiple layers of static convolution operators in a pre-trained characteristic extraction network to obtain a target characteristic diagram which is output by the characteristic extraction network and comprises the characteristics of the image to be processed; wherein, the processing of each layer of dynamic convolution operator comprises: predicting a dynamic expansion coefficient based on the received characteristic diagram, and performing convolution operation on the received characteristic diagram according to the dynamic expansion coefficient to obtain an output characteristic diagram; the processing of each layer of static convolution operators comprises the following steps: carrying out convolution operation on the received characteristic diagram according to the default expansion coefficient to obtain an output characteristic diagram;

and the identification module is used for determining the identification result of the image to be processed based on the target feature map extracted by the feature extraction module.

In a third aspect of the disclosed embodiment, an electronic device is provided, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first aspects.

In a fourth aspect of the disclosed embodiments, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the first aspect.

In a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the first aspects.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary diagram illustrating a convolution operation provided by an embodiment of the present disclosure;

fig. 2 is a flowchart of a first image processing method provided in an embodiment of the present disclosure;

fig. 3 is an exemplary schematic diagram of the size of the receptive field provided by the embodiments of the present disclosure;

FIG. 4 is an exemplary diagram of a dynamic convolution operator process provided by an embodiment of the present disclosure;

fig. 5 is a flowchart of a second image processing method according to an embodiment of the disclosure;

fig. 6 is a flowchart of a third image processing method provided by the embodiment of the disclosure;

fig. 7 is a flowchart of a method for training a feature extraction network according to an embodiment of the present disclosure;

fig. 8 is a flowchart of a fourth image processing method provided by the embodiment of the disclosure;

fig. 9 is a flowchart of a fifth image processing method provided in the embodiments of the present disclosure;

fig. 10 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure;

fig. 11 is a block diagram of an electronic device for implementing an image processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

At present, in the field of computer vision, the computer vision algorithm modeling can be performed by manually extracting image features, but the manual extraction of the features depends on experience and knowledge of people, so that the extraction precision is difficult to guarantee, and the efficiency is low, which is not beneficial to algorithm iteration. The accuracy of image classification or image detection using this algorithm is low.

Or, the image features can be extracted through a convolutional neural network to perform computer vision algorithm modeling, but the receptive field is fixed when the convolutional neural network extracts the features from the image, and the fixed receptive field cannot be applied to the targets with different sizes due to different sizes of the targets to be identified in different images, so that the accuracy of extracting the features in this way is low, and the accuracy of classifying or detecting the image by using the algorithm is low.

The receptive field according to the embodiments of the present disclosure is first described below:

the receptive field refers to the area size of each feature point in a feature map (feature map) output by each layer of the convolutional neural network, which is mapped on the original input image.

With reference to fig. 1, the left image in fig. 1 is an original input image, a dashed box in the original input image represents a convolution kernel of the first convolution processing, and the original input image is subjected to convolution processing with a convolution kernel size of 3 × 3 and a convolution step size of 2, so as to obtain an intermediate image in fig. 1. Each feature point in the intermediate image is obtained from 3 × 3 pixel points on the original input image, and therefore after the first convolution processing, the size of the receptive field of the obtained feature image is 3 × 3.

The dashed line frame in the intermediate image in fig. 1 indicates the convolution kernel of the second convolution processing, and the right image in fig. 1 is obtained after the convolution processing with the convolution kernel size of 2 × 2 and the convolution step size of 1 is performed on the intermediate image in fig. 1. Each feature point in the right image is obtained from 2 × 2 feature points of the middle image, and each feature point of the middle image is obtained from 3 × 3 pixel points on the original input image, so that after the second convolution processing, the receptive field size of the obtained feature map is 5 × 5.

Therefore, the larger the receptive field of the feature map is, the larger the corresponding area of each feature point in the feature map on the original input image is, so that the method is more favorable for identifying the target with larger size; conversely, the smaller the receptive field of the feature map is, the smaller the corresponding area of each feature point in the feature map on the original input image is, so that the smaller size of the object can be more favorably identified. For example, the size of a house in an image is generally larger than the size of a person in the image, and thus a larger field of view is required to identify the house, and a smaller field of view is required to identify the person.

In a scene of automatic driving and vehicle-road cooperation, obstacles in a road image can be classified through the collected road image, or the obstacles in the road can be positioned to assist driving decision. It can be understood that, under the influence of the shooting angle or the actual size of the obstacle, the sizes of various obstacles in the road image are different, for example, the road image includes a car image and a pedestrian image, and the pixel area occupied by the car image is often larger than the pixel area occupied by the pedestrian image. The fixed receptive field cannot be matched to both the size of the car image and the size of the pedestrian image, resulting in inaccurate classification of obstacles in the road or inaccurate identification of obstacles in the road.

In order to solve the above problem, the embodiments of the present disclosure provide an image processing method, which is applied to an electronic device, for example, the electronic device may be a desktop computer, a mobile phone, or a server and the like having an image processing capability. As shown in fig. 2, the method comprises the steps of:

s201, acquiring an image to be processed.

Optionally, the image to be processed may be a video frame acquired by a camera in a vehicle, or a video frame acquired by a camera installed on a roadside, or a road image input by a user, or a road image specified by the user, or a road image in a road image set specified by the user.

S202, performing feature extraction on the image to be processed by utilizing at least one layer of dynamic convolution operator and multiple layers of static convolution operators in the pre-trained feature extraction network to obtain a target feature map which is output by the feature extraction network and comprises the features of the image to be processed.

Wherein, the processing of each layer of dynamic convolution operator comprises: and predicting a dynamic expansion coefficient based on the received characteristic diagram, and performing convolution operation on the received characteristic diagram according to the dynamic expansion coefficient to obtain an output characteristic diagram. The size of a convolution kernel based on the dynamic convolution operator in the convolution operation is a default size, the convolution step is a default step, the number of the convolution kernels is a default number, and the number of Padding (Padding) is a default number.

The processing of each layer of static convolution operators comprises: and carrying out convolution operation on the received characteristic diagram according to the default expansion coefficient to obtain an output characteristic diagram. The convolution kernel size based on the static convolution operator in the convolution operation is a default size, the convolution step size is a default step size, the number of convolution kernels is a default number, and the number of Padding is a default number.

And S203, determining the recognition result of the image to be processed based on the target characteristic diagram.

After the target feature map is obtained, the category of the image to be processed and/or the position of the target area in the image to be processed and the like can be identified based on the target feature map.

The method and the device for processing the image feature extraction use the feature extraction network to extract the features of the image to be processed to obtain the target feature map comprising the features of the image to be processed, and determine the recognition result of the image to be processed based on the target feature map. The feature extraction network comprises at least one layer of dynamic convolution operator, and the dynamic convolution operator can predict a dynamic expansion coefficient aiming at the received feature map and carry out convolution operation on the received feature map by using the dynamic expansion coefficient. Namely, the expansion coefficient is adaptively adjusted according to different received characteristic maps. Because the expansion coefficient is adjusted, the size of the receptive field is correspondingly adjusted, that is, the embodiment of the disclosure can adaptively adjust the size of the receptive field of the feature extraction according to the difference of the images to be processed, not only does not depend on manual operation, but also improves the accuracy of feature extraction on the images, and further improves the accuracy of image identification.

The expansion coefficient due to convolution can represent the interval between adjacent convolution points in the convolution kernel, i.e., the interval between adjacent convolution points = expansion coefficient-1. The expansion coefficient of the convolution operation is adjusted, so that the receptive field of the feature extraction is dynamically adjusted, the larger the expansion coefficient is, the larger the receptive field is, and conversely, the smaller the expansion coefficient is, the smaller the receptive field is.

For example, referring to fig. 3, fig. 3 includes three images in which regions of size 7 × 7 each represent an input image or feature map and dark squares represent convolution kernels. The left image in fig. 3 represents a convolution operation with K =3 and S =1, that is, a convolution operation with a convolution kernel size (K) of 3 × 3 and an expansion coefficient (S) of 1, and the size of the receptive field is 3 × 3 in the case where a 7 × 7 region represents the input image.

The intermediate image in fig. 3 represents a convolution operation with K =3 and S =2, where the interval between adjacent convolution points in the convolution kernel is 2-1=1, that is, the adjacent convolution points are separated by 1 pixel point or 1 feature point. When the 7 × 7 region represents the input image, the size of the field is 5 × 5 because the one-side size of the equivalent convolution kernel is 2 × (3-1) +1= 5. The single-sided size refers to the size of one side.

The right image in fig. 3 represents a convolution operation with K =3 and S =3, where the interval between adjacent convolution points in the convolution kernel is 3-1=2, that is, the adjacent convolution points are separated by 2 pixels or 2 feature points. When the 7 × 7 region represents the input image, the size of the field is 7 × 7 because the one-side size of the equivalent convolution kernel is 3 × (3-1) +1= 7.

It can be seen that, in the case of the same convolution kernel size, the larger the expansion coefficient, the larger the size of the equivalent convolution kernel, and thus the larger the receptive field.

In an embodiment of the present disclosure, the above dynamic convolution operator predicts a dynamic expansion coefficient based on the received feature map, and specifically includes the following three implementation manners:

the method I specifically comprises the following three steps:

step 1, carrying out global pooling operation on the received feature map.

In the embodiment of the present disclosure, in a case that the dynamic convolution operator is the first operator of the feature extraction network, the feature map received by the dynamic convolution operator is the image to be processed.

And under the condition that the dynamic convolution operator is not the first operator of the feature extraction network, the feature map received by the dynamic convolution operator comprises the feature map output by the last operator. For example, the last operator may be a dynamic convolution operator, a static convolution operator, or a pooling operator, among others. I.e. the feature extraction network may also comprise other types of operators, such as pooling operators.

Alternatively, in the case where the dynamic convolution operator is not the first operator of the feature extraction network, the feature map received by the dynamic convolution operator may include the feature map output by the previous operator and the feature maps output by operators preceding the previous operator. For example, in a case where a first static convolution operator of one residual block included in a feature extraction network in the residual network 18 (residual network 18) is replaced by a dynamic convolution operator, and the feature map received by the dynamic convolution operator includes a feature map output by a previous residual block and a feature map input by the previous residual block when the replaced feature extraction network is used as the feature extraction network used in S202.

The feature graph received by the dynamic convolution operator is related to a specific structure of the feature extraction network, which is not specifically limited in this disclosure.

Alternatively, the Global Pooling operation may be Global Average Pooling (GAP), global maximum Pooling (Global Max Pooling), or Global minimum Pooling (Global Min Pooling), among others.

And 2, carrying out full-connection processing on the global pooling result to obtain the matching probability of the received characteristic diagram and each preset expansion coefficient.

Each preset expansion coefficient may be determined according to the size of the feature map processed by the convolution operation, the size of the convolution kernel, and/or the size range of the target to be identified.

For example, a plurality of preset expansion coefficients may be set on the basis of the principle that, after a preset expansion coefficient is added to the convolution kernel, the equivalent size of the convolution kernel does not exceed the feature map size of the convolution operation processing.

For another example, when the target to be recognized is a vehicle license plate in a road image, since the size of the license plate is generally small, a plurality of small preset expansion coefficients may be set.

The specific value of each preset expansion coefficient may be set according to actual conditions, which is not specifically limited in the embodiment of the present disclosure.

And 3, selecting the preset expansion coefficient with the maximum probability as the dynamic expansion coefficient.

The preset expansion coefficient with the maximum probability is the highest in matching degree with the received feature map, so that the preset expansion coefficient can be used as a dynamic expansion coefficient.

By the method, the probability that each preset expansion coefficient is matched with the received feature map can be predicted through the global pooling operation and the full connection operation, the preset expansion coefficient with the maximum probability is selected, and therefore the received feature map is subjected to convolution operation through the most appropriate expansion coefficient, the receptive field of the convolution operation is more suitable for the received feature map, and the accuracy of image identification is improved.

And secondly, performing deep separable convolution operation on the received characteristic diagram, and then performing full-connection processing on the result of the deep separable convolution operation to obtain the probability of matching the input characteristic diagram with each preset expansion coefficient. And then selecting the preset expansion coefficient with the highest probability as the dynamic expansion coefficient.

The processing efficiency of the dynamic convolution operator is made higher because the deep separable convolution operation reduces the number of parameters required for convolution as compared to the normal convolution operation. And through the depth separable convolution operation and the full-connection operation, the probability that each preset expansion coefficient is matched with the received characteristic diagram can be predicted, and the preset expansion coefficient with the maximum probability is selected, so that the received characteristic diagram is subjected to convolution operation by using the most appropriate expansion coefficient, the receptive field of the convolution operation is more suitable for the received characteristic diagram, and the accuracy of image identification is improved.

And thirdly, performing global pooling operation on the received feature map, performing deep separable convolution operation on the result of the global pooling operation, and performing full-connection processing on the result of the deep separable convolution operation to obtain the probability of matching the input feature map with each preset expansion coefficient. And then selecting the preset expansion coefficient with the highest probability as the dynamic expansion coefficient.

Since the global pooling operation can reduce the data amount of the feature map received by the dynamic convolution operator, the data amount required to be processed by the deep separable operation is reduced, and thus the calculation amount of the deep separable operation is reduced, so that the processing efficiency of the dynamic convolution operator is higher. Through global pooling operation, depth separable convolution operation and full-connection operation, the probability that each preset expansion coefficient is matched with the received feature map can be predicted, the preset expansion coefficient with the maximum probability is selected, and therefore the received feature map is subjected to convolution operation through the most appropriate expansion coefficient, the receptive field of the convolution operation is more suitable for the received feature map, and accuracy of image recognition is improved.

The description of the related steps in the second and third modes can refer to the first mode, and is not repeated here.

With reference to fig. 4, the following describes a process of the dynamic convolution operator by taking the first method as an example:

as shown in fig. 4, the dynamic convolution operator first receives the feature map, and the size of the received feature map is C × H × W, where C denotes the number of channels, H denotes the height of the feature map, and W denotes the width of the feature map. And the dynamic convolution operator performs global pooling on the received feature map to obtain a pooled feature vector with the size of C multiplied by 1. And then carrying out full-connection processing on the pooled feature vectors to obtain full-connection feature vectors with the size of Nx 1 x 1, wherein N represents the number of preset expansion coefficients, and each numerical value in the full-connection feature vectors represents the probability of matching one preset expansion coefficient with the received feature map. And then selecting a preset expansion coefficient with the maximum probability as a dynamic expansion coefficient, and performing convolution operation on the received characteristic diagram according to the dynamic expansion coefficient to obtain an output characteristic diagram.

The image processing method provided by the embodiment of the disclosure can be applied to an image classification scene, and at this time, the feature extraction network belongs to a pre-trained image classification model, and the image classification model further includes a classification sub-network.

The image classification model can be constructed based on models such as Resnet18, dense convolutional network (Densenet), google-released deep neural network (Googlenet), lightweight network (Mobilenet), or compression-and-Excitation network (Sennet).

Referring to fig. 5, the method for determining the recognition result of the image to be processed based on the target feature map in S203 includes the following steps:

s2031, inputting the target feature map into a global pooling layer of the classification sub-network to obtain a pooling feature map of the image to be processed.

Optionally, the operation of the global pooling layer on the target feature map may be: global average pooling, global maximum pooling, or global minimum pooling, etc.

S2032, inputting the pooling feature map into a full-connection layer of the classification sub-network to obtain the probability that the image to be processed belongs to each preset category.

The preset category may be determined according to requirements in an actual application scenario.

For example, in the field of smart driving, the preset categories may include vehicle images, tree images, curb images, traffic light images, pedestrian images, and the like.

In the field of traffic management of public transportation, the preset categories may include bus images, truck images, taxi images, bicycle images, electric bicycle images, motorcycle images, and the like.

In the smart home field, the preset categories may include a television image, a refrigerator image, a washing machine image, a stereo image, a dishwasher image, and the like.

And S2033, selecting the preset category with the maximum probability as the category of the image to be processed.

The preset category with the highest probability represents the category to which the image to be processed most likely belongs, and therefore the preset category is taken as the category of the image to be processed.

The method and the device for extracting the target feature map can be applied to image classification scenes, so that the category of the image to be processed is identified, and the perception field of the feature extraction is more suitable for the image to be processed due to the fact that the dynamic convolution operator is used when the feature extraction is carried out on the image to be processed, and therefore the accuracy of the target feature map obtained through the feature extraction is higher. The accuracy of image classification by using the target characteristic diagram is higher, namely the category of the image to be processed can be determined more accurately.

The image processing method provided by the embodiment of the disclosure can be applied to an image detection scene, and at this time, the feature extraction network belongs to a pre-trained image detection model, and the image detection model further includes a target detection sub-network.

The image detection model can be constructed based on models such as a Region-Convolutional Neural Network (RCNN), a Single-Shot multi-box Detector (SSD), a target detection algorithm (YOLO), or a Feature Pyramid Network (FPN).

Referring to fig. 6, the manner of determining the recognition result of the image to be processed based on the target feature map in S203 may include the following steps:

s2034, performing feature fusion (feature fusion) on the target feature map to obtain the multi-scale features of the target feature map.

For example, feature Pyramid Networks (Feature Pyramid Networks) can be used to perform Feature fusion on the target Feature map, so as to obtain the multi-scale features of the target Feature map.

S2035, detecting the position of the target area in the image to be processed and the target in the target area based on the multi-scale features of the target feature map.

The target area position may be a bounding box (bounding box) position where the target is located, and may be represented by a center point position of the bounding box and a bounding box size, for example.

Alternatively, the target area position may be the position of each pixel point occupied by the target.

And detecting the target in the target area, obtaining the probability that the target belongs to each preset category, and selecting the preset category with the maximum probability as the category of the target.

The embodiment of the disclosure can be applied to an image detection scene, so as to identify the position of a target area in an image to be processed and a target in the target area. In addition, when the characteristic of the image to be processed is extracted, the dynamic convolution operator is used, so that the receptive field of the characteristic extraction is more suitable for the image to be processed, and the accuracy of the target characteristic diagram obtained by the characteristic extraction is higher. The accuracy of image detection by using the target characteristic diagram is higher, namely the position of the target area in the image to be processed and the target in the target area can be determined more accurately.

Referring to fig. 7, the feature extraction network in the embodiment of the present disclosure may be obtained by training the following steps:

s701, obtaining a sample image and an annotation result of the sample image.

In the case that the disclosed embodiments are applied to different fields, the selected sample images may be different.

For example, in the field of smart driving, the sample image may be an image captured by an onboard camera, such as a video frame of a video taken by a vehicle recorder installed in a vehicle, including various types of obstacles, such as vehicles, trees, curbs, traffic lights, pedestrians, and the like.

In the field of traffic management of public transportation, the sample image may be a road image taken by a roadside unit, including various types of vehicles, such as buses, vans, taxis, bicycles, electric bicycles, motorcycles, and the like.

In the smart home field, the sample image may be an image collected by a sweeping robot or a home camera, including various types of home appliances, such as a television, a refrigerator, a washing machine, a stereo, a dishwasher, and the like.

The embodiment of the present disclosure may also be applied to other fields, and accordingly, the type of the target to be included in the sample image may also be determined according to actual requirements, which is not specifically limited by the embodiment of the present disclosure.

In the case of training the image classification model, the labeling result of the sample image represents the actual category of the sample image. In the case of training the image detection model, the labeling result of the sample image indicates the actual target region position in the sample image and the actual class of the target in the target region.

S702, performing feature extraction on the sample image by using at least one layer of dynamic convolution operator and multiple layers of static convolution operators in the feature extraction network to obtain a target feature map which is output by the feature extraction network and comprises the features of the sample image.

In the case of training the image classification model, the feature extraction network may be obtained by replacing at least one static convolution operator of the feature extraction network in Resnet18, densnet, googlenet, mobilene, or Senet with a dynamic convolution operator.

Under the condition of training the image detection model, the feature extraction network can be obtained by replacing at least one static convolution operator of the feature extraction network in RCNN, SSD, YOLO or FPN with a dynamic convolution operator.

Optionally, the position and the number of the static convolution operators to be replaced may be set according to actual requirements, for example, according to accuracy requirements and computational efficiency requirements, which is not specifically limited in the embodiment of the present disclosure.

The size and number of convolution kernels of convolution operation in the replaced dynamic convolution operator may be the same as or different from those of the replaced static convolution operator, and this is not specifically limited in the embodiments of the present disclosure.

The processing procedure of the sample image by the feature extraction network is the same as the processing procedure of the image to be processed, and reference may be made to the above description, which is not repeated herein.

And S703, determining the recognition result of the sample image based on the target feature map.

The process of determining the recognition result of the sample image is the same as the process of determining the recognition result of the image to be processed, and reference may be made to the above description, which is not repeated herein.

And S704, calculating a loss function value based on the labeling result and the identification result.

The loss function value represents the error between the labeling result and the identification result, and can reflect the accuracy of the identification result.

The loss function value may be calculated based on a preset loss function, the annotation result, and the recognition result. For example, the preset loss function may be: a mean square error loss function, a mean absolute error loss function, or a cross entropy loss function, etc.

S705, training the feature extraction network by using the loss function value to obtain the trained feature extraction network.

In the embodiment of the present disclosure, the feature extraction network belongs to an image detection model or an image classification model, and therefore, the network parameters of the image detection model or the image classification model may be adjusted based on the loss function value. And then judging whether the current model meets the storage condition, if so, storing the current model, and continuing to carry out iterative training on the current model by using the next batch of sample images until the training is finished, and inputting the detection image into the model aiming at each stored model to obtain the recognition result output by the model. Determining a loss function value between a labeling result of the detected image and a recognition result output by the model, selecting the model with the minimum loss function value as an image detection model or an image classification model, and using a feature extraction network included by the selected model as a trained feature extraction network.

The recognition accuracy of the preset storage condition representation model is high. For example, the preset saving conditions include: the loss function value is smaller than a preset threshold, or the loss function value is a minimum value in an iteration process, and the like, which is not specifically limited in the embodiment of the present disclosure.

Optionally, a maximum iteration number may be set, and when the training iteration number reaches the maximum iteration number, it is determined that the training is finished. Or, a preset threshold value may be set, and when the loss function value is smaller than the preset threshold value, it is determined that the training is finished. The convergence condition of the model is not particularly limited in the embodiments of the present disclosure.

The model with the minimum loss function value is the model with the highest identification accuracy, so that the model with the minimum loss function is used as the image classification model or the image detection model, and the accuracy of subsequently classifying the images by using the image classification model is higher, or the accuracy of detecting the images by using the image detection model is higher.

Referring to fig. 8, the following describes an image classification process in the embodiment of the present disclosure, taking an application in the field of intelligent driving as an example:

s801, acquiring a road image acquired by the vehicle-mounted camera.

S802, performing feature extraction on the road image by using at least one layer of dynamic convolution operator and multiple layers of static convolution operators in the pre-trained feature extraction network to obtain a target feature map which is output by the feature extraction network and comprises road image features. Wherein, the processing of each layer of dynamic convolution operator comprises: predicting a dynamic expansion coefficient based on the received characteristic diagram, and performing convolution operation on the received characteristic diagram according to the dynamic expansion coefficient to obtain an output characteristic diagram; the processing of each layer of static convolution operators comprises the following steps: and carrying out convolution operation on the received characteristic diagram according to the default expansion coefficient to obtain an output characteristic diagram.

And S803, inputting the target feature map into a global pooling layer of a classification sub-network in the image classification model to obtain a pooling feature map of the road image.

S804, inputting the pooling feature map into a full-link layer of the classification sub-network to obtain the probability that the road image belongs to each preset obstacle category.

The preset barrier categories comprise vehicles, trees, kerbs, traffic lights, pedestrians and the like.

And S805, selecting the preset obstacle type with the maximum probability as the type of the road image.

By the method, the embodiment of the disclosure can more accurately extract the features of the road image, so that the obstacle category of the road image included in the road image is identified based on the more accurate target feature map. Therefore, obstacles such as vehicles and pedestrians on the road can be recognized more accurately in the intelligent driving scene.

Referring to fig. 9, the following describes an image detection process in the embodiment of the present disclosure, taking an application in the field of smart driving as an example:

and S901, acquiring a road image acquired by the vehicle-mounted camera.

S902, performing feature extraction on the road image by using at least one layer of dynamic convolution operator and multiple layers of static convolution operators in the pre-trained feature extraction network to obtain a target feature map which is output by the feature extraction network and comprises the road image features. Wherein, the processing of each layer of dynamic convolution operator comprises: predicting a dynamic expansion coefficient based on the received characteristic diagram, and performing convolution operation on the received characteristic diagram according to the dynamic expansion coefficient to obtain an output characteristic diagram; the processing of each layer of static convolution operators comprises the following steps: and carrying out convolution operation on the received characteristic diagram according to the default expansion coefficient to obtain an output characteristic diagram.

And S903, performing feature fusion on the target feature map to obtain the multi-scale features of the target feature map.

And S904, detecting the position of the obstacle region in the road image and the obstacle in the obstacle region based on the multi-scale features of the target feature map.

Through the method, the embodiment of the disclosure can more accurately extract the features of the road image, so that the positions of the obstacle regions of the road image and the obstacles in the obstacle regions are identified based on the more accurate target feature map. Therefore, obstacles such as vehicles and pedestrians on the road and the positions of the obstacles such as the vehicles and the pedestrians can be recognized more accurately in the intelligent driving scene.

Based on the same inventive concept, corresponding to the above method embodiment, the disclosed embodiment provides an image processing apparatus, as shown in fig. 10, the apparatus including: an acquisition module 1001, a feature extraction module 1002 and an identification module 1003;

an obtaining module 1001 configured to obtain an image to be processed;

the feature extraction module 1002 is configured to perform feature extraction on the to-be-processed image acquired by the acquisition module 1001 by using at least one layer of dynamic convolution operator and multiple layers of static convolution operators in a pre-trained feature extraction network to obtain a target feature map including features of the to-be-processed image, where the target feature map is output by the feature extraction network; wherein, the processing of each layer of dynamic convolution operator comprises: predicting a dynamic expansion coefficient based on the received characteristic diagram, and performing convolution operation on the received characteristic diagram according to the dynamic expansion coefficient to obtain an output characteristic diagram; the processing of each layer of static convolution operators comprises the following steps: carrying out convolution operation on the received characteristic diagram according to the default expansion coefficient to obtain an output characteristic diagram;

and the identifying module 1003 is configured to determine an identification result of the image to be processed based on the target feature map extracted by the feature extracting module 1002.

Optionally, the feature extraction module 1002 is specifically configured to:

performing global pooling operation on the received feature map;

carrying out full connection processing on the global pooling result to obtain the matching probability of the received characteristic diagram and each preset expansion coefficient;

and selecting the preset expansion coefficient with the highest probability as the dynamic expansion coefficient.

Optionally, the feature extraction module 1002 is specifically configured to:

performing a depth separable convolution operation on the received feature map;

performing full-connection processing on the result of the depth separable convolution operation to obtain the probability of matching the input characteristic graph with each preset expansion coefficient;

and selecting the preset expansion coefficient with the maximum probability as the dynamic expansion coefficient.

Optionally, the feature extraction module 1002 is specifically configured to:

performing global pooling operation on the received feature map;

performing a depth separable convolution operation on the result of the global pooling operation;

Optionally, the feature extraction network belongs to a pre-trained image classification model, and the image classification model further includes a classification sub-network;

the identifying module 1003 is specifically configured to:

inputting the target feature map into a global pooling layer of a classification sub-network to obtain a pooling feature map of the image to be processed;

inputting the pooling feature map into a full-connection layer of a classification sub-network to obtain the probability that the image to be processed belongs to each preset category;

and selecting the preset category with the maximum probability as the category of the image to be processed.

Optionally, the feature extraction network belongs to a pre-trained image detection model, and the image detection model further includes a target detection subnetwork;

the identifying module 1003 is specifically configured to:

performing feature fusion on the target feature map to obtain multi-scale features of the target feature map;

and detecting the position of a target area in the image to be processed and a target in the target area based on the multi-scale features of the target feature map.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the related images all meet the regulations of related laws and regulations and do not violate the customs of the public order.

It should be noted that the sample image in the present embodiment may be from a public data set.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 11 shows a schematic block diagram of an example electronic device 1100 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 11, the device 1100 comprises a computing unit 1101, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for the operation of the electronic device 1100 may also be stored. The calculation unit 1101, the ROM1102, and the RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

A number of components in electronic device 1100 are connected to I/O interface 1105, including: an input unit 1106 such as a keyboard, a mouse, and the like; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108 such as a magnetic disk, optical disk, or the like; and a communication unit 1109 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1109 allows the electronic device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 1101 can be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 1101 performs the respective methods and processes described above, such as an image processing method. For example, in some embodiments, the image processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1100 via the ROM1102 and/or the communication unit 1109. When the computer program is loaded into the RAM 1103 and executed by the computing unit 1101, one or more steps of the image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform the image processing method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An image processing method comprising:

acquiring an image to be processed;

performing feature extraction on the image to be processed by using at least one layer of dynamic convolution operator and multiple layers of static convolution operators in a pre-trained feature extraction network to obtain a target feature map which is output by the feature extraction network and comprises the features of the image to be processed; wherein, the processing of each layer of dynamic convolution operator comprises: predicting a dynamic expansion coefficient based on the received characteristic diagram, and performing convolution operation on the received characteristic diagram according to the dynamic expansion coefficient to obtain an output characteristic diagram; the processing of each layer of static convolution operators comprises the following steps: carrying out convolution operation on the received characteristic diagram according to the default expansion coefficient to obtain an output characteristic diagram;

2. The method of claim 1, wherein the predicting dynamic expansion coefficients based on the received feature map comprises:

performing global pooling operation on the received feature map;

carrying out full-connection processing on the global pooling result to obtain the probability of matching the received characteristic diagram with each preset expansion coefficient;

and selecting a preset expansion coefficient with the highest probability as the dynamic expansion coefficient.

3. The method of claim 1, wherein the predicting dynamic expansion coefficients based on the received feature map comprises:

performing a depth separable convolution operation on the received feature map;

carrying out full-connection processing on the result of the depth separable convolution operation to obtain the probability of matching the input characteristic diagram with each preset expansion coefficient;

4. The method of claim 1, wherein the predicting a dynamic expansion coefficient based on the received feature map comprises:

carrying out global pooling operation on the received feature map;

5. The method of any of claims 1-4, wherein the feature extraction network belongs to a pre-trained image classification model, the image classification model further comprising a classification sub-network;

the determining the recognition result of the image to be processed based on the target feature map comprises:

inputting the target feature map into a global pooling layer of the classification sub-network to obtain a pooling feature map of the image to be processed;

inputting the pooling feature map into a full-link layer of the classification sub-network to obtain the probability that the image to be processed belongs to each preset category;

6. The method of any of claims 1-4, wherein the feature extraction network belongs to a pre-trained image detection model that further includes a target detection sub-network;

7. An image processing apparatus comprising:

the acquisition module is used for acquiring an image to be processed;

8. The apparatus of claim 7, wherein the feature extraction module is specifically configured to:

carrying out global pooling operation on the received feature map;

9. The apparatus of claim 7, wherein the feature extraction module is specifically configured to:

performing a depth separable convolution operation on the received feature map;

10. The apparatus of claim 7, wherein the feature extraction module is specifically configured to:

carrying out global pooling operation on the received feature map;

11. The apparatus according to any of claims 7-10, wherein the feature extraction network belongs to a pre-trained image classification model, the image classification model further comprising a classification sub-network;

the identification module is specifically configured to:

inputting the target feature map into a global pooling layer of the classification sub-network to obtain a pooled feature map of the image to be processed;

12. The apparatus according to any one of claims 7-10, wherein the feature extraction network belongs to a pre-trained image detection model, the image detection model further comprising a target detection sub-network;

the identification module is specifically configured to:

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.