CN109447169B

CN109447169B - Image processing method, training method and device of model thereof and electronic system

Info

Publication number: CN109447169B
Application number: CN201811306459.9A
Authority: CN
Inventors: 黎泽明; 俞刚
Original assignee: Beijing Kuangshi Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2020-10-27
Anticipated expiration: 2038-11-02
Also published as: CN109447169A

Abstract

The invention provides an image processing method, a training method and a training device of a model thereof and an electronic system; wherein, the method comprises the following steps: acquiring a candidate region of the target training image through the feature extraction network and the region candidate network; carrying out example positioning and example segmentation on the candidate area through a positioning segmentation network and calculating a loss value to obtain a positioning area, a segmentation area, a positioning loss value and a segmentation loss value which contain an example; classifying the candidate regions through a classification network and calculating loss values to obtain classification results and classification loss values of the candidate regions; and training the networks according to the loss values until the loss values are converged to obtain an image processing model. In the invention, the example positioning and the example segmentation are realized by adopting the same branch network, so that the example positioning and the example segmentation can share the characteristic information and are mutually promoted, the accuracy of the example positioning and the example segmentation is favorably improved, and the overall accuracy of the example positioning, the segmentation and the classification is further improved.

Description

Image processing method, training method and device of model thereof and electronic system

Technical Field

The invention relates to the technical field of image processing, in particular to an image processing method, a training method and a training device of a model thereof and an electronic system.

Background

Instance Segmentation (Instance Segmentation) is an important task of computer vision, and can provide Instance-level detection and Segmentation for various objects in a picture. The example segmentation provides important clues for a computer to more accurately understand the picture, and plays an important role in the fields of automatic driving and the like. In the related art, instance segmentation is mainly realized based on a classic target detection method FPN (Feature Pyramid Network), and a branch for realizing instance segmentation is extended on the basis of the FPN. This way, the instance is divided into the detection part and the division part; the detection part comprises a positioning task and a classification task, and is realized through the same branch network; the division is done by a separate branching network.

However, the above method realizes the division by simply adding the branch network, and does not integrate the characteristics of each task well. For example, for a classification task and a positioning task, the classification task and the positioning task have great difference, the classification task needs global semantic information, and the positioning task needs local edge information; the two are realized through the same branch network, which easily causes information loss, and results in poor accuracy of final instance segmentation.

Disclosure of Invention

In view of the above, the present invention provides an image processing method, a training device and an electronic system for a model thereof, so as to improve the accuracy of example location and example segmentation, and further improve the accuracy of the whole example location, segmentation and classification.

In a first aspect, an embodiment of the present invention provides a method for training an image processing model, including: acquiring a candidate region of a target training image through a preset feature extraction network and a region candidate network; carrying out example positioning and example segmentation on the candidate area through a preset positioning segmentation network, and calculating loss values of the example positioning and example segmentation to obtain a positioning area, a segmentation area, a positioning loss value and a segmentation loss value which contain an example; classifying the candidate regions through a preset classification network, and calculating a classification loss value to obtain a classification result and a classification loss value of the candidate regions; and training the feature extraction network, the regional candidate network, the positioning segmentation network and the classification network according to the positioning loss value, the segmentation loss value and the classification loss value until the positioning loss value, the segmentation loss value and the classification loss value are converged to obtain an image processing model.

In a preferred embodiment of the present invention, the positioning segmentation network comprises a convolutional network; the classification network comprises a fully connected network.

In a preferred embodiment of the present invention, the step of obtaining the candidate region of the target training image through the preset feature extraction network and the region candidate network includes: carrying out feature extraction processing on the target training image through a preset feature extraction network to obtain an initial feature map of the target training image; performing feature fusion processing on the initial feature map to obtain a fusion feature map; and extracting a candidate region from the fusion characteristic diagram through a preset region candidate network.

In a preferred embodiment of the present invention, the step of performing instance location and instance segmentation on the candidate area through a preset location segmentation network includes: adjusting the size of the candidate area to a size matched with the convolution network; performing example detection processing and example segmentation processing on the adjusted candidate area through a convolutional network to obtain a positioning area and a segmentation area which contain complete examples; the positioning area is marked by a detection frame; the divided regions are identified by color.

In a preferred embodiment of the present invention, the target training image carries a positioning tag and a segmentation tag corresponding to each instance; the step of calculating loss values for instance localization and instance segmentation, comprising: substituting the positioning area and the positioning label corresponding to the example contained in the positioning area into a preset positioning loss function to obtain a positioning loss value; and substituting the segmentation labels corresponding to the instances contained in the segmentation region into a preset segmentation loss function to obtain a segmentation loss value.

In a preferred embodiment of the present invention, the step of classifying the candidate regions through a preset classification network includes: adjusting the size of the candidate area to a size matching the fully connected network; and inputting the adjusted candidate area into the full-connection network, and outputting the classification result of the candidate area.

In a preferred embodiment of the present invention, the target training image carries a classification label corresponding to each instance; a step of calculating a categorical loss value, comprising: and substituting the classification result of the candidate region and the classification label corresponding to the example contained in the candidate region into a preset classification loss function to obtain a classification loss value.

In a second aspect, an embodiment of the present invention provides an image processing method, which is applied to an apparatus configured with an image processing model; the image processing model is obtained by training the training method of the image processing model; the method comprises the following steps: acquiring an image to be processed; and inputting the image to be processed into an image processing model, and outputting the positioning area, the segmentation area and the classification result of each instance in the image to be processed.

In a preferred embodiment of the present invention, the step of obtaining the image to be processed includes: acquiring an image to be processed through a camera device of a vehicle; after the step of outputting the positioning area, the segmentation area and the classification result of each instance in the image to be processed, the method further comprises: and generating a driving command according to the positioning area, the segmentation area and the classification result so as to enable the vehicle to automatically drive according to the driving command.

In a third aspect, an embodiment of the present invention provides a training apparatus for an image processing model, including: the region acquisition module is used for acquiring a candidate region of the target training image through a preset feature extraction network and a region candidate network; the positioning and dividing module is used for carrying out example positioning and example dividing on the candidate area through a preset positioning and dividing network, and calculating loss values of the example positioning and example dividing to obtain a positioning area, a dividing area, a positioning loss value and a dividing loss value which contain an example; the classification module is used for classifying the candidate regions through a preset classification network and calculating the classification loss value to obtain the classification result and the classification loss value of the candidate regions; and the training module is used for training the feature extraction network, the regional candidate network, the positioning segmentation network and the classification network according to the positioning loss value, the segmentation loss value and the classification loss value until the positioning loss value, the segmentation loss value and the classification loss value are converged to obtain an image processing model.

In a fourth aspect, an embodiment of the present invention provides an image processing apparatus, where the apparatus is disposed in a device configured with an image processing model; the image processing model is obtained by training the training method of the image processing model; the device comprises: the image acquisition module is used for acquiring an image to be processed; and the image input module is used for inputting the image to be processed into the image processing model and outputting the positioning area, the segmentation area and the classification result of each example in the image to be processed.

In a fifth aspect, an embodiment of the present invention provides an electronic system, including: the device comprises an image acquisition device, a processing device and a storage device; the image acquisition equipment is used for acquiring preview video frames or image data; the storage means has stored thereon a computer program which, when run by the processing apparatus, performs the training method of the image processing model as described above, or performs the image processing method as described above.

In a sixth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processing device to execute the training method of the image processing model or execute the steps of the image processing method.

The embodiment of the invention has the following beneficial effects:

after the candidate area of the target training image is obtained through the preset feature extraction network and the area candidate network, the candidate area is subjected to example positioning and example segmentation through the positioning segmentation network, and a corresponding loss value is calculated to obtain a positioning area and a segmentation area containing an example; classifying the candidate regions through a classification network and calculating corresponding loss values to obtain classification results of the candidate regions; and training the feature extraction network, the regional candidate network positioning segmentation network and the classification network according to the positioning loss value, the segmentation loss value and the classification loss value until all the loss values are converged to obtain an image processing model. In the method, the example positioning and the example segmentation are realized by adopting the same branch network, so that the example positioning and the example segmentation can share the characteristic information and are mutually promoted, the accuracy of the example positioning and the example segmentation is favorably improved, and the overall accuracy of the example positioning, the segmentation and the classification is further improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention as set forth above.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic structural diagram of an electronic system according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for training an image processing model according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an initial feature diagram provided in an embodiment of the present invention;

FIG. 4 is a diagram illustrating an image processing model according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating an image processing model according to the prior art according to an embodiment of the present invention;

FIG. 6 is a flowchart of an image processing method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a training apparatus for an image processing model according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In view of the problem that the final example segmentation accuracy is poor due to unreasonable task allocation in the existing example segmentation mode, embodiments of the present invention provide an image processing method and a training method, apparatus, and electronic system for a model thereof, which may be applied to various devices such as a server, a computer, a camera, a mobile phone, a tablet computer, a vehicle central control device, and the like, and may be implemented by using corresponding software and hardware, and the following describes embodiments of the present invention in detail.

The first embodiment is as follows:

first, an example electronic system 100 for implementing an image processing method and a training method, apparatus, and electronic system of a model thereof according to an embodiment of the present invention will be described with reference to fig. 1.

As shown in FIG. 1, an electronic system 100 includes one or more processing devices 102, one or more memory devices 104, an input device 106, an output device 108, and one or more image capture devices 110, which are interconnected via a bus system 112 and/or other type of connection mechanism (not shown). It should be noted that the components and structure of the electronic system 100 shown in fig. 1 are exemplary only, and not limiting, and that the electronic system may have other components and structures as desired.

The processing device 102 may be a gateway or an intelligent terminal, or a device including a Central Processing Unit (CPU) or other form of processing unit having data processing capability and/or instruction execution capability, and may process data of other components in the electronic system 100 and may control other components in the electronic system 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processing device 102 to implement client functionality (implemented by the processing device) and/or other desired functionality in embodiments of the present invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

The image capture device 110 may capture preview video frames or image data and store the captured preview video frames or image data in the storage 104 for use by other components.

For example, the devices in the example electronic system for implementing the image processing method and the training method and apparatus for the model thereof according to the embodiment of the present invention and the electronic system may be integrally arranged, or may be dispersedly arranged, such as integrally arranging the processing device 102, the storage device 104, the input device 106 and the output device 108, and arranging the image capturing device 110 at a designated position where a target image can be captured. When the above-described devices in the electronic system are integrally provided, the electronic system may be implemented as, for example, a camera, a smart phone, a tablet computer, a vehicle central control device, or the like.

Example two:

the embodiment provides a training method of an image processing model, which is executed by a processing device in the electronic system; the processing device may be any device or chip having data processing capabilities. The processing equipment can independently process the received information, can also be connected with the server, jointly analyzes and processes the information, and uploads a processing result to the cloud.

The image processing model is mainly used for example segmentation, and the example segmentation generally comprises operations such as positioning, segmentation and classification of examples; as shown in fig. 2, the training method of the image processing model includes the following steps:

step S202, acquiring a candidate region of a target training image through a preset feature extraction network and a region candidate network;

the feature extraction network can be obtained by training through a VGG16 network, an RseNet network and the like. Generally, the target training image includes various examples, such as a person, an animal, a still, and the like; each instance may have multiple instances, such as three people included in the image, person 1, person 2, person 3, etc. The training purpose of the image processing model is to locate, segment and identify classes the various instances in the image, and each instance in each instance.

The candidate regions can be identified by a candidate frame with a preset size, and a plurality of regions which possibly contain example images are selected from the target training image or the feature map of the target training image for subsequent example positioning, classification and segmentation; when extracting candidate regions, the specification of the candidate frame may have multiple types, for example, for a certain pixel point in a target training image or a feature map of the target training image, the pixel point is taken as a candidate frame center, and the size of the candidate frame may be changed into multiple types, such as 2 × 7, 3 × 6, 5 × 5, 6 × 3, 7 × 2, and the like, so as to obtain multiple image regions with the pixel point as the center; and then, taking other pixel points as the centers of the candidate frames to obtain a plurality of image areas taking the pixel point as the center.

After obtaining a plurality of image regions, generally, the image regions are further required to be classified, screened, and the like, so as to obtain image regions which may include instances, and the image regions are the candidate regions; this process may be implemented by a pre-trained neural network. For example, when extracting a candidate Region from the feature map of the target training image, the neural Network may be an RPN (Region candidate Network) Network.

Step S204, carrying out example positioning and example segmentation on the candidate area through a preset positioning segmentation network, and calculating loss values of the example positioning and example segmentation to obtain a positioning area, a segmentation area, a positioning loss value and a segmentation loss value containing an example;

considering that the tasks of example positioning and example segmentation need to use position sensitive information, such as edge characteristic information, local edge characteristic information and the like, and the classified tasks need to use all semantic information; therefore, in the present embodiment, the image processing model is divided into two branch networks, one of which is a positioning segmentation network for example positioning and example segmentation, and the other is a classification network for classification.

In the dividing mode, the positioning and dividing network can extract the position sensitive information in each candidate area in a centralized manner, and the positioning and dividing network can share the extracted feature map, feature information and the like when completing the tasks of example positioning and example dividing, so that the positioning and dividing network has the mutual promotion function between the tasks when completing the tasks of example positioning and example dividing, such as the mutual promotion of the positioning and dividing network for improving the capability of searching for boundaries. The classification network can extract global semantic information in each candidate area in a centralized manner without extracting position sensitive information.

Compared with the existing related mode, the case positioning and classification are realized through the same network branch, the network branch needs to extract position sensitive information and global semantic information at the same time, and the loss of partial characteristic information is easily caused, for example, if the network branch is realized through a fully-connected network, the loss of edge information for case positioning is easily caused, and the case positioning accuracy is poor; if the network branch is realized through a convolutional network, the extraction of global semantic information is not facilitated, and the classification accuracy is poor. In addition, since the instance division is implemented by another network branch that needs to re-extract feature information based on the candidate area, it is difficult to implement information sharing with the instance location-related feature information.

Specifically, in S204, the positioning segmentation network may be implemented by using a neural network, such as a convolutional network, which is favorable for extracting position-sensitive information; after the candidate area is input into the positioning and dividing network, the positioning and dividing network can perform example positioning and example dividing on the candidate area at the same time, or perform example positioning first and then perform example dividing to finally obtain a positioning area and a dividing area containing examples; typically, the positioning area is identified by a detection box, such as a rectangular detection box, which usually contains the complete instance; the edges of the divided areas are the edges of the examples, and the divided areas of different examples can be distinguished by different colors.

In the training process of the image processing model, the accuracy of the model needs to be evaluated, so that the target training image is usually marked with a standard positioning area and a standard segmentation area of each instance in advance, which can also be called as a positioning label and a segmentation label; after example positioning and example segmentation are completed on each candidate area of the target training image by the positioning segmentation network, the positioning area and the segmentation area of each example are output, the difference between the positioning area and the positioning label of each example is calculated through a preset loss function to obtain the positioning loss value, and the difference between the segmentation area and the segmentation label of each example is calculated to obtain the segmentation loss value.

Step S206, classifying the candidate regions through a preset classification network, and calculating a classification loss value to obtain a classification result and a classification loss value of the candidate regions;

the classification network can be implemented by using a neural network which is beneficial to extracting global semantic information, such as a fully-connected network and the like; after the candidate area is input into the classification network, the classification network obtains semantic information of the context of the candidate area in a semantic segmentation mode and the like, and then obtains global semantic information; classifying the candidate region based on the global semantic information to obtain a classification result; the classification result may be a classification identifier such as a person, a ground, a cup, or the like. In order to evaluate the classification result of the model, the classification result may be compared with the classification labels (i.e., standard classification results) of each instance carried in the target training image, and specifically, a difference between the classification result and the classification label may be calculated through a preset classification loss function, so as to obtain a loss classification loss value.

And S208, training the feature extraction network, the regional candidate network, the positioning segmentation network and the classification network according to the positioning loss value, the segmentation loss value and the classification loss value until the positioning loss value, the segmentation loss value and the classification loss value are converged to obtain an image processing model.

In the training process, parameters of the feature extraction network, the area candidate network, the positioning segmentation network and the classification network can be modified through the positioning loss value, the segmentation loss value and the classification loss value, so that all the loss values are converged, and the training process is finished. In the training process, the number of target training images used may be multiple, for example, the same target training image is used for repeated training, when the loss values are all converged, another target training image is used for repeated training, after the loss values are all converged, a third target training image is used for repeated training, and so on, the performance of the model is further stabilized.

According to the training method of the image processing model, after the candidate region of the target training image is obtained, the candidate region of the target training image is obtained through a preset feature extraction network and a region candidate network, the candidate region is subjected to example positioning and example segmentation through a positioning segmentation network, and corresponding loss values are calculated to obtain a positioning region and a segmentation region containing examples; classifying the candidate regions through a classification network and calculating corresponding loss values to obtain classification results of the candidate regions; and training the feature extraction network, the regional candidate network positioning segmentation network and the classification network according to the positioning loss value, the segmentation loss value and the classification loss value until all the loss values are converged to obtain an image processing model. In the method, the example positioning and the example segmentation are realized by adopting the same branch network, so that the example positioning and the example segmentation can share the characteristic information and are mutually promoted, the accuracy of the example positioning and the example segmentation is favorably improved, and the overall accuracy of the example positioning, the segmentation and the classification is further improved.

Example three:

the embodiment provides another training method of an image processing model, which is implemented on the basis of the above embodiment; in this embodiment, a specific implementation manner of obtaining a candidate region of a target training image is described in a focused manner; the method comprises the following steps:

step 302, performing feature extraction processing on a target training image through a preset feature extraction network to obtain an initial feature map of the target training image;

wherein, the sample image used for training the feature extraction network can be obtained from ImageNet data set or other data sets; in the process of training the feature extraction network, the performance of the network can be evaluated through a Top1 classification error function; the Top1 classification error function can be expressed as Top1 ═ number of samples with correct label different from the best label of the net output/total number of samples. And after the training of the feature extraction network is finished, inputting the target training image into the feature extraction network, and outputting an initial feature map of the target training image. Specifically, this step 302 can also be implemented by:

step 1, adjusting the size of a target training image to a preset size, and whitening the adjusted target training image;

for most neural networks, it is common to receive only fixed-size image data; therefore, the size of the target training image needs to be adjusted before being input into the feature extraction network; the specific adjustment mode can be as follows: if the length or width of the target training image is larger than the preset size, compressing the target training image to the preset size, or deleting redundant image areas; and if the length or the width of the target training image is smaller than the preset size, stretching the target training image to the preset size or filling the vacant image area.

In general, a target training image is affected by multiple factors such as ambient illumination intensity, object reflection, a shooting camera, and the like during an imaging process, and in order to remove the factors from the target training image, the target training image needs to be whitened so that the target training image contains constant information that is not affected by the outside. Therefore, the whitening process on the adjusted target training image can be understood as a process of reducing the dimension of the adjusted target training image. During the whitening process, the pixel values of each pixel of the target training image are typically converted to zero mean and unit variance pixel values. Specifically, first, the average value μ and the variance value of all pixel values of the target training image need to be calculated, and each pixel of the target training image is converted through the following formula: xij ═ (pij- μ)/; wherein pij is an original pixel value of a pixel in the ith row and the jth column of the target training image, and xij is a converted pixel value of a pixel in the ith row and the jth column of the target training image.

And 2, inputting the processed target training image into a preset feature extraction network, and outputting an initial feature map with the specified number of levels.

In practical implementation, the number of levels of the feature map output by the feature extraction network may be preset, and for example, the number of levels may be five, which are respectively denoted as Conv1, Conv2, Conv3, Conv4 and Conv 5. The initial feature map of the current layer is obtained by performing convolution calculation on the initial feature map of the lower layer of the current layer through a preset convolution kernel (the initial feature map of the bottommost layer is obtained by performing convolution calculation on the target training image), and the scale of the initial feature map of the current layer is smaller than that of the initial feature map of the lower layer; therefore, the scale of the initial feature map with the specified number of levels output by the feature extraction network is changed from large to small from the bottom level to the top level, and the scales are different from each other.

FIG. 3 is a schematic structural diagram of an initial feature map; in fig. 3, five levels of initial feature maps are illustrated as an example, and the initial feature map at the bottom is the lowest level according to the direction of the arrow; the initial characteristic diagram positioned at the top and positioned at the topmost layer; the feature extraction network is also provided with a plurality of convolution layers generally; after the target training image is input into the feature extraction network, performing convolution operation on a first layer of convolution layer to obtain an initial feature map of the bottom layer; carrying out convolution operation on the initial feature map of the bottommost layer through the second layer of convolution layer to obtain an initial feature map of the second layer until the initial feature map of the topmost layer is obtained through the last layer of convolution layer; in general, the convolution kernel used for performing the convolution operation may be different for each convolution layer; in addition to the convolution layer, the feature extraction network is generally provided with a pooling layer, a full link layer, and the like.

Step 304, performing feature fusion processing on the initial feature map to obtain a fusion feature map;

the initial characteristic diagram of each level is obtained by carrying out convolution operation through different convolution kernels, so that the initial characteristic diagram of each level comprises different types or different dimensions of characteristics of the target training image; in order to enrich the features included in the initial feature maps of the respective levels, it is necessary to perform fusion processing on the initial feature maps of the respective levels. The specific fusion process can have various forms, such as fusing the initial feature diagram of the current layer with the initial feature diagram of the previous layer of the current layer to obtain a fusion feature diagram of the current layer; for another example, before the initial feature map of the current layer is fused with the initial feature map of the previous layer of the current layer, the initial feature map of the other layer or the combined initial feature maps of the other layers may be fused, and then the fused initial feature map is fused with the initial feature map of the previous layer.

Because the scales of the initial feature maps are different, the fused initial feature maps are generally required to be preprocessed (such as convolution operation, difference operation and the like) before fusion is carried out, so that the scales of the fused initial feature maps are matched with each other; when the initial feature maps are fused, point multiplication, point addition or other logic operations can be performed between corresponding feature points.

Specifically, the step 304 may be implemented by:

step 1, determining an initial feature map of a topmost level as a fused feature map of the topmost level;

because the initial feature map of the topmost level does not have the initial feature map of the previous level, in the process of fusing the initial feature maps of each level, the initial feature map of the topmost level is not fused any more, and the initial feature map is directly determined as the fused feature map of the topmost level.

And 2, fusing the initial feature map of the current level and the fused feature map of the previous level of the current level except for the topmost level to obtain the fused feature map of the current level.

During actual implementation, convolution operation can be performed on the initial feature map of the current level through a preset convolution kernel to obtain an initial feature map after the convolution operation; the convolution kernel may be a 3 × 3 convolution kernel, but a larger convolution kernel may be used, for example, a 5 × 5 convolution kernel, a 7 × 7 convolution kernel, or the like. And then carrying out interpolation operation on the fusion feature map of the previous level of the current level according to the scale of the initial feature map of the current level to obtain the fusion feature map of the previous level of the current level matched with the scale of the initial feature map of the current level.

Since the fused feature map of the previous level of the current level is smaller than the initial feature map of the current level, in order to facilitate the fusion, the fused feature map of the previous level of the current level needs to be "stretched" to the same scale as the initial feature map of the current level, and the "stretching" process can be realized through the interpolation operation. Taking linear interpolation as an example, the process of interpolation operation is simply illustrated, for example, the numerical values of local three feature points in the initial feature map are respectively 5, 7, and 9, and in order to extend the initial feature map to a preset scale, the three feature points need to be extended to five feature points, at this time, the mean value of the feature point 5 and the feature point 7, that is, the feature point 6, may be inserted between the feature point 5 and the feature point 7, and the mean value of the feature point 7 and the feature point 9, that is, the feature point 9, may be inserted between the feature point 7 and the feature point 9, so that the local three feature points may be extended to five feature points, that is, 5, 6, 7, 8, and 9.

In addition to the linear interpolation described above, other interpolation algorithms may be used, such as bilinear interpolation; bilinear interpolation usually performs interpolation operation from the x direction and the y direction respectively; specifically, four feature points, namely, Q11, Q12, Q21 and Q22, are selected from the initial feature map, and are distributed in a rectangular shape in the initial feature map; in the x direction, an interpolation point R1 is obtained by linearly interpolating the x coordinates of Q11 and Q21, and an interpolation point R2 is obtained by linearly interpolating the x coordinates of Q12 and Q22; and in the y direction, linearly interpolating the interpolation point R1 and the interpolation point R2 to obtain a final difference point P, wherein the point P is a newly added feature point after one-time bilinear interpolation.

After the interpolation operation is completed, the fused feature map of the previous level of the current level after the interpolation operation and the initial feature map of the current level are subjected to point-by-point addition operation between corresponding feature points to obtain the fused feature map of the current level. Of course, the fused feature map of the previous level of the current level and the initial feature map of the current level may be subjected to point-by-point multiplication operation or other logic operation between the corresponding feature points.

Step 306, extracting a candidate region from the fusion feature map through a preset region candidate network.

The area candidate network may specifically be an RPN network, and the RPN network may specifically be implemented in the following manner: on each layer fused feature map, using a n × n sliding window (for example, when n is 3, i.e., a sliding window of 3 × 3 size), generating a fully-connected feature with a length of 256 or 512 dimensions, and then generating two branched fully-connected layers or convolution layers after the 256-dimensional or 512-dimensional feature, namely reg-layer and cls-layer; the reg-layer is used for predicting coordinates x and y and width and height w and h of the candidate region corresponding to the center anchor point of the candidate region; and the cls-layer is used for judging whether the candidate area is a foreground or a background, so that the candidate area possibly containing the example is obtained through screening. This candidate Region may also be referred to as RoI (Region of interest).

Step 308, carrying out example positioning and example segmentation on the candidate area through a preset positioning segmentation network, and calculating loss values of the example positioning and example segmentation to obtain a positioning area, a segmentation area, a positioning loss value and a segmentation loss value containing an example;

step 310, classifying the candidate regions through a preset classification network, and calculating a classification loss value to obtain a classification result and a classification loss value of the candidate regions;

and step 312, training the feature extraction network, the area candidate network, the positioning segmentation network and the classification network according to the positioning loss value, the segmentation loss value and the classification loss value until the positioning loss value, the segmentation loss value and the classification loss value are converged to obtain the image processing model.

According to the training method of the image processing model, after an initial feature map of a target training image is extracted through a feature extraction network, feature fusion processing is carried out on the initial feature map to obtain a fusion feature map; extracting a candidate region from the fusion characteristic graph through a regional candidate network; and then training the positioning segmentation network and the classification network based on the candidate region to obtain an image processing model. In the method, the example positioning and the example segmentation are realized by adopting the same branch network, so that the example positioning and the example segmentation can share the characteristic information and are mutually promoted, the accuracy of the example positioning and the example segmentation is favorably improved, and the overall accuracy of the example positioning, the segmentation and the classification is further improved.

Example four:

the embodiment provides another training method of an image processing model, which is implemented on the basis of the above embodiment; in this embodiment, a specific implementation manner of performing example positioning, example segmentation, and classification on the candidate region is described in detail. The convolutional network is more favorable for acquiring position sensitive information in the candidate area, such as edge context information and the like; the full-connection network is more beneficial to acquiring global voice information in the candidate area; because the positioning segmentation network in the embodiment is implemented by a convolutional network, and the classification network is implemented by a fully-connected network, the problem of missing edge context information caused by instance positioning performed by the fully-connected network is solved.

Step 402, performing feature extraction processing on a target training image through a preset feature extraction network to obtain an initial feature map of the target training image;

step 404, performing feature fusion processing on the initial feature map to obtain a fusion feature map;

step 406, extracting a candidate region from the fusion feature map through a preset region candidate network.

Step 408, adjusting the size of the candidate area to the size matched with the convolution network;

generally, a convolutional network requires that the input image data have a fixed size, such as 14 × 14, 7 × 7, etc.; as described in the above embodiments, the size of the candidate region may be adjusted by stretching, compressing, deleting redundant regions, filling in vacant regions, and the like, so that the size of the candidate region matches the size of the convolutional network.

Step 410, performing instance detection processing and instance segmentation processing on the adjusted candidate area through a convolutional network to obtain a positioning area and a segmentation area which contain complete instances; the positioning area is marked by a detection frame; the segmented regions are identified by color.

After inputting the candidate area with the adjusted size into the convolutional network, the convolutional network usually extracts the position information in the candidate area to obtain edge information of examples possibly included in the candidate area; and carrying out example positioning and segmentation on the candidate area by the convolutional network through the acquired edge information, wherein in most cases, the tasks of example positioning and example segmentation can be carried out simultaneously. In addition, for a larger instance, the candidate region may not include a complete instance, and at this time, the convolutional network may search for a candidate region that is the same as an Anchor (Anchor, which may be understood as a center point of the candidate region) of the current candidate region, or a candidate region adjacent to the Anchor, merge the candidate regions with a larger edge information correlation, or stretch the current candidate region based on the candidate region with a larger edge information correlation, so as to obtain a region including a complete instance; the area may be larger than the size of the instance, and if the area contains more background area around the instance, the area needs to be adjusted again to make the edge of the instance close to the edge of the area, so that the final detection frame contains the complete instance.

The positioning area is identified by a detection frame, and the detection frame can be a rectangular frame; the detection frame comprises an example and a background area around the example; the edge of the above-mentioned partition area is usually the edge outline of the example, usually distinguish each example by way of color filling; for example, the target training image includes a person 1, a person 2, a cup, and an animal; at this point, after example segmentation, character 1 is identified in blue, character 2 in red, the cup in green, and the animal in purple.

In the training process of the convolutional network, the loss value of the output result of the convolutional network needs to be calculated through a loss function so as to evaluate the performance of the convolutional network. Therefore, the target training image usually carries the positioning label and the segmentation label corresponding to each instance; the location tag may also be identified with a detection box to indicate the exact location of the instance; the segmentation label may indicate the edge profile of the instance by lines that make up the area occupied by the instance, which may also be filled in with color.

Specifically, in order to evaluate the example positioning performance of the convolutional network, a positioning loss value needs to be calculated, and specifically, a positioning area and a positioning label corresponding to an example included in the positioning area may be substituted into a preset positioning loss function to obtain the positioning loss value; the location Loss function may be a Bbox Loss function or other functions that may be used to evaluate location Loss.

In order to evaluate the example segmentation performance of the convolutional network, a segmentation loss value needs to be calculated, and specifically, the segmentation loss value can be obtained by substituting a segmentation region and a segmentation label corresponding to an example included in the segmentation region into a preset segmentation loss function. The segmentation Loss function can be a cross entropy Loss function, such as a Mask Sigmoid Loss function; it will be appreciated that the cross entropy loss function may also be used to evaluate the localization loss of the localization area as described above.

Step 412, adjusting the size of the candidate area to a size matching the fully connected network;

typically, a fully connected network requires that the input image data also have a fixed size, such as 7 x 7, 14 x 14, etc.; the size of the candidate area can be adjusted by stretching, compressing, deleting redundant area, filling vacant area, etc. to match the size of the candidate area with the size of the fully connected network.

And step 414, inputting the adjusted candidate area into the full-connection network, and outputting the classification result of the candidate area.

After the candidate area with the adjusted size is input into the fully-connected network, the fully-connected network usually extracts semantic information in the candidate area; classifying the candidate regions through the acquired semantic information by the full-connection network; the classification of a candidate region is mostly based on the examples contained in the candidate region. Since there is overlap between candidate regions of the same anchor point or neighboring anchor points, the candidate regions are likely to contain the same instance, and at this time, the candidate regions are classified into the same category.

The classification result output in the fully-connected network is usually expressed by a classification identifier, which can be identified near the detection frame corresponding to the positioning area of each instance; therefore, when determining which classification identifier each detection frame corresponds to, after the positioning region or the segmentation region is determined, the classification of the candidate region having the same or similar position as the positioning region or the segmentation region may be searched from the classification result, and the classification may be determined as the classification identifier of the detection frame corresponding to the positioning region or the segmentation region.

In addition, if there are a plurality of candidate region categories having the same or similar positions of the positioning region or the divided region, the category with a higher weight may be selected from the plurality of categories as the classification identifier of the detection frame corresponding to the positioning region or the divided region.

In the training process of the fully-connected network, the loss value of the output result of the convolutional network needs to be calculated through a loss function so as to evaluate the performance of the convolutional network. Therefore, the target training image usually carries the classification label corresponding to each instance; the classification label may be set corresponding to the positioning label of the above example, or may be set corresponding to the segmentation label of the example.

In order to evaluate the instance location performance of the fully-connected network, a classification loss value needs to be calculated, and specifically, a classification result of the candidate region and a classification label corresponding to an instance included in the candidate region may be substituted into a preset classification loss function to obtain the classification loss value. The classification loss function may be a log loss function, a squared loss function, an exponential loss function, or the like.

And step 416, training the feature extraction network, the regional candidate network, the positioning segmentation network and the classification network according to the positioning loss value, the segmentation loss value and the classification loss value until the positioning loss value, the segmentation loss value and the classification loss value are converged to obtain an image processing model.

FIG. 4 illustrates an example of the image processing model; the classification network in the model includes two fully-connected layers (the two fully-connected layers are merely examples and are not limited to the present embodiment), and the positioning segmentation network includes five convolutional layers (the five convolutional layers are merely examples and are not limited to the present embodiment), which are CONV1, CONV2, CONV3, CONV4 and DCONV; the candidate region matching the classification network in fig. 4 is 7 × 7 in size, and the candidate region matching the localization segmentation network in fig. 4 is 14 × 14. After the size of the candidate area is adjusted to 7 × 7, inputting the candidate area into a classification network, and outputting a classification result and a classification loss value after processing by two fully-connected layers; after the size of the candidate area is adjusted to 14 × 14, the candidate area is input into a positioning segmentation network, and after five-layer convolution layer processing, a positioning area, a segmentation area, a positioning loss value and a segmentation loss value are output.

As can be seen from fig. 4, in the image processing model provided in this embodiment, the example positioning and example segmenting tasks are implemented by the same convolutional network branch, and the classifying task is implemented by another fully-connected network; in contrast, FIG. 5 is a prior art image processing model, which may be implemented by a Mask R-CNN network model; unlike the model in fig. 4, the tasks of classification and instance location of candidate regions are performed by the same fully-connected network, while the task of instance segmentation is performed by another convolutional network.

In order to further verify the performance of the two models shown in fig. 4 and 5, the present example was performed with a verification experiment, and table 1 below shows the experimental results; wherein AP represents the mask average precision; mmAP is an evaluation method of MSCOCO (a database name), and mmAP is a result of AP under different categories and different scales. In table 1, mask average accuracy for partitioning mmAP as an example, and mask average accuracy for detecting mmAP as an example positioning and classifying task; since the classification manner of the two models is not changed, the data comparison in table 1 shows that the Mask average accuracy of the example positioning and example segmentation task in the image processing model in the embodiment is obviously improved compared with the Mask R-CNN network model in the prior art.

TABLE 1

Model (model)	Segmenting mmAP	Detecting mmAP
			Mask R-CNN network model	34.4	37
Image processing model in the present embodiment	35.4	38.7

Generally, a fully-connected network can integrate global semantic information in a candidate region, but the spatial positioning information in the candidate region is damaged, so that the model in fig. 5 realizes instance positioning and classification through the same network, which easily causes the case positioning and classification to conflict with each other, so that the positioning effect is poor, and the accuracy is low. Compared with a fully-connected network, the convolutional network is more friendly to example positioning, and is more suitable for example positioning tasks. Based on this, in the embodiment, considering that both example segmentation and example positioning have dependence on object edge features, the example segmentation and the example positioning are realized through the same convolutional network, and the performance of the loss function supervision model on the example segmentation and the example positioning is used, so that the tasks of the example segmentation and the example positioning can be mutually promoted, and the accuracy of the example positioning and the example segmentation is further improved.

According to the training method of the image processing model, after a candidate region of a target training image is obtained through a preset feature extraction network and a regional candidate network, example positioning and example segmentation are carried out on the candidate region through a convolution network, and a positioning region and a segmentation region containing an example are obtained; classifying the candidate regions through a full-connection network to obtain the classification results of the candidate regions; and training the feature extraction network, the regional candidate network, the positioning segmentation network and the classification network according to the positioning loss value, the segmentation loss value and the classification loss value until all the loss values are converged to obtain an image processing model. In the method, the example positioning and the example segmentation are realized by adopting the same branch network, so that the example positioning and the example segmentation can share the characteristic information and are mutually promoted, the accuracy of the example positioning and the example segmentation is favorably improved, and the overall accuracy of the example positioning, the segmentation and the classification is further improved.

Example five:

corresponding to the training method of the image processing model provided in the above embodiment, the present embodiment provides an image processing method applied to an apparatus configured with an image processing model; the image processing model is obtained by training in the embodiment; as shown in fig. 6, the method includes the steps of:

step S602, acquiring an image to be processed;

step S604, inputting the image to be processed into the image processing model, and outputting the positioning area, the segmentation area and the classification result of each instance in the image to be processed.

Based on the image processing method, the embodiment further provides a specific application scenario, that is, in an automatic driving scenario, the step of the image to be processed can be acquired by a camera device of the vehicle; the image processing model can be configured in a central control system of the vehicle, after the image to be processed is acquired by the camera device, the central control system inputs the image to be processed into the image processing model, and outputs the positioning area, the segmentation area and the classification result of each instance in the image to be processed, such as instances of a driving line, a traffic sign, a traffic light and the like, and according to the positioning area, the segmentation area and the classification result of the instances, the central control system can analyze the current driving road condition and generate a corresponding driving command so as to enable the vehicle to automatically drive according to the driving command.

According to the image processing method, the same branch network is adopted for realizing the instance positioning and the instance segmentation in the used image processing model, so that the instance positioning and the instance segmentation can share the characteristic information and are mutually promoted, the accuracy of the instance positioning and the instance segmentation is favorably improved, and the accuracy of the instance positioning, the segmentation and the classification as a whole is further improved.

Example six:

corresponding to the above method embodiment, referring to fig. 7, a schematic structural diagram of an image processing model training apparatus is shown, the apparatus includes:

a region obtaining module 70, configured to obtain a candidate region of the target training image through a preset feature extraction network and a region candidate network;

the positioning and dividing module 71 is configured to perform example positioning and example dividing on the candidate region through a preset positioning and dividing network, and calculate loss values of the example positioning and example dividing to obtain a positioning region, a dividing region, a positioning loss value, and a dividing loss value that include an example;

the classification module 72 is configured to classify the candidate regions through a preset classification network, and calculate a classification loss value to obtain a classification result and a classification loss value of the candidate regions;

and the training module 73 is configured to train the feature extraction network, the regional candidate network, the positioning segmentation network, and the classification network according to the positioning loss value, the segmentation loss value, and the classification loss value until the positioning loss value, the segmentation loss value, and the classification loss value are all converged, so as to obtain an image processing model.

After the candidate region of the target training image is obtained through a preset feature extraction network and a regional candidate network, the training device of the image processing model performs example positioning and example segmentation on the candidate region through a positioning segmentation network and calculates a corresponding loss value to obtain a positioning region and a segmentation region containing examples; classifying the candidate regions through a classification network and calculating corresponding loss values to obtain classification results of the candidate regions; and training the feature extraction network, the regional candidate network positioning segmentation network and the classification network according to the positioning loss value, the segmentation loss value and the classification loss value until all the loss values are converged to obtain an image processing model. In the method, the example positioning and the example segmentation are realized by adopting the same branch network, so that the example positioning and the example segmentation can share the characteristic information and are mutually promoted, the accuracy of the example positioning and the example segmentation is favorably improved, and the overall accuracy of the example positioning, the segmentation and the classification is further improved.

Further, the positioning segmentation network comprises a convolutional network; the classification network comprises a fully connected network.

Further, the area obtaining module is further configured to: carrying out feature extraction processing on the target training image through a preset feature extraction network to obtain an initial feature map of the target training image; performing feature fusion processing on the initial feature map to obtain a fusion feature map; and extracting a candidate region from the fusion characteristic diagram through a preset region candidate network.

Further, the positioning and dividing module is further configured to: adjusting the size of the candidate area to a size matched with the convolution network; performing example detection processing and example segmentation processing on the adjusted candidate area through a convolutional network to obtain a positioning area and a segmentation area which contain complete examples; the positioning area is marked by a detection frame; the divided regions are identified by color.

Further, the target training image carries a positioning label and a segmentation label corresponding to each instance; the positioning and dividing module is further configured to: substituting the positioning area and the positioning label corresponding to the example contained in the positioning area into a preset positioning loss function to obtain a positioning loss value; and substituting the segmentation labels corresponding to the instances contained in the segmentation region into a preset segmentation loss function to obtain a segmentation loss value.

Further, the classification module is further configured to: adjusting the size of the candidate area to a size matching the fully connected network; and inputting the adjusted candidate area into the full-connection network, and outputting the classification result of the candidate area.

Further, the target training image carries a classification label corresponding to each instance; the classification module is further configured to: and substituting the classification result of the candidate region and the classification label corresponding to the example contained in the candidate region into a preset classification loss function to obtain a classification loss value.

The image processing model provided in this embodiment has the same implementation principle and technical effect as those of the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments without reference to the apparatus embodiment.

The embodiment also provides an image processing device, which is arranged on the equipment provided with the image processing model; the image processing model is obtained by training the training method of the image processing model; the device includes:

the image acquisition module is used for acquiring an image to be processed;

and the image input module is used for inputting the image to be processed into the image processing model and outputting the positioning area, the segmentation area and the classification result of each example in the image to be processed.

Further, the image obtaining module is further configured to: acquiring an image to be processed through a camera device of a vehicle;

the above-mentioned device still includes: and the command generation module is used for generating a driving command according to the positioning area, the segmentation area and the classification result so as to enable the vehicle to automatically drive according to the driving command.

According to the image processing device, the case positioning and the case segmentation in the used image processing model are realized by adopting the same branch network, so that the case positioning and the case segmentation can share the characteristic information and are mutually promoted, the accuracy of the case positioning and the case segmentation is favorably improved, and the accuracy of the whole case positioning, the segmentation and the classification is further improved.

Example seven:

an embodiment of the present invention provides an electronic system, including: the device comprises an image acquisition device, a processing device and a storage device; the image acquisition equipment is used for acquiring preview video frames or image data; the storage means has stored thereon a computer program which, when run by the processing apparatus, performs the above-described training method as an image processing model, or performs the above-described method as an image processing method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the electronic system described above may refer to the corresponding process in the foregoing method embodiments, and is not described herein again.

Further, the present embodiment also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processing device, the computer program performs the above-mentioned training method, such as the image processing model, or performs the above-mentioned image processing method.

The image processing method, the training method of the model thereof, the training device of the model thereof and the computer program product of the electronic system provided by the embodiment of the invention comprise a computer readable storage medium storing program codes, instructions included in the program codes can be used for executing the method described in the previous method embodiment, and specific implementation can refer to the method embodiment and is not described herein again.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for training an image processing model, comprising:

acquiring a candidate region of a target training image through a preset feature extraction network and a region candidate network; wherein the target training image comprises a plurality of instances, and each instance has a plurality;

carrying out example positioning and example segmentation on the candidate area through a preset positioning segmentation network, and calculating loss values of the example positioning and the example segmentation to obtain a positioning area, a segmentation area, a positioning loss value and a segmentation loss value which contain an example;

classifying the candidate regions through a preset classification network, and calculating the classified loss values to obtain the classification results and the classification loss values of the candidate regions;

training the feature extraction network, the region candidate network, the positioning segmentation network and the classification network according to the positioning loss value, the segmentation loss value and the classification loss value until the positioning loss value, the segmentation loss value and the classification loss value are converged to obtain an image processing model;

the step of obtaining the candidate region of the target training image through the preset feature extraction network and the region candidate network includes:

carrying out feature extraction processing on a target training image through a preset feature extraction network to obtain an initial feature map of the target training image; performing feature fusion processing on the initial feature map to obtain a fusion feature map; and extracting a candidate region from the fusion characteristic diagram through a preset region candidate network.

2. The method of claim 1, wherein the localization segmentation network comprises a convolutional network; the classification network comprises a fully connected network.

3. The method of claim 2, wherein the step of performing instance location and instance segmentation on the candidate region through a preset location segmentation network comprises:

adjusting the size of the candidate region to a size matching the convolutional network;

performing example detection processing and example segmentation processing on the adjusted candidate area through the convolutional network to obtain a positioning area and a segmentation area which contain complete examples; the positioning area is identified through a detection frame; the segmented regions are identified by color.

4. The method according to claim 3, wherein the target training image carries a positioning label and a segmentation label corresponding to each instance;

the step of calculating a loss value for the instance position and the instance partition comprises: substituting the positioning area and a positioning label corresponding to an example contained in the positioning area into a preset positioning loss function to obtain a positioning loss value;

and substituting the segmentation labels corresponding to the segmentation areas and the examples contained in the segmentation areas into a preset segmentation loss function to obtain a segmentation loss value.

5. The method of claim 2, wherein the step of classifying the candidate regions through a preset classification network comprises:

resizing the candidate region to a size that matches the fully connected network;

inputting the adjusted candidate region into the full-connection network, and outputting the classification result of the candidate region.

6. The method of claim 5, wherein the target training image carries a classification label corresponding to each instance;

the step of calculating the classified loss value comprises: and substituting the classification result of the candidate region and the classification label corresponding to the example contained in the candidate region into a preset classification loss function to obtain a classification loss value.

7. An image processing method, characterized in that the method is applied to a device configured with an image processing model; the image processing model is obtained by training according to the method of any one of claims 1 to 6; the method comprises the following steps:

acquiring an image to be processed;

and inputting the image to be processed into the image processing model, and outputting the positioning area, the segmentation area and the classification result of each instance in the image to be processed.

8. The method of claim 7, wherein the step of obtaining the image to be processed comprises: acquiring an image to be processed through a camera device of a vehicle;

after the step of outputting the positioning region, the segmentation region and the classification result of each instance in the image to be processed, the method further comprises: and generating a driving command according to the positioning area, the segmentation area and the classification result so as to enable the vehicle to automatically drive according to the driving command.

9. An apparatus for training an image processing model, comprising:

the region acquisition module is used for acquiring a candidate region of the target training image through a preset feature extraction network and a region candidate network; wherein the target training image comprises a plurality of instances, and each instance has a plurality;

the positioning and dividing module is used for carrying out example positioning and example dividing on the candidate area through a preset positioning and dividing network, and calculating loss values of the example positioning and the example dividing to obtain a positioning area, a dividing area, a positioning loss value and a dividing loss value which comprise an example;

the classification module is used for classifying the candidate regions through a preset classification network and calculating the classified loss values to obtain the classification results and the classification loss values of the candidate regions;

the training module is used for training the feature extraction network, the region candidate network, the positioning segmentation network and the classification network according to the positioning loss value, the segmentation loss value and the classification loss value until the positioning loss value, the segmentation loss value and the classification loss value are converged to obtain an image processing model;

the region acquisition module is further configured to: carrying out feature extraction processing on a target training image through a preset feature extraction network to obtain an initial feature map of the target training image; performing feature fusion processing on the initial feature map to obtain a fusion feature map; and extracting a candidate region from the fusion characteristic diagram through a preset region candidate network.

10. An image processing apparatus, characterized in that the apparatus is provided in a device provided with an image processing model; the image processing model is obtained by training according to the method of any one of claims 1 to 6; the device comprises:

the image acquisition module is used for acquiring an image to be processed;

and the image input module is used for inputting the image to be processed into the image processing model and outputting the positioning area, the segmentation area and the classification result of each instance in the image to be processed.

11. An electronic system, characterized in that the electronic system comprises: the device comprises an image acquisition device, a processing device and a storage device;

the image acquisition equipment is used for acquiring preview video frames or image data;

the storage means having stored thereon a computer program which, when executed by the processing device, performs the method of any of claims 1 to 6, or performs the method of claim 7 or 8.

12. A computer-readable storage medium, having stored thereon a computer program, characterized in that the computer program, when being executed by a processing device, is adapted to carry out the method of any one of claims 1 to 6 or the steps of the method of claim 7 or 8.