CN113378948A

CN113378948A - Image mask generation method and device, electronic equipment and storage medium

Info

Publication number: CN113378948A
Application number: CN202110685248.6A
Authority: CN
Inventors: 崔致豪; 王子芃; 王正; 耿嘉; 丁有爽; 邵天兰
Original assignee: Mech Mind Robotics Technologies Co Ltd
Current assignee: Mech Mind Robotics Technologies Co Ltd
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2021-09-10

Abstract

The application discloses an image mask generation method and device, electronic equipment and a storage medium. The image mask generation method comprises the following steps: acquiring image data containing mask features, wherein the mask features are mask features of an image comprising an article to be identified; acquiring central point information and identification frame parameters of the object to be identified on the image data; extracting mask features of the article to be identified from image data based on the central point information and the identification frame information; and generating an image mask of the article based on the mask features of the article to be identified, which are extracted from the image data. The invention pools the image data with multi-feature dimensionality based on the key point information of the article extracted by the backbone network and the identification frame parameters, thereby identifying the mask features belonging to the identified article from the complete image and accurately extracting the mask of the identified article.

Description

Image mask generation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image mask generation method and apparatus, an electronic device, and a storage medium.

Background

Image recognition technology has found wide application in the commercial field. Currently common image recognition methods either generate segmented instance segments by selecting and assembling the output of the score map with sliding windows, or directly predict the non-proposed bounding box based on detectors, which largely rely on predefined anchors and over-parameter adjustments (e.g. anchor ratio, anchor step size) are crucial for different data sets and box scales, and some use the idea of keypoint detection to obtain four extreme points of the object and generate masks, or re-segment instances and predict the centroid of the object with polar representation, and then predict the recognition box based on the distance between the centroid and dense contour points. In addition, after obtaining the identification box, the conventional method further includes the step of eliminating the category-level redundancy detection box to avoid the occurrence of multiple identification boxes on the same item. In addition, the current mainstream image recognition algorithm usually uses a deep backbone network to process image data, and the common backbone network uses huge parameters for accuracy, so that the model reasoning speed is low, the realization of the model on low-memory equipment is severely limited, and some backbone networks concentrate on improving the model reasoning speed but reduce the accuracy.

These conventional methods are either complicated in parameter adjustment or only suitable for generating identification frames or generating masks, and the redundant frame detection techniques used in the conventional methods can cause serious missing detection when used in some special industrial scenarios, such as multiple side-by-side tilted articles or articles with occlusion. However, in an industrial scenario, such as a scenario in which a robot is used to perform article grabbing, missed inspection of the article is intolerable, and both the mask and the recognition box are necessary information in the subsequent processing. There are still a number of problems to be overcome when existing image recognition techniques are applied to industrial scenes.

Disclosure of Invention

In view of the above, the present invention has been made to overcome the above problems or at least partially solve the above problems. Specifically, firstly, the method can execute the processing of generating the identification frame and the processing of generating the mask in parallel based on the key point information and the identification frame parameters of the object, and has the advantages of accurate identification, no generation of redundant identification frames, generation of the identification frame and the mask and strong practicability when used in an industrial scene; secondly, the backbone network provided by the invention can process the input image data with multi-resolution and multi-feature dimension in multiple stages of two processing processes, and only performs up-sampling processing in the second processing process, thereby ensuring the high resolution of the output feature image data; thirdly, the backbone network of the invention comprises a characteristic transition process and a residual error process besides a data processing flow, can ensure stable characteristic transition under high-level characteristics, can avoid gradient loss of a deep network, and improves the accuracy of backbone network reasoning; thirdly, the mask generation process of the invention obtains the characteristic image data of a plurality of high-level dimension characteristics, and extracts the mask characteristics from each characteristic dimension in a pooling mode, thereby ensuring the integrity of the generated image mask and avoiding the situation of mask defect; thirdly, pooling image data of multiple feature dimensions based on the key point information of the article extracted by the backbone network and the identification frame parameters, so that the mask features belonging to the identified article can be identified from the complete image, and the mask of the identified article can be accurately extracted; the invention provides an image identification method which is particularly suitable for identifying a plurality of inclined parallel articles based on a general image identification method, can improve the accuracy of identifying a plurality of articles, and cannot cause missing detection.

All the solutions disclosed in the claims and in the description of the present application have one or more of the above-mentioned innovations and, accordingly, are capable of solving one or more of the above-mentioned technical problems. Specifically, the application provides an image mask generation method, an image mask generation device, an electronic device and a storage medium.

The image mask generation method of the embodiment of the application comprises the following steps:

acquiring image data containing mask features, wherein the mask features are mask features of an image comprising an article to be identified;

acquiring central point information and identification frame parameters of the object to be identified on the image data;

extracting mask features of the article to be identified from image data based on the central point information and the identification frame information;

and generating an image mask of the article based on the mask features of the article to be identified, which are extracted from the image data.

In some embodiments, the identification box parameter includes width information of the identification box and height information of the identification box.

In some embodiments, the center point comprises a true center point.

In some embodiments, the image data including mask features includes feature image data obtained from a backbone network and fused.

In some embodiments, the fusing includes channel-level addition operations and/or channel-level superposition operations.

In some embodiments, the backbone network includes a first data processing process and a second data processing process, the data processing process including one or more processing stages, the data stages including one or more processing branches.

In some embodiments, during the first processing, a post-processing stage includes more processing branches than a prior processing stage.

In some embodiments, in the second process, the post-processing stage includes fewer processing branches than the prior processing stage.

In some embodiments, the last processing stage of the first process has the same number of processing branches as the first processing stage of the second process.

An image mask generation device according to an embodiment of the present application includes:

the image data acquisition module is used for acquiring image data containing mask features, wherein the mask features are mask features of an image comprising an object to be identified;

the information acquisition module is used for acquiring the center point information of the article to be identified and the identification frame parameters on the image data;

the mask feature acquisition module is used for extracting the mask features of the to-be-identified object from the image data based on the central point information and the identification frame information;

and the mask generating module is used for generating an image mask of the object based on the mask features of the object to be identified, which are extracted from the image data.

In some embodiments, the center point comprises a true center point.

The electronic device of the embodiments of the present application includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the image mask generating method of any of the above embodiments when executing the computer program.

The computer-readable storage medium of the embodiments of the present application has stored thereon a computer program that, when executed by a processor, implements the image mask generation method of any of the embodiments described above.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow chart diagram of an image recognition method according to some embodiments of the present application;

FIG. 2 is a schematic flow chart diagram of an image data processing method according to some embodiments of the present application;

FIG. 3 is a schematic diagram of a backbone network according to some embodiments of the present application;

FIG. 4 is a flow chart illustrating a mask generation method for mask branching in accordance with certain embodiments of the present disclosure;

FIG. 5 is a schematic flow chart diagram of a mask generation method according to some embodiments of the present application;

FIG. 6 is a schematic flow chart diagram of a method of image recognition for a plurality of tilted side-by-side articles according to certain embodiments of the present application;

FIG. 7 is a schematic illustration of image recognition results of certain embodiments of the present application and image recognition results of the prior art;

FIG. 8 is a schematic illustration of another set of image recognition results of certain embodiments of the present application and image recognition results of the prior art;

FIG. 9 is a schematic view of the regions where the recognition boxes intersect and the regions formed after the recognition boxes are merged;

FIG. 10 is a schematic diagram of an image recognition device according to some embodiments of the present application;

FIG. 11 is a schematic diagram of an image data processing apparatus according to some embodiments of the present application;

FIG. 12 is a block diagram of an image data processing apparatus including a feature transition module and a residual connection module according to some embodiments of the present disclosure;

FIG. 13 is a schematic diagram of a mask generation apparatus for mask branching in accordance with certain embodiments of the present disclosure;

FIG. 14 is a schematic block diagram of a mask generation apparatus according to some embodiments of the present application;

FIG. 15 is a schematic diagram of an image recognition device for multiple tilted side-by-side articles according to some embodiments of the present application;

FIG. 16 is a schematic diagram of an electronic device according to some embodiments of the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 shows a schematic flow diagram of an image recognition method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step S100, acquiring image data containing an article to be identified;

step S110, processing the image data to identify the article to be identified in the image data, and acquiring key point information and identification frame parameters of the article to be identified;

step S120, aiming at the identified article, generating an identification frame on the image data based on the key point information and identification frame parameters; and is

Step S130, for the identified article, generating a mask for the identified article based on the key point information and the identification frame parameter.

For step S100, the object to be identified in the present invention may be any object placed in any manner, and compared with other existing methods, the present invention has a more obvious detection effect on side-by-side oblique objects in a dense scene. The image data in the invention can be shot on site, or can be pre-stored and manually marked data.

Compared with the traditional method, the image identification method provided by the embodiment does not use the predefined anchor frame and the complex parameters and calculation related to the anchor frame, but efficiently and accurately generates the identification frame for marking the article by acquiring the key point information of the article to be identified and the identification frame parameters, and the method of the embodiment generates a single identification frame for a single article, does not generate redundant identification frames, and does not use the necessity of redundant frame detection technology, so that the method can be applied to all industrial scenes within the scenes comprising a plurality of articles which are inclined side by side or articles with covering blockage, and does not generate missed detection, and the method executes the identification frame generation operation and the mask generation operation in parallel, so that the method has strong practicability in the industrial scenes. In a preferred embodiment, the key point may be a midpoint of the article, and the identification frame parameters may include a width and a length of the identification frame.

For step S110, the image data is input into the backbone network for processing, so as to identify the article in the image data, and obtain the key point information of the article to be identified and the identification frame parameter. The backbone networks are used for processing input data, and for different task targets, backbone networks suitable for the task targets can be selected for data processing, for example, some backbone networks are suitable for recognizing graphics, some backbone networks are suitable for recognizing human faces, and other backbone networks are suitable for recognizing characters. As described above, the important point of the present embodiment is to perform an article identification operation and an article mask generation operation in parallel based on the key point information and the identification frame parameter, the backbone network is used to identify an article in the image data, and acquire the key point information of the article to be identified and the identification frame parameter, as long as the backbone network capable of implementing the above functions can be used in the image identification method of the present embodiment, and the present embodiment does not limit the selection of the backbone network.

Some backbone networks commonly used for image recognition methods at present use huge parameters for realizing high performance, which severely limits the realization of models on low-memory devices, and on the contrary, some backbone networks focus on improving the model reasoning speed but reducing the accuracy. Therefore, a novel backbone network for image recognition is provided, and the processing speed of the network can be obviously improved while the accuracy is kept. The new backbone network is one of the key points of the present invention, and can be used in any image recognition method. The image recognition method of the present invention preferably processes the input image data using the novel backbone network

Fig. 2 shows a flow diagram of image data processing using a novel backbone network according to an embodiment of the present invention. As shown in fig. 2, the method includes:

step S200, receiving image data to be processed;

step S210, processing the image data to be processed by using a first data processing process;

in step S220, the image data processed by the first data processing process is processed using the second data processing process.

For convenience of explanation of the backbone network of the present invention, fig. 3 schematically shows the structure of the novel backbone network of the present invention. As shown in fig. 3, the network comprises two main parts: a first data processing procedure and a second data processing procedure. Each data processing process may include one or more processing stages, and each processing stage may include one or more parallel processing branches (the processing branches are shown in "blocks" in fig. 3, and reference to "blocks" and "volume blocks" hereinafter refers to processing branches), as required by the actual application scenario. As an example, the first data processing procedure may include three phases, a first phase, a second phase, and a third phase. The first stage includes a processing branch that includes a plurality of equal resolution convolution processes with convolution kernel sizes of 3x3 with step sizes of 1, followed by downsampling using 1x1 convolutional layers. The second stage comprises two processing branches, one processing branch receives the characteristic image data output by the processing branch of the first stage and repeats the processing process of the first stage, and the other branch carries out convolution processing on the characteristic image data output by the first stage after down sampling. The third stage includes three processing branches including both the operation of repeating the processing branch of the previous stage and the convolution processing of the characteristic image data output from the two processing branches of the second stage that have been down-sampled/up-sampled. In addition, the data input by the third stage is subjected to multi-resolution and multi-feature dimension fusion.

After the third stage, the second data processing procedure is entered, and the second data processing procedure may also include three stages: a fourth stage, a fifth stage and a sixth stage. The fourth stage includes three processing branches including both an operation of repeating the processing branch of the previous stage and an operation of performing convolution processing on the feature image data of the three processing branches of the third stage having been down-sampled/up-sampled, similarly to the third stage. Similarly, the data input in the fourth stage is subjected to multi-resolution and multi-feature dimension fusion. The image data is subjected to convolution processing at each stage, and the more the convolution processing is performed, the higher the gradation of the features included in the output feature image data is. When the feature image data enters the second data processing process, the feature image data is processed in 3 stages, so the feature image data output from the 4 th stage contains relatively high-level features, in order to improve the accuracy of image recognition, it is desirable to increase the feature dimension smoothly, so that the processed feature image data loses information as little as possible, a feature transition module is added to the backbone network from the 4 th stage, before the current processing branch outputs the data to the processing branch with more feature dimensions, the output of the current processing branch is adjusted by the feature transition module, that is, an additional feature conversion module is expanded after the main convolution block, and the feature conversion module doubles the previous feature dimension. In one embodiment, to enhance feature diversity, a deformable convolution is employed as a convolutional layer of the feature transformation block.

The method of the present invention generates an identification box based on the keypoint information of the item, the resolution of the image data must not be too low, especially for two objects close to each other. Therefore, in the second processing, an operation that would cause the resolution of the picture data to be lowered is not performed, in other words, in the first data processing, a part of the processing stages includes a process of increasing the resolution of the image data and a part of the processing stages includes a process of decreasing the resolution of the image data, and in the second data processing, any one of the processing stages includes only a process of increasing the resolution of the image data to enlarge a lower level of features and does not include a process of decreasing the resolution of the image data. Therefore, after the 4 th stage and the 5 th stage, only the feature image data is subjected to the up-sampling processing, and the 5 th stage and the 6 th stage only include the operation of repeating the processing branch of the previous stage and the operation of performing the convolution processing on the feature image data output by the up-sampled processing branch, but not include the operation of performing the convolution processing on the feature image data output by the down-sampled processing branch. On the whole, the processing branches of each processing stage in the first data processing process of the novel backbone network are gradually increased and are in a descending triangle structure; the processing branches of each processing stage in the second data processing process are gradually reduced and are in a 'lifting triangle' structure, the novel backbone network carries out cross fusion of multi-feature dimensionality and multi-resolution on data output by a plurality of processing branches, convolution from high to low and from low to high is constructed in parallel by the framework, and multiple fusion is carried out on the data and features of different dimensionalities while high resolution of the whole process is maintained.

In the following, an example of input image data with a resolution of 512 × 512 is taken to explain how the backbone network of the present invention processes image data.

For step S210, the image data with the resolution of 512 × 512 is input into the processing branch of the first processing stage of the first processing procedure, and after the processing of the processing branch, the feature image data with the resolution of 128 × 128 of 32 feature dimensions is obtained, on one hand, the feature image data output by the first processing branch is input into the upper processing branch of the second processing stage, on the other hand, the data output by the first processing branch is convolution down-sampled by 1 × 1 and output to the middle processing branch of the second processing stage. It should be noted that "upper" and "middle" herein refer to the processing branch being at the upper or middle position in fig. 3, and do not indicate that the upper processing branch performs data processing before the middle processing branch, and actually, multiple processing branches of each processing stage are executed in parallel without any precedence. In the entire backbone network, the resolution of the image data processed by the upper processing branch is 128x128, the resolution of the image data processed by the middle processing branch is 64x64, and the resolution of the image data processed by the lower processing branch is 32x 32.

Obtaining feature image data of 32 feature dimensions after being processed by an upper processing branch of a second processing stage, obtaining feature image data of 64 feature dimensions after being processed by a middle processing branch, on one hand, fusing the feature image data output by the upper processing branch and the feature image data output by the middle processing branch after being subjected to 1x1 convolution up-sampling, and inputting the fused feature image data to an upper branch of a third processing stage; in the second aspect, the feature image data output by the upper processing branch after 1x1 convolution and down sampling is fused with the feature image data output by the middle processing branch, and then the fused feature image data is input to the middle processing branch in the third processing stage; in the third aspect, the feature image data output after the upper processing branch is subjected to 1x1 convolution and down-sampling is fused with the feature image data output after the middle processing branch is subjected to 1x1 convolution and down-sampling, and then the fused feature image data is input to the lower branch of the third processing stage.

Obtaining feature image data of 32 feature dimensions after being processed by an upper processing branch of a third processing stage, obtaining feature image data of 64 feature dimensions after being processed by a middle processing branch, obtaining feature image data of 128 feature dimensions after being processed by a lower processing branch, on one hand, the feature image data output by the upper processing branch is fused with the feature image data output by the middle processing branch after being convolution up-sampled by 1x1 and the feature image data output by the lower processing branch after being convolution up-sampled by 1x1, and then the fused feature image data are input to an upper branch of a fourth processing stage; in the second aspect, the feature image data output by the upper processing branch after being subjected to convolution down-sampling by 1x1, the feature image data output by the middle processing branch and the feature image data output by the lower processing branch after being subjected to convolution up-sampling by 1x1 are fused and input to the middle processing branch of the fourth processing stage; in the third aspect, the feature image data output by the upper processing branch after 1x1 convolution and down-sampling is fused with the feature image data output by the middle processing branch after 1x1 convolution and down-sampling, and then input to the lower branch of the fourth processing stage.

In step S220, feature image data of 32 feature dimensions is obtained after the upper processing branch of the fourth processing stage, then feature image data of 64 feature dimensions is obtained after the feature transition module processing, feature image data of 64 feature dimensions is obtained after the middle processing branch processing, then feature image data of 128 feature dimensions is obtained after the feature transition module processing, and feature image data of 128 feature dimensions is obtained after the lower processing branch processing. On one hand, the feature image data output by the feature transition module after the upper processing branch is fused with the feature image data output by the middle processing branch after being subjected to 1x1 convolution and up-sampling and the feature image data output by the lower processing branch after being subjected to 1x1 convolution and up-sampling, and then the fused feature image data are input to the upper branch of the fifth processing stage; in the second aspect, the feature image data output by the feature transition module after the middle processing branch is fused with the feature image data output by the lower processing branch after 1 × 1 convolution upsampling, and then input to the middle processing branch of the fifth processing stage.

Feature image data of 64 feature dimensions are obtained after the upper processing branch processing of the fifth processing stage, then feature image data of 128 feature dimensions are obtained after the feature transition module processing, and feature image data of 128 feature dimensions are obtained after the middle processing branch processing. And the feature image data output by the feature transition module after the upper processing branch is fused with the feature image data output by the middle processing branch after 1x1 convolution and up-sampling, and then the fused feature image data is input to the upper branch of the sixth processing stage. And feature image data of 256 feature dimensions are obtained after the upper processing branch processing of the sixth processing stage.

Since the backbone network of the present invention may comprise many processing stages, the more processing stages, the "deeper" the network, and the more likely it is to lose information during image data processing. In one embodiment, in order to prevent gradient loss of the deep network and enhance the head-to-tail characteristics, the novel backbone network of the present invention may further include a residual connection module, and the one or more processing branches of the first data processing process may input data processed by the residual processing process to the one or more processing branches of the second data processing process through the residual processing module. As shown in fig. 3, as an example, two residual connecting modules of the present invention are shown above the whole network, one connecting the processing branch of the first processing stage with the processing branch of the sixth processing stage, and the other connecting the upper processing branch of the second processing stage with the upper processing branch of the fifth processing stage, so that the information blocking between the low-level and high-level features can be broken through, and the information loss during the processing can be avoided.

For step S120, based on the identification result of the backbone network and the data obtained in the backbone network processing process, the center point information of the article to be identified is extracted and the length information and the width information of the identification frame for marking the article are calculated and obtained, and the identification frame is generated on the image data to mark the identified article.

For step S130, the image recognition method of the present invention is composed of two parts, generating a recognition frame and predicting an article mask. Both of these parts require the use of a backbone network for the feature image data obtained during data processing. Thus, in addition to the main process of generating an identification box using the backbone network, the present invention synchronizes the process of performing mask branching, and in one embodiment, the operation of generating the identification box shares a feature weight with the operation of generating the mask. The mask branch processing process fully uses the characteristic image data generated in the process of processing the image data by the backbone network, and fuses the characteristic image data with a plurality of characteristic dimensions to extract the mask of the article, which is one of the key points of the invention.

Fig. 4 illustrates a mask generation method used by a mask generation component in the image recognition method according to an embodiment of the present invention. As shown in fig. 4, the method includes:

step S300, receiving image data to be processed;

step S310, inputting image data to be processed into a data processing process comprising M processing stages for processing, wherein M is an integer greater than or equal to 2;

step S320, acquiring data output by N processing stages in M processing stages, wherein N is an integer greater than or equal to 2 and less than or equal to M;

step S330, fusing the data output by the N processing stages;

step S340, performing pooling processing on the fused data to obtain an image mask.

As for step S310, as mentioned above, the present invention needs to input the image data into the backbone network for processing, and the backbone network may preferably use the novel backbone network of the present invention. The backbone network includes a plurality of processing stages, and the specific processing process is described in the related embodiments of the backbone network, which is not described herein again.

For step S320, in order to extract the mask of the article accurately and with good quality, the present embodiment obtains a plurality of feature image data generated in the data processing process of the backbone network, for example, the backbone network shown in fig. 3 includes 6 data processing stages, and the mask branch receives feature image data generated in 3 stages, specifically, feature image data generated in stages 4, 5, and 6.

As for step S330, an example of feature image data fusion performed by the mask branch is shown in the lower part of fig. 3, and it can be seen that feature image data of 128 feature dimensions is input to the mask branch in the 4 th processing stage, feature image data of 128 feature dimensions is input to the mask branch in the 5 th processing stage, and feature image data of 256 feature dimensions is obtained by using channel-level superposition calculation between the 4 th and 5 th stages; the 6 th processing stage inputs 256 feature image data of feature dimensions to the mask branch, and the 5 th and 6 th stages use channel-level addition calculation to obtain new 256 feature image data of feature dimensions. That is, channel-level superposition operation is performed between data lower than the maximum characteristic dimension of the backbone network, and channel-level addition operation is performed between data equal to the maximum characteristic dimension of the backbone network.

For step S340, the present invention processes the fused feature image data in a pooling manner to obtain a mask of the article. In the prior art, there are various methods for obtaining the object mask in a pooling manner, and the present embodiment focuses on obtaining the image mask by pooling new feature image data including mask features after extracting and fusing data from multiple data processing stages, but not on a specific pooling method, and any suitable pooling method for generating the mask may be used in the present embodiment.

Although any pooling method can be used, in order to improve the accuracy of mask generation, the invention develops a mask generation method based on a central point, which is particularly suitable for a method for image recognition based on the central point, and is also one of the key points of the invention. FIG. 5 illustrates an image mask generation method according to one embodiment of the present invention. As shown in fig. 5, the method includes:

step S400, acquiring image data containing mask features, wherein the mask features are the mask features of the image of the object to be identified;

step S410, acquiring center point information and identification frame information of an article to be identified on the image data;

step S420, extracting the mask feature of the object to be identified from the image data based on the central point information and the identification frame information;

step S430, generating an image mask of the object to be identified based on the mask features of the object to be identified extracted from the image data.

For step S400, the image data processed by the backbone network is image data including the object to be masked and other objects and background images, etc. The method of the embodiment extracts the mask feature of the article from as many high-level feature dimensions as possible, and therefore, preferably, as shown in fig. 3, the invention extracts 256 feature dimension image data from multiple processing stages of the second data processing process, and the feature image data includes the mask feature of the article to be identified.

For step S410, based on the data obtained in the backbone network processing process, the center point information of the article to be identified is extracted and the length information and the width information of the identification frame used for marking the article are obtained through calculation, and these information may be used for generating the identification frame and also used for generating the mask of the article to be identified, and the process of processing the data by the backbone network is not repeated here. In a preferred embodiment, the center point may be a real (Ground Truth) center point, abbreviated as GT center point.

For step S420, as described above, the obtained image data is complete image data including the object to be identified and includes feature image data of 256 feature dimensions, and in this step, on one hand, the position of the object to be identified needs to be found according to the center point information and the length information and width information of the identification frame, and on the other hand, the object mask feature needs to be found from the feature image data; therefore, in the next step, the mask of the article to be identified is generated according to the acquired mask features of the article to be identified, and various operations can be conveniently carried out according to the mask.

The image recognition method is particularly suitable for recognizing industrial scenes such as a plurality of obliquely arranged or shielded objects, and how to use the image recognition method in the industrial scenes is also one of the key points of the invention.

FIG. 6 illustrates a method of identifying and tagging image data comprising a plurality of items, according to one embodiment of the invention. As shown in fig. 6, the method includes:

step S500, acquiring image data containing an article group to be identified;

step S510, identifying each item in the group of items to be identified based on the image data;

step S520, for each item identified, generating an identification frame on the image data to mark the item.

With respect to step S500, fig. 7 and 8 exemplarily show image data in which a plurality of articles to be manipulated are included in an obliquely juxtaposed manner, the articles constituting an article group;

for step S510, each article needs to be identified from the image data, and all articles need to be identified by the image identification method in the present invention, which needs to input the image data into the backbone network for processing, and the data processing method of the backbone network is not described herein again;

for step S520, in the industrial scenario, all of the articles in the group of articles in the image data are operation objects, may need to be grabbed, may need to be painted, and therefore all of the articles in the group of articles generally need to be identified, and no omission exists. Fig. 7(a) and 8(a) show results of recognition using a conventional recognition method, and fig. 7(b) and 8(b) show results of recognition using a recognition method of the present invention. The invention generates an identification frame for each identified article according to the center point information of the article and the length and width information of the identification frame so as to mark the article, and therefore, the center of each generated identification frame is positioned on the identified article. In addition, in the prior art, when the degree of overlap between two recognition frames exceeds a certain threshold, one of the frames is determined as a redundant recognition frame, and the recognition frame is deleted from the output image, which results in the recognition results shown in fig. 7(a) and 8(a), that is, many missed detections occur. However, the recognition results shown in fig. 7(b) and 8(b) can be generated by using the recognition method of the present invention, that is, the image recognition method of the present invention can tolerate partial overlap or even complete overlap of the recognition frames, and even if the overlap degree reaches more than 60%, the present invention can correctly recognize each article without deleting the recognition frame, and no missing detection is generated. The overlapping degree of the two recognition frames is equal to the area of the region where the two recognition frames intersect/the area of the region formed by merging the two recognition frames. As shown in fig. 9, the black portion in fig. 9(a) is the area of the region where the two recognition frames intersect, and the black portion in fig. 9(b) is the area of the region where the two recognition frames are combined.

In addition, it should be noted that although each embodiment of the present invention has a specific combination of features, further combinations and cross-combinations of these features between embodiments are also possible.

According to the embodiments, firstly, the processing of generating the identification frame and the processing of generating the mask can be executed in parallel based on the key point information and the identification frame parameter of the article, when the method is used in an industrial scene, redundant identification frames cannot be generated, and the practicability is strong; secondly, the backbone network provided by the invention can process the input image data in multiple stages of two processing processes, and only performs up-sampling processing in the second processing process, thereby ensuring the high resolution of the output characteristic image data; thirdly, the backbone network of the invention comprises a characteristic transition process and a residual error process besides a data processing flow, which can ensure stable characteristic transition under high-level characteristics and avoid gradient loss of a deep network; fourthly, the mask generation process of the invention obtains the feature image data of a plurality of high-level dimension features, and extracts the mask features from each feature dimension in a pooling mode, thereby being capable of generating the image mask completely without omission; fifthly, pooling is carried out on the image data of the multi-feature dimensionality based on the key point information of the article extracted by the backbone network and the identification frame parameters, so that the mask of the identified article can be accurately extracted from the complete image; sixthly, the invention provides a method special for identifying a plurality of inclined parallel articles based on a general image identification method, which can improve the identification accuracy and avoid missing detection.

Fig. 10 shows an image recognition apparatus according to still another embodiment of the present invention, the apparatus including:

an image data obtaining module 600, configured to obtain image data including an article to be identified, that is, to implement step S100;

an image data processing module 610, configured to process the image data to identify an article to be identified in the image data, and obtain key point information and an identification frame parameter of the article to be identified, that is, to implement step S110;

an identification frame generating module 620, configured to generate an identification frame on the image data based on the key point information and an identification frame parameter for the identified item, that is, to implement step S120; and is

The mask generating module 630 generates a mask for the identified item based on the key point information and the identification frame parameter, that is, for implementing step S130. Fig. 11 shows an apparatus for processing image data according to still another embodiment of the present invention, the apparatus including:

an image data receiving module 700, configured to receive image data to be processed, that is, to implement step S200;

a first data processing module 710, configured to process the image data to be processed by using a first data processing procedure, that is, to implement step S210;

a second data processing module 720, configured to process the image data processed by the first data processing process using a second data processing process, that is, to implement step S220;

in the first data processing process, part of the processing stages comprises processing for increasing resolution of the image data, and part of the processing stages comprises processing for decreasing resolution of the image data; in the second data processing procedure, any of the processing stages includes processing for increasing the resolution of the image data and does not include processing for decreasing the resolution of the image data.

Fig. 12 shows an apparatus for processing image data according to still another embodiment of the present invention, the apparatus including:

an image data receiving module 800, which receives image data to be processed, that is, is used for implementing step S200;

a first data processing module 810, for processing the image data to be processed by using a first data processing procedure, that is, for implementing step S210;

a second data processing module 820 for processing the image data processed by the first data processing process using a second data processing process, i.e., for implementing step S220;

wherein the data processing process comprises one or more processing stages, each processing stage comprising one or more processing branches;

the processing apparatus further comprises:

a residual processing module 830, configured to connect one or more processing branches in the first data processing procedure with one or more processing branches in the second data processing procedure through a residual processing procedure;

the feature transition module 840 is configured to perform feature transition processing on the data to be output before one or more processing branches in the second processing procedure output data in a next processing stage.

For the residual processing module 830, since the backbone network of the present invention may include many processing stages, the more "deep" the network, and the more likely it is to lose information during the image data processing. In one embodiment, in order to prevent gradient loss of the deep network and enhance the head-to-tail characteristics, the novel backbone network of the present invention may further include a residual connection module, and the one or more processing branches of the first data processing process may input data processed by the residual processing process to the one or more processing branches of the second data processing process through the residual processing module. As shown in fig. 3, as an example, two residual connecting modules of the present invention are shown above the whole network, one connecting the processing branch of the first processing stage with the processing branch of the sixth processing stage, and the other connecting the upper processing branch of the second processing stage with the upper processing branch of the fifth processing stage, so that the information blocking between the low-level and high-level features can be broken through, and the information loss during the processing can be avoided. The residual processing module 830 is used to implement the above method steps.

For the feature transition module 840, since the image data is subjected to convolution processing at each stage, the more convolution processing, the higher the hierarchy of features included in the output feature image data. When the feature image data enters the second data processing process, the feature image data is processed in 3 stages, so the feature image data output from the 4 th stage contains relatively high-level features, in order to improve the accuracy of image recognition, it is desirable to increase the feature dimension smoothly, so that the processed feature image data loses information as little as possible, a feature transition module is added to the backbone network from the 4 th stage, before the current processing branch outputs the data to the processing branch with more feature dimensions, the output of the current processing branch is adjusted by the feature transition module, that is, an additional feature conversion module is expanded after the main convolution block, and the feature conversion module doubles the previous feature dimension. In one embodiment, to enhance feature diversity, a deformable convolution is employed as a convolutional layer of the feature transformation block. The feature transition module 840 is used to implement the above-described method steps.

FIG. 13 shows an image mask generation apparatus according to still another embodiment of the present invention, the apparatus including:

an image data receiving module 900, configured to receive image data to be processed, that is, to implement step S300;

an image data processing module 910, configured to input image data to be processed into a data processing procedure including M processing stages for processing, where M is an integer greater than or equal to 2, that is, to implement step S310;

a data obtaining module 920, configured to obtain data output by N processing stages of the M processing stages, where N is an integer greater than or equal to 2 and less than or equal to M, that is, to implement step S320;

a fusion module 930, configured to fuse data output by the N processing stages, that is, to implement step S330;

a mask generating module 940, configured to perform pooling processing on the fused data to obtain an image mask, that is, to implement step S340. FIG. 14 shows an image mask generation apparatus according to still another embodiment of the present invention, the apparatus including:

an image data obtaining module 1000, configured to obtain image data including a mask feature, that is, to implement step S400, where the mask feature is a mask feature of an image including an object to be identified;

an information obtaining module 1010, configured to obtain center point information of an article to be identified and an identification frame parameter on the image data, that is, to implement step S410;

a mask feature obtaining module 1020, configured to extract a mask feature of the to-be-identified item from the image data based on the center point information and the identification frame information, that is, to implement step S420;

a mask generating module 1030, configured to generate an image mask of the article based on the mask feature of the article to be identified, which is extracted from the image data, that is, to implement step S430.

Fig. 15 shows an image recognition apparatus according to still another embodiment of the present invention, the apparatus including:

an image data acquisition module 1100, configured to acquire image data including an item group to be identified, that is, to implement step S500;

an image identification module 1110, configured to identify each item in the group of items to be identified based on the image data, that is, to implement step S510;

an identification frame generating module 1120, configured to generate an identification frame on the image data for marking each identified item, i.e. for implementing step S520;

wherein the group of items comprises at least two items; and the number of the first and second electrodes,

the center of each identification frame is positioned in the image of the object marked by the identification frame; and the number of the first and second electrodes,

the plurality of recognition boxes generated on the image data at least partially overlap.

In the device embodiments shown in fig. 10 to fig. 15, only the main functions of the modules are described, all the functions of each module correspond to the corresponding steps in the method embodiment, and the working principle of each module may also refer to the description of the corresponding steps in the method embodiment, which is not described herein again. In addition, although the correspondence between the functions of the functional modules and the method is defined in the above embodiments, it can be understood by those skilled in the art that the functions of the functional modules are not limited to the correspondence, that is, a specific functional module can also implement other method steps or a part of the method steps. For example, the above embodiment describes that the mask generating module 1030 is used for implementing the method of step S430, however, the mask generating module 1030 may also be used for implementing the method of step S400, S410 or S420 or a part of the method according to the needs of the actual situation.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the above embodiments. It should be noted that the computer program stored in the computer-readable storage medium of the embodiments of the present application may be executed by a processor of an electronic device, and the computer-readable storage medium may be a storage medium built in the electronic device or a storage medium that can be plugged into the electronic device in an attachable and detachable manner.

Fig. 16 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device may be a control system/electronic system configured in an automobile, a mobile terminal (e.g., a smart mobile phone, etc.), a personal computer (PC, e.g., a desktop computer or a notebook computer, etc.), a tablet computer, a server, and the like, and a specific implementation of the electronic device is not limited by the specific embodiment of the present invention.

As shown in fig. 16, the electronic device may include: a processor (processor)1202, a communication Interface 1204, a memory 1206, and a communication bus 1208.

Wherein:

the processor 1202, communication interface 1204, and memory 1206 communicate with one another via a communication bus 1208.

A communication interface 1204 for communicating with network elements of other devices, such as clients or other servers.

The processor 1202 is configured to execute the program 1210, and may specifically perform the relevant steps in the foregoing method embodiments.

In particular, program 1210 may include program code comprising computer operating instructions.

The processor 1202 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the present invention. The electronic device comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

The memory 1206 is used for storing programs 1210. The memory 1206 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 1210 may be downloaded and installed from a network through the communication interface 1204, and/or installed from a removable medium. The program, when executed by the processor 1202, may cause the processor 1202 to perform the operations of the above-described method embodiments.

Broadly, the inventive content of the invention comprises: an image recognition method, comprising:

acquiring image data containing an article to be identified;

processing the image data to identify the object to be identified in the image data, and acquiring key point information and identification frame parameters of the object to be identified;

generating, for the identified item, an identification box on the image data based on the keypoint information and an identification box parameter; and is

For the identified item, a mask for the identified item is generated based on the keypoint information and the identification frame parameters.

Optionally, the operation of generating the identification frame and the operation of generating the mask of the identified article are performed in parallel.

Optionally, the identification frame parameter includes a width of the identification frame and a height of the identification frame.

Optionally, the key point comprises a central point of the identification item.

Optionally, the processing the image data includes inputting the image data into a backbone network for processing.

Optionally, the backbone network includes a first data processing procedure and a second data processing procedure, and the data processing procedure includes one or more processing stages including one or more processing branches.

Optionally, the data output by the plurality of processing branches are fused.

Optionally, the data output by the plurality of processing branches has multiple resolutions and/or multiple feature dimensions.

Optionally, in the first processing procedure, the post-processing stage includes more processing branches than the previous processing stage; and/or, in the second processing procedure, a post-processing stage comprises fewer processing branches than a preceding processing stage.

An image recognition apparatus comprising:

the image data acquisition module is used for acquiring image data containing the article to be identified;

the image data processing module is used for processing the image data to identify the object to be identified in the image data and acquiring key point information and identification frame parameters of the object to be identified;

an identification frame generation module which generates an identification frame on the image data based on the key point information and an identification frame parameter for the identified article; and is

And the mask generation module is used for generating a mask of the identified article based on the key point information and the identification frame parameters.

Optionally, the recognition frame generation module and the mask generation module run in parallel.

Optionally, the key point comprises a central point of the identification item.

Optionally, the image data processing module is configured to input the image data into a backbone network for processing.

Optionally, the data output by the plurality of processing branches are fused.

A method of processing image data, comprising:

receiving image data to be processed;

processing the image data to be processed using a first data processing procedure;

processing the image data processed by the first data processing process using a second data processing process;

Optionally, the processing stage comprises one or more processing branches.

Optionally, the data output by the plurality of processing branches are fused.

Optionally, the plurality of processing branches belong to the same processing stage.

Optionally, in the first processing procedure, the post-processing stage includes more processing branches than the preceding processing stage.

Optionally, in the second processing procedure, the post-processing stage includes fewer processing branches than the preceding processing stage.

Optionally, the last processing stage of the first processing procedure and the first processing stage of the second processing procedure have the same number of processing branches.

Optionally, the reducing resolution processing comprises reducing resolution using a 1x1 convolution down-sampling process; and/or the increasing resolution processing comprises increasing resolution using 1x1 convolution upsampling.

An apparatus for processing image data, comprising:

the image data receiving module is used for receiving image data to be processed;

the first data processing module is used for processing the image data to be processed by using a first data processing process;

a second data processing module for processing the image data processed by the first data processing process using a second data processing process;

Optionally, the processing stage comprises one or more processing branches.

Optionally, the data output by the plurality of processing branches are fused.

Optionally, the reducing resolution processing comprises reducing resolution using a 1x1 convolution down-sampling process; and/or the increasing resolution processing comprises increasing resolution using a 1x1 convolution upsampling process.

A method of processing image data, comprising:

receiving image data to be processed;

the method further comprises a residual processing procedure, wherein one or more processing branches in the first data processing procedure are connected with one or more processing branches in the second data processing procedure through the residual processing procedure;

and before one or more processing branches in the second processing process output data to the next processing stage, processing the data to be output through a characteristic transition operation.

Optionally, the feature transition operation includes performing deformable convolution processing on data to be output.

Optionally, the residual processing procedure includes connecting a processing branch of a first processing stage in the first data processing procedure with a processing branch of a last processing stage in the second data processing procedure through the residual processing procedure.

Optionally, data output by the multiple processing branches are fused.

An apparatus for processing image data, comprising:

a second data processing module which processes the image data processed by the first data processing process using a second data processing process;

the processing apparatus further comprises:

a residual processing module for connecting one or more processing branches in the first data processing process with one or more processing branches in the second data processing process through a residual processing process;

and the characteristic transition module is used for performing characteristic transition processing on the data to be output before one or more processing branches in the second processing process output the data to the next processing stage.

Optionally, the feature transition module is further configured to perform deformable convolution processing on the data to be output.

Optionally, the residual error processing module is further configured to connect the processing branch of the first processing stage in the first data processing process with the processing branch of the last processing stage in the second data processing process through the residual error processing process.

Optionally, data output by the multiple processing branches are fused.

An image mask generation method, comprising:

receiving image data to be processed;

inputting image data to be processed into a data processing process comprising M processing stages for processing, wherein M is an integer greater than or equal to 2;

acquiring data output by N processing stages in M processing stages, wherein N is an integer greater than or equal to 2 and less than or equal to M;

fusing data output by the N processing stages;

and performing pooling processing on the fused data to obtain an image mask.

Optionally, the inputting the image data to be processed into the data processing process including M processing stages is specifically inputting the image data to be processed into a backbone network for processing, where the backbone network includes a first processing process and a second processing process, and the processing process includes M processing stages.

Optionally, the N processing stages are processing stages in a second data processing process.

Optionally, the fusing includes a channel-level addition operation and/or a channel-level superposition operation.

Optionally, the performing the pooling process to obtain the image mask includes calculating the image mask based on the center point information of the object to be identified and the parameter of the identification frame.

Optionally, the identification frame parameter includes width information of the identification frame and height information of the identification frame.

Optionally, the center point information includes a real (Ground route) center point.

Optionally, the processing stage includes one or more processing branches, and the inputting the image data to be processed into the backbone network for processing further includes fusing data output by the processing branches.

An image mask generation apparatus comprising:

the image data processing module is used for inputting image data to be processed into a data processing process comprising M processing stages for processing, wherein M is an integer greater than or equal to 2;

the data acquisition module is used for acquiring data output by N processing stages in the M processing stages, wherein N is an integer which is more than or equal to 2 and less than or equal to M;

the fusion module is used for fusing the data output by the N processing stages;

and the mask generation module is used for performing pooling processing on the fused data to obtain an image mask.

Optionally, the image data processing module is specifically configured to input image data to be processed into a backbone network for processing, where the backbone network includes a first processing procedure and a second processing procedure, and the processing procedure includes M processing stages.

Optionally, the mask generating module is further configured to calculate an image mask based on the center point information of the object to be identified and the identification frame parameter.

An image mask generation method, comprising:

Optionally, the image data including the mask feature includes feature image data obtained by obtaining from a backbone network and fusing.

Optionally, the backbone network includes a first data processing procedure and a second data processing procedure, the data processing procedure includes one or more processing stages, and the data stage includes one or more processing branches.

An image mask generation apparatus comprising:

An image recognition method, comprising:

acquiring image data containing an article group to be identified;

identifying each item in the group of items to be identified based on the image data;

for each item identified, generating an identification box on the image data to mark the item;

Optionally, the degree of overlap of the recognition frames is expressed by using an overlap degree, and the overlap degree of the two recognition frames is equal to the area of the region where the two recognition frames intersect/the area of the region formed by merging the two recognition frames.

Optionally, the at least partial overlap comprises an overlap of greater than 60%.

Optionally, the generating the identification frame includes generating the identification frame based on the key point information of the item and the identification frame parameter.

Optionally, the identification frame parameter includes length information and width information of the identification frame.

Optionally, the key point of the item comprises a centre point of the item.

Optionally, the identifying each item in the group of items to be identified based on the image data includes inputting the image data into a backbone network for processing to identify each item in the group of items.

An image recognition apparatus comprising:

the image data acquisition module is used for acquiring image data containing the object group to be identified;

an image identification module for identifying each item in the group of items to be identified based on the image data;

an identification frame generation module for generating an identification frame on the image data for each item identified to mark the item;

Optionally, the identification frame generation module is further configured to generate an identification frame based on the key point information of the article and the identification frame parameter.

Optionally, the key point of the item comprises a centre point of the item.

Optionally, the image recognition module is further configured to input the image data into the backbone network for processing to recognize each article in the article group.

In the description herein, reference to the description of the terms "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example" or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processing module-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should be understood that portions of the embodiments of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations of the above embodiments may be made by those of ordinary skill in the art within the scope of the present application.

Claims

1. An image mask generation method, comprising:

2. The image mask generating method according to claim 1, wherein: the identification frame parameter includes width information of the identification frame and height information of the identification frame.

3. The image mask generating method according to claim 1, wherein: the center point comprises a true center point.

4. The image mask generation method according to any one of claims 1 to 3, wherein: the image data containing the mask features comprises feature image data which is obtained from a backbone network and obtained after fusion.

5. The image mask generating method according to claim 4, wherein: the fusion includes channel-level addition operations and/or channel-level superposition operations.

6. The image mask generating method according to claim 4, wherein: the backbone network includes a first data processing process and a second data processing process, the data processing process including one or more processing stages, the data stages including one or more processing branches.

7. The image mask generating method according to claim 6, wherein: in the first process, the post-processing stage includes more processing branches than the preceding processing stage.

8. The image mask generating method according to claim 6, wherein: in the second process, the post-processing stage includes fewer processing branches than the preceding processing stage.

9. The image mask generating method according to claim 6, wherein: the last processing stage of the first processing procedure has the same number of processing branches as the first processing stage of the second processing procedure.

10. An image mask generating apparatus, comprising:

11. The image mask generating apparatus according to claim 10, wherein: the identification frame parameter includes width information of the identification frame and height information of the identification frame.

12. The image mask generating apparatus according to claim 10, wherein: the center point comprises a true center point.

13. The image mask generating apparatus according to any one of claims 10 to 12, wherein: the image data containing the mask features comprises feature image data which is obtained from a backbone network and obtained after fusion.

14. The image mask generating apparatus according to claim 13, wherein: the fusion includes channel-level addition operations and/or channel-level superposition operations.

15. The image mask generating apparatus according to claim 13, wherein: the backbone network includes a first data processing process and a second data processing process, the data processing process including one or more processing stages, the data stages including one or more processing branches.

16. The image mask generating apparatus according to claim 15, wherein: in the first process, the post-processing stage includes more processing branches than the preceding processing stage.

17. The image mask generating apparatus according to claim 15, wherein: in the second process, the post-processing stage includes fewer processing branches than the preceding processing stage.

18. The image mask generating apparatus according to claim 15, wherein: the last processing stage of the first processing procedure has the same number of processing branches as the first processing stage of the second processing procedure.

19. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the image mask generation method of any one of claims 1 to 9 when executing the computer program.

20. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the image mask generating method according to any one of claims 1 to 9.