WO2022267387A1 - 图像识别方法、装置、电子设备和存储介质 - Google Patents

图像识别方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2022267387A1
WO2022267387A1 PCT/CN2021/138580 CN2021138580W WO2022267387A1 WO 2022267387 A1 WO2022267387 A1 WO 2022267387A1 CN 2021138580 W CN2021138580 W CN 2021138580W WO 2022267387 A1 WO2022267387 A1 WO 2022267387A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing
image data
item
data
recognition
Prior art date
Application number
PCT/CN2021/138580
Other languages
English (en)
French (fr)
Inventor
崔致豪
王正
耿嘉
丁有爽
邵天兰
Original Assignee
梅卡曼德(北京)机器人科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 梅卡曼德(北京)机器人科技有限公司 filed Critical 梅卡曼德(北京)机器人科技有限公司
Publication of WO2022267387A1 publication Critical patent/WO2022267387A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • the present application relates to the technical field of image processing, and more specifically, to an image recognition method, device, electronic equipment and storage medium.
  • Image recognition technology has been widely used in commercial fields.
  • image recognition methods either use sliding windows to select and assemble the output of score maps to generate segmented instance segments, or directly predict bounding boxes without proposals based on detectors, these methods rely heavily on predefined Anchor, and hyperparameter adjustment (such as anchor ratio, anchor step size) is crucial for different data sets and frame scales, and some image recognition methods use key point detection ideas to obtain the four extreme points of the object and generate a mask , or re-segment the instance with a polar coordinate representation and predict the centroid of the object, and then predict the recognition box based on the distance between the centroid and the dense contour points.
  • the traditional method also includes the step of eliminating redundant detection frames at the category level to avoid multiple recognition frames appearing on the same item.
  • the current mainstream image recognition algorithms usually use a deep backbone network to process image data.
  • the commonly used backbone network uses huge parameters for accuracy, which leads to low model inference speed and severely limits the performance of the model on low-memory devices.
  • the present invention has been proposed in order to overcome the above problems or at least partly solve the above problems. Specifically, firstly, the present invention can execute the processing of generating the recognition frame and the processing of generating the mask in parallel based on the key point information of the item and the parameters of the recognition frame. When used in an industrial scene, the recognition is accurate and no redundant recognition is generated. frames, and generate recognition frames and masks at the same time, which has strong practicability; secondly, the backbone network proposed by the present invention can process the input image data with multiple resolutions and multiple feature dimensions in multiple stages of two processing processes, and In the second process, only upsampling is performed, thereby ensuring the high resolution of the output feature image data.
  • the backbone network of the present invention improves the inference speed while maintaining the accuracy rate; again, the backbone network of the present invention removes the data In addition to the processing flow, it also includes feature transition processing and residual processing, which can ensure a smooth feature transition under high-level features and can avoid the gradient loss of the deep network, improving the accuracy of the backbone network reasoning; again, the mask of the present invention
  • the film generation process obtains feature image data of multiple high-level dimensional features, and extracts mask features from each feature dimension through pooling, so as to ensure the integrity of the generated image mask and prevent mask incompleteness.
  • the present invention pools multi-dimensional image data based on the key point information of the item extracted by the backbone network and the identification frame parameters, so that the mask features belonging to the identified item can be identified from the complete image to accurately extract The mask of the identified item; again, the present invention is based on a general-purpose image recognition method, and proposes an image recognition method that is especially suitable for identifying multiple obliquely juxtaposed items, which can improve the accuracy of multiple item identification without missing detection.
  • the present application provides an image recognition method, device, electronic equipment and storage medium.
  • For the identified item generate a recognition frame on the image data based on the key point information and the recognition frame parameters;
  • a mask of the identified item is generated based on the key point information and the identification frame parameters.
  • the generating the recognition frame and the generating the mask of the identified item are performed in parallel.
  • the identification frame parameters include the width of the identification frame and the height of the identification frame.
  • the key point includes a center point of the identified item.
  • the processing the image data includes inputting the image data into a backbone network for processing.
  • the backbone network includes a first data processing process and a second data processing process
  • the data processing process includes one or more processing stages
  • the processing stage includes one or more processing branches.
  • the data output by the multiple processing branches are fused.
  • the data output by the plurality of processing branches has multiple resolutions and/or multiple feature dimensions.
  • the post-processing stage includes Fewer processing branches than previous processing stages.
  • An image data acquisition module configured to acquire image data containing items to be identified
  • An image data processing module configured to process the image data to identify the item to be identified in the image data, and obtain key point information and identification frame parameters of the item to be identified;
  • a recognition frame generation module for the identified item, generates a recognition frame on the image data based on the key point information and the recognition frame parameters;
  • the mask generating module generates a mask of the identified item based on the key point information and the identification frame parameters for the identified item.
  • the recognition frame generation module and the mask generation module run in parallel.
  • the identification frame parameters include the width of the identification frame and the height of the identification frame.
  • the key point includes a center point of the identified item.
  • the image data processing module is used to input the image data into the backbone network for processing.
  • the backbone network includes a first data processing process and a second data processing process
  • the data processing process includes one or more processing stages
  • the processing stage includes one or more processing branches.
  • the data output by the multiple processing branches are fused.
  • the data output by the plurality of processing branches has multiple resolutions and/or multiple feature dimensions.
  • the post-processing stage includes Fewer processing branches than previous processing stages.
  • the electronic device of the embodiment of the present application includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements any of the above embodiments when executing the computer program. image recognition method.
  • the computer-readable storage medium in the embodiments of the present application stores a computer program thereon, and when the computer program is executed by a processor, the image recognition method in any of the above-mentioned embodiments is implemented.
  • FIG. 1 is a schematic flow diagram of an image recognition method in some embodiments of the present application.
  • FIG. 2 is a schematic flow diagram of an image data processing method in some embodiments of the present application.
  • FIG. 3 is a schematic structural diagram of a backbone network in some embodiments of the present application.
  • FIG. 4 is a schematic flowchart of a mask generation method of a mask branch in some embodiments of the present application.
  • FIG. 5 is a schematic flow diagram of a mask generation method in some embodiments of the present application.
  • Fig. 6 is a schematic flowchart of an image recognition method for multiple obliquely juxtaposed items according to some embodiments of the present application
  • Fig. 7 is a schematic diagram of the image recognition results of some embodiments of the present application and the image recognition results of the prior art
  • Fig. 8 is a schematic diagram of another group of image recognition results of certain embodiments of the present application and image recognition results of the prior art
  • Fig. 9 is a schematic diagram of the area where the identification frames intersect and the area formed after the identification frames are merged;
  • Fig. 10 is a schematic structural diagram of an image recognition device in some embodiments of the present application.
  • Fig. 11 is a schematic structural diagram of an image data processing device in some embodiments of the present application.
  • Fig. 12 is a schematic structural diagram of an image data processing device including a feature transition module and a residual connection module according to some embodiments of the present application;
  • FIG. 13 is a schematic structural diagram of a mask generation device for a mask branch in some embodiments of the present application.
  • FIG. 14 is a schematic structural diagram of a mask generation device in some embodiments of the present application.
  • Fig. 15 is a schematic structural diagram of an image recognition device for multiple obliquely juxtaposed items according to some embodiments of the present application.
  • Fig. 16 is a schematic structural diagram of an electronic device in some embodiments of the present application.
  • Fig. 1 shows a schematic flow chart of an image recognition method according to an embodiment of the present invention. As shown in Fig. 1, the method includes:
  • Step S100 acquiring image data containing items to be identified
  • Step S110 processing the image data to identify the item to be identified in the image data, and acquiring key point information and identification frame parameters of the item to be identified;
  • Step S120 for the identified item, generate a recognition frame on the image data based on the key point information and the recognition frame parameters;
  • Step S130 for the identified item, generate a mask of the identified item based on the key point information and the identification frame parameters.
  • the object to be recognized in the present invention can be any object placed in any way.
  • this patent has a more obvious detection effect on side-by-side inclined objects in dense scenes.
  • the image data in the present invention can be taken on-site, or can be pre-saved and marked manually.
  • the image recognition method proposed in this embodiment compared with the traditional method, does not use the pre-defined anchor frame and complex parameters and calculations related to the anchor frame, but obtains the key point information of the item to be recognized and the recognition frame parameter, efficiently and accurately generate a recognition frame for marking items, and the method of this embodiment generates a single recognition frame for a single item, does not generate redundant recognition frames, and does not need to use redundant frame detection technology, so It can be applied to all industrial scenarios including multiple side-by-side and inclined items or covered items, and no missed detection will occur. Moreover, this method executes the recognition frame generation operation and mask generation operation in parallel. It is more practical in industrial scenarios.
  • the key point may be the midpoint of the item
  • the identification frame parameters may include the width and length of the identification frame.
  • the image data is input into the backbone network for processing to identify items in the image data, and obtain key point information and identification frame parameters of the items to be identified.
  • the backbone network is used to process the input data.
  • the backbone network suitable for the task objective can be selected for data processing. For example, some backbone networks are suitable for recognizing graphics, and some backbone networks are suitable for recognizing people. faces, and some backbone networks are suitable for recognizing text.
  • the focus of this embodiment is to perform item recognition operations and item mask generation operations in parallel based on key point information and identification frame parameters, and the backbone network is used to identify items in the image data and obtain the key points of the items to be identified For the point information and the parameters of the recognition frame, as long as the backbone network that can realize the above functions can be used in the image recognition method of this embodiment, the selection of the backbone network is not limited in this embodiment.
  • novel backbone network for image recognition that can significantly increase the processing speed of the network while maintaining accuracy.
  • This novel backbone network is one of the key points of the present invention, and it can be used in any image recognition method.
  • the image recognition method of the present invention preferably uses the novel backbone network to process the input image data
  • Fig. 2 shows a schematic flow diagram of image data processing using a novel backbone network according to an embodiment of the present invention. As shown in Figure 2, the method includes:
  • Step S200 receiving image data to be processed
  • Step S210 using a first data processing procedure to process the image data to be processed
  • Step S220 use the second data processing process to process the image data processed by the first data processing process.
  • FIG. 3 schematically shows the structure of the novel backbone network of the present invention.
  • the network consists of two main parts: a first data processing process and a second data processing process.
  • each data processing process can include one or more processing stages, and each processing stage can include one or more parallel processing branches (the processing is shown in the form of "block” in Figure 3 branch, the "block” mentioned below, and the "convolution block” all refer to the processing branch).
  • the first data processing process may include three stages, a first stage, a second stage and a third stage.
  • the first stage includes a processing branch, which includes multiple equal-resolution convolution processes.
  • the convolution process uses a convolution kernel with a size of 3x3 and a step size of 1, and then uses a 1x1 convolution layer for downscaling. sampling.
  • the second stage includes two processing branches. One processing branch receives the feature image data output by the processing branch of the first stage and repeats the processing process of the first stage, and the other branch takes the feature image output from the first stage of downsampling The data is convoluted.
  • the third stage includes three processing branches. These three processing branches include repeating the operation of the processing branch in the previous stage, and also include the feature image data output by the two processing branches of the second stage after downsampling/upsampling. Convolution processing. In addition, the data input in the third stage has been fused with multiple resolutions and multiple feature dimensions.
  • the second data processing process may also include three stages: the fourth stage, the fifth stage and the sixth stage.
  • the fourth stage includes three processing branches, similar to the third stage, these three processing branches include both the operation of repeating the processing branch of the previous stage, and the three processing branches of the third stage that will undergo downsampling/upsampling Convolution processing is performed on the feature image data.
  • the data input in the fourth stage has also been fused with multiple resolutions and multiple feature dimensions.
  • the image data is subjected to convolution processing at each stage, and the more convolution processing is performed, the higher the level of features contained in the output feature image data.
  • the feature image data output from the fourth stage contains quite high-level features.
  • our backbone network adds a feature transition module from the fourth stage, and outputs data to the processing branch with more feature dimensions in the current processing branch.
  • the output of the current processing branch is adjusted by the feature transition module, that is, an additional feature transformation module is extended after the main convolution block, and the feature transformation module doubles the previous feature dimension.
  • deformable convolution is adopted as the convolutional layer of the feature conversion block.
  • the method of the present invention generates a recognition frame based on the key point information of the item, and the resolution of the image data cannot be too low, especially for two objects that are close to each other. Therefore, operations that would reduce the resolution of the picture data are no longer performed during the second processing.
  • Phase 1 includes processing the image data to reduce resolution, while in the second data processing pass, either processing stage only includes processing the image data to increase resolution to amplify lower-level features, and does not include processing
  • the image data is processed to reduce the resolution.
  • the upsampling operation is preferably used to increase the image data resolution
  • the downsampling operation is used to reduce the resolution. Therefore, after the 4th and 5th stages, only the feature image data is upsampled.
  • the processing branches of each processing stage in the first data processing process of the new backbone network gradually increase, forming a "decreasing triangle" structure; the processing branches of each processing stage in the second data processing process gradually decrease, forming an "ascending triangle” structure, and the new backbone network performs multi-feature dimensions and multi-resolution cross-fusion on the data output by multiple processing branches.
  • This architecture builds convolution from high to low and low to high in parallel, while maintaining the high At the same time of resolution, multiple fusions are performed with features of different dimensions.
  • the following takes a piece of input image data with a resolution of 512x512 as an example to explain how the backbone network of the present invention processes image data.
  • step S210 the image data with a resolution of 512x512 is input into the processing branch of the first processing stage of the first processing process, and after being processed by this processing branch, the feature image data with a resolution of 128x128 in 32 feature dimensions is obtained.
  • the feature image data output by the first processing branch is input into the upper processing branch of the second processing stage; In the processing branch.
  • the "upper” and “middle” here mean that in Figure 3, the processing branch is at the upper or middle position, and it does not mean that the upper processing branch executes data processing before the middle processing branch. In general, multiple processing branches of each processing stage are executed in parallel, without sequence.
  • the image data processed by the upper processing branch has a resolution of 128x128, the image data processed by the middle processing branch has a resolution of 64x64, and the image data processed by the lower processing branch has a resolution of 128x128. Both are 32x32.
  • the feature image data of 32 feature dimensions are obtained, and the feature image data of 64 feature dimensions are obtained after processing by the middle processing branch.
  • the feature image data output by the upper processing branch and the middle processing The feature image data output by the branch after 1x1 convolution upsampling is fused, and then input to the upper branch of the third processing stage; in the second aspect, the number of feature images output by the upper processing branch after 1x1 convolution downsampling is the same as the output of the middle processing branch After the feature image data fusion, it is input to the middle processing branch of the third processing stage; in the third aspect, the number of feature images of the upper processing branch after 1x1 convolution and downsampling is the same as the number of feature images output by the middle processing branch after 1x1 convolution and downsampling. After the image data is fused, it is input to the lower branch of the third processing stage.
  • the characteristic image data of 32 feature dimensions are obtained, the characteristic image data of 64 characteristic dimensions are obtained after processing by the middle processing branch, and the characteristic image data of 128 characteristic dimensions are obtained after processing by the lower processing branch
  • the feature image data output by the upper processing branch is fused with the feature image data output by the middle processing branch after 1x1 convolution upsampling and the feature image data output by the lower processing branch after 1x1 convolution upsampling, it is input to the first The upper branch of the four processing stages; the second aspect, the number of feature images output by the upper processing branch after 1x1 convolution downsampling and the feature image data output by the middle processing branch and the feature image output by the lower processing branch after 1x1 convolution upsampling After the data is fused, it is input to the middle processing branch of the fourth processing stage; in the third aspect, the number of feature images output by the upper processing branch after 1x1 convolution downsampling and the feature image data output by the middle processing branch after
  • the feature image data of 32 feature dimensions are obtained after being processed by the upper processing branch of the fourth processing stage, and then the feature image data of 64 feature dimensions are obtained after processing by the feature transition module, and 64 feature dimensions are obtained after processing by the middle processing branch.
  • the feature image data of 1 feature dimension is then processed by the feature transition module to obtain feature image data of 128 feature dimensions, and the feature image data of 128 feature dimensions are obtained after processing by the lower processing branch.
  • the feature image data output by the feature transition module after the upper processing branch is fused with the feature image data output by the middle processing branch after 1x1 convolution upsampling and the feature image data output by the lower processing branch after 1x1 convolution upsampling , input to the upper branch of the fifth processing stage; in the second aspect, after the feature image data output by the feature transition module after the middle processing branch is fused with the feature image data output by the lower processing branch after 1x1 convolution upsampling, input to the first The middle processing branch of the five processing stages.
  • the feature image data of 64 feature dimensions are obtained, and then after being processed by the feature transition module, the feature image data of 128 feature dimensions are obtained, and the feature image data of 128 feature dimensions are obtained after processing by the middle processing branch.
  • feature image data After the feature image data output by the feature transition module after the upper processing branch is fused with the feature image data output by the middle processing branch after 1x1 convolution upsampling, it is input to the upper branch of the sixth processing stage. After being processed by the upper processing branch of the sixth processing stage, the feature image data of 256 feature dimensions are obtained.
  • the novel backbone network of the present invention may include many processing stages, the more processing stages, the "deeper" the network is, and the more likely it is to lose information in the process of image data processing.
  • the novel backbone network of the present invention can also include a residual connection module, and one or more processing branches in the first data processing process can pass the residual The difference processing module inputs the data processed by the residual processing process to one or more processing branches in the second data processing process.
  • one residual connection module connects the processing branch of the first processing stage with the processing branch of the sixth processing stage
  • Another residual connection module connects the upper processing branch of the second processing stage with the upper processing branch of the fifth processing stage, so as to break through the information blocking between low-level and high-level features and avoid loss of information during processing.
  • step S120 based on the recognition result of the backbone network and the data obtained during the processing of the backbone network, the center point information of the item to be identified is extracted and calculated to obtain the length information and width information of the identification frame used to mark the item, on the image data Generate recognition boxes to mark recognized items.
  • the image recognition method of the present invention consists of two parts, generating a recognition frame and predicting an item mask. Both parts require feature image data obtained during data processing using the backbone network. Therefore, in addition to the main process of using the backbone network to generate the recognition frame, the present invention also simultaneously performs the processing of the mask branch. In one embodiment, the operation of generating the recognition frame and the operation of mask generation share feature weights.
  • the mask branch process makes full use of the feature image data generated in the process of image data processing by the backbone network. It is one of the key points of the present invention to fuse feature image data with multiple feature dimensions to extract the mask of the item.
  • Fig. 4 shows a mask generation method used by the mask generation branch in the image recognition method according to an embodiment of the present invention. As shown in Figure 4, the method includes:
  • Step S300 receiving image data to be processed
  • Step S310 inputting the image data to be processed into a data processing process including M processing stages for processing, wherein M is an integer greater than or equal to 2;
  • Step S320 acquiring data output by N processing stages in M processing stages, wherein N is an integer greater than or equal to 2 and less than or equal to M;
  • Step S330 fusing the data output by the N processing stages
  • Step S340 performing pooling processing on the fused data to obtain an image mask.
  • the present invention needs to input the image data into the backbone network for processing, and the backbone network can preferably use the novel backbone network of the present invention.
  • the backbone network includes multiple processing stages. For the specific processing process, please refer to the relevant embodiments of the aforementioned backbone network, which will not be repeated here.
  • this embodiment acquires multiple feature image data generated during the data processing of the backbone network.
  • the backbone network shown in FIG. 3 includes 6 data processing stages
  • the mask branch receives the feature image data generated in three stages, specifically the feature image data generated in the 4th, 5th and 6th stages.
  • step S330 the example of the feature image data fusion performed by the mask branch is shown in the lower part of Fig. 3. It can be seen that the feature image data of 128 feature dimensions are input to the mask branch in the fourth processing stage, and the mask branch is input in the fifth processing stage.
  • the branch inputs the feature image data of 128 feature dimensions, and the channel-level superposition calculation is used between the 4th and 5th stages to obtain the feature image data of 256 feature dimensions; the 6th processing stage inputs 256 to the mask branch.
  • the feature image data of feature dimensions, the channel-level addition calculation is used between the 5th and 6th stages, and a new feature image data of 256 feature dimensions is obtained. That is to say, the channel-level superposition operation is performed between the data lower than the maximum feature dimension of the backbone network, which is equal to the channel-level addition operation between the data with the maximum feature dimension of the backbone network.
  • the present invention uses pooling to process the fused feature image data to obtain the mask of the item.
  • pooling there are many methods of obtaining the mask of an item in a pooling manner.
  • the focus of this embodiment is to extract and fuse data from multiple data processing stages, and then pool the new feature image data containing mask features to Obtaining the image mask does not depend on a specific pooling method, and any suitable pooling method for generating a mask can be used in this embodiment.
  • Fig. 5 shows a method for generating an image mask according to an embodiment of the present invention. As shown in Figure 5, the method includes:
  • Step S400 acquiring image data including mask features, wherein the mask features are mask features including images of objects to be identified;
  • Step S410 acquiring center point information and identification frame information of the item to be identified on the image data
  • Step S420 extracting the mask features of the item to be identified from the image data based on the center point information and the identification frame information;
  • Step S430 generating an image mask of the item to be identified based on the mask features of the item to be identified extracted from the image data.
  • the image data processed by the backbone network is the image data including the object to generate the mask, other objects and background images.
  • the method of this embodiment extracts the mask features of the item from as many high-level feature dimensions as possible, therefore, preferably, as shown in Figure 3, the present invention extracts 256
  • the image data of the feature dimension, these feature image data contain the mask features of the item to be recognized.
  • the center point information of the item to be identified is extracted and the length information and width information of the identification frame used to mark the item are calculated and obtained. These information can be used to generate the identification frame or It can be used to generate masks for items to be identified, and the process of data processing by the backbone network will not be repeated here.
  • the center point may be a ground truth center point, referred to as the GT center point for short.
  • the obtained image data is complete image data including the object to be identified, and contains feature image data of 256 feature dimensions.
  • the length information and width information of the frame are used to find the position of the item to be identified.
  • the mask of the item so that various operations can be performed based on the mask later.
  • the image recognition method of the present invention is particularly suitable for recognizing multiple obliquely placed or occluded items in industrial scenes, how to use the image recognition method of the present invention in such industrial scenes is also one of the key points of the present invention.
  • Fig. 6 shows a method for identifying and marking image data including multiple items according to an embodiment of the present invention. As shown in Figure 6, the method includes:
  • Step S500 acquiring image data including the item group to be identified
  • Step S510 identifying each item in the group of items to be identified based on the image data
  • Step S520 for each identified item, generate a recognition frame on the image data to mark the item.
  • Figs. 7 and 8 exemplarily show some image data, in which there are a plurality of obliquely arranged items to be operated, and these items constitute an item group;
  • each item needs to be identified from the image data.
  • the present invention identifies all items through the aforementioned image recognition method, which requires inputting the image data into the backbone network for processing, and the data processing method of the backbone network is no longer described here. repeat;
  • step S520 in an industrial scene, all items in the item group in the image data are operation objects, and may need to be grasped or painted, so it is usually necessary to identify all items in the item group, and cannot There are omissions.
  • Figure 7(a) and Figure 8(a) are the results identified using the existing identification method
  • Figure 7(b) and Figure 8(b) are the results identified using the identification method of the present invention.
  • the present invention For each identified item, the present invention generates an identification frame for each item to mark the item according to the center point information of the item and the length and width information of the identification frame, so the center of each generated identification frame is located at the identified item.
  • the degree of overlap of two recognition frames the area of the area where the two recognition frames intersect/the area of the area formed by the combination of the two recognition frames.
  • the black part in Fig. 9(a) is the area where the two recognition frames intersect
  • the black part in Fig. 9(b) is the area formed by the merger of the two recognition frames.
  • the present invention can execute the processing of generating the recognition frame and the processing of generating the mask in parallel based on the key point information of the item and the recognition frame parameters, and will not generate redundant recognition frames when used in an industrial scene , and has strong practicability;
  • the backbone network proposed by the present invention can process the input image data in multiple stages of two processing processes, and only perform upsampling processing in the second processing process, thus ensuring the output feature High resolution of image data;
  • the backbone network of the present invention also includes feature transition processing and residual processing, which can ensure smooth feature transition under high-level features and can avoid the gradient of the deep network
  • the mask generation process of the present invention obtains feature image data of multiple high-level dimensional features, and extracts mask features from each feature dimension through pooling, so that image masks can be completely generated without There will be omissions;
  • the present invention pools the image data of multi-feature dimensions based on the key point information of the item extracted by the
  • Fig. 10 shows an image recognition device according to yet another embodiment of the present invention, the device includes:
  • An image data acquisition module 600 configured to acquire image data containing items to be identified, that is, to implement step S100;
  • An image data processing module 610 configured to process the image data to identify the item to be identified in the image data, and obtain key point information and identification frame parameters of the item to be identified, that is, to implement step S110;
  • the recognition frame generating module 620 generates a recognition frame on the image data based on the key point information and the recognition frame parameters for the identified item, which is used to realize step S120;
  • the mask generating module 630 generates a mask of the identified item based on the key point information and the identification frame parameters for the identified item, that is, to implement step S130.
  • Fig. 11 shows a device for processing image data according to yet another embodiment of the present invention, the device comprising:
  • An image data receiving module 700 configured to receive image data to be processed, that is, to implement step S200;
  • a first data processing module 710 configured to use a first data processing process to process the image data to be processed, that is, to implement step S210;
  • the second data processing module 720 is configured to use the second data processing process to process the image data processed by the first data processing process, that is, to implement step S220;
  • the data processing process includes one or more processing stages.
  • part of the processing stages includes processing of increasing the resolution of the image data
  • part of the processing stages includes processing of reducing the resolution of the image data.
  • any processing stage includes processing of increasing the resolution of the image data, and does not include processing of reducing the resolution of the image data.
  • Fig. 12 shows a device for processing image data according to yet another embodiment of the present invention, the device comprising:
  • the image data receiving module 800 receives the image data to be processed, that is, to implement step S200;
  • the first data processing module 810 uses a first data processing process to process the image data to be processed, that is, to implement step S210;
  • the second data processing module 820 uses the second data processing process to process the image data processed by the first data processing process, that is, to implement step S220;
  • the data processing process includes one or more processing stages, and each processing stage includes one or more processing branches;
  • the processing device also includes:
  • a residual processing module 830 configured to connect one or more processing branches in the first data processing process with one or more processing branches in the second data processing process through the residual processing process;
  • the feature transition module 840 is configured to perform feature transition processing on the data to be output before one or more processing branches in the second processing process output data to the next processing stage.
  • the novel backbone network of the present invention can also include a residual connection module, and one or more processing branches in the first data processing process can pass the residual
  • the difference processing module inputs the data processed by the residual processing process to one or more processing branches in the second data processing process.
  • residual connection modules of the present invention are shown above the entire network, and one residual connection module connects the processing branch of the first processing stage with the processing branch of the sixth processing stage, Another residual connection module connects the upper processing branch of the second processing stage with the upper processing branch of the fifth processing stage, so as to break through the information blocking between low-level and high-level features and avoid loss of information during processing.
  • the residual processing module 830 is used to implement the above method steps.
  • the feature transition module 840 since the image data is subjected to convolution processing at each stage, the more convolution processing is performed, the higher the level of features contained in the output feature image data.
  • the feature image data enters the second data processing process it has gone through three stages of processing, so the feature image data output from the fourth stage contains quite high-level features. In order to improve the accuracy of image recognition, it is hoped that Smoothly increase the feature dimension, so that the processed feature image data loses as little information as possible.
  • Our backbone network adds a feature transition module from the fourth stage, and outputs data to the processing branch with more feature dimensions in the current processing branch.
  • the output of the current processing branch is adjusted by the feature transition module, that is, an additional feature transformation module is extended after the main convolution block, and the feature transformation module doubles the previous feature dimension.
  • deformable convolution is adopted as the convolutional layer of the feature conversion block.
  • the feature transition module 840 is used to implement the above method steps.
  • Fig. 13 shows an image mask generation device according to yet another embodiment of the present invention, the device includes:
  • An image data receiving module 900 configured to receive image data to be processed, that is, to implement step S300;
  • the image data processing module 910 is used to input the image data to be processed into a data processing process comprising M processing stages for processing, wherein M is an integer greater than or equal to 2, which is used to implement step S310;
  • a data acquisition module 920 configured to acquire data output by N processing stages in the M processing stages, wherein N is an integer greater than or equal to 2 and less than or equal to M, that is, to implement step S320;
  • a fusion module 930 configured to fuse the data output by the N processing stages, that is, to implement step S330;
  • the mask generation module 940 is configured to perform pooling processing on the fused data to obtain an image mask, that is, to implement step S340.
  • Fig. 14 shows an image mask generation device according to yet another embodiment of the present invention, the device includes:
  • An image data acquisition module 1000 configured to acquire image data including mask features, that is, to implement step S400, wherein the mask features include mask features of an image of an object to be identified;
  • An information acquisition module 1010 configured to acquire center point information and identification frame parameters of the item to be identified on the image data, that is, to implement step S410;
  • a mask feature acquisition module 1020 configured to extract the mask feature of the item to be identified from the image data based on the center point information and the identification frame information, that is, to implement step S420;
  • the mask generation module 1030 is configured to generate an image mask of the item based on the mask features of the item to be identified extracted from the image data, that is, to implement step S430.
  • Fig. 15 shows an image recognition device according to another embodiment of the present invention, the device includes:
  • An image data acquisition module 1100 configured to acquire image data including the item group to be identified, that is, to implement step S500;
  • An image recognition module 1110 configured to identify each item in the group of items to be identified based on the image data, that is, to implement step S510;
  • a recognition frame generating module 1120 configured to generate a recognition frame on the image data to mark the item for each identified item, that is, to implement step S520;
  • the group of items includes at least two items;
  • each said identification box is located in the image of the item marked by the identification box.
  • a plurality of the identification frames generated on the image data at least partially overlap.
  • the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method in any one of the above-mentioned implementation modes is implemented.
  • the computer program stored in the computer-readable storage medium in the embodiment of the present application can be executed by the processor of the electronic device.
  • the computer-readable storage medium can be a storage medium built in the electronic device, or can be The storage medium of the electronic device is pluggable and pluggable. Therefore, the computer-readable storage medium in the embodiments of the present application has high flexibility and reliability.
  • Fig. 16 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • the electronic device can be a control system/electronic system configured in a car, a mobile terminal (for example, a smart mobile phone, etc.), a personal computer (PC, such as , desktop computer or notebook computer, etc.), tablet computer, server, etc., the specific embodiment of the present invention does not limit the specific implementation of the electronic device.
  • a mobile terminal for example, a smart mobile phone, etc.
  • PC personal computer
  • tablet computer server, etc.
  • the electronic device may include: a processor (processor) 1202, a communication interface (Communications Interface) 1204, a memory (memory) 1206, and a communication bus 1208.
  • processor processor
  • Communication interface Communication Interface
  • memory memory
  • the processor 1202 , the communication interface 1204 , and the memory 1206 communicate with each other through the communication bus 1208 .
  • the communication interface 1204 is used to communicate with network elements of other devices such as clients or other servers.
  • the processor 1202 is configured to execute the program 1210, specifically, may execute relevant steps in the foregoing method embodiments.
  • the program 1210 may include program codes including computer operation instructions.
  • the processor 1202 may be a central processing unit CPU, or an ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention.
  • the one or more processors included in the electronic device may be of the same type, such as one or more CPUs, or may be of different types, such as one or more CPUs and one or more ASICs.
  • the memory 1206 is used to store the program 1210 .
  • the memory 1206 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.
  • Program 1210 may be downloaded and installed from a network via communication interface 1204, and/or installed from removable media.
  • the processor 1202 may be made to perform various operations in the foregoing method embodiments.
  • an image recognition method comprising:
  • For the identified item generate a recognition frame on the image data based on the key point information and the recognition frame parameters;
  • a mask of the identified item is generated based on the key point information and the identification frame parameters.
  • the operation of generating the identification frame and the operation of generating the mask of the identified item are performed in parallel.
  • the identification frame parameters include the width of the identification frame and the height of the identification frame.
  • the key point includes a center point of the identified item.
  • the processing the image data includes inputting the image data into a backbone network for processing.
  • the backbone network includes a first data processing process and a second data processing process
  • the data processing process includes one or more processing stages
  • the processing stages include one or more processing branches.
  • the data output by the multiple processing branches are fused.
  • the data output by the multiple processing branches has multiple resolutions and/or multiple feature dimensions.
  • the post-processing stage includes more processing branches than the previous processing stage; and/or, in the second processing process, the post-processing stage includes more processing branches than the previous processing branch; Processing stages with fewer processing branches.
  • An image recognition device comprising:
  • An image data acquisition module configured to acquire image data containing items to be identified
  • An image data processing module configured to process the image data to identify the item to be identified in the image data, and obtain key point information and identification frame parameters of the item to be identified;
  • a recognition frame generation module for the identified item, generates a recognition frame on the image data based on the key point information and the recognition frame parameters;
  • the mask generating module generates a mask of the identified item based on the key point information and the identification frame parameters for the identified item.
  • the recognition frame generation module and the mask generation module run in parallel.
  • the identification frame parameters include the width of the identification frame and the height of the identification frame.
  • the key point includes a center point of the identified item.
  • the image data processing module is used to input the image data into the backbone network for processing.
  • the backbone network includes a first data processing process and a second data processing process
  • the data processing process includes one or more processing stages
  • the processing stages include one or more processing branches.
  • the data output by the multiple processing branches are fused.
  • the data output by the multiple processing branches has multiple resolutions and/or multiple feature dimensions.
  • the post-processing stage includes more processing branches than the previous processing stage; and/or, in the second processing process, the post-processing stage includes more processing branches than the previous processing branch; Processing stages with fewer processing branches.
  • a method for processing image data comprising:
  • the data processing process includes one or more processing stages.
  • part of the processing stages includes processing of increasing the resolution of the image data
  • part of the processing stages includes processing of reducing the resolution of the image data.
  • any processing stage includes processing of increasing the resolution of the image data, and does not include processing of reducing the resolution of the image data.
  • the processing stage includes one or more processing branches.
  • the data output by the multiple processing branches are fused.
  • the data output by the multiple processing branches has multiple resolutions and/or multiple feature dimensions.
  • the multiple processing branches belong to the same processing stage.
  • the post-processing stage includes more processing branches than the previous processing stage.
  • the post-processing stage includes fewer processing branches than the previous processing stage.
  • the last processing stage of the first processing procedure has the same number of processing branches as the first processing stage of the second processing procedure.
  • the processing of reducing the resolution includes using 1x1 convolution and downsampling to reduce the resolution; and/or the processing of increasing the resolution includes using 1x1 convolution and upsampling to increase the resolution.
  • a device for processing image data comprising:
  • An image data receiving module configured to receive image data to be processed
  • a first data processing module configured to use a first data processing process to process the image data to be processed
  • the second data processing module is used to use the second data processing process to process the image data processed by the first data processing process
  • the data processing process includes one or more processing stages.
  • part of the processing stages includes processing of increasing the resolution of the image data
  • part of the processing stages includes processing of reducing the resolution of the image data.
  • any processing stage includes processing of increasing the resolution of the image data, and does not include processing of reducing the resolution of the image data.
  • the processing stage includes one or more processing branches.
  • the data output by the multiple processing branches are fused.
  • the data output by the multiple processing branches has multiple resolutions and/or multiple feature dimensions.
  • the multiple processing branches belong to the same processing stage.
  • the post-processing stage includes more processing branches than the previous processing stage.
  • the post-processing stage includes fewer processing branches than the previous processing stage.
  • the last processing stage of the first processing procedure has the same number of processing branches as the first processing stage of the second processing procedure.
  • the processing of reducing the resolution includes reducing the resolution by using 1x1 convolution down-sampling processing; and/or the processing of increasing the resolution includes increasing the resolution by using 1x1 convolution up-sampling processing.
  • a method for processing image data comprising:
  • the data processing process includes one or more processing stages, and each processing stage includes one or more processing branches;
  • the method also includes a residual processing process, through which one or more processing branches in the first data processing process are connected to one or more processing branches in the second data processing process;
  • the data to be output is processed through a feature transition operation.
  • the feature transition operation includes performing deformable convolution processing on the data to be output.
  • the residual processing process includes connecting the processing branch of the first processing stage in the first data processing process with the processing branch of the last processing stage in the second data processing process through the residual processing process.
  • the data output by multiple processing branches are fused.
  • the data output by the multiple processing branches has multiple resolutions and/or multiple feature dimensions.
  • the multiple processing branches belong to the same processing stage.
  • the post-processing stage includes more processing branches than the previous processing stage.
  • the post-processing stage includes fewer processing branches than the previous processing stage.
  • the last processing stage of the first processing procedure has the same number of processing branches as the first processing stage of the second processing procedure.
  • a device for processing image data comprising:
  • the image data receiving module receives the image data to be processed
  • a first data processing module using a first data processing process to process the image data to be processed
  • the second data processing module uses the second data processing process to process the image data processed by the first data processing process
  • the data processing process includes one or more processing stages, and each processing stage includes one or more processing branches;
  • the processing device also includes:
  • a residual processing module configured to connect one or more processing branches in the first data processing process with one or more processing branches in the second data processing process through the residual processing process;
  • a feature transition module configured to perform feature transition processing on the data to be output before one or more processing branches in the second processing process output data to the next processing stage.
  • the feature transition module is further configured to perform deformable convolution processing on the data to be output.
  • the residual processing module is further configured to connect the processing branch of the first processing stage in the first data processing process with the processing branch of the last processing stage in the second data processing process through the residual processing process.
  • the data output by multiple processing branches are fused.
  • the data output by the multiple processing branches has multiple resolutions and/or multiple feature dimensions.
  • the multiple processing branches belong to the same processing stage.
  • the post-processing stage includes more processing branches than the previous processing stage.
  • the post-processing stage includes fewer processing branches than the previous processing stage.
  • the last processing stage of the first processing procedure has the same number of processing branches as the first processing stage of the second processing procedure.
  • a method for generating an image mask comprising:
  • N is an integer greater than or equal to 2 and less than or equal to M;
  • the fused data is pooled to obtain an image mask.
  • the inputting the image data to be processed into a data processing process including M processing stages for processing is specifically inputting the image data to be processed into a backbone network for processing, and the backbone network includes a first processing process and a second A processing procedure, the processing procedure includes M processing stages.
  • the N processing stages are processing stages in the second data processing process.
  • the fusion includes a channel-level addition operation and/or a channel-level superposition operation.
  • the performing the pooling process to obtain the image mask includes calculating the image mask based on the center point information of the object to be identified and the parameters of the identification frame.
  • the recognition frame parameters include the width information of the recognition frame and the height information of the recognition frame.
  • the center point information includes a ground truth center point.
  • the processing stage includes one or more processing branches, and the inputting the image data to be processed into the backbone network for processing further includes fusing the data output by the multiple processing branches.
  • the data output by the multiple processing branches has multiple resolutions and/or multiple feature dimensions.
  • An image mask generation device comprising:
  • An image data receiving module configured to receive image data to be processed
  • the image data processing module is used to input the image data to be processed into a data processing process comprising M processing stages for processing, wherein M is an integer greater than or equal to 2;
  • a data acquisition module configured to acquire data output by N processing stages in M processing stages, where N is an integer greater than or equal to 2 and less than or equal to M;
  • a fusion module configured to fuse the data output by the N processing stages
  • the mask generation module is used to perform pooling processing on the fused data to obtain an image mask.
  • the image data processing module is specifically configured to input the image data to be processed into a backbone network for processing, the backbone network includes a first processing process and a second processing process, and the processing process includes M processing stages.
  • the N processing stages are processing stages in the second data processing process.
  • the fusion includes a channel-level addition operation and/or a channel-level superposition operation.
  • the mask generation module is further configured to calculate an image mask based on the center point information of the item to be identified and the parameters of the identification frame.
  • the recognition frame parameters include the width information of the recognition frame and the height information of the recognition frame.
  • the center point information includes a ground truth center point.
  • the processing stage includes one or more processing branches, and the inputting the image data to be processed into the backbone network for processing further includes fusing the data output by the multiple processing branches.
  • the data output by the multiple processing branches has multiple resolutions and/or multiple feature dimensions.
  • a method for generating an image mask comprising:
  • Obtaining image data comprising a mask feature, wherein the mask feature is a mask feature comprising an image of an item to be identified;
  • An image mask of the item is generated based on the mask features of the item to be identified extracted from the image data.
  • the recognition frame parameters include the width information of the recognition frame and the height information of the recognition frame.
  • the center point information includes a ground truth center point.
  • the image data containing mask features includes feature image data acquired from the backbone network and obtained after fusion.
  • the fusion includes a channel-level addition operation and/or a channel-level superposition operation.
  • the backbone network includes a first data processing process and a second data processing process, the data processing process includes one or more processing stages, and the data stage includes one or more processing branches.
  • the post-processing stage includes more processing branches than the previous processing stage.
  • the post-processing stage includes fewer processing branches than the previous processing stage.
  • the last processing stage of the first processing procedure has the same number of processing branches as the first processing stage of the second processing procedure.
  • An image mask generation device comprising:
  • An image data acquisition module configured to acquire image data including mask features, wherein the mask features are mask features including images of items to be identified;
  • An information acquisition module configured to acquire center point information and identification frame parameters of the item to be identified on the image data
  • a mask feature acquisition module configured to extract the mask features of the item to be identified from the image data based on the center point information and the identification frame information;
  • a mask generating module configured to generate an image mask of the item based on the mask features of the item to be identified extracted from the image data.
  • the recognition frame parameters include the width information of the recognition frame and the height information of the recognition frame.
  • the center point information includes a ground truth center point.
  • the image data containing mask features includes feature image data acquired from the backbone network and obtained after fusion.
  • the fusion includes a channel-level addition operation and/or a channel-level superposition operation.
  • the backbone network includes a first data processing process and a second data processing process, the data processing process includes one or more processing stages, and the data stage includes one or more processing branches.
  • the post-processing stage includes more processing branches than the previous processing stage.
  • the post-processing stage includes fewer processing branches than the previous processing stage.
  • the last processing stage of the first processing procedure has the same number of processing branches as the first processing stage of the second processing procedure.
  • a method for image recognition comprising:
  • the group of items includes at least two items;
  • each said identification box is located in the image of the item marked by the identification box.
  • a plurality of the identification frames generated on the image data at least partially overlap.
  • the at least partial overlap includes an overlap greater than 60%.
  • the generating the identification frame includes generating the identification frame based on key point information of the item and identification frame parameters.
  • the identification frame parameters include length information and width information of the identification frame.
  • the key point of the item includes a center point of the item.
  • the identifying each item in the item group to be identified based on the image data includes inputting the image data into a backbone network for processing to identify each item in the item group.
  • the backbone network includes a first data processing process and a second data processing process, the data processing process includes one or more processing stages, and the data stage includes one or more processing branches.
  • the post-processing stage includes more processing branches than the previous processing stage; and/or, in the second processing process, the post-processing stage includes more processing branches than the previous processing branch; Processing stages with fewer processing branches.
  • An image recognition device comprising:
  • An image data acquisition module configured to acquire image data containing a group of items to be identified
  • an image recognition module configured to identify each item in the group of items to be identified based on the image data
  • a recognition frame generating module for each identified item, generating a recognition frame on the image data to mark the item;
  • the group of items includes at least two items;
  • each said identification box is located in the image of the item marked by the identification box.
  • a plurality of the identification frames generated on the image data at least partially overlap.
  • the at least partial overlap includes an overlap greater than 60%.
  • the recognition frame generating module is further configured to generate a recognition frame based on the key point information of the item and the recognition frame parameters.
  • the identification frame parameters include length information and width information of the identification frame.
  • the key point of the item includes a center point of the item.
  • the image recognition module is further configured to input the image data into the backbone network for processing to identify each item in the item group.
  • the backbone network includes a first data processing process and a second data processing process, the data processing process includes one or more processing stages, and the data stage includes one or more processing branches.
  • the post-processing stage includes more processing branches than the previous processing stage; and/or, in the second processing process, the post-processing stage includes more processing branches than the previous processing branch; Processing stages with fewer processing branches.
  • a "computer-readable medium” may be any device that can contain, store, communicate, propagate or transmit a program for use in or in conjunction with an instruction execution system, device or device.
  • computer-readable media include the following: electrical connection with one or more wires (electronic device), portable computer disk case (magnetic device), random access memory (RAM), Read Only Memory (ROM), Erasable and Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM).
  • the computer-readable medium may even be paper or other suitable medium on which the program can be printed, since the program can be read, for example, by optically scanning the paper or other medium, followed by editing, interpretation or other suitable processing if necessary. processing to obtain the program electronically and store it in computer memory.
  • the processor can be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • each part of the embodiments of the present application may be realized by hardware, software, firmware or a combination thereof.
  • various steps or methods may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques known in the art: Discrete logic circuits, ASICs with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.
  • each functional unit in each embodiment of the present application may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are realized in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.
  • the storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

一种图像识别方法、装置、电子设备和存储介质。图像识别方法,包括:获取包含待识别物品的图像数据;对所述图像数据进行处理以识别所述图像数据中的待识别物品,并获取待识别物品的关键点信息以及识别框参数;针对所识别的物品,基于所述关键点信息以及识别框参数在所述图像数据上生成识别框;并且针对所识别的物品,基于所述关键点信息以及识别框参数生成所识别的物品的掩膜。能够基于物品的关键点信息以及识别框参数并行地执行生成识别框的处理以及生成掩膜的处理,在工业场景中使用时,识别准确,不会生成冗余的识别框,并且同时生成识别框以及掩膜,实用性强。

Description

图像识别方法、装置、电子设备和存储介质
优先权声明
本申请要求2021年6月21日递交的、申请号为CN202110686482.0、名称为“图像识别方法、装置、电子设备和存储介质”的中国发明专利的优先权,上述专利的所有内容在此全部引入。
技术领域
本申请涉及图像处理技术领域,更具体而言,特别涉及图像识别方法、装置、电子设备和存储介质。
背景技术
图像识别技术已经在商业领域具有广泛的应用。目前常用的图像识别方法或通过利用滑动窗口来选择和组装分数图的输出来生成分割的实例段,或者基于检测器直接预测没有建议的边界框,这些方法在很大程度上依赖于预定义的锚,并且超参数调整(例如锚比、锚步长)对于不同的数据集和框尺度至关重要,还有一些图像识别方法采用关键点检测思想获取对象的四个极值点并生成掩膜,或者用极坐标表示方法重新分割实例并预测目标的质心,然后基于质心和密集的轮廓点之间的距离预测识别框。此外,在获得识别框之后,传统的方法还包括消除类别级别冗余检测框的步骤以避免在同一个物品上出现多个识别框。此外,目前主流的图像识别算法通常会使用深度骨干网络进行图像数据的处理,常用的骨干网络为了准确率而使用了庞大的参数,导致模型推理速度低,并且严重限制了模型在低内存设备上的实现,还有些骨干网络专注于提高模型推理速度,但降低了准确率。
这些传统的方法,要么参数调整复杂,要么仅适用于生成识别框或生成掩膜,并且传统的方法中所使用的冗余框检测技术在一些特殊工业场景下使用时,例如,多个并列倾斜的物品或者有遮盖阻挡的物品,会造成严重的漏检。然而在工业场景中,例如使用机器人进行物品抓取这样的场景,漏检物品是不可容忍的,并且掩膜和识别框都是后续处理过程中的必要信息。现有的图像识别技术运用于工业场景时仍然需要克服许多的问题。
发明内容
鉴于上述问题,提出了本发明以便克服上述问题或者至少部分地解决上述问题。具体地, 首先,本发明能够基于物品的关键点信息以及识别框参数并行地执行生成识别框的处理以及生成掩膜的处理,在工业场景中使用时,识别准确,不会生成冗余的识别框,并且同时生成识别框以及掩膜,实用性强;其次,本发明提出的骨干网络能够以两个处理过程的多个阶段对输入的图像数据进行多分辨率及多特征维度的处理,并且在第二处理过程中只进行升采样处理,从而保证了输出特征图像数据的高分辨率,本发明的骨干网络在保持准确率的同时,提高了推理速度;再次,本发明的骨干网络除了数据处理流程之外,还包括特征过渡处理以及残差处理,能够保证高层次特征下的平稳的特征过渡并能够避免深度网络的梯度损失,提高了骨干网络推理的准确率;再次,本发明的掩膜生成过程获取了多个高层次维度特征的特征图像数据,从各个特征维度中通过池化方式提取掩膜特征,从而能保证所生成的图像掩膜的完整性,不会发生掩膜残缺的情况;再次,本发明基于骨干网络提取的物品关键点信息以及识别框参数对多特征维度的图像数据进行池化,从而能够从完整的图像中识别属于所识别物品的掩膜特征从而准确地提取出所识别物品的掩膜;再次,本发明基于通用的图像识别方法,提出了特别适用于识别多个倾斜并列物品的图像识别方法,能够提高多个物品识别的准确度,不会发生漏检。
本申请权利要求和说明书所披露的所有方案均具有上述一个或多个创新之处,相应地,能够解决上述一个或多个技术问题。具体地,本申请提供一种图像识别方法、装置、电子设备和存储介质。
本申请的实施方式的图像识别方法,包括:
获取包含待识别物品的图像数据;
对所述图像数据进行处理以识别所述图像数据中的待识别物品,并获取待识别物品的关键点信息以及识别框参数;
针对所识别的物品,基于所述关键点信息以及识别框参数在所述图像数据上生成识别框;并且
针对所识别的物品,基于所述关键点信息以及识别框参数生成所识别的物品的掩膜。
在某些实施方式中,并行执行所述生成识别框的操作以及所述生成所识别的物品的掩膜的操作。
在某些实施方式中,所述识别框参数包括识别框的宽度以及识别框的高度。
在某些实施方式中,所述关键点包括识物品的中心点。
在某些实施方式中,所述对所述图像数据进行处理包括将所述图像数据输入骨干网络进行处理。
在某些实施方式中,所述骨干网络包括第一数据处理过程和第二数据处理过程,所述数据处理过程包括一个或多个处理阶段,所述处理阶段包括一个或多个处理分支。
在某些实施方式中,将所述多个处理分支输出的数据进行融合。
在某些实施方式中,所述多个处理分支输出的数据具有多分辨率和/或多特征维度。
在某些实施方式中,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支;和/或,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
本申请的实施方式的图像识别装置,包括:
图像数据获取模块,用于获取包含待识别物品的图像数据;
图像数据处理模块,用于对所述图像数据进行处理以识别所述图像数据中的待识别物品,并获取待识别物品的关键点信息以及识别框参数;
识别框生成模块,针对所识别的物品,基于所述关键点信息以及识别框参数在所述图像数据上生成识别框;并且
掩膜生成模块,针对所识别的物品,基于所述关键点信息以及识别框参数生成所识别的物品的掩膜。
在某些实施方式中,所述识别框生成模块以及掩膜生成模块并行运行。
在某些实施方式中,所述识别框参数包括识别框的宽度以及识别框的高度。
在某些实施方式中,所述关键点包括识物品的中心点。
在某些实施方式中,图像数据处理模块用于将所述图像数据输入骨干网络进行处理。
在某些实施方式中,所述骨干网络包括第一数据处理过程和第二数据处理过程,所述数据处理过程包括一个或多个处理阶段,所述处理阶段包括一个或多个处理分支。
在某些实施方式中,将所述多个处理分支输出的数据进行融合。
在某些实施方式中,所述多个处理分支输出的数据具有多分辨率和/或多特征维度。
在某些实施方式中,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支;和/或,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
本申请的实施方式的电子设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述任一实施方式的图像识别方法。
本申请的实施方式的计算机可读存储介质其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任一实施方式的图像识别方法。
本申请的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本申请的实践了解到。
附图说明
本申请的上述和/或附加的方面和优点从结合下面附图对实施方式的描述中将变得明显和容易理解,其中:
图1是本申请某些实施方式的图像识别方法的流程示意图;
图2是本申请某些实施方式的图像数据处理方法的流程示意图;
图3是本申请某些实施方式的骨干网络的结构示意图;
图4是本申请某些实施方式的掩膜分支的掩膜生成方法的流程示意图;
图5是本申请某些实施方式的掩膜生成方法的流程示意图;
图6是本申请某些实施方式的针对多个倾斜并列物品的图像识别方法的流程示意图;
图7是本申请某些实施方式的图像识别结果以及现有技术的图像识别结果的示意图;
图8是另一组本申请某些实施方式的图像识别结果以及现有技术的图像识别结果的示意图;
图9是识别框相交的区域以及识别框合并后形成的区域的示意图;
图10是本申请某些实施方式的图像识别装置的结构示意图;
图11是本申请某些实施方式的图像数据处理装置的结构示意图;
图12是本申请某些实施方式的包括特征过渡模块以及残差连接模块的图像数据处理装置的结构示意图;
图13是本申请某些实施方式的掩膜分支的掩膜生成装置的结构示意图;
图14是本申请某些实施方式的掩膜生成装置的结构示意图;
图15是本申请某些实施方式的针对多个倾斜并列物品的图像识别装置的结构示意图;
图16是本申请某些实施方式的电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
图1示出了根据本发明一个实施例的图像识别方法的流程示意图,如图1所示,该方法包括:
步骤S100,获取包含待识别物品的图像数据;
步骤S110,对所述图像数据进行处理以识别所述图像数据中的待识别物品,并获取待识别物品的关键点信息以及识别框参数;
步骤S120,针对所识别的物品,基于所述关键点信息以及识别框参数在所述图像数据上生成识别框;并且
步骤S130,针对所识别的物品,基于所述关键点信息以及识别框参数生成所识别的物品的掩膜。
对于步骤S100,本发明中的待识别物体可以是以任意方式放置的任意物体,与其它现有方法相比,本专利对于密集场景下并排倾斜物体会有更明显的检出效果。本发明中的图像数据可以是现场拍摄的,也可以是预先保存的并经过人工标注的数据。
本实施例提出的图像识别方法,与传统的方法相比,不使用预先定义的锚框以及与锚框相关的复杂的参数和计算,而是通过获取待识别物品的关键点信息以及与识别框参数,高效并准确地生成用于标记物品的识别框,并且本实施例的方法针对单一物品生成单一识别框,不会生成冗余的识别框,也没有使用冗余框检测技术的必要,因而可以适用于包括多个并列倾斜的物品或者有遮盖阻挡的物品这样的场景之内的全部工业场景中,也不会产生漏检,并且,本方法并行执行识别框生成操作以及掩膜生成操作,在工业场景中的实用性较强。作为一种较佳的实施方式,关键点可以为物品的中点,而识别框参数则可以包括识别框的宽度和长度。
对于步骤S110,将图像数据输入骨干网络中进行处理,以识别图像数据中的物品,并获取待识别物品的关键点信息以及与识别框参数。骨干网络用于对输入的数据进行处理,对于不同的任务目标,可以选择适于该任务目标的骨干网络进行数据的处理,例如某些骨干网络适于识别图形,某些骨干网络适于识别人脸,还有一些骨干网络适于识别文字。如上所述,本实施例的重点在于基于关键点信息以及识别框参数并行地进行物品识别操作以及物品掩膜生成操作,骨干网络则用于识别图像数据中的物品,并获取待识别物品的关键点信息以及与识别框参数,只要能实现上述功能的骨干网均可以用于本实施例的图像识别方法中,本实施例不对骨干网络的选取进行限定。
目前常用的用于图像识别方法的骨干网络,有一些为了实现高性能而使用了庞大的参数,这严重限制了模型在低内存设备上的实现,相反,有一些专注于提高模型推理速度,但降低了准确率。为此,我们提出了一种用于图像识别的新型的骨干网络,在保持准确性的同时能够显著提高网络的处理速度。该新型的骨干网络是本发明的重点之一,其可以在任意的图像识别方法中使用。本发明的图像识别方法优选使用该新型的骨干网络对输入的图像数据进行处理
图2示出了根据本发明一个实施例的使用新型骨干网络进行图像数据处理的流程示意 图。如图2所示,该方法包括:
步骤S200,接收待处理的图像数据;
步骤S210,使用第一数据处理过程处理所述待处理的图像数据;
步骤S220,使用第二数据处理过程处理经第一数据处理过程处理的图像数据。
为了方便解释本发明的骨干网络,图3示意性地给出了本发明的新型的骨干网络的结构。如图3所示,网络包含两个主要部分:第一数据处理过程和第二数据处理过程。根据实际应用场景的需要,每个数据处理过程都可以包括一个或多个处理阶段,每个处理阶段都可以包括一个或多个并行的处理分支(图3中以“块”的形式展示该处理分支,下文中提到的“块”,“卷积块”都指的是处理分支)。作为一个示例,第一数据处理过程可以包括三个阶段,第一阶段,第二阶段和第三阶段。第一阶段包括一个处理分支,该处理分支中包括多个等分辨率的卷积处理,该卷积处理使用的卷积核尺寸为3x3,步长为1,之后使用1x1的卷积层进行降采样。第二阶段包括两个处理分支,一个处理分支接收第一阶段的处理分支输出的特征图像数据,并重复第一阶段的处理过程,另一个分支则将经过降采样的第一阶段输出的特征图像数据进行卷积处理。第三阶段包括三个处理分支,这三个处理分支既包括重复前一阶段的处理分支的操作,也包括将经过降采样/升采样的第二阶段的两个处理分支输出的特征图像数据进行卷积处理。此外,第三阶段输入的数据还经过了多分辨率以及多特征维度的融合。
第三阶段后进入第二数据处理过程,第二数据处理过程也可以包括三个阶段:第四阶段,第五阶段和第六阶段。第四阶段包括三个处理分支,与第三阶段类似,这三个处理分支既包括重复前一阶段的处理分支的操作,也包括将经过降采样/升采样的第三阶段的三个处理分支的特征图像数据进行卷积处理的操作。同样,第四阶段输入的数据也经过了多分辨率以及多特征维度的融合。图像数据在每个阶段都进行卷积处理,经过越多的卷积处理,输出的特征图像数据所包含的特征的层次越高。在特征图像数据进入第二数据处理过程时,已经经过了3个阶段的处理,因此从第4阶段开始输出的特征图像数据包含了相当高层次的特征,为了提高图像识别的准确率,希望能够平滑地增加特征维度,使得处理后的特征图像数据尽可能地少损失信息,我们的骨干网络从第4阶段开始增加特征过渡模块,在当前的处理分支将数据输出至特征维度更多的处理分支前,先通过特征过渡模块调整当前处理分支的输出,也就是说,在主卷积块之后扩展额外的特征转换模块,特征转换模块使之前的特征维度增加了一倍。在一个实施方式中,为了增强特征多样性,采用可变形卷积作为特征转换块的卷积层。
本发明的方法基于物品的关键点信息生成识别框,图像数据的分辨率不能太低,尤其是对于两个彼此接近的物体。因此,在第二处理过程中不再进行会令图片数据分辨率降低的操作,换句话说,在第一数据处理过程中,部分处理阶段包括对图像数据进行增大分辨率的处理,部分处理阶段包括对图像数据进行减小分辨率的处理,而在第二数据处理过程中,任一 处理阶段只包括对图像数据进行增大分辨率的处理以放大较低级别的特征,而不包括对图像数据进行减小分辨率的处理,本发明优选使用升采样操作增大图像数据分辨率,使用降采样操作减小分辨率。因此在第4阶段和第5阶段之后,仅对特征图像数据进行升采样处理,在第5阶段和第6阶段中仅包括重复前一阶段的处理分支的操作,以及对经过升采样的处理分支输出的特征图像数据进行卷积处理的操作,不包括对经过降采样的处理分支输出的特征图像数据进行卷积处理的操作。整体来看,新型骨干网络的第一数据处理过程每个处理阶段的处理分支逐渐增多,呈“降三角形”结构;第二数据处理过程每个处理阶段的处理分支逐渐减少,呈“升三角形”结构,并且新型骨干网络将多个处理分支输出的数据进行多特征维度和多解析度的交叉融合,这样的架构并行构建了由高到低和低到高的卷积,在保持整个过程的高分辨率的同时,与不同维度的特征进行多重融合。
下面以一张分辨率为512x512的输入图像数据为例,解释本发明的骨干网络如何处理图像数据。
对于步骤S210,将分辨率为512x512的图像数据输入第一处理过程的第一处理阶段的处理分支中,经该处理分支处理后,获得32个特征维度的分辨率为128x128的特征图像数据,一方面,将第一处理分支输出的特征图像数据输入第二处理阶段的上处理分支中,另一方面,将第一处理分支输出的数据经1x1卷积降采样后,输出至第二处理阶段的中处理分支中。需要注意的是,这里的“上”和“中”是指在图3中,该处理分支在靠上或靠中间的位置,并不表示上处理分支先于中处理分支执行数据的处理,实际上,每个处理阶段的多个处理分支都是并行执行的,没有先后之分。在整个骨干网络中,经上处理分支处理的图像数据,其解析度均为128x128,经中处理分支处理的图像数据,其解析度均为64x64,经下处理分支处理的图像数据,其解析度均为32x32。
经第二处理阶段的上处理分支处理后获得32个特征维度的特征图像数据,中处理分支处理后获得64个特征维度的特征图像数据,一方面,上处理分支输出的特征图像数据与中处理分支经1x1卷积升采样后输出的特征图像数据融合后,输入到第三处理阶段的上分支;第二方面,上处理分支经1x1卷积降采样后输出的特征图像数与中处理分支输出的特征图像数据融合后,输入到第三处理阶段的中处理分支;第三方面,上处理分支经1x1卷积降采样后的特征图像数与中处理分支经1x1卷积降采样后输出的特征图像数据融合后,输入到第三处理阶段的下分支。
经第三处理阶段的上处理分支处理后获得32个特征维度的特征图像数据,中处理分支处理后获得64个特征维度的特征图像数据,下处理分支处理后获得128个特征维度的特征图像数据,一方面,上处理分支输出的特征图像数据与中处理分支经1x1卷积升采样后输出的特征图像数据以及下处理分支经1x1卷积升采样后输出的特征图像数据融合后,输入到第 四处理阶段的上分支;第二方面,上处理分支经1x1卷积降采样后输出的特征图像数与中处理分支输出的特征图像数据以及下处理分支经1x1卷积升采样后输出的特征图像数据融合后,输入到第四处理阶段的中处理分支;第三方面,上处理分支经1x1卷积降采样后的特征图像数与中处理分支经1x1卷积降采样后输出的特征图像数据以及下处理分支输出的特征图像数据融合后,输入到第四处理阶段的下分支。
对于步骤S220,经第四处理阶段的上处理分支处理后获得32个特征维度的特征图像数据,接着经过特征过渡模块处理后,获得64个特征维度的特征图像数据,中处理分支处理后获得64个特征维度的特征图像数据,接着经过特征过渡模块处理后,获得128个特征维度的特征图像数据,下处理分支处理后获得128个特征维度的特征图像数据。一方面,上处理分支后的特征过渡模块输出的特征图像数据与中处理分支经1x1卷积升采样后输出的特征图像数据以及下处理分支经1x1卷积升采样后输出的特征图像数据融合后,输入到第五处理阶段的上分支;第二方面,中处理分支后的特征过渡模块输出的特征图像数据与下处理分支经1x1卷积升采样后输出的特征图像数据融合后,输入到第五处理阶段的中处理分支。
经第五处理阶段的上处理分支处理后获得64个特征维度的特征图像数据,接着经过特征过渡模块处理后,获得128个特征维度的特征图像数据,中处理分支处理后获得128个特征维度的特征图像数据。上处理分支后的特征过渡模块输出的特征图像数据与中处理分支经1x1卷积升采样后输出的特征图像数据融合后,输入到第六处理阶段的上分支。经第六处理阶段的上处理分支处理后获得256个特征维度的特征图像数据。
由于本发明的骨干网络可以包括很多的处理阶段,处理阶段越多,网络越“深”,也越有可能在图像数据处理的过程中损失信息。在一种实施方式中,为了防止深度网络的梯度损失,加强头尾特征,本发明的新型骨干网络还可以包括残差连接模块,第一数据处理过程中的一个或多个处理分支可以通过残差处理模块向第二数据处理过程中的一个或多个处理分支输入经残差处理过程处理的数据。如图3所示,作为一个示例,整个网络的上方示出了本发明的两个残差连接模块,一个残差连接模块将第一处理阶段的处理分支与第六处理阶段的处理分支连接,另一个残差连接模块将第二处理阶段的上处理分支与第五处理阶段的上处理分支连接,如此能够突破低级和高级特征之间的信息阻塞,避免处理的过程中损失信息。
对于步骤S120,基于骨干网络的识别结果以及骨干网络处理过程中获得的数据,提取待识别物品的中心点信息并计算获得用于标记该物品的识别框的长度信息和宽度信息,在图像数据上生成识别框以标记识别出的物品。
对于步骤S130,本发明的图像识别方法由两部分组成,生成识别框以及预测物品掩膜。这两个部分均需要使用骨干网络进行数据处理过程中获得的特征图像数据。因此,在使用骨干网络生成识别框这一主要处理之外,本发明还同步进行掩膜分支的处理过程,在一种实施 方式中,生成识别框的操作与掩膜生成的操作共享特征权重。掩膜分支处理过程充分使用了骨干网络处理图像数据的过程中产生的特征图像数据,将有多个特征维度的特征图像数据融合以提取物品的掩膜,是本发明的重点之一。
图4示出了根据本发明一个实施例的图像识别方法中的掩膜生成分支所使用的掩膜生成方法。如图4所示,该方法包括:
步骤S300,接收待处理的图像数据;
步骤S310,将待处理的图像数据输入包含M个处理阶段的数据处理过程进行处理,其中M为大于等于2的整数;
步骤S320,获取M个处理阶段中的N个处理阶段输出的数据,其中N为大于等于2且小于等于M的整数;
步骤S330,融合所述N个处理阶段输出的数据;
步骤S340,将融合后的数据进行池化处理以获得图像掩膜。
对于步骤S310,如前所述,本发明需要将图像数据输入骨干网络进行处理,该骨干网络优选地可以使用本发明的新型的骨干网络。骨干网络包括多个处理阶段,具体的处理过程请见前述骨干网络的相关实施例,此处不再赘述。
对于步骤S320,为了能够准确且优质地提取物品的掩膜,本实施例获取骨干网络数据处理过程中生成的多个特征图像数据,例如,图3所示的骨干网络包括6个数据处理阶段,而掩膜分支接收了其中3个阶段产生的特征图像数据,具体为第4、5和6阶段生成的特征图像数据。
对于步骤S330,掩膜分支进行特征图像数据融合的示例见图3的下部,可以看出,第4处理阶段向掩膜分支输入了128个特征维度的特征图像数据,第5处理阶段向掩膜分支输入了128个特征维度的特征图像数据,第4和第5阶段之间使用了通道级的叠加计算,获得了256个特征维度的特征图像数据;第6处理阶段向掩膜分支输入了256个特征维度的特征图像数据,第5和第6阶段之间使用了通道级的相加计算,获得了新的256个特征维度的特征图像数据。也就是说,低于骨干网络最大特征维度的数据之间进行通道级的叠加运算,等于骨干网络最大特征维度的数据之间进行通道级的相加运算。
对于步骤S340,本发明使用池化方式对融合后的特征图像数据进行处理,以获取物品的掩膜。现有技术中存在多种以池化方式获取物品掩膜的方法,本实施例的重点在于从多个数据处理阶段提取数据并融合后,将新的包含掩膜特征的特征图像数据池化以获得图像掩膜,而不在于具体的池化方法,任意的合适的产生掩膜的池化方法都可以用于本实施例中。
虽然可以使用任意的池化方法,但是为了提高掩膜生成的准确率,本发明开发了一种基于中心点的掩膜生成方法,特别适用于基于中心点进行图像识别的方法,这也是本发明的重 点之一。图5示出了根据本发明一个实施例的图像掩膜生成方法。如图5所示,该方法包括:
步骤S400,获取包含掩膜特征的图像数据,其中,所述掩膜特征为包括待识别物品的图像的掩膜特征;
步骤S410,获取所述图像数据上的待识别物品的中心点信息以及识别框信息;
步骤S420,基于所述中心点信息以及识别框信息从图像数据中提取所述待识别物品的掩膜特征;
步骤S430,基于从图像数据中提取的所述待识别物品的掩膜特征生成待识别物品的图像掩膜。
对于步骤S400,骨干网络处理的图像数据是包含待生成掩膜的物品以及其它物品和背景图像等的图像数据。本实施例的方法从尽可能多的高层特征维度中提取物品的掩膜特征,因此,优选地,如图3所示,本发明从第二数据处理过程的多个处理阶段中提取了256个特征维度的图像数据,这些特征图像数据中包含了待识别物品的掩膜特征。
对于步骤S410,基于骨干网络处理过程中获得的数据,提取待识别物品的中心点信息并计算获得用于标记该物品的识别框的长度信息和宽度信息,这些信息即可以用于生成识别框也可以用于生成待识别物品的掩膜,骨干网络处理数据的过程此处不再赘述。在一个较佳的实施例中,该中心点可以是真实(Ground Truth)中心点,简称为GT中心点。
对于步骤S420,如前所述,所获得的图像数据是包括待识别物品在内的完整图像数据,并且包含了256个特征维度的特征图像数据,该步骤中一方面需要依据中心点信息以及识别框的长度信息和宽度信息,找到待识别物品的位置,另一方面还需要从特征图像数据中找到物品掩膜特征;从而在下一个步骤中依据所获取的待识别物品的掩膜特征生成待识别物品的掩膜,从而方便之后根据该掩膜进行各种操作。
本发明的图像识别方法特别适用于识别多个倾斜摆放的或者受到遮挡的物品这样的工业场景,如何在这样的工业场景中使用本发明的图像识别方法也是本发明的重点之一。
图6示出了根据本发明一个实施例的对包括多个物品的图像数据进行识别并标记的方法。如图6所示,该方法包括:
步骤S500,获取包含待识别物品组的图像数据;
步骤S510,基于所述图像数据识别所述待识别物品组中的每个物品;
步骤S520,针对所识别的每个物品,在所述图像数据上生成识别框以标记该物品。
对于步骤S500,图7,8示例性地示出了一些图像数据,在这些图像数据里,包括多个倾斜并列放置的待操作的物品,这些物品构成了物品组;
对于步骤S510,需要从图像数据中识别出每个物品,本发明通过前述的图像识别方法识别所有的物品,这需要将图像数据输入骨干网络中进行处理,骨干网络的数据处理方法此 处不再赘述;
对于步骤S520,在工业场景中,图像数据中的物品组中的所有都是操作对象,可能需要被抓取,可能需要被涂漆,因此通常需要将物品组中的全部物品都识别出来,不能有遗漏。在图7(a)和图8(a)是使用现有的识别方法识别出的结果,图7(b)和图8(b)是使用本发明的识别方法识别的结果。本发明针对每个识别的物品,根据物品的中心点信息以及识别框的长度和宽度信息为每个物品生成识别框以标记该物品,因此所生成的每个识别框的中心都位于所识别的物品上。此外,现有技术中通常在两个识别框之间的重叠度超过一定阈值时,会判定其中一个框为冗余识别框,并在输出的图像中删除该识别框,导致出现图7(a)和图8(a)所示的识别结果,即产生很多漏检的情况。而使用本发明的识别方法,则能够产生如图7(b)和图8(b)所示的识别结果,也就是说,本发明的图像识别方法能够容忍识别框的部分重叠甚至全部重叠,即便重叠度达到60%以上,本发明也不会删除识别框,而是能够正确地将每一个物品识别出来,不产生漏检。其中,两个识别框的重叠度=两个识别框相交的区域的面积/两个识别框合并后形成的区域的面积。如图9所示,图9(a)中黑色部分为两个识别框相交的区域的面积,图9(b)中黑色部分为两个识别框合并后形成的区域的面积。
另外,需要说明的是,虽然本发明的每个实施例都具有特定的特征组合,然而,这些特征在实施例之间的进一步组合和交叉组合也是可行的。
根据上述实施例,首先,本发明能够基于物品的关键点信息以及识别框参数并行地执行生成识别框的处理以及生成掩膜的处理,在工业场景中使用时,不会生成冗余的识别框,并且实用性强;其次,本发明提出的骨干网络能够以两个处理过程的多个阶段对输入的图像数据进行处理,并且在第二处理过程中只进行升采样处理,从而保证了输出特征图像数据的高分辨率;第三,本发明的骨干网络除了数据处理流程之外,还包括特征过渡处理以及残差处理,能够保证高层次特征下的平稳的特征过渡并能够避免深度网络的梯度损失;第四,本发明的掩膜生成过程获取了多个高层次维度特征的特征图像数据,从各个特征维度中通过池化方式提取掩膜特征,从而能完整地生成图像掩膜,而不会有遗漏;第五,本发明基于骨干网络提取的物品关键点信息以及识别框参数对多特征维度的图像数据进行池化,从而能够从完整的图像上准确地提取处所识别物品的掩膜;第六,本发明基于通用的图像识别方法,提出了专用于识别多个倾斜并列物品的方法,能够提高识别的准确度,不会发生漏检。
图10示出了根据本发明又一个实施例的图像识别装置,该装置包括:
图像数据获取模块600,用于获取包含待识别物品的图像数据,即用于实现步骤S100;
图像数据处理模块610,用于对所述图像数据进行处理以识别所述图像数据中的待识别物品,并获取待识别物品的关键点信息以及识别框参数,即用于实现步骤S110;
识别框生成模块620,针对所识别的物品,基于所述关键点信息以及识别框参数在所述 图像数据上生成识别框,即用于实现步骤S120;并且
掩膜生成模块630,针对所识别的物品,基于所述关键点信息以及识别框参数生成所识别的物品的掩膜,即用于实现步骤S130。图11示出了根据本发明又一个实施例的图像数据的处理装置,该装置包括:
图像数据接收模块700,用于接收待处理的图像数据,即用于实现步骤S200;
第一数据处理模块710,用于使用第一数据处理过程处理所述待处理的图像数据,即用于实现步骤S210;
第二数据处理模块720,用于使用第二数据处理过程处理经第一数据处理过程处理的图像数据,即用于实现步骤S220;
其中,所述数据处理过程包括一个或多个处理阶段,在第一数据处理过程中,部分处理阶段包括对图像数据进行增大分辨率的处理,部分处理阶段包括对图像数据进行减小分辨率的处理;在第二数据处理过程中,任一处理阶段包括对图像数据进行增大分辨率的处理,且不包括对图像数据进行减小分辨率的处理。
图12示出了根据本发明又一个实施例的图像数据的处理装置,该装置包括:
图像数据接收模块800,接收待处理的图像数据,即用于实现步骤S200;
第一数据处理模块810,使用第一数据处理过程处理所述待处理的图像数据,即用于实现步骤S210;
第二数据处理模块820,使用第二数据处理过程处理经第一数据处理过程处理的图像数据,即用于实现步骤S220;
其中,所述数据处理过程包括一个或多个处理阶段,每个处理阶段包括一个或多个处理分支;
所述处理装置还包括:
残差处理模块830,用于将第一数据处理过程中的一个或多个处理分支通过残差处理过程与第二数据处理过程中的一个或多个处理分支连接;
特征过渡模块840,用于在所述第二处理过程中的一个或多个处理分支向下一个处理阶段输出数据之前,对待输出的数据进行特征过渡处理。
对于残差处理模块830,由于本发明的骨干网络可以包括很多的处理阶段,处理阶段越多,网络越“深”,也越有可能在图像数据处理的过程中损失信息。在一种实施方式中,为了防止深度网络的梯度损失,加强头尾特征,本发明的新型骨干网络还可以包括残差连接模块,第一数据处理过程中的一个或多个处理分支可以通过残差处理模块向第二数据处理过程中的一个或多个处理分支输入经残差处理过程处理的数据。如图3所示,作为一个示例,整 个网络的上方示出了本发明的两个残差连接模块,一个残差连接模块将第一处理阶段的处理分支与第六处理阶段的处理分支连接,另一个残差连接模块将第二处理阶段的上处理分支与第五处理阶段的上处理分支连接,如此能够突破低级和高级特征之间的信息阻塞,避免处理的过程中损失信息。残差处理模块830即用于实现上述方法步骤。
对于特征过渡模块840,由于图像数据在每个阶段都进行卷积处理,经过越多的卷积处理,输出的特征图像数据所包含的特征的层次越高。在特征图像数据进入第二数据处理过程时,已经经过了3个阶段的处理,因此从第4阶段开始输出的特征图像数据包含了相当高层次的特征,为了提高图像识别的准确率,希望能够平滑地增加特征维度,使得处理后的特征图像数据尽可能地少损失信息,我们的骨干网络从第4阶段开始增加特征过渡模块,在当前的处理分支将数据输出至特征维度更多的处理分支前,先通过特征过渡模块调整当前处理分支的输出,也就是说,在主卷积块之后扩展额外的特征转换模块,特征转换模块使之前的特征维度增加了一倍。在一个实施方式中,为了增强特征多样性,采用可变形卷积作为特征转换块的卷积层。特征过渡模块840即用于实现上述方法步骤。
图13示出了根据本发明又一个实施例的图像掩膜生成装置,该装置包括:
图像数据接收模块900,用于接收待处理的图像数据,即用于实现步骤S300;
图像数据处理模块910,用于将待处理的图像数据输入包含M个处理阶段的数据处理过程进行处理,其中M为大于等于2的整数,即用于实现步骤S310;
数据获取模块920,用于获取M个处理阶段中的N个处理阶段输出的数据,其中N为大于等于2且小于等于M的整数,即用于实现步骤S320;
融合模块930,用于融合所述N个处理阶段输出的数据,即用于实现步骤S330;
掩膜生成模块940,用于将融合后的数据进行池化处理以获得图像掩膜,即用于实现步骤S340。图14示出了根据本发明又一个实施例的图像掩膜生成装置,该装置包括:
图像数据获取模块1000,用于获取包含掩膜特征的图像数据,即用于实现步骤S400,其中,所述掩膜特征为包括待识别物品的图像的掩膜特征;
信息获取模块1010,用于获取所述图像数据上的待识别物品的中心点信息以及识别框参数,即用于实现步骤S410;
掩膜特征获取模块1020,用于基于所述中心点信息以及识别框信息从图像数据中提取所述待识别物品的掩膜特征,即用于实现步骤S420;
掩膜生成模块1030,用于基于从图像数据中提取的所述待识别物品的掩膜特征生成物品的图像掩膜,即用于实现步骤S430。
图15示出了根据本发明又一个实施例的图像识别装置,该装置包括:
图像数据获取模块1100,用于获取包含待识别物品组的图像数据,即用于实现步骤S500;
图像识别模块1110,用于基于所述图像数据识别所述待识别物品组中的每个物品,即用于实现步骤S510;
识别框生成模块1120,用于针对所识别的每个物品,在所述图像数据上生成识别框以标记该物品,即用于实现步骤S520;
其中,所述物品组包括至少两个物品;并且,
每个所述识别框的中心都位于该识别框所标记的物品的图像里;并且,
在所述图像数据上生成的多个所述识别框至少部分重叠。
上述图10-图15所示的装置实施例中,仅描述了模块的主要功能,各个模块的全部功能与方法实施例中相应步骤相对应,各个模块的工作原理同样可以参照方法实施例中相应步骤的描述,此处不再赘述。另外,虽然上述实施例中限定了功能模块的功能与方法的对应关系,然而本领域技术人员能够理解,功能模块的功能并不局限于上述对应关系,即特定的功能模块还能够实现其他方法步骤或方法步骤的一部分。例如,上述实施例描述了掩膜生成模块1030用于实现步骤S430的方法,然而根据实际情况的需要,掩膜生成模块1030也可以用于实现步骤S400、S410或S420的方法或方法的一部分。
本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述任一实施方式的方法。需要指出的是,本申请实施方式的计算机可读存储介质存储的计算机程序可以被电子设备的处理器执行,此外,计算机可读存储介质可以是内置在电子设备中的存储介质,也可以是能够插拔地插接在电子设备的存储介质,因此,本申请实施方式的计算机可读存储介质具有较高的灵活性和可靠性。
图16示出了根据本发明实施例的一种电子设备的结构示意图,电子设备可以是汽车中配置的控制系统/电子系统、移动终端(例如,智能移动电话等)、个人计算机(PC,例如,台式计算机或者笔记型计算机等)、平板电脑以及服务器等,本发明具体实施例并不对电子设备的具体实现做限定。
如图16所示,该电子设备可以包括:处理器(processor)1202、通信接口(Communications Interface)1204、存储器(memory)1206、以及通信总线1208。
其中:
处理器1202、通信接口1204、以及存储器1206通过通信总线1208完成相互间的通信。
通信接口1204,用于与其它设备比如客户端或其它服务器等的网元通信。
处理器1202,用于执行程序1210,具体可以执行上述方法实施例中的相关步骤。
具体地,程序1210可以包括程序代码,该程序代码包括计算机操作指令。
处理器1202可能是中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路。电子设 备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。
存储器1206,用于存放程序1210。存储器1206可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
程序1210可以通过通信接口1204从网络上被下载及安装,和/或从可拆卸介质被安装。在该程序被处理器1202执行时,可以使得处理器1202执行上述方法实施例中的各项操作。
概括地说,本发明的发明内容包括:一种图像识别方法,包括:
获取包含待识别物品的图像数据;
对所述图像数据进行处理以识别所述图像数据中的待识别物品,并获取待识别物品的关键点信息以及识别框参数;
针对所识别的物品,基于所述关键点信息以及识别框参数在所述图像数据上生成识别框;并且
针对所识别的物品,基于所述关键点信息以及识别框参数生成所识别的物品的掩膜。
可选的,并行执行所述生成识别框的操作以及所述生成所识别的物品的掩膜的操作。
可选的,所述识别框参数包括识别框的宽度以及识别框的高度。
可选的,所述关键点包括识物品的中心点。
可选的,所述对所述图像数据进行处理包括将所述图像数据输入骨干网络进行处理。
可选的,所述骨干网络包括第一数据处理过程和第二数据处理过程,所述数据处理过程包括一个或多个处理阶段,所述处理阶段包括一个或多个处理分支。
可选的,将所述多个处理分支输出的数据进行融合。
可选的,所述多个处理分支输出的数据具有多分辨率和/或多特征维度。
可选的,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支;和/或,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
一种图像识别装置,包括:
图像数据获取模块,用于获取包含待识别物品的图像数据;
图像数据处理模块,用于对所述图像数据进行处理以识别所述图像数据中的待识别物品,并获取待识别物品的关键点信息以及识别框参数;
识别框生成模块,针对所识别的物品,基于所述关键点信息以及识别框参数在所述图像数据上生成识别框;并且
掩膜生成模块,针对所识别的物品,基于所述关键点信息以及识别框参数生成所识别的物品的掩膜。
可选的,所述识别框生成模块以及掩膜生成模块并行运行。
可选的,所述识别框参数包括识别框的宽度以及识别框的高度。
可选的,所述关键点包括识物品的中心点。
可选的,图像数据处理模块用于将所述图像数据输入骨干网络进行处理。
可选的,所述骨干网络包括第一数据处理过程和第二数据处理过程,所述数据处理过程包括一个或多个处理阶段,所述处理阶段包括一个或多个处理分支。
可选的,将所述多个处理分支输出的数据进行融合。
可选的,所述多个处理分支输出的数据具有多分辨率和/或多特征维度。
可选的,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支;和/或,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
一种图像数据的处理方法,包括:
接收待处理的图像数据;
使用第一数据处理过程处理所述待处理的图像数据;
使用第二数据处理过程处理经第一数据处理过程处理的图像数据;
其中,所述数据处理过程包括一个或多个处理阶段,在第一数据处理过程中,部分处理阶段包括对图像数据进行增大分辨率的处理,部分处理阶段包括对图像数据进行减小分辨率的处理;在第二数据处理过程中,任一处理阶段包括对图像数据进行增大分辨率的处理,且不包括对图像数据进行减小分辨率的处理。
可选的,所述处理阶段包括一个或多个处理分支。
可选的,将所述多个处理分支输出的数据进行融合。
可选的,所述多个处理分支输出的数据具有多分辨率和/或多特征维度。
可选的,所述多个处理分支属于同一处理阶段。
可选的,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支。
可选的,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
可选的,所述第一处理过程的最后一个处理阶段与所述第二处理过程的第一个处理阶段具有相同数量的处理分支。
可选的,所述减小分辨率的处理包括使用1x1卷积降采样处理减小分辨率;和/或所述增大分辨率的处理包括使用1x1卷积升采样增大分辨率。
一种图像数据的处理装置,包括:
图像数据接收模块,用于接收待处理的图像数据;
第一数据处理模块,用于使用第一数据处理过程处理所述待处理的图像数据;
第二数据处理模块,用于使用第二数据处理过程处理经第一数据处理过程处理的图像数据;
其中,所述数据处理过程包括一个或多个处理阶段,在第一数据处理过程中,部分处理阶段包括对图像数据进行增大分辨率的处理,部分处理阶段包括对图像数据进行减小分辨率的处理;在第二数据处理过程中,任一处理阶段包括对图像数据进行增大分辨率的处理,且不包括对图像数据进行减小分辨率的处理。
可选的,所述处理阶段包括一个或多个处理分支。
可选的,将所述多个处理分支输出的数据进行融合。
可选的,所述多个处理分支输出的数据具有多分辨率和/或多特征维度。
可选的,所述多个处理分支属于同一处理阶段。
可选的,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支。
可选的,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
可选的,所述第一处理过程的最后一个处理阶段与所述第二处理过程的第一个处理阶段具有相同数量的处理分支。
可选的,所述减小分辨率的处理包括使用1x1卷积降采样处理减小分辨率;和/或所述增大分辨率的处理包括使用1x1卷积升采样处理增大分辨率。
一种图像数据的处理方法,包括:
接收待处理的图像数据;
使用第一数据处理过程处理所述待处理的图像数据;
使用第二数据处理过程处理经第一数据处理过程处理的图像数据;
其中,所述数据处理过程包括一个或多个处理阶段,每个处理阶段包括一个或多个处理分支;
所述方法还包括残差处理过程,第一数据处理过程中的一个或多个处理分支通过残差处理过程与第二数据处理过程中的一个或多个处理分支连接;
所述第二处理过程中的一个或多个处理分支向下一个处理阶段输出数据之前,通过特征过渡操作处理待输出的数据。
可选的,所述特征过渡操作包括对待输出的数据进行可变形卷积处理。
可选的,所述残差处理过程包括将第一数据处理过程中第一处理阶段的处理分支通过残差处理过程与第二数据处理过程中的最后一个处理阶段的处理分支连接。
可选的,将多个处理分支输出的数据进行融合。
可选的,所述多个处理分支输出的数据具有多分辨率和/或多特征维度。
可选的,所述多个处理分支属于同一处理阶段。
可选的,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支。
可选的,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
可选的,所述第一处理过程的最后一个处理阶段与所述第二处理过程的第一个处理阶段具有相同数量的处理分支。
一种图像数据的处理装置,包括:
图像数据接收模块,接收待处理的图像数据;
第一数据处理模块,使用第一数据处理过程处理所述待处理的图像数据;
第二数据处理模块,使用第二数据处理过程处理经第一数据处理过程处理的图像数据;
其中,所述数据处理过程包括一个或多个处理阶段,每个处理阶段包括一个或多个处理分支;
所述处理装置还包括:
残差处理模块,用于将第一数据处理过程中的一个或多个处理分支通过残差处理过程与第二数据处理过程中的一个或多个处理分支连接;
特征过渡模块,用于在所述第二处理过程中的一个或多个处理分支向下一个处理阶段输出数据之前,对待输出的数据进行特征过渡处理。
可选的,所述特征过渡模块还用于对待输出的数据进行可变形卷积处理。
可选的,残差处理模块还用于将第一数据处理过程中第一处理阶段的处理分支通过残差处理过程与第二数据处理过程中的最后一个处理阶段的处理分支连接。
可选的,将多个处理分支输出的数据进行融合。
可选的,所述多个处理分支输出的数据具有多分辨率和/或多特征维度。
可选的,所述多个处理分支属于同一处理阶段。
可选的,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支。
可选的,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
可选的,所述第一处理过程的最后一个处理阶段与所述第二处理过程的第一个处理阶段具有相同数量的处理分支。
一种图像掩膜生成方法,包括:
接收待处理的图像数据;
将待处理的图像数据输入包含M个处理阶段的数据处理过程进行处理,其中M为大于等于2的整数;
获取M个处理阶段中的N个处理阶段输出的数据,其中N为大于等于2且小于等于M的整数;
融合所述N个处理阶段输出的数据;
将融合后的数据进行池化处理以获得图像掩膜。
可选的,所述将待处理的图像数据输入包含M个处理阶段的数据处理过程进行处理具体 为将待处理的图像数据输入骨干网络进行处理,所述骨干网络包括第一处理过程和第二处理过程,所述处理过程包括M个处理阶段。
可选的,所述N个处理阶段为第二数据处理过程中的处理阶段。
可选的,所述融合包括通道级的相加运算和/或通道级的叠加运算。
可选的,所述进行池化处理以获得图像掩膜包括基于待识别物品的中心点信息以及识别框的参数计算图像掩膜。
可选的,所述识别框参数包括识别框的宽度信息以及识别框的高度信息。
可选的,所述中心点信息包括真实(Ground Truth)中心点。
可选的,所述处理阶段包括一个或多个处理分支,所述将待处理的图像数据输入骨干网络进行处理还包括将所述多个处理分支输出的数据进行融合。
可选的,所述多个处理分支输出的数据具有多分辨率和/或多特征维度。
一种图像掩膜生成装置,包括:
图像数据接收模块,用于接收待处理的图像数据;
图像数据处理模块,用于将待处理的图像数据输入包含M个处理阶段的数据处理过程进行处理,其中M为大于等于2的整数;
数据获取模块,用于获取M个处理阶段中的N个处理阶段输出的数据,其中N为大于等于2且小于等于M的整数;
融合模块,用于融合所述N个处理阶段输出的数据;
掩膜生成模块,用于将融合后的数据进行池化处理以获得图像掩膜。
可选的,所述图像数据处理模块具体用于将待处理的图像数据输入骨干网络进行处理,所述骨干网络包括第一处理过程和第二处理过程,所述处理过程包括M个处理阶段。
可选的,所述N个处理阶段为第二数据处理过程中的处理阶段。
可选的,所述融合包括通道级的相加运算和/或通道级的叠加运算。
可选的,所述掩膜生成模块还用于基于待识别物品的中心点信息以及识别框参数计算图像掩膜。
可选的,所述识别框参数包括识别框的宽度信息以及识别框的高度信息。
可选的,所述中心点信息包括真实(Ground Truth)中心点。
可选的,所述处理阶段包括一个或多个处理分支,所述将待处理的图像数据输入骨干网络进行处理还包括将所述多个处理分支输出的数据进行融合。
可选的,所述多个处理分支输出的数据具有多分辨率和/或多特征维度。
一种图像掩膜生成方法,包括:
获取包含掩膜特征的图像数据,其中,所述掩膜特征为包括待识别物品的图像的掩膜特 征;
获取所述图像数据上的待识别物品的中心点信息以及识别框参数;
基于所述中心点信息以及识别框信息从图像数据中提取所述待识别物品的掩膜特征;
基于从图像数据中提取的所述待识别物品的掩膜特征生成物品的图像掩膜。
可选的,所述识别框参数包括识别框的宽度信息以及识别框的高度信息。
可选的,所述中心点信息包括真实(Ground Truth)中心点。
可选的,所述包含掩膜特征的图像数据包括从骨干网络中获取并经融合后获得的特征图像数据。
可选的,所述融合包括通道级的相加运算和/或通道级的叠加运算。
可选的,所述骨干网络包括第一数据处理过程和第二数据处理过程,所述数据处理过程包括一个或多个处理阶段,所述数据阶段包括一个或多个处理分支。
可选的,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支。
可选的,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
可选的,所述第一处理过程的最后一个处理阶段与所述第二处理过程的第一个处理阶段具有相同数量的处理分支。
一种图像掩膜生成装置,包括:
图像数据获取模块,用于获取包含掩膜特征的图像数据,其中,所述掩膜特征为包括待识别物品的图像的掩膜特征;
信息获取模块,用于获取所述图像数据上的待识别物品的中心点信息以及识别框参数;
掩膜特征获取模块,用于基于所述中心点信息以及识别框信息从图像数据中提取所述待识别物品的掩膜特征;
掩膜生成模块,用于基于从图像数据中提取的所述待识别物品的掩膜特征生成物品的图像掩膜。
可选的,所述识别框参数包括识别框的宽度信息以及识别框的高度信息。
可选的,所述中心点信息包括真实(Ground Truth)中心点。
可选的,所述包含掩膜特征的图像数据包括从骨干网络中获取并经融合后获得的特征图像数据。
可选的,所述融合包括通道级的相加运算和/或通道级的叠加运算。
可选的,所述骨干网络包括第一数据处理过程和第二数据处理过程,所述数据处理过程包括一个或多个处理阶段,所述数据阶段包括一个或多个处理分支。
可选的,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支。
可选的,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
可选的,所述第一处理过程的最后一个处理阶段与所述第二处理过程的第一个处理阶段具有相同数量的处理分支。
一种图像识别方法,包括:
获取包含待识别物品组的图像数据;
基于所述图像数据识别所述待识别物品组中的每个物品;
针对所识别的每个物品,在所述图像数据上生成识别框以标记该物品;
其中,所述物品组包括至少两个物品;并且,
每个所述识别框的中心都位于该识别框所标记的物品的图像里;并且,
在所述图像数据上生成的多个所述识别框至少部分重叠。
可选的,所述识别框重叠的程度使用重叠度来表示,两个识别框的重叠度=两个识别框相交的区域的面积/两个识别框合并后形成的区域的面积。
可选的,所述至少部分重叠包括重叠度大于60%。
可选的,所述生成识别框包括基于物品的关键点信息以及识别框参数生成识别框。
可选的,所述识别框参数包括识别框的长度信息和宽度信息。
可选的,所述物品的关键点包括物品的中心点。
可选的,所述基于所述图像数据识别所述待识别物品组中的每个物品包括将所述图像数据输入骨干网络进行处理以识别物品组中的每个物品。
可选的,所述骨干网络包括第一数据处理过程和第二数据处理过程,所述数据处理过程包括一个或多个处理阶段,所述数据阶段包括一个或多个处理分支。
可选的,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支;和/或,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
一种图像识别装置,包括:
图像数据获取模块,用于获取包含待识别物品组的图像数据;
图像识别模块,用于基于所述图像数据识别所述待识别物品组中的每个物品;
识别框生成模块,用于针对所识别的每个物品,在所述图像数据上生成识别框以标记该物品;
其中,所述物品组包括至少两个物品;并且,
每个所述识别框的中心都位于该识别框所标记的物品的图像里;并且,
在所述图像数据上生成的多个所述识别框至少部分重叠。
可选的,所述识别框重叠的程度使用重叠度来表示,两个识别框的重叠度=两个识别框相交的区域的面积/两个识别框合并后形成的区域的面积。
可选的,所述至少部分重叠包括重叠度大于60%。
可选的,所述识别框生成模块还用于基于物品的关键点信息以及识别框参数生成识别框。
可选的,所述识别框参数包括识别框的长度信息和宽度信息。
可选的,所述物品的关键点包括物品的中心点。
可选的,所述图像识别模块还用于将所述图像数据输入骨干网络进行处理以识别物品组中的每个物品。
可选的,所述骨干网络包括第一数据处理过程和第二数据处理过程,所述数据处理过程包括一个或多个处理阶段,所述数据阶段包括一个或多个处理分支。
可选的,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支;和/或,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
在本说明书的描述中,参考术语“一个实施方式”、“一些实施方式”、“示意性实施方式”、“示例”、“具体示例”或“一些示例”等的描述意指结合所述实施方式或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施方式或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施方式或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施方式或示例中以合适的方式结合。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本申请的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理模块的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。
处理器可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理 器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
应当理解,本申请的实施方式的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,在本申请的各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。
尽管上面已经示出和描述了本申请的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施方式进行变化、修改、替换和变型。

Claims (20)

  1. 一种图像识别方法,其特征在于,包括:
    获取包含待识别物品的图像数据;
    对所述图像数据进行处理以识别所述图像数据中的待识别物品,并获取待识别物品的关键点信息以及识别框参数;
    针对所识别的物品,基于所述关键点信息以及识别框参数在所述图像数据上生成识别框;并且
    针对所识别的物品,基于所述关键点信息以及识别框参数生成所识别的物品的掩膜。
  2. 根据权利要求1所述的图像识别方法,其特征在于:并行执行所述生成识别框的操作以及所述生成所识别的物品的掩膜的操作。
  3. 根据权利要求1所述的图像识别方法,其特征在于:所述识别框参数包括识别框的宽度以及识别框的高度。
  4. 根据权利要求1所述的图像识别方法,其特征在于:所述关键点包括识物品的中心点。
  5. 根据权利要求1-4任一项所述的图像识别方法,其特征在于:所述对所述图像数据进行处理包括将所述图像数据输入骨干网络进行处理。
  6. 根据权利要求5所述的图像识别方法,其特征在于:所述骨干网络包括第一数据处理过程和第二数据处理过程,所述数据处理过程包括一个或多个处理阶段,所述处理阶段包括一个或多个处理分支。
  7. 根据权利要求6所述的图像识别方法,其特征在于,还包括:将所述多个处理分支输出的数据进行融合。
  8. 根据权利要求6所述的图像识别方法,其特征在于:所述多个处理分支输出的数据具有多分辨率和/或多特征维度。
  9. 根据权利要求6所述的图像识别方法,其特征在于,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支;和/或,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
  10. 一种图像识别装置,其特征在于,包括:
    图像数据获取模块,用于获取包含待识别物品的图像数据;
    图像数据处理模块,用于对所述图像数据进行处理以识别所述图像数据中的待识别物品,并获取待识别物品的关键点信息以及识别框参数;
    识别框生成模块,针对所识别的物品,基于所述关键点信息以及识别框参数在所述图像 数据上生成识别框;并且
    掩膜生成模块,针对所识别的物品,基于所述关键点信息以及识别框参数生成所识别的物品的掩膜。
  11. 根据权利要求10所述的图像识别装置,其特征在于:所述识别框生成模块以及掩膜生成模块并行运行。
  12. 根据权利要求10所述的图像识别装置,其特征在于:所述识别框参数包括识别框的宽度以及识别框的高度。
  13. 根据权利要求10所述的图像识别装置,其特征在于:所述关键点包括识物品的中心点。
  14. 根据权利要求10-13任一项所述的图像识别装置,其特征在于:图像数据处理模块用于将所述图像数据输入骨干网络进行处理。
  15. 根据权利要求14所述的图像识别装置,其特征在于:所述骨干网络包括第一数据处理过程和第二数据处理过程,所述数据处理过程包括一个或多个处理阶段,所述处理阶段包括一个或多个处理分支。
  16. 根据权利要求15所述的图像识别装置,其特征在于,还包括:将所述多个处理分支输出的数据进行融合。
  17. 根据权利要求15所述的图像识别装置,其特征在于:所述多个处理分支输出的数据具有多分辨率和/或多特征维度。
  18. 根据权利要求15所述的图像识别装置,其特征在于,在所述第一处理过程中,在后处理阶段包括比在先处理阶段更多的处理分支;和/或,在所述第二处理过程中,在后处理阶段包括比在先处理阶段更少的处理分支。
  19. 一种电子设备,其特征在于,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现权利要求1至9中任一项所述的图像识别方法。
  20. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至9中任一项所述的图像识别方法。
PCT/CN2021/138580 2021-06-21 2021-12-15 图像识别方法、装置、电子设备和存储介质 WO2022267387A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110686482.0 2021-06-21
CN202110686482.0A CN113361442B (zh) 2021-06-21 2021-06-21 图像识别方法、装置、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022267387A1 true WO2022267387A1 (zh) 2022-12-29

Family

ID=77535336

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/138580 WO2022267387A1 (zh) 2021-06-21 2021-12-15 图像识别方法、装置、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN113361442B (zh)
WO (1) WO2022267387A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361442B (zh) * 2021-06-21 2024-03-29 梅卡曼德(北京)机器人科技有限公司 图像识别方法、装置、电子设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111563502A (zh) * 2020-05-09 2020-08-21 腾讯科技(深圳)有限公司 图像的文本识别方法、装置、电子设备及计算机存储介质
CN111611994A (zh) * 2019-02-26 2020-09-01 北京嘀嘀无限科技发展有限公司 图像提取方法、装置、电子设备和存储介质
CN111860027A (zh) * 2020-06-11 2020-10-30 贝壳技术有限公司 二维码的识别方法及装置
CN112200044A (zh) * 2020-09-30 2021-01-08 北京四维图新科技股份有限公司 异常行为检测方法、装置及电子设备
WO2021012570A1 (zh) * 2019-07-22 2021-01-28 深圳壹账通智能科技有限公司 数据录入方法、装置、设备及存储介质
CN113361442A (zh) * 2021-06-21 2021-09-07 梅卡曼德(北京)机器人科技有限公司 图像识别方法、装置、电子设备和存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647588A (zh) * 2018-04-24 2018-10-12 广州绿怡信息科技有限公司 物品类别识别方法、装置、计算机设备和存储介质
CN109871909B (zh) * 2019-04-16 2021-10-01 京东方科技集团股份有限公司 图像识别方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111611994A (zh) * 2019-02-26 2020-09-01 北京嘀嘀无限科技发展有限公司 图像提取方法、装置、电子设备和存储介质
WO2021012570A1 (zh) * 2019-07-22 2021-01-28 深圳壹账通智能科技有限公司 数据录入方法、装置、设备及存储介质
CN111563502A (zh) * 2020-05-09 2020-08-21 腾讯科技(深圳)有限公司 图像的文本识别方法、装置、电子设备及计算机存储介质
CN111860027A (zh) * 2020-06-11 2020-10-30 贝壳技术有限公司 二维码的识别方法及装置
CN112200044A (zh) * 2020-09-30 2021-01-08 北京四维图新科技股份有限公司 异常行为检测方法、装置及电子设备
CN113361442A (zh) * 2021-06-21 2021-09-07 梅卡曼德(北京)机器人科技有限公司 图像识别方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN113361442A (zh) 2021-09-07
CN113361442B (zh) 2024-03-29

Similar Documents

Publication Publication Date Title
US11321593B2 (en) Method and apparatus for detecting object, method and apparatus for training neural network, and electronic device
CN110717851A (zh) 图像处理方法及装置、神经网络的训练方法、存储介质
CN111583097A (zh) 图像处理方法、装置、电子设备及计算机可读存储介质
Qin et al. Bylabel: A boundary based semi-automatic image annotation tool
US11755889B2 (en) Method, system and apparatus for pattern recognition
CN109003297B (zh) 一种单目深度估计方法、装置、终端和存储介质
JP2004054956A (ja) 顔/類似顔映像で学習されたパターン分類器を利用した顔検出方法及びシステム
CN110246148B (zh) 多模态的深度信息融合和注意力学习的显著性检测方法
WO2022236824A1 (zh) 目标检测网络构建优化方法、装置、设备、介质及产品
CN116645592B (zh) 一种基于图像处理的裂缝检测方法和存储介质
CN111652181B (zh) 目标跟踪方法、装置及电子设备
WO2018219227A1 (zh) 结构光解码的方法和设备
WO2022267387A1 (zh) 图像识别方法、装置、电子设备和存储介质
CN116645598A (zh) 一种基于通道注意力特征融合的遥感图像语义分割方法
KR102413000B1 (ko) 이미지 라벨링 방법, 장치, 전자 기기, 저장 매체 및 컴퓨터 프로그램
CN114202648A (zh) 文本图像矫正方法、训练方法、装置、电子设备以及介质
CN113269280A (zh) 文本检测方法、装置、电子设备及计算机可读存储介质
CN111914894A (zh) 特征提取方法、装置、电子设备及计算机可读存储介质
KR20230083212A (ko) 객체 자세 추정 장치 및 방법
CN113610856B (zh) 训练图像分割模型和图像分割的方法和装置
CN114549825A (zh) 目标检测方法、装置、电子设备与存储介质
JP2018205858A (ja) 学習装置、認識装置、学習方法及びプログラム
JP7238510B2 (ja) 情報処理装置、情報処理方法及びプログラム
CN114332211A (zh) 一种基于边缘重建和密集融合网络的零件位姿计算方法
CN113420770B (zh) 图像数据处理方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21946861

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE