CN113468908B

CN113468908B - Target identification method and device

Info

Publication number: CN113468908B
Application number: CN202010235904.8A
Authority: CN
Inventors: 蔚勇
Original assignee: Navinfo Co Ltd
Current assignee: Navinfo Co Ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2024-05-10
Anticipated expiration: 2040-03-30
Also published as: CN113468908A

Abstract

The application provides a target identification method and a target identification device, wherein the method comprises the following steps: acquiring an image to be identified or a video to be identified; the method comprises the steps of identifying an image to be identified or a video to be identified through a preset target detection network model, and obtaining an identification result, wherein the network structure of the target detection network model comprises a backbone network, a guide anchor generation network GA-RPN, roI Align and BBoxHead, wherein the output of the backbone network is respectively connected with the input of the GA-RPN and the input of the RoI Align, the output of the GA-RPN is connected with the input of the RoI Align, and the output of the RoI Align is connected with the input of BBoxHead, so that the accuracy and recall rate of target image identification are improved.

Description

Target identification method and device

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a target recognition method and apparatus.

Background

Target detection is always a hotspot in the field of image processing, and particularly has important application value for detecting small targets with high robustness, high accuracy and high real-time performance. For example, identification of a signboard in a road, on one hand, high reliability of detection and identification of a small target is required, and particularly high detection accuracy and recall rate are required for a small target at a longer distance; on the other hand, the processing time of the image information is as short as possible, so as to ensure high real-time performance.

In the prior art, a target detection method based on deep learning is generally adopted, and the target detection based on the deep learning can be expressed as follows: in the prior art, a network structure based on a cascade_ RCNN is generally adopted as a target detection model for detecting a small target based on a depth neural network, fig. 1 is a schematic diagram of a network structure based on a cascade_ RCNN in the prior art, as shown in fig. 1, the network structure based on the cascade_ RCNN includes a backbone network (back bone), an anchor generating network (Region Proposals Networks Head, RPN), a RoI Align and BBox Head, wherein an image to be identified or a video to be identified is input to the back bone to obtain a feature map (feature map) of the image to be identified or the video to be identified, then a candidate frame (proposals) is generated through the RPN, the RoI Align calculates feature maps areas corresponding to proposals according to proposals and feature maps, and finally, the BBox Head performs classification (classification) and positioning (localization) processing on the feature maps areas to obtain the category and coordinates of the target, thereby realizing the detection of the target.

However, in the prior art, when the existing Cascade_ RCNN target detection model is used for detecting the small target, the detection accuracy and recall rate are low.

Disclosure of Invention

The application provides a target identification method and a target identification device, which are used for realizing the identification of a target object and improving the accuracy and recall rate of target image identification.

In a first aspect, an embodiment of the present application provides a target recognition method, including:

acquiring an image to be identified or a video to be identified;

The method comprises the steps of identifying an image to be identified or a video to be identified through a preset target detection network model, and obtaining an identification result, wherein the network structure of the target detection network model comprises a backbone network, a guide anchor generation network (Guided Anchoring Region Proposals Networks Head, GA-RPN), a region of interest alignment (RoI Align) and BBoxHead, wherein the output of the backbone network is respectively connected with the input of the GA-RPN and the input of the RoI Align, the output of the GA-RPN is connected with the input of the RoI Align, and the output of the RoI Align is connected with the input of BBoxHead.

In the embodiment of the application, the target object is identified by the image to be identified or the video to be identified through the preset target detection network model, the identification of the target object is realized, and compared with the network structure of the Cascade_ RCNN in the prior art, the network structure of the target detection network model is replaced by GA-RPN to replace RPN in the network structure of the Cascade_ RCNN in the prior art, and the GA-RPN generates anchors according to the distance information of the target center point, so that more anchors exist in the target area to cover small targets, and the accuracy and recall rate of target image identification are improved.

In one possible implementation, the recognition result includes coordinates and a category of the target object, and after obtaining the recognition result, the method further includes:

Determining the position of the target object in the real world scene according to the coordinates of the target object;

The location and class of the target object are stored.

In the embodiment of the application, the position of the target object in the real world scene is determined according to the coordinates of the target object, and the position and the category of the target object are stored, so that the judgment and the storage of the position and the category of the target object are realized.

In one possible embodiment, before storing the location and the category of the target object, the method further includes:

Judging whether the position and/or the category of the target object are changed or not;

if the position and/or the category of the target object change, a prompt message is pushed to the user, wherein the prompt message is used for prompting the user that the position and/or the category of the target object change.

In one possible implementation manner, before acquiring the image to be identified or the video to be identified, the method further includes:

acquiring a training data sample;

Constructing a network structure of a target detection network model;

training the training data sample by utilizing the network structure of the target detection network model to generate a trained target detection network model.

In the embodiment of the application, the training of the network structure of the target detection network model is realized, and the target detection network model is generated.

In one possible embodiment, the training data samples include a small target data sample, a medium target data sample, and a large target data sample, the small target data sample being greater than a preset duty cycle in proportion to the training data sample.

In one possible implementation, the network structure of the network model is detected by the target, and the method further includes:

And a feature pyramid network (Feature Pyramid Networks, FPN), an input of the FPN being connected to an output of the backbone network, an output of the FPN being connected to an input of the GA-RPN and an input of the RoI Align, respectively, the FPN being configured to output a plurality of scale feature maps.

In the embodiment of the application, the FPN is added, so that feature graphs with different scales are enriched, the omission of the target object is effectively reduced, the matching degree of the anchor and the actual target object can be improved, and the accuracy of identifying the target object is further improved.

In one possible implementation, the backbone network is any one of a ResNet-series network structure and a ResNext-series network structure.

In one possible implementation, before building the network structure of the target detection network model, the method further includes:

Training the backbone network to obtain a trained backbone network;

The backbone network comprises five phases; the first stage includes N3*3 convolution kernels and a maximum pooling module, where N is an integer greater than 1; the M stage comprises a downsampling and a plurality of residual modules, the downsampling is divided into two branches, the first convolution kernel step length of the first branch is 1, and the second branch sequentially comprises an average pooling module and convolution kernels, wherein M=2, 3, 4 and 5.

In the embodiment of the application, the convolution kernel in the first stage is optimized, so that the speed of the urban security return is increased, and compared with the prior art, the method has a larger receptive field, and further, a better characteristic diagram can be obtained; by optimizing downsampling in the M-th stage, the ignored information of the feature map can be reduced under the condition that the output shape is unchanged, and the integrity and accuracy of the information in the feature map are improved.

In one possible implementation, the network structure of the object detection network model further includes a Mask Head, where an input of the Mask Head is connected to an output of the RoI Align, and the Mask Head is used to perform Mask prediction on a feature map of the RoI Align output.

The following describes an apparatus, an electronic device, a computer readable storage medium, and a computer program product provided by the embodiments of the present application, and the content and effects thereof may refer to the target recognition method provided by the embodiments of the present application, which are not described herein.

In a second aspect, an embodiment of the present application provides an object recognition apparatus, including:

the first acquisition module is used for acquiring an image to be identified or a video to be identified;

The identification module is used for identifying an image to be identified or a video to be identified through a preset target detection network model, and obtaining an identification result, wherein the network structure of the target detection network model comprises a backbone network and guide anchors to generate networks GA-RPN, roI Align and BBoxHead, the output of the backbone network is respectively connected with the input of the GA-RPN and the input of the RoI Align, the output of the GA-RPN is connected with the input of the RoI Align, and the output of the RoI Align is connected with the input of BBoxHead.

In a possible implementation manner, the object identifying device provided by the embodiment of the present application further includes:

The determining module is used for determining the position of the target object in the real world scene according to the coordinates of the target object;

and the storage module is used for storing the position and the category of the target object.

The judging module is used for judging whether the position and/or the category of the target object are changed or not;

And the prompting module is used for pushing a prompting message to the user if the position and/or the category of the target object are changed, wherein the prompting message is used for prompting the user that the position and/or the category of the target object are changed.

The second acquisition module is used for acquiring training data samples;

the construction module is used for constructing a network structure of the target detection network model;

The first training module is used for training the training data sample by utilizing the network structure of the target detection network model and generating a trained target detection network model.

The input of the feature pyramid network FPN is connected with the output of the backbone network, the output of the FPN is connected with the input of the GA-RPN and the input of the RoI Align respectively, and the FPN is used for outputting feature graphs with multiple scales.

The second training module is used for training the backbone network to obtain a trained backbone network;

In a possible implementation manner, the network structure of the object detection network model further includes a Mask Head, where an input of the Mask Head is connected to an output of the RoI alignment, and the Mask Head is used for performing Mask prediction on a feature map output by the RoI alignment.

In a third aspect, embodiments of the present application provide a graphics processor (Graphics Processing Unit, GPU) for performing a method as provided by the first aspect or an implementation of the first aspect.

In a fourth aspect, an embodiment of the present application provides a vehicle machine, configured to perform a method as provided in the first aspect or an implementation manner of the first aspect.

In a fifth aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as provided by the first aspect or an implementation of the first aspect.

In a sixth aspect, embodiments of the present application provide a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method as provided by the first aspect or an implementation of the first aspect.

In a seventh aspect, embodiments of the present application provide a computer program product comprising: executable instructions for implementing a method as provided in the first aspect or the alternative of the first aspect.

According to the target identification method and device, the image to be identified or the video to be identified is obtained; and then, identifying an image to be identified or a video to be identified through a preset target detection network model, and obtaining an identification result, wherein the network structure of the target detection network model comprises a backbone network and guide anchor generation networks GA-RPN, roI Align and BBoxHead, wherein the output of the backbone network is respectively connected with the input of the GA-RPN and the input of the RoI Align, the output of the GA-RPN is connected with the input of the RoI Align, and the output of the RoI Align is connected with the input of BBoxHead. In the embodiment of the application, the target object is identified by the image to be identified or the video to be identified through the preset target detection network model, so that the identification of the target object is realized, and compared with the network structure of the Cascade_ RCNN in the prior art, the network structure of the target detection network model is replaced by GA-RPN to generate anchors according to the target center point distance information in the existing Cascade_ RCNN network structure, so that more anchors in the target area are provided to cover small targets, and the accuracy and recall rate of target image identification are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a schematic diagram of a prior art Cascade_ RCNN-based network architecture;

FIG. 2 is an exemplary application scenario diagram provided by an embodiment of the present application;

FIG. 3 is a flowchart of a target recognition method according to a first embodiment of the present application;

FIG. 4 is a schematic diagram of a network structure of an object detection network model according to an embodiment of the present application;

fig. 5 is a schematic diagram of a backbone network according to a first embodiment of the present application;

FIG. 6 is an effect diagram of a prior art Cascade_ RCNN-based network architecture;

FIG. 7 is a schematic diagram showing the effect of the network model based on object detection in the first embodiment of the present application;

Fig. 8 is a schematic diagram of a network structure of an object detection network model according to a second embodiment of the present application;

Fig. 9 is a flow chart of a target recognition method according to a third embodiment of the present application;

Fig. 10 is a schematic structural diagram of a target recognition device according to a fourth embodiment of the present application;

fig. 11 is a schematic structural diagram of an apparatus according to a fifth embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

With the continuous development of electronic technology, machine vision is also being applied more and more widely in the technical field of automatic driving. The detection and recognition of the sign board in the road is an important component of an unmanned automobile and an advanced driving assistance system (ADVANCED DRIVING ASSISTANT SYSTEM, abbreviated as ADAS). In the prior art, a network structure based on Cascade_ RCNN as shown in fig. 1 is generally adopted, however, in the prior art, when an existing Cascade_ RCNN target detection model is used for detecting a small target, the detection accuracy and recall rate are lower. In order to solve the above technical problems, an embodiment of the present application provides a target identification method and apparatus.

In the following, an exemplary application scenario of an embodiment of the present application is described.

The target recognition method provided by the embodiment of the present application may be executed by the target recognition device provided by the embodiment of the present application, where the target recognition device provided by the embodiment of the present application may be part or all of a terminal device, and fig. 2 is an exemplary application scenario diagram provided by the embodiment of the present application, and as shown in fig. 2, the target recognition method provided by the embodiment of the present application may be applied to the unmanned automobile 21, for example, may be implemented by a GPU, an ADAS in the unmanned automobile, or the like, and the unmanned automobile 21 may further include a camera, where the camera is used to obtain an image to be recognized and/or a video to be recognized, and the embodiment of the present application does not limit the type of the camera. The camera shoots an image to be identified or a video to be identified of the road surface to identify a target object, wherein the target object may be an object such as the road camera 22, the signboard 23, and the like, which is not limited in the embodiment of the present application.

The object identification method and device provided by the embodiment of the application have the following conception: the network structure of the target detection network model is improved, so that the target detection network model can accurately identify small targets, and then the image or video to be identified on the road surface is input into the target detection network model to identify target objects in the image or video to be identified, so that the accuracy of identifying the target objects is improved.

Fig. 3 is a flowchart of a method for identifying an object according to an embodiment of the present application, where the method may be performed by an object identifying apparatus, and the apparatus may be implemented by software and/or hardware, for example: the device may be a GPU, a vehicle machine, an ADAS, an unmanned vehicle, etc., as shown in fig. 3, the method for identifying a target provided by the embodiment of the present application may include:

step S101: and acquiring an image to be identified or a video to be identified.

The image to be identified or the video to be identified can be obtained by a camera of an unmanned automobile, for example, the type of the camera is not limited in the embodiment of the application, for example, the camera can be a monocular camera. Or the image to be identified or the video to be identified acquired by the high-precision map acquisition vehicle, which is not limited in the embodiment of the application. The image to be identified or the video to be identified can be the image to be identified or the video to be identified containing pavement information; after the camera acquires the image to be identified or the video to be identified, the image to be identified or the video to be identified can be input to a GPU or an ADAS of the unmanned automobile so as to facilitate subsequent processing.

Step S102: and identifying the target object by the image to be identified or the video to be identified through a preset target detection network model to obtain an identification result.

Inputting the image to be identified or the video to be identified into a preset target detection network model, and identifying the target object in the image to be identified or the video to be identified to obtain an identification result, wherein the identification result can be the coordinate or the category of the target object.

In a possible implementation manner, the method for identifying a target provided by the embodiment of the present application may further include, before acquiring an image to be identified or a video to be identified:

Acquiring a training data sample; constructing a network structure of a target detection network model; training the training data sample by utilizing the network structure of the target detection network model to generate a trained target detection network model.

The training data sample can be a pre-labeled image, for example, a COCO data set, an image acquired by a user through a camera, and the like, and then the training data sample is trained through a network structure of the constructed target detection network model, and the trained target detection network model is generated after training. The embodiments of the present application are not limited in this regard.

The specific sizes of the small, medium and large targets may be different for different datasets and for images or videos of different pixels, for example, the small target may be a target below 32×32 pixels, or a target below 20×20 pixels, the medium target may be a target above 32×32 pixels and below 96×96 pixels, and the large target may be a target above 96×96 pixels, which is merely an example of the embodiment of the present application. In addition, the specific value of the preset duty ratio is not limited in the embodiment of the application, and the specific value can be set according to the user requirement, for example, the proportion of the small target data sample to the training data sample is more than 30%.

By increasing the size target duty cycle in the training data samples, the network structure of the target detection network model may learn some small target features that were not previously present when training the training data samples with the network structure of the target detection network model. For example, the existing training data sample does not have a small target below 30 pixels, so that the small target below 30 pixels cannot be detected during detection, but by adding the small target below 30 pixels, the network structure of the target detection network model can learn the small target below 30 pixels, and the small target below 30 pixels can be detected during detection.

Fig. 4 is a schematic diagram of a network structure of an object detection network model according to an embodiment of the present application, where, as shown in fig. 4, the network structure of the object detection network model according to an embodiment of the present application includes a backbone network, a GA-RPN, and RoI aligns and BBoxHead, where, an output of the backbone network is connected to an input of the GA-RPN and an input of the RoI aligns, an output of the GA-RPN is connected to an input of the RoI aligns, and an output of the RoI aligns is connected to an input of BBoxHead, respectively.

The input of a backbone network (backbone network) is a video frame of an image to be recognized or a video to be recognized, the backbone network is used for extracting feature information of the image to be recognized or the video to be recognized to obtain a feature map (feature map), then the feature map is respectively input into a GA-RPN and a RoI alignment, the GA-RPN is used for generating a target candidate frame (proposals), the generated proposals is input into the RoI alignment, the RoI alignment is used for determining a feature map region corresponding to each proposal according to proposals and the feature map, and finally, the BBoxHead respectively performs classification processing and localization processing according to the feature map region in the RoI alignment to obtain the category and the coordinate of a target object.

The embodiment of the application does not limit the specific network structure of the backbone network, and in one possible implementation manner, the backbone network (backbone network) may be any network structure of ResNet series network structures and ResNext series network structures. For example ResNet, resNet, resNext, resNext, 101, etc., and any of Inception series networks, for example, inception v1, inception v2/Inception v3, inception v, etc., are also possible, and the embodiment of the present application is not limited thereto.

In one possible implementation manner, in order to improve the integrity and accuracy of information in the feature map, the embodiment of the present application optimizes a backbone network, and is described below by taking any one of a ResNet-series network structure and a ResNext-series network structure as an example, and before constructing a network structure of a target detection network model, the target identification method provided by the embodiment of the present application further includes:

Training the backbone network to obtain a trained backbone network; the backbone network comprises five phases; the first stage includes N3*3 convolution kernels and a maximum pooling module, where N is an integer greater than 1; the M stage comprises a downsampling and a plurality of residual modules, the downsampling is divided into two branches, the first convolution kernel step length of the first branch is 1, and the second branch sequentially comprises an average pooling module and convolution kernels, wherein M=2, 3, 4 and 5.

Fig. 5 is a schematic diagram of a backbone network provided in an embodiment of the present application, as shown in fig. 5, the backbone network includes five stages, namely a first stage, a second stage, a third stage, a fourth stage and a fifth stage, where the first stage includes N3*3 convolution kernels and a max pooling (max pooling) module, where the number of N may be set according to a user requirement, for example, n=3, and compared with the prior art, in which the first stage of the backbone network is 1 7*7 convolution sum and max pooling module, the effect of three 3x3 convolution kernels is equivalent to that of one 7x7 convolution kernel, and since the calculation cost of the convolution kernels is square of a kernel width or height, the time overhead of the 7x7 convolution kernels is 5.4 times greater than that of the 3x3 convolution kernels, not only speeds up the inference speed, but also has a larger sense field, so that the network has a better feature map.

The second stage, the third stage, the fourth stage and the fifth stage of the backbone network respectively comprise downsampling and one or more residual modules, the number of the residual modules in each stage is not limited in the embodiment of the present application, and only the fourth stage comprises downsampling and one residual module in fig. 5 as an example, and the other stages are not described again. The downsampling is divided into two branches, wherein the first branch sequentially comprises 1*1 convolution kernels, 3*3 convolution kernels with step length S=2 and 1*1 convolution kernels; compared with the prior art that the first convolution kernel is a 1*1 convolution kernel with the step length of 2, the convolution of the first branch is avoided from neglecting three-fourths information of the input image. The second branch sequentially comprises an average pooling module and 1*1 convolution kernels with a step length of 2, so that the convolution of the second branch is prevented from neglecting three-fourths of information of the input image.

In summary, in the embodiment of the present application, by optimizing the convolution kernel in the first stage, not only the inference speed is increased, but also a larger receptive field is provided compared with the prior art, so that a better feature map can be obtained; by optimizing downsampling in the M-th stage, the ignored information of the feature map can be reduced under the condition that the output shape is unchanged, and the integrity and accuracy of the information in the feature map are improved.

The embodiment of the application does not limit the network structure of the GA-RPN, and divides the whole feature map into an object center area, a peripheral area and an neglected area through the position of the center of the real frame. The area with a preset distance from the center of the real frame is an object center area, anchors are preset to be collected in the object center area, proposals collected in the object center area is taken as a positive sample when the network structure of the target detection network model is trained, so that the number of anchors (anchors) is greatly reduced and concentrated in the target center area, the number of small targets anchors is increased, the duty ratio of the small targets is increased in the sample, and the network possibly learns some small target features which do not exist before. For example, there are no targets below 30 pixels in the data set, so it is impossible to detect targets below 30 pixels, but after adding targets below 30 pixels, the network learns some characteristics, and it is possible to detect targets of this type of pixels.

Fig. 6 is an effect schematic diagram of a network structure based on cascades_ RCNN in the prior art, and fig. 7 is an effect schematic diagram based on a target detection network model in the first embodiment of the present application, it can be seen that, based on the target detection network model provided by the present application, identification of the same image to be identified is performed on the identification board, and more identification boards can be identified.

In order to further improve accuracy in identifying the target object, in a possible implementation manner, fig. 8 is a schematic diagram of a network structure of the target detection network model according to the second embodiment of the present application, where, as shown in fig. 8, the network structure of the target detection network model may further include:

the input of the FPN is connected with the output of the backbone network, the output of the FPN is respectively connected with the input of the GA-RPN and the input of the RoI Align, and the FPN is used for outputting the characteristic diagrams of multiple scales.

The embodiment of the application does not limit the specific network structure of the FPN, enriches the feature graphs with different scales by adding the FPN, effectively reduces the missed detection of the target object, and can improve the matching degree of the anchor and the actual target object, thereby improving the accuracy of identifying the target object.

As shown in fig. 8, in a possible implementation manner, the network structure of the object detection network model provided by the embodiment of the present application may further include a Mask Head, where an input of the Mask Head is connected to an output of the RoI alignment, and the Mask Head is used to perform Mask prediction on a feature map output by the RoI alignment. Mask Head may constrain the target candidate box, and thus the recognition result in BBoxHead, by the segmentation result of the target object.

In a possible implementation manner, the recognition result includes coordinates and a category of the target object, and fig. 9 is a schematic flow chart of a target recognition method provided by the third embodiment of the present application, as shown in fig. 9, after step S102 of the foregoing embodiment, the target recognition method provided by the embodiment of the present application further includes:

step S201: the position of the target object in the real world scene is determined from the coordinates of the target object.

After determining the coordinates of the target object, the position of the target object in the real world scene may be determined from the coordinates of the target object. The embodiment of the application does not limit the specific implementation manner of determining the position of the target object in the real world scene according to the coordinates of the target object, in one possible implementation manner, the road section of the current road can be determined according to the image to be identified, and the position of the target object in the current road can be determined according to the coordinates of the target object so as to determine the position of the target object in the real world scene. In another possible implementation manner, the point cloud data of the current road can be obtained through the laser radar, then the three-dimensional coordinates of the current road are determined according to the point cloud data, and then the position of the target object in the three-dimensional coordinates is determined according to the coordinates of the target object, so that the position of the target object in the real-world scene is determined. The embodiment of the present application is merely taken as an example, and is not limited thereto.

Step S202: the location and class of the target object are stored.

After the position and the category of the target object are obtained, the position and the category of the target object are stored for subsequent processing, for example, the method can be used for manufacturing a high-precision map. The embodiments of the present application are not limited in this regard.

Judging whether the position and/or the category of the target object are changed or not; if the position and/or the category of the target object change, a prompt message is pushed to the user, wherein the prompt message is used for prompting the user that the position and/or the category of the target object change.

The position and the category of the target object may already exist in the database, but the category and the position of the target object may also change due to the fact that the road condition may change, and if it is determined that the position and/or the category of the target object changes, a prompt message may also be pushed to the user, where the prompt message is used to prompt the user that the position and/or the category of the target object changes. The embodiment of the application does not limit the specific form of pushing the prompt message to the user.

The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.

Fig. 10 is a schematic structural diagram of an object recognition device according to a fourth embodiment of the present application, where the device may be implemented in software and/or hardware, and as shown in fig. 10, the object recognition device according to the embodiment of the present application may include:

the first obtaining module 51 is configured to obtain an image to be identified or a video to be identified.

The recognition module 52 is configured to recognize an image to be recognized or a video to be recognized through a preset target detection network model, and obtain a recognition result, where the network structure of the target detection network model includes a backbone network, a guidance anchor generating network GA-RPN, a RoI Align, and BBoxHead, an output of the backbone network is connected to an input of the GA-RPN and an input of the RoI Align, an output of the GA-RPN is connected to an input of the RoI Align, and an output of the RoI Align is connected to an input of BBoxHead.

a determining module 53 for determining a position of the target object in the real world scene based on the coordinates of the target object.

The storage module 54 is configured to store the location and the category of the target object.

A determining module 55, configured to determine whether the location and/or the category of the target object change.

The prompting module 56 is configured to push a prompting message to the user if the location and/or the category of the target object change, where the prompting message is used to prompt the user that the location and/or the category of the target object change.

A second acquisition module 57 is configured to acquire training data samples.

A construction module 58 for constructing a network structure of the object detection network model.

The first training module 59 is configured to train the training data sample using the network structure of the target detection network model, and generate a trained target detection network model.

A second training module 60, configured to train the backbone network to obtain a trained backbone network;

The embodiment of the apparatus provided in the present application is merely illustrative, and the module division in fig. 10 is merely a logic function division, and there may be other division manners in practical implementation. For example, multiple modules may be combined or may be integrated into another system. The coupling of the individual modules to each other may be achieved by means of interfaces which are typically electrical communication interfaces, but it is not excluded that they may be mechanical interfaces or other forms of interfaces. Thus, the modules illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed in different locations on the same or different devices.

The embodiment of the application provides a GPU for executing the target identification method provided by the embodiment of the application.

The embodiment of the application provides a vehicle machine for executing the target identification method provided by the embodiment of the application.

Fig. 11 is a schematic structural diagram of an apparatus according to a fifth embodiment of the present application, as shown in fig. 11, where the apparatus includes:

A processor 71, a memory 72, a transceiver 73, and a computer program; wherein the transceiver 73 enables data transmission with other devices, a computer program is stored in the memory 72 and configured to be executed by the processor 71, the computer program comprising instructions for performing the above-described method, the content and effects of which refer to the method embodiments.

In addition, the embodiment of the application further provides a computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when at least one processor of the user equipment executes the computer-executable instructions, the user equipment executes the various possible methods.

Among them, computer-readable media include computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a user device. The processor and the storage medium may reside as discrete components in a communication device.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims

1. A method of target identification, comprising:

acquiring an image to be identified or a video to be identified;

Identifying the image to be identified or the video to be identified through a preset target detection network model to obtain an identification result, wherein the network structure of the target detection network model comprises a backbone network, a guide anchor generation network GA-RPN, a region-of-interest alignment RoI Align and a detection frame head BBoxHead, wherein the output of the backbone network is respectively connected with the input of the GA-RPN and the input of the RoI Align, the output of the GA-RPN is connected with the input of the RoI Align, and the output of the RoI Align is connected with the input of the BBoxHead;

The backbone network comprises five phases; the first stage includes N3*3 convolution kernels and a maximum pooling module, where N is an integer greater than 1; the M stage comprises a downsampling and a plurality of residual modules, the downsampling is divided into two branches, the first convolution kernel step length of the first branch is 1, and the second branch sequentially comprises an average pooling module and convolution kernels, wherein M=2, 3,4 and 5.

2. The method of claim 1, wherein the recognition result includes coordinates and a category of the target object, and further comprising, after the obtaining the recognition result:

And storing the position and the category of the target object.

3. The method of claim 2, further comprising, prior to said storing the location and class of the target object:

And if the position and/or the category of the target object are changed, pushing a prompt message to the user, wherein the prompt message is used for prompting the user that the position and/or the category of the target object are changed.

4. A method according to any one of claims 1-3, further comprising, prior to acquiring the image or video to be identified:

acquiring a training data sample;

Constructing a network structure of the target detection network model;

And training the training data sample by utilizing the network structure of the target detection network model to generate a trained target detection network model.

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

The training data samples comprise a small target data sample, a medium target data sample and a large target data sample, and the proportion of the small target data sample to the training data sample is larger than a preset proportion.

6. The method of claim 5, wherein the object detects a network structure of a network model, further comprising:

and the input of the FPN is connected with the output of the backbone network, the output of the FPN is respectively connected with the input of the GA-RPN and the input of the RoI Align, and the FPN is used for outputting feature graphs with multiple scales.

7. The method of claim 6, wherein the backbone network is any one of a ResNet-series network structure and a ResNext-series network structure.

8. The method of claim 7, further comprising, prior to said constructing the network structure of the object detection network model:

and training the backbone network to obtain a trained backbone network.

9. The method of claim 8, wherein the step of determining the position of the first electrode is performed,

The network structure of the target detection network model further comprises a Mask Head, wherein the input of the Mask Head is connected with the output of the RoI Align, and the Mask Head is used for carrying out Mask prediction on the feature map output by the RoI Align.

10. An object recognition apparatus, comprising:

The identification module is used for identifying the image to be identified or the video to be identified through a preset target detection network model to obtain an identification result, wherein the network structure of the target detection network model comprises a backbone network, a guide anchor generation network GA-RPN, roI aligns and BBoxHead, the output of the backbone network is respectively connected with the input of the GA-RPN and the input of the RoI aligns, the output of the GA-RPN is connected with the input of the RoI aligns, and the output of the RoI aligns is connected with the input of BBoxHead; the backbone network comprises five phases; the first stage includes N3*3 convolution kernels and a maximum pooling module, where N is an integer greater than 1; the M stage comprises a downsampling and a plurality of residual modules, the downsampling is divided into two branches, the first convolution kernel step length of the first branch is 1, and the second branch sequentially comprises an average pooling module and convolution kernels, wherein M=2, 3, 4 and 5.

11. The apparatus as recited in claim 10, further comprising:

a determining module, configured to determine a position of the target object in a real world scene according to coordinates of the target object;

12. The apparatus as recited in claim 11, further comprising:

And the prompting module is used for pushing a prompting message to a user if the position and/or the category of the target object are changed, wherein the prompting message is used for prompting the user that the position and/or the category of the target object are changed.