CN109255352A

CN109255352A - Object detection method, apparatus and system

Info

Publication number: CN109255352A
Application number: CN201811049034.4A
Authority: CN
Inventors: 秦政; 黎泽明; 俞刚
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2018-09-07
Filing date: 2018-09-07
Publication date: 2019-01-22
Anticipated expiration: 2038-09-07
Also published as: CN109255352B

Abstract

The present invention provides a kind of object detection methods, apparatus and system, are related to field of artificial intelligence, this method comprises: obtaining target image to be detected；Feature extraction is carried out to target image, generates fisrt feature figure；Wherein, fisrt feature figure includes the characteristic information of different scale；Region candidate identification is carried out to fisrt feature figure, obtains the candidate region information of target image；According to candidate region information and fisrt feature figure, testing result is generated；The testing result includes target category and/or target position in target image.The present invention can effectively promote detection effect.

Description

Object detection method, apparatus and system

Technical field

The present invention relates to field of artificial intelligence, more particularly, to a kind of object detection method, apparatus and system.

Background technique

Target detection (Object Detection) is the very important task of one kind in computer vision, is such as people The basis of many complicated visual tasks such as face detection, target following, example segmentation.Existing object detection method is mostly based on volume Product neural fusion, can detecte out the object category for including in image, can also orient target object in the picture Position is widely used to the fields such as security system, traffic system.It is understood that object detection results pair Each application is of great significance, and the detection effect of existing object detection method is bad.

Summary of the invention

In view of this, can preferably be mentioned the purpose of the present invention is to provide a kind of object detection method, apparatus and system Rise detection effect.

To achieve the goals above, technical solution used in the embodiment of the present invention is as follows:

In a first aspect, the embodiment of the invention provides a kind of object detection methods, comprising: obtain target figure to be detected Picture；Feature extraction is carried out to the target image, generates fisrt feature figure；Wherein, the fisrt feature figure includes different rulers The characteristic information of degree；Region candidate identification is carried out to the fisrt feature figure, obtains the candidate region information of the target image； According to multiple candidate regions and the fisrt feature figure, testing result is generated；The testing result includes the target figure Target category and/or target position as in.

Further, the step of acquisition target image to be detected, comprising: obtain initial pictures to be detected；To institute It states initial pictures to be pre-processed, obtains target image；Wherein, the pretreatment includes whitening operation.

Further, described the step of feature extraction is carried out to the target image, generates fisrt feature figure, comprising: by institute It states target image and is input to base neural network；Multistage feature is carried out to the target image by the base neural network It extracts, obtains the characteristic information of different scale；Wherein, the scale for the characteristic information that each stage extracts is different；By multiple fingers Determine stage corresponding feature fusion and forms fisrt feature figure.

Further, the described the step of corresponding feature fusion of multiple specified phases is formed into fisrt feature figure, comprising: Obtain the fisrt feature information that the penultimate stage of the base neural network extracts；Obtain the base neural network The second feature information that the last stage extracts；Global pool operation is carried out to the second feature information, obtains third spy Reference breath；Enhance network by context to believe the fisrt feature information, the second feature information and the third feature Breath fusion forms fisrt feature figure.

Further, the context enhancing network includes the first parallel convolutional layer, the second convolutional layer and third convolutional layer； Wherein, the output end of second convolutional layer is also connected with up-sampling operation layer, and the output end of the third convolutional layer is also connected with There is broadcast operation layer；The output end and the broadcast operation layer of the output end of first convolutional layer, the up-sampling operation layer Output end be connected with add operation layer jointly.

Further, described that network is enhanced for the fisrt feature information, the second feature information and institute by context It states third feature information and merges the step of forming fisrt feature figure, comprising: by the fisrt feature information input to described first Convolutional layer, by the second feature information input to second convolutional layer, and by the third feature information input to institute State third convolutional layer；Convolution operation is carried out to the fisrt feature information by first convolutional layer, obtains that there is specified ruler The fisrt feature information of degree；By second convolutional layer and the up-sampling operation layer successively to the second feature information into Row convolution operation and up-sampling operation, obtain the second feature information with the specified scale；Pass through the third convolutional layer Convolution operation and broadcast operation successively are carried out to the third feature information with the broadcast operation layer, obtain having described specified The third feature information of scale；By the add operation layer to the specified scale fisrt feature information, have institute It states the second feature information of specified scale and the third feature information with the specified scale sums up, form first Characteristic pattern.

Further, the base neural network is lightweight feature extraction network.

Further, described that candidate region identification is carried out to the fisrt feature figure, obtain the candidate regions of the target image The step of domain information, comprising: the fisrt feature figure is input to region candidate and generates network；It is generated by the region candidate Network carries out feature extraction to the fisrt feature figure, obtains intermediate features figure, and carry out candidate regions to the intermediate features figure Domain identification, obtains the candidate region information of the target image.

Further, it includes sequentially connected channel convolutional layer and Volume Four lamination that the region candidate, which generates network,.

Further, described according to the candidate region information and the fisrt feature figure, the step of generating testing result, packet It includes: the fisrt feature figure and the intermediate features figure is input to spatial attention network；Pass through the spatial attention The fisrt feature figure and the intermediate features figure are merged and to form second feature figure by network；Wherein, the second feature figure Foreground features are better than background characteristics；According to the candidate region information and the second feature figure, testing result is generated.

Further, the spatial attention network includes sequentially connected 5th convolutional layer and activation primitive layer；It is described to swash The output end of function layer living is connected with multiplying layer.

Further, batch normalization layer is also connected between the 5th convolutional layer and the activation primitive layer.

Further, described to be merged the fisrt feature figure and the intermediate features figure by the spatial attention network The step of forming second feature figure, comprising: the intermediate features figure is input to the 5th convolutional layer, passes through described volume five Lamination, described batch of normalization layer and the activation primitive layer are successively handled the intermediate features figure, obtain the activation The intermediate features figure after processing of function layer output；Wherein, the foreground features of the intermediate features figure after processing It is better than background characteristics；The intermediate features figure by the fisrt feature figure and after processing is input to the multiplying layer； Multiplying is carried out by the intermediate features figure of the multiplying layer to the fisrt feature figure and after processing, is generated Second feature figure.

Further, described according to the candidate region information and the second feature figure, the step of generating testing result, packet It includes: the candidate region information and the second feature figure is input to candidate region feature extraction layer；Pass through the candidate regions Characteristic of field extract layer is based on the candidate region information, and the region of each candidate region is extracted on the second feature figure Feature；Provincial characteristics based on each candidate region carries out target detection, generates testing result.

Further, the provincial characteristics based on each candidate region carries out target detection, generates testing result Step, comprising: classification processing is carried out to the provincial characteristics of each candidate region by the classification sub-network, determine described in Target category in target image；And/or the provincial characteristics of each candidate region is returned by returning sub-network Processing, obtains the target position in the target image.

Further, the classification sub-network and the recurrence sub-network are a full articulamentum.

Second aspect, the embodiment of the present invention also provide a kind of object detecting device, comprising: image collection module, for obtaining Take target image to be detected；Fisrt feature figure generation module generates first for carrying out feature extraction to the target image Characteristic pattern；Wherein, the fisrt feature figure includes the characteristic information of different scale；Candidate identification module, for described the One characteristic pattern carries out region candidate identification, obtains the candidate region information of the target image；Detection module, for according to Candidate region information and the fisrt feature figure generate testing result；The testing result includes the mesh in the target image Mark classification and/or target position.

The third aspect, the embodiment of the invention provides a kind of object detection system, the system comprises: image collector It sets, processor and storage device；Described image acquisition device, for acquiring target image；Meter is stored on the storage device Calculation machine program, the computer program execute the described in any item methods of above-mentioned first aspect when being run by the processor.

Fourth aspect, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storage Computer program is stored on medium, the computer program is executed when being run by processor described in above-mentioned any one of first aspect Method the step of.

The embodiment of the invention provides a kind of object detection method, apparatus and system, can to the target image of acquisition into Row feature extraction, generation include the fisrt feature figure of the characteristic information of different scale；Then region is carried out to fisrt feature figure Candidate's identification obtains candidate region information, and then can generate testing result according to candidate region information and fisrt feature figure. Aforesaid way provided in this embodiment can carry out target detection using the characteristic information of different scale, and detection is effectively promoted Effect.

Other features and advantages of the present invention will illustrate in the following description, alternatively, Partial Feature and advantage can be with Deduce from specification or unambiguously determine, or by implement the disclosure above-mentioned technology it can be learnt that.

To enable the above objects, features and advantages of the present invention to be clearer and more comprehensible, preferred embodiment is cited below particularly, and cooperate Appended attached drawing, is described in detail below.

Detailed description of the invention

It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art be briefly described, it should be apparent that, it is described below Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor It puts, is also possible to obtain other drawings based on these drawings.

Fig. 1 shows the structural schematic diagram of a kind of electronic equipment provided by the embodiment of the present invention；

Fig. 2 shows a kind of target detection flow charts provided by the embodiment of the present invention；

Fig. 3 shows a kind of structural schematic diagram of context enhancing network provided by the embodiment of the present invention；

Fig. 4 shows a kind of structural schematic diagram of spatial attention network provided by the embodiment of the present invention；

Fig. 5 shows a kind of structural schematic diagram of target detection model provided by the embodiment of the present invention；

Fig. 6 shows a kind of structural block diagram of object detection system provided by the embodiment of the present invention；

Fig. 7 shows a kind of fisrt feature figure generation schematic diagram provided by the embodiment of the present invention；

Fig. 8 shows a kind of second feature figure generation schematic diagram provided by the embodiment of the present invention；

Fig. 9 shows a kind of structural block diagram of object detecting device provided by the embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with attached drawing to the present invention Technical solution be clearly and completely described, it is clear that described embodiments are some of the embodiments of the present invention, rather than Whole embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise Under every other embodiment obtained, shall fall within the protection scope of the present invention.

In view of target detection in the prior art is ineffective, to improve this problem, provided in an embodiment of the present invention one Kind object detection method, apparatus and system, which can be used corresponding software or hardware realization, below to the embodiment of the present invention It describes in detail.

Embodiment one:

Firstly, describing the example of object detection method for realizing the embodiment of the present invention, apparatus and system referring to Fig.1 Electronic equipment 100.

The structural schematic diagram of a kind of electronic equipment as shown in Figure 1, electronic equipment 100 include one or more processors 102, one or more storage devices 104, input unit 106, output device 108 and image collecting device 110, these components It is interconnected by bindiny mechanism's (not shown) of bus system 112 and/or other forms.It should be noted that electronic equipment shown in FIG. 1 100 component and structure be it is illustrative, and not restrictive, as needed, the electronic equipment also can have other Component and structure.

The processor 102 can use digital signal processor (DSP), field programmable gate array (FPGA), can compile At least one of journey logic array (PLA) example, in hardware realizes that the processor 102 can be central processing unit (CPU) or one or more of the processing unit of other forms with data-handling capacity and/or instruction execution capability Combination, and can control other components in the electronic equipment 100 to execute desired function.

The storage device 104 may include one or more computer program products, and the computer program product can To include various forms of computer readable storage mediums, such as volatile memory and/or nonvolatile memory.It is described easy The property lost memory for example may include random access memory (RAM) and/or cache memory (cache) etc..It is described non- Volatile memory for example may include read-only memory (ROM), hard disk, flash memory etc..In the computer readable storage medium On can store one or more computer program instructions, processor 102 can run described program instruction, to realize hereafter institute The client functionality (realized by processor) in the embodiment of the present invention stated and/or other desired functions.In the meter Can also store various application programs and various data in calculation machine readable storage medium storing program for executing, for example, the application program use and/or The various data etc. generated.

The input unit 106 can be the device that user is used to input instruction, and may include keyboard, mouse, wheat One or more of gram wind and touch screen etc..

The output device 108 can export various information (for example, image or sound) to external (for example, user), and It and may include one or more of display, loudspeaker etc..

Described image acquisition device 110 can shoot the desired image of user (such as photo, video etc.), and will be clapped The image taken the photograph is stored in the storage device 104 for the use of other components.

Illustratively, for realizing object detection method according to an embodiment of the present invention, the exemplary electron of apparatus and system Equipment may be implemented as the intelligent terminals such as smart phone, tablet computer, computer.

Embodiment two:

A kind of target detection flow chart shown in Figure 2, this method can be held by the electronic equipment that previous embodiment provides Row, this method specifically comprise the following steps:

Step S202 obtains target image to be detected.It wherein, include target pair to be detected in the target image As.Wherein, the classification of target object can sets itself according to actual needs, such as setting target object can for people, cat, Vehicle etc..

In one embodiment, the picture frame that the image capture devices such as camera can be directly acquired is as mesh Logo image.In another embodiment, initial pictures to be detected can be obtained by camera first, then to initial graph As being pre-processed, target image is obtained；That is, being pre-processed to the picture frame that camera directly acquires, after pretreatment Image as target image.Wherein, pretreatment may include the image processing operations such as whitening operation.Wherein, whitening operation It can be described as averaging operation, primary operational process is that each channel of initial pictures is subtracted to corresponding preset in the channel to be averaged Value, then again divided by the corresponding variance in the channel, to obtain pretreated image.By being located in advance to initial pictures Reason, obtains satisfactory image, can effectively accelerate to detect speed.Such as, target image is carried out using neural network When detection, the convergence rate of neural network is helped speed up through handling obtained target image.

Step S204 carries out feature extraction to target image, generates fisrt feature figure；Wherein, fisrt feature figure includes The characteristic information of different scale.

In the specific implementation, fisrt feature map generalization mode can be with are as follows: target image is input to base neural network； Multistage feature extraction is carried out to target image by base neural network, obtains the characteristic information of different scale；Wherein, each The scale for the characteristic information that stage extracts is different；It is special that the corresponding feature fusion of multiple specified phases is finally formed first Sign figure.

Above-mentioned base neural network is the backbone network that can extract characteristics of image, and main function is to extract image spy Sign, and generate characteristic pattern.In order to promote detection efficiency, it can choose the lightweights such as Xception or ShuffleNet Feature extraction network is as the base neural network in the present embodiment.Wherein, lightweight feature extraction network is characterized in that net Network structure is simple, memory requirements is low, operand is smaller and detection efficiency is higher.

Step S206 carries out region candidate identification to fisrt feature figure, obtains the candidate region information of target image.Its In, candidate region information may include location information and confidence level of multiple candidate regions etc..Candidate region namely target image Middle may include the region of target object.

Such as, fisrt feature figure is input to region candidate and generates network (Region Proposal Network, RPN)； Network is generated by region candidate, feature extraction is carried out to fisrt feature figure, obtain intermediate features figure, and to the intermediate features Figure carries out candidate region identification, obtains the candidate region information of target image.Step S206 can be shown that the embodiment of the present invention provides Object detection method specifically use the basic principle of two-stage detection algorithm.Wherein, two-stage detection algorithm is substantially former Reason are as follows: the candidate region (and can be described as candidate frame) first in forecast image, being then based on candidate region prediction includes target The target area (and can be described as detection block) of object.For the stage detection algorithm for directly predicting target area, two The detection accuracy of stage detection algorithm is more preferable.The networks mould such as Faster R-CNN, R-FCN can be used in two-stage detection algorithm Type is realized, wherein it is that network is commonly used used by various two-stage detection models that region candidate, which generates network, multiple for generating Candidate region.

Step S208 generates testing result according to candidate region information and fisrt feature figure；Testing result includes target figure Target category and/or target position as in.

Above-mentioned object detection method provided in an embodiment of the present invention can carry out feature extraction to the target image of acquisition, Generation includes the fisrt feature figure of the characteristic information of different scale；Then region candidate identification is carried out to fisrt feature figure, obtained To candidate region information, and then testing result can be generated according to candidate region information and fisrt feature figure.The present embodiment provides Aforesaid way can using the characteristic information of different scale carry out target detection, detection effect be effectively promoted.

When carrying out feature extraction to target image using base neural network, it is to be understood that the spy of neural network Sign extraction process generally includes multiple stages, and each stage can (characteristic information can to the characteristic information that the last stage obtains To be embodied in the form of characteristic pattern) feature is further extracted, obtain the stage corresponding characteristic information, the spy that different phase obtains The scale of reference breath is different.In order to incorporate the semantic information and contextual information of different scale during target detection, The present embodiment chooses several stages as specified phases, by the corresponding spy of specified phases from multiple stages of base neural network Reference breath fusion forms fisrt feature figure.

In one embodiment, specified phases can based on neural network most latter two stage, such as, if base Plinth neural network altogether there are four the stage, then selects phase III and fourth stage as specified rank in characteristic extraction procedure Section.When the corresponding feature fusion of multiple specified phases is formed fisrt feature figure, it is referred to following steps realization:

Step 1, the fisrt feature information that the penultimate stage of base neural network extracts is obtained；

Step 2, the second feature information that the last stage of base neural network extracts is obtained；

Step 3, global pool operation is carried out to second feature information, obtains third feature information；

Step 4, enhance network by context to merge fisrt feature information, second feature information and third feature information Form fisrt feature figure.

By the above-mentioned means, being capable of effectively multi-scale expression target image.It is understood that extracting characteristics of image When, if taking the feature detection mode of fixed size, the testing result for being biased to the scale will be obtained, and the other scales of missing inspection Feature.Based on this, the embodiment of the present invention carries out multistage feature extraction to target image by base neural network, can will scheme As being detected and being matched on multiple scales, so that the characteristic information that the fisrt feature figure made is included is more accurate.

In one embodiment, a kind of structural schematic diagram of context shown in Figure 3 enhancing network, context increase Strong network may include the first parallel convolutional layer, the second convolutional layer and third convolutional layer；Wherein, the output end of the second convolutional layer It is also connected with up-sampling operation layer, the output end of third convolutional layer is also connected with broadcast operation layer.The output end of first convolutional layer, The output end of the output end and broadcast operation layer that up-sample operation layer is connected with add operation layer jointly.

The parameter of first convolutional layer, the second convolutional layer and third convolutional layer can be identical or different, and such as, all selection includes There is the convolutional layer that 245 sizes are the convolution kernel of 1*1 to realize, so that the characteristic information received to be passed through to the convolution kernel pressure of 1*1 It is condensed to 245 channels.

Fisrt feature information, second feature information and third feature information are merged and to be formed enhancing network by context When fisrt feature figure, it is referred to following steps realization:

(1) by fisrt feature information input to the first convolutional layer, by second feature information input to the second convolutional layer, and By third feature information input to third convolutional layer；That is, the characteristic information that different phase is obtained be separately input to it is corresponding In convolutional layer；

(2) convolution operation is carried out to fisrt feature information by the first convolutional layer, it is special obtains first with specified scale Reference breath；Such as, fisrt feature Informational Expression is the characteristic pattern that size is 20*20, passes through the 1*1 convolution operation of the first convolutional layer Afterwards, the characteristic pattern that specified size is 20*20 is obtained.

(3) convolution operation and up-sampling are successively carried out to second feature information by the second convolutional layer and up-sampling operation layer Operation obtains the second feature information with specified scale；Such as, second feature Informational Expression is the feature that size is 10*10 Figure is obtained after the 1*1 convolution operation of the second convolutional layer and twice of up-sampling (2*Upsample) of up-sampling operation layer The characteristic pattern that specified size is 20*20.

(4) convolution operation and setting-up exercises to music are successively carried out to third feature information by third convolutional layer and broadcast operation layer Make, obtains the third feature information with specified scale；Such as, third feature Informational Expression is the characteristic pattern of 1*1, passes through third After the 1*1 convolution operation of convolutional layer and the broadcast operation of broadcast operation layer, the characteristic pattern that specified size is 20*20 is obtained.

(5) by add operation layer to fisrt feature information, the second feature with specified scale with specified scale Information and third feature information with specified scale sum up, and form fisrt feature figure.

Same scale (size) is converted to by the characteristic information for obtaining different phase, so that different phase is obtained Characteristic information sums up, and obtains finally comprising there are many characteristic patterns of semantic information and contextual information.

In order to further enhance target detection speed, network is generated compared to conventional region candidate, the present embodiment provides A kind of region candidate that structure is simplified generates network, the region candidate generate network include sequentially connected channel convolutional layer and Volume Four lamination.Such as, the channel convolution that it is 5*5 including 1 size which, which can be, which can be with The convolution kernel for being 1*1 including 256 sizes, and can be described as the 1*1 Standard convolution in 256 channels.It is generated by this region candidate Network can easily identify that the candidate region in characteristic pattern such as can generate up to 200 candidate regions.

In order to further promote target detection precision, according to candidate region information and fisrt feature figure, detection is generated As a result a kind of embodiment, which may is that, generates network in the mistake for generating candidate region information for fisrt feature figure and region candidate Intermediate features figure obtained in journey is input to spatial attention network；By spatial attention network by fisrt feature figure and centre Characteristic pattern merges to form second feature figure；Wherein, the foreground features of second feature figure are better than background characteristics, can be regarded as again The characteristic value (abbreviation foreground features value) of the foreground area of two characteristic patterns is higher than characteristic value (the abbreviation background characteristics of background area Value).According to candidate region information and second feature figure, testing result is generated.Exist it is understood that region candidate generates network It includes foreground information and background information that it is also potential, which to generate obtained intermediate features figure during the information of candidate region, In, foreground information can be understood as the information of target object region again, and background information can be regarded as not comprising target pair The information in the region of elephant；Spatial attention network is based on intermediate features figure, can be to the feature of the foreground area in fisrt feature figure Carry out enhancing processing (such as, increasing foreground features value), the feature of background area carries out weakening process (such as, reduction background spy Value indicative), to obtain the second feature figure that foreground features are better than background characteristics.For ease of understanding, as follows in this simple examples: false If the foreground features value in fisrt feature figure is 0.5, background characteristics value is 0.4, and the discrimination of foreground area and background area is not Greatly；But after spatial attention network processes, foreground features value can be promoted to 0.6, background characteristics value is reduced to 0.1, so that foreground features are significantly stronger than background characteristics, the discrimination of foreground area and background area is increased, and effectively highlight Foreground area out.It should be noted that the foreground features value in fisrt feature figure has been illustrated the above is only exemplary illustration Slightly better than background characteristics value but when numerical value is not much different, spatial attention network can still enhance further foreground features value, into one Step weakens background characteristics value, to increase the discrimination between foreground area and background area.Certain fisrt feature figure also will appear Foreground features value be lower than background characteristics value the case where, at this time spatial attention network can greatly be promoted foreground features value with And background characteristics value is reduced, so that foreground features value be made to be higher than background characteristics value, details are not described herein.Space is used by this Attention network enhances foreground features, weakens the mode of background characteristics, facilitates the feature for enhancing candidate region, namely after being allowed to The feature for the candidate region directly extracted on second feature figure more highlights, so that testing result is more accurate.

In a kind of embodiment, spatial attention network includes sequentially connected 5th convolutional layer and activation primitive layer；Swash The output end of function layer living is connected with multiplying layer.Such as, the 5th convolutional layer may include the volume that 245 sizes are 1*1 Product core, the activation primitive layer can be realized using Sigmoid activation primitive.In another embodiment, the 5th convolutional layer and Batch normalization layer (BatchNorm) is also connected between activation primitive layer.

A kind of structural schematic diagram of spatial attention network as shown in connection with fig. 4, is further described and passes through spatial attention Network generation second feature figure specific implementation: intermediate features figure is input to the 5th convolutional layer, by the 5th convolutional layer, Batch normalization layer and activation primitive layer are successively handled intermediate characteristic pattern, obtain the output of activation primitive layer after processing Intermediate features figure；Wherein, the foreground features of intermediate features figure after processing are better than background characteristics；By fisrt feature figure and through locating Intermediate features figure after reason is input to multiplying layer；Enhancing by multiplying layer to fisrt feature figure and each candidate region Feature carries out multiplying, generates second feature figure.Intermediate features figure after processing can embody the feature power in each region Weight, above-mentioned spatial attention, which generates network, to be weighted (re- to the feature in fisrt feature figure based on feature weight Weight), the foreground features in second feature figure obtained after weighting are better than background characteristics, help to highlight target location Domain promotes the accuracy of object detection results.

After generating second feature figure, candidate region information and second feature figure can be input to candidate region feature Extract layer；It is based on candidate region information by candidate region feature extraction layer, each candidate region is extracted on second feature figure Provincial characteristics；The provincial characteristics for being then based on each candidate region carries out target detection, generates testing result.Candidate region mentions It takes layer when extracting provincial characteristics, one of such as following operation can be executed to candidate region；RoI pooling (Region of Interest pooling, area-of-interest pond) operation, PSRoI pooling (Position Sensitive Region of Interest pooling, the area-of-interest pond of position sensing) operation, RoI align (Region of Interest align, area-of-interest alignment) operation or PSRoI align (Position Sensitive Region of Interest align, the area-of-interest alignment of position sensing) operation etc..

When extracting the provincial characteristics of each candidate region by candidate region feature extraction layer, then each candidate can be based on The provincial characteristics in region carries out target detection, specifically, can be special by region of the classification sub-network to each candidate region Sign carries out classification processing, determines the target category in target image；And/or by returning sub-network to each candidate region Provincial characteristics carries out recurrence processing, obtains the target position in target image.

In order to further promote target detection efficiency, shorten detection time, the classification sub-network that the present embodiment uses It can be a full articulamentum with sub-network is returned.Wherein, the port number as the full articulamentum of classification sub-network can be Classification number；Port number as the full articulamentum for returning sub-network can be 4 channels.In addition to this, classification sub-network and recurrence It can be connected to a full articulamentum again before sub-network, by the full articulamentum first to the provincial characteristics of each candidate region into one Step extracts feature, can preferably be divided for the provincial characteristics further extracted to classify sub-network with sub-network is returned Class and frame return.In one embodiment, classification sub-network and the full articulamentum before returning sub-network can have 1024 Channel.

For ease of understanding, a kind of structural schematic diagram of target detection model shown in fig. 5, the target detection mould be may refer to Type can be used for realizing above-mentioned object detection method, specifically illustrate base neural network, context enhancing network, space transforms Power network, region candidate generate network, candidate region feature extraction layer, classification sub-network and the connection relationship for returning sub-network, Details are not described herein for the specific effect of each network.It should be noted that target detection model shown in fig. 5 is only a kind of example, In practical applications, adaptability doses other network structures or deletes portion in target detection model that can be shown in Fig. 5 Subnetwork structure.

In conclusion the above-mentioned object detection method provided through this embodiment, can be had using context enhancing network Effect combines the semantic information and contextual information of different scale, so that characteristic pattern includes the characteristic information there are many scale；Using Spatial attention network can the feature to candidate region carry out enhancing processing, so as to preferably based on the time with Enhanced feature Favored area carries out target detection, obtains more accurate testing result.Moreover, the base neural network that the present embodiment uses is light Quantative feature extracts network, context enhancing network, spatial attention network, classification sub-network and the recurrence that the present embodiment proposes Sub-network structure is simplified, and operand is smaller, effectively improves detection efficiency.In conclusion the above-mentioned target that the present embodiment proposes Detection method can effectively promote target detection precision and target detection speed.

Embodiment three:

A kind of specific example using preceding aim detection method is present embodiments provided, is specifically illustrated a kind of based on deep The object detection system (and can be described as target detection model) of neural network is spent, which is mainly to current light Magnitude two-stage algorithm of target detection (such as, Light-Head R-CNN) is improved, efficient, high-precision to realize Target detection.

Generally speaking, object detection system provided in an embodiment of the present invention mainly includes following three modules: image is located in advance Manage module, region candidate (Region Proposal) extraction module and region candidate identification module.Wherein, image preprocessing mould Block is responsible for pre-processing the image (that is, aforementioned original image) of input, and region candidate extraction module mainly uses convolution Neural network generates potential target area (that is, aforementioned candidates region), and region candidate identification module mainly uses nerve The region candidate that network extracts region candidate extraction module identifies, obtains testing result to the end.In practical applications, Object detection system can also be not provided with image pre-processing module, but original image is directly input to region candidate and extracts mould Block.The main function of image pre-processing module is to accelerate target detection speed.

Specifically, present embodiments providing a kind of structural block diagram of object detection system as shown in FIG. 6, Fig. 6 Fig. 5 A kind of specific implementation, more visualize and embody for Fig. 5, below in conjunction with Fig. 6 to the present embodiment provides Lightweight two-stage detection method be further described below:

Step 1: image procossing

Whitening operation is carried out to image to be detected, obtains the target image that can be input to neural network, and by target figure As being scaled 320*320 pixel size.Specifically, input (Input) namely target image shown in fig. 6, size 320* 320*3。

Step 2: candidate region is extracted

Above-mentioned target image is input to base neural network (and can be described as backbone network, backbone), passes through basis The feature of neural network extraction target area.In order to improve detection efficiency, the basic network Xception of lightweight can be used It is realized with ShuffleNet.

In order to enhance the character representation ability of object detection system, object detection system shown in fig. 6 is also illustrated Hereafter enhance network (and can be described as context enhancing module, Context Enhancement Module, CEM), to merge The semantic information and contextual information of different scale.The context as shown in Figure 3 provided in conjunction with last embodiment enhances network Structural schematic diagram, a kind of fisrt feature figure shown in Figure 7 generates schematic diagram, and it makes use of the in base neural network the 3rd The characteristic pattern C that a stage generates₄(scale 20*20), the characteristic pattern C that the 4th stage generates₅(scale 10*10), Yi Jite Sign figure C₅Through the characteristic pattern C after global pool (global avg pooling)_glb(scale 1*1), characteristic pattern C₄Through upper and lower The first convolutional layer in text enhancing network carries out 1*1 convolution and 245 channels of boil down to, obtains the C that scale is 20*20_4—lat； Characteristic pattern C₅1*1 convolution, and 245 channels of boil down to are carried out through the second convolutional layer in context enhancing network, are passed through again later It crosses up-sampling operation layer and carries out twice of up-sampling operation (2*Upsample), obtain the C that scale is 20*20_5—lat, characteristic pattern C_glb 1*1 convolution, and 245 channels of boil down to are carried out through the third convolutional layer in context enhancing network, later using setting-up exercises to music Make layer and carry out broadcast operation (Broadcast), obtains the C that scale is 20*20_glb—lat, final C_4—lat、C_5—latAnd C_glb—latIt is logical The addition of add operation layer is crossed, fisrt feature figure CEM_fm (size 20*20*245) is obtained.

Later, fisrt feature figure CEM_fm is input to region candidate and generates network RPN, it is potential to be generated by RPN Target frame (bounding box), potential target frame are aforementioned candidates region.In order to promote computational efficiency, region candidate is raw It only include the 1x1 Standard convolution of the channel 5x5 convolution (a depthwise convolution) and 256 channels at network. In the specific implementation, network is generated by region candidate, every picture can produce up to 200 candidate regions.Specifically, Illustrate that region candidate generates network and is based on fisrt feature figure CEM_fm generation intermediate features figure RPN_fm (size 20* in Fig. 6 20*256), it then also illustrates and Rols (Region of Interest, area-of-interest) is generated by RPN_fm, also to obtain the final product To candidate region information.

Step 3: identification candidate region

In order to further enhance the character representation ability of object detection system, object detection system shown in fig. 6 is also illustrated Spatial attention network (and can be described as space transforms power module, Spatial Attention Module, SAM) is gone out, has been used to The feature in fisrt feature figure generated to context enhancing network is weighted (re-weight).Specifically, in conjunction with Fig. 4 Shown in a kind of structural schematic diagram of spatial attention network, the present embodiment illustrates a kind of second feature as shown in Figure 8 again Figure generates schematic diagram, illustrates that the intermediate features figure RPN_fm of RPN output successively passes through 1*1 convolutional layer, BatchNorm normalizing layer With Sigmoid active coating, element multiplication is carried out with fisrt feature figure CEM_fm, obtains second feature figure SAM_fm.Such as Fig. 6 institute Show, the size of the resulting second feature figure SAM_fm of spatial attention network is 20*20*245.

Later, second feature figure SAM_fm progress such as RoI pooling operation, PSRoI pooling can be operated, RoI align operation or PSRoI align operation etc., to extract provincial characteristics.As shown in fig. 6, based on Rols to second feature Figure SAM_fm executes PSROI Align operation (for sake of simplicity, not illustrating the candidate regions for executing PSROI Align operation in Fig. 6 Characteristic of field extract layer), the provincial characteristics Rol_fm (size 7*7*5) of each candidate region is obtained, and utilize R-CNN sub-network pair Each candidate region is identified that identification includes that classification (classification) and frame return (bounding box Regression) two tasks, finally obtain classification results and regression result.In practical applications, R-CNN sub-network can be first First include the full articulamentum (FC, fully-connected layer) in one layer of 1024 channel, then includes connecing to connect entirely at this parallel Two full articulamentums after layer are connect, one for classifying, port number is identical as classification number, another is returned for frame, Namely the coordinate of target frame is calculated, port number is 4 channels.For sake of simplicity, Fig. 6 be symbol illustrate one 1024 it is logical The full articulamentum FC in road can be used for carrying out feature to the provincial characteristics of candidate region extracting again, the feature that then will be extracted again Classification is carried out respectively and frame returns, and obtains classification results and regression result.In order to verify light weight provided in an embodiment of the present invention The performance of grade two-stage object detection method, by object detection method provided in an embodiment of the present invention on MS COCO data set It is compared with existing lightweight object detection method, the results are shown in Table 1.

Table 1

AP (Average Precision) in table 1 is represented by the average detected precision of each object detection method, MFLOPs is represented by calculation amount when each object detection method obtains testing result.From table 1 it follows that the present invention is implemented The object detection method (Mobile Light-Head R-CNN shown in last three row) that example provides can use less than half of Calculation amount realizes identical even preferably Detection accuracy；And under close calculation amount, target inspection provided in an embodiment of the present invention Obvious better Detection accuracy may be implemented in survey method.That is, object detection method provided in an embodiment of the present invention is effectively Improve target detection speed and target detection precision.

Example IV:

For object detection method provided in embodiment two, the embodiment of the invention provides a kind of target detection dresses It sets, a kind of structural block diagram of object detecting device shown in Figure 9, comprising:

Image collection module 902, for obtaining target image to be detected；

Fisrt feature figure generation module 904 generates fisrt feature figure for carrying out feature extraction to target image；Wherein, Fisrt feature figure includes the characteristic information of different scale；

Candidate identification module 906 obtains the candidate regions of target image for carrying out region candidate identification to fisrt feature figure Domain information；

Detection module 908, for generating testing result according to candidate region information and fisrt feature figure；Testing result packet Containing in target image target category and/or target position.

Above-mentioned object detecting device provided in an embodiment of the present invention can carry out feature extraction to the target image of acquisition, Generation includes the fisrt feature figure of the characteristic information of different scale；Then region candidate identification is carried out to fisrt feature figure, obtained To candidate region information, and then testing result can be generated according to candidate region information and fisrt feature figure.The present embodiment provides Aforesaid way can using the characteristic information of different scale carry out target detection, detection effect be effectively promoted.

In one embodiment, above-mentioned image collection module 902 is used for: obtaining initial pictures to be detected；To initial Image is pre-processed, and target image is obtained；Wherein, pretreatment includes whitening operation.

In one embodiment, above-mentioned fisrt feature figure generation module 904 is used for: target image is input to basic mind Through network；Multistage feature extraction is carried out to target image by base neural network, obtains the characteristic information of different scale；Its In, the scale for the characteristic information that each stage extracts is different；The corresponding feature fusion of multiple specified phases is formed the One characteristic pattern.

In one embodiment, above-mentioned fisrt feature figure generation module 904 is further used for: obtaining base neural network Penultimate stage extract fisrt feature information；Obtain the second spy that the last stage of base neural network extracts Reference breath；Global pool operation is carried out to second feature information, obtains third feature information；Enhance network for the by context One characteristic information, second feature information and third feature information merge to form fisrt feature figure.

In one embodiment, above-mentioned context enhancing network include the first parallel convolutional layer, the second convolutional layer and Third convolutional layer；Wherein, the output end of the second convolutional layer is also connected with up-sampling operation layer, and the output end of third convolutional layer also connects It is connected to broadcast operation layer；The output end of the output end of first convolutional layer, the output end for up-sampling operation layer and broadcast operation layer is total It is same to be connected with add operation layer.

In one embodiment, above-mentioned fisrt feature figure generation module 904 is further used for: fisrt feature information is defeated Enter to the first convolutional layer, by second feature information input to the second convolutional layer, and third feature information input to third is rolled up Lamination；Convolution operation is carried out to fisrt feature information by the first convolutional layer, obtains the fisrt feature information with specified scale； By the second convolutional layer and up-sampling operation layer successively carries out convolution operation to second feature information and up-sampling operates, and is had There is the second feature information of specified scale；Convolution successively is carried out to third feature information by third convolutional layer and broadcast operation layer Operation and broadcast operation obtain the third feature information with specified scale；By add operation layer to specified scale Fisrt feature information, the second feature information with specified scale and the third feature information with specified scale sum up place Reason forms fisrt feature figure.

In one embodiment, above-mentioned base neural network is lightweight feature extraction network.

In one embodiment, above-mentioned candidate identification module 906 is used for: it is raw that fisrt feature figure is input to region candidate At network；Network is generated by region candidate, feature extraction is carried out to fisrt feature figure, obtain intermediate features figure, and in described Between characteristic pattern carry out candidate region identification, obtain the candidate region information of target image.

In one embodiment, it includes sequentially connected channel convolutional layer and Volume Four that above-mentioned zone candidate, which generates network, Lamination.

In one embodiment, above-mentioned detection module 908 is for fisrt feature figure and intermediate features figure to be input to Spatial attention network；It merges fisrt feature figure and intermediate features figure to form second feature figure by spatial attention network； Wherein, the foreground features of second feature figure are better than background characteristics；According to candidate region information and second feature figure, detection knot is generated Fruit.

In one embodiment, above-mentioned spatial attention network includes sequentially connected 5th convolutional layer and activation primitive Layer；The output end of activation primitive layer is connected with multiplying layer.

In one embodiment, batch normalization layer is also connected between above-mentioned 5th convolutional layer and activation primitive layer.

In one embodiment, above-mentioned detection module 908 is further used for: intermediate features figure is input to the 5th convolution Layer successively handles intermediate characteristic pattern by the 5th convolutional layer, batch normalization layer and activation primitive layer, obtains activation primitive The intermediate features figure after processing of layer output；Wherein, the foreground features of intermediate features figure after processing are better than background characteristics； Intermediate features figure by fisrt feature figure and after processing is input to multiplying layer；By multiplying layer to fisrt feature figure Intermediate features figure after processing carries out multiplying, generates second feature figure.

In one embodiment, above-mentioned detection module 908 is further used for: by candidate region information and second feature figure It is input to candidate region feature extraction layer；It is based on candidate region information by candidate region feature extraction layer, in second feature figure The upper provincial characteristics for extracting each candidate region；Provincial characteristics based on each candidate region carries out target detection, generates detection As a result.

In one embodiment, above-mentioned detection module 908 is further used for: by classification sub-network to each candidate regions The provincial characteristics in domain carries out classification processing, determines the target category in target image；And/or by returning sub-network to each The provincial characteristics of candidate region carries out recurrence processing, obtains the target position in target image.

In one embodiment, above-mentioned classification sub-network and recurrence sub-network are a full articulamentum.

The technical effect of device provided by the present embodiment, realization principle and generation is identical with previous embodiment, for letter It describes, Installation practice part does not refer to place, can refer to corresponding contents in preceding method embodiment.

In addition, present embodiments provide a kind of object detection system, the system include: image collecting device, processor and Storage device；Image collecting device, for acquiring image to be detected；Computer program, computer journey are stored on storage device Sequence executes preceding aim detection method when being run by processor.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description Specific work process, can be with reference to the corresponding process in previous embodiment, and details are not described herein.

Further, a kind of computer readable storage medium is present embodiments provided, is deposited on the computer readable storage medium The step of containing computer program, method provided by above-described embodiment two executed when computer program is run by processor.

The computer program product of object detection method, apparatus and system provided by the embodiment of the present invention, including storage The computer readable storage medium of program code, the instruction that said program code includes can be used for executing previous methods embodiment Described in method, specific implementation can be found in embodiment of the method, details are not described herein.

It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially in other words The part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, server or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention. And storage medium above-mentioned includes: that USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic or disk.

In the description of the present invention, it should be noted that term " center ", "upper", "lower", "left", "right", "vertical", The orientation or positional relationship of the instructions such as "horizontal", "inner", "outside" be based on the orientation or positional relationship shown in the drawings, merely to Convenient for description the present invention and simplify description, rather than the device or element of indication or suggestion meaning must have a particular orientation, It is constructed and operated in a specific orientation, therefore is not considered as limiting the invention.In addition, term " first ", " second ", " third " is used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance.

Finally, it should be noted that embodiment described above, only a specific embodiment of the invention, to illustrate the present invention Technical solution, rather than its limitations, scope of protection of the present invention is not limited thereto, although with reference to the foregoing embodiments to this hair It is bright to be described in detail, those skilled in the art should understand that: anyone skilled in the art In the technical scope disclosed by the present invention, it can still modify to technical solution documented by previous embodiment or can be light It is readily conceivable that variation or equivalent replacement of some of the technical features；And these modifications, variation or replacement, do not make The essence of corresponding technical solution is detached from the spirit and scope of technical solution of the embodiment of the present invention, should all cover in protection of the invention Within the scope of.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims

1. a kind of object detection method characterized by comprising

Obtain target image to be detected；

Feature extraction is carried out to the target image, generates fisrt feature figure；Wherein, the fisrt feature figure includes different rulers The characteristic information of degree；

Region candidate identification is carried out to the fisrt feature figure, obtains the candidate region information of the target image；

According to the candidate region information and the fisrt feature figure, testing result is generated；The testing result includes the mesh Target category and/or target position in logo image.

2. the method according to claim 1, wherein the step of acquisition target image to be detected, comprising:

Obtain initial pictures to be detected；

The initial pictures are pre-processed, target image is obtained；Wherein, the pretreatment includes whitening operation.

3. being generated the method according to claim 1, wherein described carry out feature extraction to the target image The step of fisrt feature figure, comprising:

The target image is input to base neural network；

Multistage feature extraction is carried out to the target image by the base neural network, obtains the feature letter of different scale Breath；Wherein, the scale for the characteristic information that each stage extracts is different；

The corresponding feature fusion of multiple specified phases is formed into fisrt feature figure.

4. according to the method described in claim 3, it is characterized in that, described by the corresponding feature fusion of multiple specified phases The step of forming fisrt feature figure, comprising:

Obtain the fisrt feature information that the penultimate stage of the base neural network extracts；

Obtain the second feature information that the last stage of the base neural network extracts；

Global pool operation is carried out to the second feature information, obtains third feature information；

Enhance network by context to melt the fisrt feature information, the second feature information and the third feature information Conjunction forms fisrt feature figure.

5. according to the method described in claim 4, it is characterized in that, context enhancing network includes the first parallel convolution Layer, the second convolutional layer and third convolutional layer；Wherein, the output end of second convolutional layer is also connected with up-sampling operation layer, institute The output end for stating third convolutional layer is also connected with broadcast operation layer；

The output end of the output end of first convolutional layer, the output end of the up-sampling operation layer and the broadcast operation layer is total It is same to be connected with add operation layer.

6. according to the method described in claim 5, it is characterized in that, described enhance network for the fisrt feature by context Information, the second feature information and the third feature information merge the step of forming fisrt feature figure, comprising:

By the fisrt feature information input to first convolutional layer, by the second feature information input to the volume Two Lamination, and by the third feature information input to the third convolutional layer；

Convolution operation is carried out to the fisrt feature information by first convolutional layer, it is special to obtain first with specified scale Reference breath；

By second convolutional layer and the up-sampling operation layer successively to the second feature information carry out convolution operation and Up-sampling operation, obtains the second feature information with the specified scale；

Convolution operation and wide successively is carried out to the third feature information by the third convolutional layer and the broadcast operation layer Operation is broadcast, the third feature information with the specified scale is obtained；

By the add operation layer to the fisrt feature information with the specified scale, second with the specified scale Characteristic information and third feature information with the specified scale sum up, and form fisrt feature figure.

7. according to the described in any item methods of claim 3 to 6, which is characterized in that the base neural network is that lightweight is special Sign extracts network.

8. the method according to claim 1, wherein described carry out candidate region knowledge to the fisrt feature figure Not, the step of obtaining the candidate region information of the target image, comprising:

The fisrt feature figure is input to region candidate and generates network；

Network is generated by the region candidate, feature extraction is carried out to the fisrt feature figure, obtain intermediate features figure, and right The intermediate features figure carries out candidate region identification, obtains the candidate region information of the target image.

9. according to the method described in claim 8, it is characterized in that, it includes sequentially connected logical that the region candidate, which generates network, Road convolutional layer and Volume Four lamination.

10. according to the method described in claim 8, it is characterized in that, described according to the candidate region information and described first The step of characteristic pattern, generation testing result, comprising:

The fisrt feature figure and the intermediate features figure are input to spatial attention network；

The fisrt feature figure and the intermediate features figure are merged and to form second feature figure by the spatial attention network； Wherein, the foreground features of the second feature figure are better than background characteristics；

According to the candidate region information and the second feature figure, testing result is generated.

11. according to the method described in claim 10, it is characterized in that, the spatial attention network includes sequentially connected Five convolutional layers and activation primitive layer；The output end of the activation primitive layer is connected with multiplying layer.

12. according to the method for claim 11, which is characterized in that between the 5th convolutional layer and the activation primitive layer It is also connected with batch normalization layer.

13. according to the method for claim 12, which is characterized in that it is described by the spatial attention network by described the One characteristic pattern and the intermediate features figure merge the step of forming second feature figure, comprising:

The intermediate features figure is input to the 5th convolutional layer, by the 5th convolutional layer, described batch of normalization layer and The activation primitive layer is successively handled the intermediate features figure, obtain activation primitive layer output after processing The intermediate features figure；Wherein, the foreground features of the intermediate features figure after processing are better than background characteristics；

The intermediate features figure by the fisrt feature figure and after processing is input to the multiplying layer；

Multiplying is carried out by the intermediate features figure of the multiplying layer to the fisrt feature figure and after processing, Generate second feature figure.

14. according to the method described in claim 10, it is characterized in that, described according to the candidate region information and described second The step of characteristic pattern, generation testing result, comprising:

The candidate region information and the second feature figure are input to candidate region feature extraction layer；

It is based on the candidate region information by the candidate region feature extraction layer, is extracted on the second feature figure each The provincial characteristics of candidate region；

Provincial characteristics based on each candidate region carries out target detection, generates testing result.

15. according to the method for claim 14, which is characterized in that the provincial characteristics based on each candidate region The step of carrying out target detection, generating testing result, comprising:

Classification processing is carried out by provincial characteristics of the classification sub-network to each candidate region, is determined in the target image Target category；And/or

Recurrence processing is carried out to the provincial characteristics of each candidate region by returning sub-network, is obtained in the target image Target position.

16. according to the method for claim 15, which is characterized in that the classification sub-network and the recurrence sub-network are One full articulamentum.

17. a kind of object detecting device characterized by comprising

Image collection module, for obtaining target image to be detected；

Fisrt feature figure generation module generates fisrt feature figure for carrying out feature extraction to the target image；Wherein, institute State the characteristic information that fisrt feature figure includes different scale；

Candidate identification module obtains the candidate of the target image for carrying out region candidate identification to the fisrt feature figure Area information；

Detection module, for generating testing result according to multiple candidate regions and the fisrt feature figure；The detection knot Fruit includes target category and/or target position in the target image.

18. a kind of object detection system, which is characterized in that the system comprises: image collecting device, processor and storage dress It sets；

Described image acquisition device, for acquiring target image；

Computer program is stored on the storage device, the computer program is executed when being run by the processor as weighed Benefit requires 1 to 16 described in any item methods.

19. a kind of computer readable storage medium, computer program, feature are stored on the computer readable storage medium The step of being, the described in any item methods of the claims 1 to 16 executed when the computer program is run by processor.