CN114359808A

CN114359808A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN114359808A
Application number: CN202210015365.6A
Authority: CN
Inventors: 张军伟; 李�诚
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2022-01-07
Filing date: 2022-01-07
Publication date: 2022-04-15

Abstract

The present disclosure relates to a target detection method and apparatus, an electronic device, and a storage medium, the method including: acquiring a current image frame to be detected; determining an estimated area of a target object in the current image frame according to a target area where the target object is located in a previous image frame, wherein the acquisition time of the previous image frame is before the current image frame; and carrying out target detection on the pre-estimated area to obtain a target area of the target object in the current image frame. The embodiment of the disclosure can improve the operation efficiency and accuracy of target detection.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a target detection method and apparatus, an electronic device, and a storage medium.

Background

Object detection is a computer technology related to computer vision and image processing for detecting certain types of semantic objects (such as people, buildings, or cars) in images and videos. However, the target detection technology in the related art generally needs to be operated in the electronic device with stronger computing capability, and when the target detection technology is applied to the electronic device with weaker computing capability, the computing efficiency and the detection accuracy rate are obviously reduced.

Disclosure of Invention

The present disclosure provides a technical scheme for target detection.

According to an aspect of the present disclosure, there is provided an object detection method including: acquiring a current image frame to be detected; determining an estimated area of a target object in the current image frame according to a target area where the target object is located in a previous image frame, wherein the acquisition time of the previous image frame is before the current image frame; and carrying out target detection on the pre-estimated area to obtain a target area of the target object in the current image frame.

In a possible implementation manner, the determining an estimated region of a target object in the current image frame according to a target region where the target object is located in a previous image frame includes: determining estimated motion information corresponding to the current image frame according to a historical position offset and a historical size offset of a target area in the previous image frame, wherein the historical position offset represents a position offset of the target object between two target areas in two adjacent previous image frames, and the historical size offset represents a size offset of the target object between two target areas in two adjacent previous image frames; and determining the estimated area according to the estimated motion information and the target area in the previous image frame.

In one possible implementation, the predicted motion information includes: the estimated position offset and the estimated size offset corresponding to the current image frame; wherein the determining the estimated region according to the estimated motion information and the target region in the previous image frame comprises: translating the region position of the target region in the previous image frame by using the estimated position offset to obtain the estimated region position of the estimated region; and utilizing the estimated size offset to zoom the area size of the target area in the previous image frame to obtain the estimated area size of the estimated area.

In a possible implementation manner, the scaling the region size of the target region in the previous image frame by using the estimated size offset to obtain the estimated region size of the estimated region includes: zooming the region size of the target region in the previous image frame by using the estimated size offset to obtain a middle region size; and increasing the size of the middle area according to the specified multiple to obtain the estimated area size.

In a possible implementation manner, the determining, according to a historical position offset and a historical size offset of a target region in the previous image frame, estimated motion information corresponding to the current image frame includes: calculating a weighted average value between the historical position offset of the target area in the previous image frame and the estimated position offset corresponding to the previous image frame to obtain the estimated position offset corresponding to the current image frame; and calculating a weighted average value between the historical size offset of the target area in the previous image frame and the estimated position offset corresponding to the previous image frame to obtain the estimated size offset corresponding to the current image frame.

In a possible implementation manner, the performing target detection on the estimated region to obtain a target region of the target object in the current image frame includes: and performing target detection on the estimated area by using a target detection network to obtain a target area of the target object in the current image frame, wherein the target detection network is obtained by training an initial target detection network through a sample image and annotation information, and the annotation information represents an annotation area of the target object in the sample image.

In one possible implementation, the labeling area has a specified aspect ratio, the initial target detection network indicates a detection area in which the target object is located in the sample image by using an anchor frame, the anchor frame has an initial aspect ratio, and the training of the initial target detection network by the sample image and the labeling information includes: carrying out target detection on the sample image by using the initial target detection network to obtain a detection area output by the initial target detection network; and adjusting network parameters of the initial target detection network and adjusting an initial length-width ratio of an anchor frame adopted by the initial target detection network according to the difference between the detection area and the labeling area to obtain the target detection network, wherein the length-width ratio of the anchor frame adopted by the target detection network is the specified length-width ratio.

In a possible implementation manner, the adjusting an initial aspect ratio of an anchor frame adopted by the initial target detection network according to a difference between the detection area and the labeled area includes: matching each detection area with each labeling area according to the difference between each detection area and each labeling area to obtain a labeling area matched with each detection area; and aiming at any detection area, adjusting the initial length-width ratio of the anchor frame corresponding to the detection area to the specified length-width ratio of the labeling area matched with the detection area.

In one possible implementation, the method further includes: pruning the target detection network to obtain a pruned target detection network; the performing target detection on the pre-estimated region to obtain a target region of the target object in the current image frame includes: and carrying out target detection on the pre-estimated area by using the pruned target detection network to obtain a target area of the target object in the current image frame.

According to an aspect of the present disclosure, there is provided an object detection apparatus including: the acquisition module is used for acquiring a current image frame to be detected; the estimation module is used for determining an estimation area of a target object in the current image frame according to a target area where the target object is located in a previous image frame, wherein the acquisition time of the previous image frame is before the current image frame; and the detection module is used for carrying out target detection on the pre-estimated area to obtain a target area of the target object in the current image frame.

In a possible implementation manner, the estimation module includes: the motion information estimation submodule is used for determining estimated motion information corresponding to the current image frame according to historical position offset and historical size offset of a target area in the previous image frame, wherein the historical position offset represents position offset of the target object between two target areas in two adjacent previous image frames, and the historical size offset represents size offset of the target object between two target areas in two adjacent previous image frames; and the estimation region determining submodule is used for determining the estimation region according to the estimation motion information and the target region in the previous image frame.

In one possible implementation manner, the detection module includes: the first detection submodule is used for performing target detection on the estimated area by using a target detection network to obtain a target area of the target object in the current image frame, wherein the target detection network is obtained by training an initial target detection network through a sample image and annotation information, and the annotation information represents an annotation area where the target object is located in the sample image.

In one possible implementation, the apparatus further includes: the pruning module is used for carrying out pruning processing on the target detection network to obtain a pruned target detection network; wherein, the detection module includes: and the second detection submodule is used for carrying out target detection on the pre-estimated area by using the pruned target detection network to obtain a target area of the target object in the current image frame.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, the estimation area of the target object in the current image frame is determined according to the target area where the target object is located in the previous image frame, which is equivalent to roughly estimating the local area where the target object is roughly located in the current image frame, and then the estimation area is subjected to target detection, so as to obtain the target area of the target object in the current image frame, which is equivalent to reducing the detection range of the target detection.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow diagram of a target detection method according to an embodiment of the present disclosure.

FIG. 2 shows a schematic diagram of an object detection framework in accordance with an embodiment of the present disclosure.

FIG. 3 is a schematic diagram illustrating a target anchor frame generation flow according to an embodiment of the present disclosure.

Fig. 4 shows a flow diagram of a target detection method according to an embodiment of the present disclosure.

Fig. 5 shows a block diagram of an object detection apparatus according to an embodiment of the present disclosure.

Fig. 6 illustrates a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure.

Fig. 7 illustrates a block diagram of another electronic device 1900 in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

If the technical scheme of the application relates to personal information, a product applying the technical scheme of the application clearly informs personal information processing rules before processing the personal information, and obtains personal independent consent. If the technical scheme of the application relates to sensitive personal information, a product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'express consent'. For example, at a personal information collection device such as a camera, a clear and significant identifier is set to inform that the personal information collection range is entered, the personal information is collected, and if the person voluntarily enters the collection range, the person is regarded as agreeing to collect the personal information; or on the device for processing the personal information, under the condition of informing the personal information processing rule by using obvious identification/information, obtaining personal authorization by modes of popping window information or asking a person to upload personal information of the person by himself, and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing method, and a type of personal information to be processed.

Fig. 1 shows a flowchart of an object detection method according to an embodiment of the present disclosure, which may be performed by an electronic device such as a terminal device or a server, where the terminal device may be an intelligent robot (e.g., an intelligent education robot, an intelligent logistics robot, etc.), a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, etc., and the method may be implemented by a processor calling a computer readable instruction stored in a memory, or may be performed by the server. As shown in fig. 1, the target detection method includes:

in step S11, a current image frame to be detected is acquired.

In a possible implementation manner, the current image frame may be an image frame acquired by an image acquisition device, an image frame extracted from a local storage, or an image frame transmitted by another electronic device, which is not limited by the embodiment of the present disclosure. It should be understood that the image frame may be continuously sampled frame data in the video data, may also be frame data sampled at intervals in the video data, and may also be image data obtained by shooting alone, which is not limited by the embodiment of the present disclosure.

The image capturing device may be disposed in the electronic device as a component, or may be connected to the electronic device through a wired connection (e.g., a USB connection) or a wireless connection (e.g., a WiFi connection) to transmit the captured image frames to the electronic device. The image capturing device may include various cameras, for example, a monocular camera, a binocular camera, a wide-angle camera, a telephoto camera, and the like, which are not limited to the embodiments of the present disclosure.

In step S12, an estimated region of the target object in the current image frame is determined according to the target region where the target object is located in the previous image frame, which is prior to the current image frame at the acquisition time.

The target object may include at least one of a person, a vehicle, an object, a plant, an animal, and the like; the previous image frame may be understood as an image frame in which a target region has been detected before the current image frame. The estimation area of the target object in the current image frame is determined according to the target area where the target object is located in the previous image frame, and the estimation area where the target object is approximately located in the current image frame is predicted by using the determined target area in the previous image frame, so that the detection range of target detection can be reduced, and the target detection efficiency and the accuracy are improved.

In a possible implementation manner, a motion estimation algorithm known in the art, such as an optical flow method, a feature matching algorithm, a global search algorithm, and the like, may be used to determine an estimated area of the target object in the current image frame according to the target area where the target object is located in the previous image frame, that is, to predict the estimated area where the target object is located in the current image frame according to the motion information of the target object in the previous image frame.

In step S13, target detection is performed on the estimated area to obtain a target area of the target object in the current image frame.

In a possible implementation manner, a target detection network may be used to perform target detection on the estimated area, so as to obtain a target area of the target object in the current image frame. Wherein the target detection network may be used to detect a target object in the image. It should be understood that the embodiments of the present disclosure are not limited to the network structure, the network type, the training mode, and the like of the object detection network, for example, the object detection network may employ RCNN (Region Convolutional Neural network), Fast RCNN (Fast Region Convolutional Neural network), Fast RCNN (Faster Region Convolutional Neural network), and the like.

FIG. 2 shows a schematic diagram of an object detection framework in accordance with an embodiment of the present disclosure. The target detection framework shown in fig. 2 may be a network framework corresponding to fast RCNN, and the network framework includes a plurality of modules, such as a feature extraction module, a candidate anchor frame generation module, a candidate anchor frame classification module, and a target anchor frame fine-tuning module; the characteristic extraction module is used for extracting a shared characteristic map of the input image, and the candidate anchor frame generation module is used for generating a large number of candidate anchor frames based on the shared characteristic map; the candidate anchor frame classification module is used for classifying the candidate anchor frames to obtain target anchor frames containing target objects; the target anchor frame fine-tuning module is used for fine-tuning the size and/or the position of the target anchor frame; in order to better deploy the target detection network on electronic equipment with limited computing capacity, the original model framework can be decoupled into each independent module, and each independent module can be optimized.

Wherein, the feature extraction module can adopt light-weight MobileNet (a light-weight deep neural network framework); the candidate anchor frame generation module may adopt an RPN (Region pro-social Network) for setting a large number of candidate anchor frames on the scale of the original prediction Region. And then, a candidate anchor frame classification module is used for judging which candidate anchor frames have the target objects and which candidate anchor frames do not have the target objects. The input feature map of the RPN network is the shared feature map output by the feature extraction module in fig. 2, which may be a shared feature map for extracting the predicted region.

Fig. 3 is a schematic diagram illustrating a target anchor frame generation flow according to an embodiment of the present disclosure, and as shown in fig. 3, the entire RPN network generation candidate frame flow may be briefly described as follows: firstly, performing a convolution operation on the shared feature map of the estimation area by 3 × 3 once to obtain a 256 × 16 × 16 feature map, which can also be regarded as 16 × 16 256 feature vectors, each of which is 256 dimensions, and then performing a convolution operation twice by 1 × 1 to obtain a (2 × 9) × 16 × 16 feature map and a (4 × 9) × 16 × 16 feature map, that is, 9 × 16 × 16 results, each of which contains 2 scores and 4 coordinates. The 2 scores comprise the score of the foreground (object) and the score of the background, and because the RPN generates a large number of candidate anchor frames and does not need to judge the category, the RPN distinguishes whether the RPN is a target object or not by using the two scores; the 4 coordinates refer to the offset of each result for the pixel coordinates in the pre-estimated area; and classifying a large number of candidate anchor frames by using a candidate frame classification module to obtain a target anchor frame containing a target object.

It should be understood that one shared feature map corresponds to H × W (e.g., 16 × 16 above) feature points, any feature point has a mapping relationship with a pixel point in an original image, and since the size of the pre-estimated region is different from that of the shared feature map, a feature point on the shared feature map corresponding to a pixel point in the pre-estimated region can be understood as a frame, so that the upper left corner of the frame or the center of the frame can be used as an anchor point, and then K candidate anchor frames with specified aspect ratios are generated, that is, K candidate anchor frames generated by one anchor point; in other words, each feature point has K candidate anchor frames corresponding to the prediction region, so that there are H × W × K candidate anchor frames in the prediction region, and the output result of the RPN network is actually used to determine whether the candidate anchor frames are target objects or not, or to distinguish whether the candidate anchor frames are target objects or backgrounds.

The length to width ratios of the K anchor frames are predetermined, for example, 9 combinations are provided, K is equal to 9, so that there are H × W × 9 results, such as the above 18 fractions and 36 coordinates, the 9 anchor frame areas are 128 × 128, 256 × 256, 512 × 512, the length to width ratios are 1:1, 1:2, 2:1, and the area and length to width ratios are combined two by two to form 9 anchor frames.

It should be appreciated that a current image frame may include a plurality of predicted regions, each of which may generate a number of candidate anchor frames, with candidate anchor frames that partially contain the target object being valid target anchor frames. The candidate anchor frame classification module can adopt classifiers such as a linear classifier, a neural network classifier and a support vector machine to classify each candidate anchor frame into two categories of a target object and a background, and can be used for screening effective candidate anchor frames as target anchor frames.

In one possible implementation manner, in step S12, determining an estimated region of the target object in the current image frame according to the target region where the target object is located in the previous image frame includes:

step S121: and determining the estimated motion information corresponding to the current image frame according to the historical position offset and the historical size offset of the target area in the previous image frame.

In one possible implementation, the target region may include a region position and a region size of the target region, where the region position may include a center point or coordinates of any vertex of the target region, and the region size may include a length and a width of the target region. It should be understood that the region location and the region size of the target region together indicate the target region.

Wherein the historical position offset represents the position offset of the target object between two target areas in two adjacent previous image frames. For example, the area position of the target area in the previous image frame a is (x)_a,y_a) The area position of the target area in the previous image frame b adjacent to the previous image frame a is (x)_b,y_b) The historical positional offset may include a horizontal axis offset Dx, which may be the difference (x) between the two horizontal axes, and a vertical axis offset Dy_b–x_a) Or the ratio (x) between the two abscissas_b/x_a) The vertical axis offset Dy may be the difference (y) between the two vertical axes_b–y_a) Or the ratio (y) between the two ordinates_b/y_a)。

Wherein the historical size offset represents the size offset of the target object between two target areas in two adjacent previous image frames. For example, the length of the target region in the previous image frame a is w_aWidth of h_aThe length of the target region in the previous image frame b adjacent to the previous image frame a is w_bWidth of h_bThe historical size offset may include a length offset D_wOffset from width D_hLength offset D_wMay be the difference (w) between the two lengths_b–w_a) Or the ratio between the two lengths (w)_b/w_a) Width offset D_hMay be the difference (h) between the two widths_b–h_a) Or the ratio (h) between the two widths_b/h_a)。

In a possible implementation manner, determining the estimated motion information corresponding to the current image frame according to the historical position offset and the historical size offset of the target region in the previous image frame includes: calculating a weighted average value between the historical position offset of the target area in the previous image frame and the estimated position offset corresponding to the previous image frame to obtain the estimated position offset corresponding to the current image frame; and calculating a weighted average value between the historical size offset of the target area in the previous image frame and the estimated position offset corresponding to the previous image to obtain the estimated size offset corresponding to the current image frame. By the method, the pre-estimated motion information corresponding to the current image frame can be conveniently and accurately obtained.

In this implementation, the previous image frame may be a frame image frame adjacent to the current image frame in advance, that is, the estimated position offset of the current image frame is determined according to a weighted average value between the historical position offset corresponding to the previous image frame and the estimated position offset corresponding to the previous image frame, and the estimated size offset of the current image frame is determined according to a weighted average value between the historical size offset corresponding to the previous image frame and the estimated size offset corresponding to the previous image frame. This approach can be expressed as the following equation:

F_n＝βD_n-1+(1-β)F_n-1，n≥3。

where n represents the nth frame image frame, F_nMay represent the predicted motion information corresponding to the current image frame, F_n-1May represent the corresponding predicted motion information, D, of the previous image frame_n-1Representing the historical size offset and the historical position offset corresponding to the previous image frame, where β represents a weighted value, and β may be determined according to historical experience or experimental tests, etc., and the embodiment of the present disclosure is not limited thereto. In one possible implementation, D is_n-1、F_n-1、F_nEtc. may be expressed in the form of a matrix or a vector, etc., without limitation to the disclosed embodiments.

It can be understood that when the current image frame is the 1 st image frame and the 2 nd image frame, there is no corresponding historical position offset and historical size offset, that is, there is no corresponding estimated motion information, and the whole image frame can be directly used as the estimated area where the target is located; when the current image frame is the nth image frame, the estimated motion information corresponding to the nth image frame may include: the weighted average value between the historical position offset corresponding to the (n-1) th image frame (namely the historical position offset between the (n-1) th image frame and the (n-2) th image frame) and the estimated position offset corresponding to the (n-1) th image frame, and the weighted average value between the historical size offset corresponding to the (n-1) th image frame (namely the historical size offset between the (n-1) th image frame and the (n-2) th image frame) and the estimated size offset corresponding to the (n-1) th image frame, wherein n is more than or equal to 3. When n is 3, the estimated motion information corresponding to the 2 nd frame image frame is 0, and therefore the estimated motion information corresponding to the 3 rd frame image frame actually includes: the weighting value of the historical position offset corresponding to the (n-1) th image frame (namely, the historical position offset between the 1 st image frame and the 2 nd image frame).

In a possible implementation manner, determining the estimated motion information corresponding to the current image frame according to the historical position offset and the historical size offset of the target region in the previous image frame may further include: calculating a weighted average value of the plurality of historical position offsets to obtain an estimated position offset corresponding to the current image frame; and calculating the weighted average value of the plurality of historical size offsets to obtain the estimated size offset corresponding to the current image frame. By the method, the estimated motion information corresponding to the current image frame can be updated in time.

In this manner, the historical position offsets and historical size offsets corresponding to at least two previous image frames may be used to determine the estimated motion information. For example, if the current image frame is the 3 rd image frame of the video data, the estimated motion information of the 3 rd image frame may be determined according to the historical position offset and the historical size offset between the 1 st image frame and the 2 nd image frame; if the current image frame is the 5 th image frame of the video data, the estimated motion information of the 5 th image frame can be determined according to the historical position offset and the historical size offset between the 1 st image frame and the 4 th image frame, or the estimated position offset of the 5 th image frame can be determined according to the historical position offset and the historical size offset between the 2 nd image frame and the 4 th image frame, and so on.

Step S122: and determining an estimated region according to the estimated motion information and the target region in the previous image frame.

As described above, the predicted motion information includes: the estimated position offset and the estimated size offset corresponding to the current image frame; in one possible implementation, determining the predicted region according to the predicted motion information and the target region in the previous image frame includes: translating the region position of the target region in the previous image frame by using the estimated position offset to obtain the estimated region position of the estimated region; and zooming the region size of the target region in the previous image frame by using the estimated size offset to obtain the estimated region size of the estimated region. By the method, the predicted area of the target object in the current image frame can be effectively predicted.

As described above, the estimated positional offset is determined based on historical positional offsets, which may include the horizontal-axis offset Dx, which may be the difference (x) between the two horizontal axes, and the vertical-axis offset Dy_b–x_a) Or the ratio (x) between the two abscissas_b/x_a) The vertical axis offset Dy may be the difference (y) between the two vertical axes_b–y_a) Or the ratio (y) between the two ordinates_b/y_a). Based on this, in a possible implementation manner, translating the region position of the target region in the previous image frame by using the estimated position offset to obtain the estimated region position of the estimated region may include: under the condition that the historical position offset is determined according to the ratio between two coordinates (namely the ratio between two horizontal coordinates and the ratio between two vertical coordinates), multiplying the estimated position offset by the region position of the target region in the previous image frame to obtain the estimated region position; or, in the case that the historical position offset is determined according to a difference between two coordinates (i.e., a difference between two abscissa coordinates and a difference between two ordinate coordinates), the estimated position offset is added to the area position of the target area in the previous image frame of the previous frame to obtain the estimated area position.

As described above, the predicted size offset is determined based on the historical size offset, which may include the length offset D_wOffset from width D_hLength offset D_wMay be the difference (w) between the two lengths_b–w_a) Or the ratio between the two lengths (w)_b/w_a) Width offset D_hMay be the difference (h) between the two widths_b–h_a) Or the ratio (h) between the two widths_b/h_a). Based on this, in a possible implementation manner, the scaling the region size of the target region in the previous image frame by using the estimated size offset to obtain the estimated region size of the estimated region may include: under the condition that the historical size offset is determined according to the ratio between two sizes (namely the ratio between two lengths and the ratio between two widths), multiplying the estimated size offset by the size of the area of the target area in the previous image frame to obtain the estimated area size; or, in the case that the historical size offset is determined according to the difference between the two sizes (i.e., the difference between the two lengths and the difference between the two widths), the estimated size offset is added to the area size of the target area in the previous image frame of the previous frame to obtain the estimated area size.

Considering that the estimated region obtained by scaling the region size of the target region in the previous image frame of the previous frame may not completely include the target object, in order to make the estimated region include the complete target object as much as possible, in a possible implementation manner, scaling the region size of the target region in the previous image frame by using the estimated size offset to obtain the estimated region size of the estimated region, including: zooming the region size of the target region in the previous image frame by using the estimated size offset to obtain the size of a middle region; and increasing the size of the middle area according to the specified multiple to obtain the estimated area size. In this way, the estimated region can contain as complete a target object as possible.

The method for obtaining the estimated region size in the embodiment of the present disclosure may be referred to, and the method may scale the region size of the target region in the previous image frame by using the estimated size offset to obtain the middle region size, which is not described herein again. It will be appreciated that the specific value of the specified factor may be determined based on historical experience, and may be set to 2, for example, and then increasing the size of the intermediate region by the specified factor may be understood as doubling the size of the intermediate region to obtain the estimated region size.

In the embodiment of the disclosure, the local region where the target object is located in the current image frame at a large probability can be roughly estimated, so that the detection range of the target detection is narrowed, and the improvement of the operation efficiency and accuracy of the target detection is facilitated.

As mentioned above, the object detection may be performed on the estimated area by using an object detection network, and in a possible implementation manner, in step S13, performing object detection on the estimated area to obtain a target area of the target object in the current image frame includes: and performing target detection on the estimated area by using a target detection network to obtain a target area of the target object in the current image frame, wherein the target detection network is obtained by training an initial target detection network through the sample image and the annotation information, and the annotation information represents the annotation area of the target object in the sample image. By the method, the target area where the target object is located can be accurately and efficiently detected.

It should be understood that the network type of the initial target detection network is the same as the target detection network, and the network parameters of the initial target detection network may be different from the target detection network. The initial target detection network may be understood as an untrained detection network, and the anchor boxes employed in the initial target detection network may be of an initial aspect ratio, such as 1:1, 1:2, 2:1, above. However, since the target detection network detects a specific kind of target object, for example, a human body, the initial aspect ratio of the anchor frame in the initial target detection network may not be suitable for different kinds of target objects, for example, the region where the human body is located is more suitable for using anchor frames such as 1:9, 1:6, etc. Therefore, a specified length-width ratio matched with the target object can be preset, and the target object in the sample image is labeled based on the specified length-width ratio to obtain a labeled area.

In a possible implementation manner, the labeling area has a specified aspect ratio, the initial target detection network uses an anchor frame to indicate a detection area where the target object is located in the sample image, and the anchor frame has an initial aspect ratio, where training the initial target detection network through the sample image and the labeling information includes: carrying out target detection on the sample image by using an initial target detection network to obtain a detection area output by the initial target detection network; and adjusting network parameters of the initial target detection network and adjusting the initial length-width ratio of an anchor frame adopted by the initial target detection network according to the difference between the detection area and the labeling area to obtain the target detection network, wherein the length-width ratio of the anchor frame adopted by the target detection network is the specified length-width ratio. By the method, the target detection network capable of detecting the target object can be trained efficiently, and the anchor frame is adjusted to the specified length-width ratio, so that the anchor frame adopted by the target detection network can indicate the area where the target object is located more accurately, or the anchor frame can be more adaptive to the target object.

In a possible implementation manner, an Intersection and Union ratio (IOU) between the detection region and the labeling region may be adopted to characterize a difference between the detection region and the labeling region; of course, other implementations may also be adopted, such as the degree of coincidence, the relative position, and the like between the detection region and the labeling region, and the embodiments of the present disclosure are not limited thereto.

Adjusting network parameters of the initial target detection network according to the difference between the detection area and the labeled area, wherein the network parameters may include determining network loss according to the difference between the detection area and the labeled area; adjusting network parameters of the initial target detection network according to the network loss until the network loss meets a preset condition; the preset condition may include loss convergence, loss setting to 0, etc., and the embodiment of the present disclosure is not limited thereto. For example, the network loss between the detection region and the labeling region may be determined by a loss function such as a mean square error loss function, an intersection-to-parallel ratio loss function, or an average absolute error loss function, which is not limited in this embodiment of the present disclosure.

It should be understood that a plurality of target objects may be included in the sample image, then the annotation region may include a plurality of target objects, and the detection region output by the initial target detection network may include a plurality of target objects, and in a possible implementation, the initial aspect ratio of the anchor frame adopted by the initial target detection network is adjusted according to the difference between the detection region and the annotation region, including: matching each detection area with each labeling area according to the difference between each detection area and each labeling area to obtain a labeling area matched with each detection area; and aiming at any detection area, adjusting the initial length-width ratio of the anchor frame corresponding to the detection area to the specified length-width ratio of the labeling area matched with the detection area. In this way, the initial aspect ratio of the anchor box in the initial target detection network may be adaptively adjusted to a specified aspect ratio suitable for the target object.

The detection areas and the labeling areas are matched according to the difference between the detection areas and the labeling areas to obtain labeling areas matched with the detection areas, and it can be understood that the labeling areas matched with the detection areas are found according to the cross-over ratio between the detection areas and the labeling areas, for example, the detection area and the labeling area with the largest cross-over ratio can be determined as the matched detection area and labeling area.

In a possible implementation manner, the target detection network may further detect multiple target objects, and therefore, the specified aspect ratios of the respective labeled regions may be clustered according to labeled regions labeled in the sample images by different types of target objects, where the number of clusters is the number of types of the target objects, or a cluster center is a type of the target object, for example, two types of target objects, namely, a human body and a vehicle, the types of clusters may be a human body type and a vehicle type, the specified aspect ratios of the human body type clusters may include 1:9, 1:6, and the like, and the specified aspect ratios of the vehicle type clusters may include 3:2, 1:1, and the like. Based on this, in a possible implementation manner, the type of the target object in each detection area can be determined according to the intersection ratio between each detection area and the labeling area; and adjusting the initial aspect ratio of the anchor frame corresponding to the detection area according to the specified aspect ratio corresponding to the type of the target object in the detection area. By the method, the adjusted anchor frame can be adapted to various target objects, namely the anchor frame can more accurately indicate the areas where the various target objects are located.

In order to better deploy the target detection network in the electronic device with weak computing capability and improve the computing efficiency of the target detection network in the electronic device with weak computing capability, in a possible implementation manner, the method further includes: and pruning the target detection network to obtain the pruned target detection network. By the method, the target detection network with smaller parameter quantity (namely network volume) can be obtained, so that the operation efficiency of the target detection network in the electronic equipment with weaker operation capability is improved.

It should be understood that, network pruning techniques known in the art may be used to implement pruning processing on the target detection network to obtain a pruned target detection network, and the embodiment of the present disclosure is not limited thereto.

For example, a frame model of a fully-connected network and a dropout model can be adopted to prune the target detection network, the frame model has the input of the L1 norm, the L2 norm and the batch standardization parameter of the network parameters in each channel dimension in each layer of the target detection network, and the output of the importance degree corresponding to each channel dimension. The dropout model can generate certain randomness, and the generalization capability of the pruned target detection network is favorably improved.

In order to reduce the training complexity of the model framework, the model framework may be superimposed in the iterative training process of the initial target detection network, and it should be understood that the network loss in the iterative training process may also include a model loss corresponding to the framework module. Therefore, after the training of the initial target detection network is completed, the importance degree corresponding to each channel dimension in each network layer is also obtained, and further, network parameters under the channel dimension with the importance degree lower than a certain threshold value in the target detection network can be removed, so that the pruned target detection network is obtained; the pruned target detection network can be retrained, and the pruned target detection network can realize more accurate target detection.

In one possible implementation manner, in step S13, performing target detection on the estimated region to obtain a target region of the target object in the current image frame includes: and performing target detection on the pre-estimated area by using the pruned target detection network to obtain a target area of the target object in the current image frame. By the method, the detection efficiency of the target object in the target area can be improved.

It should be understood that, the target detection performed on the estimated area by using the pruned target detection network may refer to the above implementation manner of performing target detection on the estimated area by using the target detection network, which is not described herein again.

According to the embodiment of the disclosure, the estimated region of the target object in the current image frame is roughly estimated based on the estimated motion information, and the target detection is performed by adopting a light-weight target detection network, so that the efficiency and the accuracy of the target detection can be improved, and a target detection task can be realized in electronic equipment with weak computing capability, and the method can be particularly suitable for artificial intelligence teaching in the field of intelligent education robots.

According to the embodiment of the disclosure, the network pruning is performed on the target detection network in the training process, so that the detection efficiency and the accuracy of the target detection network are improved, the initial length-width ratio of the anchor frame is adjusted by using the detection result, the self-adaptive adjustment of the initial length-width ratio of the anchor frame is realized, and the anchor frame in the target detection network is more suitable for indicating the target area where the target object is located.

Fig. 4 shows a flowchart of an object detection method according to an embodiment of the present disclosure, which may be applied to an intelligent robot, the object detection method including:

in step S21, acquiring a current image frame currently acquired by the intelligent robot; the intelligent robot can be provided with an image acquisition component (such as a camera) which is used for acquiring image frames;

in step S22, obtaining an estimated region of the target object in the current image frame, where the estimated region of the target object in the current image frame is determined according to the target region where the target object is located indicated by the target anchor frame in the previous image frame; it should be understood that if the current image frame is the first image frame, the estimated region may be a global region of the current image frame;

in step S23, feature extraction is performed on the estimated area by using a feature extraction module of a target detection network deployed in the intelligent robot, so as to obtain a feature map;

in step S24, processing the feature map by using the candidate anchor frame generation module of the target detection network to obtain a large number of candidate anchor frames;

in step S25, classifying a large number of candidate anchor frames by using the candidate anchor frame classification module of the target detection network to obtain a target anchor frame including a target object;

in step S26, an estimated region of the target object in the image frame acquired after the current image frame is determined using the target region where the target object is located indicated by the target anchor frame in the current image frame.

It should be understood that steps S22 to S26 may be performed in a loop for a plurality of times, and the number of loops may depend on the total number of image frames to be detected. Step S23 to step S25 may refer to the implementation of the target detection network under the target detection framework shown in fig. 2, and step S26 may refer to the implementation of determining the estimated area of the target object in the current image frame according to the target area where the target object is located in the previous image frame, which is not described herein again.

In a possible implementation manner, the target detection network is obtained by training an initial target detection network through the sample image and the annotation information, and in the training process of the initial target detection network, the length-width ratio of an anchor frame adopted by the target detection network can be adjusted according to the annotation information of the sample image, so that the anchor frame adopted by the target detection network can accurately indicate the area where the target object is located, or the anchor frame is adapted to the target object; the training target detection network can be pruned to reduce the volume of the target detection network. The length-width ratio of the anchor frame and the pruning processing may refer to the above-mentioned related descriptions of the embodiments of the present disclosure, and are not described herein again.

In a possible implementation manner, after the intelligent robot detects the target object, the intelligent robot may further perform processing such as classification and identification on the target object, for example, perform object classification on the object detected in the current image frame to automatically capture the object of the target category; or gesture recognition is carried out on the hand detected in the current image frame, so that gesture control for the intelligent robot is realized. The embodiment of the present disclosure does not limit the processing manner after the target object is detected.

In the embodiment of the disclosure, the target detection is performed on the pre-estimated region of the target object in the current image frame by adopting the lightweight target detection network, so that the target detection task can be realized in the intelligent robot with weaker computing capability, and meanwhile, the efficiency and the accuracy of the target detection are improved.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides a target detection apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the target detection methods provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are not repeated.

Fig. 5 shows a block diagram of an object detection apparatus according to an embodiment of the present disclosure, which, as shown in fig. 5, includes:

an obtaining module 101, configured to obtain a current image frame to be detected;

the estimation module 102 is configured to determine an estimation area of a target object in a current image frame according to a target area where the target object is located in a previous image frame, where an acquisition time of the previous image frame is before the current image frame;

the detection module 103 is configured to perform target detection on the estimated region to obtain a target region of the target object in the current image frame.

In a possible implementation manner, the estimation module 102 includes: the motion information estimation submodule is used for determining estimated motion information corresponding to the current image frame according to historical position offset and historical size offset of a target area in the previous image frame, wherein the historical position offset represents position offset of the target object between two target areas in two adjacent previous image frames, and the historical size offset represents size offset of the target object between two target areas in two adjacent previous image frames; and the estimation region determining submodule is used for determining the estimation region according to the estimation motion information and the target region in the previous image frame.

In a possible implementation manner, the detection module 103 includes: the first detection submodule is used for performing target detection on the estimated area by using a target detection network to obtain a target area of the target object in the current image frame, wherein the target detection network is obtained by training an initial target detection network through a sample image and annotation information, and the annotation information represents an annotation area where the target object is located in the sample image.

In one possible implementation, the apparatus further includes: the pruning module is used for carrying out pruning processing on the target detection network to obtain a pruned target detection network; wherein, the detection module 103 includes: and the second detection submodule is used for carrying out target detection on the pre-estimated area by using the pruned target detection network to obtain a target area of the target object in the current image frame.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The disclosed embodiments also provide a computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 6 illustrates a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 6, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (Wi-Fi), a second generation mobile communication technology (2G), a third generation mobile communication technology (3G), a fourth generation mobile communication technology (4G), a long term evolution of universal mobile communication technology (LTE), a fifth generation mobile communication technology (5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 7 illustrates a block diagram of another electronic device 1900 in accordance with an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 7, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932^TM) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X)^TM) Multi-user, multi-process computer operating system (Unix)^TM) Free and open native code Unix-like operating System (Linux)^TM) Open native code Unix-like operating System (FreeBSD)^TM) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of object detection, comprising:

acquiring a current image frame to be detected;

determining an estimated area of a target object in the current image frame according to a target area where the target object is located in a previous image frame, wherein the acquisition time of the previous image frame is before the current image frame;

and carrying out target detection on the pre-estimated area to obtain a target area of the target object in the current image frame.

2. The method of claim 1, wherein the determining the predicted area of the target object in the current image frame according to the target area of the target object in the previous image frame comprises:

determining estimated motion information corresponding to the current image frame according to a historical position offset and a historical size offset of a target area in the previous image frame, wherein the historical position offset represents a position offset of the target object between two target areas in two adjacent previous image frames, and the historical size offset represents a size offset of the target object between two target areas in two adjacent previous image frames;

and determining the estimated area according to the estimated motion information and the target area in the previous image frame.

3. The method of claim 2, wherein the predicted motion information comprises: the estimated position offset and the estimated size offset corresponding to the current image frame;

wherein the determining the estimated region according to the estimated motion information and the target region in the previous image frame comprises:

translating the region position of the target region in the previous image frame by using the estimated position offset to obtain the estimated region position of the estimated region;

and utilizing the estimated size offset to zoom the area size of the target area in the previous image frame to obtain the estimated area size of the estimated area.

4. The method of claim 3, wherein the scaling the region size of the target region in the previous image frame by using the estimated size offset to obtain the estimated region size of the estimated region comprises:

zooming the region size of the target region in the previous image frame by using the estimated size offset to obtain a middle region size;

and increasing the size of the middle area according to the specified multiple to obtain the estimated area size.

5. The method according to any one of claims 2-4, wherein the determining the estimated motion information corresponding to the current image frame according to the historical position offset and the historical size offset of the target area in the previous image frame comprises:

calculating a weighted average value between the historical position offset of the target area in the previous image frame and the estimated position offset corresponding to the previous image frame to obtain the estimated position offset corresponding to the current image frame;

and calculating a weighted average value between the historical size offset of the target area in the previous image frame and the estimated position offset corresponding to the previous image frame to obtain the estimated size offset corresponding to the current image frame.

6. The method according to claim 1, wherein the performing target detection on the estimated region to obtain a target region of the target object in the current image frame comprises:

and performing target detection on the estimated area by using a target detection network to obtain a target area of the target object in the current image frame, wherein the target detection network is obtained by training an initial target detection network through a sample image and annotation information, and the annotation information represents an annotation area of the target object in the sample image.

7. The method of claim 6, wherein the labeled region has a specified aspect ratio, wherein the initial target detection network indicates the detection region of the sample image where the target object is located with an anchor box having an initial aspect ratio, and wherein training the initial target detection network with the sample image and the labeling information comprises:

carrying out target detection on the sample image by using the initial target detection network to obtain a detection area output by the initial target detection network;

and adjusting network parameters of the initial target detection network and adjusting an initial length-width ratio of an anchor frame adopted by the initial target detection network according to the difference between the detection area and the labeling area to obtain the target detection network, wherein the length-width ratio of the anchor frame adopted by the target detection network is the specified length-width ratio.

8. The method of claim 7, wherein the detection area comprises a plurality of detection areas, wherein the labeled area comprises a plurality of labeled areas, and wherein adjusting the initial aspect ratio of the anchor frame used by the initial target detection network according to the difference between the detection area and the labeled area comprises:

matching each detection area with each labeling area according to the difference between each detection area and each labeling area to obtain a labeling area matched with each detection area;

and aiming at any detection area, adjusting the initial length-width ratio of the anchor frame corresponding to the detection area to the specified length-width ratio of the labeling area matched with the detection area.

9. The method according to any one of claims 6-8, further comprising: pruning the target detection network to obtain a pruned target detection network;

the performing target detection on the pre-estimated region to obtain a target region of the target object in the current image frame includes:

and carrying out target detection on the pre-estimated area by using the pruned target detection network to obtain a target area of the target object in the current image frame.

10. An object detection device, comprising:

the acquisition module is used for acquiring a current image frame to be detected;

the estimation module is used for determining an estimation area of a target object in the current image frame according to a target area where the target object is located in a previous image frame, wherein the acquisition time of the previous image frame is before the current image frame;

and the detection module is used for carrying out target detection on the pre-estimated area to obtain a target area of the target object in the current image frame.

11. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any of claims 1 to 9.

12. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 9.