CN112906495A

CN112906495A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN112906495A
Application number: CN202110114475.3A
Authority: CN
Inventors: 蒋海滨; 张祥攀
Original assignee: Shenzhen Anngic Technology Co ltd
Current assignee: Shenzhen Anngic Technology Co ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-06-04
Anticipated expiration: 2041-01-27
Also published as: CN112906495B

Abstract

The application provides a target detection method, a target detection device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring video data collected by a camera; extracting a first video frame and a second video frame from video data; determining a regional image containing a distant target from the second video frame; performing target detection on the first video frame by using a target detection model to obtain a first target set, and performing target detection on the regional image by using a target detection model to obtain a second target set; and determining a final detection result according to the first target set and the second target set. In the implementation process, the target detection result of the first video frame (i.e. the complete image) and the target detection result of the area image are considered comprehensively, and the zoom factor of the area image is smaller than that of the complete image of the video frame, so that the condition that a far target is detected only from the compressed complete image is effectively avoided, and the accuracy of target detection is improved.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing and image recognition, and in particular, to a target detection method, an apparatus, an electronic device, and a storage medium.

Background

Advanced Driver Assistance System (ADAS) is an active safety technology capable of performing intelligent image analysis by using an Artificial Intelligence (AI) algorithm, and ADAS is often used in the application fields of automatic assisted driving, unmanned vehicles, and the like.

In the development work of ADAS, target detection is usually performed by using a target detection algorithm, so as to achieve the purpose of sensing targets around the vehicle. Because the computing power of the vehicle-mounted chip is limited, and the resolution of the complete image originally acquired by the camera is very large, before the target detection is performed on the complete image by using the target detection model, the resolution of the complete image is usually scaled (e.g., compressed) to the resolution of the input image of the target detection model, so as to accelerate the target detection; after image compression, a distant target is difficult to characterize in a compressed image, that is, pixels occupied by the target in the compressed image are used as input of a target detection model, and the distant target is difficult to detect, so that the accuracy of target detection on a video image by using a current target detection algorithm is low.

Disclosure of Invention

An object of the embodiments of the present application is to provide a target detection method, an apparatus, an electronic device, and a storage medium, which are used to solve the problem of low accuracy in target detection of a video image.

The embodiment of the application provides a target detection method, which comprises the following steps: acquiring video data collected by a camera; extracting a first video frame and a second video frame from video data; determining a region image containing a distant target from a second video frame, wherein the size of the second video frame is larger than that of the region image; performing target detection on the first video frame by using a target detection model to obtain a first target set, and performing target detection on the regional image by using a target detection model to obtain a second target set; and determining a final detection result according to the first target set and the second target set. In the implementation process, the target detection result of the first video frame (i.e. the complete image) and the target detection result of the area image are considered comprehensively, and the zoom multiple of the area image is smaller than that of the complete image of the video frame, so that the condition that a far target is detected only according to the zoomed complete image is effectively avoided, and the accuracy of target detection is effectively improved.

Optionally, in this embodiment of the present application, determining an area image including a distant object from a second video frame includes: determining the central point of the area image by taking the abscissa of the central point of the second video frame as the abscissa of the central point of the area image and taking the ordinate of the vanishing point of the second video frame as the ordinate of the central point of the area image; acquiring the transverse width and the longitudinal width of the area image; and determining the region image according to the central point of the region image and the transverse width and the longitudinal width of the region image. In the implementation process, the area image is determined through the vanishing point and the central point of the second video frame, so that the effective detection distance of the target detection is increased, and the error probability of detecting the target in the area image is effectively reduced.

Optionally, in this embodiment of the present application, obtaining the lateral width and the longitudinal width of the region image includes: obtaining a zoom factor; multiplying the scaling multiple by the transverse width of the input image of the target detection model to obtain the transverse width of the area image; and multiplying the scaling multiple by the longitudinal width of the input image of the target detection model to obtain the longitudinal width of the area image. In the implementation process, the transverse width and the longitudinal width of the area image are dynamically calculated, so that the area image is prevented from being determined by using the unchanged transverse width and longitudinal width, the proportion of the target in the area image is increased, and the error probability of detecting the target in the area image is effectively reduced.

Optionally, in this embodiment of the present application, determining a final detection result according to the first target set and the second target set includes: judging whether the first target set and the second target set contain the same target or not; if the first target set and the second target set contain the same target, determining a final detection result according to the position relation between the target and the regional image; if the first target set comprises a target and the second target set does not comprise the target, determining a detection frame of the target in the first target set as a final detection result; and if the first target set does not contain a target and the second target set contains the target, determining a detection frame of the target in the second target set as a final detection result. In the implementation process, by comprehensively considering the first target set of the first video frame (i.e. the complete image) and the second target set of the region image, the zoom factor of the region image is smaller than that of the complete image of the video frame, so that the situation that a distant target is detected only from the compressed complete image is effectively avoided, and the accuracy of target detection is effectively improved.

Optionally, in this embodiment of the present application, determining a final detection result according to a position relationship between the target and the area image includes: judging whether the target is in a preset area, wherein the ratio of the size of the preset area to the size of the area image is equal to the preset ratio, and the center point of the preset area and the center point of the area image are the same; if so, determining a detection frame of the target in the second target set as a final detection result; and if not, determining a detection frame of the target in the first target set as a final detection result. In the implementation process, the preset area is set, and the final detection result is determined according to whether the target is included in the preset area, so that a boundary is effectively set between the complete image detection result of the video frame and the detection result of the area image, and the flexibility of determining the final detection result is improved.

Optionally, in this embodiment of the present application, determining whether the first target set and the second target set contain the same target includes: matching detection frames corresponding to targets with the same category in the first target set and the second target set; and if the matching is successful, determining that the first target set and the second target set contain the same target, otherwise, determining that the first target set and the second target set do not contain the same target.

Optionally, in this embodiment of the present application, matching detection frames corresponding to targets in the same category in the first target set and the second target set includes: calculating the intersection area of the targets with the same category between the detection frames of the first target set and the detection frames of the second target set, and screening out the smaller detection frames of the targets in the detection frames of the first target set and the detection frames of the second target set; judging whether the ratio of the intersection area to the smaller detection frame is larger than a preset ratio threshold value or not; if so, determining that the matching is successful, otherwise, determining that the matching is unsuccessful.

Optionally, in this embodiment of the present application, the first video frame and the second video frame are a same video frame in the video data, or two consecutive and adjacent different video frames in the video data, or a preset number of different video frames are spaced in the video data. In the implementation process, the complete image corresponding to the first video frame and the area image corresponding to the second video frame which are continuous or spaced are respectively subjected to target detection, so that the complete image corresponding to each video frame and the area image of each video frame in the video data are prevented from being subjected to target detection, and the calculated amount for processing the video data is effectively saved.

An embodiment of the present application further provides a target detection apparatus, including: the video data acquisition module is used for acquiring video data acquired by the camera; the video frame extraction module is used for extracting a first video frame and a second video frame from video data; the area image determining module is used for determining an area image containing a far target from a second video frame, and the size of the second video frame is larger than that of the area image; the target set obtaining module is used for carrying out target detection on the first video frame by using a target detection model to obtain a first target set, and carrying out target detection on the regional image by using a target detection model to obtain a second target set; and the final result determining module is used for determining a final detection result according to the first target set and the second target set.

An embodiment of the present application further provides an electronic device, including: a processor and a memory, the memory storing processor-executable machine-readable instructions, the machine-readable instructions when executed by the processor performing the method as described above.

Embodiments of the present application also provide a storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the method as described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic flow chart of a target detection method provided in an embodiment of the present application;

fig. 2 is a schematic diagram illustrating an extraction process of a video frame provided by an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating determination and extraction of a region image provided by an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a calculation of a ratio of an intersection area to a smaller detection box according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Before introducing the target detection method provided in the embodiment of the present application, some concepts related in the embodiment of the present application are introduced:

target detection, also called target extraction, is an image understanding algorithm based on target geometry and statistical features, and target detection is to combine positioning and identification of a target into one, specifically for example: based on a computer vision algorithm, an interested target in the image is detected, namely the position of the target is marked by a rectangular frame, and the category of the target is identified.

The target detection model, also referred to as a target detection network model for short, refers to a neural network model obtained by training a target detection network using training data, and the target detection network model is divided into stages, which can be roughly divided into: a single-stage detection model and a two-stage detection model.

Hough transform (Hough transform), also called Hough transform, is used to identify the feature extraction algorithm in finding out objects, the algorithm flow of Hough transform is roughly as follows, given an object, the kind of the shape to be identified, the algorithm will vote in the parameter space to determine the shape of the object, which is determined by the local maximum in the accumulation space (accumulator space); specific examples thereof include: the method comprises the steps of detecting curves of straight lines, circles, parabolas, ellipses and the like in an image, wherein the curves can be described by using a certain functional relation, and the basic principle of Hough transformation is to transform curves (including straight lines) in an image space into a parameter space, determine description parameters of the curves by detecting extreme points in the parameter space, and further extract regular curves in the image.

A server refers to a device that provides computing services over a network, such as: x86 server and non-x 86 server, non-x 86 server includes: mainframe, minicomputer, and UNIX server.

It should be noted that the object detection method provided in the embodiments of the present application may be executed by an electronic device, where the electronic device refers to a device terminal having a function of executing a computer program or the server described above, and the device terminal includes, for example: smart phones, Personal Computers (PCs), tablet computers, Personal Digital Assistants (PDAs), Mobile Internet Devices (MIDs), vehicle monitoring devices, vehicle traveling devices, and the like.

Before introducing the target detection method provided in the embodiment of the present application, an application scenario applicable to the target detection method is introduced, where the application scenario includes, but is not limited to: the fields of automatic driving neighborhood and monitoring security protection, and the like; in the field of automatic driving, the target detection method can be used for enhancing the Advanced Driving Assistance System (ADAS) function of an unmanned aerial vehicle or an unmanned vehicle; or, in the field of monitoring security, the target detection method can be used for improving the accuracy of target detection on the acquired video in a monitoring system or a security system and the like.

Please refer to fig. 1 for a schematic flow chart of a target detection method provided in the embodiment of the present application; the main idea of the target detection method is that a first video frame and a second video frame are extracted from video data, a regional image containing a distant target is determined from the second video frame, and then a final detection result detected by a target detection model is determined according to a first target set detected from the first video frame and a second target set detected from the regional image. That is, by comprehensively considering the target detection result of the first video frame (i.e. the complete image) and the target detection result of the region image, since the zoom factor of the region image is smaller than that of the complete image of the video frame, the situation that a distant target is detected only according to the zoomed complete image is effectively avoided, and thus the accuracy of target detection is effectively improved; the target detection method may include:

step S110: and acquiring video data collected by the camera.

The obtaining manner of the video data in the step S110 includes: the first acquisition mode is that video data acquired by a high-definition camera aligned to the front of a vehicle is used, then the high-definition camera sends the video data to a master control electronic device of the vehicle, and the master control electronic device can acquire the video data acquired by the high-definition camera; a second acquisition mode, wherein a target object is shot by using acquisition equipment such as a video camera, a video recorder or a color camera on an unmanned aerial vehicle to acquire video data; then the acquisition equipment sends video data to the electronic equipment, and then the electronic equipment receives the video data sent by the acquisition equipment; and in the third acquisition mode, video data acquired by a monitoring camera in the security field is used, then the high-definition camera sends the video data to the main control electronic equipment of the vehicle, and the main control electronic equipment can acquire the video data acquired by the high-definition camera.

After step S110, step S120 is performed: a first video frame and a second video frame are extracted from the video data.

The video frame refers to a video frame extracted from video data or an image similar to the video frame, and specifically includes: extracting a complete video frame from the video data, and removing a frame comprising time information, or performing certain preprocessing on the video frame to obtain the video frame; among these, pre-treatments here include, but are not limited to: background removal, rotation correction, histogram equalization, image graying, binarization, image truncation, image scaling, and/or noise removal, among other operations.

It should be understood that the above-mentioned video frame may also refer to the video frame itself, that is, the video frame may be understood as a complete image extracted from the video data; the first video frame and the second video frame may be two consecutive and adjacent different video frames, or two video frames whose interval is less than a preset number of video frames or whose interval is less than a preset duration, that is, the first video frame and the second video frame have an interval less than a preset number of video frames, or whose interval is less than a preset duration. Of course, in a specific practical process, the first video frame and the second video frame may be the same video frame, i.e. a target object in the same video frame image is detected by using a target detection method.

Please refer to fig. 2, which is a schematic diagram illustrating a video frame extraction process according to an embodiment of the present application; there are many ways to extract the video frame in step S120, including but not limited to the following:

a first extraction method is to extract two consecutive video frames from video data, and this embodiment is, for example: extracting one of the video frames from the video data and extracting the next video frame next to the video frame in time sequence, for example, extracting the video frames with

frame identifications

2 and 3 in fig. 2; for consecutive video frames in the video data, frame1, frame2, … …, and frame n may be used to represent consecutive video frames, the complete video frame is also referred to as a large frame image, large frame image may be used to represent a large frame image, region image extracted from the complete class video is also referred to as a small frame image, and small frame image may be used to represent a small frame image, so that in a specific practical process, large frame image and small frame image may be alternately detected, and the alternately detected image sequence may be represented as frame1: large, frame2: Smaller, frame3: large, … …, frame n: Smaller.

A second extraction method is to extract two video frames spaced apart from each other from video data, and this embodiment includes, for example: extracting two video frames with an interval smaller than a preset number of video frames from the video data, or extracting two video frames with an interval smaller than a preset duration from the video data; the preset number may be set according to specific situations, for example, video frames with frame identifiers of 6 and 16 are extracted from fig. 2, and the preset number between the video frames with frame identifiers of 6 and 16 is 9. In a specific implementation process, of course, the first video frame and the second video frame may also be two video frames of a preset duration, specifically for example: the interval between the first video frame and the second video frame is 300 milliseconds, 1 second, or 3 seconds, etc.

A third extraction method, which extracts the same video frame from the video data, includes: for example, in fig. 2, a video frame with a frame mark 9 is extracted first and recorded as a first video frame, and then the video frame is copied and recorded as a second video frame. Or two video frames with the frame identifier of 9 are extracted again, that is, the extracted first video frame and the extracted second video frame are both video frames with the frame identifier of 9.

After step S120, step S130 is performed: an area image containing a distant object is determined from the second video frame.

The regional image includes a distant target, a distance between the distant target and a camera for acquiring the target may be within a preset range, where the preset range refers to a range from a minimum distance to a maximum distance, which needs to be detected, and the preset range may be set according to a specific situation, specifically, for example: the preset range is set to 50 to 75 meters, 80 to 150 meters, or 155 to 200 meters, and so on. It will be appreciated that the first video frame and the second video frame are both complete video frames, and the region image is extracted (e.g., truncated) from the second video frame, so that the size of the second video frame is larger than the size of the region image.

Please refer to fig. 3, which illustrates a schematic diagram of determining and extracting a region image provided in the embodiment of the present application; the vanishing point in fig. 3 is the end of the road vanishing, a tank truck is driving near the end of the road vanishing, the region image may be an image of the region including the tank truck (see the larger rectangular region indicated by the equal-length dashed lines in fig. 3), therefore, the above-mentioned distant target may be understood as the tank truck in fig. 3, and fig. 3 of the whole complete image may be understood as the above-mentioned second video frame; in the lower area of fig. 3, there is an off-road vehicle, which can be understood as a near target, but not a far target, and the off-road vehicle can be detected as a target from the complete image; there are many embodiments of the above step S130, including but not limited to the following:

in a first embodiment, determining an area image containing a distant object from a second video frame according to an abscissa of a center point of a complete image (e.g., a first video frame or a second video frame) and an ordinate of a vanishing point, the embodiment may include:

step S131: and determining the central point of the area image by taking the abscissa of the central point of the second video frame as the abscissa of the central point of the area image and taking the ordinate of the vanishing point of the second video frame as the ordinate of the central point of the area image.

The embodiment of step S131 described above is, for example: assume that the second video frame has four vertices: and similarly, the coordinates of the left lower vertex and the coordinates of the right upper vertex of the second video frame are summed and averaged to obtain the coordinates of the central point of the second video frame. Then, a vanishing point detection algorithm is used for detecting a vanishing point of the second video frame, an abscissa of a center point of the second video frame is taken as an abscissa of a center point of the area image, and an ordinate of the vanishing point of the second video frame is taken as an ordinate of the center point of the area image, so that the center point of the area image is determined; the vanishing point here may refer to a vanishing point of a road, and usable vanishing point detection algorithms include, but are not limited to: lane line intersection detection, Hough Transform (Hough Transform), and Cascade Hough Transform (CHT), among others.

Step S132: the lateral width and the longitudinal width of the region image are obtained.

The embodiment of the step S132 includes: in the first embodiment, the lateral width and the longitudinal width of the target object (e.g. pedestrian or vehicle) that usually needs to be detected in the second video frame (i.e. the image of the original resolution) can be obtained through actual data recording measurement, where the lateral width and the longitudinal width in the image of the original resolution here can be respectively denoted as (w _ orig, h _ orig); the minimum size of the target object which needs to be detected in the input image by the target detection model is marked as (w, h), wherein w represents the transverse width of the target object which needs to be detected in the input image by the target detection model, and h represents the longitudinal width of the target object which needs to be detected in the input image by the target detection model; then the scaling factor may use the formula

And calculating, wherein scale represents a scaling coefficient, w _ orig and h _ orig represent the transverse width and the longitudinal width in the image of the original resolution, respectively, and w and h represent the transverse width and the longitudinal width of the target object required to be detected in the input image by the target detection model, respectively. After obtaining the scaling coefficient, multiplying the scaling multiple by the transverse width of the input image of the target detection model to obtain the transverse width of the area image; and multiplying the scaling multiple by the longitudinal width of the input image of the target detection model to obtain the longitudinal width of the area image. The second embodiment can be obtained by calculation according to the working experience of people, and can also be obtained by adjustment according to a historically set scaling factor. After a scaling coefficient is obtained through calculation or a historically set scaling coefficient is obtained (the scaling coefficient can be adjusted), the scaling multiple is multiplied by the transverse width of the input image of the target detection model, and the transverse width of the area image is obtained; and multiplying the scaling multiple by the longitudinal width of the input image of the target detection model to obtain the longitudinal width of the area image.

Step S133: and determining the region image according to the central point of the region image and the transverse width and the longitudinal width of the region image.

The embodiment of step S133 described above includes, for example: it can be understood that since the area image can be regarded as a rectangle, the length and width of the rectangle are determined, and the center point of the rectangle is also determined, the area image of the rectangle in the second video frame can be determined; similarly, the region image may be determined (e.g., cropped) from the second video frame based on the center point of the region image and the lateral width and the longitudinal width of the region image.

In a second embodiment, the area image is determined from the second video frame only according to the vanishing point of the complete image (e.g. the first video frame or the second video frame), for example: determining a central point of the area image from the road vanishing point of the second video frame, and calculating the horizontal width and the longitudinal width of the area image from the complete image according to a preset proportion, wherein the preset proportion can be set according to a specific scene, for example, set to 0.5, 0.7, 0.9 and the like; and finally, cutting out the area image according to the central point of the area image and the transverse width and the longitudinal width of the area image. It is to be understood that the second embodiment is similar to the first embodiment except that the abscissa of the center point of the region image is the abscissa of the vanishing point, and the lateral width and the longitudinal width of the region image are calculated differently.

In a third embodiment, determining an area image containing a distant object from a second video frame according to a vertical coordinate of a center point and a horizontal coordinate of a vanishing point of a complete image (e.g. a first video frame or a second video frame), the method may include: and determining the center point of the area image by taking the abscissa of the vanishing point of the second video frame as the abscissa of the center point of the area image and taking the ordinate of the center point of the second video frame as the ordinate of the center point of the area image, obtaining the transverse width and the longitudinal width of the area image, and finally determining the area image according to the center point of the area image and the transverse width and the longitudinal width of the area image.

After step S130, step S140 is performed: and performing target detection on the first video frame by using a target detection model to obtain a first target set, and performing target detection on the regional image by using the target detection model to obtain a second target set.

Optionally, before performing target detection on the complete image (e.g., the video frame or the second video frame) and the region image by using the target detection model, the complete image and the region image may be subjected to preprocessing and normalization processing, and then the processed complete image and the processed region image are subjected to target detection, where; the embodiment specifically includes, for example: the full image and the region image are scaled to a fixed size, which may be 256 × 128, and then the scaled image is normalized, for example, each channel pixel value of the scaled image is normalized (i.e., each channel value is normalized from a space of 0 to 255 to a space of 0 to 1).

The embodiment of step S140 described above is, for example: performing target detection on the first video frame by using a target detection model to obtain a first target set, and performing target detection on the regional image by using a target detection model to obtain a second target set; the target detection model herein may adopt a single-stage detection model or a two-stage detection model, and the single-stage detection model that may be used includes but is not limited to: a Feature Fusion Single-point multi-box Detector (FSSD) and a YOLO network model, and the two-stage detection model that can be used specifically includes: network models of the RCNN, fast RCNN, and fast RCNN series.

After step S140, step S150 is performed: and determining a final detection result according to the first target set and the second target set.

The target set refers to a set obtained by detecting a target object in an image, and each element in the set has attributes including, but not limited to: a category of the target detection box (which may be represented as cls _ type, a category specific value detected on a road, such as car, person, and driver, etc.), an abscissa (which may be represented using x) of the center of the target detection box in the overall view (i.e., the entire image or the area image), an ordinate (which may be represented using y) of the center of the target detection box in the overall view (i.e., the entire image or the area image), a confidence of the target detection box (which may be represented using probabilit), a width (which may be represented using w) of the target detection box, and a height (which may be represented using h) of the target detection box, and the like.

The above-mentioned embodiment of determining the final detection result according to the first target set and the second target set in step S150 may include:

step S151: and judging whether the first target set and the second target set contain the same target or not.

The embodiment of the step S151 includes: matching detection frames corresponding to targets with the same category in the first target set and the second target set, wherein the category is the category of the target detection frame; and if the matching is successful, determining that the first target set and the second target set contain the same target, otherwise, determining that the first target set and the second target set do not contain the same target.

Please refer to fig. 4, which illustrates a schematic diagram of a ratio calculation between an intersection area and a smaller detection box provided in the embodiment of the present application; the matching of the detection frames corresponding to the targets with the same category in the first target set and the second target set includes: calculating the intersection area of the targets with the same category between the detection frames of the first target set and the detection frames of the second target set, and screening out the smaller detection frames of the targets in the detection frames of the first target set and the detection frames of the second target set; judging whether the ratio of the intersection area to the smaller detection frame is larger than a preset ratio threshold value or not; if the ratio of the intersection area to the smaller detection frame is larger than a preset ratio threshold, determining that the matching is successful; if the ratio of the intersection area to the smaller detection frame is smaller than or equal to a preset ratio threshold, determining that the matching is unsuccessful; the ratio of the intersection area to the smaller detection frame can be expressed as r _ inter/min (r1, r2) by using a formula, wherein r _ inter represents the intersection area of the objects of the same category between the detection frames of the first object set and the detection frames of the second object set, r1 represents the detection frames of the objects in the first object set, r2 represents the detection frames of the objects in the second object set, and min (r1, r2) represents the smaller detection frames of the objects in the detection frames of the first object set and the detection frames of the second object set.

Step S152: and if the first target set and the second target set contain the same target, determining a final detection result according to the position relation between the target and the region image.

The embodiment of the step S152 includes: determining whether the target is in a preset region (see a small rectangular region indicated by a long and short dashed line in the region image in fig. 3), where a ratio of a size of the preset region to a size of the region image is equal to a preset ratio, and a center point of the preset region is the same as a center point of the region image, for example: the ratio of the size of the preset area to the size of the area image is 10: 9, which can be adjusted according to the specific situation. If the target is in the preset area, determining a detection frame of the target in the second target set as a final detection result, namely taking a result of target detection on the area image as a final detection result; and if the target is not in the preset area, determining a detection frame of the target in the first target set as a final detection result, namely taking a result of target detection on the complete image (such as a second video frame) as the final detection result. In the implementation process, the preset area is set, and the final detection result is determined according to whether the target is included in the preset area, so that a boundary is effectively set between the complete image detection result of the video frame and the detection result of the area image, and the flexibility of determining the final detection result is improved.

Step S153: and if the first target set comprises a target and the second target set does not comprise the target, determining a detection frame of the target in the first target set as a final detection result.

Step S154: and if the first target set does not contain a target and the second target set contains the target, determining a detection frame of the target in the second target set as a final detection result.

The embodiments of the above steps S153 to S154 are, for example: if the first target set comprises a target (i.e. a detection frame with a confidence coefficient greater than a preset threshold exists in the first target set), and the second target set does not comprise the target (i.e. a detection frame with a confidence coefficient greater than a preset threshold does not exist in the second target set), determining the detection frame of the target in the first target set as a final detection result, i.e. determining the detection frame with a confidence coefficient greater than the preset threshold in the first target set as a final detection result; if the first target set does not contain a target (i.e. the first target set does not have a detection frame with a confidence coefficient greater than a preset threshold), and the second target set contains the target (i.e. the second target set has a detection frame with a confidence coefficient greater than the preset threshold), determining the detection frame of the target in the second target set as a final detection result, i.e. determining the detection frame with a confidence coefficient greater than the preset threshold in the second target set as a final detection result; the preset threshold value here may be set according to specific situations, for example, the preset threshold value is set to 0.7 or 0.8, and so on.

In the implementation process, first, a first video frame and a second video frame (i.e., a complete image) are extracted from video data, then, an area image containing a distant object is determined from the second video frame, and then, a final detection result detected by the object detection model is determined according to a first object set detected from the first video frame and a second object set detected from the area image. It can be understood that, when the target detection model performs the target detection processing on the complete image and the area image, since the target detection model requires that the size of the input image is fixed, it is usually necessary to reduce the resolution of the complete image to the resolution of the input image of the target detection model, so as to make the reduced size of the complete image be the same as the size of the input image; it is also necessary to scale (possibly reduce or enlarge) the resolution of the region image to the resolution of the input image of the object detection model, so that the size of the scaled region image is the same as the size of the input image. That is, by comprehensively considering the target detection result of the first video frame (i.e., the complete image) and the target detection result of the area image, since the zoom factor of the area image is smaller than that of the complete image of the video frame, the situation that a distant target is detected only according to the zoomed complete image is effectively avoided, and thus the accuracy of target detection is effectively improved.

Please refer to fig. 5, which illustrates a schematic structural diagram of a target detection apparatus according to an embodiment of the present application. The embodiment of the present application provides an object detection apparatus 200, including:

and a video data obtaining module 210, configured to obtain video data collected by the camera.

The video frame extracting module 220 is configured to extract a first video frame and a second video frame from the video data.

The region image determining module 230 is configured to determine a region image containing a distant object from a second video frame, where the size of the second video frame is larger than the size of the region image.

And an object set obtaining module 240, configured to perform object detection on the first video frame by using an object detection model to obtain a first object set, and perform object detection on the area image by using an object detection model to obtain a second object set.

And a final result determining module 250, configured to determine a final detection result according to the first target set and the second target set.

Optionally, in an embodiment of the present application, the region image determining module includes:

and the central coordinate determination module is used for determining the central point of the area image by taking the abscissa of the central point of the second video frame as the abscissa of the central point of the area image and taking the ordinate of the vanishing point of the second video frame as the ordinate of the central point of the area image.

And the region width calculation module is used for obtaining the transverse width and the longitudinal width of the region image.

And the central region determining module is used for determining the region image according to the central point of the region image and the transverse width and the longitudinal width of the region image.

Optionally, in an embodiment of the present application, the region width calculating module includes:

and the zooming times obtaining module is used for obtaining zooming times.

And the transverse width obtaining module is used for multiplying the scaling multiple times the transverse width of the input image of the target detection model to obtain the transverse width of the area image.

And the longitudinal width obtaining module is used for multiplying the scaling multiple times the longitudinal width of the input image of the target detection model to obtain the longitudinal width of the area image.

Optionally, in an embodiment of the present application, the final result determining module includes:

and the same target judging module is used for judging whether the first target set and the second target set contain the same target or not.

And the detection result determining module is used for determining a final detection result according to the position relation between the target and the region image if the first target set and the second target set contain the same target.

And the first result determining module is used for determining a detection frame of the target in the first target set as a final detection result if the first target set comprises the target and the second target set does not comprise the target.

And the second result determining module is used for determining a detection frame of the target in the second target set as a final detection result if the first target set does not contain the target and the second target set contains the target.

Optionally, in an embodiment of the present application, the detection result determining module includes:

and the target area judging module is used for judging whether the target is in a preset area or not, the ratio value of the size of the preset area to the size of the area image is equal to the preset ratio value, and the center point of the preset area and the center point of the area image are the same.

And the target area affirming module is used for determining a detection frame of the target in the second target set as a final detection result if the target is in the preset area.

And the target area negation module is used for determining a detection frame of the target in the first target set as a final detection result if the target is not in the preset area.

Optionally, in this embodiment of the present application, the same target determining module includes:

and the detection frame matching module is used for matching the detection frames corresponding to the targets with the same category in the first target set and the second target set.

And the same target determining module is used for determining that the first target set and the second target set contain the same target if the matching is successful, and determining that the first target set and the second target set do not contain the same target if the matching is failed.

Optionally, in an embodiment of the present application, the detection frame matching module includes:

and the area calculation screening module is used for calculating the intersection area of the targets with the same category between the detection frames of the first target set and the detection frames of the second target set, and screening out the smaller detection frames of the targets in the detection frames of the first target set and the detection frames of the second target set.

And the preset proportion judging module is used for judging whether the proportion of the intersection area to the smaller detection frame is greater than a preset proportion threshold value or not.

And the matching result determining module is used for determining that the matching is successful if the ratio of the intersection area to the smaller detection frame is greater than a preset ratio threshold, and determining that the matching is unsuccessful if the ratio of the intersection area to the smaller detection frame is not greater than the preset ratio threshold.

It should be understood that the apparatus corresponds to the above-mentioned embodiment of the target detection method, and can perform the steps related to the above-mentioned embodiment of the method, and the specific functions of the apparatus can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy. The device includes at least one software function that can be stored in memory in the form of software or firmware (firmware) or solidified in the Operating System (OS) of the device.

An electronic device provided in an embodiment of the present application includes: a processor and a memory, the memory storing processor-executable machine-readable instructions, the machine-readable instructions when executed by the processor performing the method as above.

The embodiment of the application also provides a storage medium, wherein the storage medium is stored with a computer program, and the computer program is executed by a processor to execute the method.

The storage medium may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules of the embodiments in the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an alternative embodiment of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application.

Claims

1. A method of object detection, comprising:

acquiring video data collected by a camera;

extracting a first video frame and a second video frame from the video data;

determining a region image containing a far object from the second video frame, wherein the size of the second video frame is larger than that of the region image;

performing target detection on the first video frame by using a target detection model to obtain a first target set, and performing target detection on the area image by using the target detection model to obtain a second target set;

and determining a final detection result according to the first target set and the second target set.

2. The method of claim 1, wherein the first video frame and the second video frame are two consecutive and adjacent different video frames in the video data, or two video frames in the video data with an interval smaller than a preset number, or a same video frame in the video data.

3. The method of claim 1, wherein said determining from said second video frame a region image containing a distant object comprises:

determining the central point of the area image by taking the abscissa of the central point of the second video frame as the abscissa of the central point of the area image and taking the ordinate of the vanishing point of the second video frame as the ordinate of the central point of the area image;

obtaining the transverse width and the longitudinal width of the area image;

and determining the area image according to the central point of the area image and the transverse width and the longitudinal width of the area image.

4. The method of claim 3, wherein the obtaining the lateral width and the longitudinal width of the region image comprises:

obtaining a zoom factor;

multiplying the scaling multiple by the transverse width of an input image of the target detection model to obtain the transverse width of the area image;

and multiplying the scaling multiple by the longitudinal width of the input image of the target detection model to obtain the longitudinal width of the area image.

5. The method of claim 1, wherein determining a final detection result from the first set of targets and the second set of targets comprises:

judging whether the first target set and the second target set contain the same target or not;

if the first target set and the second target set contain the same target, determining the final detection result according to the position relation between the target and the area image;

if the first target set comprises a target and the second target set does not comprise the target, determining a detection frame of the target in the first target set as the final detection result;

and if the first target set does not contain a target and the second target set contains the target, determining a detection frame of the target in the second target set as the final detection result.

6. The method according to claim 5, wherein the determining the final detection result according to the position relationship between the target and the region image comprises:

judging whether the target is in a preset area, wherein the ratio of the size of the preset area to the size of the area image is equal to a preset ratio, and the center point of the preset area and the center point of the area image are the same;

if so, determining a detection frame of the target in the second target set as the final detection result;

if not, determining the detection frame of the target in the first target set as the final detection result.

7. The method of claim 5, wherein determining whether the first set of targets and the second set of targets contain the same targets comprises:

matching detection frames corresponding to the targets with the same category in the first target set and the second target set;

and if the matching is successful, determining that the first target set and the second target set contain the same target, otherwise, determining that the first target set and the second target set do not contain the same target.

8. The method of claim 7, wherein matching detection boxes corresponding to objects of the same category in the first object set and the second object set comprises:

calculating the intersection area of the targets with the same category between the detection frames of the first target set and the detection frames of the second target set, and screening out the smaller detection frame of the targets in the detection frames of the first target set and the detection frames of the second target set;

judging whether the ratio of the intersection area to the smaller detection frame is larger than a preset ratio threshold value or not;

if so, determining that the matching is successful, otherwise, determining that the matching is unsuccessful.

9. The method according to any of claims 1-8, wherein the first video frame and the second video frame are a same video frame in the video data, or two consecutive and adjacent different video frames in the video data, or a preset number of different video frames spaced in the video data.

10. An object detection device, comprising:

the video data acquisition module is used for acquiring video data acquired by the camera;

the video frame extraction module is used for extracting a first video frame and a second video frame from the video data;

the area image determining module is used for determining an area image containing a far target from the second video frame, and the size of the second video frame is larger than that of the area image;

a target set obtaining module, configured to perform target detection on the first video frame by using a target detection model to obtain a first target set, and perform target detection on the area image by using the target detection model to obtain a second target set;

and the detection result determining module is used for determining a final detection result according to the first target set and the second target set.

11. An electronic device, comprising: a processor and a memory, the memory storing machine-readable instructions executable by the processor, the machine-readable instructions, when executed by the processor, performing the method of any of claims 1 to 9.

12. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, performs the method according to any one of claims 1 to 9.