CN117765439A

CN117765439A - Target object detection method, vehicle control method, device, chip and equipment

Info

Publication number: CN117765439A
Application number: CN202311776134.8A
Authority: CN
Inventors: 邢军华
Original assignee: Kunlun Core Beijing Technology Co ltd
Current assignee: Kunlun Core Beijing Technology Co ltd
Priority date: 2023-12-21
Filing date: 2023-12-21
Publication date: 2024-03-26

Abstract

The present disclosure provides a target object detection method, a control method, a device, a chip, a device, a storage medium, and a program product for a vehicle, and relates to the field of image processing, in particular to the fields of artificial intelligence, deep learning, image classification, automatic driving, and chip. The specific implementation scheme is as follows: extracting a plurality of image frames from a video stream to be processed; determining an interested region of a plurality of target image frames according to the image difference between two adjacent image frames in the plurality of image frames, wherein the target image frames are the following image frames in the two adjacent image frames, and the interested region is the region where a target object in the target image frames is located; cutting the target image frames according to the sizes of the target image frames to obtain a plurality of sub-images of each target image; and detecting relevant information of a target object in the video stream to be processed by utilizing a target sub-image in a plurality of sub-images of each target image frame, wherein the target sub-image is a sub-image with a region of interest in the plurality of sub-images.

Description

Target object detection method, vehicle control method, device, chip and equipment

Technical Field

The present disclosure relates to the field of image processing, and in particular, to the technical fields of artificial intelligence, deep learning, image classification, autopilot, and chips.

Background

With the rapid development of artificial intelligence technology, high-resolution images are widely used in various life scenes. For example, cameras have come to employ 4K and 8K high resolution shots in order to capture more scene detail.

The high resolution image can clearly describe the bright and dark details of the image, thereby displaying the image content more readily. However, in the process of processing the high-resolution image, excessive computing resources are consumed due to the large data volume of the high-resolution image.

Disclosure of Invention

The present disclosure provides a target object detection method, a control method, an apparatus, a chip, a device, a storage medium, and a program product for a vehicle.

According to an aspect of the present disclosure, there is provided a target object detection method including: extracting a plurality of image frames from a video stream to be processed; determining an interested region of a plurality of target image frames according to the image difference between two adjacent image frames in the plurality of image frames, wherein the target image frames are the following image frames in the two adjacent image frames, and the interested region is the region where a target object in the target image frames is located; cutting the target image frames according to the sizes of the target image frames to obtain a plurality of sub-images of each target image frame in the target image frames; and detecting relevant information of a target object in the video stream to be processed by utilizing a target sub-image in a plurality of sub-images of each target image frame, wherein the target sub-image is a sub-image with a region of interest in the plurality of sub-images.

According to another aspect of the present disclosure, there is provided a control method of a vehicle, including: acquiring related information of a target vehicle in a video stream to be processed; generating a track of the target vehicle according to the related information; and controlling the travel of the target vehicle based on the trajectory; the related information of the target vehicle is obtained by adopting the target object detection method provided by the disclosure.

According to another aspect of the present disclosure, there is provided a target object detection apparatus including: the extraction module is used for extracting a plurality of image frames from the video stream to be processed; the determining module is used for determining the interested areas of the plurality of target image frames according to the image difference between two adjacent image frames in the plurality of image frames, wherein the target image frames are the following image frames in the two adjacent image frames, and the interested areas are the areas where the target objects in the target image frames are located; the clipping module is used for clipping the target image frames according to the sizes of the target image frames to obtain a plurality of sub-images of each target image frame in the target image frames; and the detection module is used for detecting the related information of the target object of the video stream to be processed by utilizing a target sub-image in a plurality of sub-images of each target image frame, wherein the target sub-image is a sub-image with a region of interest in the plurality of sub-images.

According to another aspect of the present disclosure, there is provided a control device of a vehicle, including: the acquisition module is used for acquiring the related information of the target vehicle in the video stream to be processed; the generation module is used for generating a track of the target vehicle according to the related information; and a control module for controlling the travel of the target vehicle based on the trajectory; the related information of the target vehicle is obtained by adopting the target object detection device provided by the disclosure.

According to another aspect of the present disclosure, there is provided a chip including the target object detection apparatus or the control apparatus of the vehicle provided by the present disclosure.

Another aspect of the present disclosure provides an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of providing target object detection and/or the method of controlling a vehicle of the present disclosure.

According to another aspect of the embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the target object detection method and/or the control method of the vehicle provided by the present disclosure.

According to another aspect of the embodiments of the present disclosure, there is provided a computer program product comprising a computer program/instruction, characterized in that the computer program/instruction, when executed by a processor, implements the target object detection method and/or the control method of a vehicle provided by the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of a target object detection method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of determining a region of interest according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of determining a region of interest according to another embodiment of the present disclosure;

FIG. 4 is a schematic diagram of cropping a target image frame according to an embodiment of the disclosure;

FIG. 5 is a flow chart diagram of a method of controlling a vehicle according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a target object detection apparatus according to an embodiment of the present disclosure;

Fig. 7 is a block diagram of a control device of a vehicle according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of a chip according to an embodiment of the disclosure; and

fig. 9 is a schematic block diagram of an example electronic device used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the disclosure, the related data (such as including but not limited to personal information of a user) are collected, stored, used, processed, transmitted, provided, disclosed, applied and the like, all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public welcome is not violated.

The target object detection method provided by the present disclosure will be described in detail below with reference to fig. 1.

Fig. 1 is a flow chart of a target object detection method according to an embodiment of the present disclosure.

As shown in fig. 1, the target object detection method 100 of this embodiment may include operations S110 to S140.

In operation S110, a plurality of image frames are extracted from a video stream to be processed.

In the embodiment of the present disclosure, the video stream to be processed may be a video fixedly shot by a camera, and the image frame is a still picture of a certain frame in the video stream to be processed. Each of the plurality of image frames covers the same range of positions. For example, the camera is a high resolution camera and the image frames are high resolution images.

In the embodiment of the present disclosure, a plurality of image frames are sequentially extracted during the playing process of a video stream to be processed. The plurality of image frames may be all image frames included in the video stream to be processed, or may be image frames corresponding to a specific frame in the video stream to be processed. The extraction order of the plurality of image frames is the same as the play order of the plurality of image frames in the video stream to be processed.

For example, during the playing process of the video stream to be processed, a plurality of image frames are extracted from the video stream to be processed at preset time intervals, and the preset number of frames are spaced between the plurality of image frames. For example, during the playing of the video stream to be processed, at a specific moment, a specific image frame is extracted from the video stream to be processed. For example, the specific time may be a time when a picture of the video stream to be processed changes.

In operation S120, a region of interest of a plurality of target image frames is determined according to an image difference between two adjacent image frames of the plurality of image frames.

In the embodiment of the disclosure, the target image frame is a subsequent image frame of two adjacent image frames, and the region of interest is a region in which the target object is located in the target image frame. For example, two adjacent image frames may be two adjacent image frames in the video stream to be processed, or two image frames adjacent in the extraction order.

In embodiments of the present disclosure, a change in pictures in a video stream to be processed over a period of time may be determined based on an image difference between two adjacent image frames. For example, with reference to a preceding image frame of two adjacent image frames, a picture change of the following image frame relative to the preceding image frame is determined.

In the embodiment of the present disclosure, the target object may be a part of a screen change. For example, the video stream to be processed may be a vehicle driving course captured by a camera. In two adjacent image frames, the vehicle position or morphology in the picture may change. And taking the running vehicle as a target object, and determining the moving range of the vehicle in the video picture according to the image difference between two adjacent image frames. For example, the range of movement of the vehicle may be a region of interest.

For example, a first position of the target object in a preceding image frame and a second position in a following image frame are determined, respectively, a mapped first position corresponding to the first position in the preceding image frame is determined in the following image frame, and a region where the mapped first position and the second position are located is taken as a region of interest in the following image frame.

In operation S130, the plurality of target image frames are respectively cut according to the sizes of the plurality of target image frames, to obtain a plurality of sub-images of each of the plurality of target image frames.

In the embodiment of the present disclosure, the target image frame is cut into a plurality of sub-images of the same size with reference to the size of the target image frame. The ratio of the size of the target image frame to the size of the sub-image may be about 50:1-200: 1.

for example, the target image frame may be a 4K (4096×2160 pixels) high resolution image, and the target image frame may be cropped to 416×416 pixels of multiple sub-images.

In operation S140, related information of a target object in a video stream to be processed is detected using a target sub-image among a plurality of sub-images of each target image frame, the target sub-image being a sub-image having a region of interest among the plurality of sub-images.

In the embodiment of the disclosure, the target object detection is performed by using the target sub-image of each target image frame, so that the retrieved data volume can be reduced, and the calculation resources can be saved.

For example, a target sub-image is screened from a plurality of sub-images of each target image frame; and detecting the region of interest included in the target sub-image to obtain the related information of the target object in the video stream to be processed.

In the embodiment of the present disclosure, the region other than the region of interest in the target image frame may be a background region. In cropping the large-size target image frame into the small-size sub-image, the region of interest may also be cropped into a plurality of sub-regions. For example, the plurality of sub-images may include a first sub-image and a second sub-image. The first sub-image may be a sub-image comprising a sub-region of interest and the second sub-image may be a sub-image comprising only a background region.

The first sub-image is selected from the plurality of sub-images as a target sub-image of the target image frame, and the target sub-image can be a plurality of sub-images. And detecting the interested sub-region included in the target sub-image by taking the target sub-image as a processing object to obtain the related information of the target object.

By determining the region of interest in the target image frame, eliminating sub-images irrelevant to the target object and detecting only the region of interest, detection of the whole high-resolution image can be avoided, so that the data volume in the detection process is reduced, and the computing resources are saved.

In embodiments of the present disclosure, a deep learning model may be utilized for target object detection. For example, an image frame is input to a deep learning model, and a target object in the image frame is detected.

Because the data volume of the high-resolution image is too large, a large-scale deep learning model needs to be built for detecting the target object of the image frame, so that a large amount of resource expenditure is generated in the training and reasoning process. In addition, the deep learning model needs to sample the whole image frame multiple times, which makes it difficult to extract the characteristics of the small target object in the image, thereby causing missed detection of the small target object.

In the embodiment of the disclosure, the whole target image frame is cut into a plurality of sub-images, and the sub-images are subjected to model detection by using the deep learning model with smaller scale, so that the scale of the deep learning model can be reduced, and the resource cost of the model training process and the reasoning process is reduced.

For example, the present disclosure may train the YOLOv7 model with an image processor (graphics processing unit, GPU). For example, training the YOLOv7 model by employing strategies such as dynamic label assignment, re-parameterized convolution, aided head detection, model scaling, implicit knowledge learning, efficient aggregation network, and the like

For example, a high resolution image frame is cropped into multiple sub-images, and a generic object detection framework may be employed to train the YOLOv7 detection model based on the multiple sub-images.

In the reasoning process, the target image frame of the determined region of interest is cut into a plurality of sub-images, the plurality of sub-images are input into a detection model obtained through training, and the detection model only carries out target detection on target sub-images in the plurality of sub-images under the constraint of the region of interest, so that a plurality of detection results are obtained.

For multiple detection results, multiple detection results may be combined, and the combined results may be filtered using a non-maximum suppression (non inax suppression, NMS) algorithm to obtain a detection result of the target object in the image frame.

In addition, the method and the device define the region of interest in the target image frame, so that the deep learning model can only detect the region of interest in the sub-image, the data volume of model detection can be further reduced, the calculation speed is improved, and the calculation resources are saved.

In some embodiments, a data stream processing pipeline is established by a plurality of components to pipeline a plurality of image frames of a video stream to be processed. For example, a preprocessing component, a cropping component and an inference component are employed to form a data stream processing pipeline that processes a plurality of image frames in a pipelined fashion. For example, the component may be a software module that is executed by the processor to pipeline the plurality of image frames.

For example, the plurality of image frames includes I image frames, I being a positive integer greater than 1.

In the data stream processing pipeline, a region of interest of the i+1th image frame, i=1, …, I-1, is determined from the image difference between the i+1th image frame and the i+1th image frame. And cutting the (i+1) th image frame according to the size of the (i+1) th image frame to obtain a plurality of (i+1) th sub-images of the (i+1) th image frame. And detecting the related information of the target object in the i+1th image frame by utilizing the i+1th target sub-image in the i+1th sub-images.

In the embodiment of the present disclosure, in a case where it is determined to start cropping the i+1th image frame, a region of interest of the i+2th image frame is determined according to an image difference between the i+1th image frame and the i+2th image frame. And under the condition that the related information of the target object in the (i+1) th image frame is determined to start to be detected, cutting the (i+2) th image frame according to the size of the (i+2) th image frame to obtain a plurality of (i+2) th sub-images.

For example, by running the preprocessing component, a region of interest for the 2 nd image frame is determined from the 1 st image frame based on the image difference from the 2 nd image frame. And cutting the 2 nd image frame by running a cutting component to obtain a plurality of 2 nd sub-images of the 2 nd image frame. And inputting the plurality of 2 nd sub-images into a detection model by running an inference component, and detecting the related information of the target object in the 2 nd image frame by utilizing the 2 nd target sub-image in the plurality of 2 nd sub-images.

The region of interest of the 3 rd image frame is determined by running the preprocessing component while the 2 nd image frame is cropped by running the cropping component. The 3 rd image frame is cropped by the operation cropping component while the 2 nd image frame is subject to target object detection by the operation inference component.

In the embodiment of the disclosure, according to the image difference between the 1 st image frame and the 2 nd image frame, the region of interest of the 2 nd image frame can be determined, and the region of interest of the 1 st image frame can be determined accordingly. From the difference in image between the I-th image frame and the I-1 th image frame, the region of interest of the I-1 th image frame may be determined, and the region of interest of the I-th image frame may be determined accordingly.

In the embodiment of the disclosure, the data stream processing pipeline is utilized to detect the region of interest of the high-resolution image frame, so that the detection speed can be improved, and the detection rate of the target object can be improved. The data stream processing pipeline can be a group of modularized components, and the implementation standard of the components can be unified by extracting common functions in the components. The personalized parameters of each component task are configured through configuration files, so that functions of pulling, preprocessing, model reasoning and the like of the data stream can be realized. In addition, the components are connected to form a data flow processing pipeline, so that the tasks of voice recognition, machine translation, image classification, target detection, target recognition, target path tracking and the like can be efficiently completed.

The process of determining a region of interest by object target detection provided by the present disclosure will be described in detail below in connection with 2.

Fig. 2 is a schematic diagram of determining a region of interest according to an embodiment of the present disclosure.

As shown in fig. 2, in an embodiment 200, two adjacent image frames are a reference image frame 210 and a target image frame 220, respectively. The target image frame 220 is a subsequent image frame to the reference image frame 210. For example, the target image frame 220 is an image frame extracted at a time subsequent to the extraction time of the reference image frame 210.

In the embodiment of the present disclosure, the reference image frame 210 and the target image frame 220 are determined from among a plurality of image frames, reference distribution information of a plurality of reference objects in the reference image frame 210 is determined, and target distribution information of a plurality of objects to be determined in the target image frame 220 is determined. A target object is determined from a plurality of objects to be determined according to a difference between the reference distribution information and the target distribution information, and the region of interest 230 of the target image frame 220 is determined according to the distribution information of the target object.

In the disclosed embodiment, a plurality of objects included in the reference image frame 210 and the target image frame 220 are acquired. For example, the object may be a plurality of physical objects included in the image frame.

As shown in fig. 2, in the reference image frame 210, the reference objects include a pedestrian overpass 211, a pedestrian 212, and vehicles 213, 214. In the target image frame 220, the object to be determined includes a pedestrian overpass 221, a pedestrian 222, and vehicles 223, 224.

In the embodiment of the present disclosure, the distribution information may be position information of a plurality of objects in a corresponding image frame. For example, the distribution information may be coordinate information of a plurality of objects. Coordinate information of a center point of the object is taken as coordinate information of the object. For example, the reference distribution information includes coordinate information of the pedestrian overpass 211, the pedestrian 212, and the vehicles 213, 214 in the reference image frame 210. The target distribution information includes coordinate information of pedestrian overpass 221, pedestrian 222, and vehicles 223, 224 in target image frame 220.

In the embodiment of the present disclosure, the coordinate information of a plurality of reference objects in the reference image frame 210 is taken as a reference value, and the movement information of a plurality of objects to be determined corresponding to the plurality of reference objects is determined, so that the object whose position moves in the objects to be determined is determined as a target object. For example, the same object in the reference image frame 210 and the target image frame 220 has a correspondence relationship. Pedestrian overpass 211 corresponds to pedestrian overpass 221, pedestrian 212 corresponds to pedestrian 222, vehicle 213 corresponds to vehicle 223, and vehicle 214 corresponds to vehicle 224.

And comparing the coordinate information between the objects with the corresponding relation, and determining the target object from the objects to be determined. For example, the number of the cells to be processed,

in the embodiment of the disclosure, the coordinate information of the pedestrian overpass 211 is compared with the coordinate information of the pedestrian overpass 221, the coordinate information of the pedestrian 212 is compared with the coordinate information of the pedestrian 222, the coordinate information of the vehicle 213 is compared with the coordinate information of the vehicle 223, the coordinate information of the vehicle 214 is compared with the coordinate information of the vehicle 224, the position of the pedestrian 222 relative to the pedestrian 212 is determined to change, the position of the vehicle 223 relative to the vehicle 213 is changed, the position of the vehicle 214 relative to the vehicle 224 is changed, and the position of the pedestrian 222 relative to the pedestrian 212 is unchanged. Based on the comparison information, it is determined that the pedestrian 222 and the vehicles 223, 224 whose positions have changed are target objects.

In the disclosed embodiment, the reference image frame 210 is overlapped with the target image frame 220, resulting in an overlapped image. The objects included in the overlapping images are compared with the objects included in the target image frame 220 to determine the target object in the target image frame 220. For example, an object in the reference image frame 210 and the target image frame 220 that may completely coincide may be considered an object that is not moving. An object in which the reference image frame 210 does not completely coincide with the target image frame 220 may be considered a moving object, thereby determining the moving object as a target object.

In the disclosed embodiment, the region of interest of the target image frame 220 is determined from the distribution information of the target object in the target image frame 220. For example, location information of the target object of the target image frame 220 is determined. And generating a region of interest according to the position information of the target object, so that the region of interest can cover all the target objects.

In some embodiments, boundary information of the region of interest 230 is obtained for the region of interest 230 of the target image frame 220, and a bounding rectangle of the region of interest 230 is determined from the boundary information, the bounding rectangle being used to annotate the range of the region of interest 230 of the target image frame.

For example, the region of interest of the target image frame 220 may be an entire region, which may cover the entire target object. For the region of interest, coordinate information of a boundary of the region of interest is determined. And determining the circumscribed rectangle of the region of interest according to the coordinate information of the boundary. For example, the bounding rectangle of the region of interest may be a bounding rectangle of a plurality of target objects.

For example, the positional information of the bounding rectangle may be represented by the vertex coordinates of the upper left corner of the bounding rectangle, the wide coordinate range and the high coordinate range of the bounding rectangle.

According to the embodiment of the disclosure, the range of the region of interest is marked by the circumscribed rectangle, and only the region in the circumscribed rectangle in the target sub-image is detected in the reasoning process, so that the data volume of the detection object can be reduced.

The process of determining a region of interest by object target detection provided by the present disclosure will be described in detail below in conjunction with fig. 3.

Fig. 3 is a schematic diagram of determining a region of interest according to another embodiment of the present disclosure.

As shown in fig. 3, in an embodiment 300, two adjacent image frames are a reference image frame 310 and a target image frame 320, respectively. The target image frame 320 is a subsequent image frame to the reference image frame 310. For example, the target image frame 320 is an image frame extracted at a time subsequent to the extraction time of the reference image frame 310.

In the embodiment of the present disclosure, the reference image frame 310 and the target image frame 320 are determined from a plurality of image frames, pixel values of a plurality of reference pixels of the reference image frame 310 and pixel values of a plurality of target pixels of the target image frame 320 are acquired, and a region of interest of the target image frame 320 is determined according to differences between the pixel values of the plurality of reference pixels and the pixel values of the plurality of target pixels.

For example, all pixels included in the reference image frame 310 are reference pixels and all pixels included in the target image frame 320 are target pixels.

The difference between the pixel values of the plurality of reference pixels and the pixel values of the plurality of target pixels may be a pixel value difference between the reference pixel and the corresponding target pixel.

For example, the reference pixel and the target pixel respectively located at the same position in the reference image frame and the target image frame have a correspondence relationship. And acquiring the pixel of interest in the plurality of target pixels according to the pixel value difference between the plurality of target pixels and the corresponding plurality of reference pixels, wherein the pixel value difference between the pixel of interest and the corresponding reference pixel is not zero. And determining the region of interest according to the distribution region of the pixels of interest.

As shown in fig. 3, the reference pixels located in the first row and first column of the reference image frame 310 correspond to the target pixels located in the first row and first column of the target image frame 320. The region of interest is determined by determining the portion of the target image frame 320 that varies relative to the reference image frame 310 based on the pixel differences between each reference pixel in the reference image frame 310 and the corresponding target pixel in the target image frame 320.

For example, the pixel values of all reference pixels in the reference image frame 310 and the pixel values of all target pixels in the target image frame 320 are acquired. And determining the pixel with the changed pixel value in the corresponding target pixel by taking the pixel value of the reference pixel as the reference value. For example, a pixel value difference between a reference pixel and a pixel corresponding to a target pixel is calculated, and in the case where it is determined that the pixel value difference is not zero, the pixel value of the pixel is considered to be changed, thereby determining that the pixel is a pixel of interest.

For example, the pixel values of the pixel region 321 of the target image frame 320 do not change relative to the pixel values of the pixel region 311 of the reference image frame 310. The pixel values of the pixel regions 322 in the target pixel frame 320 change relative to the pixel regions corresponding to the reference image frame 310, the pixel values of the pixel regions 323 in the target pixel frame 320 change relative to the pixel regions corresponding to the reference image frame 310, and the pixel values of the pixel regions 324 in the target pixel frame 320 change relative to the pixel regions corresponding to the reference image frame 310.

Accordingly, the pixel values of the pixel regions 312 of the reference image frame 310 change with respect to the corresponding pixel regions in the target pixel frame 320, the pixel values of the pixel regions 313 of the reference image frame 310 change with respect to the corresponding pixel regions in the target pixel frame 320, and the pixel values of the pixel regions 314 of the reference image frame 310 change with respect to the corresponding pixel regions in the target pixel frame 320.

In this case, the pixel areas 322, 323, 324 of the target image frame 320 and the pixel areas corresponding to the pixel areas 312, 313, 314 may be determined as target pixel areas, and the region of interest may be determined from the target pixel areas.

For example, the regions of interest 331, 332, 333, 334 may be determined from the circumscribed matrix of pixel regions 322, 323, 324 and pixel regions 312, 313, 314, respectively, of the target image frame 320.

In the embodiment of the present disclosure, the region of interest of the target image frame 320 may be a plurality of regions, and the plurality of regions may cover pixel regions where pixel values in the respective target image frames 320 change. For each region of interest, coordinate information of a boundary of the region of interest is determined. And determining the circumscribed rectangle of the region of interest according to the coordinate information of the boundary. For example, the bounding rectangle of the region of interest may be a bounding rectangle of a plurality of target pixel regions.

The process of cropping the target image frame provided by the present disclosure will be described in detail below in conjunction with fig. 4.

Fig. 4 is a schematic diagram of cropping a target image frame according to an embodiment of the present disclosure.

As shown in fig. 4, a target video frame 400 includes a target object 410. For example, cropping the target image frames 400 according to the size of the target image frames 400 may result in a plurality of sub-images for each of the target image frames 400.

According to the embodiment of the present disclosure, for each target image frame, a cropping slider 420 is provided according to the size of the target image frame, the cropping slider having a length of m and a width of n, m and n being positive integers. The cropping slide window 420 is slid in the horizontal direction by a first step a at the target image frame, and the cropping slide window 420 is slid in the vertical direction by a second step b at the target image frame, resulting in a plurality of sub-images. Wherein the first step size is smaller than m, and the second step size is smaller than n.

For example, for the target image frame 400 of 4K or 8K high resolution pixels, the size of the crop slider 410 is 416×416 pixels.

The cropping slider 410 is slid with the lower left vertex of the target image frame 400 as the starting point. After each sliding of the clipping sliding frame 420, the area covered by the clipping sliding frame 420 is the sub-image obtained by clipping.

As shown in fig. 4, the region covered by the cropping slider 420 is a sub-image (sub-image G1), the cropping slider 420 slides in the horizontal direction to the position of the dashed-line frame 430 by a first step a, and the region covered by the dashed-line frame 430 is a sub-image (sub-image G2). Since the first step a is smaller than the length m of the cropping slider, there is an overlap region between the sub-image G1 and the sub-image G2.

The cropping slider 420 slides in the vertical direction to the position of the dashed box 440 by the second step b, and the area covered by the dashed box 440 is a sub-image (sub-image G3). Since the second step b is smaller than the width n of the cropping slider, there is an overlap region between the sub-image G1 and the sub-image G3.

In the embodiment of the disclosure, in order to avoid that the target object is cut into two parts in the cutting process, the target object at the cutting edge cannot be accurately identified, and the distance smaller than the side length of the cutting sliding frame is taken as the step length, so that the sub-image with the overlapping area is obtained. And determining the same target object through the overlapping part between the sub-images, thereby improving the accuracy of detecting the target object.

For example, the area ratio between the overlapping region and the target image frame 400 may be 1 5%. For example, the clipping window 420 may be 10 x 10 in size, with a step size of about 7.5.

By the embodiment of the disclosure, each target object can be completely detected by setting the overlapping area. Aiming at the repeated detection problem of the same target object generated in the overlapping area, the detection result can be filtered by adopting an NMS algorithm to obtain a final detection result.

Based on the target object detection method provided by the present disclosure, the present disclosure further provides a control method of the vehicle, which will be described in detail below with reference to fig. 5.

Fig. 5 is a flow chart of a control method of a vehicle according to an embodiment of the present disclosure.

As shown in fig. 5, the control method 500 of the vehicle of the embodiment may include operations S510 to S530.

In operation S510, information about a target vehicle in a video stream to be processed is acquired.

According to an embodiment of the present disclosure, the related information of the target vehicle in operation S510 is obtained using the target object detection method provided by the present disclosure.

In operation S520, a trajectory of the target vehicle is generated according to the related information.

In operation S530, the travel of the target vehicle is controlled based on the trajectory.

In the embodiment of the present disclosure, the related information may be position information of the target vehicle in each image frame of the video stream. According to the position information of the target vehicle in each image frame of the video stream, a track of the target vehicle in a period of time can be generated, so that a mode for controlling the automatic driving of the target vehicle can be determined according to the track of the target vehicle.

By the vehicle control method, time for determining the position information of the target vehicle can be effectively reduced, calculation resources required for determining the position information of the target vehicle can be saved, and therefore the vehicle control method can be more suitable for hardware terminals with limited resources, and control efficiency is improved. The vehicle control method provided by the disclosure is realized on the hardware terminal with limited resources, so that the resource expense is reduced.

Based on the target object detection method provided by the present disclosure, the present disclosure further provides a target object detection device, which will be described in detail below with reference to fig. 6.

Fig. 6 is a block diagram of a target object detection apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the target object detection apparatus 600 of this embodiment may include an extraction module 610, a determination module 620, a clipping module 630, and a detection module 640.

The extracting module 610 is configured to extract a plurality of image frames from a video stream to be processed. In an embodiment, the extracting module 610 may be configured to perform the operation S110 described above, which is not described herein.

The determining module 620 is configured to determine, according to an image difference between two adjacent image frames in the plurality of image frames, a region of interest of the plurality of target image frames, where the target image frame is a subsequent image frame in the two adjacent image frames, and the region of interest is a region in which the target object in the target image frame is located. In an embodiment, the determining module 620 is configured to perform the operation S120 described above, which is not described herein.

The cropping module 630 is configured to crop the plurality of target image frames according to the sizes of the plurality of target image frames, respectively, to obtain a plurality of sub-images of each of the plurality of target image frames. In an embodiment, the clipping module 630 may be used to perform the operation S130 described above, which is not described herein.

The detection module 640 is configured to detect information about a target object of the video stream to be processed using a target sub-image of a plurality of sub-images of each target image frame, where the target sub-image is a sub-image of the plurality of sub-images having a region of interest. In an embodiment, the detection module 640 may be configured to perform the operation S140 described above, which is not described herein.

The determining module 620 is further configured to determine a reference image frame and a target image frame from a plurality of image frames, where the reference image frame and the target image frame are two adjacent image frames, and the target image frame is a subsequent image frame to the reference image frame; determining reference distribution information of a plurality of reference objects in a reference image frame, and determining target distribution information of a plurality of objects to be determined in a target image frame; determining a target object from a plurality of objects to be determined according to a difference between the reference distribution information and the target distribution information; and determining the region of interest of the target image frame according to the distribution information of the target object.

The determining module 620 is further configured to: determining a reference image frame and a target image frame from a plurality of image frames, the reference image frame and the target image frame being two adjacent image frames, the target image frame being a subsequent image frame to the reference image frame; acquiring pixel values of a plurality of reference pixels of a reference image frame and pixel values of a plurality of target pixels of a target image frame; and determining a region of interest of the target image frame based on differences between pixel values of the plurality of reference pixels and pixel values of the plurality of target pixels.

The determining module 620 is further configured to: acquiring the pixel of interest in the plurality of target pixels according to pixel value differences between the plurality of target pixels and the corresponding plurality of reference pixels respectively, wherein the pixel value differences between the pixel of interest and the corresponding reference pixel are not zero, and the reference pixels and the target pixels respectively positioned at the same positions in the reference image frame and the target image frame have corresponding relations; and determining the region of interest according to the distribution region of the pixels of interest.

The determining module 620 is further configured to: acquiring boundary information of a region of interest; and determining an external rectangle of the region of interest according to the boundary information, wherein the external rectangle is used for annotating the range of the region of interest of the target image frame.

According to an embodiment of the present disclosure, the clipping module 630 is further configured to set a clipping sliding window according to the sizes of the plurality of target image frames, where the clipping sliding window has a length of m and a width of n, and m and n are positive integers; for each target image frame, sliding a cutting sliding window in a first step length along the horizontal direction on the target image frame, and sliding a cutting sliding window in a second step length along the vertical direction on the target image frame to obtain a plurality of sub-images; wherein the first step size is smaller than m, and the second step size is smaller than n.

The detection module 640 is further configured to screen a target sub-image from a plurality of sub-images of each target image frame according to an embodiment of the present disclosure; and detecting the region of interest included in the target sub-image to obtain the related information of the target object in the video stream to be processed.

According to an embodiment of the present disclosure, the plurality of image frames includes I image frames, I being a positive integer greater than 1. The determining module 620 is further configured to determine a region of interest of the i+1th image frame, i=1, …, I-1, according to an image difference between the i+1th image frame and the i+1th image frame. The cropping module 630 is further configured to crop the i+1th image frame according to the size of the i+1th image frame, to obtain a plurality of i+1th sub-images of the i+1th image frame. When the cropping module 630 starts performing the above operation, the determining module 620 is further configured to determine the region of interest of the i+2th image frame according to the image difference between the i+1th image frame and the i+2th image frame in the case where it is determined to start cropping the i+1th image frame.

The detection module 640 is further configured to detect information about a target object in an i+1th image frame by using an i+1th target sub-image in the i+1th sub-images according to an embodiment of the present disclosure. When the detection module 640 starts to perform the above operation, the cropping module 630 is further configured to crop the i+2th image frame according to the size of the i+2th image frame to obtain a plurality of i+2th sub-images when determining to start to detect the related information of the target object in the i+1th image frame.

Based on the control method of the vehicle provided by the present disclosure, the present disclosure further provides a control device of the vehicle, which will be described in detail below with reference to fig. 7.

Fig. 7 is a block diagram of a control device of a vehicle according to an embodiment of the present disclosure.

As shown in fig. 7, the control device 700 of the vehicle of this embodiment may include an acquisition module 710, a generation module 720, and a control module 730.

The acquiring module 710 is configured to acquire information about a target vehicle in a video stream to be processed. In an embodiment, the obtaining module 710 may be used to perform the operation S510 described above, which is not described herein.

The related information of the target vehicle is obtained by adopting the target object detection device provided by the disclosure.

The generating module 720 is configured to generate a track of the target vehicle according to the related information. In an embodiment, the generating module 720 may be configured to perform the operation S520 described above, which is not described herein.

The control module 730 is used to control the travel of the target vehicle based on the trajectory. In an embodiment, the control module 730 may be configured to perform the operation S530 described above, which is not described herein.

Based on the target object detection device and the control device of the vehicle provided by the present disclosure, the present disclosure further provides a chip, which will be described in detail below with reference to fig. 8.

Fig. 8 is a block diagram of a chip according to an embodiment of the disclosure.

As shown in fig. 8, the chip 800 of this embodiment may include a device 810.

In the disclosed embodiment, the device 810 may be a target object detection device or a control device of a vehicle.

For example, the device 810 may be the target object detection device 600 provided above or the control device 700 of the vehicle.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and applying personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public welcome is not violated. In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement methods of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic device 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data required for the operation of the electronic device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as the target object detection method and the control method of the vehicle. For example, in some embodiments, the target object detection method and the control method of the vehicle may be implemented as computer software programs, which are tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above-described target object detection method and control method of the vehicle may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the target object detection method and the control method of the vehicle by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS"). The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A target object detection method, comprising:

extracting a plurality of image frames from a video stream to be processed;

determining an interested region of a plurality of target image frames according to the image difference between two adjacent image frames in the plurality of image frames, wherein the target image frames are the following image frames in the two adjacent image frames, and the interested region is the region where a target object in the target image frames is located;

Cutting the target image frames according to the sizes of the target image frames to obtain a plurality of sub-images of each target image frame in the target image frames; and

and detecting the related information of the target object in the video stream to be processed by utilizing a target sub-image in a plurality of sub-images of each target image frame, wherein the target sub-image is a sub-image with a region of interest in the plurality of sub-images.

2. The method of claim 1, wherein the determining the region of interest of the plurality of target image frames from the image differences between two adjacent image frames of the plurality of image frames comprises:

determining a reference image frame and a target image frame from the plurality of image frames, the reference image frame and the target image frame being the two adjacent image frames, the target image frame being a subsequent image frame to the reference image frame;

determining reference distribution information of a plurality of reference objects in the reference image frame, and determining target distribution information of a plurality of objects to be determined in the target image frame;

determining the target object from the plurality of objects to be determined according to the difference between the reference distribution information and the target distribution information; and

And determining the region of interest of the target image frame according to the distribution information of the target object.

3. The method of claim 1, wherein the determining the region of interest of the plurality of target image frames from the image differences between two adjacent image frames of the plurality of image frames comprises:

acquiring pixel values of a plurality of reference pixels of the reference image frame and pixel values of a plurality of target pixels of the target image frame; and

a region of interest of the target image frame is determined from differences between pixel values of the plurality of reference pixels and pixel values of the plurality of target pixels.

4. A method according to claim 3, wherein said determining a region of interest of the target image frame from differences between pixel values of the plurality of reference pixels and pixel values of the plurality of target pixels comprises:

acquiring a pixel of interest in the plurality of target pixels according to pixel value differences between the plurality of target pixels and the corresponding plurality of reference pixels, wherein the pixel value differences between the pixel of interest and the corresponding reference pixel are not zero, and the reference pixels and the target pixels respectively positioned at the same positions in the reference image frame and the target image frame have corresponding relations; and

And determining the region of interest according to the distribution region of the pixel of interest.

5. The method of any of claims 2-4, wherein the determining a region of interest for a plurality of target image frames from an image difference between two adjacent image frames of the plurality of image frames further comprises:

acquiring boundary information of the region of interest; and

and determining an external rectangle of the region of interest according to the boundary information, wherein the external rectangle is used for annotating the range of the region of interest of the target image frame.

6. The method of claim 1, wherein the cropping the plurality of target image frames according to the sizes of the plurality of target image frames, respectively, to obtain a plurality of sub-images of each of the plurality of target image frames comprises:

setting a cutting sliding window according to the sizes of the target image frames, wherein the length of the cutting sliding window is m, the width of the cutting sliding window is n, and m and n are positive integers; and

for each target image frame, sliding the clipping sliding window in a first step length along the horizontal direction on the target image frame, and sliding the clipping sliding window in a second step length along the vertical direction on the target image frame to obtain the plurality of sub-images;

Wherein the first step size is smaller than m, and the second step size is smaller than n.

7. The method of claim 1, wherein the detecting the related information of the target object in the video stream to be processed using a target sub-image of the plurality of sub-images of each target image frame comprises:

screening the target sub-image from a plurality of sub-images of each target image frame; and

and detecting the region of interest included in the target sub-image to obtain the related information of the target object in the video stream to be processed.

8. The method of claim 1, wherein the plurality of image frames comprises I image frames, I being a positive integer greater than 1, the method further comprising:

determining a region of interest of an i+1th image frame, i=1, …, I-1, from an image difference between the i+1th image frame and the i+1th image frame;

cutting the (i+1) th image frame according to the size of the (i+1) th image frame to obtain a plurality of (i+1) th sub-images of the (i+1) th image frame; and

in the case that it is determined to start cropping the (i+1) -th image frame, determining a region of interest of the (i+2) -th image frame according to an image difference between the (i+1) -th image frame and the (i+2) -th image frame.

9. The method of claim 8, further comprising:

detecting related information of the target object in the i+1th image frame by utilizing the i+1th target sub-image in the i+1th sub-images; and

and under the condition that the related information of the target object in the (i+1) th image frame is determined to start to be detected, cutting the (i+2) th image frame according to the size of the (i+2) th image frame to obtain a plurality of (i+2) th sub-images.

10. A control method of a vehicle, comprising:

acquiring related information of a target vehicle in a video stream to be processed;

generating a track of the target vehicle according to the related information; and

controlling travel of the target vehicle based on the trajectory;

wherein the information about the target vehicle is obtained by the method according to any one of claims 1 to 9.

11. A target object detection apparatus comprising:

the extraction module is used for extracting a plurality of image frames from the video stream to be processed;

a determining module, configured to determine a region of interest of a plurality of target image frames according to an image difference between two adjacent image frames in the plurality of image frames, where the target image frame is a subsequent image frame in the two adjacent image frames, and the region of interest is a region where a target object in the target image frame is located;

The clipping module is used for clipping the target image frames according to the sizes of the target image frames to obtain a plurality of sub-images of each target image frame in the target image frames; and

and the detection module is used for detecting the related information of the target object of the video stream to be processed by utilizing a target sub-image in a plurality of sub-images of each target image frame, wherein the target sub-image is a sub-image with a region of interest in the plurality of sub-images.

12. The apparatus of claim 11, wherein the means for determining is further for:

13. The apparatus of claim 11, wherein the means for determining is further for:

14. The apparatus of claim 13, wherein the means for determining is further for:

15. The apparatus of claim 11, wherein the clipping module is further to:

16. A control device of a vehicle, comprising:

the acquisition module is used for acquiring the related information of the target vehicle in the video stream to be processed;

the generation module is used for generating the track of the target vehicle according to the related information; and

a control module for controlling travel of the target vehicle based on the trajectory;

wherein the information about the target vehicle is obtained using the apparatus according to any one of claims 11 to 15.

17. A chip, comprising:

the device of any one of claims 11-16.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

19. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10.

20. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1-10.