CN116434156A

CN116434156A - Target detection method, storage medium, road side equipment and automatic driving system

Info

Publication number: CN116434156A
Application number: CN202310331676.8A
Authority: CN
Inventors: 黄德璐; 曹书浩
Original assignee: Continental Software System Development Center Chongqing Co ltd
Current assignee: Continental Software System Development Center Chongqing Co ltd
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-07-14

Abstract

The invention relates to a target detection and tracking method based on a deep learning network, which comprises the following steps: acquiring at least one image sequence from an image sensor; performing target detection operation on a first image frame in the at least one image sequence by utilizing a pre-trained deep learning network so as to obtain first contour information of at least one target object in the first image frame, wherein the first contour information comprises first key point coordinates; inputting the first key point coordinates and the second image frames into an optical flow detection algorithm to determine optical flow vectors for each key point between the two image frames; and determining second contour information of the at least one object on the second image frame based on the first keypoint coordinates of the at least one object on the first image frame and the optical flow vector. The invention also relates to a computer-readable storage medium, a road side device comprising the computer-readable storage medium, and an autopilot system.

Description

Target detection method, storage medium, road side equipment and automatic driving system

Technical Field

The present invention relates to the field of detection and tracking of image targets, and more particularly, to a target detection and tracking method based on a deep learning network, a computer-readable storage medium, a roadside apparatus including the computer-readable storage medium, and an automatic driving system.

Background

With the development of vehicle autopilot technology and vehicle-road cooperative technology, the detection of road traffic conditions using road-side devices is becoming increasingly important. The vehicle-road cooperative technology is that information such as vehicle, pedestrian, road conditions and the like is detected through sensors (such as cameras, radars and the like) at the road side end, and the detected information is sent to a running vehicle, so that the vehicle has accurate cognition on surrounding traffic conditions, and the safety and the high efficiency of the running of the vehicle are ensured.

In the road side equipment, the key is that the information such as the category, the position, the size and the like of the traffic participants is detected from the image data acquired by the cameras by using a target detection and tracking algorithm, which plays a key role in accurately sensing the surrounding traffic conditions by the road side equipment.

Most of the current road side equipment adopts an image target detection and tracking algorithm based on a deep learning network. In the existing tracking algorithm, each image frame needs to be input into a convolutional neural network for detection, so that the detection time is generally long, the detection result of a camera is delayed and the detection efficiency is low, and the real-time requirement of an automatic driving system and a vehicle-road cooperative system on the road condition feedback is high. In addition, the tracking effect of the existing algorithm is poor under traffic conditions such as more vehicles, road congestion, shielding and the like. Accordingly, there is a need to provide more efficient and accurate target detection and tracking algorithms.

Disclosure of Invention

In order to solve the problems that the delay time of the traditional tracking algorithm is long and the tracking effect is poor under the conditions of more vehicles and shielding on roads, the invention provides a target detection and tracking method combining a deep learning network and an optical flow detection algorithm, which comprises the following steps:

acquiring at least one image sequence from an image sensor, wherein the at least one image sequence is composed of a plurality of image frames which are sequentially arranged according to time;

performing a target detection operation on a first image frame in the at least one image sequence by utilizing a pre-trained deep learning network so as to obtain first contour information of at least one target object in the first image frame, wherein the first contour information comprises first key point coordinates of the at least one target object on the first image frame;

inputting first keypoint coordinates of the at least one target object and a second image frame immediately following the first image frame into an optical flow detection algorithm to determine an optical flow vector for each keypoint of the at least one target object between two image frames; and

second contour information of the at least one object on the second image frame is determined based on first keypoint coordinates of the at least one object on the first image frame and the optical flow vector.

According to an alternative embodiment, the first contour information comprises a first circumscribed rectangular box of the at least one target object on the first image frame, and the first keypoint coordinates comprise corner coordinates and center point coordinates of the first circumscribed rectangular box at different scales.

According to an alternative embodiment, the second contour information includes a second keypoint coordinate of the at least one target object on the second image frame and a second external rectangular frame, wherein the second external rectangular frame is calculated according to the second keypoint coordinate.

According to an alternative embodiment, the target detection and tracking method further comprises:

determining calibration parameters of the image sensor and a frame rate of the at least one image sequence;

the calibration parameters of the image sensor, the frame rate of the at least one image sequence, and the optical flow vector determine the speed of movement of the at least one target object in the world coordinate system.

and executing target detection operation of the deep learning network every predetermined image frame number, and matching the detection result of the deep learning network with the detection result of the optical flow detection algorithm to determine whether a new target object appears.

According to an alternative embodiment, the deep learning network comprises:

a first section for extracting from the image frame a plurality of different levels of original feature maps relating to the at least one target object;

the second part is used for carrying out information fusion on the original feature graphs of the multiple different levels so as to generate a multi-layer feature graph to be further used for feature detection; and

and a third section for generating and outputting profile information of the at least one target object based on the multi-layer feature map.

According to an alternative embodiment, the deep learning network is a convolutional neural network, and the first part is a trunk part, the second part is a neck part, and the third part is a detection head part.

According to a second aspect of the present invention there is also provided a computer readable storage medium having stored thereon a computer program comprising program instructions which when executed by a processor implement the steps of the object detection and tracking method as described above.

According to a third aspect of the present invention, there is also provided a roadside apparatus comprising:

the road side end camera is used for collecting at least one image sequence of the road side end;

a computer readable storage medium as described above; and

a processor configured to execute the program instructions in the computer-readable storage medium based on at least one sequence of images acquired by the roadside end camera.

According to a fourth aspect of the present invention, there is also provided an autopilot system comprising:

the vehicle-mounted camera is used for collecting at least one image sequence around the vehicle;

a road environment monitoring unit configured to determine road conditions around a vehicle using the target detection and tracking method as described above based on at least one image sequence acquired by the in-vehicle camera; and

and a vehicle control unit configured to control the vehicle to perform a corresponding automatic driving operation according to the road condition determined by the road environment monitoring unit.

In the target detection and tracking method based on the deep learning network, the neural network is not required to detect each frame of image, so that the calculation resources can be obviously reduced, and the overall detection efficiency of an algorithm is improved. In addition, the target detection and tracking method has better tracking effect under the traffic conditions of more vehicles, road congestion, shielding and the like compared with the traditional method, and reduces the error rate of target detection and tracking under the complex road condition.

Drawings

Other features and advantages of the methods of the present invention will be apparent from, or are apparent from, the accompanying drawings, which are incorporated herein, and the detailed description of the invention, which, together with the drawings, serve to explain certain principles of the invention.

Fig. 1 shows a flow chart of a training process of a deep learning network used in the target detection and tracking method according to the present invention.

FIG. 2 illustrates a flowchart of an online detection and tracking process of image targets using a trained deep learning network model, according to an exemplary embodiment of the invention

Detailed Description

The deep learning network-based object detection and tracking method according to the present invention will be described below by way of example with reference to the accompanying drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention to those skilled in the art. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. Rather, the invention can be considered to be implemented with any combination of the following features and elements, whether or not they relate to different embodiments. Thus, the various aspects, features, embodiments and advantages described below are for illustration only and should not be considered elements or limitations of the claims.

In the existing image target detection and tracking algorithm based on the deep learning network, a convolutional neural network needs to be trained by combining training data with a gradient descent algorithm so as to obtain a trained neural network model. In practical application, a single frame image captured by a camera is input into the model first to classify and regress foreground objects in the image, so as to output the information of the classification, the position and the size of the detected object. And then classifying and regressing the next frame of image to obtain the corresponding foreground object in the frame of image. Finally, targets in the two frames of images are matched through a Hungary matching algorithm, and traffic participants such as vehicles and pedestrians on the road can be effectively tracked by combining a Kalman algorithm filtering algorithm, and the movement speed of the traffic participants is obtained.

In the traditional tracking method, the background of the image sequence acquired by the camera at the road side does not change greatly, and if all the image frames acquired by the camera are input into the convolutional neural network for detection, the detection is equivalent to repeated detection of invalid repeated background for a plurality of times, which wastes a lot of calculation resources and increases time delay. In addition, the tracking effect of the existing algorithm is poor under traffic conditions such as more vehicles, road congestion, shielding and the like.

In order to solve the problems that the traditional tracking algorithm has long delay time and poor tracking effect for the conditions of more vehicles and shielding on roads, the invention provides a target detection and tracking method combining a deep learning network and an optical flow detection algorithm. The technical concept of the invention is based on the following considerations: that is, the detection result output by the target detection algorithm based on the deep learning network is a circumscribed rectangular frame of the target object in the image, and the key point coordinates of the target object can be obtained very easily and accurately through the circumscribed rectangular frame, so that the deep learning network and the optical flow algorithm can be combined to realize target detection and tracking. In addition, since the road side camera shoots a continuous time image sequence and the time interval between the front image frame and the rear image frame is small, the background in each image frame does not change greatly, the position of a target object (such as a vehicle) does not change greatly, and if the detection result of the previous frame image and the target optical flow information of the next frame image can be used, the position and the moving speed of the target in the next frame image can be accurately predicted.

Fig. 1 shows a flow chart of a training process of a deep learning network used in the target detection and tracking method according to the present invention. The training process of the deep learning network in the object detection and tracking method according to the present invention is described in detail with reference to fig. 1.

First, image data to be used for training is created, which may be an original picture taken from an image sensor (e.g. a monocular camera) comprising at least one target object (e.g. a vehicle).

In addition, data enhancement processing may be performed on the created image data to increase the data set to be used for training. The specific data enhancement method can be determined according to the actual situation of the training set data, for example, data enhancement means such as image inversion, image brightness and contrast adjustment, mosaic data enhancement, random clipping, random scaling and the like can be selected.

As used herein, a deep learning network may be, for example, a Convolutional Neural Network (CNN) that may be generally divided into three parts, a trunk, a neck, and a detection head.

The backbone portion is the core of the overall deep learning network and is made up of Cn convolutions of 1 or more convolutions each of which is used to extract image features from the created image data, where the extracted features are raw features at a plurality of different levels related to at least one target object. For example, using the backbone part, an Ln layer original feature map can be generated, and this Ln layer original feature map is obtained again through the neck of the network, and then input into the detection head.

The neck part is a network establishing connection between the trunk and the detection head, and is composed of a convolution layer, an up-sampling layer and a down-sampling layer, and is used for carrying out information fusion on a plurality of different-level original features extracted by the trunk part so as to generate a new multi-layer feature map to be further used for detection. The part can improve the classification and regression accuracy of the deep learning network for objects with different sizes.

The detection head is composed of a convolution layer or a full connection layer and is used for outputting the category of the object and regressing the outline information of the object, for example, the detection head can respectively generate category information, category confidence and corner coordinates of at least one target object on the multi-layer feature map, and the circumscribed rectangular frame of the corresponding object can be further determined based on the corner coordinates.

After the convolutional neural network is determined, the pictures acquired by the roadside end cameras are input into the network, and are optimized and trained in combination with an optimization strategy, such as a random gradient descent (SGD) algorithm, to obtain a trained network model.

And deploying the trained network model into an Edge Computing Unit (ECU) at the road side end, so as to perform online detection and tracking of the image target.

FIG. 2 illustrates a flowchart of an online detection and tracking process of image targets using a trained deep learning network model, according to an exemplary embodiment of the invention. The operation flow of the object detection and tracking method according to the present invention is described in detail below with reference to fig. 2.

First, at least one image sequence from an image sensor (e.g., a monocular camera) is acquired, the at least one image sequence being made up of a plurality of image frames arranged in chronological order. For example, the image sequence includes a first image frame (also referred to as "image frame a") and a second image frame (also referred to as "image frame B") that are adjacent one after the other.

Then, a target detection operation is performed on a first image frame in at least one image sequence acquired by an image sensor by using a pre-trained deep learning network to obtain first contour information of at least one target object in the first image frame, wherein the first contour information may include, for example, a circumscribed rectangular frame of the target object in the first image frame.

After the circumscribed rectangular frame is obtained, the center point coordinate of the circumscribed rectangular frame and the angular point coordinates of the circumscribed rectangular frame under different scaling can be further obtained to serve as the key point coordinate for optical flow detection. For example, 4 corner coordinates of a corresponding frame of which the rectangular frame is reduced to 2/3 size and 4 corner coordinates of a corresponding frame reduced to 1/3 size may be acquired. That is, each target object has 9 key point coordinates, and the actual number of key point coordinates can be increased or decreased as needed.

It can be seen that the first profile information obtained by using the deep learning network may further include first keypoint coordinates of the target object on the first image frame, for example, 9 keypoint coordinates (8 corner points and 1 center point coordinate) in total in the present embodiment.

The first keypoint coordinates of the target object are then input into an optical flow detection algorithm (e.g., a sparse optical flow algorithm) in combination with a second image frame (image frame B) immediately following the first image frame to determine an optical flow vector for each keypoint of the target object between the two image frames, e.g., including the magnitude and direction of optical flow.

Based on the first key point coordinates of the target object on the first image frame and the optical flow vector thereof, the second contour information of the target object on the second image frame can be determined, wherein the second contour information can also comprise the second key point coordinates of the target object on the second image frame, and the second contour information comprises 8 corner point coordinates and a center point coordinate which are obtained by inference. And obtaining a center point coordinate by averaging the 8 corner point coordinates, and obtaining a final center point coordinate of the target object on the image frame B by averaging the two center point coordinates. In addition, the average length and width of the rectangular frame of the target object can be calculated through the 8 corner coordinates obtained through reasoning. After the center point coordinates, length, and width of the object are acquired, the detection frame of the target object on the image frame B, that is, the second circumscribed rectangular frame, acquired by the optical flow detection result can be further determined. That is, the second circumscribed rectangular frame on the second image frame may be derived from the second keypoint coordinates.

Optionally, the target detection and tracking method according to the present invention may further determine calibration parameters of the image sensor and a frame rate of the image sequence acquired by the image sensor. Based on the calibration parameters of the image sensor, the frame rate of the image sequence, and the above-determined optical flow vector (in particular, the magnitude of the optical flow) of at least one key point (in particular, the center point), the movement speed of the target object in the world coordinate system can be further determined, which requires the use of projection conversion between the image coordinate system and the world coordinate system.

In addition, since the optical flow detection algorithm cannot detect a new object, it is necessary to perform object detection of the convolutional neural network every predetermined number of image frames to detect the new object. In the process, for example, a Hungary algorithm can be used for matching the detection result of the convolutional neural network with the detection result obtained by the optical flow detection algorithm, and the unmatched detection result is a new target.

In exemplary embodiments of the present application, a computer readable storage medium is also provided, on which a computer program is stored, the program comprising executable program instructions which, when executed by, for example, a processor, may implement the steps of the object detection and tracking method described in any of the embodiments herein. In some possible implementations, the various aspects of the present application may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the exemplary steps described in the object detection and tracking method of the present invention when the program product is run on the terminal device.

The program product for implementing the above-described method according to the embodiments of the present application may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In an exemplary embodiment of the present application, there is also provided a roadside apparatus, which may include a roadside end camera for acquiring at least one image sequence of a roadside end; a computer readable storage medium as described above; and a processor configured to execute respective program instructions in the computer-readable storage medium based on images acquired by the roadside end camera.

In an exemplary embodiment of the present application, there is also provided an automatic driving system including: a vehicle-mounted camera; a road environment monitoring unit configured to determine road conditions around a vehicle using the target detection and tracking method described herein based on images acquired by the onboard camera; and a vehicle control unit configured to control the vehicle to perform a corresponding automatic driving operation according to the road condition determined by the road environment monitoring unit.

Those skilled in the art will appreciate that the example embodiments herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium or on a network, comprising several instructions to cause a computing device (which may be a personal computer, a server, or a network device, etc.) to perform a method for detecting an object in a road using a road side device according to embodiments of the present application.

While the invention has been described in terms of preferred embodiments, the invention is not limited thereto. Any person skilled in the art shall not depart from the spirit and scope of the present invention and shall accordingly fall within the scope of the invention as defined by the appended claims.

Claims

1. The target detection and tracking method based on the deep learning network is characterized by comprising the following steps of:

2. The target detection and tracking method of claim 1, wherein the first profile information comprises a first circumscribed rectangular box of the at least one target object on the first image frame, and the first keypoint coordinates comprise corner coordinates and center point coordinates of the first circumscribed rectangular box at different scales.

3. The target detection and tracking method according to claim 1 or 2, wherein the second profile information includes a second keypoint coordinate of the at least one target object on the second image frame and a second bounding rectangle, wherein the second bounding rectangle is derived from the second keypoint coordinate.

4. The target detection and tracking method according to claim 1 or 2, characterized in that the target detection and tracking method further comprises:

5. The target detection and tracking method according to claim 1 or 2, characterized in that the target detection and tracking method further comprises:

6. The target detection and tracking method according to claim 1 or 2, wherein the deep learning network comprises:

7. The method of claim 6, wherein the deep learning network is a convolutional neural network, and the first portion is a backbone portion, the second portion is a neck portion, and the third portion is a detection head portion.

8. A computer readable storage medium on which a computer program is stored, the computer program comprising program instructions, characterized in that the program instructions, when executed by a processor, implement the respective steps of the object detection and tracking method according to any one of claims 1 to 7.

9. A roadside apparatus, comprising:

the computer-readable storage medium of claim 8; and

10. An autopilot system, characterized in that the autopilot system comprises:

a road environment monitoring unit configured to determine road conditions around a vehicle using the target detection and tracking method according to any one of claims 1 to 7 based on at least one image sequence acquired by the onboard camera; and