WO2024055966A1 - 一种多相机目标检测方法、装置 - Google Patents

一种多相机目标检测方法、装置 Download PDF

Info

Publication number
WO2024055966A1
WO2024055966A1 PCT/CN2023/118350 CN2023118350W WO2024055966A1 WO 2024055966 A1 WO2024055966 A1 WO 2024055966A1 CN 2023118350 W CN2023118350 W CN 2023118350W WO 2024055966 A1 WO2024055966 A1 WO 2024055966A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
target detection
image frame
feature
information
Prior art date
Application number
PCT/CN2023/118350
Other languages
English (en)
French (fr)
Inventor
吴昊
Original Assignee
上海高德威智能交通系统有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海高德威智能交通系统有限公司 filed Critical 上海高德威智能交通系统有限公司
Publication of WO2024055966A1 publication Critical patent/WO2024055966A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present application relates to the field of image-based target detection, and in particular, to a multi-camera target detection method.
  • Multi-camera target detection and target position analysis perform target detection and target position analysis on the image information from each camera, generate a single camera target position sequence, and then project the target position sequence in each camera into the same perspective view. , for example, in the Bird Eye View (BEV) view, the projected target position sequences are finally merged into the global target position sequence.
  • BEV Bird Eye View
  • the above multi-camera target detection and target position analysis method performs target detection and target position analysis step by step, which makes the system more complex and takes up communication resources; moreover, if there are problems with the target position sequence in a single camera, when multiple cameras When the target position sequences are fused from the same perspective, it is difficult to screen and choose the target position sequence of a single camera, making it difficult to obtain a correct global target position sequence.
  • This application provides a multi-camera target detection method to obtain an accurate global target position sequence.
  • this application provides a multi-camera target detection method, which method includes:
  • each channel of video stream image data at least includes image data of overlapping areas
  • target detection and target location analysis are performed.
  • the feature point information extracted from each image frame of each channel of video stream image data is converted into a visual map, including:
  • the projection matrix is a projection matrix used to characterize the mapping relationship between pixel points in the camera image and spatial points in the visual map;
  • the step of performing target detection and target location analysis based on the fusion information includes:
  • target detection is performed to obtain target detection results at the same viewing angle corresponding to all image frames in each channel of image frames, and the target detection results are determined is the target detection result of the image frame group of each channel of image frames,
  • the target detection results of each image frame group are: the target detection results at the same viewing angle corresponding to all the image frames of each channel of image frames at different times.
  • obtaining target position sequence data from the target detection results of each image frame group includes:
  • the target position sequence data in the world coordinate system is obtained.
  • the target detection based on the projected feature point information of all image frames in each channel of image frames includes:
  • the pre-initialized target features are searched to obtain target reference position information
  • the characteristics of the corresponding target are obtained, and the characteristics of each channel of the target are obtained.
  • the target detection results include: global position information, target size and confidence under the same perspective.
  • the pre-initialized target features are searched based on the projected feature point information to obtain target reference position information, including:
  • the method uses the projection matrix of the camera from which each image frame is sourced to back-project the target reference position information to each channel respectively.
  • the image frame includes:
  • the reference position information of each target is back-projected into the feature map corresponding to each image frame to determine the position information of the reference position of each target in the feature map.
  • the characteristics of the corresponding target are obtained, and the characteristics of each path of the target are obtained, including:
  • the characteristics corresponding to each target are obtained.
  • the fusion features of each path of the target are obtained to obtain the fusion features of the target, including:
  • For each target perform feature fusion of the target separately to obtain the fusion features of each target.
  • the search for the fusion feature based on the projected feature point information includes:
  • the fused features of each target and the projected feature point information are input into the machine learning model to obtain the target detection result.
  • the feature fusion of each target is performed separately, including:
  • the features of the target in each feature map are fused to obtain the first fusion feature
  • the method further includes:
  • the target detection results of the current image frame group are filtered to obtain effective target detection results.
  • the effective target detection result is added to the initialized target feature set of the next image frame group.
  • marking each target detection result in the intersection of the target detection results of the current image frame group and the target detection results of the historical image frame group includes:
  • the target position sequence identifier of the valid target detection result added to the previous image frame is used;
  • the method of obtaining target position sequence data in the world coordinate system from the marked target detection results of each image frame group includes:
  • Target detection results with the same target position sequence identifier among the target detection results marked in each image frame group are determined as target position analysis data of the target detection result;
  • the visual map is a bird's-eye view map, and the same perspective is a bird's-eye view.
  • embodiments of the present application also provide a multi-camera target detection device, which includes:
  • the first acquisition module is used to acquire at least two channels of video stream image data from different cameras, where each channel
  • the video stream image data at least includes image data of the overlapping area
  • the second acquisition module is used to acquire the visual map information of the spatial location corresponding to the video stream image data
  • the target detection and position analysis module converts the feature point information extracted from each image frame of each video stream image data into a visual map to integrate the feature point information of each image frame into the same perspective to obtain the same Fusion information under the perspective, in which each image frame corresponding to the same collection time in each channel image frame is simultaneous,
  • target detection and target location analysis are performed.
  • embodiments of the present application further provide a computer-readable storage medium.
  • a computer program is stored in the storage medium.
  • the computer program is executed by a processor, the multi-camera target detection in any one of the first aspects is implemented. method steps.
  • embodiments of the present application further provide an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete communication with each other through the communication bus;
  • Memory used to store computer programs
  • the processor is configured to implement the method steps of multi-camera target detection described in any one of the above first aspects when executing a program stored in the memory.
  • embodiments of the present application further provide a computer program product containing instructions, which when the computer program product is run on a computer, causes the computer to perform any of the method steps of multi-camera target detection described in the first aspect.
  • the multi-camera target detection method projects video stream image features from different cameras to the same perspective, and performs target detection and target location analysis based on the fused information from the same perspective. In this way, information is collected from the source.
  • the fusion is conducive to improving the accuracy of information sources used for target detection and target position analysis, improving the intelligence of multi-camera fusion, and there is no need to first generate a target position sequence corresponding to a single camera and then fuse the target position sequence. It solves the problem of difficulty in distinguishing and choosing when fusing target position sequences corresponding to multiple single cameras. It not only avoids the computing power consumption caused by the fusion of target position sequences, but also improves the accuracy and reliability of target detection and target position analysis. .
  • Figure 1 is a schematic flow chart of a multi-camera target detection method according to an embodiment of the present application.
  • Figure 2 is a schematic flowchart of a multi-camera target detection method in a specific scenario according to an embodiment of the present application.
  • Figure 3 is a schematic diagram of four cameras collecting one frame of video stream images from four directions of an intersection.
  • Figure 4 is a schematic diagram of a high-precision map from a bird's-eye view at a traffic intersection.
  • Figure 5 is a schematic diagram of the target detection process.
  • Figure 6 is a schematic diagram of a multi-camera target detection device according to an embodiment of the present application.
  • FIG. 7 is another schematic diagram of a multi-camera target detection device according to an embodiment of the present application.
  • the embodiment of the present application projects the feature points corresponding to the image frames of the same time in the image data of each video stream to the same perspective visual map, so as to integrate the feature point information of the image frames in the video stream images of each channel into the same perspective.
  • Target detection and target location analysis are performed based on fused information from the same perspective.
  • Figure 1 is a schematic flow chart of a multi-camera target detection method according to an embodiment of the present application. The method includes:
  • Step 101 obtain at least two video stream image data
  • the image data of each video stream at least includes image data of the same scene collected from different shooting angles. That is to say, the image data of each video stream is image data of the same scene collected by different cameras from different shooting angles. Usually, the image data of each video stream is obtained by collecting images of the same scene from cameras installed at different locations, so that image data from different shooting angles can be obtained.
  • the same scene means that there is at least intersection data between the image data of each video stream, that is, the same scene causes the shooting scenes corresponding to the image data of each video stream to have overlapping areas; from the perspective of spatial location, the same scene A scene refers to a collection of targets located within the same spatial location range.
  • the spatial location range can be set as needed. In other words, there is an overlapping area between the images of each video stream. That is to say, from the perspective of spatial location, the same scene refers to a collection of locations within the same spatial location range.
  • the spatial location range can be set as needed. In other words, there are differences between the shooting scenes corresponding to each video stream image. Overlapping area.
  • the acquisition method may be real-time video stream image data obtained from the camera, or non-real-time video stream image data obtained from the storage terminal, which is not limited in this application.
  • Step 102 Obtain the visual map information of the location of the overlapping area in the video stream image data
  • the location of the overlapping area can be obtained through the position information of the camera.
  • the corresponding bird's-eye view visual map information is obtained from the map library.
  • the visual map can be a bird's-eye view visual map.
  • the bird's-eye view can be understood as a bird's-eye view.
  • the bird's-eye view visual map information is equivalent to the bird's-eye view visual map information.
  • the map information includes global position information in the world coordinate system.
  • the overlapping area is the overlapping area between the shooting ranges of multiple cameras.
  • the overlapping area of the shooting range of each camera is determined based on the geographical location information, installation angle and shooting range of each camera, from Obtain the bird's-eye view visual map information corresponding to the overlapping area from the map library.
  • the bird's-eye view visual map information corresponding to the preset range is obtained from the map library, and the preset range at least includes the above-mentioned overlapping area.
  • the geographical location information of each camera installation determine the position of each camera in the complete bird's-eye view visual map.
  • Based on the installation angle and shooting range of each camera determine the overlapping area of each camera's shooting range in the complete bird's-eye view visual map. Location.
  • Step 103 Convert the feature point information extracted from each image frame of each channel of video stream image data into the same perspective visual map, so as to fuse the feature point information of each channel image frame into the same perspective to obtain the same perspective fused information,
  • each image frame in each channel of image frames is simultaneous.
  • the image frames of each channel correspond to the same time.
  • each image frame has the same timestamp information, but in actual applications, each image frame in each image frame does not need to be strictly at the same time, as long as each image frame in each image frame The time difference between them only needs to be within the set time threshold. In this case, it corresponds to the same time and is also simultaneous. If the image frames of each channel are not simultaneous, synchronization processing can be performed.
  • each of the above-mentioned image frames in each channel is a group of image frames with the same acquisition time in each channel of image frames, and the above-mentioned time difference is the difference between the acquisition times of the group of image frames.
  • the projected feature point information includes global position information, that is, projection The coordinates of the feature point in the world coordinate system.
  • the projection feature point information of all image frames corresponding to the same acquisition time in each image frame is determined as the fusion information.
  • the fusion information represents the feature information of the scene at the same viewing angle at the same time, that is to say , which represents the feature information corresponding to all simultaneous image frames from the same perspective.
  • each image frame is a set of image frames whose time difference is within a set time threshold, that is, the image frames corresponding to the same acquisition time in each image frame are simultaneous image frames
  • this set is called in this application is the image frame group
  • the fusion information can be understood as the fusion information of the image frame group; for example, there are 3 channels of image frames, and the image frames whose collection time is 8:00 among the 1st channel image frame to the 3rd channel image frame are image frames respectively.
  • Image frame 2 and image frame 3 The image frames collected at 8:01 are image frame 4, image frame 5 and image frame 6 respectively.
  • image frame 1 to image frame 3 can be regarded as an image frame group, and image frame 4 to image frame 6 can be regarded as Another group of image frames.
  • the projection matrix is used to characterize the mapping relationship between the pixel points in the camera image and the spatial points in the visual map.
  • Different cameras correspond to different projection matrices; that is to say, the projection matrix is used to characterize the image coordinate system of the camera image. Mapping relationship with the map coordinate system of the visual map.
  • Step 104 Perform target detection and target location analysis based on the fusion information.
  • target detection is performed based on the projected feature point information of all image frames in each channel of image frames.
  • the pre-initialized target features are searched to obtain the target reference position information. That is to say, based on the projected feature point information, in the same perspective visual map, the pre-initialized target is searched.
  • Features are analyzed to obtain target reference position information in the visual map.
  • the pre-initialized target features can be predetermined features corresponding to various types of targets to be detected. In this way, based on analysis based on the pre-initialized target features, targets of these categories included in each image frame can be identified.
  • the target reference position information may be the coordinates of the target in the visual map. It can be seen from the above description that parsing the pre-initialized target features means searching for the pre-initialized target features based on the projected feature point information.
  • Corresponding target detection results under the same viewing angle are determined as the target detection results of the image frame group of each channel of image frames, wherein the image frame group is composed of image frames of each channel with simultaneity; target detection The results include: global position information, target size and confidence from the same perspective, and may also include target identification and/or target category.
  • the features of the target in each image frame included in the image frame group are fused to obtain the fusion features corresponding to the image frame group.
  • the fusion features corresponding to the image frame group are analyzed to obtain the target detection results at the acquisition time corresponding to the image frame group.
  • the fusion feature can be a 128-dimensional vector.
  • the fusion feature is input into the multi-layer perceptron, and the multi-layer perceptron can analyze whether there is a target at the corresponding position of the fusion feature in the visual map. If there is a target, the multi-layer perceptron can also parse out specific information about the target such as target category and target size.
  • the above detection method of searching for fused features can realize correlation detection from feature information of multiple image frames, which is beneficial to improving the accuracy of target detection.
  • target position sequence data from the target detection results of each image frame group, where the target detection results of each image frame group are: target detection results corresponding to the image frame groups at different times, that is, historical images Object detection results for the frame group.
  • the target position sequence data is a sequence of the target's positions at different times in the visual map arranged in chronological order.
  • the historical image frame group is the image frame group whose acquisition time is before the current time.
  • target detection can be performed based on the fusion information of this image frame group.
  • each target in the intersection of the target detection results of the current image frame group from the same perspective and the target detection results of the historical image frame group from the same perspective is The detection results are marked, for example, the target detection results in the intersection inherit the existing target position sequence identifier. That is to say, determine the target identification included in the target detection results corresponding to the current image frame group, and the intersection of the target identification included in the target detection results corresponding to the historical image frame group, and combine each target detection result to which the target identification in the intersection belongs. , mark it.
  • the current image frame group is the image frame group collected at the current time.
  • the target detection results in the same perspective of the current image frame group that are not in the intersection are assigned a new target position sequence identifier; the target detection results with the same target position sequence identifier are determined as the target position sequence data of the target detection result. That is to say, among the target identifications included in the target detection results of the current image frame group, the target detection results belonging to the target identifications that are not in the above intersection are assigned new target position sequence identifications; the target detection results with the same target position sequence identification are assigned The global position information included in the results is sorted according to the corresponding collection time to obtain the target position sequence.
  • target detection and target position analysis are integrated into one, eliminating the need to perform detection first and then perform target position analysis, making the overall detection and position analysis more concise.
  • the multi-camera target detection method in the embodiment of the present application provides an end-to-end, multi-view, integrated detection and position analysis method by fusing the feature point information of each image frame into the same viewing angle, without the need to obtain a single
  • the target position sequence corresponding to the camera is then fused with the target position sequence corresponding to each single camera, which avoids the problem of difficulty in obtaining the correct global target position sequence due to problems with the target position sequence in a single camera, and is conducive to improving multi-camera targets. Reliability and accuracy of detection and target location analysis.
  • the following takes multi-camera target detection and target location analysis applied to traffic intersections as an example. It should be understood that the present application is not limited to multi-camera target detection and target location analysis at traffic intersections. Any application of multi-camera target detection and target location analysis can be applied, for example, applications such as peripheral target detection and target location analysis using multi-channel cameras installed on the vehicle body.
  • Figure 2 is a schematic flow chart of a multi-camera target detection method in a specific scenario according to an embodiment of the present application. The method includes:
  • Step 201 Obtain video stream images from multiple cameras and visual map information corresponding to the spatial location of the video stream images.
  • the visual map information corresponding to the spatial position of the video stream image is the visual map information corresponding to the scene corresponding to the video stream image.
  • obtaining video stream images from multiple cameras may be to obtain video stream images from multiple cameras installed in the same scene and with overlapping shooting ranges. For example, four cameras are installed at a traffic intersection to collect images separately. The video stream images in the four directions of the intersection with overlapping areas are obtained to obtain four video stream images, as shown in Figure 3.
  • Figure 3 is a schematic diagram of four cameras separately collecting one frame of video stream images in the four directions of the intersection.
  • obtaining the visual map information of the spatial position corresponding to the video stream image may be to obtain the map information at the traffic intersection in order to obtain the global position information of the space corresponding to the video stream image.
  • the position information in the map information may adopt the world coordinate system. described by the global coordinate information below.
  • the map information may be a high-precision map from a bird's-eye view, as shown in Figure 4.
  • Figure 4 is a schematic diagram of a high-precision map from a bird's-eye view at a traffic intersection. It should be understood that the map information may also be a general map, that is, a non-high-precision map, for example, an ordinary navigation electronic map.
  • High-precision map is a thematic map relative to ordinary navigation electronic map, also known as high-resolution map.
  • the absolute position accuracy is close to the meter level, and the relative position accuracy is at the centimeter level; the data organization method is to use different layers to describe water systems, railways, blocks, buildings, traffic markings and other information, and then overlay the layers to express it.
  • Step 202 obtain the projection matrix corresponding to the camera and visual map information from which each video stream image comes,
  • the mapping relationship can be described by a projection matrix.
  • select multiple iconic pixels or,
  • select multiple iconic spatial points use the pixel coordinates and corresponding map information to calculate the projection matrix, and the internal and external calibration parameters of the camera can be obtained in advance.
  • the above-mentioned iconic pixel points and spatial points are points whose positions are easy to accurately determine. For example, they can be a certain point on a landmark building, a certain point on a road sign, a corner point of a certain zebra crossing, etc.
  • the spatial point is the point in the world coordinate system corresponding to the pixel point.
  • the pixels corresponding to the traffic mark points in the camera image select the pixels corresponding to the traffic mark points in the camera image, and determine the map information corresponding to the traffic mark points in the high-precision map.
  • the projection matrix of the map The specific calculation method of the projection matrix can be obtained by substituting the coordinates of multiple sets of corresponding points of the camera image and the high-precision map into the constructed linear equations and solving them using the least squares method. For example, using direct linear transformation (DLT) ) algorithm, P3P algorithm, EpnP (Efficient Pnp) algorithm, Bundle Adjustment (BA) algorithm.
  • DLT direct linear transformation
  • P3P algorithm P3P algorithm
  • EpnP (Efficient Pnp) algorithm Bundle Adjustment (BA) algorithm.
  • the projection matrix can usually be predetermined and stored offline, or it can be determined in real time.
  • Step 203 For each video stream image, extract the features in the current frame respectively to obtain the current feature map of each channel and/or the current feature information of each channel.
  • CNN Convolutional Neural Networks, convolutional neural network
  • the feature map is feature information, that is to say, the current feature map of each of the above channels is the current feature information of each of the above channels.
  • Step 204 For each current feature map, perform the following processing:
  • the initialized target features can be obtained through one of the following two implementation methods:
  • the initialized target features can be obtained through the following step 2041.
  • Step 2041 use the projection matrix of the camera from which the video stream image comes, to project the feature points in the current feature map of the road to the BEV visual map, obtain the location information of the projected feature points, and initialize the target feature set in the BEV visual map;
  • the global position information of the projected feature point in the world coordinate system can be obtained;
  • each target feature is a vector of set length. For example, set the target features corresponding to each target with pedestrians, motor vehicles, and non-motor vehicles as targets respectively, and use the target features corresponding to all targets as a set of target detection vectors to initialize to obtain the initialized target feature vector.
  • the length of the above target features can be preset to 256 and so on.
  • the target detection vector may be a 3D target detection vector, including 3D information, or a 2D target detection vector, including 2D information.
  • the target feature set can be initialized directly in the BEV visual map.
  • the step of obtaining the initialized target characteristics is implemented based on the above-mentioned step 2041, then the step of obtaining the reference position information of each target is implemented based on the following implementation method 1, that is, based on the following step 2042; if the target characteristics are Based on the method of initializing directly in the BEV visual map in the other implementation method mentioned above, the step of obtaining the reference position information of each target is implemented based on the following implementation method 2.
  • Step 2042 Based on the initialized target characteristics, obtain the reference position information of each target in the BEV visual map.
  • a target search is performed based on the projected feature points projected by the BEV visual map, and the reference position information of each target detected in the BEV visual map is obtained. That is to say, in the first implementation method, based on the projection feature points projected by the BEV visual map, the initialized target features are analyzed in the same perspective visual map to obtain the position information of each target in the BEV visual map, that is, the reference position. information. From the above description, it can be seen that target analysis is performed on the initialized target features in the same perspective visual map, that is, target search is performed based on the projected feature points projected by the BEV visual map.
  • the machine learning model can be a multi-layer perceptron.
  • the multi-layer perceptron is used to perform target detection vector on the projected feature points projected by the BEV visual map. Search to parse the reference location information of each target in the BEV visual map. For example, the reference position information of different pedestrians and different vehicles in the BEV visual map is parsed.
  • the initialized target detection vector and projected feature point position information are input into the machine learning model.
  • the machine learning model can analyze the projected feature points based on the initialized target detection vector to obtain the target of the category identified by the initialized target detection vector. Can include people, motor vehicles, non-motor vehicles, etc.
  • the reference position information of the identified target in the BEV visual map can be determined. It can be seen from the above description that the projection feature points are analyzed based on the initialized target detection vector, that is, the target detection vector is searched for the projected feature points projected by the BEV visual map.
  • target analysis can be performed on the initialized target features in the same perspective visual map to obtain the reference position information of the target in the BEV visual map.
  • the machine learning model can be a multi-layer perceptron.
  • the multi-layer perceptron is used to parse the reference position information of each target in the BEV visual map. For example, the reference location information of different pedestrians and vehicles in the BEV visual map is parsed.
  • the initialized target detection vector is input into the machine learning model.
  • the machine learning model can parse and obtain the target of the category identified by the initialized target detection vector and the reference position information of the target in the BEV visual map based on the initialized target detection vector.
  • Step 2043 use the projection matrix of the camera to back-project the reference position information of each target into the current feature map of the road to determine the characteristic position of the reference position of each target in the current feature map of the road, and obtain the corresponding position based on the characteristic position.
  • the feature position information in the current feature map corresponding to the reference position information is obtained.
  • the feature Location information to determine the corresponding features. That is to say, in the current feature map, features are obtained from the positions indicated by the feature position information.
  • the reference position information of target 1 corresponds to feature positions 1, 2, 3, and 4 in the 4-channel current feature map, and the corresponding features 1, 2, 3, and 4 are obtained from the 4 feature positions.
  • the position information in the current frame corresponding to the reference position information is obtained according to the projection matrix of the camera and the reference position information, and the corresponding features are determined based on the position information. . That is to say, in the current frame, feature extraction is performed on the position indicated by the feature position information to obtain the corresponding feature.
  • Step 2044 Perform feature fusion on each target to obtain the fusion features of each target;
  • the features of the same target in the current feature map of each channel are fused to obtain the first fusion feature.
  • the first features 1, 2, 3, and 4 of target 1 are fused to obtain the target 1's first fusion feature. That is to say, features corresponding to the same target in each image frame included in the image frame group are fused to obtain the first fusion feature corresponding to the target.
  • the same target captured by different cameras has different positions in the current frame, and the targets corresponding to the same pixel position in the current frame are different. Based on this, you can also fuse the features of other targets in each feature map except this target to obtain the second fusion feature. For example, fuse the features of other targets except target 1, that is, the features of target 2.
  • Features, features of target 3... are fused, so that redundant target information can be removed and the desired target information can be enhanced.
  • the first fusion feature and the second fusion feature are fused to obtain the fusion feature of the target.
  • the above-mentioned fusion may include at least one operation of adding and splicing feature vectors, wherein the addition may be a weighted average addition.
  • the features of target 1 in feature map 1-feature map 4 can be fused, that is, feature 1-feature 4 can be fused to obtain the first fusion feature of target 1.
  • the features of target 2 and target 3 in feature maps 1 to 4 are fused, that is, feature 5 to feature 12 are fused to obtain the second fusion feature of target 1.
  • the first fusion feature of target 1 is fused with the second fusion feature to obtain the fusion feature of target 1.
  • Step 205 Based on the projected feature point information, search the fusion features to obtain the target detection results from the same perspective of the current frame group.
  • the fusion features and projected feature point information of all targets are analyzed through the machine learning model to obtain the target detection results from the BEV perspective, that is, the target detection results in the BEV visual map of the current frame,
  • the target detection results include three-dimensional position information, three-dimensional size information, and confidence.
  • the target detection results include two-dimensional position information, two-dimensional size information, and confidence.
  • the target category can be pedestrians, bicycles, or motor vehicles.
  • Step 206 Filter the target detection results according to the confidence threshold and retain valid target detection results.
  • target detection results whose confidence is less than the confidence threshold are eliminated to obtain effective target detection results.
  • the confidence levels corresponding to target detection result 1-target detection result 5 are 65%, 70%, 95%, 97% and 96% respectively. If the confidence threshold is 90%, then target detection result 1 and target detection result 2 can be eliminated.
  • Step 207 Add the retained valid target detection results to the initialized target feature set used for target search in the next frame group,
  • the initialized target feature set of the next frame group includes an initialized target detection vector used for target search.
  • the number of target detection vectors in the next frame group is: initialized n target detection vectors + historical m valid target detection results, where the current frame group is relative to The next frame group is the historical frame group, and the m valid target detection results are the historical valid target detection results.
  • Step 208 Determine whether the valid target detection result of the current frame group is related to the valid target detection result of the previous frame group.
  • the valid target detection result comes from the initialized target, it means that the target corresponding to the valid target detection result is newly detected, then give the valid target detection result a new target position sequence identifier (ID); where, the initialized target is the first The target corresponding to the initialization target detection vector corresponding to the image frame group is the initial initialization target. If the effective target detection result comes from the initialization target, it means that the target was not detected in the image frame group before the current frame group, then the target The target corresponding to the effective target detection result is the newly detected target.
  • ID target position sequence identifier
  • the valid target detection result comes from the valid target detection result added to the previous frame group, it means that the valid target detection result was detected in both the previous frame group (a historical frame group relative to the current frame group) and the current frame group. , then the target position sequence identifier of the valid target detection result remains unchanged, and the target position sequence identifier of the previous frame group of the valid target detection result is used.
  • the current frame group, next frame group, and The previous frame group should be understood as a collection of simultaneous image frames of each channel, rather than a single frame of a certain channel. That is to say, the current frame group, the next frame group, and the previous frame group are image frame group sets that respectively contain simultaneous image frames of each channel.
  • Step 209 Determine whether the images in the video stream have been processed; if yes, execute step 210; if not, extract the next frame of images in each channel and return to step 203 until the images in the video stream are processed.
  • the electronic device can determine whether the next frame of image can be extracted to determine whether the image in the video stream has been processed. If the next frame of image can be extracted, it means that the image in the video stream has not been processed. Then it is necessary to extract the next frame of image for each video stream and return to execution. For each video stream image, extract the features in the current frame respectively, and get The steps of each current feature map and/or each current feature information until the image in the video stream is processed.
  • next frame of image is not extracted, it means that the image in the video stream has been processed. Then the electronic device can perform step 210.
  • Step 210 Output the valid target detection results of the same target position sequence ID, and obtain the position sequence of the valid target detection results in the BEV visual map, thereby obtaining the target position sequence.
  • the target position sequence ID is sequence 1
  • the valid target detection results are result 1 to result 10
  • the corresponding collection times are 11:01 to 11:10 respectively.
  • the process of obtaining target detection results through the machine learning model in the above-mentioned steps 2042 and 205 can be understood as a process of querying or searching for a set target in the visual map from the BEV perspective, that is, a target query process to perform target detection.
  • FIG. 5 is a schematic diagram of the target detection process.
  • boxes with different grayscales represent different targets, and cli represents the reference position in the visual map.
  • Position information, clmi represents the back-projection of the reference position information to the position information in the current image frames of each channel.
  • target 1 integrates other target information and the information of the target itself in the current image frames of each channel.
  • cameras 1 to 3 are used for shooting, and the first channel image data, the second channel image data, and the third channel image data are obtained.
  • the feature maps corresponding to the current frame image are respectively obtained from the three channels of image data to obtain the first feature map 521, the second feature map 522 and the third feature map 523.
  • the initialized target detection vector includes a first vector 501 for detecting pedestrians, a second vector 502 for detecting motor vehicles, and a third vector 503 for detecting non-motor vehicles.
  • a target search is performed on the first vector 501, the second vector 502, and the third vector 503.
  • three targets are detected, namely target 1 to target 3 respectively.
  • the reference position information corresponding to target 1 to target 3 can be determined, which are first position information 504, second position information 505, and third position information 506 respectively.
  • the first position information 504, the second position information 505 and the third position information 506 are back-projected to the first feature map 521, the second feature map 522 and the third feature map 523 respectively.
  • C li represents the reference position information in the visual map
  • C lmi represents the back projection of the reference position information to the position information in each current image frame.
  • Target 1 performs feature fusion on Target 1 to Target 3 respectively to obtain the fused features of Target 1 to Target 3.
  • the features of target 1 in the first feature map 521 to the third feature map 523 are fused to obtain the first fusion feature corresponding to target 1.
  • the features of target 2 and target 3 in the first feature map 521 to the third feature map 523 are fused to obtain the second fusion feature corresponding to target 1.
  • the first fusion feature and the second fusion feature corresponding to target 1 are fused to obtain the fusion feature of target 1.
  • the fusion features corresponding to target 1 to target 3 can be obtained, that is, the first feature 507, the second feature 508, and the third feature 509.
  • the above fusion features are searched to obtain target detection results corresponding to target 1 to target 3 respectively, that is, the first result 510, the second result 511, and the third result 512.
  • the target detection results include confidence.
  • the target detection results are filtered according to whether the confidence included in the target detection results corresponding to target 1 to target 3 is greater than the confidence threshold.
  • the second result 511 and the third result 512 may be determined to be valid.
  • the second result 511 and the third result 512 are added to the initialization target detection vector set of the next frame group. Next, extract the next frame of image data from the 1st channel to the 3rd channel of image data respectively, and return to the above steps of obtaining the feature map corresponding to the current frame image from the 3 channels of image data, until the image in the video stream Processed.
  • the position sequence data of target 1 to target M in the BEV visual map can be obtained, that is, Pred 1 to Pred M .
  • FIG. 6 is a schematic diagram of a multi-camera target detection device according to an embodiment of the present application.
  • the device includes,
  • the first acquisition module is used to acquire at least two channels of video stream image data, wherein the image data of each channel of video stream at least includes image data of the overlapping area,
  • the second acquisition module is used to acquire the visual map information of the location of the overlapping area in the video stream image data
  • the target detection and position analysis module is used to convert the feature point information extracted from each image frame of each video stream image data into a visual map, so as to integrate the feature point information of each image frame into the same perspective, Obtain fusion information from the same perspective, where each image frame in each channel of image frames is simultaneous,
  • target detection and target location analysis are performed.
  • the target detection and position analysis module is configured as:
  • the projection matrix is a projection matrix used to characterize the mapping relationship between pixel points in the camera image and spatial points in the visual map;
  • target detection is performed to obtain target detection results at the same viewing angle corresponding to all image frames in each channel of image frames, and the target detection results are determined is the target detection result of the image frame group of each channel of image frames,
  • the target detection results of each image frame group are: the target detection results at the same viewing angle corresponding to all the image frames of each channel of image frames at different times.
  • Figure 7 is another schematic diagram of a multi-camera target detection device according to an embodiment of the present application.
  • the device includes a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to implement the steps of the multi-camera target detection method according to an embodiment of the present application.
  • the memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk memory. optional, memory It may also be at least one storage device located far away from the aforementioned processor.
  • RAM Random Access Memory
  • NVM Non-Volatile Memory
  • the above-mentioned processor can be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processing, DSP), special integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • CPU Central Processing Unit
  • NP Network Processor
  • DSP Digital Signal Processing
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the above-mentioned multi-camera target detection device may be an electronic device, and the electronic device may further include a communication bus and/or a communication interface.
  • the processor, communication interface, and memory complete communication with each other through the communication bus.
  • the communication bus mentioned in the above-mentioned electronic equipment can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the communication bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the communication interface is used for communication between the above-mentioned electronic devices and other devices.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • a computer program is stored in the storage medium.
  • the computer program is executed by a processor, the steps of the multi-camera target detection method described in the embodiments of the present application are implemented.
  • Embodiments of the present application also provide a computer program product containing instructions that, when run on a computer, cause the computer to execute the steps of the multi-camera target detection method described in the embodiments of the present application.
  • the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Closed-Circuit Television Systems (AREA)

Abstract

一种多相机目标检测方法,该方法包括:获取至少两路来自不同相机的视频流图像数据,其中,每路视频流图像数据至少包括重叠区域的图像数据,获取重叠区域所在位置的视觉地图信息,将从每路视频流图像数据的每路图像帧中所提取的特征点信息转换至视觉地图中,以将每路图像帧特征点信息融合至同一视角下,得到同一视角下的融合信息,基于融合信息,进行目标检测和目标位置分析。无需先产生单相机对应的目标位置序列再进行目标位置序列的融合,解决了多个单相机对应的目标位置序列进行融合时难以甄别和取舍的问题,既避免目标位置序列的融合所带来的算力消耗,又提高了目标检测和目标位置分析的准确性和可靠性。

Description

一种多相机目标检测方法、装置
本申请要求于2022年9月13日提交中国专利局、申请号为202211108975.7发明名称为“一种多相机目标检测跟踪方法、装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及基于图像的目标检测领域,特别地,涉及一种多相机目标检测方法。
背景技术
随着社会的不断进步,目标检测及目标位置分析的应用范围越来越广。例如智慧城市或智慧路口、自动驾驶等应用场景,通常都需要对来自多相机的多路图像的信息进行融合,利用融合图像来进行目标检测及目标位置分析。
目前目标检测及目标位置分析方法,大多是基于目标检测结果,在视频流的连续帧中的每一帧准确定位出目标位置。多相机目标检测及目标位置分析对来自每个相机的图像信息分别进行目标检测和目标位置分析,生成单相机的目标位置序列,然后将每个相机中的目标位置序列分别投影到同一视角视图中,例如,鸟瞰视角(BEV,Bird Eye View)的视图中,最终将所投影的各个目标位置序列融合成全局目标位置序列。
上述多相机目标检测及目标位置分析方法,分步进行目标检测和目标位置分析的方式,使系统较为复杂,且占用通信资源;并且,单相机中的目标位置序列如果存在问题,当多个相机的目标位置序列在同一视角进行融合时,难以对单相机的目标位置序列进行甄别和取舍,导致难以获得正确的全局目标位置序列。
发明内容
本申请提供了一种多相机目标检测方法,以获得准确的全局目标位置序列。
第一方面,本申请提供的一种多相机目标检测方法,该方法包括:
获取至少两路来自不同相机的视频流图像数据,其中,每路视频流图像数据至少包括重叠区域的图像数据,
获取所述视频流图像数据对应空间所在位置的视觉地图信息,
将从每路视频流图像数据的每路图像帧中所提取的特征点信息转换至视觉地图中,以将每路图像帧特征点信息融合至同一视角下,得到同一视角下的融合信息,其中,每路图像帧中同一采集时间对应的每一图像帧具有同时性,
基于所述融合信息,进行目标检测和目标位置分析。
较佳地,所述将从每路视频流图像数据的每路图像帧中所提取的特征点信息转换至视觉地图中,包括:
对于每路图像帧中的每一图像帧:
分别对该图像帧进行特征提取,得到该图像帧的特征点信息和/或特征图,
利用该图像帧所来源相机的投影矩阵,将该图像帧的特征点投影至视觉地图中,得到该图像帧的同一视角下的投影特征点信息,
将所述每路图像帧中所有图像帧的所述投影特征点信息,确定为所述融合信息,
其中,
投影矩阵用于表征相机图像中的像素点与视觉地图中的空间点之间映射关系的投影矩阵;
所述基于所述融合信息,进行目标检测和目标位置分析,包括:
基于所述每路图像帧中所有图像帧的所述投影特征点信息,进行目标检测,得到所述每路图像帧中所有图像帧对应的同一视角下的目标检测结果,将该目标检测结果确定为所述每路图像帧的图像帧组的目标检测结果,
从各个图像帧组的目标检测结果中,获取目标位置序列数据,
其中,各个图像帧组的目标检测结果为:各个不同时间下的所述每路图像帧所有图像帧所对应的同一视角下的目标检测结果。
较佳地,所述从各个图像帧组的目标检测结果中,获取目标位置序列数据,包括:
对当前图像帧组的目标检测结果与历史图像帧组的目标检测结果的交集中的各个目标检测结果,进行标记,
从各图像帧组的所标记的目标检测结果中,获取世界坐标系下的目标位置序列数据。
较佳地,所述基于所述每路图像帧中所有图像帧的所述投影特征点信息,进行目标检测,包括:
基于所述投影特征点信息,对预先初始化的目标特征进行搜索,得到目标参考位置信息,
利用每路图像帧所来源相机的投影矩阵,将目标参考位置信息分别反投影至每路图像帧中,以确定目标参考位置在图像帧中的位置信息,
根据图像帧的位置信息,获取对应目标的特征,得到目标的每路特征,
融合目标的每路特征,得到目标的融合特征,
基于所述投影特征点信息,对所述融合特征进行搜索,得到所述目标检测结果;
其中,目标检测结果包括:同一视角下的全局位置信息、目标尺寸以及置信度。
较佳地,所述基于所述投影特征点信息,对预先初始化的目标特征进行搜索,得到目标参考位置信息,包括:
将所述投影特征点信息、和预先初始化的目标检测向量输入至机器学习模型,得到各目标的参考位置信息,其中,目标检测向量包括两个以上目标的目标特征向量,
所述利用每路图像帧所来源相机的投影矩阵,将目标参考位置信息分别反投影至每路 图像帧中,包括:
利用每路视频流图像所来源相机的投影矩阵,将各目标的参考位置信息分别反投影至每路图像帧对应的特征图中,以确定各目标的参考位置在特征图中的位置信息,
所述根据图像帧的位置信息,获取对应目标的特征,得到目标的每路特征,包括:
根据各目标特征图中的位置信息,获取各目标对应的特征。
较佳地,所述融合目标的每路特征,得到目标的融合特征,包括:
对每个目标,分别进行该目标的特征融合,得到该每个目标的融合特征,
所述基于所述投影特征点信息,对所述融合特征进行搜索,包括:
将每个目标的融合特征、和所述投影特征点信息输入至机器学习模型,得到所述目标检测结果。
较佳地,所述对每个目标,分别进行该目标的特征融合,包括:
对于每个目标:
基于每路特征图,融合各特征图中该目标的特征,得到第一融合特征,
基于每路特征图,融合各特征图中除同一目标之外的其他目标的特征,得到第二融合特征,
将第一融合特征和第二融合特征进行融合,得到该目标的融合特征;
该方法进一步包括:
按照设定的置信度阈值,对当前图像帧组的目标检测结果进行过滤,得到有效目标检测结果,
将所述有效目标检测结果增加至下一图像帧组的初始化目标特征集合中。
较佳地,所述对当前图像帧组的目标检测结果与历史图像帧组的目标检测结果的交集中的各个目标检测结果,进行标记,包括:
如果当前图像帧组的有效目标检测结果来自于初始化目标特征,则赋予该有效目标检测结果新的目标位置序列标识;
如果当前图像帧组的有效目标检测结果来自于上一图像帧组所加入的有效目标检测结果,则沿用上一图像帧所加入的有效目标检测结果的目标位置序列标识;
所述从各图像帧组的所标记的目标检测结果中,获取世界坐标系下的目标位置序列数据,包括:
将从各图像帧组所标记的目标检测结果中具有相同目标位置序列标识的目标检测结果,确定为该目标检测结果的目标位置分析数据;
所述视觉地图为鸟瞰视角地图,所述同一视角为鸟瞰视角。
第二方面,本申请实施例还提供一种多相机目标检测装置,该装置包括:
第一获取模块,用于获取至少两路以上来自不同相机的视频流图像数据,其中,每路 视频流图像数据至少包括重叠区域的图像数据,
第二获取模块,用于获取所述视频流图像数据对应空间所在位置的视觉地图信息,
目标检测及位置分析模块,将从每路视频流图像数据的每路图像帧中所提取的特征点信息转换至视觉地图中,以将每路图像帧特征点信息融合至同一视角下,得到同一视角下的融合信息,其中,每路图像帧中同一采集时间对应的每一图像帧具有同时性,
基于所述融合信息,进行目标检测和目标位置分析。
第三方面,本申请实施例再提供一种计算机可读存储介质,所述存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现上述第一方面任一所述多相机目标检测的方法步骤。
第四方面,本申请实施例再提供一种电子设备,包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;
存储器,用于存放计算机程序;
处理器,用于执行存储器上所存放的程序时,实现上述第一方面任一所述多相机目标检测的方法步骤。
第五方面,本申请实施例再提供一种包含指令的计算机程序产品,当所述计算机程序产品在计算机上运行时,使得计算机执行上述第一方面任一所述多相机目标检测的方法步骤。
本申请实施例提供的多相机目标检测方法,将来自不同相机的视频流图像特征投影至同一视角下,基于同一视角下的融合信息来进行目标检测和目标位置分析,这样,从源头上来进行信息的融合,有利于提高用于目标检测及目标位置分析的信息源的准确性,提高了多相机融合的智能性,并且,无需先产生单相机对应的目标位置序列再进行目标位置序列的融合,解决了多个单相机对应的目标位置序列进行融合时难以甄别和取舍的问题,既避免目标位置序列融合所带来的算力消耗,又提高了目标检测和目标位置分析的准确性和可靠性。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。
图1为本申请实施例多相机目标检测方法的一种流程示意图。
图2为本申请实施例具体场景下的多相机目标检测方法的一种流程示意图。
图3为4个相机分别采集路口4个方向的一帧视频流图像的一种示意图。
图4为交通路口处的鸟瞰视角的高精度地图的一种示意图。
图5为目标检测过程的一种示意图。
图6为本申请实施例多相机目标检测装置的一种示意图。
图7为本申请实施例多相机目标检测装置的另一种示意图。
具体实施方式
为使本申请的目的、技术方案、及优点更加清楚明白,以下参照附图并举实施例,对本申请进一步详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本申请保护的范围。
为了使本申请的目的、技术手段和优点更加清楚明白,以下结合附图对本申请做进一步详细说明。
本申请实施例将各路视频流图像数据中对应于同一时间的图像帧的特征点投影至同一视角视觉地图,以将各路视频流图像中该图像帧的特征点信息融合至同一视角下,基于同一视角下的融合信息进行目标检测和目标位置分析。
参见图1所示,图1为本申请实施例多相机目标检测方法的一种流程示意图。该方法包括:
步骤101,获取至少两路视频流图像数据,
其中,每路视频流图像数据至少包括从不同拍摄角度所采集的同一场景的图像数据,也就是说,各路视频流图像数据为不同相机从不同拍摄角度所采集的同一场景的图像数据。通常,每路视频流图像数据由安装于不同位置的相机对同一场景进行图像采集而得到,从而可得到从不同拍摄角度的图像数据。
从数据的角度而言,同一场景系指每路视频流图像数据之间至少存在交集数据,即同一场景使得每路视频流图像数据对应的拍摄场景存在重叠区域;从空间位置角度而言,同一场景系指位于同一空间位置范围内的目标集合,空间位置范围可根据需要设定,换言之,每路视频流图像之间存在重叠区域。也就是说,从空间位置角度而言,同一场景系指位于同一空间位置范围内的各个位置的集合,空间位置范围可根据需要设定,换言之,各路视频流图像对应的拍摄场景之间存在重叠区域。
获取的途径可以是从相机获取的实时视频流图像数据,也可以是从存储端获取的非实时视频流图像数据,本申请对此不做限定。
步骤102,获取所述视频流图像数据中重叠区域所在位置的视觉地图信息,
作为一种示例,重叠区域所在位置可通过相机的位置信息来获取,根据相机安装的地理位置信息,从地图库中获取对应的鸟瞰视角视觉地图信息。
视觉地图可以是鸟瞰视角视觉地图,鸟瞰视角可理解为一种俯视视角,鸟瞰视角视觉地图信息相当于俯视视觉地图信息,该地图信息包括有世界坐标系下的全局位置信息。
其中,重叠区域是多个相机拍摄范围之间重叠的区域。在一种实施方式中,根据各相机安装的地理位置信息、架设角度以及拍摄范围,确定各相机的拍摄范围的重叠区域,从 地图库中获取该重叠区域对应的鸟瞰视角视觉地图信息。在另一种实施方式中,从地图库中获取预设范围对应的鸟瞰视角视觉地图信息,该预设范围至少包括上述重叠区域。根据各相机安装的地理位置信息,确定各相机在完整的鸟瞰视角视觉地图中的位置,根据各相机的架设角度以及拍摄范围,确定各相机拍摄范围的重叠区域在完整的鸟瞰视角视觉地图中的位置。
步骤103,将从每路视频流图像数据的每路图像帧中所提取的特征点信息转换至同一视角视觉地图中,以将每路图像帧特征点信息融合至同一视角下,得到同一视角下的融合信息,
其中,每路图像帧中的每一图像帧具有同时性。例如,每路的图像帧对应相同时间,所应理解的是,绝对意义上的相同时间的图像帧有利于提高目标检测及目标位置分析的精度。又例如,每路图像帧具有相同的时间戳信息,但实际应用中,每路图像帧中的每一图像帧并不需要严苛地在同一时刻,只要每路图像帧中的每一图像帧之间的时间差在设定的时间阈值内即可,这种情况便相当于对应相同的时间,也具有同时性。若每路图像帧不具有同时性,可进行同步处理。其中,上述每路图像帧中每一图像帧为各路图像帧中采集时间相同的一组图像帧,上述时间差为该一组图像帧的采集时间之间的差值。
在该步骤中,对于每路图像帧中的每一单帧的图像帧:
分别对该图像帧进行特征提取,得到该图像帧的特征点信息和/或特征图,
利用该图像帧所来源相机的投影矩阵,将该图像帧的特征点投影至视觉地图中,得到该图像帧的同一视角下的投影特征点信息,该投影特征点信息包括全局位置信息,即投影特征点在世界坐标系下的坐标。
将每路图像帧中同一采集时间对应的所有图像帧的所述投影特征点信息,确定为所述融合信息,这样,该融合信息表征了场景在同一视角下同一时间的特征信息,也就是说,表征了同一视角下具有同时性的所有路图像帧对应的特征信息。
鉴于每路图像帧系时间差在设定的时间阈值内的图像帧集合,即,每路图像帧中同一采集时间对应的图像帧是具有同时性的各路图像帧,该集合在本申请中称为图像帧组,融合信息可理解为图像帧组的融合信息;例如,共有3路图像帧,第1路图像帧~第3路图像帧中采集时间为8:00的图像帧分别为图像帧1、图像帧2和图像帧3,采集时间为8:01的图像帧分别为图像帧4、图像帧5和图像帧6。由于图像帧1~图像帧3的采集时间相同,图像帧4~图像帧6的采集时间相同,因此可以将图像帧1~图像帧3作为一个图像帧组,将图像帧4~图像帧6作为另一个图像帧组。
其中,投影矩阵用于表征相机图像中的像素点与视觉地图中的空间点之间的映射关系,不同相机对应有不同的投影矩阵;也就是说,投影矩阵用于表征相机图像的图像坐标系与视觉地图的地图坐标系之间的映射关系。
步骤104,基于所述融合信息,进行目标检测和目标位置分析。
在该步骤中,基于所述每路图像帧中所有图像帧的所述投影特征点信息,进行目标检测。
作为一种示例,基于所述投影特征点信息,对预先初始化的目标特征进行搜索,得到目标参考位置信息,也就是说,基于投影特征点信息,在同一视角视觉地图中,对预先初始化的目标特征进行解析,得到目标在视觉地图中的目标参考位置信息。其中,预先初始化的目标特征可以为预先确定的所要检测的各类目标对应的特征,这样,基于预先初始化的目标特征进行解析,可以识别出各路图像帧中包括的这些类别的目标。目标参考位置信息可以为目标在视觉地图中的坐标。从上述描述可见,对预先初始化的目标特征进行解析,也就是基于投影特征点信息,对预先初始化的目标特征进行搜索。
利用每路图像帧所来源相机的投影矩阵,将目标参考位置信息分别反投影至每路图像帧中,以确定目标参考位置在图像帧中的位置信息,根据图像帧的位置信息,获取对应目标的特征,得到目标的每路特征,融合目标的每路特征,得到目标的融合特征,基于所述投影特征点信息,对所述融合特征进行搜索,得到所述每路图像帧中所有图像帧对应的同一视角下的目标检测结果,将该目标检测结果确定为所述每路图像帧的图像帧组的目标检测结果,其中,图像帧组由具有同时性的每路图像帧组成;目标检测结果包括:同一视角下的全局位置信息、目标尺寸以及置信度,还可以包括目标标识和/或目标类别。
也就是说,在确定目标参考位置在图像帧中的位置信息后,针对每路图像帧,基于该路图像帧中位置信息所指示的位置,对该路图像帧进行特征提取,得到该路图像帧中目标的特征。针对各个图像帧组,对目标在该图像帧组所包括的各图像帧中的特征进行融合,得到该图像帧组对应的融合特征。针对每个图像帧组,基于投影特征点信息,对该图像帧组对应的融合特征进行解析,得到该图像帧组对应的采集时间下的目标检测结果。
例如,融合特征可以为一个128维向量,将该融合特征输入多层感知机,多层感知机便可以解析出该融合特征在视觉地图中对应的位置是否存在目标。如果存在目标,多层感知机还可以解析出目标类别以及目标尺寸等目标的具体信息。
上述对融合特征来进行搜索的检测方式,能够实现从多路图像帧特征信息的关联性检测,有利于提高目标检测的准确性。
从各个图像帧组的目标检测结果中,获取目标位置序列数据,其中,各个图像帧组的目标检测结果为:各个不同时间下的所述图像帧组所对应的目标检测结果,即,历史图像帧组的目标检测结果。目标位置序列数据为目标在视觉地图中不同时间下所处位置按照时间顺序排列得到的序列。历史图像帧组为采集时间在当前时间之前的图像帧组。
作为一种示例,若融合信息为一个图像帧组的融合信息,则可以基于本图像帧组的融合信息进行目标检测。
以基于多个图像帧组的融合信息进行目标检测和目标位置分析为例,对当前图像帧组同一视角下的目标检测结果与历史图像帧组同一视角下的目标检测结果的交集中的各个目标检测结果,进行标记,例如,将交集中的目标检测结果继承既有目标位置序列标识。也就是说,确定当前图像帧组对应的目标检测结果所包括的目标标识,与历史图像帧组对应的目标检测结果所包括的目标标识的交集,将交集中的目标标识所属的各个目标检测结果,进行标记。其中,当前图像帧组为当前时间采集的图像帧组。
将不在交集中的当前图像帧组同一视角下的目标检测结果赋予新的目标位置序列标识;将具有相同目标位置序列标识的目标检测结果,确定为该目标检测结果的目标位置序列数据。也就是说,对当前图像帧组的目标检测结果所包括的目标标识中,不在上述交集中的目标标识所属的目标检测结果赋予新的目标位置序列标识;将具有相同目标位置序列标识的目标检测结果所包括的全局位置信息,按照对应的采集时间进行排序,得到目标位置序列。
由此,目标检测和目标位置分析融合为一体,无需先进行检测再进行目标位置分析,使得检测及位置分析整体更加简洁。
本申请实施例的多相机目标检测方法,通过将每路图像帧特征点信息融合至同一视角下,提供了一种端到端的、多视角的、检测及位置分析一体化的方法,无需获取单相机对应的目标位置序列后再将各单相机对应的目标位置序列进行融合,避免了单相机中的目标位置序列存在问题所导致难以获得正确的全局目标位置序列的问题,有利于提高多相机目标检测及目标位置分析的可靠性和准确性。
为便于理解本申请实施例,以下以应用于交通路口的多相机目标检测及目标位置分析为例来说明,所应理解的是,本申请不限于交通路口的多相机目标检测及目标位置分析,任何应用的多相机目标检测及目标位置分析均可适用,例如,安装于车辆本体的多路相机对周边目标检测及目标位置分析等应用。
参见图2所示,图2为本申请实施例具体场景下的多相机目标检测方法的一种流程示意图。该方法包括:
步骤201,获取来自多相机的视频流图像、以及视频流图像对应空间位置的视觉地图信息,
其中,视频流图像对应空间位置的视觉地图信息为,视频流图像对应的场景所对应的视觉地图信息。
作为一种示例,获取来自多相机的视频流图像可以是获取来自架设于同一场景且拍摄范围存在重叠部分的多个相机的视频流图像,例如,在某交通路口处安装有4个相机分别采集路口4个方向且具有重叠区域的视频流图像,得到4路视频流图像,参见图3所示,图3为4个相机分别采集路口4个方向的一帧视频流图像的一种示意图。
作为一种示例,获取视频流图像对应空间位置的视觉地图信息可以是,获取该交通路口处的地图信息,以便得到视频流图像对应空间的全局位置信息,地图信息中位置信息可采用世界坐标系下的全局坐标信息来描述。
为便于呈现和展示,地图信息可以为鸟瞰视角的高精度地图,参见图4所示,图4为交通路口处的鸟瞰视角的高精度地图的一种示意图。所应理解的是,地图信息也可以是一般的地图,即非高精度地图,例如,普通导航电子地图。
高精度地图是和普通导航电子地图相对而言的一种专题地图,也称为高分辨率地图。其绝对位置精度接近米级,相对位置精度在厘米级别;数据组织方式是通过不同的图层去描述水系、铁路、街区、建筑物、交通标记线等信息,然后将图层叠加来进行表达。
步骤202,获取每路视频流图像所来源的相机与视觉地图信息对应的投影矩阵,
鉴于一帧相机图像中的像素点与空间点之间满足相机模型的映射关系,该映射关系可通过投影矩阵来描述,这样,在相机图像中,选取多个具有标志性的像素点,或者,在地图信息中,选取多个具有标志性的空间点,利用像素坐标和对应的地图信息,便可计算出投影矩阵,相机的内外标定参数可预先获得。其中,上述具有标志性的像素点和空间点即为容易准确确定其位置的点,例如,可以标志性建筑物的某一点、道路指示牌上的某一点、某一条斑马线的角点等。其中,空间点为与像素点对应的世界坐标系下的点。
作为一种示例,在相机图像中选取交通标记点对应的像素点,在高精度地图中可确定交通标记点对应的地图信息,利用像素坐标和地图坐标,计算该图像所来源的相机与高精度地图的投影矩阵,其中,投影矩阵具体计算方式,可以将相机图像和高精度地图多组对应点的坐标代入构建的线性方程组,并使用最小二乘法求解获得,例如,采用直接线性变换(DLT)算法、P3P算法、EpnP(Efficient Pnp)算法、光束平差(Bundle Adjustment,BA)算法。
投影矩阵通常可以以离线方式预先确定并存储,也可以实时确定。
步骤203,对于每路视频流图像,分别提取当前帧中特征,得到各路当前特征图和/或各路当前特征信息,
在该步骤中,可以采用CNN(Convolutional Neural Networks,卷积神经网络)来提取每路当前帧中的特征,按照特征的像素位置信息对特征数据进行组织,可得到对应的当前特征图,其中,当前帧为各路视频流同一时间的单帧图像。特征图即特征信息,也就是说,上述各路当前特征图即上述各路当前特征信息。
步骤204,对于每路当前特征图,进行如下处理:
获得初始化的目标特征,获得各目标的参考位置信息、并执行步骤2043、以及步骤2044;
其中,可以通过以下两种实现方式中一种,获得初始化的目标特征:
一种实现方式中,可以通过下述步骤2041获得初始化的目标特征。
步骤2041,利用该视频流图像所来源相机的投影矩阵,将该路当前特征图中特征点投影至BEV视觉地图中,得到投影特征点位置信息,并在BEV视觉地图中初始化目标特征集合;
在该步骤中,根据特征点的像素坐标向量和相机投影矩阵的内积,可得到投影特征点在世界坐标系下的全局位置信息;
作为一种示例,初始化一组用于进行目标搜索的目标特征,每个目标特征是一个设定长度的向量。例如,分别以行人、机动车、非机动车为目标设置每个目标对应的目标特征,将所有目标对应的目标特征作为一组目标检测向量,进行初始化,得到初始化目标特征向量。上述目标特征的长度可以预先设置为256等。
目标检测向量可以是3D目标检测向量,包括有3D信息,也可以是2D目标检测向量,包括有2D信息。
另一种实现方式中,可以直接在BEV视觉地图中初始化目标特征集合。
上述获得各目标的参考位置信息的步骤,可以通过以下两种方式中的一种实现:
具体的,如果获得初始化的目标特征的步骤是基于上述步骤2041实现的,那么获得各目标的参考位置信息的步骤基于下述实现方式一实现,也就是基于下述步骤2042实现;如果目标特征是基于上述另一种实现方式中直接在BEV视觉地图中初始化的方式获得的,那么获得各目标的参考位置信息的步骤基于下述实现方式二实现。
步骤2042,基于初始化的目标特征,得到BEV视觉地图中各目标的参考位置信息。
在该步骤中,基于BEV视觉地图所投影的投影特征点,进行目标搜索,得到BEV视觉地图中所检测到的各目标的参考位置信息。也就是说,在实现方式一中,基于BEV视觉地图所投影的投影特征点,在同一视角视觉地图中对初始化目标特征进行目标解析,得到各目标在BEV视觉地图中的位置信息,即参考位置信息。从上述描述可见,在同一视角视觉地图中对初始化目标特征进行目标解析,也就是基于BEV视觉地图所投影的投影特征点,进行目标搜索。
将初始化目标检测向量、投影特征点位置信息输入至机器学习模型中,例如,机器学习模型可以为多层感知机,利用多层感知机,对BEV视觉地图所投影的投影特征点进行目标检测向量的搜索,解析出BEV视觉地图中各个目标的参考位置信息。例如,解析出BEV视觉地图中不同行人、不同车辆的参考位置信息。
具体来说,将初始化目标检测向量、投影特征点位置信息输入至机器学习模型中,机器学习模型可以基于初始化目标检测向量对投影特征点进行解析,得到初始化目标检测向量所标识的类别的目标,可以包括人、机动车、非机动车等。进而,根据投影特征点在BEV视觉地图中对应的位置信息,可以确定识别得到目标在BEV视觉地图中的参考位置信息。 从上述描述可见,基于初始化目标检测向量对投影特征点进行解析,也就是对BEV视觉地图所投影的投影特征点进行目标检测向量的搜索。
在实现方式二中,可以在同一视角视觉地图中对初始化目标特征进行目标解析,得到目标在BEV视觉地图中的参考位置信息。
将初始化目标检测向量输入至机器学习模型中,例如,机器学习模型可以为多层感知机,利用多层感知机,解析出BEV视觉地图中各个目标的参考位置信息。例如,解析出BEV视觉地图中不同行人、不同车辆的参考位置信息。
具体来说,将初始化目标检测向量输入至机器学习模型中,机器学习模型可以基于初始化目标检测向量,解析得到初始化目标检测向量所标识的类别的目标以及目标在BEV视觉地图中的参考位置信息。
步骤2043,利用相机的投影矩阵,将各目标的参考位置信息反投影至该路当前特征图中,以确定各目标的参考位置在该路当前特征图中的特征位置,并基于特征位置获取对应的特征,
在该步骤中,对于每个目标的参考位置,根据相机的投影矩阵和该参考位置信息,得到该参考位置信息所对应的当前特征图中的特征位置信息,在当前特征图中,根据该特征位置信息,确定对应的特征。也就是说,在当前特征图中,从特征位置信息所指示的位置获取得到特征。
例如,目标1的参考位置信息在4路当前特征图中分别对应为特征位置1、2、3、4,从4个特征位置分别获取对应的特征1、2、3、4。
作为另一种实施方式,对于每个目标的参考位置,根据相机的投影矩阵和该参考位置信息,得到该参考位置信息所对应的当前帧中的位置信息,根据该位置信息,确定对应的特征。也就是说,在当前帧中,对特征位置信息所指示的位置进行特征提取,得到对应的特征。
步骤2044,对每个目标分别进行特征融合,得到该每个目标的融合特征;
在该步骤中,对同一目标的特征,融合每路当前特征图中同一目标的特征,得到第一融合特征,例如,将目标1的第一特征1、2、3、4进行融合,得到目标1的第一融合特征。也就是说,对图像帧组所包括的各图像帧中,同一目标对应的特征进行融合,得到该目标对应的第一融合特征。
由于相机拍摄角度、拍摄位置的不同,不同相机所采集的同一目标在当前帧中的位置不同,当前帧中相同像素位置对应的目标不同。基于此,还可以对每路特征图中除该目标之外的其他目标的特征进行融合,得到第二融合特征,例如,将除目标1之外的其他目标的特征进行融合,即将目标2的特征、目标3的特征…进行融合,这样,既可以去除冗余目标信息,又有利于增强期望目标信息。
将第一融合特征和第二融合特征进行融合,得到该目标的融合特征。
上述融合可以包括对特征向量进行相加、拼接至少之一的操作,其中,相加可以是加权平均相加。
例如,对于4路视频流图像,分别提取当前帧1-当前帧4的特征,得到特征图1-特征图4。如果拍摄场景中的目标共有3个,那么这3个目标在特征图1-特征图4中的特征可以如下表所示:
针对目标1,可以特征图1-特征图4中目标1的特征进行融合,即融合特征1-特征4,得到目标1的第一融合特征。将特征图1-特征图4中目标2和目标3的特征进行融合,即融合特征5-特征12,得到目标1的第二融合特征。将目标1的第一融合特征与第二融合特征进行融合,得到目标1的融合特征。
步骤205,基于投影特征点信息,对所述融合特征进行搜索,得到当前帧组同一视角下的目标检测结果。
作为一种示例,将所有目标的融合特征、投影特征点信息通过机器学习模型进行解析,得到BEV视角的目标检测结果,即,当前帧BEV视觉地图中的目标检测结果,
当融合特征为3D特征时,目标检测结果包括三维位置信息、三维尺寸信息、置信度,当融合特征为2D特征时,目标检测结果包括二维位置信息、二维尺寸信息、置信度,还可以包括目标标识和/或目标类别。例如,目标类别可以为行人、自行车或机动车等。
步骤206,根据置信度阈值,对目标检测结果进行过滤,保留有效的目标检测结果,
在该步骤中,作为一种示例,将置信度小于置信度阈值的目标检测结果予以剔除,得到有效目标检测结果,
例如,目标检测结果1-目标检测结果5对应的置信度分别为65%、70%、95%、97%以及96%。如果置信度阈值为90%,那么可以将目标检测结果1和目标检测结果2剔除。
步骤207,将所保留的有效目标检测结果加入至下一帧组的用于目标搜索的初始化目标特征集合中,
其中,下一帧组的初始化目标特征集合包括有用于进行目标搜索的初始化目标检测向量。
例如,当前帧组有m个有效目标检测结果,则下一帧组的目标检测向量的数目为:初始化的n个目标检测向量+历史的m个有效目标检测结果,其中,当前帧组相对于下一帧组而言便为历史帧组,m个有效目标检测结果即为历史有效目标检测结果。
步骤208,判断当前帧组的有效目标检测结果是否与上一帧组的有效目标检测结果存在关联,
如果有效目标检测结果来自于初始化目标,说明该有效目标检测结果对应的目标是新检测到的,则赋予该有效目标检测结果新的目标位置序列标识(ID);其中,初始化目标为第一个图像帧组对应的初始化目标检测向量所对应的目标,也就是最初的初始化目标,如果有效目标检测结果来自于初始化目标,说明在当前帧组之前的图像帧组中未检测到该目标,那么该有效目标检测结果对应的目标也就是是新检测到的。
如果有效目标检测结果来自于上一帧组所加入的有效目标检测结果,说明该有效目标检测结果在上一帧组(相对当前帧组而言为历史帧组)、当前帧组都被检测到,则保持该有效目标检测结果的目标位置序列标识不变,沿用该有效目标检测结果的上一帧组的目标位置序列标识。
上述步骤207、208无严格的先后次序,可以并行执行,所应理解的是,由于目标检测是在同一视角下来进行的,故而,步骤207、208中所述当前帧组、下一帧组、上一帧组应理解为具有同时性的每路图像帧的集合,而不是某一路单帧的图像帧。也就是说,当前帧组、下一帧组、上一帧组为分别包含具有同时性的每路图像帧的图像帧组集合。
步骤209,判断视频流中的图像是否处理完毕;如果是,执行步骤210;如果否,提取每路下一帧图像,返回步骤203,直至视频流中的图像处理完毕。
电子设备可以判断能否提取到下一帧图像,以确定视频流中的图像是否处理完毕。如果能提取到下一帧图像,说明视频流中的图像未处理完毕,那么需要针对每路视频流提取下一帧图像,并返回执行对于每路视频流图像,分别提取当前帧中特征,得到各路当前特征图和/或各路当前特征信息的步骤,直到视频流中的图像处理完毕。
如果未提取到下一帧图像,说明视频流中的图像处理完毕。那么电子设备可以执行步骤210。
步骤210,输出同一目标位置序列ID的有效目标检测结果,得到BEV视觉地图中该有效目标检测结果的位置序列,从而得到目标位置序列。
例如,目标位置序列ID为序列1的有效目标检测结果为结果1~结果10,对应的采集时间分别为11:01~11:10。将结果1~结果10所包含的全局位置信息按照采集时间排列,得到序列1所对应目标的目标位置序列。
在上述步骤2042、步骤205通过机器学习模型获取目标检测结果的过程中,可以理解为是在BEV视角的视觉地图中查询或搜索设定目标的过程,即目标查询过程,以进行目标检测。
为便于理解上述过程中的步骤2042~207的处理过程,参见图5所示,图5为目标检测过程的一种示意图。图中,不同灰度的方框表示不同的目标,cli表示视觉地图中参考位 置信息,clmi表示参考位置信息反投影至各路当前图像帧中位置信息,在虚框中,表示目标1融合了其他目标信息以及各路当前图像帧中该目标本身的信息。
具体来说,采用相机1~相机3进行拍摄,得到第1路图像数据、第2路图像数据以及第3路图像数据。分别从3路图像数据中获取当前帧图像对应的特征图,得到第一特征图521、第二特征图522以及第三特征图523。
初始化目标检测向量包括用于检测行人的第一向量501、用于检测机动车的第二向量502以及用于检测非机动车的第三向量503。在同一视角视觉地图中,对第一向量501、第二向量502以及第三向量503进行目标搜索,例如检测得到3个目标,分别为目标1~目标3。基于视觉地图,可以确定目标1~目标3对应的参考位置信息,分别为第一位置信息504、第二位置信息505以及第三位置信息506。
利用相机1-相机3的投影矩阵,将第一位置信息504、第二位置信息505以及第三位置信息506,分别反投影至第一特征图521、第二特征图522以及第三特征图523中。Cli表示视觉地图中参考位置信息,Clmi表示参考位置信息反投影至各路当前图像帧中位置信息。这样,便可以确定出目标1~目标3在各特征图中的特征位置。进而在各特征图中,根据特征位置,获取目标1~目标3对应的特征。
对目标1~目标3分别进行特征融合,得到目标1~目标3的融合特征。下面以目标1为例,对特征融合的过程进行说明:在虚框中,表示目标1融合了其他目标信息以及各路当前图像帧中该目标本身的信息。具体来说,将目标1在第一特征图521~第三特征图523中的特征进行融合,得到目标1对应的第一融合特征。将目标2和目标3在第一特征图521~第三特征图523中的特征进行融合,得到目标1对应的第二融合特征。接下来,将目标1对应的第一融合特征和第二融合特征进行融合,便可以得到目标1的融合特征。
这样,可以得到目标1~目标3分别对应的融合特征,即第一特征507、第二特征508以及第三特征509。在同一视角视觉地图中,对上述融合特征进行搜索,得到目标1~目标3分别对应的目标检测结果,即第一结果510、第二结果511以及第三结果512。其中,目标检测结果包括置信度。根据目标1~目标3分别对应的目标检测结果所包括的置信度是否大于置信度阈值,对目标检测结果进行过滤。
如果第一结果510所包括的置信度小于置信度阈值,第二结果511和第三结果512所包括的置信度均大于置信度阈值,那么可以将第二结果511和第三结果512确定为有效目标检测结果。并且,将第二结果511和第三结果512加入下一帧组的初始化目标检测向量集合中。接下来,分别提取第1路图像数据~第3路图像数据的下一帧图像,并返回执行上述分别从3路图像数据中获取当前帧图像对应的特征图的步骤,直到视频流中的图像处理完毕。
在第1路图像数据~第3路图像数据中的图像均处理完成后,如果目标1~目标M对应 的检测结果为有效目标检测结果,那么可以得到BEV视觉地图中目标1~目标M的位置序列数据,即Pred1~PredM
参见图6所示,图6为本申请实施例多相机目标检测装置的一种示意图。该装置包括,
第一获取模块,用于获取至少两路视频流图像数据,其中,每路视频流图像数据至少包括重叠区域的图像数据,
第二获取模块,用于获取所述视频流图像数据中重叠区域所在位置的视觉地图信息,
目标检测及位置分析模块,用于将从每路视频流图像数据的每路图像帧中所提取的特征点信息转换至视觉地图中,以将每路图像帧特征点信息融合至同一视角下,得到同一视角下的融合信息,其中,每路图像帧中的每一图像帧具有同时性,
基于所述融合信息,进行目标检测和目标位置分析。
其中,
目标检测及位置分析模块被配置为:
对于每路图像帧中的每一图像帧:
分别对该图像帧进行特征提取,得到该图像帧的特征点信息和/或特征图,
利用该图像帧所来源相机的投影矩阵,将该图像帧的特征点投影至视觉地图中,得到该图像帧的同一视角下的投影特征点信息,
将所述每路图像帧中所有图像帧的所述投影特征点信息,确定为所述融合信息,
其中,
投影矩阵用于表征相机图像中的像素点与视觉地图中的空间点之间映射关系的投影矩阵;
基于所述每路图像帧中所有图像帧的所述投影特征点信息,进行目标检测,得到所述每路图像帧中所有图像帧对应的同一视角下的目标检测结果,将该目标检测结果确定为所述每路图像帧的图像帧组的目标检测结果,
从各个图像帧组的目标检测结果中,获取目标位置序列数据,
其中,各个图像帧组的目标检测结果为:各个不同时间下的所述每路图像帧所有图像帧所对应的同一视角下的目标检测结果。
参见图7所示,图7为本申请实施例多相机目标检测装置的另一种示意图。该装置包括,存储器和处理器,所述存储器存储有计算机程序,所述处理器被配置执行所述计算机程序实现本申请实施例所述多相机目标检测方法的步骤。
存储器可以包括随机存取存储器(Random Access Memory,RAM),也可以包括非易失性存储器(Non-Volatile Memory,NVM),例如至少一个磁盘存储器。可选的,存储器 还可以是至少一个位于远离前述处理器的存储装置。
上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
上述多相机目标检测装置可以为电子设备,并且该电子设备还可以包括通信总线和/或通信接口,处理器、通信接口、存储器通过通信总线完成相互间的通信。
上述电子设备提到的通信总线可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
通信接口用于上述电子设备与其他设备之间的通信。
本申请实施例还提供了一种计算机可读存储介质,所述存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现本申请实施例所述多相机目标检测方法的步骤。
本申请实施例还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行本申请实施例所述多相机目标检测方法的步骤。
对于装置/网络侧设备/存储介质实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。

Claims (12)

  1. 一种多相机目标检测方法,其特征在于,该方法包括:
    获取至少两路来自不同相机的视频流图像数据,其中,每路视频流图像数据至少包括重叠区域的图像数据,
    获取所述视频流图像数据对应空间所在位置的视觉地图信息,
    将从每路视频流图像数据的每路图像帧中所提取的特征点信息转换至视觉地图中,以将每路图像帧特征点信息融合至同一视角下,得到同一视角下的融合信息,其中,每路图像帧中同一采集时间对应的每一图像帧具有同时性,
    基于所述融合信息,进行目标检测和目标位置分析。
  2. 如权利要求1所述的多相机目标检测方法,其特征在于,所述将从每路视频流图像数据的每路图像帧中所提取的特征点信息转换至视觉地图中,包括:
    对于每路图像帧中的每一图像帧:
    分别对该图像帧进行特征提取,得到该图像帧的特征点信息和/或特征图,
    利用该图像帧所来源相机的投影矩阵,将该图像帧的特征点投影至视觉地图中,得到该图像帧的同一视角下的投影特征点信息,
    将所述每路图像帧中所有图像帧的所述投影特征点信息,确定为所述融合信息,
    其中,投影矩阵用于表征相机图像中的像素点与视觉地图中的空间点之间映射关系的投影矩阵;
    所述基于所述融合信息,进行目标检测和目标位置分析,包括:
    基于所述每路图像帧中所有图像帧的所述投影特征点信息,进行目标检测,得到所述每路图像帧中所有图像帧对应的同一视角下的目标检测结果,将该目标检测结果确定为所述每路图像帧的图像帧组的目标检测结果,
    从各个图像帧组的目标检测结果中,获取目标位置序列数据,
    其中,各个图像帧组的目标检测结果为:各个不同时间下的所述每路图像帧所有图像帧所对应的同一视角下的目标检测结果。
  3. 如权利要求2所述的多相机目标检测方法,其特征在于,所述从各个图像帧组的目标检测结果中,获取目标位置序列数据,包括:
    对当前图像帧组的目标检测结果与历史图像帧组的目标检测结果的交集中的各个目标检测结果,进行标记,
    从各图像帧组的所标记的目标检测结果中,获取世界坐标系下的目标位置序列数据。
  4. 如权利要求2或3所述的多相机目标检测方法,其特征在于,所述基于所述每路图像帧中所有图像帧的所述投影特征点信息,进行目标检测,包括:
    基于所述投影特征点信息,对预先初始化的目标特征进行搜索,得到目标参考位置信 息,
    利用每路图像帧所来源相机的投影矩阵,将目标参考位置信息分别反投影至每路图像帧中,以确定目标参考位置在图像帧中的位置信息,
    根据图像帧的位置信息,获取对应目标的特征,得到目标的每路特征,
    融合目标的每路特征,得到目标的融合特征,
    基于所述投影特征点信息,对所述融合特征进行搜索,得到所述目标检测结果;
    其中,目标检测结果包括:同一视角下的全局位置信息、目标尺寸以及置信度。
  5. 如权利要求4所述的多相机目标检测方法,其特征在于,所述基于所述投影特征点信息,对预先初始化的目标特征进行搜索,得到目标参考位置信息,包括:
    将所述投影特征点信息、和预先初始化的目标检测向量输入至机器学习模型,得到各目标的参考位置信息,其中,目标检测向量包括两个以上目标的目标特征向量,
    所述利用每路图像帧所来源相机的投影矩阵,将目标参考位置信息分别反投影至每路图像帧中,包括:
    利用每路视频流图像所来源相机的投影矩阵,将各目标的参考位置信息分别反投影至每路图像帧对应的特征图中,以确定各目标的参考位置在特征图中的位置信息,
    所述根据图像帧的位置信息,获取对应目标的特征,得到目标的每路特征,包括:
    根据各目标特征图中的位置信息,获取各目标对应的特征。
  6. 如权利要求4所述的多相机目标检测方法,其特征在于,所述融合目标的每路特征,得到目标的融合特征,包括:
    对每个目标,分别进行该目标的特征融合,得到该每个目标的融合特征,
    所述基于所述投影特征点信息,对所述融合特征进行搜索,包括:
    将每个目标的融合特征、和所述投影特征点信息输入至机器学习模型,得到所述目标检测结果。
  7. 如权利要求6所述的多相机目标检测方法,其特征在于,所述对每个目标,分别进行该目标的特征融合,包括:
    对于每个目标:
    基于每路特征图,融合各特征图中该目标的特征,得到第一融合特征,
    基于每路特征图,融合各特征图中除同一目标之外的其他目标的特征,得到第二融合特征,
    将第一融合特征和第二融合特征进行融合,得到该目标的融合特征;
    该方法进一步包括:
    按照设定的置信度阈值,对当前图像帧组的目标检测结果进行过滤,得到有效目标检测结果,
    将所述有效目标检测结果增加至下一图像帧组的初始化目标特征集合中。
  8. 如权利要求7所述的多相机目标检测方法,其特征在于,所述对当前图像帧组的目标检测结果与历史图像帧组的目标检测结果的交集中的各个目标检测结果,进行标记,包括:
    如果当前图像帧组的有效目标检测结果来自于初始化目标特征,则赋予该有效目标检测结果新的目标位置序列标识;
    如果当前图像帧组的有效目标检测结果来自于上一图像帧组所加入的有效目标检测结果,则沿用上一图像帧所加入的有效目标检测结果的目标位置序列标识;
    所述从各图像帧组的所标记的目标检测结果中,获取世界坐标系下的目标位置序列数据,包括:
    将从各图像帧组所标记的目标检测结果中具有相同目标位置序列标识的目标检测结果,确定为该目标检测结果的目标位置序列数据;
    所述视觉地图为鸟瞰视角地图,所述同一视角为鸟瞰视角。
  9. 一种多相机目标检测装置,其特征在于,该装置包括:
    第一获取模块,用于获取至少两路以上来自不同相机的视频流图像数据,其中,每路视频流图像数据至少包括重叠区域的图像数据,
    第二获取模块,用于获取所述视频流图像数据对应空间所在位置的视觉地图信息,
    目标检测及位置分析模块,将从每路视频流图像数据的每路图像帧中所提取的特征点信息转换至视觉地图中,以将每路图像帧特征点信息融合至同一视角下,得到同一视角下的融合信息,其中,每路图像帧中同一采集时间对应的每一图像帧具有同时性,
    基于所述融合信息,进行目标检测和目标位置分析。
  10. 一种计算机可读存储介质,其特征在于,所述存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至8任一所述多相机目标检测方法的步骤。
  11. 一种电子设备,其特征在于,包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;
    存储器,用于存放计算机程序;
    处理器,用于执行存储器上所存放的程序时,实现权利要求1至8任一所述多相机目标检测方法步骤。
  12. 一种包含指令的计算机程序产品,其特征在于,所述计算机程序产品被计算机执行时实现权利要求1至8任一所述多相机目标检测方法步骤。
PCT/CN2023/118350 2022-09-13 2023-09-12 一种多相机目标检测方法、装置 WO2024055966A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211108975.7 2022-09-13
CN202211108975.7A CN115457084A (zh) 2022-09-13 2022-09-13 一种多相机目标检测跟踪方法、装置

Publications (1)

Publication Number Publication Date
WO2024055966A1 true WO2024055966A1 (zh) 2024-03-21

Family

ID=84303349

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/118350 WO2024055966A1 (zh) 2022-09-13 2023-09-12 一种多相机目标检测方法、装置

Country Status (2)

Country Link
CN (1) CN115457084A (zh)
WO (1) WO2024055966A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457084A (zh) * 2022-09-13 2022-12-09 上海高德威智能交通系统有限公司 一种多相机目标检测跟踪方法、装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113411543A (zh) * 2021-03-19 2021-09-17 贵州北斗空间信息技术有限公司 一种多路监控视频融合显示方法及系统
CN113673425A (zh) * 2021-08-19 2021-11-19 清华大学 一种基于Transformer的多视角目标检测方法及系统
US20220020158A1 (en) * 2020-07-15 2022-01-20 Jingdong Digits Technology Holding Co., Ltd. System and method for 3d object detection and tracking with monocular surveillance cameras
CN114913506A (zh) * 2022-05-18 2022-08-16 北京地平线机器人技术研发有限公司 一种基于多视角融合的3d目标检测方法及装置
CN115457084A (zh) * 2022-09-13 2022-12-09 上海高德威智能交通系统有限公司 一种多相机目标检测跟踪方法、装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220020158A1 (en) * 2020-07-15 2022-01-20 Jingdong Digits Technology Holding Co., Ltd. System and method for 3d object detection and tracking with monocular surveillance cameras
CN113411543A (zh) * 2021-03-19 2021-09-17 贵州北斗空间信息技术有限公司 一种多路监控视频融合显示方法及系统
CN113673425A (zh) * 2021-08-19 2021-11-19 清华大学 一种基于Transformer的多视角目标检测方法及系统
CN114913506A (zh) * 2022-05-18 2022-08-16 北京地平线机器人技术研发有限公司 一种基于多视角融合的3d目标检测方法及装置
CN115457084A (zh) * 2022-09-13 2022-12-09 上海高德威智能交通系统有限公司 一种多相机目标检测跟踪方法、装置

Also Published As

Publication number Publication date
CN115457084A (zh) 2022-12-09

Similar Documents

Publication Publication Date Title
Yang et al. Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios
CN112667837A (zh) 图像数据自动标注方法及装置
WO2024055966A1 (zh) 一种多相机目标检测方法、装置
CN113096003B (zh) 针对多视频帧的标注方法、装置、设备和存储介质
CN113447923A (zh) 目标检测方法、装置、系统、电子设备及存储介质
CN111383204A (zh) 视频图像融合方法、融合装置、全景监控系统及存储介质
CN115164918B (zh) 语义点云地图构建方法、装置及电子设备
US20220044558A1 (en) Method and device for generating a digital representation of traffic on a road
CN115376109B (zh) 障碍物检测方法、障碍物检测装置以及存储介质
US20130135446A1 (en) Street view creating system and method thereof
CN116883610A (zh) 基于车辆识别与轨迹映射的数字孪生路口构建方法和系统
WO2020049089A1 (en) Methods and systems for determining the position of a vehicle
CN114898314A (zh) 驾驶场景的目标检测方法、装置、设备及存储介质
CN115797408A (zh) 融合多视角图像和三维点云的目标跟踪方法及装置
CN114969221A (zh) 一种更新地图的方法及相关设备
CN115004273A (zh) 交通道路的数字化重建方法、装置和系统
CN115344655A (zh) 地物要素的变化发现方法、装置及存储介质
KR20130137076A (ko) 실시간 관심 지역을 나타내는 3차원 지도를 제공하는 장치 및 방법
Rizzoli et al. Syndrone-multi-modal uav dataset for urban scenarios
CN114066974A (zh) 一种目标轨迹的生成方法、装置、电子设备及介质
Zhanabatyrova et al. Automatic map update using dashcam videos
CN116843754A (zh) 一种基于多特征融合的视觉定位方法及系统
CN115249345A (zh) 一种基于倾斜摄影三维实景地图的交通拥堵检测方法
Luo et al. Complete trajectory extraction for moving targets in traffic scenes that considers multi-level semantic features
CN110930507A (zh) 一种基于三维地理信息的大场景跨境头目标追踪方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23864698

Country of ref document: EP

Kind code of ref document: A1