WO2024055966A1 - Procédé et appareil de détection de cible à caméras multiples - Google Patents

Procédé et appareil de détection de cible à caméras multiples Download PDF

Info

Publication number
WO2024055966A1
WO2024055966A1 PCT/CN2023/118350 CN2023118350W WO2024055966A1 WO 2024055966 A1 WO2024055966 A1 WO 2024055966A1 CN 2023118350 W CN2023118350 W CN 2023118350W WO 2024055966 A1 WO2024055966 A1 WO 2024055966A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
target detection
image frame
feature
information
Prior art date
Application number
PCT/CN2023/118350
Other languages
English (en)
Chinese (zh)
Inventor
吴昊
Original Assignee
上海高德威智能交通系统有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海高德威智能交通系统有限公司 filed Critical 上海高德威智能交通系统有限公司
Publication of WO2024055966A1 publication Critical patent/WO2024055966A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present application relates to the field of image-based target detection, and in particular, to a multi-camera target detection method.
  • Multi-camera target detection and target position analysis perform target detection and target position analysis on the image information from each camera, generate a single camera target position sequence, and then project the target position sequence in each camera into the same perspective view. , for example, in the Bird Eye View (BEV) view, the projected target position sequences are finally merged into the global target position sequence.
  • BEV Bird Eye View
  • the above multi-camera target detection and target position analysis method performs target detection and target position analysis step by step, which makes the system more complex and takes up communication resources; moreover, if there are problems with the target position sequence in a single camera, when multiple cameras When the target position sequences are fused from the same perspective, it is difficult to screen and choose the target position sequence of a single camera, making it difficult to obtain a correct global target position sequence.
  • This application provides a multi-camera target detection method to obtain an accurate global target position sequence.
  • this application provides a multi-camera target detection method, which method includes:
  • each channel of video stream image data at least includes image data of overlapping areas
  • target detection and target location analysis are performed.
  • the feature point information extracted from each image frame of each channel of video stream image data is converted into a visual map, including:
  • the projection matrix is a projection matrix used to characterize the mapping relationship between pixel points in the camera image and spatial points in the visual map;
  • the step of performing target detection and target location analysis based on the fusion information includes:
  • target detection is performed to obtain target detection results at the same viewing angle corresponding to all image frames in each channel of image frames, and the target detection results are determined is the target detection result of the image frame group of each channel of image frames,
  • the target detection results of each image frame group are: the target detection results at the same viewing angle corresponding to all the image frames of each channel of image frames at different times.
  • obtaining target position sequence data from the target detection results of each image frame group includes:
  • the target position sequence data in the world coordinate system is obtained.
  • the target detection based on the projected feature point information of all image frames in each channel of image frames includes:
  • the pre-initialized target features are searched to obtain target reference position information
  • the characteristics of the corresponding target are obtained, and the characteristics of each channel of the target are obtained.
  • the target detection results include: global position information, target size and confidence under the same perspective.
  • the pre-initialized target features are searched based on the projected feature point information to obtain target reference position information, including:
  • the method uses the projection matrix of the camera from which each image frame is sourced to back-project the target reference position information to each channel respectively.
  • the image frame includes:
  • the reference position information of each target is back-projected into the feature map corresponding to each image frame to determine the position information of the reference position of each target in the feature map.
  • the characteristics of the corresponding target are obtained, and the characteristics of each path of the target are obtained, including:
  • the characteristics corresponding to each target are obtained.
  • the fusion features of each path of the target are obtained to obtain the fusion features of the target, including:
  • For each target perform feature fusion of the target separately to obtain the fusion features of each target.
  • the search for the fusion feature based on the projected feature point information includes:
  • the fused features of each target and the projected feature point information are input into the machine learning model to obtain the target detection result.
  • the feature fusion of each target is performed separately, including:
  • the features of the target in each feature map are fused to obtain the first fusion feature
  • the method further includes:
  • the target detection results of the current image frame group are filtered to obtain effective target detection results.
  • the effective target detection result is added to the initialized target feature set of the next image frame group.
  • marking each target detection result in the intersection of the target detection results of the current image frame group and the target detection results of the historical image frame group includes:
  • the target position sequence identifier of the valid target detection result added to the previous image frame is used;
  • the method of obtaining target position sequence data in the world coordinate system from the marked target detection results of each image frame group includes:
  • Target detection results with the same target position sequence identifier among the target detection results marked in each image frame group are determined as target position analysis data of the target detection result;
  • the visual map is a bird's-eye view map, and the same perspective is a bird's-eye view.
  • embodiments of the present application also provide a multi-camera target detection device, which includes:
  • the first acquisition module is used to acquire at least two channels of video stream image data from different cameras, where each channel
  • the video stream image data at least includes image data of the overlapping area
  • the second acquisition module is used to acquire the visual map information of the spatial location corresponding to the video stream image data
  • the target detection and position analysis module converts the feature point information extracted from each image frame of each video stream image data into a visual map to integrate the feature point information of each image frame into the same perspective to obtain the same Fusion information under the perspective, in which each image frame corresponding to the same collection time in each channel image frame is simultaneous,
  • target detection and target location analysis are performed.
  • embodiments of the present application further provide a computer-readable storage medium.
  • a computer program is stored in the storage medium.
  • the computer program is executed by a processor, the multi-camera target detection in any one of the first aspects is implemented. method steps.
  • embodiments of the present application further provide an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete communication with each other through the communication bus;
  • Memory used to store computer programs
  • the processor is configured to implement the method steps of multi-camera target detection described in any one of the above first aspects when executing a program stored in the memory.
  • embodiments of the present application further provide a computer program product containing instructions, which when the computer program product is run on a computer, causes the computer to perform any of the method steps of multi-camera target detection described in the first aspect.
  • the multi-camera target detection method projects video stream image features from different cameras to the same perspective, and performs target detection and target location analysis based on the fused information from the same perspective. In this way, information is collected from the source.
  • the fusion is conducive to improving the accuracy of information sources used for target detection and target position analysis, improving the intelligence of multi-camera fusion, and there is no need to first generate a target position sequence corresponding to a single camera and then fuse the target position sequence. It solves the problem of difficulty in distinguishing and choosing when fusing target position sequences corresponding to multiple single cameras. It not only avoids the computing power consumption caused by the fusion of target position sequences, but also improves the accuracy and reliability of target detection and target position analysis. .
  • Figure 1 is a schematic flow chart of a multi-camera target detection method according to an embodiment of the present application.
  • Figure 2 is a schematic flowchart of a multi-camera target detection method in a specific scenario according to an embodiment of the present application.
  • Figure 3 is a schematic diagram of four cameras collecting one frame of video stream images from four directions of an intersection.
  • Figure 4 is a schematic diagram of a high-precision map from a bird's-eye view at a traffic intersection.
  • Figure 5 is a schematic diagram of the target detection process.
  • Figure 6 is a schematic diagram of a multi-camera target detection device according to an embodiment of the present application.
  • FIG. 7 is another schematic diagram of a multi-camera target detection device according to an embodiment of the present application.
  • the embodiment of the present application projects the feature points corresponding to the image frames of the same time in the image data of each video stream to the same perspective visual map, so as to integrate the feature point information of the image frames in the video stream images of each channel into the same perspective.
  • Target detection and target location analysis are performed based on fused information from the same perspective.
  • Figure 1 is a schematic flow chart of a multi-camera target detection method according to an embodiment of the present application. The method includes:
  • Step 101 obtain at least two video stream image data
  • the image data of each video stream at least includes image data of the same scene collected from different shooting angles. That is to say, the image data of each video stream is image data of the same scene collected by different cameras from different shooting angles. Usually, the image data of each video stream is obtained by collecting images of the same scene from cameras installed at different locations, so that image data from different shooting angles can be obtained.
  • the same scene means that there is at least intersection data between the image data of each video stream, that is, the same scene causes the shooting scenes corresponding to the image data of each video stream to have overlapping areas; from the perspective of spatial location, the same scene A scene refers to a collection of targets located within the same spatial location range.
  • the spatial location range can be set as needed. In other words, there is an overlapping area between the images of each video stream. That is to say, from the perspective of spatial location, the same scene refers to a collection of locations within the same spatial location range.
  • the spatial location range can be set as needed. In other words, there are differences between the shooting scenes corresponding to each video stream image. Overlapping area.
  • the acquisition method may be real-time video stream image data obtained from the camera, or non-real-time video stream image data obtained from the storage terminal, which is not limited in this application.
  • Step 102 Obtain the visual map information of the location of the overlapping area in the video stream image data
  • the location of the overlapping area can be obtained through the position information of the camera.
  • the corresponding bird's-eye view visual map information is obtained from the map library.
  • the visual map can be a bird's-eye view visual map.
  • the bird's-eye view can be understood as a bird's-eye view.
  • the bird's-eye view visual map information is equivalent to the bird's-eye view visual map information.
  • the map information includes global position information in the world coordinate system.
  • the overlapping area is the overlapping area between the shooting ranges of multiple cameras.
  • the overlapping area of the shooting range of each camera is determined based on the geographical location information, installation angle and shooting range of each camera, from Obtain the bird's-eye view visual map information corresponding to the overlapping area from the map library.
  • the bird's-eye view visual map information corresponding to the preset range is obtained from the map library, and the preset range at least includes the above-mentioned overlapping area.
  • the geographical location information of each camera installation determine the position of each camera in the complete bird's-eye view visual map.
  • Based on the installation angle and shooting range of each camera determine the overlapping area of each camera's shooting range in the complete bird's-eye view visual map. Location.
  • Step 103 Convert the feature point information extracted from each image frame of each channel of video stream image data into the same perspective visual map, so as to fuse the feature point information of each channel image frame into the same perspective to obtain the same perspective fused information,
  • each image frame in each channel of image frames is simultaneous.
  • the image frames of each channel correspond to the same time.
  • each image frame has the same timestamp information, but in actual applications, each image frame in each image frame does not need to be strictly at the same time, as long as each image frame in each image frame The time difference between them only needs to be within the set time threshold. In this case, it corresponds to the same time and is also simultaneous. If the image frames of each channel are not simultaneous, synchronization processing can be performed.
  • each of the above-mentioned image frames in each channel is a group of image frames with the same acquisition time in each channel of image frames, and the above-mentioned time difference is the difference between the acquisition times of the group of image frames.
  • the projected feature point information includes global position information, that is, projection The coordinates of the feature point in the world coordinate system.
  • the projection feature point information of all image frames corresponding to the same acquisition time in each image frame is determined as the fusion information.
  • the fusion information represents the feature information of the scene at the same viewing angle at the same time, that is to say , which represents the feature information corresponding to all simultaneous image frames from the same perspective.
  • each image frame is a set of image frames whose time difference is within a set time threshold, that is, the image frames corresponding to the same acquisition time in each image frame are simultaneous image frames
  • this set is called in this application is the image frame group
  • the fusion information can be understood as the fusion information of the image frame group; for example, there are 3 channels of image frames, and the image frames whose collection time is 8:00 among the 1st channel image frame to the 3rd channel image frame are image frames respectively.
  • Image frame 2 and image frame 3 The image frames collected at 8:01 are image frame 4, image frame 5 and image frame 6 respectively.
  • image frame 1 to image frame 3 can be regarded as an image frame group, and image frame 4 to image frame 6 can be regarded as Another group of image frames.
  • the projection matrix is used to characterize the mapping relationship between the pixel points in the camera image and the spatial points in the visual map.
  • Different cameras correspond to different projection matrices; that is to say, the projection matrix is used to characterize the image coordinate system of the camera image. Mapping relationship with the map coordinate system of the visual map.
  • Step 104 Perform target detection and target location analysis based on the fusion information.
  • target detection is performed based on the projected feature point information of all image frames in each channel of image frames.
  • the pre-initialized target features are searched to obtain the target reference position information. That is to say, based on the projected feature point information, in the same perspective visual map, the pre-initialized target is searched.
  • Features are analyzed to obtain target reference position information in the visual map.
  • the pre-initialized target features can be predetermined features corresponding to various types of targets to be detected. In this way, based on analysis based on the pre-initialized target features, targets of these categories included in each image frame can be identified.
  • the target reference position information may be the coordinates of the target in the visual map. It can be seen from the above description that parsing the pre-initialized target features means searching for the pre-initialized target features based on the projected feature point information.
  • Corresponding target detection results under the same viewing angle are determined as the target detection results of the image frame group of each channel of image frames, wherein the image frame group is composed of image frames of each channel with simultaneity; target detection The results include: global position information, target size and confidence from the same perspective, and may also include target identification and/or target category.
  • the features of the target in each image frame included in the image frame group are fused to obtain the fusion features corresponding to the image frame group.
  • the fusion features corresponding to the image frame group are analyzed to obtain the target detection results at the acquisition time corresponding to the image frame group.
  • the fusion feature can be a 128-dimensional vector.
  • the fusion feature is input into the multi-layer perceptron, and the multi-layer perceptron can analyze whether there is a target at the corresponding position of the fusion feature in the visual map. If there is a target, the multi-layer perceptron can also parse out specific information about the target such as target category and target size.
  • the above detection method of searching for fused features can realize correlation detection from feature information of multiple image frames, which is beneficial to improving the accuracy of target detection.
  • target position sequence data from the target detection results of each image frame group, where the target detection results of each image frame group are: target detection results corresponding to the image frame groups at different times, that is, historical images Object detection results for the frame group.
  • the target position sequence data is a sequence of the target's positions at different times in the visual map arranged in chronological order.
  • the historical image frame group is the image frame group whose acquisition time is before the current time.
  • target detection can be performed based on the fusion information of this image frame group.
  • each target in the intersection of the target detection results of the current image frame group from the same perspective and the target detection results of the historical image frame group from the same perspective is The detection results are marked, for example, the target detection results in the intersection inherit the existing target position sequence identifier. That is to say, determine the target identification included in the target detection results corresponding to the current image frame group, and the intersection of the target identification included in the target detection results corresponding to the historical image frame group, and combine each target detection result to which the target identification in the intersection belongs. , mark it.
  • the current image frame group is the image frame group collected at the current time.
  • the target detection results in the same perspective of the current image frame group that are not in the intersection are assigned a new target position sequence identifier; the target detection results with the same target position sequence identifier are determined as the target position sequence data of the target detection result. That is to say, among the target identifications included in the target detection results of the current image frame group, the target detection results belonging to the target identifications that are not in the above intersection are assigned new target position sequence identifications; the target detection results with the same target position sequence identification are assigned The global position information included in the results is sorted according to the corresponding collection time to obtain the target position sequence.
  • target detection and target position analysis are integrated into one, eliminating the need to perform detection first and then perform target position analysis, making the overall detection and position analysis more concise.
  • the multi-camera target detection method in the embodiment of the present application provides an end-to-end, multi-view, integrated detection and position analysis method by fusing the feature point information of each image frame into the same viewing angle, without the need to obtain a single
  • the target position sequence corresponding to the camera is then fused with the target position sequence corresponding to each single camera, which avoids the problem of difficulty in obtaining the correct global target position sequence due to problems with the target position sequence in a single camera, and is conducive to improving multi-camera targets. Reliability and accuracy of detection and target location analysis.
  • the following takes multi-camera target detection and target location analysis applied to traffic intersections as an example. It should be understood that the present application is not limited to multi-camera target detection and target location analysis at traffic intersections. Any application of multi-camera target detection and target location analysis can be applied, for example, applications such as peripheral target detection and target location analysis using multi-channel cameras installed on the vehicle body.
  • Figure 2 is a schematic flow chart of a multi-camera target detection method in a specific scenario according to an embodiment of the present application. The method includes:
  • Step 201 Obtain video stream images from multiple cameras and visual map information corresponding to the spatial location of the video stream images.
  • the visual map information corresponding to the spatial position of the video stream image is the visual map information corresponding to the scene corresponding to the video stream image.
  • obtaining video stream images from multiple cameras may be to obtain video stream images from multiple cameras installed in the same scene and with overlapping shooting ranges. For example, four cameras are installed at a traffic intersection to collect images separately. The video stream images in the four directions of the intersection with overlapping areas are obtained to obtain four video stream images, as shown in Figure 3.
  • Figure 3 is a schematic diagram of four cameras separately collecting one frame of video stream images in the four directions of the intersection.
  • obtaining the visual map information of the spatial position corresponding to the video stream image may be to obtain the map information at the traffic intersection in order to obtain the global position information of the space corresponding to the video stream image.
  • the position information in the map information may adopt the world coordinate system. described by the global coordinate information below.
  • the map information may be a high-precision map from a bird's-eye view, as shown in Figure 4.
  • Figure 4 is a schematic diagram of a high-precision map from a bird's-eye view at a traffic intersection. It should be understood that the map information may also be a general map, that is, a non-high-precision map, for example, an ordinary navigation electronic map.
  • High-precision map is a thematic map relative to ordinary navigation electronic map, also known as high-resolution map.
  • the absolute position accuracy is close to the meter level, and the relative position accuracy is at the centimeter level; the data organization method is to use different layers to describe water systems, railways, blocks, buildings, traffic markings and other information, and then overlay the layers to express it.
  • Step 202 obtain the projection matrix corresponding to the camera and visual map information from which each video stream image comes,
  • the mapping relationship can be described by a projection matrix.
  • select multiple iconic pixels or,
  • select multiple iconic spatial points use the pixel coordinates and corresponding map information to calculate the projection matrix, and the internal and external calibration parameters of the camera can be obtained in advance.
  • the above-mentioned iconic pixel points and spatial points are points whose positions are easy to accurately determine. For example, they can be a certain point on a landmark building, a certain point on a road sign, a corner point of a certain zebra crossing, etc.
  • the spatial point is the point in the world coordinate system corresponding to the pixel point.
  • the pixels corresponding to the traffic mark points in the camera image select the pixels corresponding to the traffic mark points in the camera image, and determine the map information corresponding to the traffic mark points in the high-precision map.
  • the projection matrix of the map The specific calculation method of the projection matrix can be obtained by substituting the coordinates of multiple sets of corresponding points of the camera image and the high-precision map into the constructed linear equations and solving them using the least squares method. For example, using direct linear transformation (DLT) ) algorithm, P3P algorithm, EpnP (Efficient Pnp) algorithm, Bundle Adjustment (BA) algorithm.
  • DLT direct linear transformation
  • P3P algorithm P3P algorithm
  • EpnP (Efficient Pnp) algorithm Bundle Adjustment (BA) algorithm.
  • the projection matrix can usually be predetermined and stored offline, or it can be determined in real time.
  • Step 203 For each video stream image, extract the features in the current frame respectively to obtain the current feature map of each channel and/or the current feature information of each channel.
  • CNN Convolutional Neural Networks, convolutional neural network
  • the feature map is feature information, that is to say, the current feature map of each of the above channels is the current feature information of each of the above channels.
  • Step 204 For each current feature map, perform the following processing:
  • the initialized target features can be obtained through one of the following two implementation methods:
  • the initialized target features can be obtained through the following step 2041.
  • Step 2041 use the projection matrix of the camera from which the video stream image comes, to project the feature points in the current feature map of the road to the BEV visual map, obtain the location information of the projected feature points, and initialize the target feature set in the BEV visual map;
  • the global position information of the projected feature point in the world coordinate system can be obtained;
  • each target feature is a vector of set length. For example, set the target features corresponding to each target with pedestrians, motor vehicles, and non-motor vehicles as targets respectively, and use the target features corresponding to all targets as a set of target detection vectors to initialize to obtain the initialized target feature vector.
  • the length of the above target features can be preset to 256 and so on.
  • the target detection vector may be a 3D target detection vector, including 3D information, or a 2D target detection vector, including 2D information.
  • the target feature set can be initialized directly in the BEV visual map.
  • the step of obtaining the initialized target characteristics is implemented based on the above-mentioned step 2041, then the step of obtaining the reference position information of each target is implemented based on the following implementation method 1, that is, based on the following step 2042; if the target characteristics are Based on the method of initializing directly in the BEV visual map in the other implementation method mentioned above, the step of obtaining the reference position information of each target is implemented based on the following implementation method 2.
  • Step 2042 Based on the initialized target characteristics, obtain the reference position information of each target in the BEV visual map.
  • a target search is performed based on the projected feature points projected by the BEV visual map, and the reference position information of each target detected in the BEV visual map is obtained. That is to say, in the first implementation method, based on the projection feature points projected by the BEV visual map, the initialized target features are analyzed in the same perspective visual map to obtain the position information of each target in the BEV visual map, that is, the reference position. information. From the above description, it can be seen that target analysis is performed on the initialized target features in the same perspective visual map, that is, target search is performed based on the projected feature points projected by the BEV visual map.
  • the machine learning model can be a multi-layer perceptron.
  • the multi-layer perceptron is used to perform target detection vector on the projected feature points projected by the BEV visual map. Search to parse the reference location information of each target in the BEV visual map. For example, the reference position information of different pedestrians and different vehicles in the BEV visual map is parsed.
  • the initialized target detection vector and projected feature point position information are input into the machine learning model.
  • the machine learning model can analyze the projected feature points based on the initialized target detection vector to obtain the target of the category identified by the initialized target detection vector. Can include people, motor vehicles, non-motor vehicles, etc.
  • the reference position information of the identified target in the BEV visual map can be determined. It can be seen from the above description that the projection feature points are analyzed based on the initialized target detection vector, that is, the target detection vector is searched for the projected feature points projected by the BEV visual map.
  • target analysis can be performed on the initialized target features in the same perspective visual map to obtain the reference position information of the target in the BEV visual map.
  • the machine learning model can be a multi-layer perceptron.
  • the multi-layer perceptron is used to parse the reference position information of each target in the BEV visual map. For example, the reference location information of different pedestrians and vehicles in the BEV visual map is parsed.
  • the initialized target detection vector is input into the machine learning model.
  • the machine learning model can parse and obtain the target of the category identified by the initialized target detection vector and the reference position information of the target in the BEV visual map based on the initialized target detection vector.
  • Step 2043 use the projection matrix of the camera to back-project the reference position information of each target into the current feature map of the road to determine the characteristic position of the reference position of each target in the current feature map of the road, and obtain the corresponding position based on the characteristic position.
  • the feature position information in the current feature map corresponding to the reference position information is obtained.
  • the feature Location information to determine the corresponding features. That is to say, in the current feature map, features are obtained from the positions indicated by the feature position information.
  • the reference position information of target 1 corresponds to feature positions 1, 2, 3, and 4 in the 4-channel current feature map, and the corresponding features 1, 2, 3, and 4 are obtained from the 4 feature positions.
  • the position information in the current frame corresponding to the reference position information is obtained according to the projection matrix of the camera and the reference position information, and the corresponding features are determined based on the position information. . That is to say, in the current frame, feature extraction is performed on the position indicated by the feature position information to obtain the corresponding feature.
  • Step 2044 Perform feature fusion on each target to obtain the fusion features of each target;
  • the features of the same target in the current feature map of each channel are fused to obtain the first fusion feature.
  • the first features 1, 2, 3, and 4 of target 1 are fused to obtain the target 1's first fusion feature. That is to say, features corresponding to the same target in each image frame included in the image frame group are fused to obtain the first fusion feature corresponding to the target.
  • the same target captured by different cameras has different positions in the current frame, and the targets corresponding to the same pixel position in the current frame are different. Based on this, you can also fuse the features of other targets in each feature map except this target to obtain the second fusion feature. For example, fuse the features of other targets except target 1, that is, the features of target 2.
  • Features, features of target 3... are fused, so that redundant target information can be removed and the desired target information can be enhanced.
  • the first fusion feature and the second fusion feature are fused to obtain the fusion feature of the target.
  • the above-mentioned fusion may include at least one operation of adding and splicing feature vectors, wherein the addition may be a weighted average addition.
  • the features of target 1 in feature map 1-feature map 4 can be fused, that is, feature 1-feature 4 can be fused to obtain the first fusion feature of target 1.
  • the features of target 2 and target 3 in feature maps 1 to 4 are fused, that is, feature 5 to feature 12 are fused to obtain the second fusion feature of target 1.
  • the first fusion feature of target 1 is fused with the second fusion feature to obtain the fusion feature of target 1.
  • Step 205 Based on the projected feature point information, search the fusion features to obtain the target detection results from the same perspective of the current frame group.
  • the fusion features and projected feature point information of all targets are analyzed through the machine learning model to obtain the target detection results from the BEV perspective, that is, the target detection results in the BEV visual map of the current frame,
  • the target detection results include three-dimensional position information, three-dimensional size information, and confidence.
  • the target detection results include two-dimensional position information, two-dimensional size information, and confidence.
  • the target category can be pedestrians, bicycles, or motor vehicles.
  • Step 206 Filter the target detection results according to the confidence threshold and retain valid target detection results.
  • target detection results whose confidence is less than the confidence threshold are eliminated to obtain effective target detection results.
  • the confidence levels corresponding to target detection result 1-target detection result 5 are 65%, 70%, 95%, 97% and 96% respectively. If the confidence threshold is 90%, then target detection result 1 and target detection result 2 can be eliminated.
  • Step 207 Add the retained valid target detection results to the initialized target feature set used for target search in the next frame group,
  • the initialized target feature set of the next frame group includes an initialized target detection vector used for target search.
  • the number of target detection vectors in the next frame group is: initialized n target detection vectors + historical m valid target detection results, where the current frame group is relative to The next frame group is the historical frame group, and the m valid target detection results are the historical valid target detection results.
  • Step 208 Determine whether the valid target detection result of the current frame group is related to the valid target detection result of the previous frame group.
  • the valid target detection result comes from the initialized target, it means that the target corresponding to the valid target detection result is newly detected, then give the valid target detection result a new target position sequence identifier (ID); where, the initialized target is the first The target corresponding to the initialization target detection vector corresponding to the image frame group is the initial initialization target. If the effective target detection result comes from the initialization target, it means that the target was not detected in the image frame group before the current frame group, then the target The target corresponding to the effective target detection result is the newly detected target.
  • ID target position sequence identifier
  • the valid target detection result comes from the valid target detection result added to the previous frame group, it means that the valid target detection result was detected in both the previous frame group (a historical frame group relative to the current frame group) and the current frame group. , then the target position sequence identifier of the valid target detection result remains unchanged, and the target position sequence identifier of the previous frame group of the valid target detection result is used.
  • the current frame group, next frame group, and The previous frame group should be understood as a collection of simultaneous image frames of each channel, rather than a single frame of a certain channel. That is to say, the current frame group, the next frame group, and the previous frame group are image frame group sets that respectively contain simultaneous image frames of each channel.
  • Step 209 Determine whether the images in the video stream have been processed; if yes, execute step 210; if not, extract the next frame of images in each channel and return to step 203 until the images in the video stream are processed.
  • the electronic device can determine whether the next frame of image can be extracted to determine whether the image in the video stream has been processed. If the next frame of image can be extracted, it means that the image in the video stream has not been processed. Then it is necessary to extract the next frame of image for each video stream and return to execution. For each video stream image, extract the features in the current frame respectively, and get The steps of each current feature map and/or each current feature information until the image in the video stream is processed.
  • next frame of image is not extracted, it means that the image in the video stream has been processed. Then the electronic device can perform step 210.
  • Step 210 Output the valid target detection results of the same target position sequence ID, and obtain the position sequence of the valid target detection results in the BEV visual map, thereby obtaining the target position sequence.
  • the target position sequence ID is sequence 1
  • the valid target detection results are result 1 to result 10
  • the corresponding collection times are 11:01 to 11:10 respectively.
  • the process of obtaining target detection results through the machine learning model in the above-mentioned steps 2042 and 205 can be understood as a process of querying or searching for a set target in the visual map from the BEV perspective, that is, a target query process to perform target detection.
  • FIG. 5 is a schematic diagram of the target detection process.
  • boxes with different grayscales represent different targets, and cli represents the reference position in the visual map.
  • Position information, clmi represents the back-projection of the reference position information to the position information in the current image frames of each channel.
  • target 1 integrates other target information and the information of the target itself in the current image frames of each channel.
  • cameras 1 to 3 are used for shooting, and the first channel image data, the second channel image data, and the third channel image data are obtained.
  • the feature maps corresponding to the current frame image are respectively obtained from the three channels of image data to obtain the first feature map 521, the second feature map 522 and the third feature map 523.
  • the initialized target detection vector includes a first vector 501 for detecting pedestrians, a second vector 502 for detecting motor vehicles, and a third vector 503 for detecting non-motor vehicles.
  • a target search is performed on the first vector 501, the second vector 502, and the third vector 503.
  • three targets are detected, namely target 1 to target 3 respectively.
  • the reference position information corresponding to target 1 to target 3 can be determined, which are first position information 504, second position information 505, and third position information 506 respectively.
  • the first position information 504, the second position information 505 and the third position information 506 are back-projected to the first feature map 521, the second feature map 522 and the third feature map 523 respectively.
  • C li represents the reference position information in the visual map
  • C lmi represents the back projection of the reference position information to the position information in each current image frame.
  • Target 1 performs feature fusion on Target 1 to Target 3 respectively to obtain the fused features of Target 1 to Target 3.
  • the features of target 1 in the first feature map 521 to the third feature map 523 are fused to obtain the first fusion feature corresponding to target 1.
  • the features of target 2 and target 3 in the first feature map 521 to the third feature map 523 are fused to obtain the second fusion feature corresponding to target 1.
  • the first fusion feature and the second fusion feature corresponding to target 1 are fused to obtain the fusion feature of target 1.
  • the fusion features corresponding to target 1 to target 3 can be obtained, that is, the first feature 507, the second feature 508, and the third feature 509.
  • the above fusion features are searched to obtain target detection results corresponding to target 1 to target 3 respectively, that is, the first result 510, the second result 511, and the third result 512.
  • the target detection results include confidence.
  • the target detection results are filtered according to whether the confidence included in the target detection results corresponding to target 1 to target 3 is greater than the confidence threshold.
  • the second result 511 and the third result 512 may be determined to be valid.
  • the second result 511 and the third result 512 are added to the initialization target detection vector set of the next frame group. Next, extract the next frame of image data from the 1st channel to the 3rd channel of image data respectively, and return to the above steps of obtaining the feature map corresponding to the current frame image from the 3 channels of image data, until the image in the video stream Processed.
  • the position sequence data of target 1 to target M in the BEV visual map can be obtained, that is, Pred 1 to Pred M .
  • FIG. 6 is a schematic diagram of a multi-camera target detection device according to an embodiment of the present application.
  • the device includes,
  • the first acquisition module is used to acquire at least two channels of video stream image data, wherein the image data of each channel of video stream at least includes image data of the overlapping area,
  • the second acquisition module is used to acquire the visual map information of the location of the overlapping area in the video stream image data
  • the target detection and position analysis module is used to convert the feature point information extracted from each image frame of each video stream image data into a visual map, so as to integrate the feature point information of each image frame into the same perspective, Obtain fusion information from the same perspective, where each image frame in each channel of image frames is simultaneous,
  • target detection and target location analysis are performed.
  • the target detection and position analysis module is configured as:
  • the projection matrix is a projection matrix used to characterize the mapping relationship between pixel points in the camera image and spatial points in the visual map;
  • target detection is performed to obtain target detection results at the same viewing angle corresponding to all image frames in each channel of image frames, and the target detection results are determined is the target detection result of the image frame group of each channel of image frames,
  • the target detection results of each image frame group are: the target detection results at the same viewing angle corresponding to all the image frames of each channel of image frames at different times.
  • Figure 7 is another schematic diagram of a multi-camera target detection device according to an embodiment of the present application.
  • the device includes a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to implement the steps of the multi-camera target detection method according to an embodiment of the present application.
  • the memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk memory. optional, memory It may also be at least one storage device located far away from the aforementioned processor.
  • RAM Random Access Memory
  • NVM Non-Volatile Memory
  • the above-mentioned processor can be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processing, DSP), special integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • CPU Central Processing Unit
  • NP Network Processor
  • DSP Digital Signal Processing
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the above-mentioned multi-camera target detection device may be an electronic device, and the electronic device may further include a communication bus and/or a communication interface.
  • the processor, communication interface, and memory complete communication with each other through the communication bus.
  • the communication bus mentioned in the above-mentioned electronic equipment can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the communication bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the communication interface is used for communication between the above-mentioned electronic devices and other devices.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • a computer program is stored in the storage medium.
  • the computer program is executed by a processor, the steps of the multi-camera target detection method described in the embodiments of the present application are implemented.
  • Embodiments of the present application also provide a computer program product containing instructions that, when run on a computer, cause the computer to execute the steps of the multi-camera target detection method described in the embodiments of the present application.
  • the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Closed-Circuit Television Systems (AREA)

Abstract

L'invention concerne un procédé de détection de cible à caméras multiples. Le procédé consiste : à acquérir au moins deux trajets de données d'image de flux vidéo provenant de différentes caméras, chaque trajet de données d'image de flux vidéo comprenant au moins des données d'image d'une région de chevauchement ; à acquérir des informations de carte visuelle de la position dans laquelle se trouve la région de chevauchement ; à convertir des informations de point caractéristique, qui sont extraites de chaque trajet de trames d'image de chaque trajet de données d'image de flux vidéo, en une carte visuelle pour fusionner les informations de point caractéristique des trajets de trames d'image au même angle de visualisation, de façon à obtenir des informations fusionnées sous le même angle de visualisation ; et à effectuer une détection de cible et une analyse de position de cible sur la base des informations fusionnées. Il n'est pas nécessaire de générer d'abord des séquences de positions de cible, qui correspondent à des caméras individuelles, et, ensuite, de fusionner les séquences de positions de cible, de façon à résoudre le problème de la difficulté d'effectuer une discrimination et une sélection lorsque les séquences de positions de cible, qui correspondent à une pluralité de caméras individuelles, sont fusionnées, de telle sorte que la consommation de puissance de calcul provoquée par la fusion des séquences de positions de cible soit empêchée, et que la précision et la fiabilité de détection de cible et d'analyse de position de cible soient également améliorées.
PCT/CN2023/118350 2022-09-13 2023-09-12 Procédé et appareil de détection de cible à caméras multiples WO2024055966A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211108975.7 2022-09-13
CN202211108975.7A CN115457084A (zh) 2022-09-13 2022-09-13 一种多相机目标检测跟踪方法、装置

Publications (1)

Publication Number Publication Date
WO2024055966A1 true WO2024055966A1 (fr) 2024-03-21

Family

ID=84303349

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/118350 WO2024055966A1 (fr) 2022-09-13 2023-09-12 Procédé et appareil de détection de cible à caméras multiples

Country Status (2)

Country Link
CN (1) CN115457084A (fr)
WO (1) WO2024055966A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118279857A (zh) * 2024-05-31 2024-07-02 苏州元脑智能科技有限公司 目标检测方法、电子设备、产品及介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457084A (zh) * 2022-09-13 2022-12-09 上海高德威智能交通系统有限公司 一种多相机目标检测跟踪方法、装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113411543A (zh) * 2021-03-19 2021-09-17 贵州北斗空间信息技术有限公司 一种多路监控视频融合显示方法及系统
CN113673425A (zh) * 2021-08-19 2021-11-19 清华大学 一种基于Transformer的多视角目标检测方法及系统
US20220020158A1 (en) * 2020-07-15 2022-01-20 Jingdong Digits Technology Holding Co., Ltd. System and method for 3d object detection and tracking with monocular surveillance cameras
CN114913506A (zh) * 2022-05-18 2022-08-16 北京地平线机器人技术研发有限公司 一种基于多视角融合的3d目标检测方法及装置
CN115457084A (zh) * 2022-09-13 2022-12-09 上海高德威智能交通系统有限公司 一种多相机目标检测跟踪方法、装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220020158A1 (en) * 2020-07-15 2022-01-20 Jingdong Digits Technology Holding Co., Ltd. System and method for 3d object detection and tracking with monocular surveillance cameras
CN113411543A (zh) * 2021-03-19 2021-09-17 贵州北斗空间信息技术有限公司 一种多路监控视频融合显示方法及系统
CN113673425A (zh) * 2021-08-19 2021-11-19 清华大学 一种基于Transformer的多视角目标检测方法及系统
CN114913506A (zh) * 2022-05-18 2022-08-16 北京地平线机器人技术研发有限公司 一种基于多视角融合的3d目标检测方法及装置
CN115457084A (zh) * 2022-09-13 2022-12-09 上海高德威智能交通系统有限公司 一种多相机目标检测跟踪方法、装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118279857A (zh) * 2024-05-31 2024-07-02 苏州元脑智能科技有限公司 目标检测方法、电子设备、产品及介质

Also Published As

Publication number Publication date
CN115457084A (zh) 2022-12-09

Similar Documents

Publication Publication Date Title
CN112894832B (zh) 三维建模方法、装置、电子设备和存储介质
WO2024055966A1 (fr) Procédé et appareil de détection de cible à caméras multiples
CN112667837A (zh) 图像数据自动标注方法及装置
CN111830953A (zh) 车辆自定位方法、装置及系统
CN113447923A (zh) 目标检测方法、装置、系统、电子设备及存储介质
CN113096003B (zh) 针对多视频帧的标注方法、装置、设备和存储介质
CN111383204A (zh) 视频图像融合方法、融合装置、全景监控系统及存储介质
CN115164918B (zh) 语义点云地图构建方法、装置及电子设备
US20220044558A1 (en) Method and device for generating a digital representation of traffic on a road
CN115376109B (zh) 障碍物检测方法、障碍物检测装置以及存储介质
US20130135446A1 (en) Street view creating system and method thereof
CN115004273A (zh) 交通道路的数字化重建方法、装置和系统
CN116883610A (zh) 基于车辆识别与轨迹映射的数字孪生路口构建方法和系统
CN114898314A (zh) 驾驶场景的目标检测方法、装置、设备及存储介质
CN115797408A (zh) 融合多视角图像和三维点云的目标跟踪方法及装置
WO2020049089A1 (fr) Procédés et systèmes de détermination de la position d'un véhicule
CN114969221A (zh) 一种更新地图的方法及相关设备
CN115344655A (zh) 地物要素的变化发现方法、装置及存储介质
Rizzoli et al. Syndrone-multi-modal uav dataset for urban scenarios
KR20130137076A (ko) 실시간 관심 지역을 나타내는 3차원 지도를 제공하는 장치 및 방법
Zhanabatyrova et al. Automatic map update using dashcam videos
CN114066974A (zh) 一种目标轨迹的生成方法、装置、电子设备及介质
CN117789160A (zh) 一种基于聚类优化的多模态融合目标检测方法及系统
CN116843754A (zh) 一种基于多特征融合的视觉定位方法及系统
Luo et al. Complete trajectory extraction for moving targets in traffic scenes that considers multi-level semantic features

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23864698

Country of ref document: EP

Kind code of ref document: A1