CN118314497A

CN118314497A - Multi-view target detection and model training method, system, equipment and medium

Info

Publication number: CN118314497A
Application number: CN202410417868.5A
Authority: CN
Inventors: 刘利非; 徐娟; 王庆峰
Original assignee: Shanghai Xiding Intelligent Technology Co ltd
Current assignee: Shanghai Xiding Intelligent Technology Co ltd
Priority date: 2024-04-09
Filing date: 2024-04-09
Publication date: 2024-07-09

Abstract

The invention provides a training method, a system, equipment and a medium for multi-view target detection and model, wherein the detection method comprises the following steps: synchronously acquiring multiple paths of video pictures, and acquiring an internal reference matrix and an external reference matrix of a corresponding camera; inputting multiple paths of video pictures, an internal reference matrix and an external reference matrix into a parallax estimation network of a multi-view target detection model, extracting image features, detecting targets in each path of video pictures, and fusing the image features, the internal reference matrix and the external reference matrix to generate a parallax image; inputting the parallax image, the multipath video images, the internal reference matrix and the external reference matrix into a target tracking network of the multi-view target detection model, performing spatial correction of the corresponding video images by using the internal reference matrix and the external reference matrix, acquiring a two-dimensional labeling result of each target from the corrected video images, and generating three-dimensional coordinates of each target based on the corrected video images and the parallax image. The data from different perspectives are effectively integrated, so that the target detection and identification are more accurate and reliable.

Description

Multi-view target detection and model training method, system, equipment and medium

Technical Field

The invention relates to the field of target detection, in particular to a training method, a system, equipment and a medium for multi-view target detection and model.

Background

At present, multi-view target detection methods are paid attention to gradually, and the methods are used for providing richer spatial information by fusing views from a plurality of cameras, so that the target detection precision is remarkably improved. Such multi-view and multi-sensor fusion methods improve the accuracy of target detection, but they often require significant computational resources when processing large-scale data and complex models. This is particularly true in resource constrained scenarios, such as mobile devices, real-time systems, or large-scale camera networks, where computing resources and storage capacity are limited. In addition, the prior art often lacks a mechanism for efficiently coordinating and fusing information among multiple cameras, and particularly when the number of cameras is increased, the influence on system performance and resource requirements is more remarkable. Accordingly, there is a need to provide a training method, system, apparatus, and medium for multi-perspective object detection and model.

Disclosure of Invention

The invention provides a multi-view target detection method. The method solves the problem that data from different visual angles cannot be effectively utilized due to the lack of an effective association mechanism when multiple cameras coordinate in the prior art.

The invention provides a multi-view target detection method, which comprises the following steps: synchronously acquiring multiple paths of video pictures, and acquiring an internal reference matrix and an external reference matrix of a corresponding camera; inputting multiple paths of video pictures and corresponding internal reference matrixes and external reference matrixes into a parallax estimation network of a multi-view target detection model, extracting image characteristics of each path of video pictures, detecting targets in each path of video pictures, and fusing the extracted image characteristics of each path of video pictures and the corresponding internal reference matrixes and external reference matrixes to generate a parallax image; inputting the parallax image, the multipath video images, the corresponding internal reference matrix and the corresponding external reference matrix into a target tracking network of the multi-view target detection model, performing spatial correction of the corresponding video images by using the internal reference matrix and the external reference matrix, acquiring a two-dimensional labeling result of each target from the corrected video images, and generating three-dimensional coordinates of each target based on the corrected video images and the parallax image.

In an embodiment of the present invention, the inputting the multiple paths of video frames and the corresponding internal reference matrix and external reference matrix into the parallax estimation network of the multi-view target detection model extracts the image characteristics of each path of video frames, detects the target in each path of video frames, and fuses the extracted image characteristics and the corresponding internal reference matrix and external reference matrix to generate the parallax map, including: inputting multiple paths of video pictures and corresponding internal reference matrixes and external reference matrixes into the parallax estimation network, and extracting image features of each path of video pictures to obtain the image features of each path of video pictures; detecting targets in each path of video pictures based on the extracted image characteristics of each path of video pictures; for each target: performing feature matching on each extracted image feature to obtain image features of the same target under the view angles of all cameras; calculating the parallax value of the same target according to the image characteristics of the same target under the visual angles of all the cameras, and the internal reference matrix and the external reference matrix of the corresponding cameras; generating a disparity map according to the calculated disparity value of each target.

In an embodiment of the present invention, the calculating the parallax value of the same target according to the image features of the same target under different camera angles and the internal reference matrix and the external reference matrix of the corresponding camera includes: identifying and recording image coordinates of each path of image features of the same target under the corresponding camera view angles; converting the image coordinates of each path of image characteristics into three-dimensional coordinates in the corresponding camera according to the internal reference matrix of the camera; converting three-dimensional coordinates of each path of image characteristics into world coordinates in a world coordinate system according to an external parameter matrix of the camera; and calculating the difference value of the world coordinates corresponding to each path of image features, and taking the calculated difference value as the parallax value of the same target.

In an embodiment of the present invention, the inputting the parallax map, the multi-path video frame, the corresponding internal reference matrix, and the external reference matrix to the target tracking network of the multi-view target detection model, performing spatial correction of the corresponding video frame by using the internal reference matrix and the external reference matrix, obtaining a two-dimensional labeling result of each target from the corrected video frame, and generating three-dimensional coordinates of each target based on the corrected video frame and the parallax map includes: inputting the parallax map, the multi-path video pictures and the corresponding internal reference matrix and external reference matrix into the target tracking network, correcting the distortion of each path of video pictures based on the internal reference matrix and the external reference matrix, and aligning each path of video pictures to a uniform preset reference viewing angle; generating a corresponding two-dimensional labeling result for each target in the corrected video picture; the two-dimensional labeling result comprises attribute and size information of the corresponding target; and generating three-dimensional coordinates of each target based on the two-dimensional labeling result and the parallax map of each target.

In an embodiment of the present invention, the video frame space correction process includes: based on an image distortion correction algorithm, adjusting the position of each pixel point in the corresponding video picture according to the distortion coefficient in the internal reference matrix; performing perspective transformation on the adjusted video picture based on the corresponding internal reference matrix; and performing position correction on the video picture after perspective transformation by using the corresponding external parameter matrix.

In an embodiment of the present invention, after obtaining a two-dimensional labeling result of each target in each path of video frames based on the corrected video frames and the parallax map and generating three-dimensional coordinates of each target, the method further includes: and displaying the two-dimensional labeling result of each target in the corresponding video picture.

In an embodiment of the present invention, there is further provided a training method of a multi-view target detection model, including: acquiring a plurality of paths of synchronously acquired video picture sets, and a two-dimensional labeling result set and a three-dimensional coordinate set of each path of video picture set corresponding to the video picture sets; inputting a plurality of paths of video picture sets, corresponding internal reference matrixes and external reference matrixes into a parallax estimation network of a multi-view target detection model, extracting image feature sets of each path of video picture set, detecting targets in each path of video picture, and fusing the extracted image feature sets, the internal reference matrixes and the external reference matrixes to generate a parallax image set; inputting the parallax image set, the multipath video image set, the internal reference matrix and the external reference matrix into a target tracking network of the multi-view target detection model, performing spatial correction of the corresponding video image set by using the internal reference matrix and the external reference matrix, acquiring a two-dimensional labeling result of each target from the corrected video image, and generating a predicted three-dimensional coordinate set of each target based on the corrected video image set and the parallax image set; and updating parameters of the multi-view target detection model based on the difference degree of the predicted two-dimensional labeling result set and the difference degree of the predicted three-dimensional coordinate set and the three-dimensional coordinate set to obtain a trained multi-view target detection model.

In an embodiment of the present invention, there is also provided a system for multi-view object detection, the system including: the data acquisition module is used for synchronously acquiring multiple paths of video pictures and acquiring an internal reference matrix and an external reference matrix of the corresponding camera; the parallax image generation module is used for inputting multiple paths of video images and corresponding internal reference matrixes and external reference matrixes into a parallax estimation network of the multi-view target detection model, extracting image characteristics of each path of video images, detecting targets in each path of video images, and fusing each path of extracted image characteristics with the corresponding internal reference matrixes and external reference matrixes to generate a parallax image; the target detection module is used for inputting the parallax image, the multipath video images, the corresponding internal reference matrix and the corresponding external reference matrix into a target tracking network of the multi-view target detection model, performing spatial correction of the corresponding video images by utilizing the internal reference matrix and the external reference matrix, acquiring a two-dimensional labeling result of each target from the corrected video images, and generating three-dimensional coordinates of each target based on the corrected video images and the parallax image.

In an embodiment of the present invention, there is also provided an electronic device including: one or more processors; and a storage means for storing one or more programs that, when executed by the one or more processors, cause the electronic device to implement the steps of the multi-view object detection method or the multi-view object detection model training method of any of the above.

In an embodiment of the present invention, there is also provided a computer readable storage medium having a computer program stored thereon, which when executed by a processor of a computer, causes the computer to perform the steps of the multi-view object detection method or the multi-view object detection model training method described in any one of the above.

The invention provides a multi-view target detection method. By synchronously collecting multiple paths of video pictures and combining the internal reference matrix and the external reference matrix of each camera, the system can obtain more comprehensive spatial information, thereby effectively integrating data from different visual angles and ensuring more accurate and reliable target detection and identification. The parallax estimation network is utilized to extract image features and generate a parallax image, so that the position difference of the target under different camera angles can be accurately calculated, and the position of the target in the three-dimensional space can be accurately positioned. The disparity map and the multipath video data are input into the target tracking network and are subjected to spatial correction, so that the change of the target in different time and space can be effectively tracked. In addition, the three-dimensional coordinates of the target are generated for subsequent analysis. By the multi-view target detection and tracking method, more comprehensive and accurate space analysis and target processing can be realized, and powerful technical support is provided for various application scenes.

Drawings

Fig. 1 is a schematic flow chart of a multi-view target detection method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram showing the detection result according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a multi-view object detection model according to an embodiment of the invention;

FIG. 4 is a block diagram of a system for multi-view object detection according to an embodiment of the present invention;

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

In the following description, numerous details are set forth in order to provide a more thorough explanation of embodiments of the present invention, it will be apparent, however, to one skilled in the art that embodiments of the present invention may be practiced without these specific details, in other embodiments, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the embodiments of the present invention.

The invention provides a multi-view target detection method, which can obtain more comprehensive spatial information by synchronously collecting multiple paths of video pictures and combining internal reference matrixes and external reference matrixes of respective cameras, so that data from different view angles can be effectively integrated, and target detection and recognition are more accurate and reliable. The parallax estimation network is utilized to extract image features and generate a parallax image, so that the position difference of the target under different camera angles can be accurately calculated, and the position of the target in the three-dimensional space can be accurately positioned. The disparity map and the multipath video data are input into the target tracking network and are subjected to spatial correction, so that the change of the target in different time and space can be effectively tracked. In addition, the three-dimensional coordinates of the target are generated for subsequent analysis. By the multi-view target detection and tracking method, more comprehensive and accurate space analysis and target processing can be realized, and powerful technical support is provided for various application scenes. The algorithm combines the existing 2-dimensional target detection model architecture, is based on a binocular shooting measurement theory, performs spatial association of multiple cameras through a target spatial relationship matrix, and applies the spatial relationship matrix to a loss function, so that the model is guaranteed to have the capability of outputting spatial 3-dimensional information under the condition of respectively outputting single-frame image detection targets.

Referring to fig. 1, the multi-view object detection includes the following steps:

s1, synchronously acquiring multiple paths of video pictures, and acquiring an internal reference matrix and an external reference matrix of a corresponding camera.

And synchronously acquiring experimental videos recorded by cameras at different positions, extracting continuous video pictures from the experimental videos, and simultaneously, extracting an internal reference matrix and an external reference matrix of each camera. The camera's internal reference matrix includes parameters such as focal length, optical center position and distortion coefficient of the camera for characterizing attribute information of the camera, and the camera's external reference matrix includes position (i.e. translation vector) and orientation (i.e. rotation matrix) of the camera relative to a common world coordinate system. The characteristics and imaging properties of the cameras can be known through the internal reference matrix, and the spatial position relationship between the cameras can be determined through the external reference matrix, so that the video picture information can be analyzed more accurately later. It should be noted that the reference matrix and the reference matrix of the camera may be obtained in advance based on a calibration process in advance.

S2, inputting multiple paths of video pictures and corresponding internal and external reference matrixes into a parallax estimation network of a multi-view target detection model, extracting image characteristics of each path of video pictures, detecting targets in each path of video pictures, and generating a parallax image by fusing the extracted image characteristics of each path of video pictures and the corresponding internal and external reference matrixes.

It should be noted that the multi-view object detection model in the present application is applicable to various models capable of object detection, including, but not limited to Yolov, yolov, yolov, etc., and is not limited to. Inputting multiple paths of video pictures and corresponding internal reference matrixes and external reference matrixes into a parallax estimation network, and extracting key image features such as edges, corner points, textures and the like of images from each path of video pictures so as to realize target detection more accurately. And detecting a target in a corresponding video picture by using the extracted image features, and calculating a parallax image by combining image feature data from all video sources and a corresponding internal reference matrix and external reference matrix. The disparity map is generated according to the relative position change of the same target in different video sources, reflects the depth change of the same scene or object in space, which is observed under different camera angles, and is determined by the spatial information of the inner parameter matrix and the image characteristics of the multipath video images.

In an embodiment of the present invention, the inputting the multiple paths of video frames and the corresponding internal reference matrix and external reference matrix into the parallax estimation network of the multi-view target detection model extracts the image characteristics of each path of video frames, detects the target in each path of video frames, and fuses the extracted image characteristics and the corresponding internal reference matrix and external reference matrix to generate the parallax map, including:

Inputting multiple paths of video pictures and corresponding internal reference matrixes and external reference matrixes into the parallax estimation network, and extracting image features of each path of video pictures to obtain the image features of each path of video pictures;

detecting targets in each path of video pictures based on the extracted image characteristics of each path of video pictures;

for each target:

Performing feature matching on each extracted image feature to obtain image features of the same target under the view angles of all cameras; and

Calculating the parallax value of the same target according to the image characteristics of the same target under the visual angles of all paths of cameras and the internal reference matrix and the external reference matrix of the corresponding cameras;

generating a disparity map according to the calculated disparity value of each target.

Inputting the multiple paths of video pictures, the corresponding internal reference matrix and the external reference matrix into a parallax estimation network, extracting image characteristics in each path of video pictures, and detecting targets in the corresponding video pictures through the image characteristics. For each detected target: and comparing and matching the image characteristics of the target in different video pictures, thereby determining the corresponding relation between the same characteristics of the target under each view angle. And calculating the parallax value of the target according to the matched characteristic corresponding relation, the internal reference matrix and the external reference matrix of the cameras, wherein the parallax value is used for representing the position change of the same target in video pictures shot by different cameras. And generating a disparity map according to the calculated disparity values of the targets, wherein the disparity map provides a visual representation of depth information for each target in each video picture for subsequent determination of the three-dimensional coordinates of the target.

In an embodiment of the present invention, the calculating the parallax value of the same target according to the image features of the same target under different camera angles, and the internal reference matrix and the external reference matrix of the corresponding camera includes:

Identifying and recording image coordinates of each path of image features of the same target under the corresponding camera view angles;

converting the image coordinates of each path of image characteristics into three-dimensional coordinates in the corresponding camera according to the internal reference matrix of the camera;

Converting three-dimensional coordinates of each path of image characteristics into world coordinates in a world coordinate system according to an external parameter matrix of the camera;

And calculating the difference value of the world coordinates corresponding to each path of image features, and taking the calculated difference value as the parallax value of the same target.

The same target is identified in video pictures captured by cameras at different positions, and the position of the image feature of the target under each camera view angle, namely the image coordinates, are recorded. The image coordinates refer to a position representation of a pixel point on a two-dimensional image plane, and are usually composed of an abscissa x and an ordinate y. After the image coordinates of each image feature are obtained, the image coordinates of the identified corresponding image feature are converted from a two-dimensional image space into a coordinate system of a three-dimensional space of the camera by utilizing an internal reference matrix of the camera and comprehensively considering factors such as focal length and optical center position of the camera, so that the three-dimensional coordinates of the image feature in the corresponding camera are obtained. And then uniformly converting three-dimensional coordinates of the same image feature in all cameras into a world coordinate system by utilizing an external parameter matrix of the cameras, and generating the world coordinate of the image feature so that the cameras can represent the position of the same target under a uniform frame no matter the position and the orientation of the cameras. In the world coordinate system, the coordinate difference value of the same object in different world coordinate systems is calculated and used as the parallax value of the object. Since parallax refers to the difference between the coordinates of images of the same object at different viewing angles. The disparity values are calculated in the world coordinate system, which takes into account the actual position differences of the object in the real world, not just the visual differences in the image plane. Such a calculation can more truly reflect the relative position and depth variations of objects in three-dimensional space.

S3, inputting the parallax image, the multipath video images, the corresponding internal reference matrix and the corresponding external reference matrix into a target tracking network of the multi-view target detection model, performing spatial correction of the corresponding video images by using the internal reference matrix and the external reference matrix, acquiring a two-dimensional labeling result of each target from the corrected video images, and generating three-dimensional coordinates of each target based on the corrected video images and the parallax image.

The parallax map and video titles from different viewing angles are input to a target tracking network, and each video picture is attached with a corresponding internal reference matrix and external reference matrix. In the target tracking network, geometric distortion correction processing is carried out on the corresponding video picture according to the internal reference matrix so as to enable the characteristic position in the video picture to be correct, and the same target presented by different video pictures is aligned to a world coordinate system according to the external reference matrix so as to conveniently and accurately analyze the content of the video picture. And detecting one or more preset targets in the corrected video, and generating a corresponding two-dimensional labeling result for each detected target. And the position of the target in the three-dimensional space can be determined by analyzing the two-dimensional labeling result and the parallax map of the target in the corrected video picture, and the three-dimensional coordinates of the target are generated.

In an embodiment of the present invention, the inputting the parallax map, the multi-path video frame, the corresponding internal reference matrix, and the external reference matrix to the target tracking network of the multi-view target detection model, performing spatial correction of the corresponding video frame by using the internal reference matrix and the external reference matrix, obtaining a two-dimensional labeling result of each target from the corrected video frame, and generating three-dimensional coordinates of each target based on the corrected video frame and the parallax map includes:

Inputting the parallax map, the multi-path video pictures and the corresponding internal reference matrix and external reference matrix into the target tracking network, correcting the distortion of each path of video pictures based on the internal reference matrix and the external reference matrix, and aligning each path of video pictures to a uniform preset reference viewing angle;

Generating a corresponding two-dimensional labeling result for each target in the corrected video picture; the two-dimensional labeling result comprises attribute and size information of the corresponding target;

and generating three-dimensional coordinates of each target based on the two-dimensional labeling result and the parallax map of each target.

And inputting the parallax image and the multipath video images into a target tracking network, and correcting distortion of the corresponding video images by utilizing an internal reference matrix of each video image, so as to correct geometric distortion brought by a camera. And unifying all corrected video pictures to the same visual angle by using the external parameter matrix. And generating a two-dimensional labeling result of each preset target in each corrected video picture, wherein the two-dimensional labeling result comprises the position, the attribute and the size information of the target in the video picture. And combining the two-dimensional labeling result and the parallax map, and calculating the three-dimensional coordinates of the corresponding target.

In one embodiment of the present invention, the video picture spatial correction process includes:

Based on an image distortion correction algorithm, adjusting the position of each pixel point in the corresponding video picture according to the distortion coefficient in the internal reference matrix;

performing perspective transformation on the adjusted video picture based on the corresponding internal reference matrix;

and performing position correction on the video picture after perspective transformation by using the corresponding external parameter matrix.

Distortion coefficients are read from the reference matrix using an image distortion correction algorithm, the coefficients describing the degree of distortion of the image, wherein distortion includes, but is not limited to, radial distortion and tangential distortion, and the corresponding distortion coefficients can be correspondingly read based on the type of distortion that needs to be corrected. And calculating the correct position of each pixel point in the video picture according to a preset mathematical equation according to the position and the distortion coefficient of the pixel point in the image. The positions of pixels are remapped on the whole video picture through an image distortion correction algorithm, so that the distortion influence is reduced or eliminated, and the real world observation effect is more similar. And extracting an internal reference matrix, and performing perspective transformation on each pixel point subjected to distortion correction through the corresponding internal reference matrix. Through perspective transformation, objects in the image are made to visually exhibit the correct depth and perspective ratio. And finally, converting each point in the image from a camera coordinate system to a world coordinate system by applying a corresponding external parameter matrix to each video picture after perspective transformation, thereby realizing the position correction of the video picture.

After the calculation of the disparity map and the generation of three-dimensional coordinates based thereon are completed, the two-dimensional position of each object in the video picture has been determined. The two-dimensional labeling results include position information (such as coordinates of a bounding box) and possibly other attribute information of the targets, and the two-dimensional labeling results (such as bounding boxes, labels, etc.) of each target are superimposed in the original video frame or the corrected video frame by drawing the bounding box, adding label descriptions or other visual cues, etc., so that an observer can intuitively recognize and locate each target in the video, as shown in fig. 2. In addition, three-dimensional position data may also be presented alongside the video or through an interactive interface.

In another embodiment of the present invention, there is also provided a training method of a multi-view target detection model, the training method including:

acquiring a plurality of paths of synchronously acquired video picture sets, and a two-dimensional labeling result set and a three-dimensional coordinate set of each path of video picture set corresponding to the video picture sets;

Inputting a plurality of paths of video picture sets, corresponding internal reference matrixes and external reference matrixes into a parallax estimation network of a multi-view target detection model, extracting image feature sets of each path of video picture set, detecting targets in each path of video picture, and fusing the extracted image feature sets, the internal reference matrixes and the external reference matrixes to generate a parallax image set;

inputting the parallax image set, the multipath video image set, the internal reference matrix and the external reference matrix into a target tracking network of the multi-view target detection model, performing spatial correction of the corresponding video image set by using the internal reference matrix and the external reference matrix, acquiring a two-dimensional labeling result of each target from the corrected video image, and generating a predicted three-dimensional coordinate set of each target based on the corrected video image set and the parallax image set;

And updating parameters of the multi-view target detection model based on the difference degree of the predicted two-dimensional labeling result set and the difference degree of the predicted three-dimensional coordinate set and the three-dimensional coordinate set to obtain a trained multi-view target detection model.

Referring to fig. 3, multiple paths of synchronously collected video data are collected, each path of video data corresponds to a set of two-dimensional labeling results and three-dimensional coordinate data, wherein the two-dimensional labeling results set comprises the position and attribute information of each target in a video frame; the three-dimensional coordinate set represents the position of each object in space. The multi-path video data and the corresponding internal reference matrix and external reference matrix are input into a parallax estimation network (namely L1 layer), the image characteristics of each path of video are extracted, and the characteristics are matched in a plurality of view angles to detect a target. And (5) merging the image characteristics and the inner and outer parameter matrix information under different camera angles, and calculating and generating a parallax image set. The disparity atlas and the inner and outer parametric matrices required for correction are input to the object tracking network (i.e., L2 layer), spatial correction is performed, and geometric and positional corrections are made to the multi-path video to ensure that the same object viewed from different perspectives is properly aligned in the image. Extracting a two-dimensional labeling result of each target in the corrected video picture, calculating a predicted three-dimensional coordinate by combining a parallax image set, comparing the predicted two-dimensional labeling and three-dimensional coordinate result with real labeling and coordinate data, calculating the difference degree, updating parameters of the multi-view target detection model according to the difference degree, and performing model training and optimization, so that a trained multi-view target detection model is obtained.

Specifically, in the invention, the description of data annotation is performed by taking an experimental video as an example. In a laboratory environment, the cameras of the examination room are stationary and the user operates on a specific laboratory table. Under the arrangement, static objects such as a desktop, acquisition equipment and the like become reference points for determining the spatial relationship among multiple cameras, and the labeling process is facilitated to be simplified. In the initial stage, the fixed objects are marked once, and then the marked information can be directly copied, so that repeated work is avoided. For dynamic objects used in experiments, such as equipment, labeling is required under the view angles of a plurality of cameras. Assuming two cameras are used, the annotation information for each object will include the coordinates and size at two perspectives, as well as the circumscribed cube coordinates of the object in three dimensions. For example, the object obj1 is labeled [ x1, y1, w1, h1] under the first camera, and the object obj1 is labeled [ x2, y2, w2, h2] under the second camera, where x1, y1 is the starting coordinate of the object frame under the first camera, w1, h1 is the width and height of the object frame, respectively, x2, y2 is the starting coordinate of the object frame under the second camera, and w2, h2 is the width and height of the object frame, respectively. The coordinates of the external cube of the three-dimensional space are represented as [ px1, py1, pz1, px2, py2, pz2, px3, py3, pz3], wherein px1, py1, pz1 are the coordinates of the first point of the cube, px2, py2, pz2 are the coordinates of the second point of the cube, and px3, py3, pz3 are the coordinates of the third point of the cube. It should be noted that the three points of the cube are not coplanar. Further, in the aspect of labeling data, each pair of images needs to be ensured to contain at least three fixed boundary matching points, and labeling information of the points is used for the subsequent internal and external reference supervised learning process. In order to accurately reflect the experimental environment, the experimental table top is set to be high zero when the experimental environment is marked, and the experimental table top corresponds to the z axis; the length of the table top is defined as an x-axis, and the center of the long side of the table top is taken as an origin; the width of the mesa corresponds to the y-axis with its origin at the center of the long side. Such coordinate system settings facilitate accurate mapping of experimental environments and spatial locations of equipment, enabling efficient spatial analysis and tracking in a multi-camera system.

Inputting the marked video image into a multi-view target detection model, wherein the input dimension of the model is [ b, c, h, w ], b represents batch size (batch size), c represents channel number, 6 is the image corresponding to the top view and the other view respectively, and each view occupies 3 channels; h and w represent the height and width of the image, respectively. In the present invention, it is assumed that all cameras used are identical in model number, so that their internal reference matrices are assumed to be identical, and can be initialized using predetermined parameters. The extrinsic matrix is initialized according to a specific camera layout, for example, an initial value of the extrinsic matrix may be set assuming a top view and a positive view.

In combination with the existing two-dimensional target detection loss function, the three-dimensional space output and space transformation matrix constraint are added, and the difference L of the multi-view target detection model is shown in a formula (1):

L＝L₁+L₂+L₃(1)

Wherein, L ₁ is the target loss of the first path of camera, L ₂ is the target loss of the second path of camera, L ₃ is the space coordinate loss of the target, and specifically, the specific calculation of L ₁ is shown in formula (2):

L₁＝L_pos1+a₁·L_obj1+a₂·L_cls1 (2)

Where L _pos1 is the position loss, L _obj1 is the target confidence loss, L _cls1 is the category loss, and a ₁、a₂ is the preset weight coefficient. The position loss L _pos is used for measuring the difference between the position (coordinates) of the prediction boundary box and the real boundary box, and the specific calculation mode is shown in the formula (3):

Wherein IoU is the intersection ratio of the prediction frame and the real frame, p is the center point distance between the prediction frame and the real frame, C is the diagonal length of the minimum closure area capable of covering the prediction frame and the real frame at the same time, γ is the parameter for measuring the consistency of the length-width ratio, and α is the weight coefficient.

The target loss L _obj1 is used to measure the difference between the predicted target confidence and the true target confidence. The target confidence represents the probability of whether a bounding box contains a target. For a bounding box in which a target actually exists, the target confidence should be 1; for a bounding box with background or no targets, the target confidence should be 0. Specifically, the calculation of the target loss L _obj1 is as shown in formula (4):

L_obj＝-∑_iy_ilogp_i+(1-y_i)log(1-p_i) (4)

Where y _i is the true target confidence, the bounding box for the true target is 1, otherwise 0, p _i is the predicted target confidence.

The class loss L _cls1 is used to measure the difference between the predicted class probability and the true class. For each bounding box, the model predicts a class probability distribution, specifically, the class loss is calculated as shown in equation (5):

L_cls＝-∑_iy_ilogp_i (5)

where y _i is the true class probability distribution, typically a one-hot encoding vector, and p _i is the predicted class probability distribution.

It can be understood that the target loss L ₂＝L_pos2+a₁·L_obj2+a₂·L_cls2 of the second path of camera is similar to the calculation of the first path of camera, and the position loss L _pos2, the target loss L _obj2, and the category loss L _cls2 are not described herein. For the spatial coordinate loss L ₃ of the target, it is obtained by calculating the sampling regression loss and the spatial cross ratio loss as shown in formula (6):

L₃＝L_Huber+(1-L_cubic) (6)

Wherein, L _Huber is a sampling regression loss, L _cubic is a spatial cross ratio loss, and the sampling regression loss L _Huber is shown in formula (7):

Where y is the true value, y is the predicted value, δ is the harmonic parameter used to balance the Mean Square Error (MSE) when the predicted value is close to the true value, and to provide a linear loss increase when the predicted value differs significantly from the true value, enhancing the robustness of the model. Further, the spatial overlap ratio loss L _cubic is defined as the overlap ratio, i.e Where V _inter is the intersection volume of the prediction and real bounding cubes and V _union is the union volume of the prediction and real bounding cubes. Specifically, the calculations of V _inter and V _union are shown in formula (8):

wherein, To predict the coordinates of one corner of a cube,Is the coordinates of one corner point of the real cube,To predict cubesThe coordinates of the corresponding diagonal points,Is a true cubeThe coordinates of the corresponding diagonal points,To predict the coordinates of another corner of the cube, and these three corners are not coplanar, Is the coordinate of another corner of the real cube, and the three corners are not coplanar. Furthermore, in the invention, the difference degree of the calculated parallax map and the real parallax map can be added in the total difference degree, so that the finally obtained parallax map is more accurate.

Referring to fig. 4, the system 100 for multi-view object detection includes: a data acquisition module 110, a disparity map generation module 120, and a target detection module 130. The data acquisition module 110 is configured to synchronously acquire multiple paths of video frames, and acquire an internal reference matrix and an external reference matrix of a corresponding camera; the disparity map generating module 120 is configured to input multiple paths of video frames and corresponding internal and external reference matrices into a disparity estimation network of a multi-view target detection model, extract image features of each path of video frames, detect targets in each path of video frames, and fuse each path of extracted image features and corresponding internal and external reference matrices to generate a disparity map; the object detection module 130 is configured to input the disparity map, the multiple video frames, and the corresponding internal reference matrix and external reference matrix to the object tracking network of the multi-view object detection model, perform spatial correction of the corresponding video frames by using the internal reference matrix and the external reference matrix, obtain a two-dimensional labeling result of each object from the corrected video frames, and generate three-dimensional coordinates of each object based on the corrected video frames and the disparity map.

Specific limitations regarding the system for multi-view object detection may be found in the above limitations regarding the multi-view object detection method, and will not be described herein. The various modules in the system for multi-view object detection described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in a hardware format or may be independent of a processor in the computer device, or may be stored in a software format in a memory in the computer device, so that the processor may call for operations corresponding to the above modules.

It should be noted that, in order to highlight the innovative part of the present invention, no module that is not very close to solving the technical problem presented by the present invention is introduced in the present embodiment, but it does not indicate that other modules are not present in the present embodiment.

Referring to fig. 5, the electronic device 1 may include a memory 12, a processor 13, and a bus, and may further include a computer program stored in the memory 12 and executable on the processor 13, such as a program for multi-view object detection.

The memory 12 includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 12 may in some embodiments be an internal storage unit of the electronic device 1, such as a mobile hard disk of the electronic device 1. The memory 12 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 1. Further, the memory 12 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 12 may be used not only for storing application software installed in the electronic apparatus 1 and various types of data, such as a code for multi-view object detection, etc., but also for temporarily storing data that has been output or is to be output.

The processor 13 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, various control chips, and the like. The processor 13 is a Control Unit (Control Unit) of the electronic device 1, connects the respective components of the entire electronic device 1 using various interfaces and lines, and executes various functions of the electronic device 1 and processes data by running or executing programs or modules (e.g., programs for multi-view object detection, etc.) stored in the memory 12, and calling data stored in the memory 12.

The processor 13 executes the operating system of the electronic device 1 and various types of applications installed. The processor 13 executes the application program to implement the steps in the multi-view object detection method described above.

Illustratively, the computer program may be split into one or more modules that are stored in the memory 12 and executed by the processor 13 to complete the present application. The one or more modules may be a series of instruction segments of a computer program capable of performing a specific function for describing the execution of the computer program in the electronic device 1. For example, the computer program may be divided into a data acquisition module 110, a disparity map generation module 120, and a target detection module 130.

The integrated units implemented in the form of software functional modules may be stored in a computer readable storage medium, which may be non-volatile or volatile. The software functional module is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a computer device, or a network device, etc.) or a processor (processor) to perform part of the functions of the multi-view object detection method according to the embodiments of the present application.

In summary, the training method, system, device and medium for multi-view target detection and model disclosed by the invention adopt a two-dimensional detection network model and integrate a binocular shooting measurement technology to realize the output of the three-dimensional coordinates of the target from multi-camera data, and are convenient to implement and maintain due to the simplicity of the model structure. In addition, in the invention, images of multiple cameras are synchronously input and used for detection, so that the identification and association of the same target under different visual angles are realized. The method enhances the utilization of space information, and compared with a single camera system, the method remarkably improves the detection precision and reliability. Compared with the traditional method, the multi-camera target detection scheme avoids dependence on a large amount of point cloud data and multi-scale scaling, so that the model is lighter. This feature makes it useful and efficient in industrial applications, especially where computing resources and processing speeds are critical. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims

1. A multi-view target detection method, the method comprising:

Synchronously acquiring multiple paths of video pictures, and acquiring an internal reference matrix and an external reference matrix of a corresponding camera;

Inputting multiple paths of video pictures and corresponding internal reference matrixes and external reference matrixes into a parallax estimation network of a multi-view target detection model, extracting image characteristics of each path of video pictures, detecting targets in each path of video pictures, and fusing the extracted image characteristics of each path of video pictures and the corresponding internal reference matrixes and external reference matrixes to generate a parallax image;

inputting the parallax image, the multipath video images, the corresponding internal reference matrix and the corresponding external reference matrix into a target tracking network of the multi-view target detection model, performing spatial correction of the corresponding video images by using the internal reference matrix and the external reference matrix, acquiring a two-dimensional labeling result of each target from the corrected video images, and generating three-dimensional coordinates of each target based on the corrected video images and the parallax image.

2. The multi-view object detection method according to claim 1, wherein the inputting the multi-path video frames and the corresponding internal and external reference matrices into the parallax estimation network of the multi-view object detection model, extracting the image features of each path of video frames, detecting the object in each path of video frames, and fusing the extracted image features and the corresponding internal and external reference matrices to generate the parallax map comprises:

for each target:

3. The multi-view target detection method according to claim 2, wherein the calculating the parallax value of the same target according to the image features of the same target under different camera views and the internal reference matrix and the external reference matrix of the corresponding camera includes:

4. The multi-view object detection method according to claim 1, wherein the inputting the disparity map, the multi-path video picture, and the corresponding internal reference matrix and external reference matrix into the object tracking network of the multi-view object detection model, performing spatial correction of the corresponding video picture using the internal reference matrix and the external reference matrix, obtaining a two-dimensional labeling result of each object from the corrected video picture, and generating three-dimensional coordinates of each object based on the corrected video picture and the disparity map, comprises:

5. The multi-view object detection method according to claim 1, wherein the process of video picture spatial correction comprises:

6. The multi-view object detection method according to claim 1, wherein after obtaining a two-dimensional labeling result of each object in each path of video frame based on the corrected video frame and the disparity map and generating three-dimensional coordinates of each object, further comprising: and displaying the two-dimensional labeling result of each target in the corresponding video picture.

7. A method for training a multi-view target detection model, the method comprising:

8. A system for multi-perspective object detection, the system comprising:

The data acquisition module is used for synchronously acquiring multiple paths of video pictures and acquiring an internal reference matrix and an external reference matrix of the corresponding camera;

The parallax image generation module is used for inputting multiple paths of video images and corresponding internal reference matrixes and external reference matrixes into a parallax estimation network of the multi-view target detection model, extracting image characteristics of each path of video images, detecting targets in each path of video images, and fusing each path of extracted image characteristics with the corresponding internal reference matrixes and external reference matrixes to generate a parallax image;

The target detection module is used for inputting the parallax image, the multipath video images, the corresponding internal reference matrix and the corresponding external reference matrix into a target tracking network of the multi-view target detection model, performing spatial correction of the corresponding video images by utilizing the internal reference matrix and the external reference matrix, acquiring a two-dimensional labeling result of each target from the corrected video images, and generating three-dimensional coordinates of each target based on the corrected video images and the parallax image.

9. An electronic device, characterized in that:

One or more processors;

Storage means for storing one or more programs that, when executed by the one or more processors, cause the electronic device to implement the multi-view object detection method of any one of claims 1 to 7 or the steps of the multi-view object detection model training method of claim 7.

10. A computer-readable storage medium, characterized by: a computer program stored thereon, which, when executed by a processor of a computer, causes the computer to perform the steps of the multi-view object detection method according to any one of claims 1 to 6 or the multi-view object detection model training method according to claim 7.