CN117036448B

CN117036448B - Scene construction method and system of multi-view camera

Info

Publication number: CN117036448B
Application number: CN202311300861.7A
Authority: CN
Inventors: 顾平
Original assignee: Shenzhen Fanlai Intelligent Co ltd
Current assignee: Shenzhen Fanlai Intelligent Co ltd
Priority date: 2023-10-10
Filing date: 2023-10-10
Publication date: 2024-04-02
Anticipated expiration: 2043-10-10
Also published as: CN117036448A

Abstract

The invention belongs to the technical field of cameras, and particularly relates to a scene construction method and system of a multi-view camera. The method comprises the following steps: step 1: constructing a multi-information source event sensor based on a multi-view camera array; step 2: constructing an event trigger source in a complex three-dimensional space area; step 3: constructing a trigger object based on three-dimensional human body posture key points; step 4: triggering and recording an event; according to the method, the multi-information source event sensor is built by adopting the multi-view camera array, three-dimensional reconstruction of a human body under a plurality of view angles is achieved, event sensing is completed, meanwhile, an programmable three-dimensional space is adopted as an event trigger source, and event triggering of a complex structure is achieved.

Description

Scene construction method and system of multi-view camera

Technical Field

The invention belongs to the technical field of cameras, and particularly relates to a scene construction method and system of a multi-view camera.

Background

Three-dimensional reconstruction (3 d reconstruction 1 n) is a mathematical model for creating a three-dimensional object suitable for computer representation and processing, is a basis for processing, operating and analyzing the three-dimensional object in a computer environment, and is a key technology for creating virtual reality expressing an objective world in a computer.

In computer vision, three-dimensional reconstruction refers to the process of reconstructing three-dimensional information from single-view or multi-view images. Because the information of the single video is incomplete, the three-dimensional reconstruction needs to use empirical knowledge, and the three-dimensional reconstruction of multiple views (similar to binocular positioning of people) is relatively easy.

In three-dimensional reconstruction, the construction of an event trigger source of a complex three-dimensional space region is key, and event trigger and recording are marking and recording performed for accurately positioning and describing a scene with activity or abnormality in the video processing process.

In current commonly used uncoupled video scenes, events are typically defined in a single view only, and are described by a simple two-dimensional image region, which is prone to a large number of missed and false detections, and difficult to detect for complex scenes. Furthermore, complex event definitions are often difficult to make in a single view.

Disclosure of Invention

Therefore, a main object of the present invention is to provide a method and a system for constructing a scene of a multi-view camera, wherein the method adopts a multi-view camera array to construct a multi-information source event sensor, thereby realizing three-dimensional reconstruction of a human body under a plurality of view angles, completing event sensing, and simultaneously adopting an orchestratable three-dimensional space as an event triggering source, and realizing event triggering of a complex structure.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

a scene construction method of a multi-view camera, the method comprising the steps of:

step 1: the construction of the multi-information source event sensor based on the multi-view camera array specifically comprises the following steps:

step 1.1: performing internal parameter calibration of the multi-view camera;

step 1.2: performing external parameter calibration of the multi-view camera and bundling adjustment of parameters of the multi-view camera;

step 1.3: the multi-view camera acquires multiple views, performs three-dimensional reconstruction and tracking of human bodies based on the multiple views, and completes construction of the multi-information-source event sensor;

step 2: constructing an event trigger source in a complex three-dimensional space area;

step 3: constructing a trigger object based on three-dimensional human body posture key points;

step 4: the event triggering and recording method specifically comprises the following steps:

step 4.1: operating a multi-information source event sensor based on a multi-view camera array;

step 4.2: continuously detecting and reconstructing human body information;

step 4.3: checking the inclusion relation between the human body information and the set three-dimensional space region forming the event trigger source, if the inclusion condition is met, detecting whether the human body gesture accords with an event start human body gesture key point signal contained in the event, and if the inclusion condition is met, triggering the event corresponding to the event trigger source and starting recording; if the triggering event exists, detecting whether the triggering event meets the continuous human body gesture key point signal of the event, if yes, continuously recording, otherwise ending the event recording; if the inclusion condition is not satisfied and the event has triggered, the event recording is ended.

Further, the method for constructing the event trigger source in the complex three-dimensional space region in the step 2 includes: step 2.1: selecting a critical point of an event area in the multiple views; step 2.2: calculating a critical point under a camera coordinate system based on a triangulation method; step 2.3: and constructing an event three-dimensional space region which is wrapped by the polyhedron, and taking the event three-dimensional space region as an event trigger source of the region.

Further, the method for constructing the event three-dimensional space region wrapped by the polyhedron in the step 2.3 includes: based on a critical point under a camera coordinate system, constructing a boundary of a positive N surface body, and wrapping an event area in the boundary; the value of N is required to meet the following constraint conditions:

wherein n is the number of views of the multi-view camera; (x) _n ,y _n ) Is the coordinates of the critical point.

Further, the method for constructing the triggering object based on the three-dimensional human body posture key point in the step 3 includes: step 3.1: selecting an event trigger source; step 3.2: and defining human body posture key point signals of a plurality of events and human body posture key point signals of continuous events, registering the human body posture key point signals and the corresponding events in an event trigger source, and completing the construction of a trigger object.

Further, the method for performing internal reference calibration in the step 1.1 includes: combining a plurality of groups of high-precision cubes with known sizes to form irregular cubes with mutually different shapes; shooting the irregular cube at a plurality of different angles by using a multi-view camera to obtain a plurality of groups of shooting results; generating depth data of a plurality of groups of shooting results through perspective projection inverse process, respectively, projecting the depth data back into a three-dimensional space under a local coordinate system of a camera, generating a normal map with the same frame number, and dividing planes of a calibration object on the normal map to obtain a plurality of groups of plane depth data; carrying out data fusion on a plurality of groups of plane depth data to obtain fusion depth data; the fusion depth data is projected back to a three-dimensional space under a local coordinate system of the camera through a perspective projection inverse process, and a three-dimensional point set corresponding to each plane is obtained; performing a least square fitting method on each obtained three-dimensional point set to obtain a plane corresponding to the three-dimensional point set; calculating the included angle and the distance between the marked planes based on the obtained planes; actually measuring the included angle and the distance between marked planes of the high-precision cube, comparing the obtained included angle with the distance, constructing an optimized objective function with the aim of minimizing the difference, optimizing the internal parameters of the multi-view camera by using a nonlinear iterative optimization method through the optimized objective function, minimizing the objective function, and completing the internal parameter calibration of the multi-view camera.

Further, the method for performing data fusion on multiple groups of plane depth data to obtain fused depth data includes:wherein R is fusion depth data, and m is the group number of plane depth data; o is the number of cubes contained in each set of high precision cube combinations; r is (r) _i Is plane depth data; s is the surface area of the irregular cube; m is the average number of faces of cubes in each set of high precision cube combinations.

Further, the method for performing external parameter calibration in the step 1.2 comprises the following steps: acquiring included angles of the multi-view camera in three regular directions; further, the projections of the external parameters of the multi-view camera in three regular directions under the included angles are obtained, and three external parameter projection sets are respectively obtained; the three regular directions are a first regular direction, a second regular direction and a third regular direction respectively; the fitting error for three canonical directions is calculated using the following formula:

；

wherein, l is the parameter number in the external parameter projection set under each regular direction; w (w) ₁ Calculate the function, y, for the error _l The multi-view camera is externally referred in a certain regular direction,the projection of the multi-view camera in a certain regular direction is externally referred; θ _k An included angle of the multi-view camera in a certain regular direction; and taking the regular direction corresponding to the minimum fitting error as a standard projection direction according to the calculated fitting error, and taking an external parameter projection set obtained by projecting in the regular direction as an external parameter.

Further, the method for performing bundle adjustment of the parameters of the multi-view camera in step 1.2 is a parallel bundle adjustment method.

Further, the method for obtaining multiple views by the multi-view camera in the step 1.3, performing three-dimensional reconstruction and tracking of the human body based on the multiple views, and completing the construction of the multi-information-source event sensor comprises the following steps: under each multi-view, carrying out three-dimensional reconstruction and tracking on the human body to obtain a plurality of three-dimensional reconstruction and tracking results of the human body; and under the three-dimensional reconstruction and tracking results of each human body, constructing information source event perceptrons, and combining each information source event perceptrons into a multi-information source event perceptrons.

A scene construction system of a multi-view camera, the system comprising: a construction unit of the multi-information source event sensor, configured to construct the multi-information source event sensor based on the multi-view camera array; the regional event trigger source construction unit is configured to construct a complex three-dimensional spatial regional event trigger source; a trigger object construction unit configured to construct a trigger object based on three-dimensional human body posture key points: and the event processing unit is configured for carrying out event triggering and recording.

The scene construction method and system of the multi-view camera have the following beneficial effects: according to the invention, the multi-information source event sensor is constructed by adopting the multi-view camera array, so that effective sensing can be performed on a complex scene, the shielding robustness is very strong, the three-dimensional reconstruction of a human body can be realized by information complementation of a plurality of view angles, and further, the event sensing task is completed. Meanwhile, the invention adopts the programmable three-dimensional space area as the event trigger source, and constructs the event trigger source with various event types and complex structures by programming the programmable three-dimensional space area. In addition, the three-dimensional human body gesture key points are used as event triggering objects, and corresponding events are triggered and recorded by detecting the specific gesture of the three-dimensional human body gesture key points and the interaction relation between the key points and the three-dimensional space area.

Drawings

Fig. 1 is a schematic flow chart of a scene construction method of a multi-view camera according to an embodiment of the present invention.

Detailed Description

The method of the present invention will be described in further detail with reference to the accompanying drawings.

Example 1

step 1.1: performing internal parameter calibration of the multi-view camera;

step 4.2: continuously detecting and reconstructing human body information;

Specifically, in computer vision, three-dimensional reconstruction is a process of reconstructing three-dimensional information according to single-view or multi-view images, and because the information of a single video is incomplete, three-dimensional reconstruction needs to use priori knowledge, while multi-view three-dimensional reconstruction can reconstruct a three-dimensional model by using the information of two-dimensional images of more viewpoints. However, most three-dimensional reconstruction algorithms at present are not accurate and comprehensive enough to utilize two-dimensional information, and the calculation process excessively depends on information provided by external equipment, such as depth information provided by a depth camera, or depends on segmentation results of a target and a background, so that the reconstructed results are still rough.

According to the invention, the multi-information source event sensor is constructed by adopting the multi-view camera array, so that effective sensing can be performed on a complex scene. The result is not dependent on the segmentation result of the target and the background, but the corresponding event is triggered and recorded by detecting the specific gesture of the three-dimensional human gesture key point and the interaction relation between the key point and the three-dimensional space region, so that the accuracy of the reconstruction result is higher and the method is more suitable for complex three-dimensional scenes.

Example 2

On the basis of the above embodiment, the method for constructing the complex three-dimensional space region event trigger source in the step 2 includes: step 2.1: selecting a critical point of an event area in the multiple views; step 2.2: calculating a critical point under a camera coordinate system based on a triangulation method; step 2.3: and constructing an event three-dimensional space region which is wrapped by the polyhedron, and taking the event three-dimensional space region as an event trigger source of the region.

Specifically, the event trigger source defines an event triggered area.

Example 3

On the basis of the above embodiment, the method for constructing the event three-dimensional space region enclosed by the polyhedron in the step 2.3 includes: based on a critical point under a camera coordinate system, constructing a boundary of a positive N surface body, and wrapping an event area in the boundary; the value of N is required to meet the following constraint conditions:wherein n is the number of views of the multi-view camera; (x) _n ,y _n ) Is the coordinates of the critical point.

Specifically, in general, the larger the N value of the positive N surface body is, the more accurate the obtained event area is, and the more accurate the subsequent scene construction result is.

Example 4

On the basis of the above embodiment, the method for constructing the triggering object based on the three-dimensional human body posture key point in the step 3 includes: step 3.1: selecting an event trigger source; step 3.2: and defining human body posture key point signals of a plurality of events and human body posture key point signals of continuous events, registering the human body posture key point signals and the corresponding events in an event trigger source, and completing the construction of a trigger object.

In particular, in image processing, a key point is essentially a feature. It is an abstract description of a fixed region or spatial physical relationship, describing a combination or context within a certain neighborhood. It is not just a point information, or represents a location, but rather a combination of context and surrounding neighborhood. The object of the detection of key points is to find out the coordinates of these points from an image by a computer, which is a basic task in the field of computer vision, and the detection of key points has a crucial meaning for high-level tasks such as identification and classification.

Specifically, the internal parameter calibration algorithm in the prior art is often realized based on only a single parameter or parameter set, and the accuracy of the result is low. Whereas the scene construction due to the invention is based on multi-view cameras. In this case, the accuracy is lower by using the conventional internal reference calibration method.

Human body posture key point detection (Human Keypoint Detection) is also called human body posture recognition, aims to accurately position the positions of human body joints in images, and is a front-end task of human body action recognition, human body behavior analysis and human-computer interaction. Unlike human face key point detection, the human trunk part is more flexible, the change is more difficult to predict, the coordinate regression-based method is difficult to compete, and a thermodynamic diagram regression key point detection method is generally used.

Example 5

On the basis of the above embodiment, the method for performing internal reference calibration in step 1.1 includes: combining a plurality of groups of high-precision cubes with known sizes to form irregular cubes with mutually different shapes; shooting the irregular cube at a plurality of different angles by using a multi-view camera to obtain a plurality of groups of shooting results; generating depth data of a plurality of groups of shooting results through perspective projection inverse process, respectively, projecting the depth data back into a three-dimensional space under a local coordinate system of a camera, generating a normal map with the same frame number, and dividing planes of a calibration object on the normal map to obtain a plurality of groups of plane depth data; carrying out data fusion on a plurality of groups of plane depth data to obtain fusion depth data; the fusion depth data is projected back to a three-dimensional space under a local coordinate system of the camera through a perspective projection inverse process, and a three-dimensional point set corresponding to each plane is obtained; performing a least square fitting method on each obtained three-dimensional point set to obtain a plane corresponding to the three-dimensional point set; calculating the included angle and the distance between the marked planes based on the obtained planes; actually measuring the included angle and the distance between marked planes of the high-precision cube, comparing the obtained included angle with the distance, constructing an optimized objective function with the aim of minimizing the difference, optimizing the internal parameters of the multi-view camera by using a nonlinear iterative optimization method through the optimized objective function, minimizing the objective function, and completing the internal parameter calibration of the multi-view camera.

The key point detection method can be generally divided into two types, one is solved by a coordinate regression mode, the other is to model the key points into a thermodynamic diagram, and the position of the key points is obtained by regression thermodynamic diagram distribution through a pixel classification task. Both methods are means or approaches, and solve the problem of finding out the position and relation of the point in the image

Example 6

On the basis of the above embodiment, the method for performing data fusion on multiple sets of plane depth data to obtain fused depth data includes:wherein R is fusion depth data, and m is the group number of plane depth data; o is the number of cubes contained in each set of high precision cube combinations; r is (r) _i Is plane depth data; s is the surface area of the irregular cube; m is the average number of faces of cubes in each set of high precision cube combinations.

Specifically, data fusion is performed on multiple groups of plane depth data, and the obtained fusion depth data based on fusion can reflect the result obtained by shooting the multi-view camera under multiple angles on the whole. Thereby making the calibration result more accurate.

Example 7

Based on the above embodiment, the method for performing the external parameter calibration in step 1.2 is as follows: acquiring included angles of the multi-view camera in three regular directions; further, the projections of the external parameters of the multi-view camera in three regular directions under the included angles are obtained, and three external parameter projection sets are respectively obtained; the three regular directions are a first regular direction, a second regular direction and a third regular direction respectively; the fitting error for three canonical directions is calculated using the following formula:

；

where l is each canonical directionThe number of parameters in the underlying projection set of the external parameters; w (w) ₁ Calculate the function, y, for the error _l The multi-view camera is externally referred in a certain regular direction,the projection of the multi-view camera in a certain regular direction is externally referred; θ _k An included angle of the multi-view camera in a certain regular direction; and taking the regular direction corresponding to the minimum fitting error as a standard projection direction according to the calculated fitting error, and taking an external parameter projection set obtained by projecting in the regular direction as an external parameter.

In particular, in the external parameter calibration, if the method in the prior art is carried into a multi-view camera, the result is easily inaccurate. And through the projection of a plurality of regular directions, the external parameter calibration is carried out based on the fitting error of each regular direction, so that the result is more accurate.

Example 8

On the basis of the above embodiment, the method for performing bundle adjustment of the parameters of the multi-view camera in step 1.2 is a parallel bundle adjustment method.

Example 9

On the basis of the above embodiment, the method for constructing the multi-information source event sensor by acquiring multiple views by the multi-view camera in step 1.3 and performing three-dimensional reconstruction and tracking of the human body based on the multiple views includes: under each multi-view, carrying out three-dimensional reconstruction and tracking on the human body to obtain a plurality of three-dimensional reconstruction and tracking results of the human body; and under the three-dimensional reconstruction and tracking results of each human body, constructing information source event perceptrons, and combining each information source event perceptrons into a multi-information source event perceptrons.

Example 10

It should be noted that, in the system provided in the foregoing embodiment, only the division of the foregoing functional units is illustrated, in practical application, the foregoing functional allocation may be performed by different functional units, that is, the units or steps in the embodiment of the present invention are further decomposed or combined, for example, the units in the foregoing embodiment may be combined into one unit, or may be further split into multiple sub-units, so as to complete all or the functions of the units described above. The names of the units and the steps related to the embodiment of the invention are only used for distinguishing the units or the steps, and are not to be construed as undue limitation of the invention.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the storage device and the processing device described above and the related description may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

Those of skill in the art will appreciate that the various illustrative elements, method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the program(s) corresponding to the software elements, method steps may be embodied in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation is not intended to be limiting.

The terms "first," "another portion," and the like, are used for distinguishing between similar objects and not for describing a particular sequential or chronological order.

The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or unit/apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or unit/apparatus.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related art marks may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will fall within the scope of the present invention.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention.

Claims

1. A scene construction method of a multi-view camera, the method comprising the steps of:

step 1.1: performing internal parameter calibration of the multi-view camera;

step 4.2: continuously detecting and reconstructing human body information;

step 4.3: checking the inclusion relation between the human body information and the set three-dimensional space region forming the event trigger source, if the inclusion condition is met, detecting whether the human body gesture accords with an event start human body gesture key point signal contained in the event, and if the inclusion condition is met, triggering the event corresponding to the event trigger source and starting recording; if the triggering event exists, detecting whether the triggering event meets the continuous human body gesture key point signal of the event, if yes, continuously recording, otherwise ending the event recording; if the inclusion condition is not satisfied and the event has been triggered, ending the event record;

the method for performing internal reference calibration in the step 1.1 comprises the following steps: combining a plurality of groups of high-precision cubes with known sizes to form irregular cubes with mutually different shapes; shooting the irregular cube at a plurality of different angles by using a multi-view camera to obtain a plurality of groups of shooting results; generating depth data of a plurality of groups of shooting results through perspective projection inverse process, respectively, projecting the depth data back into a three-dimensional space under a local coordinate system of a camera, generating a normal map with the same frame number, and dividing planes of a calibration object on the normal map to obtain a plurality of groups of plane depth data; carrying out data fusion on a plurality of groups of plane depth data to obtain fusion depth data; the fusion depth data is projected back to a three-dimensional space under a local coordinate system of the camera through a perspective projection inverse process, and a three-dimensional point set corresponding to each plane is obtained; performing a least square fitting method on each obtained three-dimensional point set to obtain a plane corresponding to the three-dimensional point set; calculating the included angle and the distance between the marked planes based on the obtained planes; the method comprises the steps of actually measuring the angle and the distance between marked planes of a high-precision cube, comparing the obtained included angle with the distance, constructing an optimized objective function with the purpose of minimum difference, optimizing the internal parameters of the multi-view camera by using a nonlinear iterative optimization method through the optimized objective function, minimizing the objective function, and completing the internal parameter calibration of the multi-view camera.

2. The method according to claim 1, wherein the method for constructing the complex three-dimensional space region event trigger source in step 2 comprises: step 2.1: selecting a critical point of an event area in the multiple views; step 2.2: calculating a critical point under a camera coordinate system based on a triangulation method; step 2.3: and constructing an event three-dimensional space region which is wrapped by the polyhedron, and taking the event three-dimensional space region as an event trigger source of the region.

3. The method according to claim 2, wherein the method for constructing the event three-dimensional space region enclosed by the polyhedron in step 2.3 comprises: based on a critical point under a camera coordinate system, constructing a boundary of a positive N surface body, and wrapping an event area in the boundary; the value of N is required to meet the following constraint conditions:wherein n is the number of views of the multi-view camera; (x) _n ,y _n ) Is the coordinates of the critical point.

4. The method according to claim 3, wherein the method for constructing the trigger object based on the three-dimensional human body posture key point in the step 3 comprises the following steps: step 3.1: selecting an event trigger source; step 3.2: and defining human body posture key point signals of a plurality of events and human body posture key point signals of continuous events, registering the human body posture key point signals and the corresponding events in an event trigger source, and completing the construction of a trigger object.

5. The method of claim 4, wherein the data fusing the plurality of sets of planar depth data to obtain fused depth data comprises:wherein R is fusion depth data, and m is the group number of plane depth data; o is the height of each groupThe number of cubes included in the precision cube combination; r is (r) _i Is plane depth data; s is the surface area of the irregular cube; h is the average number of faces of cubes in each set of high precision cube combinations.

6. The method according to claim 5, wherein the method for performing the external parameter calibration in step 1.2 comprises: acquiring included angles of the multi-view camera in three regular directions; further, the projections of the external parameters of the multi-view camera in three regular directions under the included angles are obtained, and three external parameter projection sets are respectively obtained; the three regular directions are a first regular direction, a second regular direction and a third regular direction respectively; the fitting error for three canonical directions is calculated using the following formula:

；

wherein, l is the parameter number in the external parameter projection set under each regular direction; w (w) ₁ Calculate the function, y, for the error _l Is an external parameter of the multi-view camera in a certain regular direction,the projection of the multi-view camera in a certain regular direction is externally referred; θ _k An included angle of the multi-view camera in a certain regular direction; and taking the regular direction corresponding to the minimum fitting error as a standard projection direction according to the calculated fitting error, and taking an external parameter projection set obtained by projecting in the regular direction as an external parameter.

7. The method of claim 6, wherein the bundle adjustment of the multi-view camera parameters in step 1.2 is a parallel bundle adjustment method.

8. The method of claim 7, wherein the multi-view camera in step 1.3 acquires multiple views, and the method for performing three-dimensional reconstruction and tracking of the human body based on the multiple views to complete the construction of the multi-information source event sensor comprises: under each multi-view, carrying out three-dimensional reconstruction and tracking on the human body to obtain a plurality of three-dimensional reconstruction and tracking results of the human body; and under the three-dimensional reconstruction and tracking results of each human body, constructing information source event perceptrons, and combining each information source event perceptrons into a multi-information source event perceptrons.

9. A scene construction system for a multi-view camera for implementing the method of one of claims 1 to 8, characterized in that the system comprises: a construction unit of the multi-information source event sensor, configured to construct the multi-information source event sensor based on the multi-view camera array; the regional event trigger source construction unit is configured to construct a complex three-dimensional spatial regional event trigger source; a trigger object construction unit configured to construct a trigger object based on three-dimensional human body posture key points: and the event processing unit is configured for carrying out event triggering and recording.