CN115344113A

CN115344113A - Multi-view human motion capture method, device, system, medium and terminal

Info

Publication number: CN115344113A
Application number: CN202110522428.2A
Authority: CN
Inventors: 梁瀚; 黄程宇; 张启煊; 吴迪; 许岚; 虞晶怡
Original assignee: ShanghaiTech University
Current assignee: ShanghaiTech University
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2022-11-15

Abstract

The invention provides a multi-view human motion capturing method, a device, a system, a medium and a terminal, comprising the following steps: acquiring multi-view video signals of an object to be captured and audio signals corresponding to the video signals; eliminating the time difference of the multi-view video signal based on the audio signal to obtain a multi-view synchronous video signal; extracting corresponding multi-view 2D human body key points from the multi-view synchronous video signals; acquiring the association information among all the visual angles of the 2D human body key points; and performing optimization calculation based on the associated information to obtain 3D human body posture information. The invention can capture only by a common RGB camera; the self-shielding problem is relieved, and higher capturing precision is achieved; compared with the motion capture by using an inertial sensor, the method has better real-time performance, lower use threshold and larger identification range without limiting the number of people; the reduction of wearable devices can improve the use experience of the user and has higher freedom of movement.

Description

Multi-view human motion capturing method, device, system, medium and terminal

Technical Field

The invention relates to the technical field of human motion capture, in particular to a multi-view human motion capture method, device, system, medium and terminal.

Background

With the popularity of Virtual Reality (VR) and Augmented Reality (AR), the industry has an increasing need for reliable 3D human motion capture. Label-free optical motion capture alleviates the need for invasive wearable motion sensors and markers as a low-cost alternative to the widely used Marker-Based and Sensor-Based motion capture solutions.

The existing unmarked optical motion capture technology mostly adopts a single-view motion capture method, has the problems of self-shielding and poor capture precision, and the capture equipment is required to be provided with a depth sensor, so the implementation cost is higher.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, an object of the present invention is to provide a multi-view human motion capture method, apparatus, system, medium and terminal, which are used to solve the technical problem of insufficient human motion capture precision in the prior art.

To achieve the above and other related objects, a first aspect of the present invention provides a multi-view human motion capture method, comprising: acquiring multi-view video signals of an object to be captured and audio signals corresponding to the video signals; eliminating the time difference of the multi-view video signal based on the audio signal to obtain a multi-view synchronous video signal; extracting corresponding multi-view 2D human body key points from the multi-view synchronous video signal; acquiring the associated information among the multi-view 2D human body key points; and performing optimization calculation based on the associated information to acquire the 3D human body posture information of the object to be captured.

In some embodiments of the first aspect of the present invention, the audio signal is a high frequency audio signal; the picture time synchronization mode of the multi-view video signal comprises the following steps: determining a gross error of the video signal based on a time stamp of the video signal; carrying out convolution calculation on the high-frequency audio signals corresponding to the video signals and ideal high-frequency characteristic sound waves to determine fine errors among the video signals; and combining the fine error and the coarse error to realize the frame level picture time synchronization of the multi-view video signal.

In some embodiments of the first aspect of the present invention, the obtaining of the 3D body pose information includes: constructing a 3D human body posture estimation model, and marking 3D human body key points on the model; predefining an energy function, comprising: 2D key point terms, time sequence stable terms, attitude prior terms and joint limiting terms; the 2D key point item and the 2D pixel coordinate projected to each view angle by the 3D human body key point are related to the distance between the 2D pixel coordinate and the corresponding 2D human body key point; the temporal stability term relates to a continuity of the motion capture in time; the attitude prior term is related to the joint rotation truth; the joint limitation item is related to joint rotation angle; and performing optimization calculation on the energy function to obtain the 3D human body posture information.

In some embodiments of the first aspect of the present invention, the obtaining manner of the association information includes: constructing a bottom-up 2D human body posture estimation model based on RGB data; extracting corresponding 2D human key points and connection scores between the key points from the multi-view synchronous video signal by using the 2D human posture estimation model; integrating the 2D human body key points with multiple visual angles and connection scores between the key points, and establishing a weighted undirected graph model with adjacent side weights corresponding to the connection scores of the 2D human body key point pairs; and maximizing the weight of the generation subtree of the weighted undirected graph model to acquire the associated information.

In some embodiments of the first aspect of the present invention, the multi-view video signals and the audio signals corresponding to the video signals are captured from multiple angles by multiple mobile devices, and are independently transmitted to a server in a plug-flow manner, respectively, so as to capture the 3D body posture information in real time.

In some embodiments of the first aspect of the present invention, the method comprises: and acquiring human body motion data including facial motion data and limb motion data based on the 3D human body posture information capture.

To achieve the above and other related objects, a second aspect of the present invention provides a multi-view human motion capture device, comprising: the signal acquisition module is used for acquiring multi-view video signals of an object to be captured and audio signals corresponding to the video signals; a signal synchronization module, configured to eliminate a time difference of the multiview video signal based on the audio signal to obtain a multiview synchronization video signal; the key point extraction module is used for extracting corresponding multi-view 2D human body key points from the multi-view synchronous video signals; the associated information acquisition module is used for acquiring associated information among the multi-view 2D human body key points; and the human body posture information acquisition module is used for carrying out optimization calculation based on the associated information so as to acquire the 3D human body posture information of the object to be captured.

To achieve the above and other related objects, a third aspect of the present invention provides a multi-view human motion capture system, comprising: the system comprises a plurality of video signal acquisition devices, a plurality of video signal acquisition devices and a plurality of video signal acquisition devices, wherein the video signal acquisition devices are used for acquiring video signals of an object to be captured; the audio signal generating device is used for sending out a high-frequency characteristic sound wave signal; the multi-view human body motion capture device is used for acquiring multi-view video signals of an object to be captured and audio signals corresponding to the video signals; eliminating a time difference of the multi-view video signal based on the audio signal to obtain a multi-view synchronous video signal; extracting corresponding multi-view 2D human body key points from the multi-view synchronous video signals; acquiring the associated information among the multi-view 2D human body key points; and performing optimization calculation based on the association information to obtain the 3D human body posture information of the object to be captured.

To achieve the above and other related objects, a fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the multi-view human motion capture method.

To achieve the above and other related objects, a fifth aspect of the present invention provides an electronic terminal, comprising: a processor and a memory; the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the multi-view human motion capture method.

As described above, the multi-view human motion capture method, device, system, medium, and terminal provided by the present invention can capture only by using a common RGB camera, and do not require whether the mobile device has a depth sensor. Compared with a single-view motion capture technology, the method has the advantages that the self-shielding problem is relieved, and the capture accuracy is higher. Compared with most other multi-view motion capture technologies, the method has better real-time performance. Compared with the use of an inertial sensor (such as a gyroscope), the invention has lower use threshold, and only a user needs to have a plurality of mobile devices; similar or even higher capture precision and larger identification range can be achieved; wearable equipment is reduced, the use experience of a user can be improved, and the degree of freedom of movement is higher; the number of people is not limited, and mobile equipment does not need to be added when the number of people for capturing is increased.

Drawings

Fig. 1 is a flowchart illustrating a multi-view human motion capture method according to an embodiment of the invention.

Fig. 2 is a flow chart illustrating another multi-view human motion capture method according to an embodiment of the invention.

Fig. 3 is a schematic structural diagram of a multi-view human motion capture device according to an embodiment of the invention.

FIG. 4 is a schematic diagram of a multi-view human motion capture system according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of an electronic terminal according to an embodiment of the invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It is noted that in the following description, reference is made to the accompanying drawings which illustrate several embodiments of the present invention. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present invention. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present invention is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Spatially relative terms, such as "upper," "lower," "left," "right," "lower," "below," "lower," "above," "upper," and the like, may be used herein to facilitate describing one element or feature's relationship to another element or feature as illustrated in the figures.

In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," "retained," and the like are to be construed broadly, e.g., as meaning fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including" specify the presence of stated features, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, operations, elements, components, items, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "a, B or C" or "a, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions or operations are inherently mutually exclusive in some way.

The invention provides a multi-view-angle-based human motion capture method, a multi-view-angle-based human motion capture device, a multi-view-angle-based human motion capture system, a multi-view-angle-based human motion capture medium and a multi-view-angle-based human motion capture terminal, which can solve the technical problems of insufficient human motion capture precision, high requirement on equipment, poor real-time performance, great influence of the number of people to be captured and the like in the prior art.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention are further described in detail by the following embodiments in conjunction with the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example one

As shown in fig. 1, the present embodiment provides a flow chart of a multi-view human motion capture method, which includes steps S11 to S15, and can be specifically expressed as follows:

step S11, obtaining multi-view video signals of an object to be captured and audio signals corresponding to the video signals, specifically receiving and obtaining the multi-view audio and video signals transmitted from the outside, or directly collecting the multi-view audio and video signals by adopting a plurality of mobile devices (such as mobile phones, tablet computers and the like). Wherein, the mobile device is configured with the module of making a video recording, the module of making a video recording includes camera device, storage device and processing apparatus, camera device includes but not limited to: cameras, video cameras, camera modules integrated with optical systems or CCD chips, camera modules integrated with optical systems and CMOS chips, and the like.

In some examples, one of the plurality of mobile devices may be selected as the source of the audio signal, or another mobile device may be separately selected as the source of the audio signal, such as a speaker, a voice broadcast device, a music player, and so on. Preferably, the audio signal is a high-frequency audio signal, which has the characteristics of strong directivity and short propagation distance, is arranged around the object to be captured, and can avoid generating unnecessary environmental interference while being received by a plurality of mobile devices performing multi-angle video shooting, and is particularly suitable for application in the present invention.

In a preferred embodiment of the present embodiment, each mobile device records multiple views of a picture with RGB information, and each device independently streams the acquired signals to the server, i.e., using RGB (red, green, blue) color space, each color passes through the three variables, its color and intensity, and thus records and displays the acquired video color image.

And S12, eliminating the time difference of the multi-view video signal based on the audio signal to acquire a multi-view synchronous video signal. Specifically, the audio signals in the multi-view video are synchronized, and then the frame synchronization of the multi-view video is realized by means of the synchronized audio.

In a preferred embodiment of the present invention, the frame time synchronization method of the multi-view video signal comprises: determining a gross error of the video signal based on a time stamp of the video signal; carrying out convolution calculation on the high-frequency audio signals corresponding to the video signals and ideal high-frequency characteristic sound waves to determine fine errors among the video signals; and combining the fine error and the gross error to realize the frame level picture time synchronization of the multi-view video signal, thereby obtaining the multi-view synchronous video signal. In functional analysis, convolution refers to a mathematical operator that generates a third function from two functions f and g, characterizing the integral of the overlap length multiplied by the function value of the overlapping part of g, which has been flipped and translated.

And S13, extracting corresponding multi-view 2D human body key points from the multi-view synchronous video signals. The realization of the markless optical motion capture benefits from the development and popularization of a Deep Neural network (Deep Neural Networks) in recent years, and the Deep Neural network is used as a universal function approximator, so that 2D human body key point extraction based on human body RGB (red, green and blue) features becomes possible. Through large-scale RGB data collection and the positions of the corresponding artificially marked 2D human key points, the deep neural network is trained to automatically learn the mapping relation between the RGB data and the 2D human key points, so that the 2D human key point information can be directly extracted from the RGB information collected by the consumer-grade color camera.

In a preferred embodiment of the present embodiment, a bottom-up 2D human body posture estimation model based on RGB data is constructed; and extracting corresponding 2D human key points and connection scores between the key points from the multi-view synchronous video signal by using the 2D human posture estimation model.

And S14, acquiring the associated information among the multi-view 2D human body key points. The analysis of the 2D human body key point information based on multiple visual angles is the key to realize stable and robust 3D pose estimation. The 2D human body key point information obtained through the neural network has unicity and limitation on the constraint of three-dimensional poses due to limited dimensionality; and the problems are serious due to the complexity of the non-rigid motion of the human body, self-occlusion and multiple solutions. Aiming at the two points, the 2D human key points of a plurality of visual angles are integrated to obtain the correlation information between the 2D human key points of different visual angles, so that the self-shielding problem can be greatly reduced, the number of non-optimal solutions of the 3D posture is reduced, and the stable and robust real-time 3D posture estimation is realized.

In a preferred embodiment of this embodiment, connection scores between the 2D human body key points and key points in multiple views are integrated, and a weighted undirected graph model with adjacent edge weights corresponding to the connection scores of the 2D human body key point pairs is established; and maximizing the weight of the generation subtree of the weighted undirected graph model to acquire the associated information.

In the preferred embodiment of the present embodiment, a kinematic (kinematic) based parameterized skeleton model is the key to implementing avatar driving. Preferably, a plurality of human skeletal joint points (generally 16 or 24) are regressed by using a plurality of learning-based human mesh models, and then the whole human skeletal model is constructed by a kinematic tree structure. Compared with a manually designed skeleton, the optimal human skeleton model reserves real human body prior information, so that the motion capture result is more accurate and interpretable. Based on the constructed human skeleton model, the connection between the skeleton and the posture estimation key points is established by attaching points on the corresponding joint points as 3D marks.

And S15, performing optimization calculation based on the associated information to obtain the 3D human body posture information of the object to be captured. Alternative optimization algorithms are gradient descent algorithms, newton's method, simulated annealing algorithms, ant colony algorithms, genetic algorithms, etc. The least square method is preferably adopted in the embodiment, the optimal function matching of the data is found through the square sum of the minimized errors, the regression parameters of the nonlinear least square regression model are solved through the gauss-newton iteration method, the taylor series expansion is used for approximately replacing the nonlinear regression model, then the regression coefficients are corrected for multiple times through multiple iterations, the regression coefficients are enabled to continuously approximate the optimal regression coefficients of the nonlinear regression model, and finally the residual square sum of the original model is enabled to be minimum.

In a preferred embodiment of this embodiment, the 3D human body posture information is obtained by: constructing a 3D human body posture estimation model, and marking 3D human body key points on the model; predefining an energy function E (θ) comprising: 2D Key Point item E _2D (theta), time sequence stabilization term E _temp (θ), attitude prior term E _prior (theta) and a joint constraint term E _limit (θ); the 2D keypoint item E _2D (theta) is related to the 2D pixel coordinates of the 3D human body key points projected to each view angle and the corresponding 2D human body key point distance; the time sequence stable item E _temp (θ) correlating to a continuity in time of the motion capture; the attitude prior term E _prior (θ) is related to joint rotational truth; the joint restriction item E _limit (θ) is related to joint rotation angle; and performing optimization calculation on the energy function to obtain the 3D human body posture information.

Specifically, the energy function E (θ) is expressed as follows:

E(θ)＝λ _2D E _2D (θ)+ _temp E _temp (θ)+λ _prior E _prior (θ)+λ _limit E _limit (θ)；

E _prior (θ)＝(θ-μ _θ ) ^T ∑ _θ ^-1 (θ-μ _θ )；

wherein, J _j (θ) represents the position of the jth 3D marker calculated by Forward dynamics (Forward dynamics) from the parametric skeleton model according to the parameter θ; pi _v (. Cndot.) represents a projection function that projects the 3D marker onto the v-th perspective pixel plane; p is a radical of _v，j Representing pixel plane coordinates of a j 2D key point extracted from a v visual angle RGB by the neural network model; n is a radical of _v Represents the total number of views used; n is a radical of _j Represents the total number of bound 3D markers; mu.s _θ Representing a posture mean value; sigma _θ Representing a covariance matrix;

θ _lower and theta _upper Respectively representing the lower bound and the upper bound of the Euler angle corresponding to the rotational degree of freedom; lambda _2D ，λ _temnp ，λ _prior And λ _limit The weight superparameters of the energy function E (theta) are respectively used for balancing the influence of various energies on the optimization result.

It should also be noted that the 2D key point item E _2D (theta) has the effect that the 3D mark bound on the 3D skeleton is projected to the 2D pixel coordinates under each view angle, and is as close as possible to the corresponding human body key point extracted by the 2D human body posture estimation model; timing stabilization term E _temp (theta) functions to keep the continuity of motion capture in time sequence as much as possible, mitigating jitter; attitude prior term E _prior The function of (theta) is to make the skeleton joint rotate to the full extentThe human body posture is simulated as natural as possible;

in the preferred embodiment of the present invention, a multivariate normal distribution is used as the attitude prior term, and the attitude mean μ thereof _θ Sum covariance matrix ∑ _θ Regressing from a large number of scanned body data. Mahalanobis Distance (Mahalanobis Distance) is preferred to measure the likelihood of a given pose θ, which can be used to measure the Distance of a sample point from a probability distribution. The joint limitation item has similar action with the posture prior item, and the difference is that the joint limitation item explicitly models the joint rotation limitation, and when the joint rotation exceeds the limitation, a counter force can be generated to correct the rotation.

In the preferred embodiment, the human body motion data, such as the facial motion data and the limb motion data of the human body, etc., are captured and acquired based on the acquired 3D human body posture information of the object to be captured. Further, the acquired motion capture data may be streamed over a network to various engines (e.g., unity, unreal, etc.) to drive and render the character model in real-time. In the embodiment, facial expressions and body movements of multiple persons can be captured in real time based on the acquired 3D body posture information, and then the model is pushed to multiple engines (such as Unity, unknown and the like) to drive the character model in real time, so that the real-time performance is high.

To further explain the method provided in this embodiment, as shown in fig. 2, this embodiment further provides a flow diagram of another multi-view human motion capture method, which is explained from both ends of the device end and the server end. The device end selects a plurality of mobile devices with cameras, one mobile device serves as a high-frequency sound wave emission source to emit high-frequency characteristic sound waves, and the other mobile devices record RGB/RGBD videos of objects to be captured at multiple viewing angles; and the plurality of mobile devices respectively push and send the acquired audio and video signals to the server side. The server side realizes multi-device picture frame synchronization by using the characteristic audio in the server side; extracting 2D human body key point information by using a neural network, and further establishing association between multi-view multi-sequence frame 2D data; and then, optimizing 3D human body posture information by using a Gauss-Newton iteration method, and transmitting the motion capture data to other virtual engines (a driving engine, a rendering engine and the like) through network streams so as to drive and render the character model.

In some embodiments, the method may be applied to a controller, such as an ARM (Advanced RISC Machines) controller, an FPGA (Field Programmable Gate Array) controller, an SoC (System on Chip) controller, a DSP (Digital Signal Processing) controller, or an MCU (micro controller unit) controller, among others. In some embodiments, the methods are also applicable to computers including components such as memory, memory controllers, one or more processing units (CPUs), peripheral interfaces, RF circuits, audio circuits, speakers, microphones, input/output (I/O) subsystems, display screens, other output or control devices, and external ports; the computer includes, but is not limited to, personal computers such as desktop computers, notebook computers, tablet computers, smart phones, smart televisions, personal Digital Assistants (PDAs), and the like. In other embodiments, the method may also be applied to servers that may be arranged on one or more physical servers, or may be formed of a distributed or centralized cluster of servers, depending on various factors such as function, load, etc.

Example two

As shown in fig. 3, the present embodiment provides a structural schematic diagram of a multi-view human motion capture device, which includes: a signal acquiring module 31, configured to acquire multi-view video signals of an object to be captured and audio signals corresponding to the video signals; a signal synchronization module 32, configured to eliminate a time difference of the multiview video signal based on the audio signal to obtain a multiview synchronization video signal; a key point extracting module 33, configured to extract corresponding multi-view 2D human body key points from the multi-view synchronous video signal; the associated information acquisition module 34 is configured to acquire associated information between the multi-view 2D human body key points; and a human body posture information obtaining module 35, configured to perform optimization calculation based on the association information to obtain 3D human body posture information of the object to be captured.

It should be noted that the modules provided in this embodiment are similar to the methods and embodiments provided above, and therefore, the description thereof is omitted. It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can all be implemented in the form of software invoked by a processing element; or can be implemented in the form of hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the key point extracting module 33 may be a separate processing element, or may be integrated into a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a processing element of the apparatus calls and executes the functions of the key point extracting module 33. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. As another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

EXAMPLE III

As shown in fig. 4, the present embodiment provides a structural schematic diagram of a multi-view human motion capture system, which includes: an audio signal generating means 41 for emitting a high frequency characteristic sound wave signal; a plurality of video signal collecting means 42 for collecting video signals of an object to be captured; the multi-view human motion capture device 43 is configured to obtain multi-view video signals of an object to be captured and audio signals corresponding to the video signals; eliminating the time difference of the multi-view video signal based on the audio signal to obtain a multi-view synchronous video signal; extracting corresponding multi-view 2D human body key points from the multi-view synchronous video signals; acquiring the association information among the multi-view 2D human body key points; and performing optimization calculation based on the associated information to acquire the 3D human body posture information of the object to be captured.

Example four

The present embodiment proposes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the multi-view human motion capture method as described above.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

EXAMPLE five

As shown in fig. 5, an embodiment of the present invention provides a schematic structural diagram of an electronic terminal. The electronic terminal provided by the embodiment comprises: a processor 51, a memory 52, a communicator 53; the memory 52 is connected with the processor 51 and the communicator 53 through a system bus and completes mutual communication, the memory 52 is used for storing computer programs, the communicator 53 is used for communicating with other devices, and the processor 51 is used for operating the computer programs, so that the electronic terminal executes the steps of the multi-view human body motion capturing method.

The system bus mentioned above may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this is not intended to represent only one bus or type of bus. The communication interface is used for realizing communication between the database access device and other devices (such as a client, a read-write library and a read-only library). The Memory may include a Random Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

In summary, the present invention provides a multi-view human motion capture method, apparatus, system, medium, and terminal, where data acquisition is performed by multiple mobile devices, and no requirement is made on whether a mobile device has a depth sensor, RGB audio/video data are acquired by using the mobile devices, and the data are independently transmitted to a server in a stream pushing manner, and then the server realizes time axis alignment of frame-level multi-device acquisition information in a manner of determining a fine error by audio and determining a gross error by a timestamp, and extracts 2D human key point information according to synchronized multi-view RGB information, correlates multi-view multi-sequence frame data, and obtains a 3D human pose by using a nonlinear least square optimization algorithm, thereby realizing real-time capture of body motions such as a face and a hand. Compared with the existing motion capture scheme, the method has the following beneficial effects: 1) Capturing can be carried out only by a common RGB camera, and no requirement is made on whether the mobile equipment has a depth sensor; 2) Compared with a single-view motion capture technology, the self-shielding problem is relieved, and the capture precision is higher; 3) The real-time performance is better; 4) The number of people is not limited, and mobile equipment does not need to be added when the number of captured people is increased; 5) It is possible to push to various engines (such as Unity, ureal, etc.) to drive the character model in real time. Therefore, the present invention effectively overcomes various disadvantages of the prior art and has a high industrial utility value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A multi-view human motion capture method is characterized by comprising the following steps:

acquiring multi-view video signals of an object to be captured and audio signals corresponding to the video signals;

eliminating the time difference of the multi-view video signal based on the audio signal to obtain a multi-view synchronous video signal;

extracting corresponding multi-view 2D human body key points from the multi-view synchronous video signals;

acquiring the associated information among the multi-view 2D human body key points;

and performing optimization calculation based on the associated information to acquire the 3D human body posture information of the object to be captured.

2. The multi-perspective human motion capture method of claim 1, wherein the audio signal is a high frequency audio signal; the picture time synchronization mode of the multi-view video signal comprises the following steps:

determining a gross error of the video signal based on a time stamp of the video signal;

carrying out convolution calculation on the high-frequency audio signals corresponding to the video signals and ideal high-frequency characteristic sound waves to determine fine errors among the video signals;

and combining the fine error and the coarse error to realize the frame-level picture time synchronization of the multi-view video signal.

3. The multi-view human motion capture method of claim 1, wherein the 3D human pose information is obtained by:

constructing a 3D human body posture estimation model, and marking 3D human body key points on the model;

predefining an energy function, comprising: 2D key point terms, time sequence stable terms, attitude prior terms and joint limiting terms; the 2D key point item and the 2D pixel coordinate projected by the 3D human body key point to each visual angle are related to the distance between the 2D pixel coordinate and the corresponding 2D human body key point; the temporal stability term relates to a continuity of the motion capture in time; the attitude prior term is related to the joint rotation truth; the joint limitation item is related to joint rotation angle;

and performing optimization calculation on the energy function to obtain the 3D human body posture information.

4. The multi-view human motion capture method of claim 1, wherein the manner of obtaining the associated information comprises:

constructing a bottom-up 2D human body posture estimation model based on RGB data;

extracting corresponding 2D human key points and connection scores between the key points from the multi-view synchronous video signal by using the 2D human posture estimation model;

integrating the 2D human body key points with multiple visual angles and connection scores between the key points, and establishing a weighted undirected graph model with adjacent side weights corresponding to the connection scores of the 2D human body key point pairs;

and maximizing the weight of the generation subtree of the weighted undirected graph model to acquire the associated information.

5. The multi-view human motion capture method of claim 1, wherein the multi-view video signals and the audio signals corresponding to the video signals are captured from multiple angles by multiple mobile devices and independently transmitted to a server in a plug-flow manner, respectively, to capture the 3D human pose information in real time.

6. The multi-view human motion capture method of claim 1, comprising: and acquiring human body motion data including facial motion data and limb motion data based on the 3D human body posture information capture.

7. A multi-perspective human motion capture device, comprising:

the signal acquisition module is used for acquiring multi-view video signals of an object to be captured and audio signals corresponding to the video signals;

a signal synchronization module, configured to eliminate a time difference of the multiview video signal based on the audio signal to obtain a multiview synchronization video signal;

the key point extraction module is used for extracting corresponding multi-view 2D human body key points from the multi-view synchronous video signals;

the associated information acquisition module is used for acquiring associated information among the multi-view 2D human body key points;

and the human body posture information acquisition module is used for performing optimization calculation based on the correlation information so as to acquire the 3D human body posture information of the object to be captured.

8. A multi-perspective human motion capture system, comprising:

the device comprises a plurality of video signal acquisition devices, a plurality of video signal acquisition devices and a plurality of image acquisition devices, wherein the video signal acquisition devices are used for acquiring video signals of an object to be captured;

the audio signal generating device is used for sending out a high-frequency characteristic sound wave signal;

the capturing apparatus according to claim 7, configured to obtain multi-view video signals of an object to be captured and audio signals corresponding to the video signals; eliminating a time difference of the multi-view video signal based on the audio signal to obtain a multi-view synchronous video signal; extracting corresponding multi-view 2D human body key points from the multi-view synchronous video signals; acquiring the associated information among the multi-view 2D human body key points; and performing optimization calculation based on the associated information to acquire the 3D human body posture information of the object to be captured.

9. A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the multi-view human motion capture method of any one of claims 1 to 6.

10. An electronic terminal, comprising: a processor and a memory;

the memory is used for storing a computer program, and the processor is used for executing the computer program stored by the memory to enable the terminal to execute the multi-view human motion capture method according to any one of claims 1 to 6.