CN117769710A

CN117769710A - Automatic data-driven human skeleton labeling

Info

Publication number: CN117769710A
Application number: CN202180100820.6A
Authority: CN
Inventors: 李众; 郭玉亮; 杜翔宇; 全书学; 徐毅
Original assignee: Innopeak Technology Inc
Current assignee: Innopeak Technology Inc
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2024-03-26
Also published as: WO2023027691A1

Abstract

The computer system obtains a plurality of first images of a scene captured simultaneously by a plurality of first cameras. Each first image is captured by a respective first camera disposed at a different location in the scene. The computer system generates a plurality of two-dimensional (2D) feature maps from a plurality of first images, each first image corresponding to a subset of a corresponding respective two-dimensional feature map. The plurality of feature maps are projected into a plurality of aggregated volumes of the scene. The computer system generates a plurality of three-dimensional (3D) heatmaps corresponding to a plurality of aggregated volumes of the scene using a heatmap neural network. The computer system automatically and without user intervention identifies locations of a plurality of keypoints in the scene from a plurality of three-dimensional heat maps. Each key point corresponds to a joint of a person in the scene.

Description

Automatic data-driven human skeleton labeling

Technical Field

The present application relates generally to data processing techniques, including, but not limited to, methods, systems, and non-transitory computer readable media for generating information from image data of human joints and bones.

Background

The human body posture estimation requires a large amount of data, and human body key points in the image are marked. Such keypoint tags may be synthesized in the image, manually created, or automatically identified. The automatically identified tags consume minimal human and computer resources and at the same time have reasonable accuracy. However, automatically identified tags are often the result of the need to use a specific imaging device. Specific imaging devices provide limited quality images and are often used with physical markers attached to the surface of the tracked object. Physical marks are inconvenient to use, can cause data contamination, and in some cases can even interfere with movement of objects. It would be highly advantageous to identify human body keywords in images (especially images taken by conventional cameras) using a human body pose estimation mechanism that is more convenient than current practice.

Disclosure of Invention

Thus, there is a need for a convenient human body pose estimation mechanism for identifying human body keypoints in images, particularly in images taken by conventional cameras (e.g., cell phone cameras or enhanced-eyewear cameras). For this purpose, the present application aims to automatically label key points in images captured by a second camera (e.g. RGB camera, time-of-flight camera) using the labeling function of the first camera. The first camera and the second camera are synchronized in time and, more importantly, spatially aligned to determine a physical correlation between two coordinates of the first camera and the second camera. The physical correlation is optionally represented by a rotation and translation matrix. The first cameras are distributed in the scene and capture a plurality of first images simultaneously. A feature map and an aggregate volume are derived from a plurality of first images of the scene and applied to create a first keypoint in the scene. And converting the subset of the plurality of first key points into a plurality of corresponding second key points on a second image shot by the second camera according to the physical correlation between the two coordinates of the first camera and the second camera. And filling additional missing key points on the second image according to the plurality of second key points. Thereby, the second keypoints and/or additional missing keypoints are annotated on the second image automatically and without user intervention.

According to one aspect, a method of automatically annotating an image is performed at a computer system. The method includes acquiring a plurality of first images of a scene captured simultaneously by a plurality of first cameras. Each first image is captured by a respective first camera disposed at a different location in the scene. The method further includes generating a plurality of two-dimensional (2D) feature maps from the plurality of first images, and each first image corresponds to a subset of a respective 2D feature map. The method further includes projecting the plurality of 2D feature maps into a plurality of aggregated volumes of the scene and generating a plurality of three-dimensional (3D) heat maps corresponding to the plurality of aggregated volumes of the scene using a heat map neural network. The method further includes identifying, automatically and without user intervention, locations of a plurality of keypoints in the scene from the plurality of 3D heatmaps. Each of the keypoints corresponds to a joint of a person in the scene.

In some embodiments, the locations of the plurality of keypoints are identified in a first coordinate of the scene. The method further comprises: a second image of the scene taken simultaneously with the plurality of first images and by a second camera is acquired and a correlation between the first coordinates of the scene and second coordinates of the second camera is determined. The method further comprises: according to the correlation between the first coordinate and the second coordinate, converting the positions of the plurality of key points from the first coordinate to the second coordinate, and according to the converted positions of the plurality of key points in the second coordinate, automatically marking the second image by utilizing the plurality of key points.

According to another aspect, some embodiments include a computer system comprising one or more processors and memory having instructions stored therein, the instructions being executable by the one or more processors to implement any of the methods described above. The computer system achieves keypoint labeling and annotation by performing data-driven volumetric triangulation methods on three-dimensional human gestures, time alignment and coordinate system calibration on both types of cameras.

According to yet another aspect, some embodiments include a non-transitory computer-readable storage medium having instructions stored therein that are executed by one or more processors to implement any of the methods described above.

Drawings

For a better understanding of the various embodiments described herein, reference is made to the following detailed description, taken in conjunction with the accompanying drawings, in which like reference numerals refer to like parts throughout the drawings.

FIG. 1 is an example data processing environment having one or more servers communicatively coupled with one or more client devices, according to some embodiments.

FIG. 2 is an example local imaging environment with multiple client devices, according to some embodiments.

FIG. 3 is an example flow diagram of a process of identifying and annotating keypoints according to some embodiments.

Fig. 4 is an example flow diagram of a process of synchronizing one or more first cameras with a second camera in a scene, in accordance with some embodiments.

Fig. 5 is an example flow chart of a process of recording calibration data from a first camera and a second camera, according to some embodiments.

Fig. 6A and 6B are two test images for spatially calibrating a plurality of first and second cameras according to some embodiments, and fig. 6C is a flowchart of a process for spatially calibrating a first and second camera according to some embodiments.

Fig. 7 is a flow chart of a process of annotating keypoints on a second image 204 captured by a second camera based on a plurality of first images captured by a plurality of first cameras, according to some embodiments.

Fig. 8 is a flowchart of a process of identifying keypoints on a plurality of first images captured by a plurality of first cameras, for example using volumetric triangulation, according to some embodiments.

Fig. 9 is a flow chart of a method of automatically annotating an image in accordance with some embodiments.

FIG. 10 is a schematic block diagram illustrating a computer system in accordance with some embodiments.

Like reference numerals refer to like parts throughout the various views of the drawings.

Detailed Description

Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide an understanding of the subject matter presented herein. It will be apparent, however, to one skilled in the art that various alternatives can be used and that the subject matter can be practiced without these specific details without departing from the scope of the claims. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein may be implemented at a variety of electronic devices having digital video capabilities.

Various embodiments of the present application are directed to annotating keypoints in images captured by a second camera (e.g., RGB camera, depth camera) with the functionality of the first camera to automatically identify and annotate the keypoints. The first camera and the second camera are automatically synchronized in time, automatically calibrated in space, and a physical correlation between two coordinates of the first camera and the second camera is determined. In some embodiments, the first cameras are fixed in the scene, and coordinates associated with the scene are used for each first camera to identify key points of the corresponding first camera shots, respectively. The first key points are identified in the first image shot by the first camera, and the first key points are converted into second key points on the second image shot by the second camera simultaneously with the first image, so that the second key points are annotated on the second image quickly and accurately. In some embodiments, the first keypoint is associated with a physical marker attached to the object and may be simply detectable from the first image taken by the first camera. Conversely, in some embodiments, the first keypoint is uncorrelated with any physical marker and the first keypoint is detected from the first image using an image processing algorithm.

FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, according to some embodiments. The one or more client devices 104 may be, for example, a desktop computer 104A, a tablet computer 104B, a cell phone 104C, an imaging device 104D, a head mounted display (also referred to as AR glasses) 104E, or a smart, multi-aware and network connected home device (such as a thermostat). Each client device 104 may collect data or user input, execute a user application, and display the output results on its user interface. The collected data or user input may be processed locally (e.g., for training and/or reasoning) on the client device 104 and/or remotely by the server 102. One or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to client devices 104 and, in some embodiments, process data and user inputs received from client devices 104 when client devices 104 run user applications. In some embodiments, data processing environment 100 also includes storage device 106 for storing data related to server 102, client device 104, and applications executing on client device 104. For example, the storage device 106 may store video content for training a machine learning model (e.g., a deep learning network) and/or video content that a user-acquired trained machine learning model may apply to determine one or more operations related to the video content.

The one or more servers 102 may enable real-time data communication with client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, one or more servers 102 may perform data processing tasks that client device 104 cannot or does not tend to accomplish locally. For example, the client device 104 includes a game host that executes an interactive online game application. The game host receives the user instructions and sends the user instructions to the game server 102 along with the user data. The game server 102 generates a video data stream based on the user instructions and user data and provides the video data stream to the game host for display with other client devices in the same game session as the game host. As another example, the client device 104 includes a network monitoring camera 104D and a cell phone 104C. The network monitoring camera collects video data and streams the video data to the monitoring camera server 102 in real time. When the video data is optionally preprocessed on the monitoring camera 104D, the monitoring camera server 102 processes the video data to identify motion events or audio events in the video data and shares information of those events to the mobile phone 104C, thereby enabling a user of the mobile phone 104C to remotely monitor events occurring in the vicinity of the network monitoring camera 104C in real time.

One or more servers 102, one or more client devices 104, and storage devices 106 are communicatively coupled to one another through one or more communications networks 108, which are media used to provide communications links between these devices and computers connected together in data processing environment 100. One or more of the communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include a Local Area Network (LAN), a Wide Area Network (WAN) such as the internet, or a combination thereof. Optionally, one or more of the communication networks 108 are implemented using any known network protocol, including various wired or wireless protocols such as Ethernet, universal Serial Bus (USB), firewire, long Term Evolution (LTE), global System for Mobile communications (GSM), enhanced Data GSM Environment (EDGE), code Division Multiple Access (CDMA), time Division Multiple Access (TDMA), bluetooth, wi-Fi, voice over IP (VoIP), wi-MAX, or any other suitable communication protocol. Connections to one or more communication networks 108 may be established directly (e.g., to a radio bearer using 3G/4G connections), or through a network interface 110 (e.g., a router, switch, gateway, hub, or intelligent private overall home control node), or through any combination thereof. Thus, the one or more communication networks 108 may represent the Internet of global networks and gateway sets that communicate with each other using the Transmission control protocol/Internet protocol (TCP/IP) suite. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages.

In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) acquired by an application running on the client device 104 to identify information contained in the content data, match the content data to other data, classify the content data, or synthesize related content data. In these deep learning techniques, a data processing model is created based on one or more neural networks to process content data. The data processing model is trained with training data before it is applied to process the content data. In some embodiments, model training and data processing are both implemented locally on each client device 104 (e.g., client device 104C). Client device 104C obtains training data from one or more servers 102 or storage devices 106 and applies the training data to train the data processing model. After model training, the client device 104C obtains content data (e.g., via an internal camera, captures video data) and processes the content data locally using a training data processing model. Alternatively, in some embodiments, both model training and data processing are implemented remotely on a server 102 (e.g., server 102A) associated with a client device 104 (e.g., client device 104A). Server 102A obtains training data from itself, another server 102, or storage 106, and applies the training data to train the data processing model. The client device 104A obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing model, receives the data processing results from the server 102A, and displays the results on a user interface (e.g., a user interface associated with the application). The client device 104A does not itself perform data processing or performs only a small amount of data processing on the content data before transmitting the content data to the server 102A. Furthermore, in some embodiments, data processing is performed locally on a client device 104 (e.g., client device 104B), while model training is performed remotely on a server 102 (e.g., server 102B) associated with client device 104B. Server 102B obtains training data from itself, another server 102, or storage device 106, and applies the training data to train the data processing model. The trained data processing model is optionally stored in server 102B or storage 106. Client device 104B imports the trained data processing model from server 102B or storage device 106, processes the content data using the data processing model, and generates data processing results for local display on a user interface.

In various embodiments of the present application, a data processing model is used to identify keypoints on images captured by multiple cameras. Optionally, a data processing model is trained in the client device 104 or server 102 and used in the client device 104 or server 102 to infer keypoints. In some embodiments, the images labeled with keypoints may be used as training data to train a deep learning model for subsequent use. Training and reasoning of the deep learning model may also be performed in the client device 104 or the server 102.

Fig. 2 is an example local imaging environment 200 with multiple client devices 104, according to some embodiments. The plurality of client devices 104 includes a first imaging device 202, a second imaging device 204, and a server 102. The first imaging device 202 and the second imaging device 204 are disposed in a scene and are used to capture images of respective fields of view associated with the same scene. The first imaging device 202 is a first type of imaging device (e.g., an infrared camera) and the second imaging device 204 is a second type of imaging device (e.g., a visible camera). The second type is different from the first type. In some embodiments, first imaging device 202 and second imaging device 204 are communicatively coupled to each other only directly through local area network 110 (e.g., a bluetooth communication link). Alternatively, in some embodiments, the first imaging device 202 and the second imaging device 204 may be communicatively coupled to each other via the remote area network 108. In some embodiments, the first imaging device 202 and the second imaging device 204 are communicatively coupled to the server 102 via at least one of a local area network and a remote area network. The server 102 is configured to process, in conjunction with the first imaging device 202 and the second imaging device 204, images captured by the first imaging device 202 and the second imaging device 204 and/or communication images or related data between the first imaging device 202 and the second imaging device 204. For example, the server 102 is a local computer machine disposed in the scene and communicates with the first imaging device 202 and the second imaging device 204 via a local area network.

In some embodiments, the plurality of first imaging devices 202 capture a plurality of first images simultaneously (e.g., within a time window) in the scene, and the plurality of first imagery is used to map the scene into a 3D map of the scene. The 3D map created from the first image captured by the first camera 202 corresponds to the first coordinates, and the second image corresponds to the second coordinates. The physical correlation between the first and second coordinates may be calibrated and used to convert a position in the first coordinate associated with the first image to a position in the second coordinate of the second image. In some embodiments, the scene and the first imaging device 202 are fixed, as are the first coordinates. The first coordinate is an absolute coordinate relative to the scene. The second imaging device 204 moves in the scene and the second coordinates are relative coordinates with respect to the scene.

In some embodiments, the 3D map of the scene includes a plurality of first feature points distributed at different locations of the 3D map. Each first image captured by the corresponding first camera 202 includes a first subset of the plurality of first feature points. The second image captured by the second camera 204 includes one or more second feature points corresponding to a second subset of the first feature points. Each second feature point corresponds to a different coordinate value of the first coordinate and the second coordinate. Different coordinate values of the one or more second feature points may be used to determine a physical correlation between the first and second coordinates. In some embodiments, one or more second feature points are defined at known locations, such as corners and intermediate points of the checkerboard. Further details regarding determining the physical correlation between the first and second coordinates may be found in fig. 6 below.

In some cases, the first imaging device 202A has a first field of view 208A and the second imaging device 204 has a second field of view 210 that shares the same portion as the first field of view 208A. The human body 206 is located in the scene and simultaneously appears in the second field of view 210 of the second camera 204 and the first field of view 208A of the first camera 202A. The human body 206 is captured by the second camera 204 and the first camera 202A and is viewable in a first image captured by the first imaging device 202A and a second image captured by the second imaging device 204, but viewable from two different angles.

In some embodiments, the person 206 carries a plurality of physical markers. Each physical marker is optionally attached to a joint of the human body 206 or to a body part having a known position relative to the joint of the human body 206. When the first imaging device 202A captures a first image including the human body 206, a plurality of physical markers are recorded on the first image. From the physical indicia, a plurality of key points corresponding to the human body 206 (particularly joints of the human body 206) are identified. In some embodiments, the first imaging device 202A and the physical marker facilitate detection of the physical marker based on unique imaging characteristics. For example, the first imaging device 202A includes an infrared camera including an infrared emitter, and the physical marker has a distinct infrared reflection characteristic. The physical markers are displayed differently (e.g., have a higher brightness level) on the first image, which is an infrared image, and thus can be simply and accurately identified by the infrared image processing algorithm. In some embodiments, the first imaging device 202A is configured to locally identify the location of the physical marker on the first image. In some embodiments, the first imaging device 202A is configured to provide the first image to the server 102 or the second imaging device 204, and the server 102 or the second imaging device 204 is configured to identify the location of the physical marker in the first image.

In some embodiments, the location of the physical marker on the first image is converted to a location on the second image based on a physical correlation between coordinates of the first image and the second image. The transformed locations on the second image are used to identify keypoints of the human body in the second image. Alternatively, in some embodiments, the keypoints of the human body 206 are identified and tracked based on the locations of physical markers on the first image captured by the first imaging device 202A. The keypoints associated with the physical markers on the first image are converted to keypoints on the second image captured by the second camera 204 according to the physical correlation between the coordinates of the first image and the second image. The keypoints on the second image are connected together to generate a skeletal model of the human body 206 of the second image and thus correlate the keypoints on the second image with different body parts in the human body 206. In this way, these keypoints may be annotated on the second image in relation to a body part (e.g., joint) of the human body 206.

Further, in some embodiments, no physical markers are attached to the human body 206 and the first image captured by the first imaging device 202A is directly processed to identify one or more first keypoints of the human body 206 in the first image. Optionally, a deep learning technique is applied to identify one or more first keypoints in the first image. One or more first keypoints in the first image are converted to second keypoints on the second image captured by the second camera 204 using physical correlation between the first coordinates and the second coordinates. The keypoints on the second image are connected together to generate a skeletal model of the human body 206 of the second image and thus correlate the keypoints on the second image with different body parts in the human body 206. In some embodiments, the first imaging device 202A is configured to provide the first image to the server 102 or the second imaging device 204, and the server 102 or the second imaging device 204 is configured to identify the location of the physical marker in the first image, convert the identified first keypoints to second keypoints, and generate a skeletal model of the human body 206.

In some embodiments, the plurality of first imaging devices 202 are fixed at different locations of the scene and have different first fields of view 208 that optionally overlap one another so that a 3D map of the scene may reasonably cover the entire scene. When the body 206 changes body position in the scene, the body 206 remains in the second field of view 210 of the second imaging device 204 at all times, for example, because the second imaging device 204 is adjusted such that the body 206 remains partially or fully in the second field of view 210. In some cases, at a first moment in time, the human body 206 is located in a first field of view of the first imaging device 202A. At a second moment in time, the human body 206 is not present in the first field of view of the first imaging device 202A, but is present in the first field of view of the first imaging device 202B. For a first moment, from a first image taken by the first imaging device 202A, keypoints associated with the body-part of the human body 206 are identified and converted into second coordinates of a second image taken simultaneously with the first image and by the second imaging device. For the second moment, from the third image taken by the first imaging device 202B, a keypoint related to the body part of the human body 206 is identified and converted into a second coordinate of another second image taken simultaneously with the third image and by the second imaging device 204.

Thus, as the human body 206 moves in the scene, a first sequence of images taken sequentially by the plurality of first imaging devices 202 is used to identify key points related to the body part of the human body 206 in a second sequence of images taken simultaneously with the first sequence of images and by the second imaging device 204. For example, the first image sequence is sequentially captured by the first imaging devices 202A, 202B, 202C, and 202D and includes a corresponding number of consecutive first images of each of the first imaging devices 202A-202D according to the speed of movement of the human body. It should be noted that, in some contexts, the plurality of first cameras 202 are collectively referred to as first cameras 202, including a plurality of first camera portions (e.g., first camera portion 202A).

In some embodiments, at a particular moment in time, the human body 206 is present in both the first fields of view of the two first imaging devices 202A and 202B and the second imaging device 204. One of the two images taken by the two first imaging devices 202A and 202B is selected in accordance with the device selection criteria to identify key points associated with the body part of the human body 206 in the second image taken by the second imaging device 204. For example, a first imaging device 202 having a first field of view that overlaps more with a second field of view is selected to determine keypoints in the second image according to device selection criteria. For another example, the first imaging device 202 is selected to be physically closer to the second imaging device 204 to help identify keypoints in the second image. In some embodiments, selection of either of the two first imaging devices 202A and 202B does not change the location of the keypoints of the human body 206 within the first coordinates associated with the 3D map of the scene. Alternatively, in some embodiments, both images captured by the two first imaging devices 202A and 202B will be applied to help identify key points in the second image captured by the second imaging device 204 that are related to the body part of the human body 206. In some embodiments, the two images have some common keypoints and some different keypoints and are complementary to each other for identifying the keypoints of the human body 206. The identified keypoints of the human body are the common keypoints and the set of different keypoints of the two images. In some embodiments, both images have the same keypoints, identified as keypoints of the human body 206. The first imaging devices 202A and 202B share the same first coordinates (e.g., which are fixed relative to the scene) and key points identified in the first coordinates will be translated into second coordinates of a second image captured by the second imaging device 204.

FIG. 3 is an example flow diagram of a process 300 of identifying and annotating keypoints according to some embodiments. The process 300 is performed jointly by the plurality of first cameras 202 and the second camera 204. In some embodiments, the process 300 involves the server 102 for processing images and/or communicating images or related data between the first camera 202 and the second camera 204 in conjunction with the first camera 202 and the second camera 204. In one example, each first camera 202 is an infrared camera and the first image captured by the first camera 202 is an infrared image. Alternatively, in another example, each first camera 202 is a high-end visible light camera system with built-in key point detection and annotation functionality. However, the first camera 202 is expensive and cannot be integrated into conventional consumer electronics devices. Conversely, in one example, the second camera 202 may be mounted on the cell phone 104C or AR glasses 104E for capturing color images, monochrome images, or depth images. Thus, the second image captured by the second camera 204 may be annotated with keypoints according to the keypoint detection and annotation function of the first camera 202.

Specifically, the second image captured by the second camera 204 is correlated with the first image captured by the first camera 202 and records key points attached to the human body 206. The first image captured by the first camera 202 is synchronized (302) with the second image captured by the second camera 204. Each image captured by the first camera 202 and the second camera 204 is associated with a time stamp that keeps track of the time the image was captured. The first image captured by the first camera 202 has a first frame rate (e.g., 240 Frames Per Second (FPS)), and the second image captured by the second camera 204 has a second frame rate (e.g., 30 FPS). In some embodiments, each second image is related to its temporally closest first image with its keypoints, whether the closest first image was taken earlier or later than the second image. Alternatively, in some embodiments, each second image is related to its temporally closest first image that is captured earlier than the second image and has a second keypoint of the second image. Alternatively, in some embodiments, each second image is associated with its temporally closest first image that is captured later than the second image and has a second keypoint of the second image. Alternatively, in some embodiments, each second image is related to its two temporally closest first images, one of which is captured earlier than the second image and the other of which is captured later than the second image.

The first camera 202 is associated with a first coordinate and the second camera 204 is associated with a second coordinate. The first and second coordinates are physically related according to a physical correlation (e.g., represented by a rotation and translation matrix). The object is located at a first position in a first coordinate and at a second position in a second coordinate, the first position being related to the second position according to a physical correlation. In some embodiments, the physical correlation between the first and second coordinates includes a plurality of displacement parameters related to a three-dimensional displacement between the first and second coordinates, and a plurality of rotation parameters related to a three-dimensional rotation between the first and second coordinates. Calibration is performed to determine physical correlation (304). In particular, the same object may be captured by the first camera 202 and the second camera 204, the same object being collected as calibration data at a first location in the first image and at a second location in the second image. Calibration data is used to determine a physical correlation between the first coordinate and the second coordinate (306). In one example, the object includes a plurality of fixed positions of a checkerboard (e.g., in fig. 6A and 6B).

In some embodiments, a plurality of first images are captured by the first camera 202, and a plurality of second images are captured simultaneously with the first images and captured by the second camera 204. Human operational data is recorded on the first image of the first camera 202 and the second image of the second camera 204, respectively (308).

Information of a first key point of the human body 206 is extracted from the first image, and the first key point information includes a position of the first key point in a first coordinate of the first camera 202. The position of the first keypoint in the first coordinate is converted to a position of a second keypoint, which is related to the body part of the human body 206 in the second coordinate of the second image, using the physical correlation. In some embodiments, one or more additional keypoints are absent. One or more additional missing keypoints (e.g., interpolation) are derived from the second keypoints transformed from the information of the first keypoints attached to the human body 206. Thus, a data post-processing (310) is performed to calculate keypoints for the first image and the second image and to derive one or more additional missing keypoints for the second image.

According to the physical correlation between the first coordinates of the first camera 202 and the second coordinates of the second camera 204, key points in the first image captured by the first camera 202 are converted into key points related to the body part of the human body 206 in the second image captured by the second camera 204. Annotations associated with keypoints of the first image are also projected into a second image captured by the second camera 204 (312). Through these methods, process 300 is used to generate data for a human body related algorithm. As trends in data-driven algorithms appear, accurate and automated data labeling and generation becomes critical, and process 300 provides a data generation scheme that does not require a large number of manual labels.

In some embodiments, the second image is related to two first images closest in time thereto, one of which is captured earlier than the second image and the other of which is captured later than the second image. The two first images closest in time are optionally taken by the same first camera 202 or by different first cameras 202. The two first images have two different sets of keypoints, one of which is temporally interpolated and converted to keypoints in the second image captured by the second camera 204 that are related to the body part of the human body 206. Alternatively, two different sets of keypoints of the two first images are converted to two different sets of keypoints in the second coordinates, wherein the keypoints in the second image are temporally interpolated.

Fig. 4 is an example flowchart of a process 302 between a plurality of first cameras and a second camera in a synchronized scene, according to some embodiments. The server 102 includes one of a local computer machine located in the scene and a remote server communicatively coupled to the first camera 202 and the second camera 204 through one or more communication networks 108 (402). When the server 102 comprises a local computer machine, each of the plurality of first cameras 202 and second cameras 204 are coupled to the local computer machine by a local area network (e.g., a WiFi network) or a wired link. When the server 102 comprises a remote server, each of the plurality of first cameras 202 and second cameras 204 are coupled to the remote server 102 at least through a wide area network (e.g., a cellular network). A software application runs on a local computer machine or remote server to receive image data from the plurality of first cameras 202 and second cameras 204 and process the image data as needed.

The system times tracked by the first camera 202 and the second camera 204 may not be exactly the same, and therefore calibration is required to ensure that the image data captured by the first camera 202 and the second camera 204 are synchronized. In some embodiments, each first camera 202 transmits a test signal to the server 102, the test signal including a first timestamp recording the transmission time tracked in accordance with the first camera time. The server 102 receives the test signal and tracks a first timestamp that records the time of receipt tracked according to the server time. Specifically, the server 102 receives the test signal and determines a time difference between the first camera time and the server time. In some embodiments, server 102 determines a delay time for the test signal and subtracts the delay time from the time difference. In some embodiments, the delay time is negligible compared to the time difference. The time difference is then used to synchronize the first camera 202 and the server 102 (404). In some embodiments, the second camera 204 is similarly synchronized 406 with the server 102, optionally according to an associated delay, so that the times tracked by the first camera 202, and the server 102 can be mutually calibrated, e.g., according to the server time of the server 102. Alternatively, in some embodiments, the server 102 is synchronized 408 with the second camera 204, optionally according to an associated delay, so that the times tracked by the first camera 202, and the server 102 may be mutually calibrated, for example, according to a second camera time of the second camera 204.

In some embodiments, first camera 202 and second camera 204 communicate data directly with each other and server 102 is not involved in process 300. One of the first cameras 202 sends a test signal to the second camera 204, and the second camera 204 tracks the time of receipt and determines the time difference between the first camera 202 and the second camera 204. The second camera 204 optionally subtracts the delay time from the time difference. The time difference is used to synchronize the first camera 202 and the second camera 204. Conversely, in some embodiments, the second camera 204 sends a test signal to one of the first cameras 202, the first camera 202 tracks the time of receipt and determines the time difference between the first camera 202 and the second camera 204. The time difference is used to synchronize the first camera 202 and the second camera 204, wherein the delay time is optionally subtracted from the time difference.

Fig. 5 is an example flow chart of a process 304 of recording calibration data from a plurality of first cameras 202 and second cameras 204, in accordance with some embodiments. The plurality of first cameras 202 and second cameras 204 are present in a scene that includes one or more objects associated with a plurality of keypoints.

In some embodiments, the physical marker is disposed at a key point and at a known location relative to the key point, and the signal is transmitted at a predefined marker frequency (e.g., 0-200 Hz). After the first camera 202 begins recording the first image (502), the first image is captured to record the three-dimensional positions of the plurality of physical markers (504). Alternatively, in some embodiments, the keypoints need not be marked with physical markers, and the first image comprises three-dimensional positions of multiple keypoints of one or more objects in the scene. The first camera 202 sends the first image frame by frame to the server 102 (504). Each first image includes a first timestamp that records a first frame time when the corresponding first image was captured. The server 102 receives and stores a first image 508 captured by the first camera 202 and including a three-dimensional location of a physical marker or keypoint and a corresponding first timestamp of the first image (506).

After the second camera 204 begins recording the second image (502), the second camera 204 captures the second image (510) and sends the second image to the server 102 on a frame-by-frame basis. Each second image 512 includes a second timestamp that records a second frame time when the corresponding second image was captured by the second camera 204. The system times of the first camera 202, the second camera 204, and the server 102 have been previously calibrated, and therefore, at least one of the first timestamp and the second timestamp is adjusted to synchronize the first image captured by the first camera 202 with the second image captured by the second camera 204.

Fig. 6A and 6B are two test images 600 and 620 for spatially calibrating a plurality of first cameras 202 and second cameras, and fig. 6C is a flow chart of a process 306 for spatially calibrating a first camera 202 and a second camera 204, according to some embodiments. One of the first cameras 202 captures a first test image 600 (602), the second camera 204 captures a second test image 620 (604), and the second test image 620 and the first test image 600 are captured simultaneously in the same scene. Because of the different positions and orientations of the first camera 202 and the second camera 204, the first test image 600 and the second test image 620 are taken from two different perspectives. Both the first test image 600 and the second test image 620 include a board 606 and a plurality of keypoints 608 (e.g., 608A, 608B, 608C, and 608D) disposed at a plurality of predetermined positions on the board 602.

The first test image 600 is associated with a first timestamp for recording a first frame time when the first test image 600 was captured. The second test image 620 is associated with a second timestamp for recording a second frame time when the second test image 620 was captured. In one example, the first camera 202 is an RGB camera for capturing test images at a first frame rate of 200FPS and the second camera 204 is a camera integrated in the cell phone 104C for capturing test images at a second frame rate of 30 FPS. In some embodiments, the first test image 600, the first timestamp, the second test image 620, and the second timestamp are integrated (synchronized) and processed in the server 102 or the second camera 204 to spatially calibrate the first camera 202 and the second camera 204, i.e., to determine a physical correlation between the coordinate systems of the first camera 202 and the second camera 204.

The first test image 600 is located in a sequence of consecutive first test images taken by the first camera 202 and the second test image 620 is located in a sequence of consecutive second test images taken by the second camera 204. For each second test image 620, the server 102 identifies the temporally closest first test image (i.e., first test image 600) (610), the first test image 600 being taken substantially simultaneously with the second test image 620 and including the keypoints of the second test image 620. In some embodiments, each second test image 620 is taken simultaneously with the temporally closest first test image 600, whether the closest first test image 600 was taken earlier or later than the second test image 620. Alternatively, in some embodiments, each second test image 620 is taken simultaneously with the temporally closest first test image 600, the first test image 600 being taken earlier than the second test image 620 and including the keypoints of the second test image 620. Alternatively, in some embodiments, each second test image 620 is taken simultaneously with the temporally closest first test image 600, the first test image 600 being taken later than the second test image 620 and including the keypoints of the second test image 620. Alternatively, in some embodiments, each second test image 620 is associated with two first images 600 that are closest in time, one of which is earlier than the second image capture and the other of which is later than the second image capture.

In some embodiments, each long side of checkerboard 606 corresponds to two keypoints 608. The location of the keypoint 608 is simply derived from the first coordinates of the first test image 600. In addition to the keypoints 608, the positions of the four corners 612 of the board 606 and the positions of the intermediate points 614 of the keypoints 608 in the first test image 600 are simply obtained from the positions of the keypoints 608. The locations of a subset or all of the keypoints 608, four corners 612, and intermediate points 614 are identified on the second test image 620 and compared to corresponding locations in the first test image 600 to determine a physical correlation between the coordinate systems of the first camera 202 and the second camera 204.

In some embodiments, a perspective-n-point (PnP) method may be used to determine a physical correlation between the coordinate systems of the first camera 202 and the second camera 204. Alternatively, in some embodiments, a random sample consensus (random sample consensus, RANSAC) method is used to determine a physical correlation between the coordinate systems of the first camera 202 and the second camera 204 (616). Optionally, the physical correlation is represented by a rotation and translation matrix (618) that correlates the locations of the keypoints 608, corners 612, or intermediate points 614 identified from the first test image 600 to corresponding locations in the second coordinates of the second camera 204.

In other words, the physical correlation between the first coordinate and the second coordinate is determined by: one or more first test images 600 of the scene are obtained from the first camera 202, wherein there are a plurality of test keypoints 608, 612, or 614 to be detected from the one or more first test images 600, and one or more second test images 620 of the scene are obtained from the second camera 204, wherein there are a plurality of second test keypoints 608, 612, or 614 to be detected from the one or more second test images 620. The second test image 620 is photographed at the same time as the first test image 600. The first test key point has a known physical location in the scene relative to the second test key point. The first test keypoint and the second test keypoint are detected from the first test image 600 and the second test image 620, respectively.

A physical correlation between the first coordinate and the second coordinate is determined based on a known physical location of the first test key point in the scene relative to the second test key point. In one example, a first test keypoint detected from the first test image 600 is a keypoint 608 and a second test keypoint detected from the second test image 620 is an angle 612 and an intermediate point 614. The location of the keypoint 608 in each first test image 600 is used to derive the location of the corner 612 and intermediate point 614 in the first test image 600. The positions of the angle 612 and the intermediate point 614 derived in the first test image 600 are compared with the corresponding positions in the second test image 620 to determine the physical correlation. More specifically, the first plurality of test keypoints includes a first marking unit (e.g., 608A) and a second marking unit (e.g., 608B), and the second plurality of test keypoints includes a third marking unit (e.g., 614A) located midway between the positions of the first marking unit and the second marking unit.

Further, in some embodiments, the one or more first test images 600 include a sequence of first image frames, each first test mark being contained in a subset of the first test images 600. The one or more second test images 620 include a sequence of second image frames, each second test mark corresponding to a subset of the first test keypoints, respectively, and each second test mark being contained in a subset of the second test images 620. Further, in some embodiments, at least a subset of the first test keypoints and the second test keypoints are attached to a checkerboard. The checkerboard moves in a plurality of checkerboard poses in the scene and is recorded in a first sequence 600 and a second sequence 610 of image frames. The chessboard postures are provided with different positions or directions from each other.

In some embodiments, the one or more first test images comprise a single first test image 600 and the one or more second test images comprise a single second test image 620. Each second test key corresponds to a subset of the first test keys (e.g., derived from a subset of the first test marks), respectively. In addition, in some embodiments not illustrated in fig. 6A-6C, the first test keypoints and the second test keypoints are marked on a three-dimensional box having multiple sides, where each side is covered by a checkerboard pattern. The three-dimensional box is recorded in the first test image 600 and the second test image 620.

Fig. 7 is a flow diagram of a process 700 for annotating a keypoint on a second image 204 captured by a second camera from a plurality of first images captured by a plurality of first cameras 202, in accordance with some embodiments. The plurality of first cameras 202 capture a plurality of first images (702), each first image overlaying a corresponding portion of a scene in which the first camera 202 and the second camera 204 are located. Each first image is a two-dimensional image. In some cases, the object is present in a first subset of the first image and is absent from a second subset of the first image due to occlusion. The first subset and the second subset of the first image are used jointly (704) by a method of volumetric triangulation to accurately locate keypoints of objects in the scene. In some embodiments, the locations of the plurality of first keypoints are identified in first coordinates of a scene associated with the plurality of first cameras 202 (706). A second camera 204, concurrent with the plurality of first cameras 202, captures a second image of the scene (708). In some embodiments, the second image includes a plurality of second keypoints corresponding to a subset of the plurality of first keypoints captured in the subset of the first image. The second image is matched (710) with each of the first images to identify a subset of the first images. A plurality of second keypoints are represented in second coordinates of second camera 204. For more details on volume triangulation, see fig. 8 below.

A physical correlation is determined between the first coordinates of the scene and the second coordinates of the second camera 204. In some embodiments, the first camera 202 is fixed in the scene, and the first coordinates of the first camera 202 are related to absolute coordinates fixed relative to the scene. The second camera 204 moves in the scene and the second coordinates of the second camera 204 include relative coordinates that vary with respect to absolute coordinates. In some embodiments, the physical correlation between the first coordinate and the second coordinate includes a rotation and translation matrix. The locations of the plurality of first keypoints are transformed from the first coordinates into the second coordinates based on the physical correlation between the first coordinates and the second coordinates (712). A second image is automatically annotated with a plurality of second keypoints according to the subset transformed positions of the plurality of first keypoints in the second coordinates (714). In some embodiments, the second image comprises at least one of an RGB image and a time-of-flight image.

In some embodiments, after identifying the locations of the plurality of keypoints in the second coordinate, additional keypoint locations are interpolated from the converted locations of the plurality of keypoints in the second coordinate, and the additional keypoint locations are associated with additional keypoints that are not in the plurality of keypoints. In some embodiments, a subset of the converted locations of the plurality of keypoints are fitted to the human body 206. The geometric centers of the subset of transformed locations of the plurality of keypoints are identified. Based on the geometric center, the locations of the human body of the subset of the converted locations of the plurality of keypoints are identified. For example, a subset of the plurality of keypoints corresponds to the eyes, ears, mouth, and neck, and is used to correspond to the geometric center of the head of the human body 206. In some embodiments, the second image labeled with keypoints may be used as training data to train a deep learning model for subsequent use. The deep learning model is independent of a data processing model used to identify a first keypoint in a first image captured by the plurality of first cameras 202.

In some embodiments, process 700 is performed in a label-free manner, i.e., without requiring the person 206 to wear specific clothing or to attach physical labels. The customized calibration thread is capable of generating body annotation data from an external device. In particular, the calibration method may conveniently synchronize the time of the server 102 with the different cameras 202 and 204 and calibrate the coordinate systems of the cameras 202 and 204. With the first camera 202 set at different perspectives, a data driven volumetric triangulation technique is applied to accurately calculate the three-dimensional joint position. In some embodiments, a data-driven deep learning network is applied to fuse the multi-view data collected from the plurality of first cameras 202 into an accurate three-dimensional human joint tag (e.g., in a three-dimensional map). By these methods, a computer system may be coupled to multiple ToF cameras or RGB cameras, and the computer system may be applied to automatically and accurately annotate two-dimensional/three-dimensional human joint locations on images captured by unsynchronized camera devices.

Fig. 8 is a flowchart of a process 800 of identifying keypoints in a plurality of first images captured by a plurality of first cameras 202, for example using volume triangulation, according to some embodiments. The plurality of first cameras 202 are fixed at different locations in the scene and capture a plurality of first images simultaneously. Image data comprising a plurality of first images is passed into a two-dimensional backbone network (e.g., resNet-152) to generate a plurality of two-dimensional feature maps. For example, first camera 202A captures a first image (802), extracts a first two-dimensional feature map from the first image using a two-dimensional backbone (804), first camera 202B captures another first image (806), and extracts a second two-dimensional feature map from the other first image using the two-dimensional backbone (808). A two-dimensional feature map extracted from the plurality of first images is projected into a plurality of volumes (810), e.g., aggregated with each view angle. Specifically, in some embodiments, each two-dimensional feature map corresponds to a corresponding first image captured by a corresponding first camera 202, and is projected to the volume according to the camera position and orientation associated with its corresponding first camera 202.

The volume is transferred (812) into a three-dimensional convolutional neural network (convolutional neural network, CNN) to generate a plurality of three-dimensional heatmaps. A three-dimensional convolutional neural network is one example of a heat map neural network. A three-dimensional location of the keypoint is determined from the plurality of three-dimensional heat maps (814) using a softmax function (e.g., a normalized exponential function) to thereby increase a level of accuracy of the three-dimensional location of the keypoint determined from the first image. Thus, the data processing model includes a two-dimensional backbone network, a three-dimensional convolutional neural network, and a softmax function, and is trained prior to application to reasoning about the three-dimensional locations of keypoints in the first image. Different neural networks in the data processing model may be trained individually or jointly. In some embodiments, a data processing model is trained in the server 102 and each first image is provided to the server 102 for processing using the data processing model. Alternatively, in some embodiments, the data processing model is trained in the server 102 and provided to the first camera or the second camera, which processes each first image using the data processing model.

Fig. 9 is a flow chart of a method of automatically annotating an image in accordance with some embodiments. For convenience, the method 900 is described as being performed by a computer system (e.g., the client device 104, the server 102, or a combination thereof). One example of a client device 104 is a cell phone 104C or AR glasses 104E. In one example, method 900 is used to annotate human keypoints in captured images. For example, the image may be captured by a second camera (e.g., a camera of the cell phone 104C or AR glasses 104E) and annotated locally or streamed to the server 102 (e.g., stored in the storage device 106 or database of the server 102) for annotation. The same person is also contained in one or more first images taken of a subset of the plurality of first cameras, and the markers associated with the keypoints to be annotated or the keypoints to be annotated themselves can be simply derived from the first images and used to direct annotation of the keypoints in the second images.

Optionally, method 900 is subject to instructions stored in a non-transitory computer-readable storage medium and executed by one or more processors of a computer system. The instructions are executed by one or more processors of a computer system. Each of the operations shown in fig. 9 may correspond to instructions stored in a computer memory or a non-transitory computer readable storage medium (e.g., memory 1006 of computer system 1000 in fig. 10). The computer-readable storage medium may include a magnetic or optical disk storage device, a solid state storage device (e.g., flash memory), or other non-volatile storage device. The instructions stored on the computer-readable storage medium may include at least one of source code, assembly language code, object code, or other instruction formats that are interpretable by one or more processors. Some operations in method 900 may be combined and/or the order of some operations may be changed.

The computer system obtains a plurality of first images of a scene captured simultaneously by a plurality of first cameras (e.g., RGB cameras, toF cameras) (902). Each first image is captured by a corresponding first camera disposed at a different location in the scene. In some embodiments, when multiple first images are taken within a time window (e.g., within 5 milliseconds), the multiple first images are taken simultaneously. The computer system generates a plurality of two-dimensional feature maps from the plurality of first images (904). Each first image corresponds to a subset of a respective two-dimensional feature map. In some embodiments, for each first image, a corresponding two-dimensional feature map subset is generated from the corresponding first image using a corresponding backbone neural network, and the corresponding backbone neural network is trained separately or in end-to-end association with the heat map neural network and other corresponding backbone networks.

The computer system projects the plurality of feature maps into a plurality of aggregated volumes of the scene (906) and generates a plurality of three-dimensional heat maps corresponding to the plurality of aggregated volumes of the scene using a heat map neural network (908). In some embodiments, a normalized exponential function is applied to each three-dimensional heat map to identify the location of a corresponding one of the keypoints in the scene. The computer system automatically and without user intervention identifies locations of a plurality of keypoints in the scene from a plurality of three-dimensional heatmaps (910). Each key point corresponds to a joint of a person in the scene.

In some embodiments, locations of a plurality of keypoints are identified in a first coordinate of a scene (912). The computer system obtains a second image of the scene captured simultaneously with the plurality of first images and captured by the second camera 204 914 and determines a correlation between the first coordinates of the scene and the second coordinates of the second camera 916. In some embodiments, the second camera is configured to capture a color image, a monochrome image, or a depth image. In some embodiments, when the plurality of first images are taken within a time window of the second image (e.g., within 5 milliseconds), the plurality of first images are taken simultaneously with the second image. The computer system converts the locations of the plurality of keypoints from the first coordinates into the second coordinates based on the correlation between the first coordinates and the second coordinates (918), and automatically annotates the second image with the plurality of keypoints based on the converted locations of the plurality of keypoints in the second coordinates (920). Further, in some embodiments, the physical correlation between the first and second coordinates includes a plurality of displacement parameters related to a three-dimensional displacement between the first and second coordinates, and a plurality of rotation parameters related to a three-dimensional rotation between the first and second coordinates.

Further, in some embodiments, the computer system interpolates additional keypoint locations from the post-conversion locations of the plurality of keypoints of the second coordinate, and the additional keypoint locations are associated with additional keypoints that are not in the plurality of keypoints. In some embodiments, a subset of the converted locations of the plurality of keypoints are fitted to the human body. The computer system identifies a geometric center of a subset of the conversion locations of the plurality of keypoints. Based on the geometric center, the locations of the human body of the subset of the converted locations of the plurality of keypoints are identified. In some embodiments, the computer system trains the deep learning model using a second image labeled with a plurality of keypoints.

Further, in some embodiments, a correlation between the first coordinates and the second coordinates is determined using the plurality of first test images and the one or more second test images. The computer system obtains a plurality of first test images of the scene from a plurality of first cameras 202 and obtains one or more second test images of the scene from a second camera 204. The second test image is taken simultaneously with the first test image. The computer system detects a position of a first test point in a first coordinate of the scene, a position of a second test point in a second coordinate of the second camera 204, from the plurality of first test images. The second test point has a known physical location in the first coordinate relative to the first test point. That is, the second test point overlaps the first test point, or the second test point has a known displacement relative to the first test point. Based on the known physical location of the second test point in the scene relative to the first test point, the computer system derives a correlation between the first coordinate and the second coordinate. In this application, a test point may also be referred to as a test key point.

It should be understood that the particular order of operations in fig. 8 is merely exemplary and does not indicate that the order is the only order in which the operations are performed. One of ordinary skill in the art will recognize various methods of annotating keypoints in images as described herein. Furthermore, it should be noted that the details of the other processes in FIGS. 3-8 above may also be applied in an analogous manner to the method 900 above with respect to FIG. 9. For the sake of brevity, no further description is provided here.

Fig. 10 is a schematic block diagram illustrating a computer system 1000 in accordance with some embodiments. Computer system 1000 includes server 102, client device 104, storage device 106, or a combination thereof. Computer system 1000 is used to implement any of the methods described above as shown in fig. 3-10. Computer system 1000 typically includes one or more processing units (CPUs) 1002, one or more network interfaces 1004, memory 1006, and one or more communication buses 1008 for interconnecting these components (sometimes called a chipset). The computer system 1000 includes one or more input devices 1010 that facilitate user input, such as a keyboard, mouse, voice command input unit or microphone, touch screen display, touch sensitive tablet, gesture camera, or other input buttons or controls. Further, in some embodiments, the client device 104 of the computer system 1000 uses microphone and voice recognition or camera and gesture recognition to supplement or replace a keyboard. In some embodiments, the client device 104 includes one or more cameras, scanners, or picture sensor units for capturing images, such as images of a graphic sequence code printed on an electronic device. Computer system 1000 also includes one or more output devices 1012 for presenting user interfaces and displaying content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a global satellite positioning system (global positioning satellite, GPS) or other geographic location receiver, for determining the location of the client device 104.

Memory 1006 includes high-speed random access memory, such as DRAM, SRAM, DDRRAM, or other random access solid state storage devices; optionally, nonvolatile memory may also be included, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other nonvolatile solid state storage devices. Optionally, the memory 1006 may include one or more storage devices, remote from the one or more processing units 1002. Memory 1006 or a non-volatile memory in memory 1006 includes a non-transitory computer-readable storage medium. In some embodiments, memory 1006 or a non-transitory computer readable storage medium of memory 1006 stores the following programs, modules, and data structures, or a subset or superset thereof:

an operating system 1014 including programs for handling various basic system services and for performing hardware related tasks;

a network communication module 1016 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage device 106) through one or more network interfaces 1004 (wired or wireless) and one or more communication networks 108, such as the internet, other wide area networks, local area networks, metropolitan area networks, etc.;

A user interface module 1018 for displaying information (e.g., graphical user interfaces of applications 1024, gadgets, websites and their web pages and/or games, audio and/or video content, text, etc.) on each client device 104 via one or more output devices 1012 (e.g., a display, speakers, etc.);

an input processing module 1020 for detecting one or more user inputs or interactions from one or more input devices 1010 and decoding the detected inputs or interactions;

a web browser module 1022 for navigating, requesting (e.g., via HTTP) and displaying websites and their web pages, including a web interface for logging into a user account associated with the client device 104 or other electronic device, controlling the client or electronic device if associated with the user account, and editing and viewing settings and data associated with the user account;

one or more user applications 1024 (e.g., games, social networking applications, smart home applications, and/or other web-based or non-web-based applications) executed by the computer system 1000 for controlling other electronic devices and viewing data captured by such devices;

Model training module 1026 for receiving training data (e.g., training data 1042) and building a data processing model (e.g., data processing module 1028) to process content data (e.g., video, visual, or audio data) collected or acquired by client device 104;

a data processing module 1028 for processing the content data using the data processing model 1044 to identify information contained in the content data, match the content data with other data, classify the content data, or synthesize related content data. Wherein in some embodiments, data processing module 1028 is associated with one of plurality of user applications 1024 to process content data in response to user instructions received from user application 1024;

a camera calibration module 1030 for calibrating two cameras in time and space, determining a physical correlation between a first coordinate of a plurality of first cameras and a second coordinate of a second camera;

a keypoint annotation module 1032 for identifying, with the first camera, a location of a keypoint of an object (e.g., human 206), converting the location of the identified keypoint to a location of a keypoint in a second image captured by the second camera, and annotating the keypoint in the second image; and

One or more databases 1034 for storing at least the following data:

device settings 1036, including generic device settings (e.g., service level, device model, storage capacity, processing power, communication power, etc.) for one or more of server 102 or client device 104;

user account information 1038 for one or more user applications 1024, such as user name, security questions, account history data, user preferences, and predefined account settings;

network parameters 1040 of the one or more communication networks 108, such as IP address, subnet mask, default gateway, DNS server, and hostname;

the o training data 1042 for training one or more data processing models 1044;

a omicron data processing model 1044 for processing content data (e.g., video, visual, audio data) using deep learning techniques; and

content data and results 1046 acquired and output, respectively, by client device 104 of computer system 1000 to client device 104, wherein the content data includes images taken by first camera 202 and second camera 204, locations of keypoints in the first image taken by first camera 202, and/or annotated keypoint information in the second image taken by second camera 204.

Optionally, one or more databases 1034 are stored in one of server 102, client device 104, and storage device 106 of computer system 1000. Optionally, one or more databases 1034 are distributed in more than one of server 102, client device 104, and storage device 106 of computer system 1000. In some embodiments, more than one copy of the data described above is stored in different devices, e.g., two copies of data processing model 1044 are stored in server 102 and storage device 106, respectively.

Each of the elements identified above may be stored in one or more of the storage devices mentioned above and correspond to a set of instructions for performing the functions described above. The above-identified modules or programs (i.e., sets of instructions) do not have to be executed solely as software programs, procedures, modules or data structures, and thus, in various embodiments, various subsets of these modules may be combined or otherwise rearranged. In some embodiments, memory 1006 may optionally store a subset of the modules and data structures described above. Furthermore, memory 1006 may also optionally store other modules and data structures not mentioned above.

The terminology used herein in describing the various described embodiments is for the purpose of describing particular embodiments only and is not intended to be limiting. As used in the description of the various embodiments described and the appended claims, the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. Furthermore, it should be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Furthermore, it will be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element.

The term "if" as used herein is to be interpreted in the context of "when" or "in" or "responsive to determination" or "responsive to detection" or "based on determination". Likewise, the phrase "if a condition or event is determined" or "if detected" is also understood to be "determining" or "responding to a determination" or "detecting a condition or event" or "responding to a determination of detecting a condition or event" depending on the context.

For ease of explanation, the foregoing description has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of operation and the practical application, thereby enabling others skilled in the art to understand the invention.

Although the various figures illustrate some logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or split. Although some reordering or other groupings are specifically mentioned, other groupings will be apparent to those of ordinary skill in the art, and thus the ordering and groupings described herein are not an exhaustive list of alternatives. Furthermore, it should be appreciated that these stages may be implemented in hardware, firmware, software, or any combination thereof.

Claims

1. A method for automatically labeling an image, comprising:

acquiring a plurality of first images of a scene simultaneously shot by a plurality of first cameras, wherein each first image is shot by a respective first camera arranged at a different position in the scene;

Generating a plurality of two-dimensional feature maps from the plurality of first images, each first image corresponding to a subset of the respective two-dimensional feature maps;

projecting the plurality of two-dimensional feature maps into a plurality of aggregated volumes of the scene;

generating a plurality of three-dimensional heat maps corresponding to a plurality of aggregate volumes of the scene using a heat map neural network; and

from the plurality of three-dimensional heat maps, locations of a plurality of keypoints in the scene are identified automatically and without user intervention, each keypoint corresponding to a joint of a person in the scene.

2. The method of claim 1, wherein for each first image, a corresponding two-dimensional feature map subset is generated from the corresponding first image using a backbone neural network associated therewith, and the corresponding backbone neural network is trained separately from the heat map neural network and other corresponding backbone neural networks or end-to-end jointly.

3. The method of claim 1 or 2, wherein identifying locations of a plurality of keypoints in the scene from the plurality of three-dimensional heatmaps further comprises:

a normalized exponential function is applied to each of the plurality of three-dimensional heat maps to identify a location of a respective keypoint of the plurality of keypoints in the scene.

4. The method of any one of the preceding claims, wherein the plurality of first images are captured simultaneously when the plurality of first images are captured within a time window.

5. The method of any one of the preceding claims, wherein each of the plurality of first cameras comprises a time-of-flight camera.

6. The method of any of the preceding claims, wherein identifying the locations of the plurality of keypoints in the first coordinate of the scene further comprises:

acquiring a second image of the scene captured simultaneously with the plurality of first images and captured by a second camera;

determining a correlation between the first coordinate of the scene and a second coordinate of the second camera;

converting the positions of the plurality of key points from the first coordinate into the second coordinate according to the correlation of the first coordinate and the second coordinate; and

and automatically labeling the second image by utilizing the plurality of key points according to the converted positions of the plurality of key points in the second coordinate.

7. The method of claim 6, wherein labeling the second image with the plurality of keypoints further comprises:

Interpolating additional keypoint locations from the transformed locations of the plurality of keypoints in the second coordinate; and

the additional keypoint locations are correlated with additional keypoints that are not among the plurality of keypoints.

8. The method of claim 6, wherein labeling the second image with the plurality of keypoints further comprises:

fitting a subset of the converted positions of the plurality of keypoints to a human body;

identifying a geometric center of a subset of the converted locations of the plurality of keypoints; and

and identifying the positions of the human body corresponding to the subset of the converted positions of the plurality of key points according to the geometric center.

9. The method as recited in claim 6, further comprising:

training a deep learning model using the second image labeled with the plurality of keypoints.

10. The method of claim 6, wherein the correlation between the first coordinate and the second coordinate comprises a plurality of displacement parameters related to three-dimensional displacement between the first coordinate and the second coordinate, and a plurality of rotation parameters related to three-dimensional rotation between the first coordinate and the second coordinate.

11. The method of claim 6, wherein determining the correlation between the first coordinate and the second coordinate further comprises:

acquiring a plurality of first test images of the scene from the plurality of first cameras;

acquiring one or more second test images of the scene from the second camera, wherein the second test images and the plurality of first test images are shot at the same time;

detecting a position of a first test point in the first coordinates of the scene from the plurality of first test images;

detecting a position of a second test point in the second coordinate of the second camera, the second test point having a known physical position in the first coordinate relative to the first test point; and

the correlation between the first and second coordinates is derived from a known physical position of the second test point in the scene relative to the first test point.

12. The method of claim 6, wherein the second camera is configured to capture a color image, a monochrome image, or a depth image.

13. The method of claim 6, wherein the second camera is mounted on a mobile device or augmented reality glasses.

14. A computer system, comprising:

one or more processors; and

a memory having instructions stored therein, the instructions being executable by the one or more processors to implement the method of any of claims 1-13.

15. A non-transitory computer-readable storage medium having instructions stored therein, the instructions being executable by one or more processors to implement the method of any of claims 1-13.