CN117597710A

CN117597710A - Automatic data-driven human skeleton labeling

Info

Publication number: CN117597710A
Application number: CN202180099736.7A
Authority: CN
Inventors: 李众; 杜翔宇; 全书学; 徐毅
Original assignee: Innopeak Technology Inc
Current assignee: Innopeak Technology Inc
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2024-02-23
Also published as: WO2023022706A1

Abstract

The application relates to image annotation. The computer system acquires a first image and a second image captured simultaneously by two cameras of different types. The first image records a plurality of marks, each mark being attached to the object and corresponding to a key point on the object in the first image. The computer system determines a physical correlation between a first coordinate of the first camera and a second coordinate of the second camera and identifies a location of the plurality of markers of the first image record in the first coordinate. And converting the positions of the marks in the first coordinate into the positions of the key points in the second coordinate according to the physical correlation between the first coordinate and the second coordinate. The computer system automatically annotates the second image with the plurality of keypoints.

Description

Automatic data-driven human skeleton labeling

Technical Field

The present application relates to data processing techniques including, but not limited to, methods, systems, and non-transitory computer readable media for identifying keypoints in images related to objects.

Background

The human body posture estimation requires a large amount of data, and human body key points in the image are marked. Such keypoint tags may be synthesized in the image, manually created, or automatically identified. The automatically identified tags consume minimal human and computer resources and at the same time have reasonable accuracy. However, automatically identified tags are often the result of the need to use specific imaging techniques that are not suitable for use with conventional imaging devices. A human body pose estimation mechanism that marks key points of a human body in images captured by a conventional camera would be very advantageous.

Disclosure of Invention

Accordingly, there is a need for a human body pose estimation mechanism for marking human body keypoints in images, particularly images taken by conventional cameras (e.g., cell phone cameras or enhanced-eyewear cameras). For this purpose, the present application aims at marking key points in images captured by a second camera (such as a conventional RGB camera) using the automatic mark detection function of the first camera. The first camera and the second camera are synchronized in time and, more importantly, spatially aligned to determine a physical correlation between two coordinates of the first camera and the second camera. The physical correlation is optionally represented by a rotation and translation matrix. The first camera captures a first image of the scene, the objects in the scene carrying a plurality of physical markers, the positions of which can be obtained simply from the first image. And according to the physical correlation between the two coordinates of the first camera and the second camera, converting the positions of the physical marks on the first image into the positions of a plurality of key points on the second image shot by the second camera. Optionally, additional missing keypoints are populated on the second image based on the plurality of keypoints. Thereby, keypoints and/or additional missing keypoints are annotated on the second image automatically and without user intervention.

According to one aspect, a method of automatically annotating an image is performed at a computer system. The method includes capturing a first image of a scene captured by a first camera. The first image records a plurality of marks. Each marker is attached to an object and corresponds to a keypoint on the object in the first image. The method further includes acquiring a second image of the scene captured by a second camera simultaneously with the first image. The first camera and the second camera are cameras of different types. The method further includes determining a physical correlation between a first coordinate of the first camera and a second coordinate of the second camera; identifying locations of the plurality of markers of the first image record in the first coordinate; and converting the positions of the plurality of marks in the first coordinate into the positions of a plurality of key points in the second coordinate according to the physical correlation between the first coordinate and the second coordinate. The method further includes annotating the second image with the plurality of keypoints automatically and without user intervention. In some embodiments, the second image annotated with the plurality of keypoints is used to train a deep learning model.

According to another aspect, some embodiments include a computer system comprising one or more processors and memory having instructions stored therein, the instructions being executable by the one or more processors to implement any of the methods described above.

According to yet another aspect, some embodiments include a non-transitory computer-readable storage medium having instructions stored therein that are executed by one or more processors to implement any of the methods described above.

Drawings

For a better understanding of the various embodiments described herein, reference is made to the following detailed description, taken in conjunction with the accompanying drawings, in which like reference numerals refer to like parts throughout the drawings.

FIG. 1 is an example data processing environment having one or more servers communicatively coupled with one or more client devices, according to some embodiments.

Fig. 2 is an example local imaging environment in which multiple imaging devices are used to capture images, in accordance with some embodiments.

FIG. 3 is an example flow diagram of an image annotation process according to some embodiments.

Fig. 4 is an example flow diagram of a process of synchronizing one or more first cameras with a second camera in a scene, in accordance with some embodiments.

Fig. 5 is an example flow chart of a process of recording calibration data from a first camera and a second camera, according to some embodiments.

Fig. 6A and 6B are two test images for spatially calibrating a first camera 202 and a second camera, and fig. 6C is a flow chart of a process 306 for spatially calibrating a first camera and a second camera, according to some embodiments.

Fig. 7 is a flow chart of a process of determining human keypoints in a scene, according to some embodiments.

Fig. 8A is an image captured by a camera according to some embodiments, fig. 8B is an image including a two-dimensional bone model superimposed on the image shown in fig. 8A according to some embodiments, and fig. 8C is a flowchart of an image annotation process according to some embodiments.

Fig. 9A is an image captured by a camera in accordance with some embodiments, and fig. 9B is coordinates of a two-dimensional skeletal model including a human body corresponding to the image shown in fig. 9A in accordance with some embodiments.

Fig. 10 is a flow chart of a method of automatically annotating an image in accordance with some embodiments.

FIG. 11 is a schematic diagram illustrating a computer system according to some embodiments.

Like reference numerals refer to like parts throughout the various views of the drawings.

Detailed Description

Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide an understanding of the subject matter presented herein. It will be apparent, however, to one skilled in the art that various alternatives can be used and that the subject matter can be practiced without these specific details without departing from the scope of the claims. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein may be implemented at a variety of electronic devices having digital video capabilities.

Various embodiments of the present application are directed to automatically labeling key points in an image captured by a second camera (e.g., a conventional RGB camera) using an automatic label detection function of the first camera. The first camera and the second camera are automatically synchronized in time, automatically calibrated in space, and a physical correlation between two coordinates of the first camera and the second camera is determined. The position of the physical marker is detected in a first image taken by a first camera and converted to a position in a second image taken simultaneously with the first image and by a second camera. The keypoints are automatically noted and annotated on the second image based on the converted positions of the physical markers and other known information. Optionally, a first key point is marked on a first image shot by the first camera according to the position of the physical mark. And converting the marked first key points to corresponding second key points in a second image which is simultaneously taken by the second camera and the first image, so that the second key points are quickly and accurately annotated on the second image.

FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, according to some embodiments. The one or more client devices 104 may be, for example, a desktop computer 104A, a tablet computer 104B, a cell phone 104C, an imaging device 104D, a head mounted display (also referred to as AR glasses) 104E, or a smart, multi-aware and network connected home device (such as a thermostat). Each client device 104 may collect data or user input, execute a user application, and display the output results on its user interface. The collected data or user input may be processed locally (e.g., for training and/or prediction) on the client device 104 and/or remotely by the server 102. One or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to client devices 104 and, in some embodiments, process data and user inputs received from client devices 104 when client devices 104 run user applications. In some embodiments, data processing environment 100 also includes storage device 106 for storing data related to server 102, client device 104, and applications executing on client device 104. For example, the storage device 106 may store video content for training a machine learning model (e.g., a deep learning network) and/or video content that a user-acquired trained machine learning model may apply to determine one or more operations related to the video content.

The one or more servers 102 may enable real-time data communication with client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, one or more servers 102 may perform data processing tasks that client device 104 cannot or does not tend to accomplish locally. For example, the client device 104 includes a game host that executes an interactive online game application. The game host receives the user instructions and sends the user instructions to the game server 102 along with the user data. The game server 102 generates a video data stream based on the user instructions and user data and provides the video data stream to the game host for display with other client devices in the same game session as the game host. As another example, the client device 104 includes a network monitoring camera 104D and a cell phone 104C. The network monitoring camera collects video data and streams the video data to the monitoring camera server 102 in real time. When the video data is optionally preprocessed on the monitoring camera 104D, the monitoring camera server 102 processes the video data to identify motion events or audio events in the video data and shares information of those events to the mobile phone 104C, thereby enabling a user of the mobile phone 104C to remotely monitor events occurring in the vicinity of the network monitoring camera 104C in real time.

One or more servers 102, one or more client devices 104, and storage devices 106 are communicatively coupled to one another through one or more communications networks 108, which are media used to provide communications links between these devices and computers connected together in data processing environment 100. One or more of the communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include a Local Area Network (LAN), a Wide Area Network (WAN) such as the internet, or a combination thereof. Optionally, one or more of the communication networks 108 are implemented using any known network protocol, including various wired or wireless protocols such as Ethernet, universal Serial Bus (USB), firewire, long Term Evolution (LTE), global System for Mobile communications (GSM), enhanced Data GSM Environment (EDGE), code Division Multiple Access (CDMA), time Division Multiple Access (TDMA), bluetooth, wi-Fi, voice over IP (VoIP), wi-MAX, or any other suitable communication protocol. Connections to one or more communication networks 108 may be established directly (e.g., to a radio bearer using 3G/4G connections), or through a network interface 110 (e.g., a router, switch, gateway, hub, or intelligent private overall home control node), or through any combination thereof. Thus, the one or more communication networks 108 may represent the Internet of global networks and gateway sets that communicate with each other using the Transmission control protocol/Internet protocol (TCP/IP) suite. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages.

In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) acquired by an application running on the client device 104 to identify information contained in the content data, match the content data to other data, classify the content data, or synthesize related content data. In these deep learning techniques, a data processing model is created based on one or more neural networks to process content data. The data processing model is trained with training data before it is applied to process the content data. In some embodiments, model training and data processing are both implemented locally on each client device 104 (e.g., client device 104C). Client device 104C obtains training data from one or more servers 102 or storage devices 106 and applies the training data to train the data processing model. After model training, the client device 104C obtains content data (e.g., via an internal camera, captures video data) and processes the content data locally using a training data processing model. Alternatively, in some embodiments, both model training and data processing are implemented remotely on a server 102 (e.g., server 102A) associated with a client device 104 (e.g., client device 104A). Server 102A obtains training data from itself, another server 102, or storage 106, and applies the training data to train the data processing model. The client device 104A obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing model, receives the data processing results from the server 102A, and displays the results on a user interface (e.g., a user interface associated with the application). The client device 104A does not itself perform data processing or performs only a small amount of data processing on the content data before transmitting the content data to the server 102A. Furthermore, in some embodiments, data processing is performed locally on a client device 104 (e.g., client device 104B), while model training is performed remotely on a server 102 (e.g., server 102B) associated with client device 104B. Server 102B obtains training data from itself, another server 102, or storage device 106, and applies the training data to train the data processing model. The trained data processing model is optionally stored in server 102B or storage 106. Client device 104B imports the trained data processing model from server 102B or storage device 106, processes the content data using the data processing model, and generates data processing results for local display on a user interface.

Fig. 2 is an example local imaging environment 200 with multiple client devices 104, according to some embodiments. The plurality of client devices 104 includes a first imaging device 202A, a second imaging device 204, and a server 102. The first imaging device 202A and the second imaging device 204 are disposed in a scene and are used to capture images of respective fields of view associated with the same scene. The first imaging device 202A is a first type of imaging device (e.g., an infrared camera) and the second imaging device 204 is a second type of imaging device (e.g., a visible camera). The second type is different from the first type. In some embodiments, first imaging device 202A and second imaging device 204 are communicatively coupled to each other only directly through local area network 110 (e.g., a Bluetooth communication link). Alternatively, in some embodiments, the first imaging device 202A and the second imaging device 204 may be communicatively coupled to each other via the remote area network 108. In some embodiments, the first imaging device 202A and the second imaging device 204 are communicatively coupled to the server 102 via at least one of a local area network and a remote area network. The server 102 is configured to process images captured by the imaging device 202A and the imaging device 204 and/or communication images or related data between the imaging device 202A and the imaging device 204 in conjunction with the imaging device 202A and the imaging device 204. For example, the server 102 is a local computer machine disposed in the scene and communicates with the first imaging device 202 and the second imaging device 204 via a local area network.

The first imaging device 202A has a first field of view 208 and the second imaging device 204 has a second field of view 210 that shares the same portion as the first field of view 208. The human body 206 is located in the scene and simultaneously appears in the second field of view 210 of the second camera 204 and the first field of view 208 of the first camera 202A. The human body 206 is captured by the second camera 204 and the first camera 202A and is viewable in a first image captured by the first imaging device 202A and a second image captured by the second imaging device 204, but viewable from two different angles. The first image corresponds to the first coordinates and the second image corresponds to the second coordinates. The physical correlation between the first and second coordinates may be calibrated and used to convert a position in the first coordinates of the first image to a position in the second coordinates of the second image.

In some embodiments, the person 206 carries a plurality of physical markers. Each physical marker is optionally attached to a joint of the human body 206 or to a body part having a known position relative to the joint of the human body 206. When the first imaging device 202A captures a first image including the human body 206, a plurality of physical markers are recorded on the first image. From the physical indicia, a plurality of key points corresponding to the human body 206 (particularly joints of the human body 206) are identified. In some embodiments, the first imaging device 202A and the physical marker facilitate detection of the physical marker based on unique imaging characteristics. For example, the first imaging device 202A includes an infrared camera including an infrared emitter, and physical markers may be used to reflect infrared light. The physical markers are displayed differently (e.g., have a higher brightness level) on the first image, which is an infrared image, and thus can be simply and accurately identified by the infrared image processing algorithm. In some embodiments, the first imaging device 202A is configured to locally identify the location of the physical marker on the first image. In some embodiments, the first imaging device 202A is configured to provide the first image to the server 102 or the second imaging device 204, and the server 102 or the second imaging device 204 is configured to identify the location of the physical marker in the first image.

In some embodiments, the location of the physical marker on the first image is converted to a location on the second image based on a physical correlation between coordinates of the first image and the second image. The converted locations on the second image are used to identify human keypoints in the second image. Alternatively, in some embodiments, the keypoints of the human body 206 are identified and tracked based on the locations of physical markers on the first image captured by the first imaging device 202A. The keypoints associated with the physical markers on the first image are converted to keypoints on the second image captured by the second camera 204 according to the physical correlation between the coordinates of the first image and the second image. The keypoints on the second image are connected together to generate a skeletal model of the human body 206 of the second image and thus correlate the keypoints on the second image with different body parts in the human body 206. In this way, these keypoints may be annotated on the second image in relation to a body part (e.g., joint) of the human body 206.

In some embodiments, the plurality of client devices 104 includes a plurality of first imaging devices 202, including a first imaging device 220A. The plurality of first imaging devices 202 are fixed at different locations of the scene and have different first fields of view 208 that optionally overlap one another. When the body 206 changes body position in the scene, the body 206 remains in the second field of view 210 of the second imaging device 204 at all times, for example, because the second imaging device 204 is adjusted such that the body 206 remains partially or fully in the second field of view 210. In some cases, at a first moment in time, the human body 206 is located in a first field of view of the first imaging device 202A. At a second moment in time, the human body 206 is not present in the first field of view of the first imaging device 202A, but is present in the first field of view of the first imaging device 202B. For a first moment, from a first image taken by the first imaging device 202A, keypoints associated with the body-part of the human body 206 are identified and converted into second coordinates of a second image taken simultaneously with the first image and by the second imaging device. For the second moment, from the third image taken by the first imaging device 202B, a keypoint related to the body part of the human body 206 is identified and converted into a second coordinate of another second image taken simultaneously with the third image and by the second imaging device 204.

Thus, as the human body 206 moves in the scene, a first sequence of images taken sequentially by the plurality of first imaging devices 202 is used to identify key points related to the body part of the human body 206 in a second sequence of images taken simultaneously with the first sequence of images and by the second imaging device 204. For example, the first image sequence is sequentially captured by the first imaging devices 202A, 202B, 202C, 202D, and 202E and includes a corresponding number of consecutive first images of each of the first imaging devices 202A-202E according to the speed of movement of the human body. It should be noted that, in some contexts, the plurality of first cameras 202 are collectively referred to as first cameras 202, including a plurality of first camera portions (e.g., first camera portion 202A).

In some embodiments, at a particular moment in time, the human body 206 is present in both the first fields of view of the two first imaging devices 202A and 202B and the second imaging device 204. One of the two images taken by the two first imaging devices 202A and 202B is selected in accordance with the device selection criteria to identify key points associated with the body part of the human body 206 in the second image taken by the second imaging device 204. For example, a first imaging device 202 having a first field of view that overlaps more with a second field of view is selected to determine keypoints in the second image according to device selection criteria. For another example, the first imaging device 202 is selected to be physically closer to the second imaging device 204 to help identify keypoints in the second image. Alternatively, in some embodiments, both images captured by the two first imaging devices 202A and 202B will be applied to help identify key points in the second image captured by the second imaging device 204 that are related to the body part of the human body 206. In some embodiments, the two images have some common keypoints and some different keypoints and are complementary to each other for identifying the keypoints of the human body 206. The identified keypoints of the human body are the common keypoints and the set of different keypoints of the two images. In some embodiments, both images have the same keypoints, identified as keypoints of the human body 206. The first imaging devices 202A and 202B share the same first coordinates (e.g., which are fixed relative to the scene), and the identified keypoints will be translated into second coordinates of a second image that is both of the two images and captured by the second imaging device.

Fig. 3 is an example flow diagram of an image annotation process 300 according to some embodiments. The image annotation process 300 is performed jointly by the first camera 202 and the second camera 204. In some embodiments, the image annotation process 300 involves the server 102 for processing images and/or communicating data related between the images or the first camera 202 and the second camera 204 in conjunction with the first camera 202 and the second camera 204. In one example, the first camera 202 is an infrared camera and the first image captured by the first camera 202 is an infrared image. Alternatively, in another example, the first camera 202 is a high-end visible light camera system with built-in marker detection or key point annotation functionality. However, the first camera 202 is expensive and cannot be integrated into conventional consumer electronics devices. Conversely, in one example, the second camera 202 may be mounted on the cell phone 104C or AR glasses 104E for capturing color images, monochrome images, or depth images. Thus, the second image captured by the second camera 204 may be annotated with keypoints according to the marker detection or keypoint annotation functionality of the first camera 202.

Specifically, the second image captured by the second camera 204 is correlated with the first image captured by the first camera 202 and records the physical markers attached to the human body 206. The first image captured by the first camera 202 is synchronized (302) with the second image captured by the second camera 204. Each image captured by the first camera 202 and the second camera 204 is associated with a time stamp that keeps track of the time the image was captured. The first image captured by the first camera 202 has a first frame rate (e.g., 240 Frames Per Second (FPS)), and the second image captured by the second camera 204 has a second frame rate (e.g., 30 FPS). In some embodiments, each second image is associated with its temporally closest first image, whether the closest first image was taken earlier or later than the second image. Alternatively, in some embodiments, each second image is associated with its temporally closest first image, which was taken earlier than the second image. Alternatively, in some embodiments, each second image is associated with its temporally closest first image taken later than the second image. Alternatively, in some embodiments, each second image is related to its two temporally closest first images, one of which is captured earlier than the second image and the other of which is captured later than the second image.

The first camera 202 is associated with a first coordinate and the second camera 204 is associated with a second coordinate. The first and second coordinates are physically related according to a physical correlation (e.g., represented by a rotation and translation matrix). The object is located at a first position in a first coordinate and at a second position in a second coordinate, the first position being related to the second position according to a physical correlation. In some embodiments, the physical correlation between the first and second coordinates includes a plurality of displacement parameters related to a three-dimensional displacement between the first and second coordinates, and a plurality of rotation parameters related to a three-dimensional rotation between the first and second coordinates. Calibration is performed to determine physical correlation (304). In particular, the same object may be captured by the first camera 202 and the second camera 204, the same object being collected as calibration data at a first location in the first image and at a second location in the second image. Calibration data is used to determine a physical correlation between the first coordinate and the second coordinate (306). In one example, the object includes a plurality of points disposed at fixed positions relative to a checkerboard (e.g., in fig. 6A and 6B).

In some embodiments, a plurality of first images are captured by the first camera 202 and a plurality of second images are captured simultaneously with the first images and by the second camera 204. The first camera 202 and the second camera 204 record human operation data through the first image and the second image (308).

From the first image, information of the physical markers attached to the human body 206 is extracted. In some embodiments, the information of the physical marker is converted into a location of the physical marker in the first image using the physical correlation, the location of the physical marker being used to identify the keypoint in the second image. Alternatively, in some embodiments, the physically marked information is converted to locations of keypoints in the first image that are related to the body part of the human body 206, and the locations of the keypoints are converted to locations in the second image using physical correlation. In some embodiments, the physical markers do not correspond to key points associated with all body parts of the human body 206. One or more additional keypoints are missing. From the keypoints transformed from the information of the physical markers attached to the human body 206, one or more additional missing keypoints (e.g., interpolation) are derived. Accordingly, a data post-processing (310) is performed to calculate keypoints associated with the physical markers and to derive one or more additional missing keypoints from the first image and/or the physical correlation.

According to the physical correlation between the first coordinates of the first camera 202 and the second coordinates of the second camera 204, the key points related to the first image are converted into key points related to the body part of the human body 206 in the second image shot by the second camera 204. Annotations associated with keypoints of the first image are also projected into a second image captured by the second camera 204 (312). Alternatively, the position of the physical marker associated with the first image is converted into a position in the second image captured by the second camera 204 according to the physical correlation, and the converted position of the physical marker is used to identify a key point associated with the body part of the human body 206 in the second image. Through these methods, process 300 is used to generate data for a human body related algorithm. In view of the growing trend of data driven algorithms, accurate and automated data annotation and generation becomes critical, while process 300 provides a data generation scheme that does not require a large number of manual annotations.

In some embodiments, the second image is related to two first images closest in time thereto, one of which is captured earlier than the second image and the other of which is captured later than the second image. The two first images closest in time are optionally taken by the same first camera 202 or by different first cameras 202. The two first images have two different sets of keypoints, one of which is temporally interpolated and converted to keypoints in the second image taken by the second camera 204 that are related to the body part of the human body 206. Alternatively, two different sets of keypoints of the two first images are converted to two different sets of keypoints in the second coordinates, wherein the keypoints in the second image are temporally interpolated.

Fig. 4 is an example flow diagram of a process 302 between one or more first cameras and a second camera in a synchronization scenario, according to some embodiments. The server 102 includes one of a local computer machine located in the scene and a remote server communicatively coupled to the first camera 202 and the second camera 204 through one or more communication networks 108 (402). When the server 102 comprises a local computer machine, one or more first cameras 202 and second cameras 204 are each coupled to the local computer machine by a local area network (e.g., a WiFi network) or a wired link. When the server 102 comprises a remote server, each of the one or more first cameras 202 and second cameras 204 are coupled to the remote server 102 via at least a wide area network (e.g., a cellular network). The software application runs on a local computer machine or remote server to receive image data from one or more of the first camera 202 and the second camera 204 and process the image data as needed.

The system times tracked by the first camera 202 and the second camera 204 may not be exactly the same, and therefore calibration is required to ensure that the image data captured by the first camera 202 and the second camera 204 are synchronized. In some embodiments, the second camera 204 sends the test signal (404) to the server 102 at multiple times, e.g., every half second within five seconds. At each instant, the test signal includes a first timestamp recording the time of transmission tracked in accordance with the second camera time. The server 102 receives the test signal and tracks a second timestamp that records the time of receipt tracked according to the server time. Specifically, the server 102 receives the test signals, determines a time difference between the second camera time and the server time, and optionally averages the time differences of the different test signals. In some embodiments, server 102 determines a delay time for the test signal (406) and subtracts the delay time from the time difference. In some embodiments, the delay time is negligible compared to the time difference. The time difference is then used to synchronize the second camera 204 and the server 102 (408). In some embodiments, each first camera 202 is similarly synchronized with the server 102 so that the times tracked by the first camera 202, the second camera 204, and the server 102 can be calibrated to each other.

In some embodiments, the first camera 202 and the second camera 204 communicate data directly with each other and the server 102 is not involved in the image annotation process 300. The first camera 202 sends a test signal to the second camera 204, and the second camera 204 tracks the time of receipt and determines the time difference between the first camera 202 and the second camera 204. The second camera 204 optionally subtracts the delay time from the time difference. The time difference is used to synchronize the first camera 202 and the second camera 204. Conversely, in some embodiments, the second camera 204 sends a test signal to the first camera 202, the first camera 202 tracks the time of receipt and determines the time difference between the first camera 202 and the second camera 204. The time difference is used to synchronize the first camera 202 and the second camera 204, wherein the delay time is optionally subtracted from the time difference.

Fig. 5 is an example flowchart of a process 304 of recording calibration data for a first camera and a second camera, according to some embodiments. A plurality of physical markers are attached to an object (e.g., human body 206). In some embodiments, each physical tag transmits a signal at a predefined tag frequency (e.g., 0-200 Hz). After the first camera 202 begins recording the first image (502), the first image is captured to record the three-dimensional positions of the plurality of physical markers (504). The first camera 202 sends the first image frame by frame to the server 102 (504). Each first image includes a first timestamp that records a first frame time when the corresponding first image was captured. The server 102 receives and stores a first image 508 taken by the first camera 202 and including the physically marked three-dimensional location and a corresponding first timestamp (506). After the second camera 204 begins recording the second image (502), the first camera 202 captures the second image (510) and sends the second image to the server 102 on a frame-by-frame basis. Each second image 512 includes a second timestamp that records a second frame time when the corresponding second image was captured by the second camera 204. The system times of the first camera 202, the second camera 204, and the server 102 have been previously calibrated, and therefore, at least one of the first timestamp and the second timestamp is adjusted to synchronize the first image captured by the first camera 202 with the second image captured by the second camera 204.

Fig. 6A and 6B are two test images 600 and 620 for spatially calibrating the first camera 202 and the second camera, and fig. 6C is a flow chart of a process 306 for spatially calibrating the first camera 202 and the second camera 204, according to some embodiments. The first camera 202 captures a first test image 600 (602), the second camera 204 captures a second test image 620 (604), and the second test image 620 and the first test image 600 are captured simultaneously in the same scene. Because of the different positions and orientations of the first camera 202 and the second camera 204, the first test image 600 and the second test image 620 are taken from two different perspectives. Both the first test image 600 and the second test image 620 include a board 606 and a plurality of physical marks 608 (e.g., four marks 608A, 608B, 608C, and 608D) disposed at a plurality of predetermined positions relative to the board 602. The first test image 600 is associated with a first timestamp for recording a first frame time when the first test image 600 was captured. The second test image 620 is associated with a second timestamp for recording a second frame time when the second test image 620 was captured. In one example, the first camera 202 is an RGB camera for capturing test images at a first frame rate of 200FPS and the second camera 204 is a camera integrated in the cell phone 104C for capturing test images at a second frame rate of 30 FPS. In some embodiments, the first test image 600, the first timestamp, the second test image 620, and the second timestamp are integrated (synchronized) and processed in the server 102 or the second camera 204 to spatially calibrate the first camera 202 and the second camera 204, i.e., to determine a physical correlation between the coordinate systems of the first camera 202 and the second camera 204.

The first test image 600 is located in a sequence of consecutive first test images taken by the first camera 202 and the second test image 620 is located in a sequence of consecutive second test images taken by the second camera 204. For each second test image 620, the server 102 identifies the temporally closest first test image 600 (610), the first test image 600 being taken substantially simultaneously with the second test image 620. In some embodiments, each second test image 620 is taken simultaneously with the temporally closest first test image 600, whether the closest first test image 600 was taken earlier or later than the second test image 620. Alternatively, in some embodiments, each second test image 620 is taken at the same time as the temporally closest first test image 600, the first test image 600 being taken earlier than the second test image 620. Alternatively, in some embodiments, each second test image 620 is taken at the same time as the temporally closest first test image 600, the first test image 600 being taken later than the second test image 620. Alternatively, in some embodiments, each second image is associated with two first images that are closest in time, one of which is earlier than the second image capture and the other of which is later than the second image capture.

In some embodiments, each long side of checkerboard 606 approximates a pair of physical markers 608. The location of the physical marker 608 is simply derived from the first coordinate of the first test image 600. In addition to the physical marking 608, the positions of the four corners 612 of the board 606 and the positions of the middle points 614 of the physical marking 608 in the first test image 600 are simply obtained from the positions of the physical marking 608. The locations of a subset or all of the physical marker 608, the four corners 612, and the intermediate point 614 are identified on the second test image 620 and compared to corresponding locations in the first test image 600 to determine a physical correlation between the coordinate systems of the first camera 202 and the second camera 204.

In some embodiments, a perspective-n-point (PnP) method may be used to determine a physical correlation between the coordinate systems of the first camera 202 and the second camera 204. Alternatively, in some embodiments, a random sample consensus (random sample consensus, RANSAC) method is used to determine a physical correlation between the coordinate systems of the first camera 202 and the second camera 204 (616). Optionally, the physical correlation (618) is represented by a rotation and translation matrix that correlates the locations of the physical markers 608, corners 612, or intermediate points 614 identified from the first test image 600 to corresponding locations in the second coordinates of the second camera 204.

In other words, the physical correlation between the first coordinate and the second coordinate is determined by: one or more first test images 600 of the scene are obtained from the first camera 202, wherein there are a plurality of first test markers (e.g., 608, 612, or 614 in fig. 6A) to be detected from the one or more first test images 600, and one or more second test images 620 of the scene are obtained from the second camera 204, wherein there are a plurality of second test markers (e.g., 608, 612, or 614 in fig. 6B) to be detected from the one or more second test images 620. The second test image 620 is photographed at the same time as the first test image 600. The first test mark has a known physical position in the scene relative to the second test mark. The first test mark and the second test mark are detected from the first test image 600 and the second test image 620, respectively.

A physical correlation between the first coordinate and the second coordinate is determined based on a known physical location of the first test mark in the scene relative to the second test mark. In one example, a first test mark detected from the first test image 600 is a physical mark 608 and a second test mark detected from the second test image 620 is a corner 612 and an intermediate point 614. The location of the physical marker 608 in each first test image 600 is used to derive the location of the corner 612 and the intermediate point 614 in the first test image 600. The positions of the angle 612 and the intermediate point 614 derived in the first test image 600 are compared with the corresponding positions in the second test image 620 to determine the physical correlation. More specifically, the first plurality of test marks includes a first marking unit (e.g., 608A) and a second marking unit (e.g., 608B), and the second plurality of test marks includes a third marking unit (e.g., 614A) located at an intermediate point between the locations of the first marking unit and the second marking unit.

Further, in some embodiments, the one or more first test images 600 include a sequence of first image frames, each first test mark being contained in a subset of the first test images 600. The one or more second test images 620 comprise a sequence of second image frames, each second test mark corresponding to a subset of the first test marks, respectively, and each second test mark being contained in a subset of the second test images 620. Further, in some embodiments, at least a subset of the first test indicia and the second test indicia are attached to the board. The checkerboard moves in a plurality of checkerboard poses in the scene and is recorded in a first sequence 600 and a second sequence 610 of image frames. The chessboard postures are provided with different positions or directions from each other.

In some embodiments, the one or more first test images comprise a single first test image 600 and the one or more second test images comprise a single second test image 620. Each second test mark corresponds to (e.g., is derived from) a subset of the first test marks, respectively. In addition, in some embodiments not illustrated in fig. 6A-6C, the first test mark and the second test mark are marked on a three-dimensional box having a plurality of sides, wherein each side is covered by a checkerboard pattern. The three-dimensional box is recorded in the first test image 600 and the second test image 620.

Fig. 7 is a flow diagram of a process 700 of determining human keypoints in a scene, according to some embodiments. A plurality of physical markers 608 are attached to a known location of the human body 206 (e.g., a joint or other known location displaced from a joint of the human body 206). The first camera 202 and the second camera 204 are controlled to capture images of the human body 206 including the physical marker 608. The first camera 202 is configured such that the physical marker 608 is significantly highlighted (e.g., has a different color, has a higher brightness level) on the first image captured by the first camera 202, thereby facilitating detection of the physical marker 608 in the first image. The first camera 202 or the server 102 detects 702 positions of a plurality of physical markers 608 in a first coordinate associated with the first camera 202 from the first image. The positions of the plurality of physical markers 608 in the first coordinate are converted into a second coordinate associated with the second camera 204 using the previously calibrated physical correlation (704). Since each physical marker 608 is associated with a body part when attached to the human body 206, both the position of the physical marker 608 in the first image and the translated position of the physical marker 608 in the second image are annotated with information of the body part of the human body 206.

In some embodiments, the first camera 202 and the second camera 204 continuously capture a first image sequence and a second image sequence, respectively. At a time, a subset of the first image sequence cannot track to a subset of the physical markers 608, for example, because of rapid movement and occlusion. The trajectory of each physical marker 608 is tracked and fitted (706) from the first image sequence, and optionally, for that moment, the positions of the missing physical markers 608 are filled in for a subset of the first image sequence. Alternatively, in some embodiments, the location of the missing physical marker 608 is filled in and interpolated from the known locations of the adjacent physical markers 608.

In some cases, the physical markers 608 are attached to the surface of the human body 206 and are offset from corresponding key points inside the human body 206. In some embodiments, the location of the physical marker 608 is used to determine the shape of the body part of the human body 206. For example, geometric centers are determined for adjacent physical markers 608 to determine the location of particular keypoints.

As described above, the plurality of keypoints corresponds to one or more human bodies of human body 206, and in process 700, the second image is annotated with the plurality of keypoints. Optionally, the converted positions of the plurality of markers 608 in the second coordinates of the second image are respectively in one-to-one correspondence with the plurality of key points. In some embodiments, from the converted positions of the plurality of markers 608, additional marker positions are interpolated (708), the additional marker positions being associated with additional keypoints that are not among the plurality of keypoints corresponding to the plurality of markers 608. In some embodiments, a subset of the converted locations of the plurality of markers 608 in the second coordinate is fitted to identify a geometric center of the subset of the converted locations of the plurality of markers 608 (710). From the geometric center, keypoints of the human body 206 corresponding to a subset of the plurality of labeled transformed positions are further derived (712). In some embodiments, the second image annotated with the plurality of keypoints may be used to train a deep learning model.

Fig. 8A is an image 800 taken by a camera (e.g., second camera 204) according to some embodiments, fig. 8B is an image 820 including a two-dimensional bone model 822 superimposed over the image shown in fig. 8A according to some embodiments, and fig. 8C is a flowchart of an image annotation process 850 according to some embodiments. The first camera 202 captures a first image (802). The second camera 204 captures a second image 800 at the same time that the first image is captured in the same scene (804). A temporally closest first image taken substantially simultaneously with the second image 620 is identified (806). The positions of the plurality of physical markers 608 are projected from the first coordinate associated with the first camera 202 onto the position in the second coordinate associated with the second camera 204 using a pre-calibrated physical correlation between the first coordinate and the second coordinate, e.g., using a rotation and translation matrix (808).

In some embodiments, a subset of the locations in the second coordinate are directly related to the body part of the human body 206, and keypoints are annotated on the second image according to the subset of locations (810). Alternatively, in some embodiments, these locations are used to interpolate some lost keypoints of the human body 206. Additionally and optionally, in some embodiments, these positions are displaced from certain body parts of the human body 206, and thus are used to identify the true geometric center of the body part labeled with one or more keypoints. Referring to fig. 8B, the two-dimensional skeletal model 822 of the human body 206 includes a plurality of key points 824 that are labeled according to the locations of a plurality of physical markers 608 attached to the human body 206. A plurality of keypoints are connected to form a two-dimensional bone model 822.

Fig. 9A is an image 900 captured by a camera in accordance with some embodiments, and fig. 9B is coordinates 920 including a two-dimensional bone model 922 corresponding to a human body as shown in fig. 9A in accordance with some embodiments. The image 900 includes a human body 206 with a plurality of physical markers 608 attached. In some embodiments, the location of the physical marker 608 in the first coordinate of the image 900 is converted to a location in the second coordinate 920 of the second image. The converted position is used to mark and annotate a keypoint 924 in the second coordinate 920 of the second image. Alternatively, in some embodiments, the location of the physical marker 608 in the first coordinate of the image 900 is used to mark and annotate the first keypoint in the first coordinate of the first image. The position of the first keypoint in the first coordinate is converted into a position in the second coordinate 920 of the second image. The keypoints 924 are annotated at the transformed positions in the second coordinates 920 of the second image. Referring to fig. 9B, the keypoints 924 identified for the second image are separated and overlaid on the coordinates 920 associated with the second image and the second camera 204 without using the image 900 or the second image as a background. Keypoints 924 are connected to form two-dimensional skeletal model 922.

In some embodiments, the marker 608 is attached to a keypoint 924 of the human body (e.g., corresponding to the neck), and the keypoint 924 is annotated according to the location of the marker 608. In some embodiments, the markers 608 are attached to the keypoints 924 of the human body, but are missing in the image. The keypoints 924 are interpolated from neighboring keypoints 924 in the same image, or from the same keypoints identified in previous and/or subsequent images. In some embodiments, the markers 608 are disposed adjacent to the keypoints 924 of the human body 206. Key points 924 are derived from the locations of the markers 608. For example, the location of the marker 608 is used to identify the geometric center of a body part (e.g., head) along with other markers 608, and, based on the identified geometric center, keypoints 924 are annotated on the body part,

fig. 10 is a flow chart of a method of automatically annotating an image in accordance with some embodiments. For convenience, method 1000 is described as being performed by a computer system (e.g., client device 104, server 102, or a combination thereof). One example of a client device 104 is a cell phone 104C or AR glasses 104E. In one example, method 1000 is used to annotate human keypoints in captured images. For example, the image may be captured by a second camera (e.g., a camera of the cell phone 104C or AR glasses 104E) and annotated locally or streamed to the server 102 (e.g., stored in the storage device 106 or database of the server 102) for annotation. The same person is also contained in the first image taken by the first camera and the markers associated with the keypoints to be annotated can be simply derived from the first image and used to direct the annotation of the keypoints in the second image.

Optionally, method 1000 is subject to instructions stored in a non-transitory computer-readable storage medium and executed by one or more processors of a computer system. The instructions are executed by one or more processors of a computer system. Each of the operations shown in fig. 10 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 1106 of computer system 1100 in fig. 11). The computer-readable storage medium may include a magnetic or optical disk storage device, a solid state storage device (e.g., flash memory), or other non-volatile storage device. The instructions stored on the computer-readable storage medium may include at least one of source code, assembly language code, object code, or other instruction formats that are interpretable by one or more processors. Some operations in method 1000 may be combined and/or the order of some operations may be changed.

Specifically, the computer system acquires a first image of a scene captured by the first camera 202 and recorded with a plurality of markers 608 (1002), and acquires a second image of the scene captured simultaneously with the first image and by the second camera 204 (1004). The first camera 202 and the second camera 204 are different types of cameras. Each marker 608 is attached to an object (e.g., human body 206) and corresponds to a keypoint on the object in the first image. In some embodiments, the first image comprises an infrared image on which a plurality of markers are attached to the object and appear as a distinct optical feature (e.g., brightness level). In some embodiments, the second camera 204 is mounted on a mobile device or AR glasses. In some embodiments, the second image is one of a color image, a monochrome image, and a depth image. In some embodiments, the first camera is synchronized with the second camera, each of the first image and the second image being derived by a respective time stamp.

In some embodiments, the first camera 202 includes a plurality of first camera portions distributed in the scene. The first image includes one of a plurality of images taken by a plurality of first camera portions substantially simultaneously. One of the plurality of images is captured by one of the plurality of first camera portions and is associated with the first coordinate. The field of view of one of the plurality of first camera portions overlaps with the field of view of the second camera. The locations of the plurality of markers are identified in a first coordinate of the first camera using the first image. More details regarding the first camera portion can be found in fig. 2.

The computer system determines a physical correlation between the first coordinate of the first camera 202 and the second coordinate of the second camera 204 (1006) and identifies the locations of the plurality of markers 608 of the first image record in the first coordinate of the first camera 202 (1008). In some embodiments, the physical correlation between the first and second coordinates includes a plurality of displacement parameters related to a three-dimensional displacement between the first and second coordinates, and a plurality of rotation parameters related to a three-dimensional rotation between the first and second coordinates.

Based on the physical correlation between the first and second coordinates, the computer system converts the locations of the plurality of markers in the first coordinate to locations of the plurality of keypoints in the second coordinate (1010). The computer system annotates the second image with the plurality of keypoints automatically and without user intervention (1012). In some embodiments, a deep learning model is trained with a second image annotated with a plurality of keypoints (1014). In some embodiments, the plurality of keypoints corresponds to one or more humans. And annotating the second image by using the plurality of key points when the converted positions of the marks are respectively in one-to-one correspondence with the plurality of key points. In some embodiments, a plurality of first keypoints are identified on the first image based on positions of the plurality of markers. And converting the positions of the plurality of first key points in the first image into the positions of the plurality of key points in the second image. Alternatively, in some embodiments, the locations of the plurality of markers in the first coordinate are converted to the locations of the plurality of markers in the second coordinate of the second image. And identifying a plurality of key points in the second image according to the converted positions of the marks in the second image. In some embodiments, additional marker positions are interpolated from the converted positions of the plurality of markers and correlated with additional keypoints that are not among the plurality of keypoints. Conversely, in some embodiments, the additional keypoints are directly interpolated from the plurality of keypoints in the second image. In some embodiments, a subset of the converted positions of the plurality of markers is fitted to a body part of the human body. A geometric center of a subset of the converted locations of the plurality of markers in the second image is identified. From the geometric center, a set of keypoints of the human body corresponding to a subset of the transformed locations of the plurality of markers is identified.

In some embodiments, referring to fig. 6A-6C, a physical correlation between the first coordinate and the second coordinate is determined. The computer system obtains one or more first test images 600 of the scene from the first camera 202, wherein there are a plurality of first test marks to be detected from the one or more first test images 600, and one or more second test images 620 of the scene from the second camera 204, wherein there are a plurality of second test marks to be detected from the one or more second test images 620. The first test mark has a known physical position in the scene relative to the second test mark, and the second test image 620 is taken simultaneously with the first test image 600. The computer system detects the first test mark and the second test mark from the first test image 600 and the second test image 600, respectively. From the known physical location of the first test mark in the scene relative to the second test mark, the computer system may derive a physical correlation between the first coordinate and the second coordinate.

Further, in some embodiments, the one or more first test images 600 include a first sequence of image frames. Each first test mark is contained in a subset of the first test images 600, and one or more second test images 620 comprise a second sequence of image frames. Each second test mark corresponds to a subset of the first test marks and is contained in a subset of the second test image 620. Further, in some embodiments, the subset of the first and second test marks are attached to a checkerboard, which moves in a plurality of checkerboard poses in the scene, and is recorded in a first and second sequence of image frames. The plurality of chessboard gestures have different positions or directions from each other.

Further, in some embodiments, the one or more first test images 600 comprise a single first test image and the one or more second test images 620 comprise a single second test image. Each second test mark corresponds to a subset of the first test marks, respectively. Further, in some embodiments, the first test mark and the second test mark are marked on a three-dimensional box having a plurality of sides covered by a checkerboard. The three-dimensional box is recorded in the first test image 600 and the second test image 620.

Further, in some embodiments, the plurality of first test marks includes a first mark unit and a second mark unit, and the plurality of second test marks includes a third mark unit located at an intermediate point between the locations of the first mark unit and the second mark unit.

It should be understood that the particular order of operations in fig. 8 is merely exemplary and does not indicate that the order is the only order in which the operations are performed. One of ordinary skill in the art will recognize various methods of annotating keypoints in images as described herein. Furthermore, it should be noted that the details of the other processes in FIGS. 3-9 above may also be applied in an analogous manner to the method 1000 described above with respect to FIG. 10. For the sake of brevity, no further description is provided here.

Fig. 11 is a schematic diagram illustrating a computer system 1100 according to some embodiments. Computer system 1100 includes server 102, client device 104, storage device 106, or a combination thereof. Computer system 1100 is used to implement any of the methods described above as shown in fig. 3-10. Computer system 1100 typically includes one or more processing units (CPUs) 1102, one or more network interfaces 1104, memory 1106, and one or more communication buses 1108 for interconnecting these components (sometimes called chipsets). The computer system 1100 includes one or more input devices 1110 that facilitate user input, such as a keyboard, mouse, voice command input unit or microphone, touch screen display, touch sensitive tablet, gesture camera, or other input buttons or controls. Further, in some embodiments, the client device 104 of the computer system 1100 uses microphone and voice recognition or camera and gesture recognition to supplement or replace a keyboard. In some embodiments, the client device 104 includes one or more cameras, scanners, or picture sensor units for capturing images, such as images of a graphic sequence code printed on an electronic device. The computer system 1100 also includes one or more output devices 1112 for presenting user interfaces and displaying content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a global satellite positioning system (global positioning satellite, GPS) or other geographic location receiver, for determining the location of the client device 104.

Memory 1106 includes high-speed random access memory, such as DRAM, SRAM, DDRRAM, or other random access solid state storage devices; optionally, nonvolatile memory may also be included, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other nonvolatile solid state storage devices. Optionally, the memory 1106 may include one or more storage devices remote from the one or more processing units 1102. The memory 1106 or a non-volatile memory in the memory 1106 includes a non-transitory computer-readable storage medium. In some embodiments, memory 1106 or a non-transitory computer readable storage medium of memory 1106 stores the following programs, modules, and data structures, or a subset or superset thereof:

an operating system 1114 including programs for handling various basic system services and for performing hardware related tasks;

a network communication module 1116 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage device 106) via one or more network interfaces 1104 (wired or wireless) and one or more communication networks 108, such as the internet, other wide area networks, local area networks, metropolitan area networks, etc.;

A user interface module 1118 for displaying information (e.g., graphical user interfaces of applications 1124, gadgets, websites and their web pages and/or games, audio and/or video content, text, etc.) on each client device 104 via one or more output devices 1112 (e.g., a display, speakers, etc.);

an input processing module 1120 for detecting one or more user inputs or interactions from one or more input devices 1110 and decoding the detected inputs or interactions;

a web browser module 1122 for navigating, requesting (e.g., via HTTP) and displaying websites and their web pages, including a web interface for logging into user accounts associated with the client device 104 or other electronic devices, controlling the client or electronic devices if associated with the user accounts, and editing and viewing settings and data associated with the user accounts;

one or more user applications 1124 (e.g., games, social networking applications, smart home applications, and/or other web-based or non-web-based applications for controlling other electronic devices and viewing data captured by such devices) executed by the computer system 1100;

A model training module 1126 for receiving training data (e.g., training data 1142) and building a data processing model (e.g., data processing module 1128) for processing content data (e.g., video, visual, or audio data) collected or acquired by the client device 104;

a data processing module 1128 for processing the content data using the data processing model 1144 to identify information contained in the content data, match the content data with other data, classify the content data, or synthesize related content data. Wherein in some embodiments, the data processing module 1128 is associated with one of the plurality of user applications 1124 to process content data in response to user instructions received from the user application 1124;

a camera calibration module 1130 for calibrating the two cameras in time and space, determining a physical correlation between a first coordinate of the first camera and a second coordinate of the second camera;

a keypoint annotation module 1132 configured to convert the identified position of the physical marker into a position of a keypoint in the second image captured by the second camera, using the function of identifying the position of the physical marker of the first camera, and annotating the keypoint according to the physical marker; and

One or more databases 1134 for storing at least the following data:

device settings 1136, including one or more generic device settings (e.g., service level, device model, storage capacity, processing power, communication power, etc.) of server 102 or client device 104;

user account information 1138 of one or more user applications 1124, such as user name, security questions, account history data, user preferences, and predefined account settings;

network parameters 1140 of one or more communication networks 108, such as IP address, subnet mask, default gateway, DNS server, and hostname;

training data 1142 for training one or more data processing models 1144;

a data processing model 1144 for processing content data (e.g., video, visual, audio data) using deep learning techniques; and

content data and results 1146 acquired and output, respectively, by client device 104 of computer system 1100 to client device 104, wherein the content data includes images captured by first camera 202 and second camera 204, locations of physical markers in the first image captured by first camera 202, and/or keypoint information annotated in the second image captured by second camera 204.

Optionally, one or more databases 1134 are stored in one of server 102, client device 104, and storage device 106 of computer system 1100. Optionally, one or more databases 1134 are distributed among more than one of server 102, client device 104, and storage device 106 of computer system 1100. In some embodiments, more than one copy of the data is stored in different devices, e.g., two copies of data processing model 1144 are stored in server 102 and storage device 106, respectively.

Each of the elements identified above may be stored in one or more of the storage devices mentioned above and correspond to a set of instructions for performing the functions described above. The above-identified modules or programs (i.e., sets of instructions) do not have to be executed solely as software programs, procedures, modules or data structures, and thus, in various embodiments, various subsets of these modules may be combined or otherwise rearranged. In some embodiments, memory 1106 may optionally store a subset of the modules and data structures described above. Furthermore, memory 1106 may optionally store other modules and data structures not mentioned above.

The terminology used herein in describing the various described embodiments is for the purpose of describing particular embodiments only and is not intended to be limiting. As used in the description of the various embodiments described and the appended claims, the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. Furthermore, it should be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Furthermore, it will be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element.

The term "if" as used herein is to be interpreted in the context of "when" or "in" or "responsive to determination" or "responsive to detection" or "based on determination". Likewise, the phrase "if a condition or event is determined" or "if detected" is also understood to be "determining" or "responding to a determination" or "detecting a condition or event" or "responding to a determination of detecting a condition or event" depending on the context.

For ease of explanation, the foregoing description has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of operation and the practical application, thereby enabling others skilled in the art to understand the invention.

Although the various figures illustrate some logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or split. Although some reordering or other groupings are specifically mentioned, other groupings will be apparent to those of ordinary skill in the art, and thus the ordering and groupings described herein are not an exhaustive list of alternatives. Furthermore, it should be appreciated that these stages may be implemented in hardware, firmware, software, or any combination thereof.

Claims

1. A method of automatically annotating an image, comprising:

acquiring a first image of a scene captured by a first camera, the first image recording a plurality of markers, each marker being attached to an object and corresponding to a key point on the object in the first image;

Acquiring a second image of the scene, which is shot by a second camera and is simultaneously with the first image, wherein the first camera and the second camera are cameras of different types;

determining a physical correlation between a first coordinate of the first camera and a second coordinate of the second camera;

identifying locations of the plurality of markers of the first image record in the first coordinate; and

converting the positions of the plurality of marks in the first coordinate into the positions of a plurality of key points in the second coordinate according to the physical correlation between the first coordinate and the second coordinate; and

annotating the second image with the plurality of keypoints automatically and without user intervention.

2. The method of claim 1, wherein said converting the locations of the plurality of markers in the first coordinate to the locations of the plurality of keypoints in the second coordinate further comprises:

identifying a plurality of first keypoints in the first image according to the positions of the plurality of marks;

and converting the positions of the plurality of first key points in the first image into the positions of the plurality of key points in the second image.

3. The method of claim 1, wherein said converting the locations of the plurality of markers in the first coordinate to the locations of the plurality of keypoints in the second coordinate further comprises:

converting the positions of the plurality of marks in the first coordinate to positions of the plurality of marks in the second coordinate of the second image; and

and identifying the key points in the second image according to the converted positions of the marks.

4. A method according to any one of claims 1 to 3, further comprising:

interpolating additional marker positions from the positions of the plurality of markers; and

additional keypoints that are not among the plurality of keypoints are identified based on the additional marker positions.

5. A method according to any one of claims 1 to 3, further comprising:

from the plurality of keypoints, additional keypoints are interpolated.

6. The method of claim 1, wherein the plurality of keypoints correspond to one or more human bodies;

the converting the positions of the plurality of marks in the first coordinate to the positions of the plurality of key points in the second coordinate further includes:

Converting the positions of the plurality of marks in the first coordinate to the positions of the plurality of marks in the second coordinate;

fitting a subset of the converted locations of the plurality of markers to a body part of the human body;

identifying a geometric center of a subset of the converted locations of the plurality of markers; and

a set of keypoints of the human body corresponding to a subset of the converted positions of the plurality of markers is identified from the geometric center.

7. The method according to any of the preceding claims, further comprising:

a deep learning model is trained using the second image annotated with the plurality of keypoints.

8. The method of any of the preceding claims, wherein the first image comprises an infrared image of the plurality of marks appearing thereon with a pronounced optical feature.

9. The method according to any of the preceding claims, wherein the physical correlation between the first and second coordinates comprises a plurality of displacement parameters related to a three-dimensional displacement between the first and second coordinates, and a plurality of rotation parameters related to a three-dimensional rotation between the first and second coordinates.

10. The method according to any of the preceding claims, characterized in that,

the first camera comprises a plurality of first camera parts distributed in the scene;

the first image includes one of a plurality of images taken by the plurality of first camera portions substantially simultaneously;

one of the plurality of images is captured by one of the plurality of first camera portions and is associated with the first coordinate;

a field of view of one of the plurality of first camera portions coincides with a field of view of the second camera; and

the locations of the plurality of markers are identified in the first coordinate of the first camera using the first image.

11. The method according to any of the preceding claims, further comprising:

synchronizing the first camera and the second camera, wherein each of the first image and the second image is obtained by a respective timestamp.

12. The method of any of the preceding claims, wherein the second camera is mounted on a mobile device or augmented reality glasses.

13. The method according to any of the preceding claims, wherein the second image is one of a color image, a monochrome image and a depth image.

14. The method of any of the preceding claims, wherein determining a physical correlation between a first coordinate of the first camera and a second coordinate of the second camera further comprises:

obtaining one or more first test images of the scene from the first camera, wherein a plurality of first test marks to be detected from the one or more first test images exist in the one or more first test images;

obtaining one or more second test images of the scene from the second camera, there being a plurality of second test marks to be detected from the one or more second test images, the plurality of first test marks having known physical positions in the scene relative to the plurality of second test marks, the second test images and the first test images being taken simultaneously;

detecting the plurality of first test marks and the plurality of second test marks from the first test image and the second test image, respectively; and

the physical correlation between the first and second coordinates is derived from known physical locations of the plurality of first test marks in the scene relative to the second test mark.

15. The method of claim 14, wherein the one or more first test images comprise a first sequence of image frames, each first test mark being included in a subset of the plurality of first test images, and the one or more second test images comprise a second sequence of image frames, each second test mark corresponding to a subset of the first test marks and being included in a subset of the plurality of second test images, respectively.

16. The method of claim 15, wherein a subset of the plurality of first test marks and the plurality of second test marks are attached to a checkerboard, the checkerboard moving in a plurality of checkerboard poses in the scene and being recorded in the first image frame sequence and the second image frame sequence, the plurality of checkerboard poses having different positions or orientations from one another.

17. The method of claim 14, wherein the one or more first test images comprise a single first test image and the one or more second test images comprise a single second test image, each second test mark corresponding to a respective subset of the first test marks.

18. The method of claim 17, wherein the plurality of first test marks and the plurality of second test marks are marked on a three-dimensional box having a plurality of sides covered by a checkerboard, the three-dimensional box being recorded in the first test image and the second test image.

19. The method of claim 14, wherein the first plurality of test marks comprises a first mark unit and a second mark unit, and wherein the second plurality of test marks comprises a third mark unit located at an intermediate point between the locations of the first mark unit and the second mark unit.

20. A computer system, comprising:

one or more processors; and

a memory having instructions stored therein, the instructions being executable by the one or more processors to implement the method of any of claims 1-19.

21. A non-transitory computer-readable storage medium having instructions stored therein, the instructions being executable by one or more processors to implement the method of any of claims 1-19.