CN115963917B

CN115963917B - Visual data processing apparatus and visual data processing method

Info

Publication number: CN115963917B
Application number: CN202211661741.5A
Authority: CN
Inventors: 彭昊天; 冯志强; 杨黔生; 陈睿智; 赵晨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-12-22
Filing date: 2022-12-22
Publication date: 2024-04-16
Anticipated expiration: 2042-12-22
Also published as: CN115963917A

Abstract

The disclosure provides a visual data processing device and a visual data processing method, relates to the technical field of artificial intelligence, in particular to the technical fields of augmented reality, virtual reality, computer vision and deep learning, and can be applied to scenes such as metauniverse, computer vision and virtual digital people. The specific implementation scheme is as follows: the visual sensors are configured to respond to the detection of the synchronous acquisition instruction, acquire visual data of a target object at the same moment according to the synchronous acquisition instruction, and obtain visual data corresponding to the visual sensors, wherein the visual angle ranges of the visual sensors are different from each other; the main processor is connected with the plurality of visual sensors through a first data line and is configured to respond to receiving visual data from the plurality of visual sensors and process the plurality of visual data to obtain pose data of a target object; and the main power supply is connected with the plurality of vision sensors and the main processor through a first power line and is configured to supply power to the plurality of vision sensors and the main processor.

Description

Visual data processing apparatus and visual data processing method

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of augmented reality, virtual reality, computer vision, deep learning and the like, and can be applied to scenes such as metauniverse, computer vision, virtual digital people and the like. In particular, it relates to a visual data processing apparatus and a visual data processing method.

Background

With the development of artificial intelligence technology, computer vision technology has been widely used. The computer vision technique may refer to a technique of recognizing, tracking, and measuring a target object using an image pickup device, and performing image processing using a computer to obtain an image more suitable for human eye observation or device detection.

Disclosure of Invention

The present disclosure provides a visual data processing apparatus and a visual data processing method.

According to an aspect of the present disclosure, there is provided a visual data processing apparatus comprising: a plurality of vision sensors configured to acquire vision data of a target object at the same time according to the synchronous acquisition instruction in response to detection of the synchronous acquisition instruction, and to obtain vision data corresponding to the plurality of vision sensors, wherein the vision angles of the plurality of vision sensors are different from each other; the main processor is connected with the plurality of vision sensors through a first data line and is configured to process the plurality of vision data to obtain pose data of the target object in response to receiving the vision data from the plurality of vision sensors; and a main power supply connected to the plurality of vision sensors and the main processor through a first power line and configured to supply power to the plurality of vision sensors and the main processor.

According to another aspect of the present disclosure, there is provided a visual data processing method including: a plurality of vision sensors respond to the detection of the synchronous acquisition instruction, acquire the vision data of the target object at the same moment according to the synchronous acquisition instruction, and obtain the vision data corresponding to the vision sensors, wherein the vision sensors are different in view angle range; and the main processor is used for responding to the received visual data from the plurality of visual sensors through the first data line, processing the plurality of visual data and obtaining pose data of the target object; wherein, the main power supply supplies power to the plurality of vision sensors and the main processor through a first power line.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates a structural schematic diagram of a visual data processing apparatus according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a structural schematic diagram of a visual data processing apparatus according to another embodiment of the present disclosure;

FIG. 3 schematically illustrates a structural diagram of a visual data processing apparatus according to another embodiment of the present disclosure;

FIG. 4 schematically illustrates a structural diagram of a visual data processing apparatus according to another embodiment of the present disclosure;

FIG. 5 schematically illustrates a structural diagram of a visual data processing apparatus according to another embodiment of the present disclosure;

FIG. 6 schematically illustrates a structural diagram of a visual data processing apparatus according to another embodiment of the present disclosure;

FIG. 7 schematically illustrates a structural diagram of a visual data processing apparatus according to another embodiment of the present disclosure;

FIG. 8 schematically illustrates an example schematic diagram of a visual data processing apparatus according to an embodiment of the present disclosure;

FIG. 9 schematically illustrates an example schematic diagram of a visual data processing apparatus according to another embodiment of the present disclosure; and

Fig. 10 schematically shows a flowchart of a visual data processing method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Visual data of the target object may be acquired based on a single-view visual sensor. Motion capture can be performed based on visual data.

However, the actual application scene is complex, and the situation of shielding or blurring may exist, so that the motion capturing effect of the vision sensor based on the single visual angle is poor. In addition, since the object is unstable, the motion capturing effect of the vision sensor based on the single view angle is also poor.

To this end, the present disclosure proposes a visual data processing apparatus. For example, the plurality of vision sensors are configured to respond to detection of the synchronous acquisition instruction, acquire vision data of a target object at the same moment according to the synchronous acquisition instruction, and obtain vision data corresponding to the plurality of vision sensors. The viewing angle ranges of the plurality of visual sensors are different from each other. And the main processor is connected with the plurality of visual sensors through a first data line and is configured for responding to the received visual data from the plurality of visual sensors and processing the plurality of visual data to obtain pose data of the target object. And the main power supply is connected with the plurality of vision sensors and the main processor through a first power line and is configured to supply power to the plurality of vision sensors and the main processor.

According to the embodiment of the disclosure, since the visual data of the target object is acquired by the plurality of visual sensors according to the synchronous acquisition instructions, the visual data corresponding to the plurality of visual sensors can represent the visual data of the target object in different visual angle ranges at the same moment, and the accuracy of the visual data is further improved. Because the data transmission is carried out between the main processor and the vision sensor through the first data line, the stability of the data transmission between the main processor and the vision sensor can be ensured. In addition, since the pose data of the target object is obtained by processing the visual data corresponding to the plurality of visual sensors through the main processor, the accuracy of the pose data is improved.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

Fig. 1 schematically illustrates a structural diagram of a visual data processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 1, the visual data processing apparatus 100 may include a plurality of visual sensors 110, a main processor 120, and a main power supply 130. The plurality of vision sensors 110 may include vision sensor 110_1, vision sensor 110_2, vision sensor 110_m. M may be an integer greater than or equal to 1, M ε {1,2, …, M-1, M }.

The main processor 120 may be connected with the plurality of vision sensors 110 through the first data line 101. The main power supply 130 may be connected to the plurality of vision sensors 110 and the main processor 120 through the first power line 102.

The plurality of vision sensors 110 may be configured to acquire vision data of the target object at the same time according to the synchronous acquisition instruction in response to detecting the synchronous acquisition instruction, resulting in vision data corresponding to the plurality of vision sensors 110. The viewing angle ranges of the plurality of vision sensors 110 may be different from each other.

The main processor 120 may be configured to process the plurality of visual data to obtain pose data of the target object in response to receiving the visual data from the plurality of visual sensors 110.

The primary power source 130 may be configured to power the plurality of vision sensors 110 and the primary processor 120.

According to embodiments of the present disclosure, the visual sensor 110—m may have multi-frame visual data synchronization capability. The deployment positions of the plurality of visual sensors 110 may be configured according to actual service requirements, and may be capable of cooperatively obtaining visual data of a predetermined viewing angle range of the target object, which is not limited herein.

For example, the predetermined viewing angle range may be 360 °. The viewing angle ranges of the plurality of vision sensors 110 may be different from each other. Among the plurality of vision sensors 110, there may be vision sensors whose viewing angle ranges overlap. Alternatively, there may be no vision sensor in the plurality of vision sensors 110 in which the viewing angle ranges overlap.

For example, each of the plurality of vision sensors 110 may be disposed at a first predetermined location spaced a first predetermined distance from the target object. Alternatively, some of the plurality of vision sensors 110 may be disposed at a second predetermined position spaced apart from the target object by a second predetermined distance, and others of the plurality of vision sensors 110 may be disposed at a third predetermined position spaced apart from the target object by a third predetermined distance. For example, the first predetermined distance may be 10 meters, the second predetermined distance may be 5 meters, and the third predetermined distance may be 8 meters.

According to an embodiment of the present disclosure, the vision sensor 110—m may include at least one of: monocular vision sensors, binocular vision sensors, laser vision sensors, structured light vision sensors, and TOF (Time Of Flight) vision sensors.

For example, in the case where the vision sensor 110—m includes a binocular vision sensor, in response to detection of the synchronous acquisition instruction, vision data of the target object at the same time may be acquired based on the parallax principle using the binocular vision sensor. In this case, the visual data corresponding to each of the plurality of visual sensors may include three-dimensional geometric information of the target object.

For example, in the case where the vision sensor 110—m includes a laser vision sensor, in response to detecting the synchronous acquisition instruction, vision data of the target object at the same timing may be acquired based on the triangulation principle using the laser vision sensor. In this case, the visual data corresponding to each of the plurality of visual sensors may include three-dimensional geometric information of the target object.

For example, in the case where the plurality of vision sensors 110—m includes a structured light vision sensor, in response to detecting the synchronous acquisition instruction, the structured light vision sensor may be utilized to acquire the vision data of the target object at the same time by projecting controllable speckle light spots onto the target object surface. In this case, the visual data corresponding to each of the plurality of visual sensors may include surface characteristic information of the target object.

For example, in the case where the vision sensor 110—m includes a TOF vision sensor, in response to detection of the synchronous acquisition instruction, vision data of the target object at the same timing can be acquired by measuring the time of flight using the TOF vision sensor. In this case, the visual data corresponding to each of the plurality of visual sensors may include surface characteristic information of the target object.

According to embodiments of the present disclosure, the types of the plurality of visual sensors 110 may be configured according to actual business requirements, and are not limited herein. For example, the plurality of vision sensors 110 may include a binocular vision sensor and a structured light vision sensor, in which case the vision data corresponding to each of the plurality of vision sensors may include three-dimensional geometric information and surface characteristic information of the target object at the same time. Alternatively, the plurality of vision sensors 110 may include a laser vision sensor and a TOF vision sensor, in which case the vision data corresponding to each of the plurality of vision sensors may include both three-dimensional geometric information and surface feature information of the target object.

According to an embodiment of the present disclosure, the main processor 120 may include at least one of: an image signal processor (IMAGE SIGNAL processing, ISP), a digital signal processor (DIGITAL SIGNAL processor, DSP) and a central processor (Central Processing Unit, CPU) in the server. Alternatively, the main processor 120 may further include at least one of: application SPECIFIC INTEGRATED circuits (asics), field programmable gate arrays (Field Programmable GATE ARRAY, FPGA), and microprocessors (i.e., microprocessors).

According to an embodiment of the present disclosure, the image signal processor may include at least one of: the first input unit, the first arithmetic unit first controller, the first output unit and the first bus. The digital signal processor may comprise at least one of: the second input unit, the second arithmetic unit second controller, the second output unit and the second bus. The central processor may comprise at least one of: the second input unit, the second arithmetic unit second controller, the second output unit and the second bus.

For example, where the main processor 120 includes an image signal processor and a digital signal processor, in response to receiving visual data from a plurality of visual sensors, the image signal processor may perform at least one of the following on the visual data from the plurality of visual sensors: black level compensation (Black Level Compensation, BLC), lens correction (LENS SHADING correction, LSC), bad pixel correction (Bad Pixel Correction, BPC), color interpolation (i.e., demosaic), bayer noise removal, white balance (Automatic White Balance, AWB) correction, color Correction (CC), gamma correction, color space conversion, and color and contrast enhancement. The color space conversion may be converting the RGB format to YUV format. After the YUV format data of the plurality of visual sensors are obtained, the YUV format data of the plurality of visual sensors can be transmitted to the digital signal processor through the I/O interface, so that the digital signal processor can process the YUV format data of the plurality of visual sensors, and pose data of a target object can be obtained. "G" in "RGB" may characterize "Green (i.e., green)". "B" in "RGB" may characterize "Blue (i.e., blue)". "Y" in YUV may characterize "Luminance (i.e., luminance)". "U" in YUV may characterize "chroma (i.e., chrominance)". "V" in YUV may characterize "concentration (i.e., chroma)".

For example, in the case where the main processor 120 is a central processor in a server, in response to receiving visual data from a plurality of visual sensors, the central processor may process the plurality of visual data to obtain pose data of the target object. The server may be various types of servers that provide various functions. For example, the server may be a cloud server (i.e., a cloud computing server or a cloud host). The cloud server is a host product in a cloud computing service system, and overcomes the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual private server (VPS PRIVATE SERVER) service. In addition, the server may also be a server of a distributed system or a server that incorporates a blockchain.

According to embodiments of the present disclosure, the main power supply 130 may include at least one of: an alternating Current (ALTERNATING CURRENT, AC) mains and a Direct Current (DC) mains. The ac mains power source may comprise at least one of: parameter-adjusting type ac mains, auto-adjusting type ac mains and switching type ac mains. The direct current main power supply may include at least one of: chemical DC main power supply, linear DC main power supply and switch-type DC main power supply.

According to an embodiment of the present disclosure, the first data line 101 may include at least one of: the method comprises the steps of sampling a first data line of a COM interface, adopting a first data line of a USB interface, adopting a first data line of a Type-C interface and adopting a first data line of a Micro-USB interface. The number of the first data lines 101, the length of the first data lines 101, the arrangement mode of the first data lines 101, the number of interfaces of the first data lines 101, and the arrangement mode of the interfaces of the first data lines 101 may be set according to actual service requirements, which is not limited herein. For example, the length of the first data line 101 may be 10 meters, the number of the first data lines 101 and the number of interfaces of the first data line 101 may be adapted to the number of the plurality of visual sensors 110, and the deployment manner of the first data line 101 and the deployment manner of the interfaces of the first data line 101 may be adapted to the deployment positions of the plurality of visual sensors 110.

According to the embodiment of the disclosure, the first insulating layer may be coated on the outer side of the first data line 101 to ensure thermal stability and electrical insulation of the first data line 101, thereby improving the service life of the first data line 101.

According to an embodiment of the present disclosure, the first power line 102 may include at least one of: a first standard power line and a first interconnect power line. The first standard power cord may include a first power plug and a first connector. The first interconnect power line may include a first plug and a second connector. The number of the first power lines 102, the arrangement manner of the first power lines 102, the number of plugs of the first power lines 102, and the arrangement manner of plugs of the first power lines 102 may be set according to actual service requirements, which is not limited herein. For example, the number of the first power lines 102 and the number of plugs of the first power lines 102 may be adapted to the sum of the numbers of the plurality of vision sensors 110 and the main processor 120, and the arrangement of the first power lines 102 and the arrangement of plugs of the first power lines 102 may be adapted to the arrangement positions of the plurality of vision sensors 110 and the main processor 120.

According to the embodiment of the disclosure, the outer side of the first power line 102 may be coated with the second insulating layer to ensure the thermal stability and the electrical insulation of the first power line 102, thereby improving the service life of the first power line 102.

According to embodiments of the present disclosure, a target object may refer to an object of interest during operation of a visual data processing apparatus. The target objects may include static target objects and dynamic target objects. For example, static target objects may include static persons, static objects, static scenes, and the like. Alternatively, dynamic target objects may include dynamic characters, dynamic objects, dynamic scenes, and the like.

According to embodiments of the present disclosure, the target object may include a virtual object. A virtual object may refer to a virtual character having a digitized appearance. The virtual object may include at least one of: two-dimensional virtual objects and three-dimensional virtual objects. Virtual objects may also be referred to as digital persons. The three-dimensional virtual object may have character features, character behaviors, and character ideas. The character features may include at least one of a character's looks, gender, character, and the like. The behavior of the character may include at least one of language expression ability, expression change ability, limb movement expression ability, and the like. The concept of a character may refer to having the ability to identify the external environment and interact with the user. The three-dimensional virtual object may include a writeable real digital person and a non-writeable real digital person. The realistic type digital person may include at least one of a service type digital person and a media type digital person. For example, the service type digital person may include one of a digital person applied to customer service and a digital person applied to a small assistant for service, and the like. The media digital person may include one of a digital person applied to live delivery with goods, a digital person applied to broadcast media, and the like. The non-realistic digital person may include a digital person that is used in a navigation application.

According to an embodiment of the present disclosure, the target object may be located at a center of gravity position of a shape constituted by the plurality of vision sensors 110. In this case, each of the plurality of vision sensors 110 may acquire vision data of the target object.

The visual data processing apparatus 100 according to an embodiment of the present disclosure is further described below with reference to fig. 2 to 9 in connection with specific embodiments.

Fig. 2 schematically illustrates a structural diagram of a visual data processing apparatus according to another embodiment of the present disclosure.

As shown in fig. 2, the visual data processing apparatus 200 may include a plurality of visual sensors 210, a main processor 220, and a main power supply 230. The plurality of vision sensors 210 may include a vision sensor 210_1, a vision sensor 210_2, & gt, a vision sensor 210_n-1, and a vision sensor 210_n. N may be an integer greater than or equal to 1, N ε {1,2, …, N-1, N }.

The main processor 220 may be connected with the plurality of vision sensors 210 through the first data line 201. The main power supply 230 may be connected to the plurality of vision sensors 210 and the main processor 220 through the first power line 202. The plurality of vision sensors 210 may be connected by a synchronization line.

For the description of the plurality of vision sensors 210, the main processor 220, the main power supply 230, the first data line 201, and the first power line 202, reference may be made to the description of the plurality of vision sensors 110, the main processor 120, the main power supply 130, the first data line 101, and the first power line 102, which are not repeated herein.

According to embodiments of the present disclosure, the synchronization lines 203_1, …, 203_n-1 may be used to transmit synchronization signals between the vision sensor 210_1, the vision sensors 210_2, …, the vision sensors 210_n, …, the vision sensor 210_n-1, the vision sensor 210_n to enable multi-frame synchronization between vision sensors of different perspectives.

According to the embodiment of the present disclosure, the connection manner of the synchronization line may be set according to actual service requirements, which is not limited herein. The synchronization line may be used to make a connection between every two adjacent vision sensors, for example, between vision sensor 210_1 and vision sensor 210_2 may be connected by synchronization line 203_1, and so on, between vision sensor 210_n-1 and vision sensor 210_n may be connected by synchronization line 203_n-1. Alternatively, a synchronization line may be used to make a connection between any two vision sensors, for example, vision sensor 210_1 and vision sensor 210_ (N-1) may be connected by a synchronization line.

According to the embodiment of the present disclosure, the deployment manner of the synchronization line may be set according to the actual service requirement, which is not limited herein. For example, the manner in which the synchronization line is deployed may be adapted to the deployment locations of the plurality of visual sensors 210. Alternatively, the deployment manner of the synchronization line may be adapted to the position of the target object.

According to the embodiment of the disclosure, since the plurality of visual sensors are connected through the synchronization line, the synchronization line can be used for transmitting the synchronization signal, so that multi-frame synchronization between the visual sensors of different visual angle ranges can be realized, thereby improving accuracy of visual data.

According to embodiments of the present disclosure, the main processor 220 may be configured to perform keypoint detection on a plurality of visual data resulting in a plurality of sets of keypoints. And determining pose data of the target object according to the plurality of key point sets.

According to embodiments of the present disclosure, the main processor 220 may be further configured to perform an internal and external reference calibration process on the visual data of the plurality of visual sensors prior to keypoint detection of the plurality of visual data.

According to an embodiment of the present disclosure, after obtaining the visual data after the internal reference calibration process and the external reference calibration process, the main processor 220 may perform keypoint detection on the plurality of visual data to obtain a plurality of keypoint sets. Each of the plurality of sets of keypoints may comprise frame synchronized visual data at a different perspective. The key point detection mode may include one of the following: a point of interest detection mode and a dense extraction mode. The point of interest detection means may comprise at least one of: harris corner detector, loG (Laplacian of Gaussian) operator and DoG (Difference of Gaussian) operator. The dense extraction mode may include at least one of: scale-invariant feature transform (SIFT), fast feature detection (features from ACCELERATED SEGMENT TEST, FAST) algorithms, direction gradient histograms (Histogram of Oriented Gradient, HOG), and local binary patterns (Local Binary Pattern, LBP).

According to the embodiment of the disclosure, after obtaining the plurality of key point sets, pose data of the target object may be determined according to the plurality of key point sets. For example, for each of the plurality of keypoint sets, a keypoint fusion process may be performed on the plurality of keypoints in the keypoint set to obtain a fused keypoint corresponding to each of the keypoint sets. The keypoint fusion process may include at least one of: the key point average value processing and the key point weighting processing.

According to the embodiment of the disclosure, after the fusion key points corresponding to each key point set are obtained, the fusion key points corresponding to each key point set can be processed by using a human body three-dimensional model (Skinned Multi-Person Linear Model, SMPL) to obtain pose data of the target object. For example, the fused keypoints corresponding to each set of keypoints may be projected onto a two-dimensional plane using a three-dimensional model of the human body, to obtain two-dimensional keypoints corresponding to each set of keypoints. And determining the re-projection error according to the two-dimensional key points corresponding to each key point set. Alternatively, the re-projection error may be determined from a plurality of fusion keypoints. And determining pose data of the target object according to the reprojection error.

According to an embodiment of the present disclosure, the main processor 220 may be configured to perform feature extraction on the plurality of visual data, and obtain a keypoint feature map of at least one scale corresponding to each of the plurality of visual data. And obtaining a plurality of key point sets according to the key point feature map of at least one scale corresponding to each of the plurality of visual data.

According to embodiments of the present disclosure, scale may refer to image resolution. Each scale may have at least one keypoint feature map corresponding to the scale.

According to the embodiment of the disclosure, the dial image can be processed based on a single-stage serial method, so that a key point characteristic diagram of at least one scale is obtained. Alternatively, the dial image is processed based on a multi-stage series method to obtain a key point feature map of at least one scale. Alternatively, the dial image is processed based on a multi-stage parallel method, and a key point feature map of at least one scale is obtained.

According to the embodiment of the disclosure, since the key point feature map of at least one scale can provide richer information, the key point set is obtained by using the key point feature map of at least one scale, so that the accuracy of the key point set is improved.

According to an embodiment of the present disclosure, the main processor 220 may be configured to perform feature extraction of the visual data in U stages, resulting in at least one key point feature map corresponding to the U-th stage. And obtaining the key point feature map of at least one scale corresponding to the visual data according to the at least one key point feature map corresponding to the U-th stage.

According to an embodiment of the present disclosure, the u-th stage may have T _u parallel stages. The image resolution of the keypoint feature map of the same parallel hierarchy is the same. The image resolution of the keypoint feature map is different for different parallel levels.

According to embodiments of the present disclosure, U may be an integer greater than or equal to 1. U may be an integer greater than or equal to 1 and less than or equal to U. T _u may be an integer greater than or equal to 1.

According to an embodiment of the present disclosure, the U stages may include an input stage, an intermediate stage, and an output stage. The input phase may refer to phase 1. The output phase may refer to the U-th phase. The intermediate stages may refer to stages 2 through U-1. The number of parallel stages of each stage may be the same or different. In stages 1 to U-1, the current stage may have at least one more parallel hierarchy than the previous stage. The U-th stage may be the same number of parallel stages as the U-1-th stage. U may be configured according to actual service requirements, which is not limited herein. For example, u=4. In stages 1 to 3, the current stage may be at least one more parallel hierarchy than the previous stage. Stage 1 has T ₁ = 2 parallel levels. Stage 2 has T ₂ = 3 parallel levels. Stage 3 has T ₃ =4 parallel levels. Stage 4 has T ₄ =4 parallel levels.

According to embodiments of the present disclosure, the image resolution of the keypoint feature map of the same parallel hierarchy is the same. The image resolutions of the keypoint feature maps of different parallel levels are different, for example, the image resolution of the keypoint feature map of the current parallel level is smaller than the image resolution of the keypoint feature map of the upper parallel level. The image resolution of the keypoint feature map of the current parallel hierarchy of the current stage may be determined from the image resolution of the keypoint feature map of the upper parallel hierarchy of the previous stage. For example, the image resolution of the current-stage keypoint feature map of the current stage may be obtained by downsampling the image resolution of the keypoint feature map of the upper parallel hierarchy of the previous stage.

According to an embodiment of the present disclosure, in a case where U > 1, performing feature extraction of U phases on the visual data to obtain at least one key point feature map corresponding to the U phases may include: and responding to u=1, and carrying out feature extraction on the visual data to obtain an intermediate key point feature map of at least one scale corresponding to the 1 st stage. And obtaining the key point feature map of at least one scale corresponding to the stage 1 according to the intermediate key point feature map of at least one scale corresponding to the stage 1. And responding to the U which is more than 1 and less than or equal to U, carrying out feature extraction on the key point feature map of at least one scale corresponding to the U-1 stage, and obtaining the intermediate key point feature map of at least one scale corresponding to the U-1 stage. And obtaining the key point feature map of at least one scale corresponding to the u-th stage according to the intermediate key point feature map of at least one scale corresponding to the u-th stage.

According to an embodiment of the present disclosure, in a case where u=1, performing feature extraction of U phases on the visual data to obtain at least one key point feature map corresponding to the U phases may include: and carrying out feature extraction on the visual data to obtain an intermediate key point feature map of at least one scale corresponding to the 1 st stage. And obtaining the key point feature map of at least one scale corresponding to the stage 1 according to the intermediate key point feature map of at least one scale corresponding to the stage 1.

According to an embodiment of the present disclosure, obtaining a keypoint feature map of at least one scale according to at least one keypoint feature map corresponding to the U-th stage may include: at least one keypoint feature map corresponding to the U-th stage may be determined as a keypoint feature map of at least one scale.

According to the embodiment of the disclosure, since the image resolutions of the key point feature maps of the same parallel hierarchy are the same and the image resolutions of the key point feature maps of different parallel hierarchies are different, the high-resolution feature characterization can be maintained in the whole feature extraction process, and the parallel hierarchies from high resolution to low resolution can be gradually increased. Deep semantic information is directly extracted on the high-resolution feature representation, but not used as the supplement of low-level feature information of the image, so that the image has enough classification capability and avoids the loss of effective spatial resolution. At least one parallel hierarchy can capture the context information and acquire rich global and local information. In addition, information is repeatedly exchanged on the parallel hierarchy to realize multi-scale fusion of the features, and more accurate position information of the key points can be obtained, so that the accuracy of the key point set is improved.

In accordance with an embodiment of the present disclosure, where U is an integer greater than 1, the main processor 200 may be configured to convolve at least one keypoint feature map corresponding to the U-1 stage to obtain at least one intermediate keypoint feature map corresponding to the U-1 stage. And carrying out feature fusion on at least one intermediate key point feature map corresponding to the u-th stage to obtain at least one key point feature map corresponding to the u-th stage.

According to embodiments of the present disclosure, U may be an integer greater than 1 and less than or equal to U.

According to the embodiment of the disclosure, for the u-1 stage, for a key point feature map in at least one key point feature map, convolution processing may be performed on the key point feature map to obtain an intermediate key point feature map of the u stage, so that at least one intermediate key point feature map of the u stage may be obtained.

According to an embodiment of the present disclosure, feature fusion is performed on at least one intermediate keypoint feature map corresponding to the u-th stage, to obtain at least one keypoint feature map corresponding to the u-th stage, which may include: and fusing the intermediate key point feature map of the u stage and the intermediate key point feature maps of other parallel stages except for the parallel stage where the intermediate key point feature map is located according to the intermediate key point feature map of at least one intermediate key point feature map corresponding to the u stage, so as to obtain the key point feature map corresponding to the intermediate key point feature map of the u stage. Other parallel levels may refer to at least some of the parallel levels of the u-th stage except for the parallel level at which the intermediate keypoint feature map is located.

According to an embodiment of the present disclosure, the main processor 220 may be configured to obtain, for an ith parallel hierarchy of the T _u parallel hierarchies, a keypoint feature map corresponding to the ith parallel hierarchy from other intermediate keypoint feature maps corresponding to the ith parallel hierarchy and the intermediate keypoint feature map corresponding to the ith parallel hierarchy.

According to an embodiment of the present disclosure, the other intermediate keypoint feature map corresponding to the i-th parallel level is an intermediate keypoint feature map corresponding to at least part of the T _u parallel levels other than the i-th parallel level. i may be an integer greater than or equal to 1 and less than or equal to T _u.

According to an embodiment of the disclosure, in the case that I < 1, up-sampling is performed on at least one first other intermediate keypoint feature map, so as to obtain an up-sampled keypoint feature map corresponding to the at least one first other intermediate keypoint feature map. And downsampling the at least one second other intermediate key point feature map to obtain a downsampled key point feature map corresponding to the at least one second other intermediate key point feature map. The first other intermediate keypoint feature map may refer to other intermediate keypoint feature maps of greater than the i-th parallel level of the T _u parallel levels. The second other intermediate keypoint feature map may refer to other intermediate keypoint feature maps in less than the i-th parallel level in the T _u parallel levels. The image resolution of the upsampled keypoint feature map is the same as the resolution of the intermediate keypoint feature map of the ith parallel hierarchy. The resolution of the downsampled keypoint feature map is the same as the resolution of the intermediate keypoint feature map of the i parallel hierarchies.

According to an embodiment of the present disclosure, in the case of i=1, at least one second other intermediate keypoint feature map is up-sampled, resulting in a down-sampled keypoint feature map corresponding to the at least one first other intermediate keypoint feature map. The first other intermediate keypoint feature map may refer to other intermediate keypoint feature maps of greater than the 1 st parallel level of the T _u parallel levels. The image resolution of the upsampled keypoint feature map is the same as the resolution of the intermediate keypoint feature map of the 1 st parallel hierarchy.

According to an embodiment of the present disclosure, in the case of i=i, at least one second other intermediate keypoint feature map is downsampled, resulting in a downsampled keypoint feature map corresponding to the at least one second other intermediate keypoint feature map. The second other intermediate keypoint feature map may refer to other intermediate keypoint feature maps in less than the i-th parallel level in the T _u parallel levels. The resolution of the downsampled keypoint feature map is the same as the resolution of the intermediate keypoint feature map of the i parallel hierarchies.

According to the embodiment of the disclosure, a keypoint feature map corresponding to an ith parallel hierarchy is obtained from an up-sampled keypoint feature map corresponding to at least one first other intermediate keypoint feature map, a down-sampled keypoint feature map corresponding to at least one second other intermediate keypoint feature map, and an intermediate keypoint feature map of the ith parallel hierarchy. For example, the upsampled keypoint feature map corresponding to the at least one first other intermediate keypoint feature map, the downsampled keypoint feature map corresponding to the at least one second other intermediate keypoint feature map, and the intermediate keypoint feature map of the ith parallel hierarchy may be fused to obtain the keypoint feature map corresponding to the ith parallel hierarchy. The fusion may include at least one of: splicing and adding.

According to an embodiment of the present disclosure, the main processor 220 may be further configured to process the plurality of visual data using the self-contact detection model to obtain contact detection information. And optimizing pose data according to the contact detection information.

According to embodiments of the present disclosure, the self-contact detection model may be derived by training a deep learning model using sample visual data and sample contact tag information of the sample visual data.

According to embodiments of the present disclosure, a self-contact detection model may be used to determine contact detection information between various locations of a target object. The contact detection information may be used to characterize the contact state between two sites. The contact state may include contact and non-contact. The contact detection information between the respective portions of the target object may include at least one of: contact detection information between both hands of the target object, contact detection information between the hands and the waist of the target object, contact detection information between the hands and the face of the target object, contact detection information between the hands and the legs of the target object, and the like.

According to an embodiment of the present disclosure, the self-contact detection model may be obtained by training a deep learning model using sample visual data and sample contact tag information of the sample visual data, and may include: and inputting the sample visual data into a deep learning model to obtain sample contact detection information. And obtaining a loss function value according to the sample contact detection information and the sample contact label information based on the loss function. And adjusting model parameters of the deep learning model according to the loss function value until a preset ending condition is met. The deep learning model obtained in the case where the predetermined end condition is satisfied is determined as the self-contact detection model. The predetermined end condition may include at least one of the loss function value converging and reaching a maximum training round. The model structure of the deep learning model may be configured according to actual service requirements, which is not limited herein.

According to embodiments of the present disclosure, after the contact detection information is obtained, pose data may be optimized according to the contact detection information, for example, pose data may be optimized according to the contact detection information and the inverse dynamics method.

According to the embodiment of the disclosure, the touch detection information is obtained by processing a plurality of visual data by using the self-touch detection model, and the pose data are optimized according to the touch detection information, so that the accuracy of the position data is improved.

Fig. 3 schematically illustrates a structural diagram of a visual data processing apparatus according to another embodiment of the present disclosure.

As shown in fig. 3, the visual data processing apparatus 300 may include a plurality of visual sensors 310, a main processor 320, and a main power supply 330. The plurality of vision sensors 310 may include a primary vision sensor 311 and at least one secondary vision sensor 312. The at least one auxiliary vision sensor 312 may include an auxiliary vision sensor 312_1, auxiliary vision sensors 312_2, …, auxiliary vision sensors 312_p, …, auxiliary vision sensor 312_p. P may be an integer greater than or equal to 1, P ε {1,2, …, P-1, P }.

The main processor 320 may be connected with the plurality of vision sensors 310 through the first data line 301. The main power supply 330 may be connected to the plurality of vision sensors 310 and the main processor 320 through the first power line 302.

The primary vision sensor 311 may be configured to generate a synchronous acquisition instruction in response to detecting the start instruction.

The at least one secondary vision sensor 312 may be configured to respond to receiving a synchronous acquisition instruction from the secondary vision sensor 311.

For the description of the plurality of vision sensors 310, the main processor 320, the main power supply 330, the first data line 301 and the first power line 302, reference may be made to the description of the plurality of vision sensors 110, the main processor 120, the main power supply 130, the first data line 101 and the first power line 102, which are not repeated herein.

According to an embodiment of the present disclosure, the primary processor 320 may be configured to transmit the first time to the primary vision sensor 311 and the at least one secondary vision sensor 312 through the first data line 301. The first time may serve as a reference time for the primary vision sensor 311 and the at least one secondary vision sensor 312. The primary vision sensor 311 and the at least one secondary vision sensor 312 connected to the primary processor 320 through the first data line 301 may have the same reference time.

For example, the code that generates the start instruction may be edited in advance as the first script. The main processor 320 may be configured to run a first script to generate the start-up instruction in response to detecting a visual data processing operation of the user. The initiation instruction may include a second time. The second time may be the current system time at which the main processor 320 generated the startup instruction. After generating the start-up instruction, the main processor 320 may send the start-up instruction to the main vision sensor 311.

According to an embodiment of the present disclosure, the code that generates the synchronous acquisition instruction may be edited in advance as the second script. The primary vision sensor 311 may be configured to run the second script to generate the synchronous acquisition instructions in response to detecting the start instruction. The synchronous acquisition instructions may include a fourth time. For example, the primary vision sensor 311 may determine a fourth time based on the second time and the third time. The second time may refer to a time included in the start instruction. The third time may refer to a time when the main vision sensor 311 detects the start instruction. The fourth time may be the current system time at which the master vision sensor 311 generates the synchronous acquisition instructions.

According to an embodiment of the present disclosure, after generating the synchronous acquisition instruction, the primary vision sensor 311 may transmit the synchronous acquisition instruction to the at least one auxiliary vision sensor 312 through a synchronous line, so that the at least one auxiliary vision sensor 312 acquires vision data of a target object at the same time according to the synchronous acquisition instruction, and obtains vision data corresponding to the at least one auxiliary vision sensor 312. For example, the primary vision sensor 311 may send a synchronous acquisition instruction to the secondary vision sensor 312_1 via the synchronous line 303_1.

According to embodiments of the present disclosure, the primary vision sensor 311 may include at least one of: binocular vision sensors, laser vision sensors, structured light vision sensors, and TOF vision sensors.

According to embodiments of the present disclosure, the at least one auxiliary vision sensor 312 may be configured to acquire vision data of a target object at the same time in response to receiving a synchronous acquisition instruction from the vision sensor 311, resulting in vision data corresponding to the at least one auxiliary vision sensor 312. The same time may refer to a fifth time. The fifth time may refer to a current system time at which the at least one auxiliary vision sensor 312 receives the synchronous acquisition instruction from the vision sensor 311. Alternatively, the primary vision sensor 311 may acquire the vision data of the target object at the fifth time, resulting in the vision data corresponding to the primary vision sensor 311.

According to an embodiment of the present disclosure, the at least one secondary vision sensor 312 may include at least one of: binocular vision sensors, laser vision sensors, structured light vision sensors, and TOF vision sensors.

The specific deployment locations of the primary vision sensor 311 and the at least one secondary vision sensor 312 may be configured according to actual business requirements in accordance with embodiments of the present disclosure, and are not limited herein. For example, the primary vision sensor 311 and the at least one secondary vision sensor 312 may each be disposed at a fourth predetermined location spaced a fourth predetermined distance from the target object. Alternatively, the primary vision sensor 311 may be disposed at a fifth predetermined location spaced a fifth predetermined distance from the target object, and the at least one secondary vision sensor 312 may be disposed at a sixth predetermined location spaced a sixth predetermined distance from the primary vision sensor 311. For example, the fourth predetermined distance may be 10 meters, the fifth predetermined distance may be 5 meters, and the sixth predetermined distance may be 5 meters.

According to the embodiment of the disclosure, the synchronous acquisition instruction is generated by the main vision sensor in response to the detection of the starting instruction, and on the basis, the vision data of the target object are acquired by the at least one auxiliary vision sensor in response to the receiving of the synchronous acquisition instruction, so that the vision data under different visual angle ranges at the same moment can be obtained, and the accuracy of the vision data is further improved.

Fig. 4 schematically illustrates a structural diagram of a visual data processing apparatus according to another embodiment of the present disclosure.

As shown in fig. 4, the visual data processing apparatus 400 may include a plurality of visual sensors 410, a main processor 420, a main power supply 430, and an auxiliary power supply 440. The plurality of vision sensors 410 may include vision sensor 410_1, vision sensors 410_2, …, vision sensors 410_q, …, vision sensor 410_q. Q may be an integer greater than or equal to 1, Q ε {1,2, …, Q-1, Q }.

The main processor 420 may be connected with the plurality of vision sensors 410 through the first data line 401. The main power supply 430 may be connected to the plurality of vision sensors 410 and the main processor 420 through the first power line 402. The auxiliary power source 440 may be connected to the first data line 401 through the second power line 404.

The auxiliary power supply 440 may be configured to supply power to the first data line 401.

For the description of the plurality of vision sensors 410, the main processor 420, the main power supply 430, the first data line 401 and the first power line 402, reference may be made to the description of the plurality of vision sensors 110, the main processor 120, the main power supply 130, the first data line 101 and the first power line 102, which are not repeated herein.

According to an embodiment of the present disclosure, the auxiliary power source 440 may include at least one of: an ac auxiliary power supply and a dc auxiliary power supply. The ac auxiliary power source may include at least one of: parameter-adjusting type ac auxiliary power supply, auto-adjusting type ac auxiliary power supply, and switching type ac auxiliary power supply. The dc auxiliary power source may include at least one of: chemical DC auxiliary power supply, linear DC auxiliary power supply and switch-mode DC auxiliary power supply.

According to an embodiment of the present disclosure, the second power line 404 may include at least one of: a second standard power line and a second interconnect power line. The second standard power cord may include a second power plug and a third connector. The second interconnect power line may include a second plug and a fourth connector. The number of the second power lines 404, the arrangement manner of the second power lines 404, the number of plugs of the second power lines 404, and the arrangement manner of plugs of the second power lines 404 may be set according to actual service requirements, which is not limited herein.

For example, the number of second power lines 404 may be adapted to the number of first data lines 401, the number of plugs of the second power lines 404 may be adapted to the number of interfaces of the first data lines 401, the arrangement mode of the second power lines 404 may be adapted to the arrangement mode of the first data lines 401, and the number of plugs of the second power lines 404 may be adapted to the arrangement mode of the interfaces of the first data lines 401.

According to the embodiment of the disclosure, since the auxiliary power supply can be connected with the first data line through the second power line to supply power to the first data line, stability and reliability of data transmission of the first data line can be ensured. According to an embodiment of the present disclosure, the main power source 430 may be a portable power source.

In accordance with an embodiment of the present disclosure, in the case where the main power source 430 is a portable power source, the main power source 430 may include at least one first portable power source. At least one first mobile power supply can be connected in parallel through a first bus for synchronous charging or synchronous discharging so as to adapt to the scene with higher power requirements. Alternatively, a part of the first portable power sources in the at least one first portable power source may be controlled to charge or discharge so as to reduce the use loss of the first portable power source.

According to an embodiment of the present disclosure, each of the at least one first portable power source may include a first power source housing, a first solar panel, and a first display. The first solar panel may be disposed on a surface of the first power supply housing. The first solar panel may be configured to receive solar light and convert light energy generated by the solar light into electrical energy, thereby improving the sustainability of the main power source 430 in an outdoor scenario.

According to the embodiment of the disclosure, since the main power supply can use the mobile power supply, the sustainability of the main power supply in an outdoor scene can be improved, and the use effect of the visual data processing equipment in the outdoor scene is further ensured.

The auxiliary power source 440 may be a portable power source according to an embodiment of the present disclosure.

In accordance with an embodiment of the present disclosure, in the case where the auxiliary power source 440 is a portable power source, the auxiliary power source 440 may include at least one second portable power source. At least one second mobile power supply can be connected in parallel through a second bus for synchronous charging or synchronous discharging so as to adapt to the scene with higher power requirements. Alternatively, a part of the second portable power sources in the at least one second portable power source may be controlled to charge or discharge so as to reduce the use loss of the second portable power source.

According to an embodiment of the present disclosure, each of the at least one second portable power source may include a second power source housing, a second solar panel, and a second display. The second solar panel may be disposed on a surface of the second power supply housing. The second solar panel may be configured to receive solar light and convert light energy generated by the solar light into electrical energy, thereby improving the sustainability of the auxiliary power supply 440 in an outdoor scenario.

According to the embodiment of the disclosure, since the auxiliary power supply can use the mobile power supply, the sustainability of the auxiliary power supply in an outdoor scene can be improved, and the use effect of the visual data processing device in the outdoor scene is further ensured.

Fig. 5 schematically illustrates a structural diagram of a visual data processing apparatus according to another embodiment of the present disclosure.

As shown in fig. 5, the visual data processing apparatus 500 may include a plurality of visual sensors 510, a main processor 520, a main power supply 530, and a plurality of first auxiliary processors 550. The plurality of vision sensors 510 may include vision sensor 510_1, vision sensors 510_2, …, vision sensors 510_k, …, vision sensor 510_k. The plurality of first auxiliary processors 550 may include a first auxiliary processor 550_1, a first auxiliary processor 550_2, a first auxiliary processor 550_k. K may be an integer greater than or equal to 1, K ε {1,2, …, K-1, K }.

The main processor 520 may be connected to the plurality of vision sensors 510 through the first data line 501. The main power supply 530 may be connected to the plurality of vision sensors 510 and the main processor 520 through the first power line 502. The plurality of first auxiliary processors 550 may be connected to the plurality of visual sensors 510 through second data lines. The main processor 520 may be connected to a plurality of first auxiliary processors 550 through third data lines 506.

The plurality of first auxiliary processors 550 may be configured to process the plurality of raw visual data resulting in a plurality of visual data in response to receiving raw visual data from the plurality of visual sensors 510.

According to embodiments of the present disclosure, the first auxiliary processor may be in one-to-one correspondence with the vision sensor. For example, the first auxiliary processor 550_1 may be connected to the vision sensor 510_1 through the second data line 505_1, the first auxiliary processor 550_2 may be connected to the vision sensor 510_2 through the second data line 505_2, …, the first auxiliary processor 550_k may be connected to the vision sensor 510_k through the second data line 505_k, …, and the first auxiliary processor 550_k may be connected to the vision sensor 510_k through the second data line 505_k.

The primary processor 520 may be configured to respond to receiving visual data from the plurality of first secondary processors 550.

For the description of the plurality of vision sensors 510, the main processor 520, the main power supply 530, the first data line 501, and the first power line 502, reference may be made to the description of the plurality of vision sensors 110, the main processor 120, the main power supply 130, the first data line 101, and the first power line 102, which are not repeated herein.

According to an embodiment of the present disclosure, the plurality of first auxiliary processors 550 may include at least one of: an image signal processor, a digital signal processor, and a central processor in a server. Alternatively, the plurality of first auxiliary processors 550 may further include at least one of: application specific integrated circuits, field programmable gate arrays, and microprocessors. The number of the plurality of first auxiliary processors 550 may be adapted to the number of the plurality of visual sensors 510, and the plurality of first auxiliary processors 550 may be deployed in a manner adapted to the deployment location of the plurality of visual sensors 510.

According to embodiments of the present disclosure, the visual data of the visual sensor corresponding to each of the first auxiliary processors may be preprocessed by the plurality of first auxiliary processors 550, respectively, to obtain the first processed visual data corresponding to each of the plurality of visual sensors 510. After obtaining the first processed visual data corresponding to each of the plurality of visual sensors 510, the first processed visual data corresponding to each of the plurality of visual sensors 510 may be transmitted to the main processor 520, so that the main processor 520 processes the first processed visual data corresponding to each of the plurality of visual sensors 510 after receiving the first processed visual data corresponding to each of the plurality of visual sensors 510, to obtain pose data of the target object.

According to the embodiments of the present disclosure, the number of the second data lines, the length of the second data lines, the deployment manner of the second data lines, the number of interfaces of the second data lines, and the deployment manner of the interfaces of the second data lines may be set according to actual service requirements, which is not limited herein. For example, the number of second data lines 505_1, 505_2, …, 505_k, …, 505_k and the number of interfaces may be adapted to the number of the plurality of visual sensors 510. The deployment of the second data lines 505_1, 505_2, …, 505_k, …, 505_k, and the interface deployment may be adapted to the deployment positions of the plurality of visual sensors 510.

According to the embodiment of the present disclosure, the number of third data lines 506, the length of the third data lines 506, the arrangement mode of the third data lines 506, the number of interfaces of the third data lines 506, and the arrangement mode of the interfaces of the third data lines 506 may be set according to actual service requirements, which is not limited herein. For example, the length of the third data line 506 may be 5 meters, and the number of third data lines 506 and the number of interfaces of the third data line 506 may be adapted to the number of the plurality of first auxiliary processors 550. The deployment of the third data line 506 and the deployment of the interface may be adapted to the deployment location of the plurality of first auxiliary processors 550.

According to an embodiment of the present disclosure, the second data line 505_1, the second data lines 505_2, …, the second data lines 505_k, …, the second data line 505_k, and the third data line 506 may include at least one of: the data line of sampling COM interface, adopt the data line of USB interface, adopt the data line of Type-C interface and adopt the data line of Micro-USB interface.

According to the embodiment of the disclosure, since the plurality of visual data are processed by the plurality of first auxiliary processors, the data transmission amount between the first auxiliary processors and the main processor is reduced, and the stability and reliability of data transmission between the first auxiliary processors and the main processor are improved.

Fig. 6 schematically illustrates a structural diagram of a visual data processing apparatus according to another embodiment of the present disclosure.

As shown in fig. 6, the visual data processing apparatus 600 may include a plurality of visual sensors 610, a main processor 620, a main power supply 630, and a second auxiliary processor 660. The plurality of vision sensors 610 may include vision sensor 610_1, vision sensor 610_2,..the vision sensor 610_r. R may be an integer greater than or equal to 1, R ε {1,2, …, R-1, R }.

The main processor 620 may be connected with the plurality of vision sensors 610 through a first data line 601. The main power source 630 may be connected to the plurality of vision sensors 610 and the main processor 620 through the first power line 602. The second auxiliary processor 660 may be connected to the plurality of vision sensors 610 through a fourth data line 607. The main processor 620 may be connected with the second auxiliary processor 660 through a fifth data line 608.

The second auxiliary processor 660 may be configured to process the plurality of raw visual data resulting in a plurality of visual data in response to receiving raw visual data from the plurality of visual sensors 610.

The primary processor 620 may be configured to respond to receiving visual data from the second secondary processor 660.

For the description of the plurality of vision sensors 610, the main processor 620, the main power supply 630, the first data line 601, and the first power line 602, reference may be made to the description of the plurality of vision sensors 110, the main processor 120, the main power supply 130, the first data line 101, and the first power line 102, which are not repeated herein.

According to an embodiment of the present disclosure, the second auxiliary processor 660 may include at least one of: an image signal processor, a digital signal processor, and a central processor in a server. Alternatively, the second auxiliary processor 660 may further include at least one of: application specific integrated circuits, field programmable gate arrays, and microprocessors. The number of second auxiliary processors 660 may be adapted to the number of main processors 620, and the manner in which the second auxiliary processors 660 are deployed may be adapted to the deployment location of the main processors 620.

According to embodiments of the present disclosure, the visual data of the plurality of visual sensors 610 may be preprocessed using the second auxiliary processor 660 to obtain second processed visual data corresponding to each of the plurality of visual sensors 610. After obtaining the second processed visual data corresponding to each of the plurality of visual sensors 610, the second processed visual data corresponding to each of the plurality of visual sensors 610 may be transmitted to the main processor 620, so that the main processor 620 processes the second processed visual data corresponding to each of the plurality of visual sensors 610 after receiving the second processed visual data corresponding to each of the plurality of visual sensors 610, to obtain pose data of the target object.

According to the embodiment of the present disclosure, the number of the fourth data lines 607, the length of the fourth data lines 607, the arrangement mode of the fourth data lines 607, the number of interfaces of the fourth data lines 607, and the arrangement mode of the interfaces of the fourth data lines 607 may be set according to actual service requirements, which is not limited herein. For example, the length of the fourth data line 607 may be 7 meters, and the number of the fourth data lines 607 and the number of the fourth data line 607 interfaces may be adapted to the number of the plurality of visual sensors 610. The deployment of the fourth data line 607 and the interface deployment may be adapted to the deployment locations of the plurality of visual sensors 610.

According to the embodiment of the present disclosure, the number of the fifth data lines 608, the length of the fifth data lines 608, the arrangement mode of the fifth data lines 608, the number of interfaces of the fifth data lines 608, and the arrangement mode of the interfaces of the fifth data lines 608 may be set according to actual service requirements, which is not limited herein. For example, the length of the fifth data line 608 may be 6 meters, and the number of the fifth data lines 608 and the number of interfaces of the fifth data line 608 may be adapted to the number of the second auxiliary processors 660. The deployment of the fifth data line 608 and the interface deployment may be adapted to the deployment location of the second auxiliary processor 660.

According to an embodiment of the present disclosure, the fourth data line 607 and the fifth data line 608 may include at least one of: the data line of sampling COM interface, adopt the data line of USB interface, adopt the data line of Type-C interface and adopt the data line of Micro-USB interface.

According to the embodiment of the disclosure, since the plurality of visual data are obtained by processing the plurality of original visual data by the second auxiliary processor, the data transmission amount between the second auxiliary processor and the main processor is reduced, and the stability and reliability of data transmission between the second auxiliary processor and the main processor are improved.

Fig. 7 schematically illustrates a structural diagram of a visual data processing apparatus according to another embodiment of the present disclosure.

As shown in fig. 7, the visual data processing apparatus 700 may include a plurality of visual sensors 710, a main processor 720, and a main power supply 730. The plurality of vision sensors 710 may include vision sensor 710_1, vision sensor 710_2, & gt, vision sensor 710_s. The vision sensor 710_1 may include an acquisition unit 711_1 and a processing unit 712_1. The vision sensor 710_2 may include an acquisition unit 711_2 and a processing unit 712_2. In this manner, the vision sensor 710_s may include an acquisition unit 711_s and a processing unit 712_s. In this manner, the vision sensor 710_s may include an acquisition unit 711_s and a processing unit 712_s. S may be an integer greater than or equal to 1, S ε {1,2, …, S-1, S }.

The main processor 720 may be connected to the plurality of vision sensors 710 through a first data line 701. The main power supply 730 may be connected to the plurality of vision sensors 710 and the main processor 720 through the first power line 702.

For the description of the plurality of vision sensors 710, the main processor 720, the main power supply 730, the first data line 701, and the first power line 702, reference may be made to the description of the plurality of vision sensors 110, the main processor 120, the main power supply 130, the first data line 101, and the first power line 102, which are not repeated herein.

The acquisition units 711_1, 711_2, …, 711_s, …, 711_s may be configured to acquire the visual data of the target object at the same time according to the synchronous acquisition instruction in response to detecting the synchronous acquisition instruction, resulting in the original visual data corresponding to the visual sensor.

The processing units 712_1, 712_2, …, 712_s, …, 712_s may be configured to process the raw visual data to obtain visual data corresponding to the visual sensor.

According to the embodiment of the disclosure, since the visual data is obtained by processing the original visual data by the acquisition unit included in the visual sensor, the data transmission amount between the visual sensor and the main processor is reduced, and the stability and reliability of the data transmission between the visual sensor and the main processor are improved.

Fig. 8 schematically shows an example schematic diagram of a visual data processing device according to an embodiment of the disclosure.

As shown in fig. 8, the visual data processing apparatus 800 may include a plurality of visual sensors, a main processor 820, and a main power supply 830. The target object may be located at O _a(x_a,y_a,z_a). The number and location of the plurality of visual sensors may be set according to actual business requirements, and are not limited herein.

For example, the plurality of vision sensors may include vision sensor 810_1 at a ₁(x₁,y₁,z₁), vision sensor 810_2 at a ₂(x₂,y₂,z₂), vision sensor 810_3 at a ₃(x₃,y₃,z₃), and vision sensor 810_4 at a ₄(x₄,y₄,z₄).

Note that O _a(x_a,y_a,z_a) may be any point located in the first world coordinate system x _ay_az_a. A ₁(x₁,y₁,z₁) may be any point in the first camera coordinate system x _a1y_a1z_a1, a ₂(x₂,y₂,z₂) may be any point in the second camera coordinate system x _a2y_a2z_a2, a ₃(x₃,y₃,z₃) may be any point in the third camera coordinate system x _a3y_a3z_a3, a ₄(x₄,y₄,z₄) may be any point in the fourth camera coordinate system x _a4y_a4z_a4.

The relative positional relationship among the first world coordinate system x _ay_az_a, the first camera coordinate system x _a1y_a1z_a1, the second camera coordinate system x _a2y_a2z_a2, the third camera coordinate system x _a3y_a3z_a3, and the fourth camera coordinate system x _a4y_a4z_a4 may be set according to actual business requirements, and is not limited herein.

The vision sensor 810_1, the vision sensor 810_2, the vision sensor 810_3, and the vision sensor 810_4 may be configured to acquire vision data of a target object at the same time according to a synchronous acquisition instruction in response to detection of the synchronous acquisition instruction, resulting in vision data corresponding to a plurality of vision sensors.

The main processor 820 may be connected to a plurality of vision sensors through a first data line. For example, the main processor 820 may be connected to the vision sensor 810_1 through the first data line 801_1, the main processor 820 may be connected to the vision sensor 810_2 through the first data line 801_2, the main processor 820 may be connected to the vision sensor 810_3 through the first data line 801_3, and the main processor 820 may be connected to the vision sensor 810_4 through the first data line 801_4.

The main power source 830 may be connected to the plurality of vision sensors and the main processor 120 through a first power line. For example, the main power source 830 may be connected to the vision sensor 810_1 through the first power line 802_1, the main power source 830 may be connected to the vision sensor 810_2 through the first power line 802_2, the main power source 830 may be connected to the vision sensor 810_3 through the first power line 802_3, and the main power source 830 may be connected to the vision sensor 810_4 through the first power line 802_4. Alternatively, the main power source 830 may also be connected to the main processor 820 through a first power line 802_a.

The main processor 820 may be configured to process the plurality of visual data to obtain pose data of the target object in response to receiving visual data from the visual sensor 810_1, the visual sensor 810_2, the visual sensor 810_3, and the visual sensor 810_4.

For example, a first coordinate transformation process may be performed on the first world coordinate system x _ay_az_a to obtain O _a(x_a,y_a,z_a) a point O _a' located in the first pixel coordinate system. The first coordinate conversion process may include conversion processes from the first world coordinate system x _ay_az_a to the eleventh camera coordinate system, conversion processes from the eleventh camera coordinate system to the first image coordinate system, and conversion processes from the first image coordinate system to the first pixel coordinate system.

For example, the first, second, third, and fourth camera coordinate systems x _a1y_a1z_a1, x _a2y_a2z_a2, x _a3y_a3z_a3, and x _a4y_a4z_a4 may be subjected to a second coordinate conversion process, resulting in a point a ₁ ', a point a ₂', a point a ₃ ', and a point a ₄' located in the second pixel coordinate system. The second coordinate conversion process may include a conversion process from the first camera coordinate system x _a1y_a1z_a1, the second camera coordinate system x _a2y_a2z_a2, the third camera coordinate system x _a3y_a3z_a3, and the fourth camera coordinate system x _a4y_a4z_a4 to the second image coordinate system, and a conversion process from the second image coordinate system to the second pixel coordinate system.

The first pixel coordinate system and the second pixel coordinate system may be the same or different, and are not limited herein. In the case where the first pixel coordinate system and the second pixel coordinate system belong to the same coordinate system, the main processor 820 may obtain pose data of the target object according to the point O _a ', the point a ₁ ', the point a ₂ ', the point a ₃ ', and the point a ₄ '.

Fig. 9 schematically shows an example schematic diagram of a visual data processing device according to another embodiment of the present disclosure.

As shown in fig. 9, the visual data processing apparatus 900 may include a plurality of visual sensors, a main processor 920 and a main power source 930. The target object may be located at O _b(x_b,y_b,z_b). The number and location of the plurality of visual sensors may be set according to actual business requirements, and are not limited herein.

For example, the plurality of visual sensors may include visual sensor 910_1 at B ₁(x₁,y₁,z₁), visual sensor 910_2 at B ₂(x₂,y₂,z₂), visual sensor 910_3 at B ₃(x₃,y₃,z₃), visual sensor 910_4 at B ₄(x₄,y₄,z₄), visual sensor 910_5 at B ₅(x₅,y₅,z₅), and visual sensor 910_6 at B ₆(x₆,y₆,z₆).

Note that O _b(x_b,y_b,z_b) may be any point located in the second world coordinate system x _by_bz_b. B ₁(x₁,y₁,z₁) may be any point in the fifth camera coordinate system x _b1y_b1z_b1, B ₂(x₂,y₂,z₂) may be any point in the sixth camera coordinate system x _b2y_b2z_b2, B ₃(x₃,y₃,z₃) may be any point in the seventh camera coordinate system x _b3y_b3z_b3, B ₄(x₄,y₄,z₄) may be any point in the eighth camera coordinate system x _b4y_b4z_b4, B ₅(x₅,y₅,z₅) may be any point in the ninth camera coordinate system x _b5y_b5z_b5, and B ₆(x₆,y₆,z₆) may be any point in the tenth camera coordinate system x _b6y_b6z_b6.

The relative positional relationship among the second world coordinate system x _by_bz_b, the fifth camera coordinate system x _b1y_b1z_b1, the sixth camera coordinate system x _b2y_b2z_b2, the seventh camera coordinate system x _b3y_b3z_b3, the eighth camera coordinate system x _b4y_b4z_b4, the ninth camera coordinate system x _b5y_b5z_b5, and the tenth camera coordinate system x _b6y_b6z_b6 may be set according to actual business requirements, and is not limited herein.

The vision sensor 910_1, the vision sensor 910_2, the vision sensor 910_3, the vision sensor 910_4, the vision sensor 910_5, and the vision sensor 910_6 may be configured to acquire vision data of a target object at the same time according to a synchronous acquisition instruction in response to detection of the synchronous acquisition instruction, resulting in vision data corresponding to a plurality of vision sensors.

The main processor 920 may be connected with a plurality of vision sensors through a first data line. For example, the main processor 920 may be connected to the vision sensor 910_1 through a first data line 901_1, the main processor 920 may be connected to the vision sensor 910_2 through a first data line 901_2, the main processor 920 may be connected to the vision sensor 910_3 through a first data line 901_3, the main processor 920 may be connected to the vision sensor 910_4 through a first data line 901_4, the main processor 920 may be connected to the vision sensor 910_5 through a first data line 901_5, and the main processor 920 may be connected to the vision sensor 910_6 through a first data line 901_6.

The main power supply 930 may be connected to the plurality of vision sensors and the main processor 120 through a first power line. For example, the main power source 930 may be connected to the vision sensor 910_1 through the first power line 902_1, the main power source 930 may be connected to the vision sensor 910_2 through the first power line 902_2, the main power source 930 may be connected to the vision sensor 910_3 through the first power line 902_3, the main power source 930 may be connected to the vision sensor 910_4 through the first power line 902_4, the main power source 930 may be connected to the vision sensor 910_5 through the first power line 902_5, and the main power source 930 may be connected to the vision sensor 910_6 through the first power line 902_6. Alternatively, the main power 930 may also be connected to the main processor 920 through a first power line 902—b.

The main processor 920 may be configured to process the plurality of visual data to obtain pose data of the target object in response to receiving visual data from the visual sensor 910_1, the visual sensor 910_2, the visual sensor 910_3, the visual sensor 910_4, the visual sensor 910_5, and the visual sensor 910_6.

For example, a third coordinate transformation process may be performed on the second world coordinate system x _by_bz_b to obtain O _b(x_b,y_b,z_b) a point O _b' located in the third pixel coordinate system. The third coordinate conversion process may include a conversion process from the second world coordinate system x _by_bz_b to the twelfth camera coordinate system, a conversion process from the twelfth camera coordinate system to the third image coordinate system, and a conversion process from the third image coordinate system to the third pixel coordinate system.

A fourth coordinate conversion process may be performed on the fifth camera coordinate system x _b1y_b1z_b1, the sixth camera coordinate system x _b2y_b2z_b2, the seventh camera coordinate system x _b3y_b3z_b3, the eighth camera coordinate system x _b4y_b4z_b4, the ninth camera coordinate system x _b5y_b5z_b5, and the tenth camera coordinate system x _b6y_b6z_b6, resulting in a point B ₁ ', a point B ₂', a point B ₃ ', a point B ₄', a point B ₅ ', and a point B ₆' located in the fourth pixel coordinate system. The fourth coordinate conversion process may include a conversion process from the first camera coordinate system x _b1y_b1z_b1, the second camera coordinate system x _b2y_b2z_b2, the third camera coordinate system x _b3y_b3z_b3, and the fourth camera coordinate system x _b4y_b4z_b4 to the fourth image coordinate system, and a conversion process from the fourth image coordinate system to the fourth pixel coordinate system.

The third pixel coordinate system and the fourth pixel coordinate system may be the same or different, and are not limited herein. In the case that the third pixel coordinate system and the fourth pixel coordinate system belong to the same coordinate system, the main processor 920 may obtain pose data of the target object according to the point O _b ', the point B ₁ ', the point B ₂ ', the point B ₃ ', the point B ₄ ', the point B ₅ ', and the point B ₆ '.

The above is only an exemplary embodiment, but is not limited thereto, and other visual data processing apparatuses known in the art may be also included as long as accuracy of visual data and pose data can be improved.

As shown in fig. 10, the method 1000 includes operations S1010 to S1020.

In operation S1010, the plurality of vision sensors acquire vision data of the target object at the same time according to the synchronous acquisition instruction in response to detecting the synchronous acquisition instruction, and vision data corresponding to the plurality of vision sensors is obtained.

In operation S1020, the main processor processes the plurality of visual data in response to receiving the visual data from the plurality of visual sensors through the first data line, to obtain pose data of the target object.

According to embodiments of the present disclosure, the viewing angle ranges of the plurality of visual sensors may be different from each other. The primary power source may power the plurality of vision sensors and the primary processor through the first power line.

According to an embodiment of the present disclosure, the visual data processing method described in the embodiment of the present disclosure may be applied to the visual data processing apparatus described in the embodiment of the present disclosure.

According to embodiments of the present disclosure, a plurality of visual sensors may be connected by a synchronization line.

According to an embodiment of the present disclosure, operation S1010 may include the following operations.

The primary vision sensor generates a synchronous acquisition instruction in response to detecting the start instruction. At least one secondary vision sensor is responsive to receiving a synchronous acquisition instruction from the secondary vision sensor. And the main vision sensor and the at least one auxiliary vision sensor acquire the vision data of the target object at the same moment according to the synchronous acquisition instruction, so as to obtain the vision data corresponding to the main vision sensor and the at least one auxiliary vision sensor.

According to an embodiment of the present disclosure, the plurality of vision sensors includes a primary vision sensor and at least one secondary vision sensor.

According to an embodiment of the present disclosure, the auxiliary power supply supplies power to the first data line through the second power supply line.

According to an embodiment of the present disclosure, the visual data processing method 1000 may further include the following operations.

The first auxiliary processors process the original visual data to obtain visual data in response to receiving the original visual data from the visual sensors through the second data lines.

According to an embodiment of the present disclosure, operation S1020 may include the following operations.

The main processor responds to the receiving of the visual data from the plurality of first auxiliary processors through the third data line, and processes the plurality of visual data to obtain pose data of the target object.

According to an embodiment of the present disclosure, the first auxiliary processor and the vision sensor one-to-one correspondence according to an embodiment of the present disclosure, the vision data processing method 1000 may further include the following operations.

The second auxiliary processor processes the plurality of original visual data to obtain a plurality of visual data in response to receiving the original visual data from the plurality of visual sensors through the fourth data line.

The main processor responds to the visual data received from the second auxiliary processor through the fifth data line, and processes the plurality of visual data to obtain pose data of the target object.

According to an embodiment of the present disclosure, a vision sensor includes an acquisition unit and a processing unit.

And the plurality of acquisition units respond to the synchronous acquisition instruction, acquire the original visual data of the target object at the same moment according to the synchronous acquisition instruction, and acquire the visual data corresponding to the plurality of visual sensors. The plurality of processing units process the plurality of original visual data to obtain visual data corresponding to the plurality of visual sensors.

According to embodiments of the present disclosure, the primary power source may be a mobile power source.

According to embodiments of the present disclosure, the auxiliary power source may be a mobile power source.

And performing key point detection on the plurality of visual data to obtain a plurality of key point sets. And determining pose data of the target object according to the plurality of key point sets.

According to an embodiment of the present disclosure, performing keypoint detection on a plurality of visual data, resulting in a plurality of keypoint sets, may include the following operations.

And extracting features of the plurality of visual data to obtain a key point feature map of at least one scale corresponding to each of the plurality of visual data. And obtaining a plurality of key point sets according to the key point feature map of at least one scale corresponding to each of the plurality of visual data.

According to an embodiment of the present disclosure, feature extraction is performed on a plurality of visual data, and a key point feature map of at least one scale corresponding to each of the plurality of visual data is obtained, which may include the following operations.

And carrying out feature extraction of U stages on the visual data to obtain at least one key point feature map corresponding to the U-th stage. And obtaining the key point feature map of at least one scale corresponding to the visual data according to the at least one key point feature map corresponding to the U-th stage.

According to an embodiment of the present disclosure, the u-th stage has T _u parallel stages. The image resolution of the keypoint feature map of the same parallel hierarchy is the same. The image resolution of the keypoint feature map is different for different parallel levels.

According to an embodiment of the present disclosure, U is an integer greater than or equal to 1. U is an integer greater than or equal to 1 and less than or equal to U. T _u is an integer greater than or equal to 1.

According to an embodiment of the present disclosure, in a case where U is an integer greater than 1, performing feature extraction of U stages on the visual data to obtain at least one key point feature map corresponding to a U-th stage may include the following operations.

And (3) carrying out convolution processing on at least one key point feature map corresponding to the u-1 stage to obtain at least one intermediate key point feature map corresponding to the u-1 stage. And carrying out feature fusion on at least one intermediate key point feature map corresponding to the u-th stage to obtain at least one key point feature map corresponding to the u-th stage.

According to an embodiment of the present disclosure, U is an integer greater than 1 and less than or equal to U.

According to an embodiment of the present disclosure, feature fusion is performed on at least one intermediate keypoint feature map corresponding to the u-th stage, to obtain at least one keypoint feature map corresponding to the u-th stage, which may include the following operations.

And aiming at the ith parallel hierarchy in the T _u parallel hierarchies, obtaining a key point feature map corresponding to the ith parallel hierarchy according to other intermediate key point feature maps corresponding to the ith parallel hierarchy and the intermediate key point feature map corresponding to the ith parallel hierarchy.

According to an embodiment of the present disclosure, the other intermediate keypoint feature map corresponding to the i-th parallel level is an intermediate keypoint feature map corresponding to at least part of the T _u parallel levels other than the i-th parallel level. i is an integer greater than or equal to 1 and less than or equal to T _u.

According to an embodiment of the present disclosure, the above-described visual data processing method may further include the following operations.

And the main processor processes a plurality of visual data by using the self-contact detection model to obtain contact detection information. And optimizing the pose data according to the contact detection information.

According to an embodiment of the present disclosure, the self-contact detection model is derived by training a deep learning model using sample visual data and sample contact tag information of the sample visual data.

The above is only an exemplary embodiment, but is not limited thereto, and other visual data processing methods known in the art may be included as long as accuracy of visual data and pose data can be improved.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A visual data processing apparatus comprising:

the visual sensors are configured to respond to detection of a synchronous acquisition instruction, acquire visual data of a target object at the same moment according to the synchronous acquisition instruction, and obtain visual data corresponding to the visual sensors, wherein the visual angle ranges of the visual sensors are different from each other;

The main processor is connected with the plurality of vision sensors through a first data line and is configured to respond to receiving vision data from the plurality of vision sensors, process the plurality of vision data and obtain pose data of the target object; and

A main power supply connected to the plurality of vision sensors and the main processor through a first power line and configured to supply power to the plurality of vision sensors and the main processor;

the main processor is configured to perform feature extraction of U stages on the visual data to obtain at least one key point feature map corresponding to a U-th stage, obtain at least one scale key point feature map corresponding to the visual data according to the at least one key point feature map corresponding to the U-th stage, obtain a plurality of key point sets according to the at least one scale key point feature map corresponding to the visual data, and determine pose data of the target object according to the plurality of key point sets;

The u-th stage is provided with T _u parallel hierarchies, the image resolutions of the key point feature maps of the same parallel hierarchy are the same, and the image resolutions of the key point feature maps of different parallel hierarchies are different;

Wherein U, T _u are integers greater than 1 or equal to 1, U is an integer greater than or equal to 1 and less than or equal to U.

2. The apparatus of claim 1, wherein the plurality of vision sensors are connected by a synchronization line.

3. The apparatus of claim 2, wherein the plurality of vision sensors includes a primary vision sensor and at least one secondary vision sensor;

The main vision sensor is configured to respond to the detection of a starting instruction and generate the synchronous acquisition instruction; and

The at least one secondary vision sensor is configured to respond to receiving a synchronous acquisition instruction from the primary vision sensor.

4. A device according to any one of claims 1 to 3, further comprising:

And the auxiliary power supply is connected with the first data line through a second power line and is configured to supply power to the first data line.

5. A device according to any one of claims 1 to 3, further comprising:

The first auxiliary processors are connected with the vision sensors through second data lines and are configured to process the original vision data to obtain the vision data in response to receiving the original vision data from the vision sensors, wherein the first auxiliary processors are in one-to-one correspondence with the vision sensors; and

The main processor is connected with the plurality of first auxiliary processors through a third data line and is configured to respond to receiving visual data from the plurality of first auxiliary processors.

6. A device according to any one of claims 1 to 3, further comprising:

the second auxiliary processor is connected with the plurality of visual sensors through a fourth data line and is configured to process a plurality of original visual data to obtain a plurality of visual data in response to receiving the original visual data from the plurality of visual sensors; and

The primary processor, coupled to the second secondary processor via a fifth data line, is configured to respond to receiving visual data from the second secondary processor.

7. A device according to any one of claims 1 to 3, wherein the vision sensor comprises:

The acquisition unit is configured to respond to the synchronous acquisition instruction, acquire the visual data of the target object at the same moment according to the synchronous acquisition instruction, and obtain the original visual data corresponding to the visual sensor; and

And the processing unit is configured to process the original visual data to obtain visual data corresponding to the visual sensor.

8. A device according to any one of claims 1 to 3, wherein the primary power source is a mobile power source.

9. The apparatus of claim 4, wherein the auxiliary power source is a mobile power source.

10. The apparatus of claim 1, wherein, in the case where U is an integer greater than 1,

The main processor is configured to perform convolution processing on at least one key point feature map corresponding to a u-1 stage to obtain at least one intermediate key point feature map corresponding to the u-1 stage, and perform feature fusion on at least one intermediate key point feature map corresponding to the u-1 stage to obtain at least one key point feature map corresponding to the u-1 stage;

Wherein U is an integer greater than 1 and less than or equal to U.

11. The apparatus of claim 10, wherein,

The main processor is configured to obtain a key point feature map corresponding to an ith parallel hierarchy according to other intermediate key point feature maps corresponding to the ith parallel hierarchy and intermediate key point feature maps corresponding to the ith parallel hierarchy aiming at the ith parallel hierarchy in the T _u parallel hierarchies;

The other intermediate key point feature maps corresponding to the ith parallel hierarchy are intermediate key point feature maps corresponding to at least part of parallel hierarchies except for the ith parallel hierarchy in the T _u parallel hierarchies, and i is an integer greater than or equal to 1 and less than or equal to T _u.

12. The apparatus according to claim 1 to 3, wherein,

The main processor is further configured to process a plurality of visual data by using a self-contact detection model to obtain contact detection information, and optimize the pose data according to the contact detection information;

The self-contact detection model is obtained by training a deep learning model by using sample visual data and sample contact tag information of the sample visual data.

13. A method of visual data processing comprising:

A plurality of vision sensors respond to detection of a synchronous acquisition instruction, acquire vision data of a target object at the same moment according to the synchronous acquisition instruction, and acquire vision data corresponding to the vision sensors, wherein the vision angle ranges of the vision sensors are different from each other; and

The main processor responds to the receiving of the visual data from the plurality of visual sensors through the first data line, processes the plurality of visual data and obtains pose data of the target object;

wherein a primary power supply supplies power to the plurality of vision sensors and the primary processor through a first power supply line;

the processing the visual data to obtain pose data of the target object includes:

Performing key point detection on the plurality of visual data to obtain a plurality of key point sets, and determining pose data of the target object according to the plurality of key point sets;

The detecting the key points of the visual data to obtain a plurality of key point sets includes:

performing feature extraction of U stages on the visual data to obtain at least one key point feature map corresponding to a U-th stage;

Obtaining a key point feature map of at least one scale corresponding to the visual data according to the at least one key point feature map corresponding to the U-th stage; and

Obtaining a plurality of key point sets according to the key point feature diagrams of at least one scale corresponding to the visual data;

14. The method of claim 13, wherein the plurality of vision sensors are connected by a synchronization line.

15. The method of claim 13 or 14, wherein the plurality of vision sensors includes a primary vision sensor and at least one secondary vision sensor;

The method for acquiring the visual data of the target object at the same moment by the plurality of visual sensors in response to the detection of the synchronous acquisition instruction comprises the following steps of:

the main vision sensor responds to the detection of the starting instruction and generates a synchronous acquisition instruction;

The at least one secondary vision sensor is responsive to receiving a synchronous acquisition instruction from the primary vision sensor; and

And the main vision sensor and the at least one auxiliary vision sensor acquire the vision data of the target object at the same moment according to the synchronous acquisition instruction to obtain the vision data corresponding to the main vision sensor and the at least one auxiliary vision sensor.

16. A method according to claim 13 or 14, wherein an auxiliary power supply supplies power to the first data line via a second power supply line.

17. The method of claim 13 or 14, further comprising:

The first auxiliary processors respond to the received original visual data from the visual sensors through the second data lines and process the original visual data to obtain visual data, wherein the first auxiliary processors are in one-to-one correspondence with the visual sensors;

The main processor processes a plurality of visual data in response to receiving the visual data from the plurality of visual sensors through a first data line to obtain pose data of the target object, and the method comprises the following steps:

And the main processor responds to receiving the visual data from the plurality of first auxiliary processors through a third data line, and processes the plurality of visual data to obtain the pose data of the target object.

18. The method of claim 13 or 14, further comprising:

The second auxiliary processor is used for responding to the received original visual data from the plurality of visual sensors through the fourth data line, and processing the plurality of the original visual data to obtain a plurality of visual data;

And the main processor responds to the receiving of the visual data from the second auxiliary processor through a fifth data line, and processes a plurality of visual data to obtain the pose data of the target object.

19. The method of claim 13 or 14, wherein the vision sensor comprises an acquisition unit and a processing unit;

The plurality of acquisition units respond to the synchronous acquisition instruction, acquire the original visual data of the target object at the same moment according to the synchronous acquisition instruction, and acquire the visual data corresponding to the plurality of visual sensors; and

And the processing units process the original visual data to obtain visual data corresponding to the visual sensors.

20. The method of claim 13 or 14, wherein the primary power source is a mobile power source.

21. The method of claim 16, wherein the auxiliary power source is a mobile power source.

22. The method of claim 13, wherein, in a case where U is an integer greater than 1, the performing feature extraction of the visual data in U stages to obtain at least one keypoint feature map corresponding to a U-th stage includes:

Carrying out convolution processing on at least one key point feature map corresponding to the u-1 stage to obtain at least one intermediate key point feature map corresponding to the u-1 stage; and

Feature fusion is carried out on at least one intermediate key point feature map corresponding to the u-th stage, and at least one key point feature map corresponding to the u-th stage is obtained;

Wherein U is an integer greater than 1 and less than or equal to U.

23. The method of claim 22, wherein the feature fusing the at least one intermediate keypoint feature map corresponding to the u-th stage to obtain the at least one keypoint feature map corresponding to the u-th stage comprises:

for the i-th parallel level of the T _u parallel levels,

Obtaining a key point feature map corresponding to the ith parallel hierarchy according to other intermediate key point feature maps corresponding to the ith parallel hierarchy and intermediate key point feature maps corresponding to the ith parallel hierarchy;

24. The method of claim 13 or 14, further comprising:

The main processor processes a plurality of visual data by using a self-contact detection model to obtain contact detection information, and optimizes the pose data according to the contact detection information;