CN116452741A

CN116452741A - Object reconstruction method, object reconstruction model training method, device and equipment

Info

Publication number: CN116452741A
Application number: CN202310431145.6A
Authority: CN
Inventors: 吕以豪; 卢飞翔; 李龙腾; 张良俊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-07-18
Anticipated expiration: 2043-04-20
Also published as: CN116452741B

Abstract

The disclosure provides an object reconstruction method, an object reconstruction model training method, an object reconstruction device, an electronic device, a storage medium and a program product, and relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality and deep learning. The specific implementation scheme is as follows: respectively extracting features of a plurality of images to be processed in an image sequence to be processed to obtain an initial feature image sequence, wherein the image sequence to be processed comprises an object to be reconstructed; generating global features and local feature groups of the object based on the initial feature graphs aiming at each initial feature graph in the initial feature graph sequence to obtain a global feature sequence and a local feature group sequence; generating an object model parameter sequence for reconstructing an object based on the global feature sequence and the local feature set sequence; reconstructing the object based on the object model parameter sequence to obtain a target model sequence.

Description

Object reconstruction method, object reconstruction model training method, device and equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the fields of computer vision, augmented reality, virtual reality, and deep learning, and more particularly, to an object reconstruction method, an object reconstruction model training method, an apparatus, an electronic device, a storage medium, and a program product.

Background

Computer vision technology is a science that studies how to "see" a computer. The computer vision technology can be applied to scenes such as image recognition, image semantic understanding, image retrieval, three-dimensional object reconstruction, virtual reality, synchronous positioning, map construction and the like. For each scene, how to use the computer vision technology to make the generated result reasonable and accurate is worth exploring.

Disclosure of Invention

The present disclosure provides an object reconstruction method, an object reconstruction model training method, an object reconstruction device, an electronic device, a storage medium, and a program product.

According to an aspect of the present disclosure, there is provided an object reconstruction method including: respectively extracting features of a plurality of images to be processed in an image sequence to be processed to obtain an initial feature image sequence, wherein the image sequence to be processed comprises an object to be reconstructed; generating global features and local feature groups of the object based on the initial feature graphs aiming at each initial feature graph in the initial feature graph sequence to obtain a global feature sequence and a local feature group sequence; generating an object model parameter sequence for reconstructing the object based on the global feature sequence and the local feature group sequence, and reconstructing the object based on the object model parameter sequence to obtain a target model sequence.

According to another aspect of the present disclosure, there is provided a training method of an object reconstruction model, including: respectively extracting features of a plurality of sample images in a sample image sequence to obtain a sample initial feature image sequence, wherein the sample image sequence comprises a sample object to be reconstructed; generating a sample global feature and a sample local feature group of the sample object based on the sample initial feature map for each sample initial feature map in the sample initial feature map sequence to obtain a sample global feature sequence and a sample local feature group sequence; generating a sample object model parameter sequence for reconstructing the sample object based on the sample global feature sequence and the sample local feature group sequence; and training the object reconstruction model by using the sample object model parameter sequence and a sample object model parameter label sequence matched with the sample image sequence.

According to another aspect of the present disclosure, there is provided an object reconstruction apparatus including: the device comprises a feature extraction module, a feature extraction module and a reconstruction module, wherein the feature extraction module is used for respectively extracting features of a plurality of images to be processed in an image sequence to be processed to obtain an initial feature image sequence, and the image sequence to be processed comprises an object to be reconstructed; the first generation module is used for generating global features and local feature groups of the object based on the initial feature graphs aiming at each initial feature graph in the initial feature graph sequence to obtain a global feature sequence and a local feature group sequence; the second generation module is used for generating an object model parameter sequence for reconstructing the object based on the global feature sequence and the local feature group sequence; and the reconstruction module is used for reconstructing the object based on the object model parameter sequence to obtain a target model sequence.

According to another aspect of the present disclosure, there is provided a training apparatus of an object reconstruction model, including: the sample feature extraction module is used for respectively carrying out feature extraction on a plurality of sample images in the sample image sequence to obtain a sample initial feature image sequence, wherein the sample image sequence comprises a sample object to be reconstructed; a sample first generating module, configured to generate, for each sample initial feature map in the sample initial feature map sequence, a sample global feature and a sample local feature set related to the sample object based on the sample initial feature map, to obtain a sample global feature sequence and a sample local feature set sequence; a sample second generating module, configured to generate a sample object model parameter sequence for reconstructing the sample object based on the sample global feature sequence and the sample local feature set sequence; and a training module for training the object reconstruction model by using the sample object model parameter sequence and a sample object model parameter tag sequence matched with the sample image sequence.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as disclosed herein.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer as described above to perform a method as disclosed herein.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as disclosed herein.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates an exemplary system architecture to which object reconstruction methods and apparatus may be applied, according to embodiments of the present disclosure;

FIG. 2 schematically illustrates a flow chart of an object reconstruction method according to an embodiment of the present disclosure;

fig. 3 schematically shows a flow diagram of a human reconstruction method according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a structural schematic of an object reconstruction model according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of a feature extraction method according to an embodiment of the disclosure;

FIG. 6 schematically illustrates a flow chart of a training method of an object reconstruction model according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of an object reconstruction apparatus according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a block diagram of a training apparatus of an object reconstruction model according to an embodiment of the present disclosure; and

fig. 9 schematically illustrates a block diagram of an electronic device adapted to implement an object reconstruction method according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

According to an embodiment of the present disclosure, there is provided an object reconstruction method including: and respectively extracting the characteristics of a plurality of images to be processed in the image sequence to be processed to obtain an initial characteristic image sequence. The sequence of images to be processed comprises the object to be reconstructed. For each initial feature map in the initial feature map sequence, global features and local feature groups about the object are generated based on the initial feature map, and a global feature sequence and a local feature group sequence are obtained. An object model parameter sequence for reconstructing the object is generated based on the global feature sequence and the local feature set sequence. Reconstructing the object based on the object model parameter sequence to obtain a target model sequence.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, applying and the like of the personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

Fig. 1 schematically illustrates an exemplary system architecture to which object reconstruction methods and apparatuses may be applied according to embodiments of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the object reconstruction method and apparatus may be applied may include a terminal device, but the terminal device may implement the object reconstruction method and apparatus provided by the embodiments of the present disclosure without interaction with a server.

As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications may be installed on the terminal devices 101, 102, 103, such as a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client and/or social platform software, etc. (as examples only).

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for content browsed by the user using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, the object reconstruction method provided by the embodiments of the present disclosure may be generally performed by the terminal device 101, 102, or 103. Accordingly, the object reconstruction apparatus provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103.

Alternatively, the object reconstruction method provided by the embodiments of the present disclosure may also be generally performed by the server 105. Accordingly, the object reconstruction apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The object reconstruction method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the object reconstruction apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

For example, in the football match process, the terminal devices 101, 102, 103 may acquire a match video, then send the acquired match video to the server 105, and the server 105 performs frame splitting processing on the match video to obtain a to-be-processed image sequence; and respectively extracting the characteristics of a plurality of images to be processed in the image sequence to be processed to obtain an initial characteristic image sequence. Included in the sequence of images to be processed is an object to be reconstructed, such as an athlete. For each initial feature map in the initial feature map sequence, global features and local feature groups about the object are generated based on the initial feature map, and a global feature sequence and a local feature group sequence are obtained. An object model parameter sequence for reconstructing the object is generated based on the global feature sequence and the local feature set sequence. Or by a server or a cluster of servers capable of communicating with the terminal devices 101, 102, 103 and/or the server 105, and finally obtaining a reconstructed sequence of the object model.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

It should be noted that the sequence numbers of the respective operations in the following methods are merely representative of the operations for the purpose of description, and should not be construed as representing the order of execution of the respective operations. The method need not be performed in the exact order shown unless explicitly stated.

Fig. 2 schematically illustrates a flow chart of an object reconstruction method according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S210 to S240.

In operation S210, feature extraction is performed on a plurality of images to be processed in the image sequence to be processed, respectively, to obtain an initial feature map sequence. The sequence of images to be processed comprises the object to be reconstructed.

In operation S220, for each initial feature map in the initial feature map sequence, global features and local feature sets for the object are generated based on the initial feature map, resulting in a global feature sequence and a local feature set sequence.

In operation S230, an object model parameter sequence for reconstructing an object is generated based on the global feature sequence and the local feature group sequence.

In operation S240, the object is reconstructed based on the object model parameter sequence to obtain a target model sequence.

According to embodiments of the present disclosure, the object to be modeled in the image to be processed may include a person object, an animal object, or a mechanical body object. The plurality of images to be processed in the image sequence to be processed are arranged according to time sequence. The object reconstruction method provided by the embodiment of the present disclosure may be used to generate an object model parameter sequence for reconstructing an object based on an image sequence to be processed. The sequence of object model parameters may comprise a plurality of object model parameters, which are in one-to-one correspondence with a plurality of images to be processed in the sequence of images to be processed. A sequence of target models for the object is generated based on the sequence of object model parameters. Each object model in the sequence of object models may be used as a virtual model, e.g. a simulation model characterizing the motion, pose of the object. The object reconstruction method provided by the embodiment of the disclosure can be applied to scenes such as virtual reality, augmented reality, man-machine interaction, action recognition and the like.

For example, when the object reconstruction method provided by the embodiment of the present disclosure is applied to a motion recognition scene, a video of an athlete during training or competition may be collected, and the video including the athlete is used as a to-be-processed image sequence. Each object model in the sequence of object models includes information such as the player's gesture motion. The sequence of target models may be used to determine the consistency of athlete movement. In addition, based on a key target model in the target model sequence, information such as swing frequency, amplitude, angle and the like of arms and legs of the athlete can be determined, and whether the actions of the athlete reach the standards or not is counted and quantitatively analyzed by combining various indexes of key links.

According to the embodiment of the disclosure, the feature extraction network can be utilized to respectively perform feature extraction on the image sequence to be processed, so as to obtain an initial feature map sequence. The initial feature map sequence includes a plurality of initial feature maps. The initial feature map may be used to characterize the image to be processed. The type of the initial feature map is not limited, and may be, for example, a feature map of n×h×w, where N is used to represent the number of channels of the initial feature map, H is used to represent the height of the initial feature map, and W is used to represent the width of the initial feature map. The feature extraction network may include at least one of: CNN (Convolutional Neural Networks, convolutional neural Network), resNet (Residual Network), shuffleNet (a lightweight Network).

According to embodiments of the present disclosure, with respect to global features of an object, it may refer to global features excluding background information in an image to be processed. Feature extraction can be performed on the initial feature map to obtain high-order global features relative to the initial feature map. But is not limited thereto. The initial feature map excluding the background information may also be used as a global feature. Any global feature that contains the global semantic information of the object to be modeled.

According to an embodiment of the present disclosure, a local feature set for an object includes a plurality of local features. Each local feature in the set of local features may refer to a local feature excluding background information in the image to be processed. Feature extraction can be performed on the initial feature map to obtain high-order global features relative to the initial feature map. And splitting the global features to obtain local feature groups. But is not limited thereto. The initial feature map excluding the background information can be directly split to obtain a local feature set. As long as the local features contain part of the semantic information of the object to be modeled.

According to an embodiment of the present disclosure, generating an object model parameter sequence based on a global feature sequence and a local feature set sequence may include: and inputting the global feature sequence and the local feature group sequence into an object reconstruction model to generate an object model parameter sequence. The object model parameter sequences are in one-to-one correspondence with the initial feature map sequences. The object reconstruction model may include one or more of a graphical convolutional network, a codec, or a generation countermeasure network.

According to an embodiment of the present disclosure, each object model parameter in the sequence of object model parameters comprises a pose parameter and a posture parameter. Reconstructing the object based on the object model parameter sequence, the obtaining the target model sequence may include: for each object model parameter in the sequence of object model parameters, a target model is generated based on the pose parameter and the posture parameter. And obtaining a target model sequence corresponding to the object model parameter sequence one by one.

According to embodiments of the present disclosure, object model parameters may include pose parameters and shape parameters. The pose parameters may include the rotation angle of the individual joints of the subject, such as the rotation angle of the joints of the subject's head, wrist, etc. Shape parameters may include the outline shape and size of the object, such as height, thickness, etc. of the object.

According to an embodiment of the present disclosure, generating the target model based on the pose parameter and the shape parameter may include: and inputting the attitude parameters and the shape parameters into an attitude simulation model to obtain a target model. But is not limited thereto. The gesture parameters, the shape parameters and the camera parameters can be input into a gesture simulation model to obtain a target model. The camera parameters may be three-dimensional data, for example, the camera parameters include a scaling and offset of the projection mapping of pixels between three-dimensional space and two-dimensional space.

According to embodiments of the present disclosure, the pose simulation model may include a preset three-dimensional model, the three-dimensional model of the morphological change being controlled by a fixed number of parameters. For example, object model parameters may be input into the gesture simulation model to obtain an updated gesture simulation model. The updated gesture simulation model presents a morphology matching with each joint angle of the object in the image to be processed. In the case where the object to be reconstructed is a human object, the pose simulation model may include an SMPL (skin Multi-person linear model) model. Three-dimensional key point information of the human body can be obtained based on the target model.

According to an embodiment of the present disclosure, a target model is generated based on object model parameters and a pose simulation model. The method can realize the end-to-end three-dimensional model reconstruction and simultaneously realize the accuracy and the rapidity of the reconstructed three-dimensional model.

According to the embodiment of the disclosure, the global feature and the local feature set excluding the background information are utilized to generate the object model parameters, so that noise caused by the background information in the image to be processed is eliminated, and the accuracy of the generated object model parameters is further improved. In addition, the overall characteristics of the object to be modeled can be expressed through the overall characteristics, and the local characteristics of different parts of the object to be modeled can be expressed through the local characteristic groups, so that the generated object model parameters are generated based on the overall and detail dual characteristics, the feasibility of object reconstruction is further improved, and meanwhile the rationality of the reconstructed object model is improved. In addition, the object model parameter sequence is generated based on the global feature sequence and the local feature group sequence, so that the time sequence correlation between the image sequences to be processed, such as continuous video frames, can be fully utilized, and the object model sequence generated based on the object model parameter sequence is low in jitter and high in smoothness in time sequence.

Fig. 3 schematically shows a flow diagram of a human reconstruction method according to an embodiment of the present disclosure.

As shown in fig. 3, the image sequence to be processed includes a first image 311 to be processed, a second image 312 to be processed, and a third image 313 to be processed, which are arranged in time series. The first object P310 and the second object P320 to be modeled are included in each of the first image 311 to be processed, the second image 312 to be processed, and the third image 313 to be processed. In the first image to be processed 311, the first object P310 and the second object P320 are both folded toward the front both legs and both hands naturally droop. In the second image to be processed 312, the first object P310 faces the second object P320, the legs are closed and the hands naturally droop. The second object P320 faces forward, with both legs differentiated and with both hands placed across the chest. In the third image to be processed 313, the first object P310 is a double-leg differential, and both arm bends are directed to the right. The second object P320 faces the first object P310, is placed behind the hands and is differentiated by the legs. The first image to be processed 311, the second image to be processed 312, and the third image to be processed 313 may be respectively subjected to feature extraction, resulting in an initial feature map sequence 320. Based on the initial feature map sequence 320, a target feature sequence 330 is obtained. Each target feature in the target feature sequence 330 includes a global feature and a set of local features. Based on the target feature sequence 330, an object model parameter sequence 340 is derived. The object model parameter sequence 340 is input to the SMPL model to obtain a target model sequence. The sequence of target models includes a first target model 351 that matches the pose of two objects in the first image to be processed 311, a second target model 352 that matches the pose of two objects in the second image to be processed 312, and a third target model 353 that matches the pose of two objects in the third image to be processed 313.

According to a related example, the sequence of images to be processed may be processed with an object reconstruction model, e.g. a codec (transducer) containing a self-attention mechanism, resulting in a sequence of object models. The data of the body surface nodes in each object model in the sequence of object models is directly generated by the object reconstruction model.

Compared with a mode of directly obtaining a target model sequence through an object reconstruction model, the object reconstruction method provided by the embodiment of the invention can be used for generating the target model sequence based on the object model parameter sequence by combining the gesture simulation model, and improving the reconstruction effect of the target model sequence while reducing the processing data amount, so that the generated target model sequence has good smoothness, and the problem of uneven skin of a target model in the target model sequence is avoided.

According to an embodiment of the present disclosure, for operation S230 as shown in fig. 2, generating an object model parameter sequence for reconstructing an object based on a global feature sequence and a local feature group sequence may include the following operations.

For example, a first potential code set sequence is derived based on the global feature sequence and the local feature set sequence. Based on the first potential code set sequence, a global potential code is generated. A second potential code set sequence is derived based on the global potential code and the first potential code set sequence. An object model parameter sequence is generated based on the second potential encoding set sequence.

Operation S230 as shown in fig. 2 may be performed using the object reconstruction model according to an embodiment of the present disclosure. The object reconstruction model may include a first reconstruction module, a second reconstruction module, and an output layer. For example, the global feature sequence and the local feature set sequence may be input into a first reconstruction module, resulting in a first potential code set sequence. Based on the first potential code set sequence, a global potential code is generated. The global potential code and the first potential code group sequence are input into a second modeling block, and a second potential code group sequence is obtained. And inputting the second potential coding group sequence to an output layer to obtain an object model parameter sequence.

According to an embodiment of the present disclosure, the first reconstruction module may include a visual codec. But is not limited thereto and may include an encoder of a visual codec. As long as it is a network structure with feature fusion of self-attention mechanisms.

According to an embodiment of the present disclosure, the second reconstruction module may include a visual codec. But is not limited thereto and may include an encoder of a visual codec. As long as it is a network structure with feature fusion of self-attention mechanisms.

According to embodiments of the present disclosure, the output layer may include a full connection layer and an activation layer.

According to an alternative embodiment of the present disclosure, the object reconstruction model may include a first visual codec, a second visual codec, and an output layer. The respective encoders of the first visual codec and the second visual codec include a cascade of a self-attention mechanism, a normalization layer, and a feedforward layer. The respective decoders of the first visual codec and the second visual codec include a cascade of a self-attention mechanism, a normalization layer, and a feedforward layer.

According to the embodiment of the disclosure, the global feature sequence and the local feature group sequence are integrated in a multi-round coding manner, so that the feature integration capability is improved.

According to a related example, a global reconstruction method may be used to obtain object model parameters, e.g., a global feature sequence is input into a first visual codec to obtain a first potential code set sequence. The first potential coding group sequence is input into a second visual codec to obtain a second potential coding group sequence. And inputting the second potential coding group sequence to an output layer to obtain an object model parameter sequence. But is not limited thereto. The object model parameters may also be obtained by a local reconstruction method, for example, a plurality of local feature set sequences are input into a first visual codec to obtain a first potential code set sequence. The first potential coding group sequence is input into a second visual codec to obtain a second potential coding group sequence. And inputting the second potential coding group sequence to an output layer to obtain an object model parameter sequence.

Compared with the global reconstruction method and the local reconstruction method, the object reconstruction method provided by the embodiment of the disclosure can utilize the plurality of local features in the local feature group to improve the attention of key local features and highlight the association relationship among the plurality of local features. And combining the plurality of local features with the global features, so that the key semantic information is highlighted while the semantic information contained in the first potential coding group sequence and the second potential coding group sequence which are obtained based on the plurality of local features and the global features is comprehensive. In addition, the global feature sequence and the local feature group sequence are processed simultaneously, so that the object reconstruction model can combine the global features and the local feature groups among different time sequences, and the common features, the different features and the associated features among different images to be processed in the time sequences are highlighted, so that the generated object model parameter sequences have continuity and authenticity among each other.

According to an embodiment of the present disclosure, deriving a first potential code group sequence based on a global feature sequence and a local feature group sequence may include: for each global feature in the sequence of global features, a target local feature set is determined from the sequence of local feature sets that matches the global feature. The global feature and the target local feature group matched with the global feature are the image features of the same image to be processed. And obtaining a first fusion feature based on the target local feature group. A first set of potential codes is obtained based on a predetermined number of global features and the first fusion feature.

For example, a set of target local features, e.g., M target local features, may be input simultaneously as query features, key features, and value features into an encoder of a first visual codec, resulting in a first fusion feature. A predetermined number of global features, such as N global features, are used as query features, and the first fusion feature is used as both key features and value features, and is input to a decoder of the first visual codec to obtain a first set of potential encodings, such as N first potential encodings. Thereby obtaining a first potential code group sequence corresponding to the global feature sequence.

According to an alternative embodiment of the present disclosure, obtaining a first set of latent codes based on a predetermined number of global features and a first fusion feature may include: repeating the following operation until the current round is equal to a first preset round threshold value, and taking the first potential coding group of the current round as a first potential coding group: and under the condition that the current round is determined to be smaller than a first preset round threshold value, obtaining a first latent code set of the next round based on the first fusion characteristic and the first latent code set of the current round.

According to the embodiment of the disclosure, through the iterative process, the first potential coding set of the current round is input into the decoder of the first visual coding decoder to obtain the first potential coding set of the next round, so that a process of gradually optimizing the potential coding set is formed, and further the potential coding set is gradually optimized, so that the generation of an accurate and effective first potential coding set sequence is facilitated.

According to an embodiment of the present disclosure, obtaining the first latent code set based on the predetermined number of global features and the first fusion feature may further include: an object number of the respective objects of the plurality of images to be processed in the sequence of images to be processed is determined. The maximum number of objects is taken as a predetermined number. And obtaining a first latent code set of the first round based on the first fusion feature and a preset number of global features.

According to the embodiment of the disclosure, the global features of the number of objects are used as query vectors and are simultaneously input into the decoder of the first visual codec, so that a plurality of objects can be simultaneously processed, the accuracy of object model parameters is ensured, and meanwhile, the processing efficiency is improved.

According to an embodiment of the present disclosure, generating a global potential code based on a first potential code group sequence includes: for each first potential code set in the first potential code set sequence, an averaged first potential code is obtained based on the first potential code set. And splicing the averaged first potential coding sequences in one-to-one correspondence with the first potential coding group sequences to obtain global potential codes.

For example, the first potential encoding set sequence includes a first potential encoding set S1, a first potential encoding set S2, and a first potential encoding set S3. The N first potential codes in the first potential code set S1 are averaged to obtain an averaged first potential code S1'. And so on, an averaged first potential code S2 'corresponding to the first potential code set S2 and an averaged first potential code S3' corresponding to the first potential code set S3 are obtained. And splicing the averaged first potential coding sequences in one-to-one correspondence with the first potential coding group sequences to obtain global potential codes. The global potential codes include the averaged first potential code S1' + the averaged first potential code S2' + the averaged first potential code S3'.

According to the embodiment of the disclosure, the global potential codes are obtained by splicing the averaged first potential code sequences, so that the global potential codes integrate all the features in the image sequence to be processed, the global potential codes fully utilize the relevance among time sequence features in the sequence, and a second potential code group obtained by using the global potential codes is accurate and effective.

According to an embodiment of the present disclosure, deriving a second potential coding group sequence based on the global potential coding and the first potential coding group sequence may include: a second fusion feature is generated based on the global potential encoding. And obtaining a second latent code set based on the second fusion feature and the first latent code set for each first latent code set in the first latent code set sequence.

For example, the global potential code may be input into the encoder of the second visual codec as a query feature, a key feature, and a value feature, simultaneously, resulting in a second fusion feature. The first potential code group, such as N first potential codes, is used as query characteristics, and the second fusion characteristics are used as key characteristics and value characteristics simultaneously, and are input into a decoder of a second visual coding decoder, so that a second potential code group, such as N second potential codes, is obtained.

According to an alternative embodiment of the present disclosure, based on the second fusion feature and the first set of latent codes, obtaining the second set of latent codes may include: repeating the following operation until the current round is equal to a second preset round threshold value, and taking the second potential coding group of the current round as a second potential coding group: and under the condition that the current round is determined to be smaller than a second preset round threshold value, obtaining a second latent code set of the next round based on the second fusion characteristic and the second latent code set of the current round.

According to an embodiment of the present disclosure, obtaining the second latent code set based on the second fusion feature and the first latent code set may further include: and obtaining a second latent code set of the first round based on the second fusion characteristic and the first latent code set.

According to the embodiment of the disclosure, through the iterative process, the current round of second potential coding set is input into the decoder of the second visual coding decoder to obtain the next round of second potential coding set, so that a process of gradually optimizing the potential coding set is formed, and the potential coding set is gradually optimized, so that the generation of an accurate and effective second potential coding set sequence is facilitated, and the generation of accurate and effective object model parameters is facilitated.

Fig. 4 schematically illustrates a structural schematic of an object reconstruction model according to an embodiment of the present disclosure.

As shown in fig. 4, the object reconstruction model includes a first visual codec, a second visual codec, and an output layer M450. The first visual codec includes a first encoder M410 and a first decoder M420. The second visual codec includes a second encoder M430 and a second decoder M440.

As shown in fig. 4, the sequence of images to be processed 410 includes L images to be processed. Based on the image sequence to be processed 410, a global feature sequence 420 and a local feature set sequence 430 are obtained. The global feature sequence 420 includes L global features that correspond one-to-one to the L images to be processed. The local feature group sequence 430 includes L local feature groups that correspond one-to-one to the L images to be processed. Each local feature group includes M local features. M is an integer greater than 1. L is an integer greater than 1.

As shown in fig. 4, the local feature group sequence 430 may be input into the first encoder M410 as the query feature Q, the key feature K, and the value feature V simultaneously, resulting in a first fused feature sequence 440. The first fused feature sequence 440 includes L first fused features that are in one-to-one correspondence with the L local feature groups. The number of objects in the image to be processed is determined to be N. N is an integer greater than or equal to 1. Each global feature in the global feature sequence 420 is replicated N times, resulting in a global feature set sequence 421. Each global feature group includes N global features. The global feature group sequence 421 is used as a query feature Q, and the first fusion feature sequence 440 is used as a key feature K and a value feature V to be input into the first decoder M420, so as to obtain a first potential coding group sequence in the first round. The first round first potential code set sequence includes L first round first potential code sets that are in one-to-one correspondence with the first fusion signature sequence 440. Each first-round first potential encoding group comprises N first-round first potential encodings in one-to-one correspondence with the global feature group.

As shown in fig. 4, the first fusion feature sequence 440 and the first-round first potential coding group sequence are input into the first decoder M420, resulting in the second-round first potential coding group sequence. In case it is determined that the second round is smaller than the first predetermined round threshold, e.g. the predetermined round threshold I, the second round first potential coding group sequence is input as query feature Q and the first fusion feature sequence 440 into the first decoder M420 resulting in a third round first potential coding group sequence. I is an integer greater than or equal to 2. And so on. In case it is determined that the I-th round is smaller than the predetermined round threshold I, the I-th round first potential coding group sequence is input as the query feature Q, and the first fusion feature sequence 440 is input into the first decoder M420, resulting in the i+1-th round first potential coding group sequence. In the case where it is determined that the ith pass is equal to the predetermined pass threshold I, the ith pass first potential coding group sequence is taken as the first potential coding group sequence.

As shown in fig. 4, for each first potential code set in the first potential code set sequence, an averaged first potential code is obtained based on the first potential code set. The averaged first potential coding sequences, which are in one-to-one correspondence with the first potential coding group sequences, are spliced to obtain the global potential coding 450.

As shown in fig. 4, the global potential code 450 may be input to the second encoder M430 as the query feature Q, the key feature K, and the value feature V simultaneously, resulting in a second fusion feature 460. The first potential coding group sequence is used as a query feature Q, and the second fusion feature 460 is used as a key feature K and a value feature V to be input into the second decoder M440, so that the second potential coding group sequence of the first round is obtained.

As shown in fig. 4, the second fusion feature 460 and the first round second potential encoding group sequence are input into the second decoder M440, resulting in a second round second potential encoding group sequence. In case it is determined that the second round is smaller than a second predetermined round threshold, e.g. the predetermined round threshold I, the second round second potential encoding group sequence is input as query feature Q and the second fusion feature 460 into the second decoder M440 resulting in a third round second potential encoding group sequence. And so on, in the case that the ith round is determined to be smaller than the preset round threshold I, the second potential coding group sequence of the ith round is used as the query feature Q, and the second fusion feature 460 is input into the second decoder M440, so as to obtain the second potential coding group sequence of the (i+1) th round. In the event that the ith pass is determined to be equal to the predetermined pass threshold I, the ith pass second potential coding group sequence is taken as the second potential coding group sequence.

As shown in fig. 4, the second potential encoding group sequence is input into the output layer M450, resulting in the object model parameter sequence 470.

According to an embodiment of the present disclosure, the initial feature map includes a background region and a foreground region containing the object.

According to an embodiment of the present disclosure, for operation S220 as shown in fig. 2, generating global features and local feature groups with respect to an object based on an initial feature map may include a feature extraction method as shown in fig. 5.

Fig. 5 schematically shows a flow chart of a feature extraction method according to an embodiment of the disclosure.

As shown in fig. 5, the method includes operations S510 to S540.

In operation S510, feature extraction is performed on an image to be processed, and an initial feature map is obtained.

In operation S520, a mask map is generated based on the initial feature map.

In operation S530, a plurality of target pixels are determined from the plurality of pixels based on the pixel values of the respective plurality of pixels.

In operation S540, a target feature for the object is generated based on the plurality of target pixels and the initial feature map.

According to embodiments of the present disclosure, the target features may include global features and local feature sets.

According to an embodiment of the present disclosure, an image to be processed includes a background region and a foreground region containing an object. The background region in the initial feature map may be removed by a method such as that shown in fig. 5 using a Mask map (Mask) such that the target feature is a feature of a foreground region containing the object. Therefore, the target features are features of removing the background area, and the effective features have high ratio and low noise.

According to an embodiment of the present disclosure, for operation S520 as shown in fig. 5, generating a mask map based on the initial feature map may include: and convolving the initial feature map to obtain a convolved feature map. And activating the convolved feature map to obtain an activated feature map. And (3) performing normalization operation on the activated feature map to obtain a mask map.

According to an embodiment of the present disclosure, generating a mask map based on an initial feature map may include: and inputting the initial feature map into a feature extraction network to obtain a mask map. The feature extraction network may include a convolutional layer, an activation layer, and a normalization layer, stacked in sequence. The number of stacked layers of each of the convolution layer, the activation layer, and the normalization layer is not limited. The convolutional layer may comprise a convolutional neural network and the active layer may comprise a linear rectification function (Rectified Linear Unit, reLU). The Normalization layer may also be referred to as a Normalization layer (Normalization).

According to the embodiment of the disclosure, the initial feature map may be convolved with a convolution layer to obtain a convolved feature map. And activating the convolved feature map by using an activation layer to obtain an activated feature map. And normalizing the activated feature map by using a normalization layer to obtain a mask map.

According to the embodiment of the disclosure, the initial feature map is processed by the feature extraction network provided by the embodiment of the disclosure, the mask map is generated, and the network structure of the feature extraction network is light, so that the realization mode is simple, and the processing efficiency is high.

According to an embodiment of the present disclosure, a mask map includes a plurality of pixels and pixel values corresponding to the plurality of pixels one by one. A plurality of target pixels are determined from the plurality of pixels based on the pixel values of each of the plurality of pixels. The target pixel is used to characterize a pixel of the object.

According to embodiments of the present disclosure, a predetermined pixel threshold may be set. In the case where the pixel value is greater than the predetermined pixel threshold value, a pixel corresponding to the pixel value is taken as the target pixel. In the case where the pixel value is smaller than the predetermined pixel threshold value, a pixel corresponding to the pixel value is regarded as a non-target pixel, for example, a pixel for characterizing a background region.

According to the embodiment of the disclosure, the non-target pixel is distinguished from the target pixel by using the mask map, so that the global feature and the plurality of local features obtained based on the target pixel and the initial feature map are features with background region information filtered out, and further noise of the global feature and the plurality of local features is small.

According to embodiments of the present disclosure, determining global features based on a plurality of target pixels and an initial feature map may include the following operations.

For example, based on the initial feature map, feature vectors for each of a plurality of target pixels are determined. A global feature is determined based on a first number of pixels of the plurality of target pixels and a feature vector for each of the plurality of target pixels.

According to an embodiment of the present disclosure, global feature F _g Can be determined by the following formula (1).

Wherein n represents the first pixel number of the target pixels in the mask map; forward is a global set of target pixels in the mask map that is made up of a plurality of target pixels; m is M _i，j Is the feature vector of the target pixel with pixel position (i, j) in the initial feature map.

According to the embodiment of the disclosure, by using the global feature extraction method provided by the embodiment of the disclosure, the features of the foreground region where the object is located in the initial feature map can be uniformly extracted by using the target pixels in the mask map. The global feature extraction principle is visual, the calculation is simple, and the processing efficiency is high.

According to embodiments of the present disclosure, determining a plurality of local features based on a target pixel and an initial feature map may include the following operations.

For example, the plurality of target pixels are classified based on their positional relationship with each other, and a plurality of pixel sets are obtained. Each set of pixels includes a plurality of target pixels of the same class. For each of the plurality of pixel sets, determining a local feature corresponding to the pixel set based on the initial feature map, resulting in a local feature set.

According to an embodiment of the present disclosure, classifying a plurality of target pixels based on a positional relationship between the plurality of target pixels to obtain a plurality of pixel sets may include: the plurality of pixel sets may include M pixel sets. M target pixels whose positions are uniform are determined from among the plurality of target pixels as M reference target pixels. The positions where the M reference target pixels are located are regarded as M reference positions. The M reference target pixels are divided into M categories. The target pixels other than the M reference target pixels among the plurality of target pixels may be referred to as a plurality of class-to-be-divided target pixels. For each of the plurality of class-to-be-classified target pixels, the following operations may be performed, respectively. For example, whether the class-to-be-classified target pixel S1 and the reference target pixel are of one class is determined according to the distance between the position of the class-to-be-classified target pixel S1 and each of the M reference positions. In the case where the distance between the position of the class-to-be-classified target pixel S1 and the M1 reference position of the M reference positions is smaller than the predetermined threshold value, it is determined that the class-to-be-classified target pixel S1 and the reference target pixel M1 are in the same class. The class target pixel S1 to be classified and the reference target pixel M1 are determined as two target pixels in one pixel set.

According to an embodiment of the present disclosure, classifying a plurality of target pixels based on a positional relationship between the plurality of target pixels to obtain a plurality of pixel sets may further include: an outline frame containing a plurality of target pixels is determined based on the positional relationship of the plurality of target pixels with each other. The outline frame is divided into a plurality of sub-frames. Multiple target pixels within the same subframe are determined as a class as a set of pixels. A plurality of pixel sets is obtained.

According to an embodiment of the present disclosure, determining local features corresponding to a set of pixels based on an initial feature map may include: for each target pixel in the set of pixels, a respective feature vector for the plurality of target pixels is determined based on the initial feature map. A local feature is determined based on a second number of pixels of the plurality of target pixels in the set of pixels and a feature vector for each of the plurality of target pixels.

According to embodiments of the present disclosure, local featuresCan be determined by the following formula (2).

Wherein f _k A set of pixels representing a kth class of the M classes; n is n _k A second number of pixels representing a set of pixels in a kth class; m is M _i，j Is the feature vector of the target pixel with pixel position (i, j) in the initial feature map.

According to the embodiment of the disclosure, by using the local feature generation mode provided by the embodiment of the disclosure, the local features about the foreground region in a predetermined number can be extracted quickly and accurately, and the method is not limited by the size and shape of the target. The determination principle is visual, the implementation is simple and the feasibility is strong.

Fig. 6 schematically illustrates a flowchart of a training method of an object reconstruction model according to an embodiment of the present disclosure.

As shown in fig. 6, the method includes operations S610 to S640.

In operation S610, feature extraction is performed on a plurality of sample images in a sample image sequence, respectively, to obtain a sample initial feature map sequence. The sample image sequence comprises a sample object to be reconstructed.

In operation S620, for each sample initial feature map in the sample initial feature map sequence, a sample global feature and a sample local feature set for the sample object are generated based on the sample initial feature map, resulting in a sample global feature sequence and a sample local feature set sequence.

In operation S630, a sample object model parameter sequence for reconstructing the sample object is generated based on the sample global feature sequence and the sample local feature group sequence.

In operation S640, the object reconstruction model is trained using the sample object model parameter sequence and the sample object model parameter tag sequence matched to the sample image sequence.

According to the embodiment of the disclosure, the sample object model parameters are generated by using the sample global features and the plurality of sample local features excluding the background information, noise caused by the background information in the sample image is eliminated, and the precision of the generated sample object model parameters is further improved, so that the training precision and efficiency of the object reconstruction model can be improved.

According to an embodiment of the present disclosure, generating a sample object model parameter sequence for reconstructing a sample object based on a sample global feature sequence and a sample local feature set sequence may include: and obtaining a first potential coding group sequence of the sample based on the sample global feature sequence and the sample local feature group sequence. Based on the sample first potential code set sequence, a sample global potential code is generated. And obtaining a sample second potential coding group sequence based on the sample global potential coding and the sample first potential coding group sequence. A sample object model parameter sequence is generated based on the sample second potential code set sequence.

According to an embodiment of the present disclosure, obtaining a sample first potential code group sequence based on a sample global feature sequence and a sample local feature group sequence may include: for each sample global feature in the sample global feature sequence, a sample target local feature set matching the sample global feature is determined from the sample local feature set sequence. The sample global feature and the sample target local feature group matched with the sample global feature are image features of the same sample image. And obtaining a first fusion characteristic of the sample based on the sample target local characteristic group. And obtaining a first sample potential code group based on the preset number of sample global features and the first sample fusion features.

According to an embodiment of the present disclosure, obtaining a first set of potential codes for a sample based on a predetermined number of global features of the sample and a first fusion feature of the sample includes: repeating the following operation until the current round is equal to a third preset round threshold value, and taking the first potential coding group of the samples of the current round as the first potential coding group of the samples: and under the condition that the current round is determined to be smaller than a third preset round threshold value, obtaining a first latent code set of samples of the next round based on the first fusion characteristic of the samples and the first latent code set of samples of the current round.

According to an embodiment of the present disclosure, obtaining a first set of potential codes for a sample based on a predetermined number of global features for the sample and a first fusion feature for the sample, further includes: and obtaining a first potential coding group of the first round samples based on the first fusion characteristic of the samples and the global characteristic of the preset number of samples.

According to an embodiment of the present disclosure, generating a sample global potential code based on a sample first potential code set sequence includes: for each sample first potential code set in the sequence of sample first potential code sets, an averaged sample first potential code is obtained based on the sample first potential code set. And splicing the average sample first potential coding sequences corresponding to the sample first potential coding group sequences one by one to obtain a sample global potential code.

According to an embodiment of the present disclosure, obtaining a sample second potential coding group sequence based on a sample global potential coding and a sample first potential coding group sequence includes: a sample second fusion feature is generated based on the sample global potential encoding. And aiming at each sample first potential coding group in the sample first potential coding group sequence, obtaining a sample second potential coding group based on the sample second fusion characteristic and the sample first potential coding group.

According to an embodiment of the present disclosure, obtaining a sample second latent code set based on a sample second fusion feature and the sample first latent code set includes: repeating the following operation until the current round is equal to a fourth preset round threshold value, and taking the second potential coding group of the samples of the current round as a second potential coding group of the samples: and under the condition that the current round is determined to be smaller than a fourth preset round threshold value, obtaining a second latent code set of samples of the next round based on the second fusion characteristic of the samples and the second latent code set of samples of the current round.

According to an embodiment of the disclosure, based on the sample second fusion feature and the sample first latent code set, obtaining the sample second latent code set further includes: and obtaining a first round of sample second latent code set based on the sample second fusion characteristic and the sample first latent code set.

According to an embodiment of the present disclosure, each sample object model parameter in the sequence of sample object model parameters comprises a sample pose parameter and a sample pose parameter. The training method of the object reconstruction model may further include: each sample object model parameter in the sample object model parameter sequence generates a sample target model based on the sample pose parameter and the sample pose parameter. And determining sample three-dimensional key point information based on the sample target model.

According to an embodiment of the present disclosure, training an object reconstruction model using a sample object model parameter sequence and a sample object model parameter tag sequence that matches the sample image sequence, comprises: a first penalty value is generated based on the sample object model parameter sequence and a sample object model parameter tag sequence that matches the sample image sequence. A second loss value is generated based on the sample three-dimensional keypoint information sequence matching the sample image sequence and the keypoint label matching the sample image sequence. The object reconstruction model is trained based on the first loss value and the second loss value. But is not limited thereto. A first penalty value is generated based on the sample object model parameter sequence and a sample object model parameter tag sequence that matches the sample image sequence. Based on the first loss value, an object reconstruction model is trained.

According to the embodiment of the disclosure, compared with a method for training the object reconstruction model based on the first loss value, training the object reconstruction model based on the first loss value and the second loss value can enable the accuracy of the trained object reconstruction model to be high and improve the robustness.

According to an embodiment of the present disclosure, the first loss value L1 may be determined by the following formula (3).

/>

Wherein θ _l Representing the 1 st sample pose parameter in the sample object model parameter sequence; θ _l ' represent the 1 st pose parameter tag in the model parameter tag sequence; beta _l Representing the 1 st sample shape parameter in the sample object model parameter sequence; beta _l ' represent the 1 st shape parameter label in the model parameter label sequence; l is the number of sample object model parameters in the sequence of sample object model parameters.

According to an embodiment of the present disclosure, the second loss value L2 may be determined by the following formula (4).

Wherein J is _l Representing the 1 st sample three-dimensional key point information in the sample three-dimensional key point information sequence; j (J) _l ' represents the 1 st keypoint tag in the sequence of keypoint tags.

According to an embodiment of the present disclosure, training an object reconstruction model based on a first loss value and a second loss value may include: and carrying out weighted summation on the first loss value and the second loss value to obtain a target loss value. And adjusting parameters of the object reconstruction model based on the target loss value until a preset training condition is reached. The preset training conditions may include: the predetermined training round or target loss value converges.

According to an embodiment of the present disclosure, generating a sample global feature and a plurality of sample local features with respect to a sample object based on a sample initial feature map includes: based on the sample initial feature map, a sample mask map is generated. The sample mask map comprises a plurality of sample pixel points and pixel values corresponding to the plurality of sample pixel points one by one. A plurality of sample target pixels are determined from a plurality of sample pixels based on the sample mask map, wherein the sample target pixels are used to characterize pixels of a sample object. Based on the plurality of target pixel points and the sample initial feature map, a sample global feature and a plurality of sample local features are determined.

According to an embodiment of the present disclosure, determining a sample global feature based on a plurality of target sample pixels and a sample initial feature map includes: based on the sample initial feature map, a sample feature vector of each of a plurality of sample target pixel points is determined. And determining the sample global feature based on the number of the sample first pixel points of the plurality of sample target pixel points and the sample feature vectors of the plurality of sample target pixel points.

According to an embodiment of the present disclosure, determining a plurality of sample local features based on a sample target pixel point and a sample initial feature map includes: classifying the plurality of sample target pixels based on the position relationship among the plurality of sample target pixels to obtain a plurality of sample pixel sets. Each sample pixel set includes a plurality of sample target pixels of the same class. For each sample pixel point set in the plurality of sample pixel point sets, determining sample local features corresponding to the sample pixel point sets based on the sample initial feature map, and obtaining a plurality of sample local features.

According to an embodiment of the present disclosure, determining a sample local feature corresponding to a sample set of pixels based on a sample initial feature map includes: for each sample target pixel point in the sample pixel point set, determining a sample feature vector of the sample target pixel point based on the sample initial feature map; and

and determining the local sample characteristics based on the number of sample second pixel points of the plurality of sample target pixel points in the sample pixel point set and the sample characteristic vectors of the plurality of sample target pixel points.

According to an embodiment of the present disclosure, generating a sample mask map based on a sample initial feature map includes: and convolving the initial characteristic diagram of the sample to obtain a characteristic diagram after the sample is convolved. And activating the characteristic diagram after sample convolution to obtain the characteristic diagram after sample activation. And carrying out normalization operation on the characteristic map after the sample activation to obtain a sample mask map.

It should be noted that, in the embodiments of the present disclosure, terms such as an image to be processed and a sample image, an initial feature map and a sample initial feature map, global features and a sample global feature, and local features and a sample local feature are only provided with different names for ease of understanding, but there is no difference in operation and properties, and the description of the training method portion of the object reconstruction model may specifically refer to the object reconstruction method portion, which is not described herein again.

Fig. 7 schematically shows a block diagram of an object reconstruction apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the object reconstruction apparatus 700 includes: the feature extraction module 710, the first generation module 720, the second generation module 730, and the reconstruction module 740.

The feature extraction module 710 is configured to perform feature extraction on a plurality of images to be processed in the image sequence to obtain an initial feature map sequence. The sequence of images to be processed comprises the object to be reconstructed.

A first generating module 720, configured to generate, for each initial feature map in the initial feature map sequence, a global feature and a local feature set related to the object based on the initial feature map, to obtain a global feature sequence and a local feature set sequence.

A second generating module 730, configured to generate an object model parameter sequence for reconstructing an object based on the global feature sequence and the local feature set sequence.

The reconstruction module 740 is configured to reconstruct the object based on the object model parameter sequence to obtain a target model sequence.

According to an embodiment of the present disclosure, the second generating module includes: the device comprises a first coding submodule, a second coding submodule, a third coding submodule and a first generating submodule.

The first coding submodule is used for obtaining a first potential coding group sequence based on the global feature sequence and the local feature group sequence.

A second encoding submodule for generating a global potential encoding based on the first potential encoding group sequence.

And a third coding submodule, configured to obtain a second potential coding set sequence based on the global potential coding and the first potential coding set sequence.

A first generation sub-module for generating a sequence of object model parameters based on the second sequence of potential encoding sets.

According to an embodiment of the present disclosure, a first encoding submodule includes: the first determining unit, the first encoding unit and the second encoding unit.

A first determination unit for, for each global feature in the sequence of global features,

a target local feature set is determined from the sequence of local feature sets that matches the global feature. The global feature and the target local feature group matched with the global feature are the image features of the same image to be processed;

the first coding unit is used for obtaining a first fusion feature based on the target local feature group.

And the second coding unit is used for obtaining a first potential coding group based on a preset number of global features and the first fusion features.

According to an embodiment of the present disclosure, the second encoding unit is configured to:

repeating the following operation until the current round is equal to a first preset round threshold value, and taking the first potential coding group of the current round as a first potential coding group:

and under the condition that the current round is determined to be smaller than a first preset round threshold value, obtaining a first latent code set of the next round based on the first fusion characteristic and the first latent code set of the current round.

According to an embodiment of the present disclosure, the first encoding unit is further configured to:

and obtaining a first latent code set of the first round based on the first fusion feature and a preset number of global features.

According to an embodiment of the present disclosure, the second encoding submodule includes: a third encoding unit and a fourth encoding unit.

And a third encoding unit, configured to obtain, for each first potential encoding group in the first potential encoding group sequence, an averaged first potential encoding based on the first potential encoding group.

And the fourth coding unit is used for splicing the averaged first potential coding sequences which are in one-to-one correspondence with the first potential coding group sequences to obtain global potential codes.

According to an embodiment of the present disclosure, the third encoding submodule includes: a fifth encoding unit and a sixth encoding unit.

And a fifth encoding unit, configured to generate a second fusion feature based on the global potential encoding.

And a sixth encoding unit, configured to obtain, for each first latent encoding group in the first latent encoding group sequence, a second latent encoding group based on the second fusion feature and the first latent encoding group.

According to an embodiment of the present disclosure, the sixth encoding unit is configured to:

repeating the following operation until the current round is equal to a second preset round threshold value, and taking the second potential coding group of the current round as a second potential coding group:

and under the condition that the current round is determined to be smaller than a second preset round threshold value, obtaining a second latent code set of the next round based on the second fusion characteristic and the second latent code set of the current round.

According to an embodiment of the present disclosure, the sixth encoding unit is further configured to:

and obtaining a second latent code set of the first round based on the second fusion characteristic and the first latent code set.

According to an embodiment of the present disclosure, each object model parameter in the sequence of object model parameters comprises a pose parameter and a posture parameter.

According to an embodiment of the present disclosure, the first generation module includes: the system comprises a second generation sub-module, a first determination sub-module and a second determination sub-module.

And the second generation sub-module is used for generating a mask map based on the initial feature map. The mask map comprises a plurality of pixel points and pixel values corresponding to the pixel points one by one.

The first determining submodule is used for determining a plurality of target pixel points from the plurality of pixel points based on the pixel values of the plurality of pixel points, wherein the target pixel points are used for representing the pixel points of the object.

And the second determining submodule is used for determining the global feature and the local feature group based on the target pixel points and the initial feature map.

According to an embodiment of the present disclosure, the second determination submodule includes: a second determination unit and a third determination unit.

And a second determining unit for determining the feature vector of each of the plurality of target pixel points based on the initial feature map.

And a third determining unit, configured to determine the global feature based on the first pixel number of the plurality of target pixels and the feature vectors of the plurality of target pixels.

According to an embodiment of the present disclosure, the second determination submodule includes: a classification unit and a fourth determination unit.

The classification unit is used for classifying the plurality of target pixel points based on the position relation among the plurality of target pixel points to obtain a plurality of pixel point sets. Each pixel set includes a plurality of target pixels of the same class.

And a fourth determining unit, configured to determine, for each of the plurality of pixel point sets, a local feature corresponding to the pixel point set based on the initial feature map, and obtain a local feature group.

According to an embodiment of the present disclosure, the fourth determination unit includes: a first determination subunit and a second determination subunit.

The first determining subunit is configured to determine, for each target pixel in the set of pixels, a feature vector of the target pixel based on the initial feature map.

And the second determining subunit is used for determining the local feature based on the second pixel point quantity of the plurality of target pixel points in the pixel point set and the feature vector of each of the plurality of target pixel points.

According to an embodiment of the present disclosure, the second generation submodule includes: a convolution unit, an activation unit and a normalization unit.

And the convolution unit is used for convoluting the initial feature map to obtain a convolved feature map.

And the activating unit is used for activating the characteristic map after convolution to obtain the characteristic map after activation.

And the normalization unit is used for performing normalization operation on the activated feature map to obtain a mask map.

Fig. 8 schematically shows a block diagram of an object reconstruction apparatus according to an embodiment of the present disclosure.

As shown in fig. 8, a training apparatus 800 for reconstructing a model of an object includes: sample feature extraction module 810, sample first generation module 820, sample second generation module 830, and training module 840.

The sample feature extraction module 810 is configured to perform feature extraction on a plurality of sample images in the sample image sequence, so as to obtain a sample initial feature map sequence. The sample image sequence comprises a sample object to be reconstructed.

The sample first generating module 820 is configured to generate, for each sample initial feature map in the sample initial feature map sequence, a sample global feature and a sample local feature set related to the sample object based on the sample initial feature map, and obtain a sample global feature sequence and a sample local feature set sequence.

A sample second generation module 830 is configured to generate a sample object model parameter sequence for reconstructing the sample object based on the sample global feature sequence and the sample local feature set sequence.

The training module 840 is configured to train the object reconstruction model by using the sample object model parameter sequence and the sample object model parameter tag sequence matched with the sample image sequence.

According to an embodiment of the present disclosure, the sample second generation module includes: a sample first encoding submodule, a sample second encoding submodule, a sample third encoding submodule, and a sample first generating submodule.

And the sample first coding submodule is used for obtaining a sample first potential coding group sequence based on the sample global feature sequence and the sample local feature group sequence.

And the sample second coding submodule is used for generating a sample global potential code based on the sample first potential code group sequence.

And the sample third coding submodule is used for obtaining a sample second potential coding group sequence based on the sample global potential coding and the sample first potential coding group sequence.

And the sample first generation submodule is used for generating a sample object model parameter sequence based on the sample second potential coding group sequence.

According to an embodiment of the present disclosure, a sample first encoding submodule includes: sample first determining unit, sample first encoding unit and sample second encoding unit.

And the sample first determining unit is used for determining a sample target local feature group matched with the sample global feature from the sample local feature group sequence aiming at each sample global feature in the sample global feature sequence. The sample global feature and the sample target local feature group matched with the sample global feature are image features of the same sample image.

And the sample first coding unit is used for obtaining a sample first fusion characteristic based on the sample target local characteristic group.

And the sample second coding unit is used for obtaining a sample first potential coding group based on the global features of a preset number of samples and the first fusion features of the samples.

According to an embodiment of the present disclosure, the sample second encoding unit is configured to:

repeating the following operation until the current round is equal to a third preset round threshold value, and taking the first potential coding group of the samples of the current round as the first potential coding group of the samples:

and under the condition that the current round is determined to be smaller than a third preset round threshold value, obtaining a first latent code set of samples of the next round based on the first fusion characteristic of the samples and the first latent code set of samples of the current round.

According to an embodiment of the present disclosure, the sample second encoding unit is further configured to:

and obtaining a first potential coding group of the first round samples based on the first fusion characteristic of the samples and the global characteristic of the preset number of samples.

According to an embodiment of the present disclosure, the sample second encoding submodule includes: sample third encoding section and sample fourth encoding section.

And the sample third coding unit is used for obtaining an average sample first potential code based on the sample first potential code group for each sample first potential code group in the sample first potential code group sequence.

And the sample fourth coding unit is used for splicing the average sample first potential coding sequences corresponding to the sample first potential coding group sequences one by one to obtain a sample global potential code.

According to an embodiment of the present disclosure, the sample third encoding submodule includes: sample fifth encoding section and sample sixth encoding section.

And the sample fifth coding unit is used for generating a sample second fusion characteristic based on the sample global potential coding.

And the sample sixth coding unit is used for obtaining a sample second potential coding set based on the sample second fusion characteristic and the sample first potential coding set aiming at each sample first potential coding set in the sample first potential coding set sequence.

According to an embodiment of the present disclosure, the sample sixth encoding unit is configured to:

repeating the following operation until the current round is equal to a fourth preset round threshold value, and taking the second potential coding group of the samples of the current round as a second potential coding group of the samples:

and under the condition that the current round is determined to be smaller than a fourth preset round threshold value, obtaining a second latent code set of samples of the next round based on the second fusion characteristic of the samples and the second latent code set of samples of the current round.

According to an embodiment of the present disclosure, the sample sixth encoding unit is further configured to:

and obtaining a first round of sample second latent code set based on the sample second fusion characteristic and the sample first latent code set.

According to an embodiment of the present disclosure, each sample object model parameter in the sequence of sample object model parameters comprises a sample pose parameter and a sample pose parameter.

According to an embodiment of the present disclosure, the training apparatus of the object reconstruction model further includes: and a sample third generation module and a sample determination module.

And a sample third generation module, configured to generate a sample target model based on the sample pose parameter and the sample posture parameter for each sample object model parameter in the sample object model parameter sequence.

And the sample determining module is used for determining sample three-dimensional key point information based on the sample target model.

According to an embodiment of the present disclosure, a training module includes: the first loss determination sub-module, the second loss determination sub-module, and the training sub-module.

A first loss determination submodule for generating a first loss value based on the sample object model parameter sequence and the sample object model parameter tag sequence matched with the sample image sequence.

And a second loss determination sub-module for generating a second loss value based on the sample three-dimensional keypoint information sequence matching the sample image sequence and the keypoint label matching the sample image sequence.

And the training sub-module is used for training the object reconstruction model based on the first loss value and the second loss value.

According to an embodiment of the present disclosure, a sample first generation module includes: the system comprises a sample second generation sub-module, a sample first determination sub-module and a sample second determination sub-module.

And the sample second generation submodule is used for generating a sample mask graph based on the sample initial feature graph. The sample mask map comprises a plurality of sample pixel points and pixel values corresponding to the plurality of sample pixel points one by one.

And the sample first determining submodule is used for determining a plurality of sample target pixel points from a plurality of sample pixel points based on the sample mask map. The sample target pixel is used to characterize the pixel of the sample object.

And the sample second determining submodule is used for determining a sample global characteristic and a plurality of sample local characteristics based on the plurality of target pixel points and the sample initial characteristic diagram.

According to an embodiment of the present disclosure, the sample second determination submodule includes: and a sample second determination unit and a sample third determination unit.

And the sample second determining unit is used for determining the sample feature vector of each of the plurality of sample target pixel points based on the sample initial feature map.

And the sample third determining unit is used for determining the global sample characteristic based on the number of the first sample pixels of the plurality of sample target pixels and the sample characteristic vectors of the plurality of sample target pixels.

According to an embodiment of the present disclosure, the sample second determination submodule includes: and a sample classification unit and a sample fourth determination unit.

The sample classification unit is used for classifying the plurality of sample target pixel points based on the position relation among the plurality of sample target pixel points to obtain a plurality of sample pixel point sets. Each sample pixel set includes a plurality of sample target pixels of the same class.

The fourth sample determining unit is configured to determine, for each of the plurality of sample pixel point sets, a sample local feature corresponding to the sample pixel point set based on the sample initial feature map, and obtain a plurality of sample local features.

According to an embodiment of the present disclosure, the sample fourth determination unit includes: a sample first determination subunit and a sample second determination subunit.

A sample first determining subunit, configured to determine, for each sample target pixel in the set of sample pixels, a sample feature vector of the sample target pixel based on the sample initial feature map.

And the sample second determining subunit is used for determining the local sample characteristics based on the number of sample second pixel points of the plurality of sample target pixel points in the sample pixel point set and the sample characteristic vectors of the plurality of sample target pixel points.

According to an embodiment of the present disclosure, the sample second generation submodule includes: a sample convolution unit, a sample activation unit and a sample normalization unit.

The sample convolution unit is used for convolving the initial characteristic diagram of the sample to obtain a characteristic diagram after sample convolution.

And the sample activating unit is used for activating the characteristic diagram after sample convolution to obtain the characteristic diagram after sample activation.

And the sample normalization unit is used for performing normalization operation on the characteristic map after the sample activation to obtain a sample mask map.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as in an embodiment of the present disclosure.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as in an embodiment of the present disclosure.

According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as an embodiment of the present disclosure.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, for example, the object reconstruction method. For example, in some embodiments, the object reconstruction method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the object reconstruction method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the object reconstruction method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An object reconstruction method, comprising:

respectively extracting features of a plurality of images to be processed in an image sequence to be processed to obtain an initial feature image sequence, wherein the image sequence to be processed comprises an object to be reconstructed;

generating global features and local feature groups related to the object based on the initial feature graphs aiming at each initial feature graph in the initial feature graph sequence to obtain a global feature sequence and a local feature group sequence;

Generating an object model parameter sequence for reconstructing the object based on the global feature sequence and the local feature set sequence; and

reconstructing the object based on the object model parameter sequence to obtain a target model sequence.

2. The method of claim 1, wherein the generating an object model parameter sequence for reconstructing the object based on the global feature sequence and the local feature set sequence comprises:

obtaining a first potential coding group sequence based on the global feature sequence and the local feature group sequence;

generating a global potential code based on the first potential code set sequence;

obtaining a second potential coding group sequence based on the global potential coding and the first potential coding group sequence; and

the object model parameter sequence is generated based on the second potential encoding set sequence.

3. The method of claim 2, wherein the deriving a first potential code set sequence based on the global feature sequence and the local feature set sequence comprises:

for each global feature in the sequence of global features,

determining a target local feature group matched with the global feature from the local feature group sequence, wherein the global feature and the target local feature group matched with the global feature are image features of the same image to be processed;

Obtaining a first fusion feature based on the target local feature group; and

and obtaining a first latent code set based on a preset number of the global features and the first fusion features.

4. A method according to claim 3, wherein said deriving a first set of latent codes based on a predetermined number of said global features and said first fusion features comprises:

repeating the following operation until the current round is equal to a first preset round threshold value, and taking a first potential coding group of the current round as the first potential coding group:

and under the condition that the current round is determined to be smaller than the first preset round threshold value, obtaining a next round of first latent code set based on the first fusion characteristic and the current round first latent code set.

5. The method of claim 4, wherein the deriving a first set of potential codes based on a predetermined number of the global features and the first fusion features further comprises:

6. The method of claim 2, wherein the generating a global potential code based on the first potential code set sequence comprises:

For each first potential code set in the first potential code set sequence, obtaining an average first potential code based on the first potential code set; and

and splicing the averaged first potential coding sequences in one-to-one correspondence with the first potential coding group sequences to obtain the global potential coding.

7. The method of claim 2, wherein the deriving a second potential coding group sequence based on the global potential coding and the first potential coding group sequence comprises:

generating a second fusion feature based on the global potential code; and

for each first potential code set in the first potential code set sequence, deriving the second potential code set based on the second fusion feature and the first potential code set.

8. The method of claim 7, wherein the deriving the second set of latent codes based on the second fusion feature and the first set of latent codes comprises:

repeating the following operation until the current round is equal to a second preset round threshold value, and taking a second potential coding group of the current round as the second potential coding group:

and under the condition that the current round is determined to be smaller than the second preset round threshold value, obtaining a second latent code set of the next round based on the second fusion characteristic and the second latent code set of the current round.

9. The method of claim 8, wherein the deriving the second set of latent codes based on the second fusion feature and the first set of latent codes further comprises:

and obtaining a first round of second latent code set based on the second fusion characteristic and the first latent code set.

10. The method according to any one of claims 1 to 9, wherein the initial feature map comprises a background region and a foreground region containing the object,

the generating global features and local feature sets for the object based on the initial feature map includes:

generating a mask map based on the initial feature map, wherein the mask map comprises a plurality of pixel points and pixel values corresponding to the pixel points one by one;

determining a plurality of target pixel points from the plurality of pixel points based on the pixel values of the plurality of pixel points, wherein the target pixel points are used for representing the pixel points of the object; and

the global feature and the local feature set are determined based on a plurality of the target pixel points and the initial feature map.

11. The method of claim 10, wherein the determining the global feature based on the plurality of target pixel points and the initial feature map comprises:

Determining the feature vector of each of the plurality of target pixel points based on the initial feature map; and

and determining the global feature based on the first pixel point quantity of the target pixel points and the feature vectors of the target pixel points.

12. The method of claim 10 or 11, wherein the determining the local feature set based on the target pixel point and the initial feature map comprises:

classifying a plurality of target pixel points based on the position relation among the target pixel points to obtain a plurality of pixel point sets, wherein each pixel point set comprises a plurality of target pixel points in the same category; and

and determining local features corresponding to the pixel point sets based on the initial feature map for each pixel point set in the pixel point sets to obtain the local feature set.

13. The method of claim 12, wherein the determining local features corresponding to the set of pixels based on the initial feature map comprises:

for each target pixel point in the pixel point set, determining a feature vector of the target pixel point based on the initial feature map; and

And determining the local feature based on the second pixel number of the target pixels and the feature vector of each of the target pixels.

14. The method of claim 10, wherein the generating a mask map based on the initial feature map comprises:

convolving the initial feature map to obtain a convolved feature map;

activating the convolved feature map to obtain an activated feature map; and

and performing normalization operation on the activated feature map to obtain the mask map.

15. A method of training a model of an object reconstruction, comprising:

respectively extracting features of a plurality of sample images in a sample image sequence to obtain a sample initial feature image sequence, wherein the sample image sequence comprises a sample object to be reconstructed;

generating a sample global feature and a sample local feature group about the sample object based on the sample initial feature map for each sample initial feature map in the sample initial feature map sequence to obtain a sample global feature sequence and a sample local feature group sequence;

generating a sample object model parameter sequence for reconstructing the sample object based on the sample global feature sequence and the sample local feature set sequence; and

And training the object reconstruction model by using the sample object model parameter sequence and a sample object model parameter label sequence matched with the sample image sequence.

16. The method of claim 15, wherein the generating a sample object model parameter sequence for reconstructing the sample object based on the sample global feature sequence and sample local feature set sequence comprises:

obtaining a first potential coding group sequence of the sample based on the sample global feature sequence and the sample local feature group sequence;

generating a sample global potential code based on the sample first potential code group sequence;

obtaining a sample second potential coding group sequence based on the sample global potential coding and the sample first potential coding group sequence; and

the sample object model parameter sequence is generated based on the sample second potential coding set sequence.

17. The method of claim 16, wherein the deriving a sample first potential code set sequence based on the sample global feature sequence and the sample local feature set sequence comprises:

for each sample global feature in the sequence of sample global features,

Determining a sample target local feature group matched with the sample global feature from the sample local feature group sequence, wherein the sample global feature and the sample target local feature group matched with the sample global feature are image features of the same sample image;

obtaining a first fusion characteristic of the sample based on the sample target local characteristic group; and

and obtaining a first sample potential coding set based on the preset number of the sample global features and the first sample fusion features.

18. The method of claim 17, wherein the deriving a first set of potential codes based on a predetermined number of the sample global features and the sample first fusion features comprises:

repeating the following operation until the current round is equal to the third preset round threshold value, and taking the first potential coding group of the samples of the current round as the first potential coding group of the samples:

and under the condition that the current round is determined to be smaller than the third preset round threshold value, obtaining a first latent encoding group of a sample of a next round based on the first fusion characteristic of the sample and the first latent encoding group of the current round.

19. The method of claim 18, wherein the deriving a first set of potential codes based on a predetermined number of the sample global features and the sample first fusion features further comprises:

and obtaining a first potential coding group of the first round of samples based on the first fusion characteristic of the samples and a preset number of global characteristics of the samples.

20. The method of claim 16, wherein the generating a sample global potential code based on the sample first potential code set sequence comprises:

for each sample first potential code set in the sample first potential code set sequence, obtaining an average sample first potential code based on the sample first potential code set; and

and splicing the average sample first potential coding sequences corresponding to the sample first potential coding group sequences one by one to obtain the sample global potential coding.

21. The method of claim 16, wherein the deriving a sample second potential coding set sequence based on the sample global potential coding and the sample first potential coding set sequence comprises:

generating a sample second fusion feature based on the sample global potential code; and

And obtaining a sample second latent code set based on the sample second fusion feature and the sample first latent code set for each sample first latent code set in the sample first latent code set sequence.

22. The method of claim 21, wherein the deriving the second set of samples based on the second fusion feature of samples and the first set of samples comprises:

repeating the following operation until the current round is equal to the fourth preset round threshold value, and taking the second potential coding group of the samples of the current round as the second potential coding group of the samples:

and under the condition that the current round is determined to be smaller than the fourth preset round threshold value, obtaining a second latent encoding group of a next round sample based on the second fusion characteristic of the sample and the second latent encoding group of the current round sample.

23. The method of claim 22, wherein the deriving the second set of samples based on the second fusion feature of samples and the first set of samples further comprises:

24. The method of any of claims 15 to 23, wherein each sample object model parameter in the sequence of sample object model parameters comprises a sample pose parameter and a sample pose parameter, the method further comprising:

each sample object model parameter in the sequence of sample object model parameters,

generating a sample target model based on the sample pose parameter and the sample attitude parameter; and

and determining sample three-dimensional key point information based on the sample target model.

25. The method of claim 24, wherein the training the object reconstruction model using the sample object model parameter sequence and a sample object model parameter tag sequence that matches the sample image sequence comprises:

generating a first loss value based on the sample object model parameter sequence and a sample object model parameter tag sequence matched with the sample image sequence;

generating a second loss value based on a sample three-dimensional key point information sequence matched with the sample image sequence and a key point label matched with the sample image sequence; and

the object reconstruction model is trained based on the first loss value and the second loss value.

26. The method of any of claims 15 to 25, wherein generating a sample global feature and a plurality of sample local features for the sample object based on the sample initial feature map comprises:

generating a sample mask map based on the sample initial feature map, wherein the sample mask map comprises a plurality of sample pixel points and pixel values corresponding to the plurality of sample pixel points one by one;

determining a plurality of sample target pixels from the plurality of sample pixels based on the sample mask map, wherein the sample target pixels are used for representing pixels of the sample object; and

the sample global feature and the plurality of sample local features are determined based on the plurality of target pixel points and the sample initial feature map.

27. The method of claim 26, wherein the determining the sample global feature based on the plurality of target sample pixels and the sample initial feature map comprises:

determining respective sample feature vectors of a plurality of sample target pixel points based on the sample initial feature map; and

and determining the sample global feature based on the number of the sample first pixel points of the plurality of sample target pixel points and the sample feature vector of each of the plurality of sample target pixel points.

28. The method of claim 26 or 27, wherein the determining the plurality of sample local features based on the sample target pixel points and the sample initial feature map comprises:

classifying a plurality of sample target pixel points based on the position relation among the plurality of sample target pixel points to obtain a plurality of sample pixel point sets, wherein each sample pixel point set comprises a plurality of sample target pixel points in the same category; and

and determining sample local features corresponding to the sample pixel point sets based on the sample initial feature map for each sample pixel point set in the plurality of sample pixel point sets to obtain the plurality of sample local features.

29. The method of claim 28, wherein the determining, based on the sample initial feature map, a sample local feature corresponding to the set of sample pixels comprises:

for each sample target pixel point in the sample pixel point set, determining a sample feature vector of the sample target pixel point based on the sample initial feature map; and

and determining the sample local feature based on the sample second pixel point number of the plurality of sample target pixel points in the sample pixel point set and the sample feature vector of each of the plurality of sample target pixel points.

30. The method of claim 26, wherein the generating a sample mask map based on the sample initial feature map comprises:

convolving the initial characteristic diagram of the sample to obtain a characteristic diagram after sample convolution;

activating the characteristic map after sample convolution to obtain a characteristic map after sample activation; and

and carrying out normalization operation on the characteristic map after the sample activation to obtain the sample mask map.

31. An object reconstruction apparatus, comprising:

the device comprises a feature extraction module, a feature extraction module and a reconstruction module, wherein the feature extraction module is used for respectively extracting features of a plurality of images to be processed in an image sequence to be processed to obtain an initial feature image sequence, and the image sequence to be processed comprises an object to be reconstructed;

the first generation module is used for generating global features and local feature groups of the object according to the initial feature images in the initial feature image sequence to obtain a global feature sequence and a local feature group sequence;

a second generation module, configured to generate an object model parameter sequence for reconstructing the object based on the global feature sequence and the local feature set sequence; and

and the reconstruction module is used for reconstructing the object based on the object model parameter sequence to obtain a target model sequence.

32. The apparatus of claim 31, wherein the second generation module comprises:

the first coding submodule is used for obtaining a first potential coding group sequence based on the global feature sequence and the local feature group sequence;

a second encoding submodule for generating a global potential encoding based on the first potential encoding group sequence;

a third coding sub-module, configured to obtain a second potential coding group sequence based on the global potential coding and the first potential coding group sequence; and

a first generation sub-module for generating the object model parameter sequence based on the second potential encoding group sequence.

33. The apparatus of claim 32, wherein the first encoding submodule comprises:

the first coding unit is used for obtaining a first fusion characteristic based on the target local characteristic group; and

And the second coding unit is used for obtaining a first latent coding group based on a preset number of the global features and the first fusion features.

34. The apparatus of claim 33, wherein the second encoding unit is configured to:

35. The apparatus of claim 34, wherein the first encoding unit is further configured to:

36. The apparatus of claim 32, wherein the second encoding submodule comprises:

a third encoding unit, configured to obtain, for each first potential encoding group in the first potential encoding group sequence, an averaged first potential encoding based on the first potential encoding group; and

And the fourth coding unit is used for splicing the averaged first potential coding sequences which are in one-to-one correspondence with the first potential coding group sequences to obtain the global potential coding.

37. The apparatus of claim 32, wherein the third encoding submodule comprises:

a fifth encoding unit, configured to generate a second fusion feature based on the global potential encoding; and

a sixth encoding unit configured to obtain, for each first potential encoding group in the first potential encoding group sequence, the second potential encoding group based on the second fusion feature and the first potential encoding group.

38. The apparatus of claim 37, wherein the sixth encoding unit is configured to:

39. The apparatus of claim 38, wherein the sixth encoding unit is further configured to:

40. The apparatus of any one of claims 31 to 39, wherein the initial feature map comprises a background region and a foreground region containing the object,

the first generation module includes:

the second generation submodule is used for generating a mask map based on the initial feature map, wherein the mask map comprises a plurality of pixel points and pixel values corresponding to the pixel points one by one;

a first determining sub-module, configured to determine a plurality of target pixel points from the plurality of pixel points based on pixel values of each of the plurality of pixel points, where the target pixel points are used to characterize a pixel point of the object; and

and the second determining submodule is used for determining the global feature and the local feature group based on a plurality of target pixel points and the initial feature map.

41. The apparatus of claim 40, wherein the second determination submodule includes:

a second determining unit, configured to determine feature vectors of each of the plurality of target pixel points based on the initial feature map; and

and the third determining unit is used for determining the global feature based on the first pixel point quantity of the target pixel points and the feature vectors of the target pixel points.

42. The apparatus of claim 40 or 41, wherein the second determination submodule comprises:

the classification unit is used for classifying the plurality of target pixel points based on the position relation among the plurality of target pixel points to obtain a plurality of pixel point sets, wherein each pixel point set comprises a plurality of target pixel points in the same category; and

and a fourth determining unit, configured to determine, for each of the plurality of pixel point sets, a local feature corresponding to the pixel point set based on the initial feature map, and obtain the local feature set.

43. The apparatus of claim 42, wherein the fourth determining unit comprises:

a first determining subunit, configured to determine, for each target pixel in the set of pixels, a feature vector of the target pixel based on the initial feature map; and

44. The apparatus of claim 40, wherein the second generation submodule includes:

The convolution unit is used for convoluting the initial feature map to obtain a convolved feature map;

the activating unit is used for activating the characteristic map after convolution to obtain an activated characteristic map; and

and the normalization unit is used for performing normalization operation on the activated feature map to obtain the mask map.

45. A training apparatus for reconstructing a model of an object, comprising:

the sample feature extraction module is used for respectively carrying out feature extraction on a plurality of sample images in the sample image sequence to obtain a sample initial feature image sequence, wherein the sample image sequence comprises a sample object to be reconstructed;

a sample first generating module, configured to generate, for each sample initial feature map in the sample initial feature map sequence, a sample global feature and a sample local feature set related to the sample object based on the sample initial feature map, to obtain a sample global feature sequence and a sample local feature set sequence;

a sample second generation module, configured to generate a sample object model parameter sequence for reconstructing the sample object based on the sample global feature sequence and the sample local feature set sequence; and

and the training module is used for training the object reconstruction model by using the sample object model parameter sequence and the sample object model parameter label sequence matched with the sample image sequence.

46. The apparatus of claim 45, wherein the sample second generation module comprises:

a sample first coding submodule, configured to obtain a sample first potential coding group sequence based on the sample global feature sequence and the sample local feature group sequence;

a sample second encoding submodule for generating a sample global potential code based on the sample first potential code group sequence;

a sample third coding submodule, configured to obtain a sample second potential coding group sequence based on the sample global potential coding and the sample first potential coding group sequence; and

a sample first generation sub-module for generating the sample object model parameter sequence based on the sample second potential code set sequence.

47. The apparatus of claim 46, wherein the sample first encoding submodule comprises:

a sample first determination unit for, for each sample global feature in the sequence of sample global features,

The sample first coding unit is used for obtaining a sample first fusion feature based on the sample target local feature group; and

and the sample second coding unit is used for obtaining a sample first potential coding group based on a preset number of the sample global features and the sample first fusion features.

48. The apparatus of claim 47, wherein the sample second encoding unit is configured to:

49. The apparatus of claim 48, wherein the sample second encoding unit is further configured to:

50. The apparatus of claim 46, wherein the sample second encoding submodule comprises:

A sample third coding unit, configured to obtain, for each sample first potential coding set in the sample first potential coding set sequence, an average sample first potential coding based on the sample first potential coding set; and

and the sample fourth coding unit is used for splicing the average sample first potential coding sequences corresponding to the sample first potential coding group sequences one by one to obtain the sample global potential coding.

51. The apparatus of claim 46, wherein the sample third encoding submodule comprises:

a sample fifth encoding unit, configured to generate a sample second fusion feature based on the sample global potential encoding; and

and a sample sixth coding unit, configured to obtain, for each sample first latent coding set in the sample first latent coding set sequence, the sample second latent coding set based on the sample second fusion feature and the sample first latent coding set.

52. The apparatus of claim 51, wherein the sample sixth encoding unit is configured to:

repeating the following operation until the current round is equal to a fourth preset round threshold value, and taking the second potential coding group of the samples of the current round as the second potential coding group of the samples:

53. The apparatus of claim 52, wherein the sample sixth encoding unit is further configured to:

54. The apparatus of any one of claims 45 to 53, wherein each sample object model parameter in the sequence of sample object model parameters comprises a sample pose parameter and a sample pose parameter, the training apparatus of the object reconstruction model further comprising:

a sample third generation module, configured to generate a sample target model based on the sample pose parameter and the sample posture parameter for each sample object model parameter in the sample object model parameter sequence; and

55. The apparatus of claim 54, wherein the training module comprises:

A first loss determination submodule for generating a first loss value based on the sample object model parameter sequence and a sample object model parameter tag sequence matched with the sample image sequence;

a second loss determination submodule for generating a second loss value based on a sample three-dimensional key point information sequence matched with the sample image sequence and a key point label matched with the sample image sequence; and

a training sub-module for training the object reconstruction model based on the first loss value and the second loss value.

56. The apparatus of any one of claims 45 to 55, wherein the sample first generation module comprises:

a sample second generation sub-module, configured to generate a sample mask map based on the sample initial feature map, where the sample mask map includes a plurality of sample pixel points and pixel values corresponding to the plurality of sample pixel points one-to-one;

a sample first determining submodule, configured to determine a plurality of sample target pixels from the plurality of sample pixels based on the sample mask map, where the sample target pixels are used to characterize pixels of the sample object; and

And a sample second determining sub-module, configured to determine the sample global feature and the plurality of sample local features based on the plurality of target pixel points and the sample initial feature map.

57. The apparatus of claim 56, wherein the sample second determination submodule includes:

a sample second determining unit, configured to determine, based on the sample initial feature map, a sample feature vector of each of a plurality of sample target pixel points; and

and a sample third determining unit, configured to determine the global feature of the sample based on the number of sample first pixel points of the plurality of sample target pixel points and the sample feature vectors of the plurality of sample target pixel points.

58. The apparatus of claim 56 or 57, wherein the sample second determination submodule comprises:

the sample classification unit is used for classifying the plurality of sample target pixel points based on the position relation among the plurality of sample target pixel points to obtain a plurality of sample pixel point sets, wherein each sample pixel point set comprises a plurality of sample target pixel points of the same category; and

a fourth sample determining unit, configured to determine, for each sample pixel point set in the plurality of sample pixel point sets, a sample local feature corresponding to the sample pixel point set based on the sample initial feature map, and obtain the plurality of sample local features.

59. The apparatus of claim 58, wherein the sample fourth determination unit comprises:

a sample first determining subunit, configured to determine, for each sample target pixel in the sample pixel set, a sample feature vector of the sample target pixel based on the sample initial feature map; and

and the sample second determining subunit is used for determining the sample local feature based on the sample second pixel point number of the plurality of sample target pixel points in the sample pixel point set and the sample feature vector of each of the plurality of sample target pixel points.

60. The apparatus of claim 56, wherein the sample second generation submodule includes:

the sample convolution unit is used for convolving the sample initial feature map to obtain a feature map after sample convolution;

the sample activating unit is used for activating the characteristic diagram after sample convolution to obtain a characteristic diagram after sample activation; and

and the sample normalization unit is used for performing normalization operation on the characteristic map after the sample activation to obtain the sample mask map.

61. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 30.

62. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 30.

63. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 30.