CN117237542B

CN117237542B - Three-dimensional human body model generation method and device based on text

Info

Publication number: CN117237542B
Application number: CN202311492951.0A
Authority: CN
Inventors: 张小梅; 董文恺; 雷震
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2024-02-13
Anticipated expiration: 2043-11-10
Also published as: CN117237542A

Abstract

The invention provides a three-dimensional human body model generation method and device based on text, which are applied to the technical field of computer vision. The method comprises the following steps: acquiring a first text, wherein the first text is used for describing the appearance characteristics of a target object; generating a first image of the target object according to the first text; initializing the parameterized three-dimensional human body model and generating a second image of the three-dimensional human body model; updating model parameters of the three-dimensional human model based on the first image and the second image to obtain a three-dimensional model of the target object; wherein the second image is a two-dimensional image of the three-dimensional mannequin.

Description

Three-dimensional human body model generation method and device based on text

Technical Field

The invention relates to the technical field of computer vision, in particular to a three-dimensional human model generation method and device based on text.

Background

In general, three-dimensional manikins are required as auxiliary tools in many fields such as virtual reality, movie special effects, game development, medical simulation, human motion analysis, education training, and the like.

In the prior art, to create a three-dimensional manikin, this can be accomplished by means of manual modeling or animation. However, these methods can be accomplished by a professional in the related art, and not only the application threshold is high, but also the creation efficiency of the model is low.

Disclosure of Invention

The invention provides a method and a device for generating a three-dimensional human body model based on text, which are used for solving the problems that in the prior art, the creating mode of the three-dimensional human body model is high in application threshold and low in model creating efficiency.

The invention provides a three-dimensional human body model generation method based on text, which comprises the following steps: acquiring a first text, wherein the first text is used for describing the appearance characteristics of a target object; generating a first image of the target object according to the first text; initializing the parameterized three-dimensional human body model and generating a second image of the three-dimensional human body model; updating model parameters of the three-dimensional human model based on the first image and the second image to obtain a three-dimensional model of the target object; wherein the second image is a two-dimensional image of the three-dimensional mannequin.

According to the present invention, there is provided a method for generating a text-based three-dimensional mannequin, the method for generating a first image of the target object according to the first text, including: generating a third image of the target object according to the first text; identifying a face area of the target object in the third image, and carrying out refinement treatment on the face area through a face refinement model to obtain the first image; wherein the refinement process is used to enhance details and personalized features of the facial region.

According to the present invention, there is provided a text-based three-dimensional manikin generating method, after identifying a face area of the target object in the third image, the method further comprising: intercepting a face area of the target object from the third image to obtain a fourth image; generating a text of the fourth image through an image description method to obtain a second text, and determining a first sub-text describing the face area from the first text; model parameters of the face refinement model are updated based on the second text and the first sub-text.

According to the present invention, there is provided a text-based three-dimensional mannequin generation method, the generating a second image of the three-dimensional mannequin, including: initializing the parameterized three-dimensional human body model to obtain a three-dimensional human body; rendering the three-dimensional human body at different visual angles to obtain fifth images at different angles; inputting the three-dimensional human body into a normal map extractor to obtain a normal map of the three-dimensional human body; weighting the fifth image and the normal map to obtain the second image; wherein the normal map extractor is used for extracting geometric features of the three-dimensional human model.

According to the present invention, there is provided a text-based three-dimensional human body model generating method, in which model parameters of the three-dimensional human body model are updated based on the first image and the second image to obtain a three-dimensional model of the target object, including: calculating a first loss function based on probability density distillation based on the first image and the second image; and updating model parameters of the three-dimensional human body model according to the first loss function so as to obtain a three-dimensional model of the target object.

The invention also provides a device for generating the three-dimensional human body model based on the text, which comprises the following steps: the device comprises an acquisition module and a processing module; the acquisition module can be used for acquiring a first text, wherein the first text is used for describing the appearance characteristics of the target object; the processing module can be used for generating a first image of the target object according to the first text; initializing the parameterized three-dimensional human body model and generating a second image of the three-dimensional human body model; updating model parameters of the three-dimensional human model based on the first image and the second image to obtain a three-dimensional model of the target object; wherein the second image is a two-dimensional image of the three-dimensional mannequin.

According to the invention, the processing module can be used for generating a third image of the target object according to the first text; identifying a face area of the target object in the third image, and carrying out refinement treatment on the face area through a face refinement model to obtain the first image; wherein the refinement process is used to enhance details and personalized features of the facial region.

According to the text-based three-dimensional human body model generating device provided by the invention, the processing module can be used for intercepting the face area of the target object from the third image to obtain a fourth image; generating a text of the fourth image through an image description method to obtain a second text, and determining a first sub-text describing the face area from the first text; model parameters of the face refinement model are updated based on the second text and the first sub-text.

According to the text-based three-dimensional human body model generating device provided by the invention, the processing module can be used for initializing the parameterized three-dimensional human body model to obtain a three-dimensional human body; rendering the three-dimensional human body at different visual angles to obtain fifth images at different angles; inputting the three-dimensional human body into a normal map extractor to obtain a normal map of the three-dimensional human body; weighting the fifth image and the normal map to obtain the second image; wherein the normal map extractor is used for extracting geometric features of the three-dimensional human model.

According to the text-based three-dimensional mannequin generating device provided by the invention, the processing module can be used for calculating a first loss function based on probability density distillation based on the first image and the second image; and updating model parameters of the three-dimensional human body model according to the first loss function so as to obtain a three-dimensional model of the target object.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the text-based three-dimensional mannequin generating method according to any one of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the text-based three-dimensional mannequin generation method as described in any one of the above.

The method and the device for generating the three-dimensional human body model based on the text can acquire the first text, and the first text is used for describing the appearance characteristics of the target object; generating a first image of the target object according to the first text; initializing the parameterized three-dimensional human body model and generating a second image of the three-dimensional human body model; updating model parameters of the three-dimensional human model based on the first image and the second image to obtain a three-dimensional model of the target object; wherein the second image is a two-dimensional image of the three-dimensional mannequin. By the scheme, on one hand, the three-dimensional model of the target object can be generated based on the first text, namely, the three-dimensional model can be created only through text description, so that the application threshold for creating the three-dimensional model can be reduced, and a non-professional person can also create the three-dimensional model in the order of self-learning; on the other hand, in the process of generating the three-dimensional model based on the parameterized three-dimensional human body model, the labeling of the three-dimensional human body label is not needed, and a large amount of training data is not needed, so that the calculation force can be saved, and the creation efficiency of the three-dimensional model is improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of one of the text-based three-dimensional mannequin generation methods provided by the present invention;

FIG. 2 is a second flow chart of the text-based three-dimensional mannequin generation method provided by the present invention;

FIG. 3 is a schematic diagram of the structure of the text-based three-dimensional manikin generating device provided by the invention;

fig. 4 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, in the embodiments of the present invention, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present invention is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

In order to clearly describe the technical solution of the embodiment of the present invention, in the embodiment of the present invention, the words "first", "second", etc. are used to distinguish identical items or similar items having substantially the same function and effect, and those skilled in the art will understand that the words "first", "second", etc. are not limited in number and execution order.

Embodiments of the invention some exemplary embodiments have been described for illustrative purposes, it being understood that the invention may be practiced otherwise than as specifically shown in the accompanying drawings.

The foregoing implementations are described in detail below with reference to specific embodiments and accompanying drawings.

As shown in fig. 1, an embodiment of the present invention provides a text-based three-dimensional mannequin generating method that can be applied to a text-based three-dimensional mannequin generating apparatus. The text-based three-dimensional mannequin generation method may include S101-S104:

s101, the three-dimensional human body model generating device acquires a first text.

Wherein the first text is used for describing the appearance characteristics of the target object. The topographical features may include wear features, morphological features, status features, and the like. The target object may be a person object.

For example, if the user wants to generate a three-dimensional human model, the appearance of the desired human model may be described by a text, and then the description text is input as a first text to the three-dimensional human model generating device. For example, the first text T may be "a double-eyelid girl wearing a hat, a down jacket.

S102, the three-dimensional human body model generating device generates a first image of the target object according to the first text.

Alternatively, the three-dimensional human model generating means may generate a third image of the target object from the first text; identifying a face area of the target object in the third image, and carrying out refinement treatment on the face area through a face refinement model to obtain the first image; wherein the refinement process is used to enhance details and personalized features of the facial region.

Specifically, as shown in fig. 2, the three-dimensional human body model generating means may input the first text T into the two-dimensional diffusion modelThereby obtaining a third image of the target object, the two-dimensional diffusion model being a pre-training model, the two-dimensional diffusion model being capable of generating a two-dimensional image based on the text prompt. After that, the three-dimensional manikin generating means can pass the face refinement model +.>Identifying the face area of the target object in the third image and passing the face refinement model +.>And carrying out refinement processing on the face area, thereby obtaining a first image. Wherein the third image may be represented as: />The method comprises the steps of carrying out a first treatment on the surface of the The face region of the target object in the first image may be expressed as: />。

Alternatively, the network of the two-dimensional Diffusion model may be a Diffusion network. The diffration network is a trained Imagen.

Optionally, the algorithm of the Face refinement model includes a Face Detection algorithm (Face Detection). The face detection algorithm may perform attribute recognition on the facial region of the generated image to determine whether the features described by the first text are generated, and further optimize facial details and personalized features.

Based on the scheme, as the third image of the target object can be generated according to the first text, bridging of a three-dimensional model can be constructed, and the problem of scarcity of available data in a three-dimensional scene is solved; because the face region can be subjected to refinement processing through the face refinement model to obtain the first image, optimization of face details and personalized features can be achieved, and therefore the recognition degree of the three-dimensional model is improved.

Alternatively, after identifying the face region of the target object in the third image, the three-dimensional human model generating device may intercept the face region of the target object from the third image to obtain a fourth image; generating a text of the fourth image through an image description method to obtain a second text, and determining a first sub-text describing the face area from the first text; model parameters of the face refinement model are updated based on the second text and the first sub-text.

Specifically, with continued reference to fig. 2, the three-dimensional human model generating apparatus may further perform optimization training on the facial refinement model after recognizing the facial region of the target object in the third image. The three-dimensional human body model generating device can intercept the face area of the target object from the third image to obtain a fourth image, and then the image describing method is adopted(Image capture) text description is carried out on the Image characteristics of the fourth Image, and a second text is obtained>And determining a first sub-text describing the face area from the first text +.>The method comprises the steps of carrying out a first treatment on the surface of the Finally based on the second text->And first sub-text->A cross entropy loss function is calculated and a face refinement model is trained based on the cross entropy loss function. Wherein the second text->Can be expressed as:the method comprises the steps of carrying out a first treatment on the surface of the The cross entropy loss function can be expressed as: />。

Based on the scheme, the model parameters of the face refinement model can be updated based on the second text and the first sub-text, so that the pre-training of the face refinement model can be realized, and the optimization of the face refinement model can be realized.

S103, the three-dimensional human body model generating device performs initialization processing on the parameterized three-dimensional human body model and generates a second image of the three-dimensional human body model.

The parameterized three-dimensional human body model refers to a basic three-dimensional human body model constructed by parameters, for example, if a user wants to generate a three-dimensional human body model, the parameterized three-dimensional human body model may be a human body model constructed by parameters. The second image is a two-dimensional image of the three-dimensional mannequin.

Alternatively, the three-dimensional mannequin may be a SMPL-X parametric model, which is a parameterized three-dimensional mannequin including parts of the human body, hands, face, etc.

Optionally, the three-dimensional human body model generating device may perform initialization processing on the parameterized three-dimensional human body model to obtain a three-dimensional human body; rendering the three-dimensional human body at different visual angles to obtain fifth images at different angles; inputting the three-dimensional human body into a normal map extractor to obtain a normal map of the three-dimensional human body; weighting the fifth image and the normal map to obtain the second image; wherein the normal map extractor is used for extracting geometric features of the three-dimensional human model.

Specifically, with continued reference to fig. 2, the three-dimensional mannequin generating apparatus may perform an initialization process on the SMPL-X parameterized model to obtain a three-dimensional human body, and then from different perspectivesTo render a fifth image +.>The fifth image->Is a two-dimensional image. Meanwhile, the three-dimensional human body model generating device can also input the three-dimensional human body into a normal map extractor to obtain a normal map of the three-dimensional human body, and then perform weighting processing on the fifth image and the normal map to obtain a second image。

Alternatively, the SMPL-X parameterized model described above may still be trained using a probability density distillation based loss function. The probability density distillation based loss function can be expressed as:wherein->Representing a pre-trained diffusion model of the device,representing a corresponding noise generation model,/->The representation depends on a noise weighting function.

Alternatively, the normal map extractor may be an open micro-renderable (open differentiable Render, openDR).

Optionally, the fifth imageCan be expressed as +.>。

Based on the scheme, the second image can be obtained by carrying out weighting processing on the fifth image and the normal map, so that the second image can comprise the geometric features of the three-dimensional human body and can also comprise the two-dimensional maps of different visual angles of the three-dimensional human body, and the consistency of the geometric shape and texture and the expression of the first text can be ensured.

S104, the three-dimensional human body model generating device updates model parameters of the three-dimensional human body model based on the first image and the second image so as to obtain a three-dimensional model of the target object.

Specifically, with continued reference to fig. 2, the three-dimensional mannequin generating device may calculate a first loss function based on probability density distillation based on the first image and the second image; and updating model parameters of the three-dimensional human model according to the first loss function to obtain the three-dimensional model of the target object.

In the embodiment of the invention, on one hand, because the three-dimensional model of the target object can be generated based on the first text, namely, the three-dimensional model can be created only through text description, the application threshold for creating the three-dimensional model can be reduced, and a non-professional person can also create the three-dimensional model in the order of self-care; on the other hand, in the process of generating the three-dimensional model based on the parameterized three-dimensional human body model, the labeling of the three-dimensional human body label is not needed, and a large amount of training data is not needed, so that the calculation force can be saved, and the creation efficiency of the three-dimensional model is improved.

The foregoing description of the solution provided by the embodiments of the present invention has been mainly presented in terms of a method. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

According to the text-based three-dimensional mannequin generating method provided by the embodiment of the invention, the execution subject can be a text-based three-dimensional mannequin generating device or a control module for generating the text-based three-dimensional mannequin in the text-based three-dimensional mannequin generating device. In the embodiment of the invention, a three-dimensional human body model generating device based on text is taken as an example to execute a three-dimensional human body model generating method based on text, and the three-dimensional human body model generating device based on text provided by the embodiment of the invention is described.

It should be noted that, in the embodiment of the present invention, the function modules may be divided into the three-dimensional human body model generating device based on text according to the above method example, for example, each function module may be divided into each function module, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. Optionally, the division of the modules in the embodiment of the present invention is schematic, which is merely a logic function division, and other division manners may be implemented in practice.

As shown in fig. 3, an embodiment of the present invention provides a text-based three-dimensional manikin generating device 300. The text-based three-dimensional mannequin generating apparatus 300 includes: an acquisition module 301 and a processing module 302. The obtaining module 301 may be configured to obtain a first text, where the first text is used to describe a feature of the target object; the processing module 302 may be configured to generate a first image of the target object according to the first text; initializing the parameterized three-dimensional human body model and generating a second image of the three-dimensional human body model; updating model parameters of the three-dimensional human model based on the first image and the second image to obtain a three-dimensional model of the target object; wherein the second image is a two-dimensional image of the three-dimensional mannequin.

Optionally, the processing module 302 may be configured to generate a third image of the target object according to the first text; identifying a face area of the target object in the third image, and carrying out refinement treatment on the face area through a face refinement model to obtain the first image; wherein the refinement process is used to enhance details and personalized features of the facial region.

Optionally, the processing module 302 may be configured to intercept a face area of the target object from the third image to obtain a fourth image; generating a text of the fourth image through an image description method to obtain a second text, and determining a first sub-text describing the face area from the first text; model parameters of the face refinement model are updated based on the second text and the first sub-text.

Optionally, the processing module 302 may be configured to perform an initialization process on the parameterized three-dimensional human body model to obtain a three-dimensional human body; rendering the three-dimensional human body at different visual angles to obtain fifth images at different angles; inputting the three-dimensional human body into a normal map extractor to obtain a normal map of the three-dimensional human body; weighting the fifth image and the normal map to obtain the second image; wherein the normal map extractor is used for extracting geometric features of the three-dimensional human model.

Optionally, the processing module 302 may be configured to calculate a first loss function based on probability density distillation based on the first image and the second image; and updating model parameters of the three-dimensional human body model according to the first loss function so as to obtain a three-dimensional model of the target object.

Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. The processor 410 may invoke logic instructions in the memory 430 to perform a text-based three-dimensional mannequin generation method including: acquiring a first text, wherein the first text is used for describing the appearance characteristics of a target object; generating a first image of the target object according to the first text; initializing the parameterized three-dimensional human body model and generating a second image of the three-dimensional human body model; updating model parameters of the three-dimensional human model based on the first image and the second image to obtain a three-dimensional model of the target object; wherein the second image is a two-dimensional image of the three-dimensional mannequin.

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method of generating a text-based three-dimensional mannequin provided by the above methods, the method comprising: acquiring a first text, wherein the first text is used for describing the appearance characteristics of a target object; generating a first image of the target object according to the first text; initializing the parameterized three-dimensional human body model and generating a second image of the three-dimensional human body model; updating model parameters of the three-dimensional human model based on the first image and the second image to obtain a three-dimensional model of the target object; wherein the second image is a two-dimensional image of the three-dimensional mannequin.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above-provided text-based three-dimensional mannequin generation method, the method comprising: acquiring a first text, wherein the first text is used for describing the appearance characteristics of a target object; generating a first image of the target object according to the first text; initializing the parameterized three-dimensional human body model and generating a second image of the three-dimensional human body model; updating model parameters of the three-dimensional human model based on the first image and the second image to obtain a three-dimensional model of the target object; wherein the second image is a two-dimensional image of the three-dimensional mannequin.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A text-based three-dimensional manikin generation method, comprising:

acquiring a first text, wherein the first text is used for describing the appearance characteristics of a target object;

generating a first image of the target object according to the first text;

initializing the parameterized three-dimensional human body model and generating a second image of the three-dimensional human body model;

updating model parameters of the three-dimensional human model based on the first image and the second image to obtain a three-dimensional model of the target object;

wherein the second image is a two-dimensional image of the three-dimensional mannequin;

the initializing the parameterized three-dimensional human body model and generating a second image of the three-dimensional human body model includes: initializing the parameterized three-dimensional human body model to obtain a three-dimensional human body; rendering the three-dimensional human body at different visual angles to obtain fifth images at different angles; inputting the three-dimensional human body into a normal map extractor to obtain a normal map of the three-dimensional human body; weighting the fifth image and the normal map to obtain the second image; the normal map extractor is used for extracting geometric features of the three-dimensional human body model;

the generating a first image of the target object from the first text includes: generating a third image of the target object according to the first text; identifying a face area of the target object in the third image, and carrying out refinement treatment on the face area through a face refinement model to obtain the first image; wherein the refinement process is used to enhance details and personalized features of the facial region;

after the identifying the face region of the target object in the third image, the method further comprises: intercepting a face area of the target object from the third image to obtain a fourth image; generating a text of the fourth image through an image description method to obtain a second text, and determining a first sub-text describing the face area from the first text; model parameters of the face refinement model are updated based on the second text and the first sub-text.

2. The text-based three-dimensional mannequin generation method according to claim 1, wherein the updating model parameters of the three-dimensional mannequin based on the first image and the second image to obtain the three-dimensional model of the target object includes:

calculating a first loss function based on probability density distillation based on the first image and the second image;

and updating model parameters of the three-dimensional human body model according to the first loss function so as to obtain a three-dimensional model of the target object.

3. A text-based three-dimensional manikin generating device, comprising: the device comprises an acquisition module and a processing module;

the acquisition module is used for acquiring a first text, wherein the first text is used for describing the appearance characteristics of the target object;

the processing module is used for generating a first image of the target object according to the first text; initializing the parameterized three-dimensional human body model and generating a second image of the three-dimensional human body model; updating model parameters of the three-dimensional human model based on the first image and the second image to obtain a three-dimensional model of the target object;

the processing module is used for initializing the parameterized three-dimensional human body model to obtain a three-dimensional human body; rendering the three-dimensional human body at different visual angles to obtain fifth images at different angles; inputting the three-dimensional human body into a normal map extractor to obtain a normal map of the three-dimensional human body; weighting the fifth image and the normal map to obtain the second image; the normal map extractor is used for extracting geometric features of the three-dimensional human body model;

the processing module is used for generating a third image of the target object according to the first text; identifying a face area of the target object in the third image, and carrying out refinement treatment on the face area through a face refinement model to obtain the first image; wherein the refinement process is used to enhance details and personalized features of the facial region;

the processing module is further used for intercepting the face area of the target object from the third image to obtain a fourth image; generating a text of the fourth image through an image description method to obtain a second text, and determining a first sub-text describing the face area from the first text; model parameters of the face refinement model are updated based on the second text and the first sub-text.

4. A text-based three-dimensional mannequin generation apparatus according to claim 3, wherein the processing module is configured to calculate a first loss function based on probability density distillation based on the first image and the second image; and updating model parameters of the three-dimensional human body model according to the first loss function so as to obtain a three-dimensional model of the target object.