CN118035493A

CN118035493A - Image generation method, device, equipment, storage medium and program product

Info

Publication number: CN118035493A
Application number: CN202311860241.9A
Authority: CN
Inventors: 周彧聪; 王力; 韩烁; 王子豪; 杨斌
Original assignee: Shanghai Xiyu Jizhi Technology Co ltd
Current assignee: Shanghai Xiyu Jizhi Technology Co ltd
Priority date: 2023-12-31
Filing date: 2023-12-31
Publication date: 2024-05-14

Abstract

The present disclosure provides an image generation method, apparatus, device, storage medium, and program product, the method including: acquiring style information according to a reference image appointed by a user through a style extraction model; acquiring a text description; and generating an artificial intelligent image according to the style information and the text description through a text-generated graph model, or generating an artificial intelligent video according to the style information and the text description through a text-generated video model. The present disclosure provides methods that are capable of extracting styles from images to generate images.

Description

Image generation method, device, equipment, storage medium and program product

Technical Field

The present disclosure relates to the field of text-generated graphics, and in particular, to an image generating method, apparatus, device, storage medium, and program product.

Background

The generated artificial intelligence (ARTIFICIAL INTELLIGENCE GENERATED Content, AIGC) is a technology for generating related Content with proper generalization ability by learning and identifying existing data based on a technical method for generating artificial intelligence such as an countermeasure network and a large-scale pre-training model. The key idea of AIGC technology is to generate content with a certain creative and quality using artificial intelligence algorithms. Through training the model and learning of a large amount of data, AIGC can generate content related to the text description input by the user. However, current AIGC technology can only generate images from text descriptions entered by a user, and when a user wants to generate an image of a certain style, the style of the image generated by AIGC is often not desired by the user because the user has difficulty describing the style in text.

Disclosure of Invention

The embodiment of the specification provides an image generation method, an image generation device, a storage medium and a program product, which can extract styles from images.

In a first aspect, one or more embodiments of the present specification provide an image generating method, including: acquiring style information according to a reference image appointed by a user through a style extraction model; acquiring a text description; and generating an artificial intelligent image according to the style information and the text description through a text-generated graph model, or generating an artificial intelligent video according to the style information and the text description through a text-generated video model.

Optionally, the style information includes at least one of: color matching information, local texture information, material information, pen touch information, illumination information and composition information.

According to some embodiments of the present disclosure, the generating, by using a text-generated graph model, an artificial intelligence image or an artificial intelligence video according to the style information and the text description further includes: at least part of the style information is optimized.

Optionally, the optimizing at least part of the style information includes: optimizing at least part of the style information based on the textual description.

Optionally, the optimizing at least part of the style information includes: obtaining an optimization direction specified by a user, wherein the optimization direction comprises at least one of the following: enhancing or weakening light shadows, improving close or far view composition, increasing or decreasing saturation, enhancing or decreasing stereo.

According to some embodiments of the present disclosure, there is provided an image generating method, the method further comprising: obtaining reference weight of style information of the reference image, wherein the reference weight is used for indicating the weight ratio of the style information in generating the artificial intelligent image or the artificial intelligent video compared with the text description in generating the artificial intelligent image or the artificial intelligent video; the generating an artificial intelligent image according to the style information and the text description through a text-generated graph model or generating an artificial intelligent video according to the style information and the text description through a text-generated video model comprises the following steps: generating an artificial intelligent image according to the reference weight of the style information of the reference image, the style information and the text description through a text-generated graph model; or generating the artificial intelligent video according to the reference weight of the style information of the reference image, the style information and the text description through a text-generated video model.

According to some embodiments of the present disclosure, the image generating method includes: the method comprises the steps of obtaining style weights corresponding to at least two reference images respectively, wherein the style weights are used for indicating weight duty ratios of style information of the at least two reference images respectively referenced in the generation of the artificial intelligent image or the artificial intelligent video; the obtaining style information according to the reference image appointed by the user through the style extraction model comprises the following steps: and acquiring style information according to at least two reference images designated by a user and style weights of the at least two reference images through a style extraction model, wherein the style information is fusion of the style information of the at least two reference images according to the style weights.

According to some embodiments of the present disclosure, the image generating method includes: the method comprises the steps of obtaining style weights corresponding to at least two reference images respectively, wherein the style weights are used for indicating weight duty ratios of style information of the at least two reference images respectively referenced in the generation of the artificial intelligent image or the artificial intelligent video; the generating an artificial intelligent image according to the style information and the text description through a text-generated graph model or generating an artificial intelligent video according to the style information and the text description through a text-generated video model comprises the following steps: generating an artificial intelligent image according to the style weights of the at least two reference images, the style information of the at least two reference images and the text description through a text-based image model, or generating an artificial intelligent video according to the style weights of the at least two reference images, the style information of the at least two reference images and the text description through a text-based video model.

Optionally, the user-specified reference image includes a reference image uploaded by the user and/or a template image selected by the user.

According to some embodiments of the present disclosure, the style extraction model includes a mapping space model, and the acquiring style information according to a reference image specified by a user through the style extraction model includes: generating a mapping space based on a reference image specified by a user through the mapping space model, wherein the mapping space contains style information of the reference image, and the mapping space comprises vectors with multiple dimensions.

According to some embodiments of the present disclosure, the method for generating an image, which obtains style information according to a reference image specified by a user through a style extraction model, includes: dividing the reference image into a plurality of sub-images, disturbing the sub-images, and then splicing the sub-images again; and acquiring style information based on the re-spliced reference image through the style extraction model.

According to some embodiments of the present disclosure, the image generating method, which obtains style information according to a reference image specified by a user, includes: deforming the content of the reference image; and acquiring style information based on the deformed reference image through the style extraction model.

According to some embodiments of the present disclosure, the image generating method, which obtains style information according to a reference image specified by a user, includes: acquiring pixel distribution histograms of different colors in the reference image; and acquiring style information based on the pixel distribution histograms of different colors through the style extraction model.

In a second aspect, one or more embodiments of the present specification provide an image generating apparatus including: the style extraction model is used for acquiring style information according to a reference image appointed by a user; the acquisition module is used for acquiring the text description; the draft image model is used for generating an artificial intelligent image according to the style information and the text description, or the draft video model is used for generating an artificial intelligent video according to the style information and the text description.

In a third aspect, one or more embodiments of the present specification provide an image generating apparatus, comprising: a processor and a memory having executable code stored thereon that, when processed by the processor, causes the processor to perform any of the methods described herein.

In a fourth aspect, one or more embodiments of the present description provide a computer-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the electronic device to perform the method of any one of the claims.

In a fifth aspect, one or more embodiments of the present description provide a computer program product comprising a computer program or computer-executable instructions which, when executed by a processor, implement the method of any one of the claims.

In the embodiment of the specification, the style information of the reference image appointed by the user is extracted and input into the text-generated image model or the text-generated video through the style extraction model, so that the user can obtain the AI image with similar style by finding the reference image with similar style without describing the style by the text.

Drawings

The present specification will be further elucidated by way of example embodiments, which will be described in detail by means of the accompanying drawings. The embodiments are not limited, and like numbers indicate like structures or steps in the embodiments.

Fig. 1 is a flow chart of an embodiment of an image generation method of the present specification.

Fig. 2 is a flow chart of an embodiment of an image generation method of the present specification.

Fig. 3 is a flow chart of an embodiment of the image generation method of the present specification.

Fig. 4 is a schematic block diagram of an embodiment of an image generating apparatus of the present specification.

Fig. 5 is a schematic structural view of an embodiment of the image generating apparatus of the present specification.

Detailed Description

In order to more clearly describe the technical solutions of the embodiments of the present specification, the embodiments will be described in detail below with reference to the accompanying drawings. It should be apparent that the following descriptions are some examples or embodiments of the present specification, and it is possible for those skilled in the art to apply the technical solution or means disclosed in the present specification to other situations according to the technical contents without inventive effort.

It should be appreciated that the terms "system," "apparatus," "unit," and/or "module" as used herein are one method for distinguishing between different components, elements, parts, portions, or assemblies at different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.

Unless otherwise indicated, the technical terms used to describe components, elements, etc. in this specification do not denote a singular number but may include a plural number. In general, the terms "comprises," "comprising," and the like, are intended to only cover a specifically identified step, element, or component, but do not constitute an exclusive list, as the described method or apparatus may include other steps or components.

It should also be understood that the term "and/or" as used in this specification is intended to encompass any or all possible combinations of one or more of the associated listed items. Although the terms "first," "second," "third," etc. may be used in this specification to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present description. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present specification, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

Flowcharts are used in this specification to describe the steps of operations performed by an apparatus or system of related embodiments, but the order in which the steps are described should not be construed as a limitation on the order in which the steps are performed unless otherwise indicated. One of ordinary skill in the art may adjust the order of execution of the steps based on knowledge information conveyed by embodiments of the present description, including, but not limited to, exchange of precedence relationships, merging of multiple steps, and splitting of a step.

Fig. 1 is a flow chart of an embodiment of an image generation method of the present specification. As shown in fig. 1, the method includes the following steps.

Step S101, style information is acquired according to a reference image designated by a user through a style extraction model.

In some embodiments, there may be multiple methods of acquiring the user-specified reference image. For example, an image uploaded by the user is determined as a reference image specified by the user, or template images having different styles are provided for selection by the user, and the template image selected by the user is determined as the reference image specified by the user.

Alternatively, the number of reference images specified by the user may be 1 or at least two. In the example where the reference images specified by the user are at least two, the style extraction model extracts and outputs the style information of each reference image, respectively, or the style extraction model may extract and fuse the style information of each reference image and output the style information after fusion.

There are various types of style information extracted. For example, the style information includes at least one of: color matching information, local texture information, material information, pen touch information, illumination information and composition information.

Step S102, acquiring a text description.

In some embodiments, the textual description may be obtained by direct user input. Optionally, a plurality of template descriptions may be provided for the user, and the template description selected by the user is used as the text description. The template descriptions can be generated online in real time under the triggering of the user, or can be better scene descriptions which are trained and screened in advance and displayed to the user for selection under the triggering of the user. The template description is obtained in various ways, for example, ten thousand word descriptions are generated through a language model, then 2000 word descriptions with better graph quality are manually screened from 1 ten thousand word descriptions and stored, and when a user clicks 'AI one key generation' without input, the 2000 word descriptions can be provided for the user to select, or one word description is selected from the 2000 word descriptions randomly without the user selection.

Optionally, the text description input by the user can be obtained through the language model, then the text description is expanded and beautified, and the expanded and beautified text description is input into the text-generated graph model or the text-generated video model, so that the generation of the AI image or the AI video with better quality is facilitated. There are various ways of expanding and beautifying, and optionally, the language model expands the text input by the user while adhering to the user's semantics, including the main body itself in the text input by the user, and/or the details of the environment. In some examples, the language model generates at least one of the following supplemental information to complete the diffusion: scene information, composition information, action expression information of a character and clothing accessory information.

The context information may include conventional context content, such as indoor or outdoor, and corresponding conventional context elements, such as a user entering a girl in a classroom, the language model may expand out of common objects in the classroom, such as a blackboard, desk, etc. The composition information may include whether the generated image is a distant view, a middle view, or a close view, the position of the character in the image, and so on. Alternatively, the composition information may be supplemented according to composition rules and scene information respectively corresponding to different scenes.

And step S103, generating an artificial intelligent image according to the style information and the text description through a text-generated graph model, or generating an artificial intelligent video according to the style information and the text description through a text-generated video model.

In generating the AI image or AI video, optionally, the style information and the textual description may be considered to the same extent of importance by the text-generated graphics model or the text-generated video model; or the draft graphic model or the draft video model defaults to consider style information to a more important degree than the text description; or the text description may be considered to a more important degree by default by the text-to-graphic model or the text-to-video model than by the style information. Or alternatively, a reference weight of style information of a reference image may be further acquired, where the reference weight is used to indicate a weight ratio of the style information in generating the artificial intelligence image or the artificial intelligence video compared to the text description in generating the artificial intelligence image or the artificial intelligence video; and generating an AI image or AI video according to the reference weight of the style information of the reference image, the style information and the text description through a text-generated graph model or a text-generated video model. Alternatively, the reference weight may be obtained by receiving an input from the user.

The user may wish to obtain a specific style of picture, such as cartoon style, photographic style, movie style, etc., when generating AI pictures, and say that the user wants harbor style, even that of Zhou Xingchi, instead of the eugenously hollywood movie style; also say a very abstract style in which the user may want to draw at ordinary times; however, these styles are difficult to describe specifically and to express accurately by text. In the embodiment of the specification, the style information of the reference image appointed by the user is extracted and input into the text-generated image model or the text-generated video through the style extraction model, so that the user can obtain the AI image with similar style by finding the reference image with similar style without describing the style by the text.

In an example that the style extraction model outputs the fused style information of at least two reference images, optionally, style weights of the at least two reference images are also obtained, where the style weights are used to indicate weight duty ratios of style information of the at least two reference images, which are respectively referred in generating the artificial intelligent image or the artificial intelligent video; and fusing the style information of the at least two reference images according to the style weights of the at least two reference images respectively. Alternatively, the style weights may be obtained by receiving user input. There are various ways of fusing, and in one example, the fusing of style information is implemented by a style extraction model.

Fig. 2 is a flow chart of an embodiment of an image generation method of the present specification. As shown in fig. 2, the method includes the following steps.

Step S201, obtaining style weights corresponding to the at least two reference images respectively.

The style weight is used for indicating the weight ratio of the style information of the at least two reference images to the style information referenced in the generation of the artificial intelligent image or the artificial intelligent video respectively. In the at least two reference images, for the reference image with larger style weight, compared with the reference image with smaller style weight, the more the style of the reference image with larger style weight is referred to by the fusion style information.

Step S202, style information is obtained according to at least two reference images designated by a user and style weights of the at least two reference images through a style extraction model, wherein the style information is fusion of the style information of the at least two reference images according to the style weights.

Step S203, a text description is acquired.

And S204, generating an artificial intelligent image according to the style information and the text description through a text-generated graph model, or generating an artificial intelligent video according to the style information and the text description through a text-generated video model.

In another example, the fusion of style information is implemented by a meridional graph model or a meridional video model. Fig. 3 is a flow chart of an embodiment of the image generation method of the present specification. As shown in fig. 3, the method includes the following steps.

Step S301, style information of each reference image is obtained according to at least two reference images designated by a user through a style extraction model.

Step S302, obtaining the style weights corresponding to the at least two reference images, where the style weights are used to indicate the weight duty ratio of style information of the at least two reference images to be referred to in generating the artificial intelligent image or the artificial intelligent video, respectively.

Step S303, acquiring a text description.

Step S304, generating an artificial intelligent image according to the style weights of the at least two reference images, the style information of the at least two reference images and the text description through a text-generated image model, or generating an artificial intelligent video according to the style weights of the at least two reference images, the style information of the at least two reference images and the text description through a text-generated video model.

The style information of the at least two reference images acquired in the step S301, the style weights of the at least two reference images acquired in the step S302 and the text description acquired in the step S303 are input into a text-to-image model or a text-to-video model together, the style information of the at least two reference images is fused according to the style weights of the at least two reference images, and an AI image or an AI video is generated according to the fused style information and the text description.

Optionally, the style extraction model in embodiments of the present description is obtained by training a model space for a mapping (embedding) of styles. The mapped spatial model may employ existing model architecture BIT. The training data of the mapped spatial model may contain different styles of image works, including, for example, various styles of artwork, paintings, artistic or photographic works, and so forth. The mapping space model has good style learning capability, generalization capability and robustness, and can quickly acquire style information from one or a few reference images. When a reference image is input to the style extraction model, the style extraction model reflects style information of the reference image based on a mapping space in which the reference image outputs a vector having a plurality of dimensions (e.g., thousands of dimensions). The mapping space is input into a text-based graphics model or a text-based video model along with the textual description.

There are various training modes of style extraction models. In one example, the style extraction model and the meridional graph model or the meridional video model may be trained together. Taking a text-generated graph model as an example, firstly, an initial mapping space model and an initial text-generated graph model are constructed, and a plurality of sample images with different styles are used as training samples. The training sample is input into an initial mapping space model to obtain a sample mapping space, and then the sample mapping space and the sample text description are input into a text-to-image model to generate an AI picture; and finally, constructing a loss function based on the similarity of the generated AI picture and the sample style image at the pixel level, and constructing the loss function by improving the similarity so as to train and optimize a mapping space model or synchronously train and optimize the mapping space model and the context graph model. The initial meridional graph model used in the training may be an existing meridional graph model, or may be a separately trained meridional graph model, without limitation.

Since the sample image used for training the style extraction model contains not only style information but also content information, if the sample image is directly input into the model for training, the style extraction model is likely to learn not only the style of the sample image but also the content in the sample image. To avoid the impact of the content in the sample image on the training of the style extraction model, in some examples, the sample image may be preprocessed and then input into the style extraction model for training, where the preprocessing is used to remove the content of the sample image. The style of an image is preserved even though the content is not identifiable. Therefore, in the training process, the content of the image cannot be identified first, and then training is carried out, so that the influence of the content on model training can be avoided.

In one example, the preprocessing includes segmenting and scrambling a sample image into a plurality of sub-images; and then, inputting the reference image obtained by re-stitching the disturbed multiple sub-images into a style extraction model for training. The method has the advantages that the sub-images are randomly disturbed to be recombined to form the image, the style of the original image can still be reserved, but the content of the image cannot be identified, and the style extraction model performs feature extraction on the recombined image and obtains a mapping space reflecting style information. The scattering mode can effectively avoid the influence of the content of the sample image on the style extraction model, and the style extraction model can learn different styles in various images better.

In one example, the preprocessing includes morphing the content of the sample image; the deformed sample image is then input into a style extraction model for training. The sample image is deformed so that the image content cannot be identified, and the influence of the content of the sample image on the style extraction model can be effectively avoided. The deformation may include one or more of a scaling deformation, a rotational deformation, a tilting deformation, a twisting deformation, an affine deformation, a perspective deformation, and the like.

In one example, the preprocessing includes acquiring pixel distribution histograms of different colors in the sample image, and then inputting the pixel distribution histograms of different colors into a style extraction model for training. The pixel distribution histograms of different colors of the sample image are extracted, so that the image content can not be identified, and the influence of the content of the sample image on the style extraction model can be effectively avoided.

Optionally, when style information is acquired according to a reference image specified by a user through a style extraction model in step S101, the reference image formulated by the user is firstly divided into a plurality of sub-images and is spliced again after being disturbed; then acquiring style information based on the re-spliced reference image through the style extraction model; or firstly, deforming the content of the reference image appointed by the user; then, acquiring style information based on the deformed reference image through the style extraction model; or firstly acquiring pixel distribution histograms of different colors in a reference image appointed by a user, and then acquiring style information based on the pixel distribution histograms of different colors through the style extraction model.

In some examples, at least a portion of the style information is also optimized prior to generating an artificial intelligence image or artificial intelligence video from the style information and the textual description via a text-generated graphics model. For example, an option for style optimization may be provided, but when the user selects to open the option, style information of the reference image uploaded by the user is automatically analyzed and at least part of the style information is optimized. For example, the composition of the reference image uploaded by the user is relatively poor but the color matching is very good, the composition aspect of the reference image can be automatically optimized to obtain an optimized style, the optimized style improves composition information, but other style information of the reference image uploaded by the user is reserved, and the optimized style information is used for inputting a text-to-image model or a text-to-video model to generate an image or a video.

In some examples, optimizing at least a portion of the style information includes optimizing at least a portion of the style information based on the textual description. And when the style is optimized, determining the content which the user wants to generate according to the text description input by the user, and performing the style optimization in a targeted manner.

In some examples, optimizing at least a portion of the style information includes obtaining a user-specified optimization direction, the optimization direction including at least one of: enhancing or weakening light shadows, improving close or far view composition, increasing or decreasing saturation, enhancing or decreasing stereo. For example, different optimization directions may be set to different options for the user to select. After uploading or selecting the reference image, the user can select different style beautifying directions according to own ideas to adjust the style information of the reference image. In an example where the user may adjust the style weights of different reference images, optionally, the optimal strength of the style information of the corresponding reference image may also be determined according to the style weights input by the user. For example, for a reference image with higher style weights, the optimization strength of its style information is higher.

Fig. 4 is a schematic block diagram of one embodiment of an image generating apparatus of the present specification. As shown in fig. 4, the image generating apparatus 400 includes a style extraction model 401, a first acquisition module 402, and a text graph/video model 403.

A style extraction model 401 for acquiring style information from a reference image specified by a user.

A first obtaining module 402, configured to obtain a text description.

A meridional chart/video model 403 for generating an artificial intelligence image or artificial intelligence video based on the style information and the textual description.

In some embodiments, the user-specified reference image may be a reference image, and the apparatus 200 further includes: a second obtaining module, configured to obtain a reference weight of style information of the reference image, where the reference weight is used to indicate a weight duty ratio of the style information in generating the artificial intelligence image or the artificial intelligence video compared with the text description in generating the artificial intelligence image or the artificial intelligence video; the text-to-image/video model 403 is specifically configured to, when generating an artificial intelligence image or artificial intelligence video according to the style information and the text description: and generating an artificial intelligent image or an artificial intelligent video according to the reference weight of the style information of the reference image, the style information and the text description.

In some embodiments, the user-specified reference image is at least two reference images, and the apparatus 200 further comprises: the third acquisition module is used for acquiring the style weights corresponding to the at least two reference images respectively, wherein the style weights are used for indicating the weight duty ratio of the style information of the at least two reference images respectively in the generation of the artificial intelligent image or the artificial intelligent video; the style extraction model 401 is specifically configured to, when acquiring style information according to a reference image specified by a user: and acquiring style information according to at least two reference images designated by a user and style weights of the at least two reference images, wherein the style information is fusion of the style information of the at least two reference images according to the style weights.

In some embodiments, the user-specified reference image is at least two reference images, the apparatus further comprising: a fourth obtaining module, configured to obtain style weights corresponding to the at least two reference images, where the style weights are used to indicate weight duty ratios of style information of the at least two reference images that are respectively referred to in generating the artificial intelligent image or the artificial intelligent video; the text-to-image/video model 403 is specifically configured to, when generating an artificial intelligence image or artificial intelligence video according to the style information and the text description: and generating an artificial intelligent image or an artificial intelligent video according to the style weights of the at least two reference images, the style information of the at least two reference images and the text description.

In some embodiments, the style extraction model 401 includes a mapped spatial model by which style information is obtained from a user-specified reference image, including: generating a mapping space based on a reference image specified by a user through the mapping space model, wherein the mapping space contains style information of the reference image, and the mapping space comprises vectors with multiple dimensions.

In some embodiments, the style extraction model 401 is specifically configured to segment the reference image into a plurality of sub-images and re-stitch the sub-images after disturbing when acquiring style information according to the reference image specified by the user; and acquiring style information based on the re-spliced reference image.

In some embodiments, obtaining style information from a user-specified reference image includes: deforming the content of the reference image; and acquiring style information based on the deformed reference image through the style extraction model.

In some embodiments, obtaining style information from a user-specified reference image includes: acquiring pixel distribution histograms of different colors in the reference image; and acquiring style information based on the pixel distribution histograms of different colors through the style extraction model.

In some embodiments, the meridional chart/video model 403 further includes, prior to generating the artificial intelligence image or artificial intelligence video from the style information and the textual description: at least part of the style information is optimized.

Some embodiments of the present description may also be implemented as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having stored thereon executable code (or a computer program or computer instruction code) that, when executed by a processor of an electronic device (or server, etc.), causes the processor to perform some or all of the steps of the above-described methods according to the present description.

Or some embodiments of the specification may also be implemented as a computer program product comprising a computer program or computer-executable instructions. Which when executed by a processor, implement some or all of the steps of the above-described methods of the present description.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 includes a processor 501 and a memory 502. The memory 502 is used to store a computer program. The processor 501 is configured to execute the image generation method described above when the computer program is called. Further, the electronic device may also include a bus, a microphone, a speaker, a display, and a camera. The processor 501, the memory 502, the microphone, the speaker, the display, and the camera communicate via a bus, or may communicate via other means such as wireless transmission.

It is to be appreciated that the processor in the embodiments of this specification can be a central processing unit (central processing unit, CPU), but can also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application Specific Integrated Circuits (ASICs), field programmable gate arrays (fieldprogrammable GATE ARRAY, FPGAs) or other programmable logic devices, transistor logic devices, hardware components, or any combinations thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.

The method steps in the embodiments of the present description may be implemented by means of hardware, or may be implemented by means of a processor executing software instructions. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable ROM (erasablePROM, EPROM), electrically erasable programmable EPROM (EEPROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present description, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Drive (SSD)), etc.

It will be appreciated that the various numerical numbers referred to in the embodiments of the present specification are merely for ease of description and are not intended to limit the scope of the embodiments of the present specification.

The embodiments of the present specification have been described above, and the above description is illustrative, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An image generation method, comprising:

Acquiring style information according to a reference image appointed by a user through a style extraction model;

Acquiring a text description;

and generating an artificial intelligent image according to the style information and the text description through a text-generated graph model, or generating an artificial intelligent video according to the style information and the text description through a text-generated video model.

2. The method of claim 1, wherein the style information comprises at least one of:

Color matching information, local texture information, material information, pen touch information, illumination information and composition information.

3. The method of claim 1, wherein generating an artificial intelligence image or artificial intelligence video from the style information and the textual description by a text-to-text graph model further comprises:

at least part of the style information is optimized.

4. A method according to claim 3, wherein said optimizing at least part of said style information comprises:

optimizing at least part of the style information based on the textual description.

5. A method according to claim 3, wherein said optimizing at least part of said style information comprises:

obtaining an optimization direction specified by a user, wherein the optimization direction comprises at least one of the following: enhancing or weakening light shadows, improving close or far view composition, increasing or decreasing saturation, enhancing or decreasing stereo.

6. The method according to claim 1, wherein the method further comprises:

Obtaining reference weight of style information of the reference image, wherein the reference weight is used for indicating the weight ratio of the style information in generating the artificial intelligent image or the artificial intelligent video compared with the text description in generating the artificial intelligent image or the artificial intelligent video;

the generating an artificial intelligent image according to the style information and the text description through a text-generated graph model or generating an artificial intelligent video according to the style information and the text description through a text-generated video model comprises the following steps:

Generating an artificial intelligent image according to the reference weight of the style information of the reference image, the style information and the text description through a text-generated graph model; or alternatively

And generating an artificial intelligent video according to the reference weight of the style information of the reference image, the style information and the text description through a text-based video model.

7. The method of claim 1, wherein the user-specified reference image is at least two reference images, the method further comprising:

the method comprises the steps of obtaining style weights corresponding to at least two reference images respectively, wherein the style weights are used for indicating weight duty ratios of style information of the at least two reference images respectively referenced in the generation of the artificial intelligent image or the artificial intelligent video;

The obtaining style information according to the reference image appointed by the user through the style extraction model comprises the following steps:

And acquiring style information according to at least two reference images designated by a user and style weights of the at least two reference images through a style extraction model, wherein the style information is fusion of the style information of the at least two reference images according to the style weights.

8. The method of claim 1, wherein the user-specified reference image is at least two reference images, the method further comprising:

Generating an artificial intelligent image according to the style weights of the at least two reference images, the style information of the at least two reference images and the text description through a text-based image model, or generating an artificial intelligent video according to the style weights of the at least two reference images, the style information of the at least two reference images and the text description through a text-based video model.

9. The method of claim 1, wherein the user-specified reference image comprises a reference image uploaded by the user and/or a template image selected by the user.

10. The method of claim 1, wherein the style extraction model comprises a mapped spatial model,

Generating a mapping space based on a reference image specified by a user through the mapping space model, wherein the mapping space contains style information of the reference image, and the mapping space comprises vectors with multiple dimensions.

11. The method of claim 1, wherein the obtaining style information from the user-specified reference image via the style extraction model comprises:

Dividing the reference image into a plurality of sub-images, disturbing the sub-images, and then splicing the sub-images again;

And acquiring style information based on the re-spliced reference image through the style extraction model.

12. The method of claim 1, wherein the obtaining style information from the user-specified reference image comprises:

Deforming the content of the reference image;

and acquiring style information based on the deformed reference image through the style extraction model.

13. The method of claim 1, wherein the obtaining style information from the user-specified reference image comprises:

acquiring pixel distribution histograms of different colors in the reference image;

and acquiring style information based on the pixel distribution histograms of different colors through the style extraction model.

14. An image generating apparatus, comprising:

The style extraction model is used for acquiring style information according to a reference image appointed by a user;

the first acquisition module is used for acquiring the text description;

The draft image model is used for generating an artificial intelligent image according to the style information and the text description, or the draft video model is used for generating an artificial intelligent video according to the style information and the text description.

15. An image generating apparatus, characterized by comprising:

a processor and a memory having executable code stored thereon which, when processed by the processor, causes the processor to perform the method of any one of claims 1 to 13.

16. A computer readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the electronic device to perform the method of any of claims 1 to 13.

17. A computer program product comprising a computer program or computer-executable instructions which, when executed by a processor, implement the method of any one of claims 1 to 13.