CN116385597B

CN116385597B - Text mapping method and device

Info

Publication number: CN116385597B
Application number: CN202310231486.9A
Authority: CN
Inventors: 秦鹏达; 潘禧辰; 李裕宏
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2024-02-02
Anticipated expiration: 2043-03-03
Also published as: CN116385597A

Abstract

The application discloses a text mapping method and device, a text generation image model processing method and device and electronic equipment. The text mapping method generates a target image corresponding to any content text in a content text sequence forming a story based on the target content text and a context text thereof and a context image generated based on the context text. Because the correlation degree between the target image and the target content text is considered, the correlation degree between the target image and the context content text is also considered, and the semantic consistency between the target image and the context image corresponding to the context content text is considered, the content continuity and plot consistency of the story-matching diagram can be effectively improved.

Description

Text mapping method and device

Technical Field

The application relates to the technical field of image processing, in particular to a text mapping method and device, a text generation image model processing method and device and electronic equipment.

Background

The content innovation speed is an important influencing factor of the traffic scale. In various multimedia information media, image information is more visual and has visual impact force relative to text information, and propagation effect is easier to bring, so that artificial intelligence technology for generating images based on text becomes a research hot spot.

Currently, a typical machine learning model that generates images from textual descriptions is an image generation model based on a diffusion model. However, in the process of implementing the present invention, the inventors found that the above technical solution has at least the following problems: the diffusion model-based image generation model is only focused on the generation of a single picture, and does not have the capability of generating a series of images with content continuity and plot continuity, for example, in scenes such as cartoon making, children story book text matching drawing making, novel inserting drawing making, self-media article text matching drawing making and the like, the continuity and continuity of the content plot cannot be met by people, scenes and styles in the generated pictures.

Disclosure of Invention

The application provides a text graph matching method to solve the problem of poor content consistency between graph matching sequences in the prior art. The application additionally provides a text mapping device, a text generation image model processing method and device and electronic equipment.

The application provides a text mapping method, which comprises the following steps:

acquiring a content text sequence of a target object;

acquiring first characteristic data of a target content text in the text sequence;

acquiring at least one context text of the target content text and a context image corresponding to the context text;

Acquiring second characteristic data corresponding to the context text according to the context text and the corresponding context image;

and generating a target image corresponding to the target content text according to the first characteristic data and at least one second characteristic data corresponding to the at least one context content text.

Optionally, the generating the target image corresponding to the target content text according to the first feature data and the at least one second feature data corresponding to the at least one context content text includes:

acquiring a noise image;

and removing noise from the noise image according to the first characteristic data and the at least one second characteristic data through a diffusion model, and taking the image after noise removal as the target image.

Optionally, the method further comprises:

extracting a first feature map of the noise image;

the removing noise from the noise image according to the first feature data and the at least one second feature data through a diffusion model, and taking the image after removing noise as the target image, includes:

removing noise from the first feature image according to the first feature data and the at least one second feature data through a diffusion model, and taking the image after noise removal as a second feature image;

And up-sampling the second characteristic diagram, and taking the up-sampled image as the target image.

Optionally, the acquiring the first feature data of the target content text in the text sequence includes:

performing word vector and word position embedding processing on the target content text to form third characteristic data of the target content text;

extracting fourth feature data of the target content text from the third feature data;

and acquiring the first characteristic data according to the fourth characteristic data and the target text type information.

Optionally, the obtaining, according to the context text and the corresponding context image, second feature data corresponding to the context text includes:

and carrying out multi-mode joint coding on the context text and the corresponding context image to form second characteristic data of graphic fusion.

Optionally, the performing multi-mode joint encoding on the context text and the corresponding context image to form second feature data of image-text fusion includes:

performing word vector and word position embedding processing on the context text to form fifth characteristic data of the context text;

Dividing the context image into a plurality of sub-images;

acquiring sixth characteristic data of the context image according to the characteristic data of the plurality of subgraphs and subgraph position information;

acquiring seventh feature data of the context content text according to the fifth feature data and the sixth feature data;

and acquiring the second characteristic data according to the seventh characteristic data, the context text type information and the context text serial number information.

Optionally, constructing the text generates an image model, the model comprising: a condition information encoding network and an image generating network;

the conditional information encoding network includes: a first feature data acquisition network and at least one second feature data acquisition network;

the first characteristic data acquisition network is used for acquiring the first characteristic data according to the target content text;

the second characteristic data acquisition network is used for carrying out multi-mode joint coding on the context content text and the context image to form second characteristic data;

the image generation network is used for generating the target image according to the first characteristic data and the at least one second characteristic data.

Optionally, the constructing text to generate the image model includes:

acquiring a text and a corresponding image which are irrelevant to the target object, and forming a first training sample set;

learning from the first training sample set to obtain a text to generate an image model;

acquiring a text and a corresponding image related to the target object to form a second training sample set;

and adjusting parameters of the text generated image model according to the second training sample set.

Optionally, the acquiring text and corresponding image related to the target object includes:

acquiring at least one role description information, at least one scene description information and/or at least one picture style description information;

generating an image model through a text obtained by learning in the first training sample set, and generating at least one character image design picture according to at least one character description information; generating at least one scene design picture according to the at least one scene description information; and/or generating at least one picture style design picture according to the at least one picture style description information;

taking the character description information and the character image design picture as a second training sample; taking the scene description information and the scene design picture as a second training sample; and/or taking the picture style description information and the picture style design picture as a second training sample.

Optionally, the method further comprises:

acquiring a new text and a corresponding image related to the target object to form a third training sample set;

and adjusting parameters of the text generation image model according to the third training sample set, wherein the model is used for generating a text map for the text to be processed of the target object.

Optionally, the obtaining the added text and the corresponding image related to the target object includes:

acquiring at least one newly added character description information, at least one newly added scene description information and/or at least one newly added picture style description information;

generating an image model through the text learned from the second training sample set, and generating at least one newly-added character image design picture according to at least one newly-added character description information; generating at least one newly added scene design picture according to the at least one newly added scene description information; and/or generating at least one newly added picture style design picture according to the at least one newly added picture style description information;

taking the newly added character description information and the newly added character image design picture as a third training sample; taking the newly added scene description information and the newly added scene design picture as a third training sample; and/or taking the newly added picture style description information and the newly added picture style design picture as a third training sample.

The application also provides a text graphic device, comprising:

a text sequence obtaining unit for obtaining a content text sequence of the target object;

a first feature data acquisition unit, configured to acquire first feature data of a target content text in the text sequence;

a context data obtaining unit, configured to obtain at least one context text of the target content text and a context image corresponding to the context text;

a second feature data obtaining unit, configured to obtain second feature data corresponding to the context text according to the context text and the corresponding context image;

and the image generation unit is used for generating a target image corresponding to the target content text according to the first characteristic data and at least one second characteristic data corresponding to the at least one context content text.

The application also provides a text generation image model processing method, which comprises the following steps:

acquiring a text and a corresponding image irrelevant to a target object to form a first training sample set;

learning from the first training sample set to obtain a text generation image model, wherein the model comprises a condition information coding network and an image generation network;

and adjusting parameters of the model according to the second training sample set.

The application also provides a text generation image model processing device, which comprises:

the first training sample acquisition unit is used for acquiring texts and corresponding images irrelevant to the target object to form a first training sample set;

the first training unit is used for learning from the first training sample set to obtain a text generation image model, and the model comprises a condition information coding network and an image generation network;

the second training sample acquisition unit is used for acquiring texts and corresponding images related to the target object to form a second training sample set;

and the second training unit is used for adjusting parameters of the model according to the second training sample set.

The application also provides a story mapping method, which comprises the following steps:

receiving a content text sequence of a target story submitted by a client;

acquiring first characteristic data of the content text;

acquiring at least one context content text of the content text and a context image corresponding to the context content text;

generating an image corresponding to the content text according to the first characteristic data and at least one second characteristic data corresponding to the at least one contextual content text;

and sending an image sequence corresponding to the content text sequence to the client.

Optionally, the method further comprises:

receiving text and corresponding images related to the target story submitted by the client;

and learning to obtain a text generation image model according to the text and the corresponding image related to the target story, and generating the target image.

Optionally, the method further comprises:

receiving a new text and a corresponding image related to the target story submitted by the client;

and adjusting the text to generate an image model according to the newly added text and the corresponding image.

acquiring a content text sequence of a target story;

the content text sequence is sent to a server side, so that the server side obtains first characteristic data of the content text; acquiring at least one context content text of the content text and a context image corresponding to the context content text; acquiring second characteristic data corresponding to the context text according to the context text and the corresponding context image; generating an image corresponding to the content text according to the first characteristic data and at least one second characteristic data corresponding to the at least one contextual content text;

And displaying the image sequence which is returned by the server and corresponds to the content text sequence.

The application also provides a commodity live broadcast method, which comprises the following steps:

receiving a description content sequence of a target commodity submitted by a client;

acquiring first characteristic data of the descriptive content;

acquiring at least one context descriptive content of the descriptive content and a context image corresponding to the context descriptive content;

acquiring second characteristic data corresponding to the context description content according to the context description content and the corresponding context image;

generating an image corresponding to the description content according to the first characteristic data and at least one second characteristic data corresponding to the at least one contextual description content;

and publishing the image sequence corresponding to the description content sequence to a live broadcast platform.

acquiring a description content sequence of a target commodity;

the description content sequence is sent to a server side, so that the server side obtains first characteristic data of the description content; acquiring at least one context descriptive content of the descriptive content and a context image corresponding to the context descriptive content; acquiring second characteristic data corresponding to the context description content according to the context description content and the corresponding context image; generating an image corresponding to the description content according to the first characteristic data and at least one second characteristic data corresponding to the at least one contextual description content; and publishing the image sequence corresponding to the description content sequence to a live broadcast platform.

The application also provides a commodity release method, which comprises the following steps:

acquiring first characteristic data of the descriptive content;

and publishing an image sequence corresponding to the descriptive content sequence to an item detail page of the target item.

acquiring a description content sequence of a target commodity;

the description content sequence is sent to a server side, so that the server side obtains first characteristic data of the description content; acquiring at least one context descriptive content of the descriptive content and a context image corresponding to the context descriptive content; acquiring second characteristic data corresponding to the context description content according to the context description content and the corresponding context image; generating an image corresponding to the description content according to the first characteristic data and at least one second characteristic data corresponding to the at least one contextual description content; and publishing an image sequence corresponding to the descriptive content sequence to an item detail page of the target item.

The present application also provides a computer-readable storage medium having instructions stored therein that, when executed on a computer, cause the computer to perform the various methods described above.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.

Compared with the prior art, the application has the following advantages:

according to the text mapping method provided by the embodiment of the application, for any content text in a content text sequence forming a story, a target image corresponding to the target content text is generated based on the target content text and the context content text thereof and a context image generated based on the context content text. Because the correlation degree between the target image and the target content text is considered, the correlation degree between the target image and the context content text is also considered, and the semantic consistency between the target image and the context image corresponding to the context content text is considered, the content continuity and plot consistency of the story-matching diagram can be effectively improved.

Drawings

FIG. 1 is a flow diagram of an embodiment of a text-to-graph method provided herein;

FIG. 2 is a schematic diagram of an embodiment of a text-to-graphic method provided herein;

FIG. 3 is a schematic view of an embodiment of a text-to-graphic method provided herein;

FIG. 4 is a further schematic view of an embodiment of the text-to-graphic method provided herein;

FIG. 5 is a flowchart of adding training data before starting the text matching in the embodiment of the text matching method provided in the present application;

fig. 6 is a schematic flow chart of adding training data in the mapping process of the embodiment of the text mapping method provided in the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other ways than those herein described and similar generalizations can be made by those skilled in the art without departing from the spirit of the application and the application is therefore not limited to the specific embodiments disclosed below.

In the application, a text mapping method and device, a text generation image model processing method and device and an electronic device are provided. The various schemes are described in detail below in the examples.

First embodiment

Please refer to fig. 1, which is a flowchart of a text mapping method of the present application. In this embodiment, the method may include the steps of:

Step S101: and acquiring a content text sequence of the target object.

The target object includes a content text sequence including a plurality of context-dependent text. For example, the target object is a children's story, the story includes a plurality of text segments, and each text segment is mapped to obtain a comic form of the story.

As shown in fig. 2, the method provided in the embodiment of the present application may be applied to a scenario for mapping a story, where the scenario includes a first user client and a first service end, the first user client is a story viewer, the first user client may be a terminal device such as a personal computer, a smart phone, a tablet computer, etc., the first service end is deployed with a text mapping system (also referred to as a text mapping system), input data of the text mapping system is a content text sequence of the story, and output data is a mapping sequence of the story. In specific implementation, the text mapping system can generate an image model by using texts through a story mapping module, and acquire a mapping sequence of a story according to a content text sequence of the story.

Step S103: and acquiring first characteristic data of target content text in the text sequence.

The first feature data refers to text feature data of the target content text. The method provided by the embodiment of the application can execute the processing of step S103 on each text segment in the text sequence, and obtain the first feature data of each text segment.

Step S105: and acquiring at least one context content text of the target content text and a context image corresponding to the context content text.

For any target content text in the text sequence, one or more context content texts and a context image corresponding to the context content texts can be obtained. Contextual content text includes, but is not limited to, content text in the context of the target content text, and may also include content text in the context. The context text may be not only a context text adjacent to the target content text but also a context text separated from the target content text by other content text.

The context image corresponding to the context text may be a corresponding image generated for the context text by using the method provided in the embodiment of the present application, or may be a corresponding image generated or specified by other means.

Step S107: and acquiring second characteristic data corresponding to the context text according to the context text and the corresponding context image.

The second feature data refers to text of the context and the image-text fusion feature data of the corresponding image, such as the text of the last moment of the target content text and the image-text fusion feature data of the corresponding image. The image-text fusion characteristic data can be simple superposition of image characteristic data and text characteristic data, or can be characteristic data obtained by jointly encoding the image and the text.

Step S109: and generating a target image corresponding to the target content text according to the first characteristic data and at least one second characteristic data corresponding to the at least one context content text.

In one example, first feature data of a target content text in the text sequence is obtained through a conditional information encoding network (also referred to as an autoregressive feature extraction network) of a text-generating image model; acquiring second characteristic data corresponding to the context text according to the context text and the corresponding context image; and generating a target image corresponding to the target content text according to the first characteristic data and at least one second characteristic data corresponding to the at least one context content text through an image generation network of a text generation image model.

In a specific implementation, the condition information encoding network may include a first feature acquiring network and at least one second feature acquiring network, where the first feature acquiring network is configured to acquire first feature data of a target content text in the text sequence, and the second feature acquiring network is configured to acquire second feature data corresponding to a context content text according to the context content text and a corresponding context image at a certain moment.

The plurality of second feature acquisition networks respectively correspond to the context content text at different moments. As shown in fig. 2, a corresponding image of the target content text may be generated according to the target content text and m context content texts preceding the target content text, and the images respectively corresponding to the context content texts. In this case, the autoregressive feature extraction network may include m second feature extraction networks, and the target content text is regarded as an nth frame text, and then the 1 st second feature extraction network corresponds to an n-1 st frame text, the 2 nd second feature extraction network corresponds to an n-2 nd frame text, and …, and the m second feature extraction network corresponds to an n-m th frame text. In the process of generating the story line, if the target content text is a 2 nd frame text, the 1 st second feature acquisition network corresponds to the 1 st frame text, and the image generation network outputs a 2 nd frame image; if the target content text is the 3 rd frame text, the 1 st second feature acquisition network may correspond to the 2 nd frame text, the 2 nd second feature acquisition network may correspond to the 1 st frame text, and the image generation network outputs the 3 rd frame image.

In one example, the first feature data acquisition network may acquire the first feature data of the target content text in the following manner: 1) Performing word vector and word position embedding processing on the target content text to form third characteristic data of the target content text; 2) Extracting fourth feature data of the target content text from the third feature data; 3) And acquiring the first characteristic data according to the fourth characteristic data and the target text type information.

A piece of text may include a plurality of words, word position referring to the sequence position data of the words in the text, such as the 1 st word, the 2 nd word, etc. By carrying out word position embedding processing on the target content text, input sequence information is injected into the model, and feature extraction accuracy can be effectively improved.

The feature type corresponding to the first feature data acquisition network is target text type information, and the target text type information represents the first feature data of which the feature data output by the corresponding network is target content text (description of a picture to be generated). Correspondingly, the feature type corresponding to the second feature data acquisition network is context text type information, and the context type information indicates that the feature data output by the corresponding network is fusion feature data of the text of the context and the corresponding image, namely the second feature data. By adopting the processing mode, the image generation network can determine the types of the characteristic data, so that the accuracy of the characteristic data is improved.

As shown in fig. 3, in the implementation, the first feature data obtaining network may perform word vector embedding processing on the target content Text through a Text embedding layer (Text embedded i ng), and extract fourth feature data of the target content Text from the third feature data through a Text modeling layer (Text Mode l). Fig. 3 also shows embedded data of feature type and word position, feature type "0" represents a target text type, and feature type "1" represents a context text type.

In one example, the second feature data acquisition network may acquire the second feature data in the following manner: and carrying out multi-mode joint coding on the context text and the corresponding context image to form second characteristic data of graphic fusion. By jointly encoding the context and the image, the accuracy of the second feature data can be effectively improved.

In specific implementation, the second feature data acquiring network may acquire the second feature data in the following manner: 1) Performing word vector and word position embedding processing on the context text to form fifth characteristic data of the context text; 2) Dividing the context image into a plurality of sub-images; 3) Acquiring sixth characteristic data of the context image according to the characteristic data of the plurality of subgraphs and subgraph position information; 4) Acquiring seventh feature data of the context content text according to the fifth feature data and the sixth feature data; 5) And acquiring the second characteristic data according to the seventh characteristic data, the context text type information and the context text serial number information. The feature type of the second feature data is a context text type.

The word vector and word position embedding process for the context text is similar to the word vector and word position embedding process for the target text, and will not be repeated here.

In the implementation, the corresponding context image can be divided into a plurality of sub-images corresponding to the words according to the word number of the context text, the characteristic data of each sub-image is extracted, and the sub-image position information is embedded, so that the context text and the corresponding image can be better subjected to joint characteristic coding, and the accuracy of the second characteristic data can be effectively improved.

In particular, corresponding time information (context text sequence number information) may be embedded in the plurality of second feature data, for example, the sequence number "n-1" is embedded in the second feature data corresponding to the context text of the n-1 th frame, and the sequence number "n-m" is embedded in the second feature data corresponding to the context text of the n-m-th frame. By adopting the processing mode, the image generation network can determine the time information of the context content text corresponding to each second characteristic data, and different emphasis consideration can be carried out on a plurality of second characteristic data according to the time information of the second characteristic data, so that the content consistency of the target image and the context image is improved.

In one example, step S109 may be implemented as follows: acquiring a noise image; and removing noise from the noise image according to the first characteristic data and the at least one second characteristic data through a diffusion model, and taking the image after noise removal as the target image. In the implementation mode, the image generation network adopts the diffusion model, and the diffusion model has the characteristic of higher image generation quality, so that the image quality of the story-matching picture can be effectively improved.

In particular, step S109 may employ a diffusion model or other more sophisticated image generation network. In this embodiment, the method may further include the steps of: extracting a first feature map of the noise image; accordingly, step S109 may be implemented as follows: removing noise from the first feature image according to the first feature data and the at least one second feature data through a diffusion model, and taking the image after noise removal as a second feature image; and up-sampling the second characteristic diagram, and taking the up-sampled image as the target image. The first feature map is smaller than the noise image in size, and the diffusion model is used for processing the noise image with the smaller size, so that the image generation quality efficiency can be effectively improved.

Thus, the text mapping process is described, and the processing manner of generating the image model by the text applied in the text mapping process is described below.

In one example, the method may further comprise the steps of: building a text-generated image model, the model comprising: a condition information encoding network and an image generating network; the conditional information encoding network includes: a first feature data acquisition network and at least one second feature data acquisition network; the first characteristic data acquisition network is used for acquiring the first characteristic data according to the target content text; the second characteristic data acquisition network is used for carrying out multi-mode joint coding on the context content text and the context image to form second characteristic data; the image generation network is used for generating the target image according to the first characteristic data and the at least one second characteristic data.

In specific implementation, the text generation image model may include the following steps: 1) Acquiring a text and a corresponding image which are irrelevant to the target object, and forming a first training sample set; 2) Learning from the first training sample set to obtain a text to generate an image model; 3) Acquiring a text and a corresponding image related to the target object to form a second training sample set; 4) And adjusting parameters of the text generated image model according to the second training sample set.

The text generation image model obtained by learning in the first training sample set irrelevant to the target object is a model which is universal to various objects, and the model which is adjusted according to the text generation image model in the second training sample set relevant to the target object is a special model which is adaptive to the target object. By adopting the processing mode, the processing of the custom role, the custom scene and the custom style on the target object can be supported, so that the accuracy of text mapping can be effectively improved, and the use experience and the use flexibility can be improved. The following table shows the data for the first training sample set and the second training sample set.

In this table, the relevant training sample for story a is the second training sample and the other training samples are the first training samples.

As shown in fig. 4 and 5, in the implementation, to obtain the text and the corresponding image related to the target object, the following sub-steps may be included: 1) Acquiring at least one role description information, at least one scene description information and/or at least one picture style description information; 2) Generating an image model through a text obtained by learning in the first training sample set, and generating at least one character image design picture according to at least one character description information; generating at least one scene design picture according to the at least one scene description information; and/or generating at least one picture style design picture according to the at least one picture style description information; 3) Taking the character description information and the character image design picture as a second training sample; taking the scene description information and the scene design picture as a second training sample; and/or taking the picture style description information and the picture style design picture as a second training sample. By adopting the processing mode, a manager of the story line can orderly input one or more of character description information, scene description information and picture style description information, and the system can automatically generate images corresponding to the description information, so that texts and corresponding images related to target objects can be automatically acquired, and therefore, the generation efficiency of story materials can be effectively improved, and the story line efficiency is improved.

As shown in fig. 6, in implementation, the method may further include the following steps: acquiring a new text and a corresponding image related to the target object to form a third training sample set; and adjusting parameters of the text generation image model according to the third training sample set, wherein the model is used for generating a text map for the text to be processed of the target object. By adopting the processing mode, the story materials, such as character design pictures of newly added characters, can be newly added in the story mapping process, so that the flexibility and convenience are further improved, and the actual use habit of a user is met.

In the implementation, to obtain the new text and the corresponding image related to the target object, the method may include the following sub-steps: 1) Acquiring at least one newly added character description information, at least one newly added scene description information and/or at least one newly added picture style description information; 2) Generating an image model through the text learned from the second training sample set, and generating at least one newly-added character image design picture according to at least one newly-added character description information; generating at least one newly added scene design picture according to the at least one newly added scene description information; and/or generating at least one newly added picture style design picture according to the at least one newly added picture style description information; 3) Taking the newly added character description information and the newly added character image design picture as a third training sample; taking the newly added scene description information and the newly added scene design picture as a third training sample; and/or taking the newly added picture style description information and the newly added picture style design picture as a third training sample.

As can be seen from the foregoing embodiments, in the text mapping method provided in the embodiments of the present application, for any content text in a sequence of content texts that constitute a story, a target image corresponding to the target content text is generated based on the target content text and its context content text, and a context image generated based on the context content text. Because the correlation degree between the target image and the target content text is considered, the correlation degree between the target image and the context content text is also considered, and the semantic consistency between the target image and the context image corresponding to the context content text is considered, the content continuity and plot consistency of the story-matching diagram can be effectively improved.

Second embodiment

In the above embodiment, a text mapping method is provided, and correspondingly, the application also provides a text mapping device. The device corresponds to the embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The application additionally provides a text-to-graphics device comprising:

the present application additionally provides a text-generated image model processing apparatus including: the device comprises a text sequence acquisition unit, a first characteristic data acquisition unit, a context data acquisition unit, a second characteristic data acquisition unit and an image generation unit.

A text sequence obtaining unit for obtaining a content text sequence of the target object; a first feature data acquisition unit, configured to acquire first feature data of a target content text in the text sequence; a context data obtaining unit, configured to obtain at least one context text of the target content text and a context image corresponding to the context text; a second feature data obtaining unit, configured to obtain second feature data corresponding to the context text according to the context text and the corresponding context image; and the image generation unit is used for generating a target image corresponding to the target content text according to the first characteristic data and at least one second characteristic data corresponding to the at least one context content text.

In one example, the image generation unit is specifically configured to acquire a noise image; and removing noise from the noise image according to the first characteristic data and the at least one second characteristic data through a diffusion model, and taking the image after noise removal as the target image.

In one example, the apparatus further comprises: a feature map extracting unit configured to extract a first feature map of the noise image; the image generating unit is specifically configured to remove noise from the first feature map according to the first feature data and the at least one second feature data through a diffusion model, and use the image after noise removal as a second feature map; and up-sampling the second characteristic diagram, and taking the up-sampled image as the target image.

In one example, the first feature data obtaining unit is specifically configured to perform word vector and word position embedding processing on the target content text to form third feature data of the target content text; extracting fourth feature data of the target content text from the third feature data; and acquiring the first characteristic data according to the fourth characteristic data and the target text type information.

In one example, the second feature data obtaining unit is specifically configured to perform multi-mode joint encoding on the context text and the corresponding context image to form second feature data of the graphic fusion.

In one example, the second feature data obtaining unit is specifically configured to perform word vector and word position embedding processing on the context content text to form fifth feature data of the context content text; dividing the context image into a plurality of sub-images; acquiring sixth characteristic data of the context image according to the characteristic data of the plurality of subgraphs and subgraph position information; acquiring seventh feature data of the context content text according to the fifth feature data and the sixth feature data; and acquiring the second characteristic data according to the seventh characteristic data, the context text type information and the context text serial number information.

In one example, the apparatus further comprises: a model construction unit for constructing a text-generated image model, the model comprising: a condition information encoding network and an image generating network; the conditional information encoding network includes: a first feature data acquisition network and at least one second feature data acquisition network; the first characteristic data acquisition network is used for acquiring the first characteristic data according to the target content text; the second characteristic data acquisition network is used for carrying out multi-mode joint coding on the context content text and the context image to form second characteristic data; the image generation network is used for generating the target image according to the first characteristic data and the at least one second characteristic data.

In one example, the model building unit is specifically configured to obtain a text and a corresponding image that are unrelated to the target object, and form a first training sample set; learning from the first training sample set to obtain a text to generate an image model; acquiring a text and a corresponding image related to the target object to form a second training sample set; and adjusting parameters of the text generated image model according to the second training sample set.

In one example, the model building unit is specifically configured to obtain at least one role description information, at least one scene description information, and/or at least one picture style description information; generating an image model through a text obtained by learning in the first training sample set, and generating at least one character image design picture according to at least one character description information; generating at least one scene design picture according to the at least one scene description information; and/or generating at least one picture style design picture according to the at least one picture style description information; taking the character description information and the character image design picture as a second training sample; taking the scene description information and the scene design picture as a second training sample; and/or taking the picture style description information and the picture style design picture as a second training sample.

In one example, the apparatus further comprises: the model adjusting unit is used for acquiring a new text and a corresponding image related to the target object to form a third training sample set; and adjusting parameters of the text generation image model according to the third training sample set, wherein the model is used for generating a text map for the text to be processed of the target object.

In one example, the apparatus further comprises: the model adjusting unit is specifically used for acquiring at least one newly-added character description information, at least one newly-added scene description information and/or at least one newly-added picture style description information; generating an image model through the text learned from the second training sample set, and generating at least one newly-added character image design picture according to at least one newly-added character description information; generating at least one newly added scene design picture according to the at least one newly added scene description information; and/or generating at least one newly added picture style design picture according to the at least one newly added picture style description information; taking the newly added character description information and the newly added character image design picture as a third training sample; taking the newly added scene description information and the newly added scene design picture as a third training sample; and/or taking the newly added picture style description information and the newly added picture style design picture as a third training sample.

Third embodiment

In the above embodiment, a text mapping method is provided, and corresponding electronic equipment is also provided. The device corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing any one of the text mapping methods described above, the apparatus being powered on and running the program of the method by the processor.

The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

In particular implementations, the electronic device may further include one or more of the following: a power component, an input/output (I/O) interface, and a communication component. The power supply assembly provides power to the various components of the electronic device. Power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic devices. The I/O interface provides an interface between the processor 503 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. The communication component is configured to facilitate wired or wireless communication between the electronic device and a user device (e.g., a smart phone, tablet, etc.).

Fourth embodiment

In the above embodiment, a text mapping method is provided, and correspondingly, the application also provides a text generated image model processing method. The method corresponds to the embodiment of the method described above. Since this method embodiment is substantially similar to method embodiment one, the description is relatively simple, and reference is made to the description of method embodiments in part. The method embodiments described below are merely illustrative.

The application additionally provides a text generation image model processing method, which comprises the following steps:

step 1: and acquiring a text and a corresponding image which are irrelevant to the target object, and forming a first training sample set.

Step 2: and learning from the first training sample set to obtain a text generated image model, wherein the model comprises a condition information coding network and an image generating network.

Step 3: and acquiring a text and a corresponding image related to the target object to form a second training sample set.

Step 4: and adjusting parameters of the model according to the second training sample set.

The condition information encoding network includes a first feature data acquisition network and at least one second feature data acquisition network. The model is used for acquiring first characteristic data of a target content text in a content text sequence of the target object through the first characteristic data acquisition network; acquiring at least one context text of the target content text and a context image corresponding to the context text; through the second characteristic data acquisition network, carrying out multi-mode joint coding on the context text of the target content text and the context image corresponding to the context text to form second characteristic data; and generating a target image corresponding to the target content text according to the first characteristic data and at least one second characteristic data corresponding to at least one context content text of the target content text through the image generation network.

The method provided in this embodiment corresponds to the text-to-image model portion in the first embodiment, and will not be described here again. The method provided by the embodiment of the application can support the processing of the custom role, the custom scene and the custom style on the target object, so that the accuracy of text mapping can be effectively improved, and the use experience and the use flexibility can be improved.

Fifth embodiment

In the above embodiment, a text-generated image model processing method is provided, and correspondingly, the application also provides a text-generated image model processing device. The device corresponds to the embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application additionally provides a text-generated image model processing apparatus including: the first training sample acquisition unit is used for acquiring texts and corresponding images irrelevant to the target object to form a first training sample set; the first training unit is used for learning from the first training sample set to obtain a text generation image model, and the model comprises a condition information coding network and an image generation network; the second training sample acquisition unit is used for acquiring texts and corresponding images related to the target object to form a second training sample set; and the second training unit is used for adjusting parameters of the model according to the second training sample set.

Sixth embodiment

In the above embodiment, a text-generated image model processing method is provided, and corresponding to the text-generated image model processing method, the application also provides an electronic device. The device corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The electronic device of the present embodiment includes:

the electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the text-to-image model processing method of any one of the above, the apparatus being powered on and running the program of the method by the processor.

Seventh embodiment

In the above embodiment, a text mapping method is provided, and correspondingly, the present application further provides a story mapping method for a server. The method corresponds to the embodiment of the method described above. Since this method embodiment is substantially similar to method embodiment one, the description is relatively simple, and reference is made to the description of method embodiments in part. The method embodiments described below are merely illustrative.

The application additionally provides a story mapping method comprising:

step 1: a sequence of content text of a target story submitted by a client is received.

Clients include, but are not limited to: and terminal equipment such as personal computers, smart phones, tablet computers and the like. The client acquires a content text sequence of the target story; and sending the content text sequence to a server. The server performs the following steps 2 to 5 for each content file in the content text sequence.

Step 2: and acquiring first characteristic data of the content text.

Step 3: and acquiring at least one context content text of the content text and a context image corresponding to the context content text.

Step 4: and acquiring second characteristic data corresponding to the context text according to the context text and the corresponding context image.

Step 5: and generating an image corresponding to the content text according to the first characteristic data and at least one second characteristic data corresponding to the at least one contextual content text.

Step 6: and sending an image sequence corresponding to the content text sequence to the client.

And the server side sends the image sequence corresponding to the content text sequence to the client side, and the client side displays the image sequence.

In one example, the method may further comprise the steps of: receiving text and corresponding images related to the target story submitted by the client; and learning to obtain a text generation image model according to the text and the corresponding image related to the target story, and generating the target image. By adopting the processing mode, the processing of the custom role, the custom scene and the custom style on the target story can be supported, so that the accuracy of text mapping can be effectively improved, and the use experience and the use flexibility can be improved.

In one example, the method may further comprise the steps of: receiving a new text and a corresponding image related to the target story submitted by the client; and adjusting the text to generate an image model according to the newly added text and the corresponding image. By adopting the processing mode, the story materials, such as character design pictures of newly added characters, can be newly added in the story mapping process, so that the flexibility and convenience are further improved, and the actual use habit of a user is met.

As can be seen from the above embodiments, in the story mapping method provided in the embodiments of the present application, for any content text in a sequence of content texts that constitute a story, a target image corresponding to the target content text is generated based on the target content text and its context content text, and a context image generated based on the context content text. Because the correlation degree between the target image and the target content text is considered, the correlation degree between the target image and the context content text is also considered, and the semantic consistency between the target image and the context image corresponding to the context content text is considered, the content continuity and plot consistency of the story-matching diagram can be effectively improved.

Eighth embodiment

In the above embodiment, a story mapping method is provided, and correspondingly, the application also provides a story mapping method for a client. The method corresponds to the embodiment of the method described above. Since this method embodiment is substantially similar to method embodiment one, the description is relatively simple, and reference is made to the description of method embodiments in part. The method embodiments described below are merely illustrative.

The application additionally provides a story mapping method comprising:

step 1: a sequence of content text for the target story is obtained.

Step 2: and sending the content text sequence to a server.

The server acquires first characteristic data of the content text; acquiring at least one context content text of the content text and a context image corresponding to the context content text; acquiring second characteristic data corresponding to the context text according to the context text and the corresponding context image; generating an image corresponding to the content text according to the first characteristic data and at least one second characteristic data corresponding to the at least one contextual content text;

step 3: and displaying the image sequence which is returned by the server and corresponds to the content text sequence.

Ninth embodiment

In the above embodiment, a text mapping method is provided, and correspondingly, the application also provides a commodity live broadcast method for the server. The method corresponds to the embodiment of the method described above. Since this method embodiment is substantially similar to method embodiment one, the description is relatively simple, and reference is made to the description of method embodiments in part. The method embodiments described below are merely illustrative.

The application additionally provides a commodity live broadcast method, which comprises the following steps:

step 1: and receiving the description content sequence of the target commodity submitted by the client.

Clients include, but are not limited to: and terminal equipment such as personal computers, smart phones, tablet computers and the like. The method comprises the steps that a client obtains a description content sequence of a target commodity; and sending the description content sequence to a server. The server performs the following steps 2 to 5 for each description in the description sequence.

Step 2: first characteristic data of the descriptive content is acquired.

Step 3: and acquiring at least one context descriptive content of the descriptive content and a context image corresponding to the context descriptive content.

Step 4: and acquiring second characteristic data corresponding to the context description content according to the context description content and the corresponding context image.

Step 5: and generating an image corresponding to the descriptive content according to the first characteristic data and at least one second characteristic data corresponding to the at least one contextual descriptive content.

Step 6: and publishing the image sequence corresponding to the description content sequence to a live broadcast platform.

The server side distributes the image sequence corresponding to the description content sequence to the live broadcast platform, and the user side displays the image sequence through the live broadcast platform, so that the user can know the commodity more conveniently, and commodity success is promoted.

In one example, the method may further comprise the steps of: receiving a text and a corresponding image related to the target commodity submitted by the client; and learning to obtain a text generation image model according to the text and the corresponding image related to the target commodity, and generating the target image. By adopting the processing mode, the processing of the custom roles, the custom scenes and the custom styles on the target commodity can be supported, so that the accuracy of commodity allocation can be effectively improved, and the use experience and the use flexibility can be improved.

In one example, the method may further comprise the steps of: receiving a new text and a corresponding image related to the target commodity submitted by the client; and adjusting the text to generate an image model according to the newly added text and the corresponding image. By adopting the processing mode, commodity materials can be newly added in the commodity mapping process, so that the flexibility and convenience are further improved, and the practical use habit of a user is met.

As can be seen from the above embodiments, in the commodity live broadcast method provided in the embodiments of the present application, for any description content in a description content sequence that constitutes a commodity, an image corresponding to the description content is generated based on the description content and its contextual description content, and a contextual image generated based on the contextual description content. Because the correlation between the image and the corresponding descriptive content is considered, the correlation between the image and the context descriptive content is considered, and the semantic consistency between the image and the context image corresponding to the context descriptive content is considered, the content continuity and the plot consistency of the commodity distribution diagram can be effectively improved, and the commodity live broadcast effect is improved.

Tenth embodiment

In the foregoing embodiment, a commodity live broadcast method is provided, and correspondingly, the present application further provides a commodity live broadcast method for a client. The method corresponds to the embodiment of the method described above. Since this method embodiment is substantially similar to method embodiment one, the description is relatively simple, and reference is made to the description of method embodiments in part. The method embodiments described below are merely illustrative.

step 1: a descriptive content sequence of the target commodity is obtained.

Step 2: the description content sequence is sent to a server side, so that the server side obtains first characteristic data of the description content; acquiring at least one context descriptive content of the descriptive content and a context image corresponding to the context descriptive content; acquiring second characteristic data corresponding to the context description content according to the context description content and the corresponding context image; generating an image corresponding to the description content according to the first characteristic data and at least one second characteristic data corresponding to the at least one contextual description content; and publishing the image sequence corresponding to the description content sequence to a live broadcast platform.

Eleventh embodiment

In the above embodiment, a text mapping method is provided, and correspondingly, the application also provides a commodity publishing method for the server. The method corresponds to the embodiment of the method described above. Since this method embodiment is substantially similar to method embodiment one, the description is relatively simple, and reference is made to the description of method embodiments in part. The method embodiments described below are merely illustrative.

The application additionally provides a commodity release method, which comprises the following steps:

Step 2: first characteristic data of the descriptive content is acquired.

Step 6: and publishing an image sequence corresponding to the descriptive content sequence to an item detail page of the target item.

And the server publishes the image sequence corresponding to the description content sequence to the commodity detail page of the target commodity so as to facilitate the user to know the commodity more and promote commodity exchange.

As can be seen from the above embodiments, the commodity distribution method provided in the embodiments of the present application generates, for any description content in a description content sequence constituting a commodity, an image corresponding to the description content based on the description content and its contextual description content, and a contextual image generated based on the contextual description content. Because the correlation between the image and the corresponding descriptive content is considered, the correlation between the image and the context descriptive content is considered, and the semantic consistency between the image and the context image corresponding to the context descriptive content is considered, the content continuity and plot consistency of the commodity distribution diagram can be effectively improved, and the richness of commodity distribution content is improved.

Twelfth embodiment

In the above embodiment, a commodity publishing method is provided, and correspondingly, the application also provides a commodity publishing method for the client. The method corresponds to the embodiment of the method described above. Since this method embodiment is substantially similar to method embodiment one, the description is relatively simple, and reference is made to the description of method embodiments in part. The method embodiments described below are merely illustrative.

step 1: a descriptive content sequence of the target commodity is obtained.

Step 2: the description content sequence is sent to a server side, so that the server side obtains first characteristic data of the description content; acquiring at least one context descriptive content of the descriptive content and a context image corresponding to the context descriptive content; acquiring second characteristic data corresponding to the context description content according to the context description content and the corresponding context image; generating an image corresponding to the description content according to the first characteristic data and at least one second characteristic data corresponding to the at least one contextual description content; and publishing an image sequence corresponding to the descriptive content sequence to an item detail page of the target item.

It should be noted that, in the embodiments of the present application, the use of user data may be involved, and in practical applications, user specific personal data may be used in the schemes described herein within the scope allowed by applicable legal regulations in the country where the applicable legal regulations are met (for example, the user explicitly agrees to the user to actually notify the user, etc.).

While the preferred embodiment has been described, it is not intended to limit the invention thereto, and any person skilled in the art may make variations and modifications without departing from the spirit and scope of the present invention, so that the scope of the present invention shall be defined by the claims of the present application.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (f.ash RAM), among other forms in computer readable media. Memory is an example of computer-readable media.

1. Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include non-transitory computer-readable media (trans itory med i a), such as modulated data signals and carrier waves.

2. It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A text mapping method, comprising:

acquiring a content text sequence of a target object;

2. The method of claim 1, wherein generating the target image corresponding to the target content text from the first feature data and at least one second feature data corresponding to the at least one contextual content text comprises:

acquiring a noise image;

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

the method further comprises the steps of:

extracting a first feature map of the noise image;

4. The method of claim 1, wherein the obtaining the first feature data of the target content text in the text sequence comprises:

5. The method according to claim 1, wherein the obtaining second feature data corresponding to the context text according to the context text and the corresponding context image includes:

6. The method of claim 5, wherein the multi-modal joint encoding of the context text and the corresponding context image to form the second feature data of the graphic fusion comprises:

Dividing the context image into a plurality of sub-images;

7. The method of claim 1, wherein the step of determining the position of the substrate comprises,

building a text-generated image model, the model comprising: a condition information encoding network and an image generating network;

8. The method of claim 7, wherein said constructing text to generate an image model comprises:

9. The method of claim 8, wherein the step of determining the position of the first electrode is performed,

the acquiring the text and the corresponding image related to the target object comprises the following steps:

10. The method as recited in claim 8, further comprising:

11. The method of claim 10, wherein the step of determining the position of the first electrode is performed,

the obtaining the new text and the corresponding image related to the target object comprises the following steps:

12. A text-to-graphics apparatus, comprising:

13. A text-generating image model processing method, characterized by comprising:

according to the second training sample set, adjusting parameters of the model;

wherein the condition information encoding network comprises: a first feature data acquisition network and at least one second feature data acquisition network; the first characteristic data acquisition network is used for acquiring first characteristic data of a target text according to the target text; the second feature data acquisition network is used for acquiring second feature data corresponding to the context text according to the context text of the target text and the context image corresponding to the context text; the image generation network is used for generating a target image corresponding to the target text according to the first characteristic data and at least one second characteristic data corresponding to at least one context text.

14. A text-generating image model processing apparatus, comprising:

the second training unit is used for adjusting parameters of the model according to the second training sample set;

15. A story mapping method, comprising:

receiving a content text sequence of a target story submitted by a client;

acquiring first characteristic data of the content text;

16. The method as recited in claim 15, further comprising:

and according to the text and the corresponding image related to the target story, learning to obtain a text generation image model for generating an image corresponding to the content text.

17. The method as recited in claim 15, further comprising:

18. A story mapping method, comprising:

acquiring a content text sequence of a target story;

19. A commodity direct broadcast method, comprising:

acquiring first characteristic data of the descriptive content;

20. A commodity direct broadcast method, comprising:

acquiring a description content sequence of a target commodity;

21. A commodity distribution method, comprising:

acquiring first characteristic data of the descriptive content;

22. A commodity distribution method, comprising:

acquiring a description content sequence of a target commodity;