CN117493602A

CN117493602A - Training method of graphic conversion model, information interaction method and related equipment

Info

Publication number: CN117493602A
Application number: CN202310965865.0A
Authority: CN
Inventors: 白安琪; 蒋宁; 陆全; 夏粉; 吴海英; 肖冰
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2024-02-02

Abstract

The application discloses a training method, an information interaction method and related equipment of a graphic conversion model, which are used for improving conversion accuracy of the graphic conversion model. The training method comprises the following steps: training a prosthetic extraction model based on a training corpus and first prosthetic information, wherein the training corpus comprises training texts and training pictures, and the first prosthetic information comprises prosthetic sequences of the training texts and prosthetic sequences of the training pictures; performing image-text conversion on the training corpus through an image-text conversion model to obtain converted corpus, wherein the converted corpus comprises a first picture for describing training texts and a first text for describing the training pictures; performing the semanteme extraction on the converted corpus through the trained semanteme extraction model to obtain a semanteme sequence of the first picture and a semanteme sequence of the first text; and updating and training the image-text conversion model based on the first semanteme information, the semanteme sequence of the first picture and the semanteme sequence of the first text.

Description

Training method of graphic conversion model, information interaction method and related equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a training method of a graphic conversion model, an information interaction method and related equipment.

Background

With the rapid development of information technology, graphic conversion is called one of key technologies in various business scenes. The method carries out model training through graphic text corpus, can quickly convert the text into pictures only by inputting a section of text into the trained model, and can also quickly convert the text into the text by inputting pictures or videos into the trained model. The image-text conversion model in the related technology cannot accurately represent the semantics of the image-text corpus, so that the result obtained after the image-text corpus is converted is far from the original corpus in terms of semantics, and the accuracy is low.

Disclosure of Invention

The embodiment of the application aims to provide a training method, an information interaction method and related equipment for a graphic conversion model, which are used for improving the conversion accuracy of the graphic conversion model.

In order to achieve the above purpose, the embodiment of the present application adopts the following technical scheme:

in a first aspect, an embodiment of the present application provides a training method for a graphics context conversion model, including:

based on a training corpus and first semanteme information, the training corpus comprises training text training pictures and a training semanteme extraction model, and the first semanteme information comprises a semanteme sequence of the training text and a semanteme sequence of the training pictures;

Performing image-text conversion on the training corpus through an image-text conversion model to obtain conversion corpus, wherein the conversion corpus comprises a first picture for describing the training text and a first text for describing the training picture;

performing the semanteme extraction on the converted corpus through the trained semanteme extraction model to obtain a semanteme sequence of the first picture and a semanteme sequence of the first text;

and updating and training the image-text conversion model based on the semanteme information, the semanteme sequence of the first picture and the semanteme sequence of the first text.

According to the training method for the image-text conversion model, training texts and corresponding training pictures are used as training corpus, the image-text conversion model is trained by the training corpus, and a semanteme extraction model is trained by the training corpus and semanteme information of the training corpus, so that the semanteme in any text or picture can be extracted by the semanteme extraction model; since the sememe of the text is the minimum semantic unit of the text, different texts can be essentially distinguished without being influenced by the context, and the sememe of the picture is the minimum semantic unit of the picture, the sememe can be used for essentially distinguishing different corpora from the angles of the back cognitive psychology and the cognitive image of the language; based on the above, in the training process of the image-text conversion model, the conversion corpus output by the image-text conversion model in the training process is subjected to the element extraction by using the element extraction model, and then the image-text conversion model is updated by using the element information of the training corpus and the corresponding conversion corpus, so that the image-text conversion model can learn how to perform image-text conversion on the basis of accurately understanding the semantics of the training corpus by taking the element as granularity, and further the conversion accuracy of the image-text conversion model can be improved.

In a second aspect, an embodiment of the present application provides an information interaction method, including:

displaying a scene picture of a virtual scene, wherein the virtual scene comprises a virtual object;

acquiring interaction information aiming at the virtual object, wherein the interaction information comprises at least one of the following information: target text and target picture;

performing image-text conversion on the interactive information through a target image-text conversion model to obtain conversion corpus, wherein the target image-text conversion model is obtained by training based on the image-text conversion model training method in the first aspect, and the conversion corpus comprises at least one of the following information: a predicted picture for describing the target text, a predicted text for describing the target picture;

and displaying the conversion corpus in the scene picture.

According to the information interaction method provided by the embodiment of the application, the target image-text conversion model obtained through training by the training method is utilized to carry out image-text conversion on interaction information aiming at a virtual scene and then display the interaction information in a scene picture, for example, an input text is converted into a picture for describing the text and is displayed in the scene picture, or the input picture is converted into a text for describing the picture and is displayed in the scene picture. Therefore, the diversity and flexibility of interaction modes in the virtual scene can be improved; in addition, the target image-text conversion model takes the semanteme as granularity, and performs image-text conversion on the basis of accurately understanding the interaction information, so that the corpus obtained by conversion can accurately represent the semantics of the interaction information, and the interaction accuracy is improved.

In a third aspect, an embodiment of the present application provides a training device for a graphic conversion model, including:

the training system comprises a first training unit, a second training unit and a third training unit, wherein the first training unit is used for training a prosthetic extraction model based on training corpus and first prosthetic information, the training corpus comprises training texts and training pictures, and the first prosthetic information comprises prosthetic sequences of the training texts and prosthetic sequences of the training pictures;

the first conversion unit is used for carrying out image-text conversion on the training corpus through an image-text conversion model to obtain conversion corpus, wherein the conversion corpus comprises a first picture used for describing the training text and a first text used for describing the training picture;

the extraction unit is used for carrying out the semanteme extraction on the converted corpus through the trained semanteme extraction model to obtain the semanteme sequence of the first picture and the semanteme sequence of the first text;

and the second training unit is used for updating and training the image-text conversion model based on the first semanteme information, the semanteme sequence of the first picture and the semanteme sequence of the first text.

In a fourth aspect, an embodiment of the present application provides an information interaction device, including:

the display unit is used for displaying a scene picture of a virtual scene, wherein the virtual scene comprises a virtual object;

A second obtaining unit, configured to obtain interaction information for the virtual object, where the interaction information includes at least one of the following information: target text and target picture;

the second conversion unit is used for carrying out image-text conversion on the interaction information through a target image-text conversion model to obtain conversion corpus, the target image-text conversion model is obtained by training based on the training method of the image-text conversion model in the first aspect, and the conversion corpus comprises at least one of the following information: a predicted picture for describing the target text, a predicted text for describing the target picture;

the display unit is further configured to display the converted corpus in the scene picture.

In a fifth aspect, embodiments of the present application provide an electronic device, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method of training a teletext conversion model according to the first aspect; alternatively, the processor is configured to execute the instructions to implement the information interaction method as described in the second aspect.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method for training a graphic conversion model according to the first aspect; alternatively, the instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the information interaction method as described in the second aspect.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a flow chart of a training method of a graphic conversion model according to an embodiment of the present application;

fig. 2 is a flow chart of another training method of a graphic conversion model according to an embodiment of the present application;

FIG. 3 is a flowchart of another training method of a graphic conversion model according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a testing process of a graphic conversion model according to an embodiment of the present application;

fig. 5 is a schematic flow chart of an information interaction method provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a training device for a graphic conversion model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an information interaction device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate to implement embodiments of the present application in a sequence other than those illustrated or described herein. Furthermore, in the present specification and claims, "and/or" means at least one of the connected objects, and the character "/" generally means a relationship in which the associated object is an "or" before and after.

Description of related concepts:

cognitive linguistics: is a branch subject of linguistics, takes 2 nd generation cognitive science and experience philosophy as theoretical backgrounds, is born on the basis of generating grammar against main flow linguistic conversion, and starts forming in the late 1980 s to the early 1990 s.

Image of: is a concept in cognitive linguistic theory. The image analysis is a psychological "complete" that is, meaning "comprehensive" analysis that focuses on meaning, not meaning "analysis". Further, focusing on the overall comprehensive analysis, the combination of subjective knowledge and objective reality is maintained, and the analysis result can be considered. For example, the "at-the-fly" contains a sense that "the start and end of the fly are both on the back surface of the horse", while the "fly at-the-fly" contains a sense that "the start of the fly is not on the back surface of the horse, and the end of the fly is on the back surface of the horse". Based on the progressive development of the current natural language processing (Natural Language Processing, NLP) research in the fields of natural language understanding (Natural Language Understanding, NLU) and natural language generation (Natural Language Generation, NLG), we consider that the opportunity to realize recognition, display, and discrimination of cognitive images behind languages by means of NLP technology is mature.

The meaning is as follows: is a modern semantic term, meaning units (content units) in a language, also called meaning, and is information obtained by abstracting common essential characteristics of things perceived by humans. The meaning of a text is the constituent meaning of a word in the text, is the smallest semantic unit of the text, such as meaning of the word "sister" including [ relative ] [ sibling ] [ senior ] [ female ] etc., meaning of the word "man" including [ animal ] [ higher animal ] [ man ] [ adult ]. The picture semanteme is a minimum subgraph obtained by dividing the picture according to different positioning indexes, and is a minimum semantic unit of the picture.

The image-text conversion model in the related technology cannot accurately represent the semantics of the image-text corpus, so that the result obtained after the image-text corpus is converted is far from the original corpus in terms of semantics, and the accuracy is low. In view of this, the embodiment of the present application provides a training method for an image-text conversion model, in which training texts and corresponding training pictures thereof are used as training corpus, the image-text conversion model is trained by using the training corpus, and an artificial element extraction model is trained by using the training corpus and artificial element information thereof, and the artificial element extraction model can extract artificial elements in any text or picture; since the semanteme of the text is the minimum semantic unit of the text, different texts can be essentially distinguished without being influenced by the context, and the semanteme of the picture is the minimum semantic unit of the picture, therefore, the semanteme can be essentially distinguished from the angles of the back cognition psychology and the cognition image of the language, based on the fact, in the training process of the image-text conversion model, the semanteme extraction is carried out on the conversion corpus output by the image-text conversion model in the training process by using the semanteme extraction model, and then the image-text conversion model is updated by using the training corpus and the semanteme information corresponding to the conversion corpus, so that the image-text conversion model can learn how to carry out image-text conversion on the basis of accurately understanding the semantics of the training corpus by using the semanteme as granularity, and further the conversion accuracy of the image-text conversion model can be improved.

On the basis, the embodiment of the application also provides an information interaction method for the virtual scene, which is used for performing image-text conversion on interaction information of virtual objects in the virtual scene by using the target image-text conversion model obtained through training by the training method, for example, converting an input text into a picture for describing the text and displaying the picture in the scene, or converting the input picture into a text for describing the picture and displaying the picture in the scene. Therefore, the diversity and flexibility of interaction modes in the virtual scene can be improved; in addition, the target image-text conversion model takes the semanteme as granularity, and performs image-text conversion on the basis of accurately understanding the interaction information, so that the corpus obtained by conversion can accurately represent the semantics of the interaction information, and the interaction accuracy is improved.

It should be understood that, the training method and the information interaction method of the graphic conversion model provided in the embodiments of the present application may be executed by an electronic device or software installed in the electronic device. The electronic devices referred to herein may include terminal devices such as smartphones, tablet computers, notebook computers, desktop computers, intelligent voice interaction devices, intelligent home appliances, smart watches, vehicle terminals, aircraft, etc.; alternatively, the electronic device may further include a server, such as an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides a cloud computing service.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a training method of a graphic conversion model according to an embodiment of the present application is shown, and the method includes the following steps:

s102, acquiring training corpus and first semaphorium information.

The training corpus comprises training texts and training pictures, the training pictures can be used for describing corresponding training texts, and the training texts can also be used for describing corresponding training pictures. In the implementation, the training texts and the training pictures contained in the training corpus can be multiple, one training text can correspond to multiple training pictures, one training picture can also correspond to multiple training texts, and the training texts corresponding to the training pictures are used for description. In practical application, the number of the training corpuses is multiple, and multiple training corpuses can be respectively obtained from the development data set and the annotation data set. The open data set refers to a corpus which is disclosed on the internet, and the labeling data set refers to a corpus which is obtained by manually labeling a task.

The first semaphorium information includes a semaphorium sequence of training text and a semaphorium sequence of training pictures. The training text semanteme sequence is obtained by arranging the semanteme of each word according to the sequence of each word in the training text. For example, the training text is composed of word 1 and word 2, the semanteme of word 1 includes [ semanteme 1] [ semanteme 2], and the semanteme of word 2 includes [ semanteme 3] [ semanteme 4], then the semanteme sequence of the training text is: [ Yisu 1] [ Yisu 2] [ Yisu 3] [ Yisu 4].

The artificial element sequence of the training picture is obtained by arranging a plurality of artificial elements in the training picture in sequence, and specifically, the artificial element sequence of the training picture can be obtained by arranging minimum subgraphs obtained by dividing pictures by different positioning indexes according to an index sequence. For example, the training picture is divided into sub-pictures 1 to 4, each sub-picture is a semanteme of the training picture, and the sub-pictures are arranged according to the index sequence, so that a semanteme sequence of the training picture can be obtained: [ sub-1 ] [ sub-2 ] [ sub-3 ] [ sub-4 ].

S104, training a semanteme extraction model based on the training corpus and the first semanteme information.

The semaphorin extraction model can be used to perform semaphorin extraction on either text or either picture. In S104, training corpus may be used as a sample, and the semanteme information of the training corpus may be used as a label, and the supervised training method may be used to perform iterative training on the semanteme extraction model until a preset training stop condition is satisfied. The prosthetic extraction model may be any pre-trained language model commonly used in the art, such as roberta, etc., which is not limited in this embodiment of the present application. The training stop condition may be set according to actual needs, for example, the number of iterative training reaches a preset number of times or the extraction accuracy of the semaphorium extraction model exceeds a preset accuracy, which is not limited in the embodiment of the present application.

Optionally, considering that the semanteme of the text is different from that of the picture, and further, the extraction modes are different, in order to accurately extract the corresponding semanteme from the text and the picture, and provide accurate and reliable data support for the subsequent training of the image-text conversion model, the semanteme extraction model can comprise a text semanteme extraction model for extracting the text semanteme and a picture semanteme extraction model for extracting the picture semanteme, wherein the text semanteme extraction model is obtained by training based on a training text and a semanteme sequence thereof, and the picture semanteme extraction model is obtained by training based on a training picture and a semanteme sequence thereof.

And S106, performing image-text conversion on the training corpus through an image-text conversion model to obtain converted corpus.

The converted corpus includes a first picture for describing training text and a first text for describing training pictures. The first picture is obtained by converting the training text by the image-text conversion model, and can be specifically a picture which is generated by the image-text conversion model based on the artificial element sequence of the training text and is used for describing the semantics of the training text. For example, if the training text is "jump immediately", the text-to-text conversion model may parse the semantics of the training text based on the semanteme sequence of the training text to generate a picture expressing "jump immediately by a person" as the first picture.

The first text is obtained by converting the training picture by the image-text conversion model, and can be specifically a text which is generated by the image-text conversion model based on the artificial element sequence of the training picture and is used for describing the semantics of the training picture. For example, if the training picture is a picture expressing "people jump immediately", the image-text conversion model may parse the semantics of the training picture based on the semanteme sequence of the training picture to generate the first text "jump immediately".

The image-text conversion model in the embodiment of the present application may have any suitable structure, and may specifically be selected according to actual needs, which is not limited in the embodiment of the present application.

Optionally, in order to achieve accurate conversion of the corpus, the image-text conversion model may include a picture generation model and a text generation model, considering that there is a difference between the process of generating the picture based on the text and the process of generating the text based on the picture. The picture generation model is used for generating corresponding pictures based on the texts, and the text generation model is used for generating corresponding texts based on the pictures. Accordingly, as shown in fig. 2, the step S106 may include: s1062, generating a first picture based on the training text through a picture generation model; s1064, generating a first text based on the training pictures through a text generation model.

S108, performing the semanteme extraction on the converted corpus through the trained semanteme extraction model to obtain a semanteme sequence of the first picture and a semanteme sequence of the first text.

Specifically, inputting the first text into a trained text semanteme extraction model for semanteme extraction, so as to obtain a semanteme sequence of the first text; and inputting the first picture into a trained picture semanteme extraction model to perform semanteme extraction, so as to obtain a semanteme sequence of the first picture.

S110, updating and training the image-text conversion model based on the first semanteme information, the semanteme sequence of the first picture and the semanteme sequence of the first text.

The difference of the training picture and the first picture on the semanteme sequence can accurately reflect the difference of the training picture and the first picture on the semanteme by taking the semanteme as granularity, and further can reflect the effect of converting the text into the picture by the image-text conversion model; the difference between the training text and the first text on the semanteme sequence can accurately reflect the difference between the training text and the first text on the semanteme by taking the semanteme as granularity, and further can reflect the effect of converting the picture into the text by the image-text conversion model. Based on the method, the image-text conversion model can be updated by combining the difference of the training images and the first images on the artificial element sequence and the difference of the training texts and the first texts on the artificial element sequence, so that the image-text conversion model can accurately understand the respective semantics of the training texts and the corresponding training images from the supervision signals provided by the artificial element information of the training corpus, and the knowledge of accurately converting the texts into the corresponding images and the knowledge of accurately converting the images into the corresponding texts are learned.

In an alternative embodiment, first difference information between the prosthetic sequence of the first picture and the prosthetic sequence of the training picture and second difference information between the prosthetic sequence of the first text and the prosthetic sequence of the training text may be acquired respectively; then, based on the first difference information and the second difference information, conversion loss of the image-text conversion model is determined, and the image-text conversion model is updated based on the conversion loss of the image-text conversion model by using a back propagation algorithm, a gradient descent algorithm and the like. The conversion loss of the graphic conversion model is used for representing the difference between the training corpus and the corresponding conversion corpus.

In another alternative embodiment, where the teletext conversion model comprises a picture generation model and a text generation model, the update training of the teletext conversion model comprises an update training of the picture generation model and an update training of the text generation model. Specifically, as shown in fig. 2, the step S110 may include: s1102, acquiring first difference information between a first picture artificial element sequence and a training picture artificial element sequence, and updating and training a picture generation model based on the first difference information; s1104, acquiring second difference information between the first text and the training text, and updating and training the text generation model based on the second difference information.

For example, for a picture generation model, a difference set between the semanteme sequence of the first picture and the semanteme sequence of the training picture may be taken as the first difference information; then, determining conversion loss of the picture generation model based on the number of the semaphorins contained in the difference set, and updating the picture generation model based on the conversion loss by utilizing various updating methods commonly used in the field, such as a back propagation algorithm, a gradient descent algorithm and the like; repeating the above process for a plurality of times until the picture generation model meets the preset training stop condition. The conversion loss of the picture generation model is used for representing the difference between the training picture and the first picture in terms of semantics. The number of the semaphorins contained in the difference set and the conversion loss of the picture generation model have a preset corresponding relation, and the larger the number of the semaphorins contained in the difference set is, the smaller the conversion loss of the picture generation model is; otherwise, the larger the conversion loss of the picture generation model.

It can be understood that the picture generation model is updated and trained in the above manner, so that the picture generation model can accurately understand the semantics of the training picture by taking the semanteme as granularity in the supervision signal provided by the semanteme sequence of the training picture, fully learn the knowledge of accurately converting the training text into the training picture based on the semantics of the training picture, and has the capability of accurately executing text-to-picture conversion.

Similarly, for a text generation model, a difference set between the prosthetic sequence of the first text and the prosthetic sequence of the training text may be taken as the second difference information; then, determining conversion loss of the text generation model based on the number of the semaphorins contained in the difference set, and updating the text generation model based on the conversion loss by utilizing various updating methods commonly used in the field, such as a back propagation algorithm, a gradient descent algorithm and the like; repeating the above process for a plurality of times until the text generation model meets the preset training stop condition. Wherein the conversion loss of the text generation model is used to represent the semantically difference of the training text and the first text. The number of the semaphorins contained in the difference set and the conversion loss of the text generation model have a preset corresponding relation, and the larger the number of the semaphorins contained in the difference set is, the smaller the conversion loss of the text generation model is; otherwise, the larger the conversion loss of the text generation model.

It can be understood that the text generation model is updated and trained in the above manner, so that the text generation model can accurately understand the semantics of the training text by taking the semanteme as granularity from the supervision signal provided by the semanteme sequence of the training text, and further fully learn the knowledge of accurately converting the training picture into the training text based on the semantics of the training text, thereby having the capability of accurately executing picture-to-text conversion.

In another embodiment of the present application, in order to enable the trained graphic conversion model to fully mine the cognitive psychology and the cognitive image behind the text in terms of the semanteme, accurately understand the semantics of the text and provide for accurately executing the text-to-picture processing, and fully utilize the cognitive psychology and the cognitive image behind the picture, accurately understand the semantics of the picture and provide for accurately executing the picture-to-text processing, after S110, the training method provided in the embodiment of the present application further includes testing the updated trained graphic conversion model, and specifically may include testing the updated trained picture generation model and testing the updated trained text generation model.

In implementation, as shown in fig. 3, after S110, the method provided in the embodiment of the present application may further include:

s112, acquiring a first test corpus and second semaphorium information.

The first test corpus comprises a first test text and a first test picture, and the second semaphorium information comprises a semaphorium sequence of the first test text and a semaphorium sequence of the first test picture.

The specific implementation manner of S112 is similar to the specific implementation manner of S102, and will not be repeated.

And S114, testing the updated and trained picture generation model based on the first test text and the artificial element sequence thereof.

In order to enable the updated trained picture generation model to accurately understand text semantics, S114 may include: converting the first test text by updating the trained picture generation model to obtain a second picture for describing the first test text, and converting the second picture by updating the trained text generation model to obtain a second text for describing the second picture; then, carrying out the semanteme extraction on the second text through a semanteme extraction model to obtain a semanteme sequence of the second text; and finally, determining a test result for updating the trained picture generation model based on third difference information between the artificial element sequence of the second text and the artificial element sequence of the first test text.

For example, as shown in fig. 4 (a), the first test corpus includes text 1 and its corresponding picture 1. Inputting the text 1 into the updated and trained picture generation model for conversion to obtain a picture 1 'for describing the test text 1, and inputting the picture 1' into the updated and trained text generation model for conversion to obtain a text 2 for describing the picture 1; then inputting the text 2 into a text semanteme extraction model to perform semanteme extraction to obtain a semanteme sequence of the text 2; then, taking a difference set between the semanteme sequence of the text 1 and the semanteme sequence of the text 2 as third difference information, if the number of the semantemes contained in the difference set is smaller, indicating that the conversion effect of the picture generation model is better, and if the number of the semanteme contained in the difference set is smaller than the preset number, determining that the picture generation model after the update training passes the test; otherwise, determining that the updated and trained picture generation model fails the test.

If the test is not passed, the steps S1062 to S114 are repeated until the updated and trained picture generation model passes the test. Thus, the picture generation model passing the test can be used as a target text generation model for text-to-picture processing.

S116, testing the updated and trained text generation model based on the first test picture and the artificial element sequence thereof.

In order to enable the updated trained text generation model to accurately understand text semantics, S116 may include: converting the first test picture through updating the trained text generation model to obtain a third text for describing the first test picture, and converting the third text through updating the trained picture generation model to obtain a third picture for describing the third text; then, carrying out the semaphorium extraction on the third picture through a semaphorium extraction model to obtain a semaphorium sequence of the third picture; and finally, determining a test result for updating the trained text generation model based on fourth difference information between the artificial element sequence of the third picture and the artificial element sequence of the first test picture.

For example, as shown in fig. 4 (b), a picture 1 is input into an updated and trained picture generation model to be converted to obtain a text 1' for describing the picture 1, and then the text 1' is input into the updated and trained text generation model to be converted to obtain a picture 2 for describing the text 1 '; then inputting the picture 2 into a picture semanteme extraction model to perform semanteme extraction to obtain a semanteme sequence of the picture 2; then, taking a difference set between the semanteme sequence of the picture 1 and the semanteme sequence of the picture 2 as fourth difference information, if the number of the semantemes contained in the difference set is smaller, indicating that the conversion effect of the text generation model is better, and if the number of the semanteme contained in the difference set is smaller than the preset number, determining that the updated and trained text generation model passes the test; otherwise, determining that the updated text generation model fails the test.

If the test is not passed, repeating the steps S1064 to S116 until the updated and trained text generation model passes the test. Further, the text generation model that passed the test may be used as a target text generation model for picture-to-text processing.

In another embodiment of the present application, considering that the matching degree between the training text and the training picture in the training corpus affects the training effect of the image-text conversion model, in order to introduce the influence of the matching degree on the image-text conversion accuracy in the training process, so as to further improve the conversion accuracy of the image-text conversion model obtained by training, the number of image-text conversion models may be multiple, the structures of the image-text conversion models are the same, and multiple training corpuses with different image-text matching degrees are used for training the image-text conversion models, and the best image-text conversion model is selected from the multiple trained image-text conversion models.

In specific implementation, the training corpus and the first semaphorium information comprise a plurality of training corpuses and first semaphorium information of each training corpus, wherein the plurality of training corpuses and the plurality of image-text conversion models are in one-to-one correspondence, namely each training corpus is used for training one image-text conversion model, and the matching degree between training texts and training pictures in different training corpuses is different. Correspondingly, in the step S106, for each training corpus, the training corpus is converted based on the image-text conversion model corresponding to the training corpus, so as to obtain the converted corpus corresponding to the training corpus. Correspondingly, in the step S108, for the training corpus, updating and training the image-text conversion model corresponding to the training corpus based on the first semanteme information of the training corpus and the semanteme sequence of the first picture and the semanteme sequence of the first text in the conversion corpus corresponding to the training corpus; further, a second test corpus and third semanteme information are obtained, the second test corpus comprises a second test text and a second test picture corresponding to the second test text, and the third semanteme information comprises a semanteme sequence of the second test text and a semanteme sequence of the second test picture; based on the second test corpus and the third semaphorium information, testing the updated and trained multiple image-text conversion models, and based on test results, determining a target image-text conversion model for image-text conversion from the updated and trained multiple image-text conversion models.

The value is stated that, the specific implementation manner of testing the updated and trained multiple image-text conversion models based on the second test corpus and the semaphorium information thereof is similar to the specific implementation manner of S114 to S116, and is not repeated.

In the embodiment of the application, the matching degree between the training text and the training picture in each training corpus is used for representing the similarity of the training text and the training picture in terms of semantics. If the matching degree between the training text and the training picture is higher, the training text and the training picture are similar in meaning.

In practical applications, each training corpus is marked with a reason for mismatch between the training text and the training picture, and the degree of match between the training text and the training picture can be determined based on the number of mismatch items between the training text and the training picture. For example, if the number of non-matching items between the training text and the corresponding training pictures is smaller than a first preset number (for example, 1), it may be determined that the matching degree between the training text and the corresponding training pictures is high; if the number of the mismatch items between the training text and the corresponding training pictures is greater than a second preset number (such as 2), determining that the matching degree between the training text and the corresponding training pictures is low; if the number of mismatches between the training text and the corresponding training pictures is between the first preset number and the second preset number, the matching degree between the training text and the corresponding training pictures can be determined to be medium.

The multiple training corpuses comprise training corpuses 1, 2 and 3, and the matching degree of the three training corpuses is respectively high, medium and low, and the image-text conversion model comprises an image-text conversion model 1, an image-text conversion model 2 and an image-text conversion model 3. Firstly, according to the training method, training a graphic conversion model 1 based on a training corpus 1 and the semanteme information thereof, training a graphic conversion model 2 based on a training corpus 2 and the semanteme information thereof, and training a graphic conversion model 3 based on a training corpus 3 and the semanteme information thereof; then, respectively testing each image-text conversion model after updating training by using the second test corpus and the semanteme information thereof to obtain a test result, wherein the test result can represent whether the test is passed or not and the difference information on the semanteme sequence before and after the conversion of the second test corpus; further, one with the smallest difference information is selected from all the updated and trained image-text conversion models and is used as a target image-text conversion model.

According to the training method for the image-text conversion model, training texts and training pictures are used as training corpus, the image-text conversion model is trained by using the training corpus, and an artificial element extraction model is trained by using the training corpus and artificial element information of the training corpus, so that artificial elements in any text or picture can be extracted by the artificial element extraction model; since the semanteme of the text is the minimum semantic unit of the text, different texts can be essentially distinguished without being influenced by the context, and the semanteme of the picture is the minimum semantic unit of the picture, therefore, the semanteme can be essentially distinguished from the angles of the back cognition psychology and the cognition image of the language, based on the fact, in the training process of the image-text conversion model, the semanteme extraction is carried out on the conversion corpus output by the image-text conversion model in the training process by using the semanteme extraction model, and then the image-text conversion model is updated by using the training corpus and the semanteme information corresponding to the conversion corpus, so that the image-text conversion model can learn how to carry out image-text conversion on the basis of accurately understanding the semantics of the training corpus by using the semanteme as granularity, and further the conversion accuracy of the image-text conversion model can be improved.

The image-text conversion model obtained by training can be applied to various business scenes needing image-text conversion, such as information interaction in a virtual scene, a financial business handling scene, a multiparty dialogue scene and the like, and the embodiment of the application is not limited to the above. Information interaction in a virtual scene is described below as an example.

Referring to fig. 5, a flow chart of an information interaction method provided in an embodiment of the present application may include the following steps:

s502, displaying a scene picture of the virtual scene.

The virtual scene is a digital scene which can be constructed by the terminal equipment through virtual network technology, virtual reality technology, digital communication technology and the like. Included in the virtual scene are virtual objects, which are objects in the virtual scene, which may include, for example, but are not limited to, virtual characters, virtual objects, and the like. The virtual scene may be a virtual reality scene simulating the real world, an augmented reality scene, or the like.

S504, interaction information for the virtual object is acquired.

The interaction information includes at least one of the following information: target text and target picture.

S506, performing image-text conversion on the interactive information through the target image-text conversion model to obtain converted corpus.

The target image-text conversion model is obtained by training based on the image-text conversion model training method provided by the embodiment of the application. The converted corpus includes at least one of the following information: a predicted picture for describing the target text, and a predicted text for describing the target picture.

In specific implementation, the target image-text conversion model comprises a target text generation model and a target picture generation model, the target picture is input into the target text generation model to obtain a corresponding predicted text, and the target picture is input into the target picture generation model to obtain a corresponding predicted picture.

S508, displaying the conversion corpus in the scene picture.

For example, the interactive information for the virtual object includes text "jump immediately", and after the text is converted by the target image-text conversion model, a picture expressing "jump on the horse back" is obtained, and is displayed above the virtual object in the scene. For another example, the interactive information for the virtual object includes a picture expressing "people jump from horse to horse back", and after the picture is converted by the target image-text conversion model, the text is obtained to jump immediately, and is displayed above the virtual object in the scene.

In the embodiment of the application, different interaction information can be acquired for different virtual scenes and displayed in the scene after conversion. In the virtual dialogue scene, the virtual object comprises at least two virtual characters corresponding to each dialogue party, and the dialogue parties can conduct information interaction in the virtual scene through the virtual characters.

In an alternative implementation, the step S104 includes: acquiring real-time dialogue voice of a dialogue party corresponding to any virtual character, and performing voice recognition on the real-time dialogue voice to obtain a target text; the step S108 includes: and displaying the predicted picture at the first designated position of the virtual character.

In practical application, real-time dialogue voice of a dialogue party is acquired in real time by terminal equipment of the dialogue party, and the real-time dialogue voice is converted by an automatic voice recognition technology (Automatic Speech Recognition, ASR) to obtain a target text. In addition, the first designated position may be set according to actual needs, for example, above, below, left side, right side, etc. of the avatar, which is not limited in the embodiment of the present application.

It can be understood that the real-time dialogue voice of the dialogue parties is converted into the corresponding text, and then the corresponding text is converted into the picture through the target image-text conversion model and displayed at the first appointed position of the corresponding virtual object, so that the interaction among the dialogue parties can be performed in the virtual scene in a voice mode, and the interaction flexibility is improved; in addition, the semantics expressed by each dialogue party are visually displayed, so that the interestingness of the dialogue is increased, and the monitoring and the disfigurement of the dialogue party to the expressed content are also increased. In an alternative implementation, the step S104 includes: acquiring a real-time video stream of a dialogue party corresponding to any virtual character, and carrying out framing treatment on the real-time video stream to obtain a plurality of video frames based on time sequence arrangement as target pictures; the step S108 includes: and displaying the predicted text at a second designated position of the virtual character.

In practical application, a real-time video stream of a conversation party is obtained by shooting the conversation party by a terminal device of the conversation party, a plurality of video frames are obtained after the real-time video stream is subjected to framing treatment, and a predicted picture is obtained after the plurality of video frames are converted by a picture-text conversion model. In addition, the second designated position may be set according to actual needs, for example, above, below, left side, right side, etc. of the avatar, which is not limited in the embodiment of the present application.

It can be understood that the real-time video stream of the dialogue parties is processed into corresponding pictures, and then converted into texts through the target image-text conversion model and displayed at the second appointed position of the corresponding virtual object, so that the interaction among the dialogue parties can be performed in the virtual scene through limb actions, and the interaction flexibility is improved; in addition, the semantics expressed by each dialogue party are visually displayed, so that the interestingness of the dialogue is increased, and the monitoring and the disfigurement of the dialogue party to the expressed content are also increased.

Optionally, after performing frame division processing on the real-time video stream to obtain a plurality of video frames based on time sequence arrangement, the information interaction method provided in the embodiment of the application may further include: analyzing the plurality of video frames to obtain limb action information of a dialogue party corresponding to the virtual character; and controlling any virtual character to execute corresponding actions based on the limb action information.

In practical application, various image processing technologies commonly used in the art can be adopted to detect and analyze each video frame, and based on the analysis result of each video frame, the limb action of the conversation party is determined, so that the virtual character corresponding to the conversation party is controlled to execute the limb action. Therefore, the limb actions of the dialogue party are visually displayed, so that the interest of the dialogue is increased, the monitoring and the thinking back of the dialogue party to the limb actions of the dialogue party are also increased, and the understanding of the other dialogue parties to the expression semantics of the other party is enhanced.

According to the information interaction method provided by the embodiment of the application, the target image-text conversion model obtained through training by the training method is utilized to carry out image-text conversion on interaction information of virtual objects in a virtual scene and then display the interaction information in a scene picture, for example, an input text is converted into a picture for describing the text and is displayed in the scene picture, or the input picture is converted into a text for describing the picture and is displayed in the scene picture. Therefore, the diversity and flexibility of interaction modes in the virtual scene can be improved; in addition, the target image-text conversion model takes the semanteme as granularity, and performs image-text conversion on the basis of accurately understanding the interaction information, so that the corpus obtained by conversion can accurately represent the semantics of the interaction information, and the interaction accuracy is improved.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The embodiment of the application also provides a training device of the image-text conversion model. Referring to fig. 6, a schematic structural diagram of a training device 600 for a graphic conversion model according to an embodiment of the present application, the device 600 may include:

a first training unit 610, configured to train a prosthetic extraction model based on a training corpus and first prosthetic information, where the training corpus includes training text and training pictures, and the first prosthetic information includes prosthetic sequences of the training text and the training pictures;

a first conversion unit 620, configured to perform a text-to-text conversion on the training corpus by using a text-to-text conversion model, to obtain a converted corpus, where the converted corpus includes a first picture for describing the training text and a first text for describing the training picture;

An extracting unit 630, configured to perform a prosthetic extraction on the converted corpus through the trained prosthetic extraction model, so as to obtain a prosthetic sequence of the first picture and a prosthetic sequence of the first text;

and a second training unit 640, configured to update and train the image-text conversion model based on the first semanteme information, the semanteme sequence of the first picture, and the semanteme sequence of the first text.

Optionally, the image-text conversion model includes a picture generation model and a text generation model, the first picture is generated by the picture generation model based on the training text, and the first text is generated by the text generation model based on the training picture;

the second training unit 650 is specifically configured to: acquiring first difference information between the artificial element sequence of the first picture and the artificial element sequence of the training picture, and updating and training the picture generation model based on the first difference information; and acquiring second difference information between the first text and the training text, and updating and training the text generation model based on the second difference information.

Optionally, the apparatus 600 further comprises a test unit;

the test unit is used for: after the second training unit performs update training on the image-text conversion model based on the semanteme information, the semanteme sequence of the first picture and the semanteme sequence of the first text, acquiring a first test corpus and second semanteme information, wherein the first test corpus comprises a first test text and a first test picture, and the second semanteme information comprises the semanteme sequence of the first test text and the semanteme sequence of the first test picture; based on the first test text and the artificial element sequence thereof, testing the updated and trained picture generation model, and if the test fails, repeating the steps of acquiring the first difference information between the artificial element sequence of the first picture and the artificial element sequence of the training picture until the updated and trained picture generation model passes the test; and testing the updated and trained text generation model based on the first test picture and the artificial element sequence thereof, and if the test fails, repeating the step of acquiring the second difference information between the artificial element sequence of the first text and the artificial element sequence of the training text until the updated and trained text generation model passes the test.

Optionally, the test unit is specifically configured to: converting the first test text by updating the trained picture generation model to obtain a second picture for describing the first test text, and converting the second picture by updating the trained text generation model to obtain a second text for describing the second picture; performing the artificial element extraction on the second text through the trained artificial element extraction model to obtain an artificial element sequence of the second text; and determining a test result for updating the trained picture generation model based on third difference information between the artificial element sequence of the second text and the artificial element sequence of the first test text.

Optionally, the test unit is specifically configured to: converting the first test picture through updating the trained text generation model to obtain a third text for describing the first test picture, and converting the third text through updating the trained picture generation model to obtain a third picture for describing the third text; performing the artificial element extraction on the third picture through the trained artificial element extraction model to obtain an artificial element sequence of the third picture; and determining a test result for updating the trained text generation model based on fourth difference information between the artificial element sequence of the third picture and the artificial element sequence of the first test picture.

Optionally, the number of the graphic conversion models is multiple, and the structures of the graphic conversion models are the same; the training expectation and the first semaphorium information comprise a plurality of training corpuses and first semaphorium information of various training corpuses, wherein the plurality of training corpuses and the plurality of image-text conversion models are in one-to-one correspondence, and the matching degree between training texts and training pictures in different training corpuses is different;

the first conversion unit is specifically configured to: aiming at each training corpus, converting the training corpus based on a graphic conversion model corresponding to the training corpus to obtain converted corpus corresponding to the training corpus;

the second training unit is specifically configured to: updating and training a graphic-text conversion model corresponding to the training corpus based on the first semanteme information of the training corpus and the semanteme sequence of a first picture and the semanteme sequence of a first text in the conversion corpus corresponding to the training corpus aiming at each training corpus; acquiring a second test corpus and third semaphorium information, wherein the second test corpus comprises a second test text and a corresponding second test picture, and the third semaphorium information comprises a semaphorium sequence of the second test text and a semaphorium sequence of the second test picture; based on the second test corpus and the third semaphorium information, testing a plurality of updated and trained image-text conversion models; based on the test result, a target graphic conversion model for graphic conversion is determined from the updated and trained plurality of graphic conversion models.

Optionally, the semanteme extraction model includes a text semanteme extraction model and a picture semanteme extraction model, the text semanteme extraction model is obtained by training based on the training text and the semanteme sequence thereof, and the picture semanteme extraction model is obtained by training based on the training picture and the semanteme sequence thereof.

Obviously, the training device for the image-text conversion model provided by the embodiment of the application can be used as an execution main body of the training method for the image-text conversion model shown in fig. 1, so that the function of the training method for the image-text conversion model in fig. 1 can be realized. The principle is the same and will not be described again.

The embodiment of the application also provides an information interaction device. Referring to fig. 7, a schematic structural diagram of an information interaction device 700 provided in an embodiment of the present application, the device 700 may include:

a display unit 710, configured to display a scene picture of a virtual scene, where the virtual scene includes a virtual object;

a second obtaining unit 720, configured to obtain interaction information for the virtual object, where the interaction information includes at least one of the following information: target text and target picture;

the second converting unit 730 is configured to perform an image-text conversion on the interaction information through a target image-text conversion model, so as to obtain a converted corpus, where the target image-text conversion model is obtained by training based on the training method of the image-text conversion model according to the embodiment of the present application, and the converted corpus includes at least one of the following information: a predicted picture for describing the target text, a predicted text for describing the target picture;

The display unit 710 is further configured to display the converted corpus in the scene frame.

Optionally, the virtual object includes virtual characters corresponding to at least two dialogue parties respectively;

the second obtaining unit is specifically configured to: acquiring real-time dialogue voice of a dialogue party corresponding to any virtual character; performing voice recognition on the real-time dialogue voice to obtain the target text;

the display unit is specifically configured to: and displaying the predicted picture at the first appointed position of any virtual character.

the second obtaining unit is specifically configured to: acquiring a real-time video stream of a dialogue party corresponding to any virtual character; carrying out framing treatment on the real-time video stream to obtain a plurality of video frames based on time sequence arrangement as the target picture;

the display unit is specifically configured to: and displaying the predicted text at a second appointed position of any virtual character.

Optionally, the apparatus further comprises:

the analyzing unit is used for analyzing the plurality of video frames after the second obtaining unit carries out frame division processing on the real-time video stream to obtain a plurality of video frames based on time sequence arrangement so as to obtain limb action information of a conversation party corresponding to any virtual character;

And the control unit is used for controlling any virtual character to execute corresponding actions based on the limb action information.

Obviously, the information interaction device provided in the embodiment of the present application may be used as an execution body of the information interaction method shown in fig. 5, so that the functions implemented by the information interaction method in fig. 5 can be implemented. The principle is the same and will not be described again.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 8, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 8, but not only one bus or type of bus.

And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory to the memory and then runs the computer program to form the training device of the graphic conversion model on the logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:

Alternatively, the processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the information interaction device on the logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:

performing image-text conversion on the interaction information through a target image-text conversion model to obtain conversion corpus, wherein the target image-text conversion model is obtained by training based on the image-text conversion model training method in the embodiment of the application, and the conversion corpus comprises at least one of the following information: a predicted picture for describing the target text, a predicted text for describing the target picture;

and displaying the conversion corpus in the scene picture.

The method executed by the training device of the graphic conversion model disclosed in the embodiment shown in fig. 1 of the present application or the method executed by the information interaction device disclosed in the embodiment shown in fig. 5 of the present application may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

The electronic device may further execute the method of fig. 1 and implement the functions of the training device of the image-text conversion model in the embodiment shown in fig. 1 to 4, or the electronic device may further execute the method of fig. 5 and implement the functions of the information interaction device in the embodiment shown in fig. 5, which are not described herein.

Of course, other implementations, such as a logic device or a combination of hardware and software, are not excluded from the electronic device of the present application, that is, the execution subject of the following processing flow is not limited to each logic unit, but may be hardware or a logic device.

The present embodiments also provide a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment of fig. 1, and in particular to:

Alternatively, the instructions, when executed by a portable electronic device comprising a plurality of applications, enable the portable electronic device to perform the method of the embodiment shown in fig. 5, and in particular to:

And displaying the conversion corpus in the scene picture.

In summary, the foregoing description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims

1. The training method of the image-text conversion model is characterized by comprising the following steps of:

training a prosthetic extraction model based on a training corpus and first prosthetic information, wherein the training corpus comprises training texts and training pictures, and the first prosthetic information comprises prosthetic sequences of the training texts and prosthetic sequences of the training pictures;

and updating and training the image-text conversion model based on the first semanteme information, the semanteme sequence of the first picture and the semanteme sequence of the first text.

2. The method of claim 1, wherein the teletext model comprises a picture generation model and a text generation model, the first picture being generated by the picture generation model based on the training text, the first text being generated by the text generation model based on the training picture;

the updating training of the image-text conversion model based on the first semanteme information, the semanteme sequence of the first picture and the semanteme sequence of the first text comprises the following steps:

acquiring first difference information between the artificial element sequence of the first picture and the artificial element sequence of the training picture, and updating and training the picture generation model based on the first difference information;

And acquiring second difference information between the first text and the training text, and updating and training the text generation model based on the second difference information.

3. The method of claim 2, wherein after updating the graphic conversion model based on the first semaphorium information, the semaphorium sequence of the first picture, and the semaphorium sequence of the first text, the method further comprises:

acquiring a first test corpus and second semanteme information, wherein the first test corpus comprises a first test text and a first test picture, and the second semanteme information comprises a semanteme sequence of the first test text and a semanteme sequence of the first test picture;

based on the first test text and the artificial element sequence thereof, testing the updated and trained picture generation model, and if the test fails, repeating the steps of acquiring the first difference information between the artificial element sequence of the first picture and the artificial element sequence of the training picture until the updated and trained picture generation model passes the test;

And testing the updated and trained text generation model based on the first test picture and the artificial element sequence thereof, and if the test fails, repeating the step of acquiring the second difference information between the artificial element sequence of the first text and the artificial element sequence of the training text until the updated and trained text generation model passes the test.

4. The method of claim 3, wherein the testing the updated trained picture generation model based on the first test text and its sequence of semaphorins comprises:

converting the first test text by updating the trained picture generation model to obtain a second picture for describing the first test text, and converting the second picture by updating the trained text generation model to obtain a second text for describing the second picture;

performing the artificial element extraction on the second text through the trained artificial element extraction model to obtain an artificial element sequence of the second text;

and determining a test result for updating the trained picture generation model based on third difference information between the artificial element sequence of the second text and the artificial element sequence of the first test text.

5. The method of claim 3, wherein the testing the updated trained text generation model based on the first test picture and its sequence of semaphorins comprises:

converting the first test picture through updating the trained text generation model to obtain a third text for describing the first test picture, and converting the third text through updating the trained picture generation model to obtain a third picture for describing the third text;

performing the artificial element extraction on the third picture through the trained artificial element extraction model to obtain an artificial element sequence of the third picture;

and determining a test result for updating the trained text generation model based on fourth difference information between the artificial element sequence of the third picture and the artificial element sequence of the first test picture.

6. The method according to claim 1, wherein the number of the graphic conversion models is plural and the structures of the respective graphic conversion models are the same; the training expectation and the first semaphorium information comprise a plurality of training corpuses and first semaphorium information of various training corpuses, wherein the plurality of training corpuses and the plurality of image-text conversion models are in one-to-one correspondence, and the matching degree between training texts and training pictures in different training corpuses is different;

The text-to-text conversion is carried out on the training corpus through a text-to-text conversion model to obtain converted corpus, and the text-to-text conversion method comprises the following steps:

aiming at each training corpus, converting the training corpus based on a graphic conversion model corresponding to the training corpus to obtain converted corpus corresponding to the training corpus;

the updating training of the image-text conversion model based on the first semanteme information, the semanteme sequence of the training picture and the semanteme sequence of the first text comprises the following steps:

updating and training a graphic-text conversion model corresponding to the training corpus based on the first semanteme information of the training corpus and the semanteme sequence of a first picture and the semanteme sequence of a first text in the conversion corpus corresponding to the training corpus aiming at each training corpus;

acquiring a second test corpus and third semanteme information, wherein the second test corpus comprises a second test text and a second test picture, and the third semanteme information comprises a semanteme sequence of the second test text and a semanteme sequence of the second test picture;

based on the second test corpus and the third semaphorium information, testing a plurality of updated and trained image-text conversion models;

Based on the test result, a target graphic conversion model for graphic conversion is determined from the updated and trained plurality of graphic conversion models.

7. The method according to any one of claims 1 to 6, wherein the semanteme extraction model comprises a text semanteme extraction model and a picture semanteme extraction model, the text semanteme extraction model being trained based on the training text and its semanteme sequence, the picture semanteme extraction model being trained based on the training picture and its semanteme sequence.

8. An information interaction method, comprising:

performing image-text conversion on the interactive information through a target image-text conversion model to obtain conversion corpus, wherein the target image-text conversion model is obtained by training based on the training method of the image-text conversion model according to any one of claims 1 to 7, and the conversion corpus comprises at least one of the following information: a predicted picture for describing the target text, a predicted text for describing the target picture;

And displaying the conversion corpus in the scene picture.

9. A training device for a graphic conversion model, comprising:

10. An information interaction device, comprising:

the second conversion unit is configured to perform image-text conversion on the interaction information through a target image-text conversion model, so as to obtain a conversion corpus, where the target image-text conversion model is obtained by training based on the training method of the image-text conversion model according to any one of claims 1 to 7, and the conversion corpus includes at least one of the following information: a predicted picture for describing the target text, a predicted text for describing the target picture;

11. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the training method of the graphic conversion model as claimed in any one of claims 1 to 7; alternatively, the processor is configured to execute the instructions to implement the information interaction method of claim 8.

12. A computer readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the training method of the graphic conversion model according to any one of claims 1 to 7; alternatively, the instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the information interaction method of claim 8.