CN111445545B

CN111445545B - Text transfer mapping method and device, storage medium and electronic equipment

Info

Publication number: CN111445545B
Application number: CN202010124986.9A
Authority: CN
Inventors: 谢文珍; 黄恺; 冯富森
Original assignee: Future Vipkid Ltd
Current assignee: Future Vipkid Ltd
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2023-08-18
Anticipated expiration: 2040-02-27
Also published as: CN111445545A

Abstract

The embodiment of the application discloses a text transfer mapping method, a text transfer mapping device, a storage medium and electronic equipment, wherein the method comprises the following steps: obtaining text characteristics of a target text, drawing a scene canvas corresponding to the text characteristics, determining a target object to be drawn based on the text characteristics and the scene canvas, determining attribute characteristics of the target object according to the target object, the text characteristics and the scene canvas, drawing the target object on the scene canvas, and adjusting the target object based on the attribute characteristics to generate a map corresponding to the target text. By adopting the embodiment of the application, the fit degree of the map and the actual description scene of the text can be improved, and the accuracy of generating the map is improved.

Description

Text transfer mapping method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text transfer mapping method, a text transfer mapping device, a storage medium, and an electronic device.

Background

The text transfer map is used as an application of a text generation image technology, can convert texts into pictures of a vivid image, for example, in online education, teaching texts (such as texts of student language writing or language dialogue) can be converted into pictures, so that the interest of the student in language learning is stimulated, and the online teaching effect is improved.

At present, in the process of text mapping, text is usually coded, mapping objects are indexed in a mapping library according to keywords (such as nouns in the text) after coding, mapping objects are combined on a canvas to generate mapping, however, the manner of generating mapping according to the keyword combination is adopted, the semantics actually expressed by the text can be ignored, so that the mapping is difficult to fit with the scene actually described by the text, and the accuracy of generating mapping is affected.

Disclosure of Invention

The embodiment of the application provides a text transfer mapping method, a device, a storage medium and electronic equipment, which can improve the fit degree of mapping and a text actual description scene and improve the accuracy of mapping generation. The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a text transfer mapping method, where the method includes:

acquiring text characteristics of a target text, and drawing scene canvas corresponding to the text characteristics;

determining a target object to be drawn based on the text features and the scene canvas;

determining attribute characteristics of the target object according to the target object, the text characteristics and the scene canvas;

and drawing the target object on the scene canvas, and adjusting the target object based on the attribute characteristics to generate a map corresponding to the target text.

In a second aspect, an embodiment of the present application provides a text-to-map apparatus, including:

the scene canvas drawing module is used for obtaining text characteristics of the target text and drawing scene canvases corresponding to the text characteristics;

the target object determining module is used for determining a target object to be drawn based on the text characteristics and the scene canvas;

the attribute characteristic determining module is used for determining attribute characteristics of the target object according to the target object, the text characteristics and the scene canvas;

and the map generation module is used for drawing the target object on the scene canvas, adjusting the target object based on the attribute characteristics and generating a map corresponding to the target text.

In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-described method steps.

In a fourth aspect, an embodiment of the present application provides an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The technical scheme provided by the embodiments of the application has the beneficial effects that at least:

in one or more embodiments of the present application, a terminal obtains text features of a target text, draws a scene canvas corresponding to the text features, determines a target object to be drawn based on the text features and the scene canvas, determines attribute features of the target object according to the target object, the text features and the scene canvas, draws the target object on the scene canvas, and adjusts the target object based on the attribute features to generate a map corresponding to the target text. The scene canvas corresponding to the target text, the object to be drawn and the attribute characteristics of the object are determined step by step through the text characteristics of the target text, the image is correspondingly adjusted based on the attribute characteristics (position, action, gesture and the like) of the object when the object is drawn by the scene canvas, the problem that the generated map is difficult to fit with the actual description scene of the text according to keywords can be avoided, a map with accurate scene, clear object and clear attribute can be generated, the map can be closer to the semantics actually expressed by the text, the fit degree of the map and the actual description scene of the text can be improved, and the accuracy of the map generation is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a text transfer mapping method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an attribute determination model related to a text transfer mapping method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of another attribute determining model related to a text transfer mapping method according to an embodiment of the present application;

FIG. 4 is a flowchart of another text-to-map method according to an embodiment of the present application;

FIGS. 5 to 10 are schematic diagrams illustrating conversion mapping related to a text transfer mapping method according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a text-to-paste device according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a scene canvas rendering module according to an embodiment of the present application;

Fig. 13 is a schematic structural diagram of a target object determining module according to an embodiment of the present application;

FIG. 14 is a schematic structural diagram of an attribute determining module according to an embodiment of the present application;

FIG. 15 is a schematic structural diagram of a map generating module according to an embodiment of the present application;

FIG. 16 is a schematic diagram of another text-to-text mapping apparatus according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the description of the present application, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In the description of the present application, it should be noted that, unless expressly specified and limited otherwise, "comprise" and "have" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The specific meaning of the above terms in the present application will be understood in specific cases by those of ordinary skill in the art. Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The present application will be described in detail with reference to specific examples.

In one embodiment, as shown in fig. 1, a text-to-map method is specifically proposed, which may be implemented in dependence on a computer program, and may be run on a von neumann system-based text-to-map device. The computer program may be integrated in the application or may run as a stand-alone tool class application.

Specifically, the text transfer mapping method comprises the following steps:

step 101: and acquiring text characteristics of the target text, and drawing scene canvas corresponding to the text characteristics.

The target text refers to text containing text content, for example, the target text may be text in an online educational scene when user writing is collected. The text content contained in the target text can be generally understood as a written representation of a language, typically a sentence or a combination of sentences having a complete, systematic meaning. The target text takes english language as an example, and may be at least one word, at least one sentence, and at least one paragraph, where the target text may be an actual application form of language, and generally refers to some language words in implementation.

The text feature refers to a text attribute specific to unstructured data expressed by characters, and takes a written composition as an example, and the text feature comprises text elements such as author intention, data, topic description, bottom feature meaning and the like. The semantics of the object to be described can be expressed through text features, and various features of the semantics of the object to be described in a language environment can be expressed. The text content contained in the target text is taken as an English short text example, and the text characteristics can be the composition letters, the word sequence, the word emotion information, mutual information and the like.

The composition letters are the letters of a word, and the letter sequence relationship.

The word order is the order of the words that make up a sentence (meaning).

The emotion information of a word is the emotion meaning expressed by the word in the sentence, and the emotion meaning can be understood as whether the word is in the sense or the detraction, whether the word is high or low, whether the word is happy or sad, and the like.

Mutual information refers to a statistically independent relationship between a word or word and a category, and is often used to measure the interactivity between two objects.

Specifically, after acquiring a target text input by a user, the terminal acquires text features of the target text by using a text feature acquisition model.

Optionally, the text feature acquisition model may be a text feature information acquisition method based on a contextual framework, that is, feature elements (sentences, words, characters, symbols, etc.) of text content are determined first, and then language semantic analysis is integrated into a statistical algorithm to extract and process text content contained in the target text, so as to obtain text features of the target text; the text feature acquisition method based On ontology can be used, namely, the text content is taken as input by using an ontology (On-ontology) model, and the text feature of the target text is output; the method can be a conceptual feature extraction method based on a knowledge network, namely a feature acquisition method based on conceptual features, is characterized in that language semantic analysis is carried out on the text content on the basis of a vector space model (Vector Space Model, VSM), language semantic information of words is acquired by utilizing a database of the knowledge network, words with the same language semantic are mapped to the same subject concept, then clustered words are obtained, the clustered words are used as feature items of text vectors of the VSM model, model operation and the like are carried out. It should be noted that there are many ways to obtain the text feature of the target text, and the method may be one or more of the fitting, which is not limited herein.

Specifically, after the terminal acquires the text characteristics of the target text, drawing a scene canvas corresponding to the text characteristics. Wherein the scene canvas can be understood as a background map of a map, an initial scene map of an element to be added, and the like.

In a possible implementation manner, after the terminal obtains the text feature of the target text, the terminal may extract key elements (key sentences, key words, key symbols, etc.) of the text feature based on a key feature extraction method of the text context, then integrate semantic analysis into a statistical algorithm to extract the key elements contained in the text feature, determine scene topics (such as scene topics of beach, mountain, river, etc.) corresponding to the key elements, and then match corresponding scene canvases in a preset map index library according to the scene topics.

Step 102: and determining a target object to be drawn based on the text characteristics and the scene canvas.

The target object can be understood as any drawing figure which can be drawn or inserted, and the figure can be changed and perfected. The drawing graph includes a self-selected graph, a curve, a line, etc., and in the embodiment of the present application, the target object may be a person to be drawn (such as a cartoon person), an animal, a plant, a vehicle, a building, etc. For example, a teenager, a sponge, a duck, a wolf, etc. may be the target object to be drawn.

Specifically, after obtaining text features of a target text and drawing a corresponding scene canvas according to the text features, the terminal inputs the text features and the scene canvas into an object determination model based on an attention mechanism, so as to output a target object to be drawn. And predicting a target object to be drawn of each time step of drawing the scene canvas by taking the text characteristics and the scene canvas as inputs of a neural network model-object determining model, for example, predicting a teenager to be drawn in a first time step t1 and predicting a duck to be drawn in a second time step t 2.

In embodiments of the present application, the attention mechanism may include at least two aspects: the decision is made as to which part of the input needs to be focused on and limited information processing resources allocated to the important part. The introduction of attention mechanisms to scene canvas and text features may highlight more critical image portions of the scene canvas, such as higher priority target objects to be drawn in the current scene canvas. For example, in a specific application scenario, the target text may include description objects, attribute information of the objects, scenario information, and the like. After the object of the attention mechanism is introduced to determine the model, the object part contained in the text feature can be highlighted, and the non-object part (attribute information such as color, emotion, action, position and the like) in the feature map is weakened, so that the highlighted part is focused in subsequent processing. To determine the target object to be drawn next.

In this embodiment, the object determination model is a neural network model configured by densely interconnecting simple nonlinear analog processing elements of each of a plurality of nodes, which is a system model that mimics biological neurons. The neural network model is formed by connecting an input of at least one node with an output of each node, similar to a synaptic connection of a real neuron. Each neuron expresses a specific output function, i.e. an excitation function, and the connection between every two neurons contains a connection strength, i.e. a weighting value acting on the signal passing through the connection. In this embodiment, a large number of scene canvases and corresponding text features are input into a neural network model based on an attention mechanism for training, so that an object determination model after training can be obtained, wherein the object determination model has the capabilities of key information feature extraction, semantic knowledge summarization and learning and memorization in the process of determining a target object, and usually, the information or knowledge learned by the neural network model is stored on a connection matrix between each unit node.

Alternatively, the neural network model may be implemented based on fitting of one or more of a convolutional neural network (Convolutional Neural Network, CNN) model, a deep neural network (Deep Neural Network, DNN) model, a recurrent neural network (Recurrent Neural Networks, RNN), a model, an embedded (embedding) model, a gradient-lifting decision tree (Gradient Boosting Decision Tree, GBDT) model, a logistic regression (Logistic Regression, LR) model, and the like.

Specifically, when a terminal obtains a large amount of sample data containing text features and scene canvas, the sample data is marked, the marking can be understood as marking key information (objects to be drawn) corresponding to the sample data, the text features and the scene canvas are input into an initial object determination model for training, the object determination model is trained based on the marked sample data, and a trained object determination model can be obtained.

Step 103: and determining the attribute characteristics of the target object according to the target object, the text characteristics and the scene canvas.

The attribute features refer to characteristics or attributes describing the target object, including but not limited to facial features, clothing features, affective features, behavioral features, and the like. For example, words such as "inland and malignant braking", "beautiful", "gentle and elegant", "gray", "sudden wind " may be used to describe the characteristics of the target object. Also for example, running, fighting, sponsing, resting, playing, accompanying, etc. text may be used to describe the characteristics of the target object.

Specifically, after determining a target object to be drawn, the terminal inputs the target object, the text feature and the scene canvas into an attribute determination model based on an attention mechanism, so as to output attribute features of the target object. And predicting attribute characteristics of the target object corresponding to each time step of drawing the scene canvas by taking the target object, the text characteristics and the scene canvas as inputs of a neural network model-attribute determining model, for example, predicting the characteristics of expression, behavior and the like of a teenager to be drawn in a first time step t1, and predicting the characteristics of appearance, behavior and the like of a duck to be drawn in a second time step t 2.

In embodiments of the present application, the attention mechanism may include at least two aspects: the decision is made as to which part of the input needs to be focused on and limited information processing resources allocated to the important part. The introduction of the attention mechanism to the target object, the text feature, and the scene canvas may highlight more critical image portions of the scene canvas, such as attribute features of higher priority target objects in the current scene canvas. For example, in a specific application scenario, the target text may include description objects, attribute information of the objects, scenario information, and the like. After the object of the attention mechanism is introduced to determine the model, attribute parts (such as position, color, emotion, action, position and other attribute information) contained in the text feature can be highlighted, and the object parts (such as people, animals, plants and other objects) in the feature map are weakened, so that the highlighted parts are highlighted in subsequent processing. To determine the corresponding attribute features of the target object to be drawn next.

The attribute determining model based on the attention mechanism can be used for coding related content (namely attribute information) of a target object to be drawn based on context semantics of the input text feature attention target text, determining attribute information-position information and the like of the target object in a scene canvas, and finally outputting the attribute feature of the target object after being coded by the attribute determining model.

Alternatively, the attention mechanism based attribute determination model may be a decoder model in the seq2seq framework. As shown in fig. 2, fig. 2 is a schematic diagram of a structure of an attribute determining model, in fig. 2, the target object x1, the text feature x2 and the scene canvas x3 form inputs x, h1, h2. of the attribute determining model hn form neural network computing units in a decoder model, and each connection between two neural network computing units contains a connection strength, i.e. a weighting value acting on a signal passing through the connection. The object determination model has the capabilities of key information feature extraction, semantic knowledge summarization and learning and memorization in the target object determination process, and information or knowledge learned by the neural network model is usually stored on a connection matrix between each neural network computing unit. It should be noted that, in the model shown in fig. 2, it can be seen that the output attribute y at the previous time is taken as the input at the current time, and that "input x" is only taken as the initial state to participate in the operation, and the latter operation is independent of "input x". For example, the output attribute y1 corresponding to the neural network computing unit h1 at the previous time may be used as the input of the neural network computing unit h2 at the current time to participate in the computation. Thus outputting an output y (i.e., an attribute signature) containing attribute y1, attribute y2..

In a possible embodiment, the structure of the attribute determination model may be a decoder model structure as shown in fig. 3. In the model shown in fig. 3, it can be seen that the output attribute y at the previous time is taken as the input at the current time, and that "input x" is taken as the steady state to participate in the operation of each neural network computing unit, i.e., the latter operation is related to "input x". For example, the output attribute y1 corresponding to the neural network computing unit h1 at the previous time may be used as the input of the neural network computing unit h2 at the current time to participate in the computation. Thus outputting an output y (i.e., an attribute signature) containing attribute y1, attribute y2.. Where "input x" participates as a steady state in the operation via the network computing unit h2, and the latter operations are all related to "input x".

In the embodiment of the application, after the initial attribute determination model is created, a large amount of sample data comprising scene canvas, target objects and text features is acquired and input into the neural network model based on the attention mechanism for training, and the attribute determination model after training can be obtained. The training of the attribute determination model may be performed by using a training method (DTW) based on dynamic time warping, a training method (VQ) based on vector quantization, a training method (HMM) based on a time series of image signals, or the like.

Step 104: and drawing the target object on the scene canvas, and adjusting the target object based on the attribute characteristics to generate a map corresponding to the target text.

The mapping can be understood as that the terminal finally converts the target text into the corresponding text semantic image when executing the text mapping method of the embodiment of the application.

In one possible implementation, the terminal may invoke the object generation program to initially generate a feature map of the target object on the scene canvas. Feature maps are images that are initially generated based on object feature vectors. In this embodiment, the feature map may be a low resolution image, for example, an image with a resolution of 32×32 or 64×64. An object generates a corresponding object feature vector at the terminal, and the terminal can obtain the object feature vector corresponding to the target object and then generate the feature map according to the object feature vector. And can synchronously or asynchronously adjust the image of the feature map of the target object on the scene canvas based on the attribute features, wherein the object of the image adjustment further comprises adjusting the image part associated with the target object, for example, adjusting the image background (plant, animal, etc.) associated with the target object, in a specific application scene, for example, using a young bird data set (CUB), inputting a text to be processed as "a gray bird with white chest, bad mood of the gray bird", generating the feature map of the bird on the scene canvas by the terminal according to the feature vector of the bird, and can synchronously or asynchronously adjust the image of the feature map of the target object on the scene canvas according to the attribute features corresponding to the bird object: adjusting the appearance characteristics of the birds to be grey, the chest areas of the birds to be white, adjusting the emotional characteristics of the birds to be characterized by the fact that the faces are low in mood, the image parts associated with the birds, namely, the scene environment, are characterized by overcast and rainy days (such as adding clouds and raindrops, correspondingly adjusting the brightness and contrast of the scene), and the like. And after the target object is adjusted according to the attribute characteristics, obtaining an adjusted scene canvas, namely the map corresponding to the target text.

In a possible implementation manner, the terminal stores a mapping index library, the mapping index library stores at least a plurality of mapping elements (mapping corresponding to objects), the terminal can acquire the mapping object corresponding to the target object in the mapping index library, add the mapping object on a scene canvas, and can synchronously or asynchronously adjust the feature mapping of the target object on the scene canvas based on the attribute features, and the object of image adjustment further comprises adjusting the image part associated with the target object, for example, adjusting the image background (plant, animal, etc.) associated with the target object.

In the embodiment of the application, a terminal acquires text features of a target text, draws a scene canvas corresponding to the text features, determines a target object to be drawn based on the text features and the scene canvas, determines attribute features of the target object according to the target object, the text features and the scene canvas, draws the target object on the scene canvas, and adjusts the target object based on the attribute features to generate a map corresponding to the target text. The scene canvas corresponding to the target text, the object to be drawn and the attribute characteristics of the object are determined step by step through the text characteristics of the target text, the image is correspondingly adjusted based on the attribute characteristics (position, action, gesture and the like) of the object when the object is drawn by the scene canvas, the problem that the generated map is difficult to fit with the actual description scene of the text according to keywords can be avoided, a map with accurate scene, clear object and clear attribute can be generated, the map can be closer to the semantics actually expressed by the text, the fit degree of the map and the actual description scene of the text can be improved, and the accuracy of the map generation is improved.

Referring to fig. 4, fig. 4 is a flowchart illustrating another embodiment of a text-to-map method according to the present application. Specific:

step 201: and inputting the target text into a text encoder, and outputting text features corresponding to the target text.

The text encoder is used for outputting encoded feature expression to the target text of the input text encoder, namely outputting corresponding text features of the target text, wherein the text features are usually characterized in the form of encoding vectors in practical application.

Specifically, the text encoder can compress the target text by using the deep neural network to obtain a coding vector corresponding to each moment; the specific mode is that a long-short-time memory network model in a deep neural network is used, each text element (word, sentence, symbol and the like) of the target text is sequentially input into the network, and the hidden layer representation hi corresponding to each moment (i.e. instant step) is obtained.

Specifically, using a long short time memory network (LSTM) in the deep neural network, inputting a text element (such as a word, a character, etc.) in the target text to the long short time memory network (LSTM) at the beginning, compressing the word into a vector, and then transmitting the vector obtained after compression to the next moment; the cyclic neural network at the next moment inputs the compressed vector at the previous moment and the next text element of the original text, compresses the compressed vector and the next text element into a new vector, and transmits the new vector to the next moment; the coded vector at each moment obtained after compressing all the texts is the feature information needed when decoding the coded vector (including attribute decoding, object decoding, scene decoding, etc.); the time count of the embodiment is the word count of sentences; the hidden layer vector corresponding to each moment is a vector formed by compressing words by a long short time memory network (LSTM), and the text characteristics corresponding to the target text can be obtained through the encoding process.

In one specific implementation scenario, the text encoder may be an LSTM based evolutionary network GRU network, i.e. a single layer bi-directional cyclic network (BiGRU) with gated cyclic units (GRUs) is employed. It takes as input the linear embedding of text elements-each word, and the hidden dimension of each direction may be a fixed dimension, such as 256 dimensions. The pre-training parameters from the GloVe method are used here to initialize a text encoder made up of a word-embedded network. The word embedding vectors are fixed as text features for abstract scene and semantic layout generation for the generation of a composite image-map for subsequent steps.

For example, the encoding process of a text encoder may be characterized by the following formula, we calculate for each text element-word for a given target text:

where BiGRU denotes a neural network with bi-directional GRU units, xi is the word-embedded vector corresponding to the ith word, andis a context hiding vector (expressed in the form of a feature vector) corresponding to the encoded text element, here represented by the context hiding vector +.>Vector pair consisting of word embedding vector Xi>I.e. the output of the text encoder-text characteristics.

Step 202: and extracting a scene theme corresponding to the text feature, indexing a scene map corresponding to the scene theme in a preset map index library, and determining the scene map as a scene canvas.

The mapping index library is a pre-established image library containing a large amount of mapping materials, and the image library contains scene maps corresponding to a plurality of scene topics.

Specifically, after acquiring text features of a target text, a terminal extracts scene topics corresponding to the text features, indexes scene maps corresponding to the scene topics in a preset map index library, and determines the scene maps as scene canvas. Wherein the scene canvas can be understood as a background map of a map, an initial scene map of an element to be added, and the like.

In a possible implementation manner, after the terminal obtains the text feature of the target text, the terminal may extract key elements (key sentences, key words, key symbols, etc.) of the text feature based on a key feature extraction method of the text context, then integrate semantic analysis into a statistical algorithm to extract the key elements contained in the text feature, determine scene topics (such as scene topics of beach, mountain, river, etc.) corresponding to the key elements, then match corresponding scene maps in a preset map index library according to the scene topics, and determine the scene map as a scene canvas after the scene map is found.

Step 203: and inputting the scene canvas into a convolution network to perform scene coding, and outputting a scene characteristic diagram after scene coding.

The scene feature map can be understood as a scene feature map obtained by identifying an image to be identified through a convolution network. In the embodiment of the application, the scene feature map is obtained by identifying a scene canvas through a convolution network, and generally, the scene feature map comprises at least one scene feature value.

For example, the scene feature map may be obtained by extracting scene feature values of an image to be identified through a convolutional network (CNN network), and the specific process may be:

the convolution network comprises one or more convolution kernels for extracting feature information from the pixel matrix of the scene canvas, the pixel matrix of the image to be identified is traversed by the convolution kernels according to a certain step length, the scene canvas is input into the convolution network for scene coding, at least one scene feature value can be obtained, and a scene feature map is formed by the at least one scene feature value.

In a specific embodiment, the convolutional network may be a convolutional GRU network consisting of at least one gated loop unit (ConvGRU unit). Each convolution layer in the convolution GRU network has a convolution kernel of 3×3, a stride (i.e., step size) of 1, and a hidden dimension of 512 dimensions. And populating each convolved input such that the scene feature map has the same spatial resolution as the scene canvas of the input. The hidden state of the hidden layer of the convolutional network is initialized by spatially copying the last hidden state of the text feature output by the text encoder.

Illustratively, the convolutional network may be characterized by the following formula, employing the scene canvas Bt as an input to the convolutional network:

wherein, the liquid crystal display device comprises a liquid crystal display device,the scene state of the current time step can also be understood as a scene characteristic value, convGRU (omega) is a convolution network with a gating circulation unit, < + >>Scene states for historical time steps, here each +.>The informativeness characterizing the temporal dynamics of each spatial (grid) position in the scene can then be aggregated +.>A scene feature map representing the output-current scene state of a convolutional network, C x H x W, where C is the number of channels (made up of the number of input channels and the number of output channels), and H and W are the convolution height and convolution width.

Step 204: and inputting the text characteristic and the scene characteristic diagram into an object decoder, and outputting the target object.

The object decoder is a neural network model based on attention, and can output target objects according to likelihood scores of all possible objects. The method takes the state of each circulation scene and the text characteristic in the scene characteristic diagram as input, and outputs the target object to be drawn in the current time step. In practice, the object decoder determines a model for an object, including but not limited to a scene pooling portion, a text attention portion, and an object convolution portion.

In this embodiment, the object decoder is configured by dense interconnection of simple nonlinear analog processing elements of each of a plurality of nodes, and is a system model that mimics biological neurons. The object decoder is formed by connecting an input of at least one node with an output of each node, similar to a synaptic connection of a real neuron. Each neuron expresses a specific output function, i.e. an excitation function, and the connection between every two neurons contains a connection strength, i.e. a weighting value acting on the signal passing through the connection. In this embodiment, a large number of scene canvases and corresponding text features are input into a neural network model based on an attention mechanism for training, so that a trained object decoder can be obtained, wherein the object decoder has the capabilities of key information feature extraction, semantic knowledge summarization and learning and memorization in the process of determining a target object, and the information or knowledge learned by the object decoder is usually stored in a connection matrix between every two unit nodes. After the text feature and the scene feature map are input into the object decoder, the scene pooling part of the object decoder performs pooling processing on the scene feature map to collect scene space context required by object prediction, for example, whether a canvas corresponding to the scene feature map in the current time step adds a target object and an added historical object, then the pooling processing is performed to control a pooling neural unit to fuse the participated object space feature into a scene attention vector, the text attention part of the object decoder processes the scene attention vector and the text feature, specifically focuses on the semantic context of the target text through an attention mechanism and highlights the object part contained in the corresponding text feature, and weakens the non-object part (attribute information such as color, emotion, action, position and the like) in the feature map to determine text information contained in the text feature by the object to be drawn, the text information is characterized by the text attention vector, and finally the text attention vector and the scene attention vector are convolved by the object convolving part of the object decoder, so that the output of the target object layer of the convolved network is performed.

In one possible implementation, in order to make the object to be drawn predicted by the object decoder more accurate, the drawn history object may be incorporated into a reference to improve the accuracy of the prediction result of the object decoder. The history object may be understood as an object drawn at a time before the current time (i.e. the current time step), and in practical application, the history object is typically an object drawn at a previous time (i.e. the previous time step), and assuming that the object corresponding to the current time step T is the target object 1, the history object is an object drawn at the time T-1 (i.e. the previous time step). The method comprises the following steps:

1. the terminal can obtain the drawn history object, in particular, the object O predicted by the last time step object encoder _t-1 In general, the object O _t-1 The encoded representation is performed with a high-dimensional object feature vector.

2. Pooling the inputted scene feature map to obtain a pooled first scene attention vector u _t ⁰ I.e.

Where AvgPooling () represents the scene pooling part of the object encoder, typically the pooling layer, ψ, in the object encoder ⁰ For a convolutional network of an object encoder,for the scene state of the current time step, the current time step is processed through a convolution network ψ ⁰ Can pay attention to->Is a spatial state of (c).

The terminal inputs the text feature, the scene feature map and the historical object to an object encoder, and a scene pooling part of the object decoder pools the scene feature map to collect pairsScene space context as required for image prediction, i.e.For example, whether the canvas corresponding to the scene feature map in the current time step adds the target object and the added history object, and then merging the spatial features of the participated object into the scene attention vector by the pooling neural unit of the pooling layer to obtain a first scene attention vector u after the pooling process _t ⁰ 。

3. Inputting the first scene attention vector, the history object and the text feature into a text attention part of an object encoder, namely a first text attention device, and outputting a first text attention vectorI.e.

Wherein phi is ¹ Text attention part for object encoder-first text attention device, u _t ⁰ For the first scene attention vector, O _t-1 In order for the history object to be a history object,is a text feature.

Text attention part of object decoder-first text attention device processes scene attention vector, history object and text feature, in particular input by attention mechanism To pay attention to the semantic context of the target text, i.e. +.>And highlights the object parts contained in the corresponding text feature, whereas for non-object parts (colorAttribute information of color, emotion, motion, position, etc.) to determine text information contained in the text feature of the object to be drawn, the text information being characterized by a first text attention vector.

4. And inputting the first scene attention vector, the historical objects and the first text attention vector into an object convolution network, and outputting a target object to be drawn. The object convolution network is typically a convolution sensor of a predetermined number of layers, for example, a convolution sensor of a predetermined number of layers 2; taking the first scene attention vector, the historical object and the first text attention vector as inputs, integrating the characteristics of each input to an output layer at a full connection layer of a convolution perceptron according to the first scene attention vector, the historical object and the first text attention vector, and predicting the likelihood of the next object by using an excitation function (softmax function) so as to output a target object with high likelihood, namely the target object.

Step 205: and inputting the text characteristics and the target object into a second text attention device, and outputting a second text attention vector.

Wherein the second text attention device, the scene convolution network and the attribute convolution network together form an attribute decoder, the attribute decoder is a neural network model based on attention, and the attribute decoder introducing attention mechanisms to the target object, the text feature and the scene canvas can highlight more critical image parts in the scene canvas, such as attribute features of the target object with higher priority in the current scene canvas. In practical application, the attribute decoder consisting of the second text attention, the scene convolution network and the attribute convolution network determines a model for an attribute. It should be noted that the attribute decoder includes, but is not limited to, a second text focus, a scene convolution network, and an attribute convolution network according to the specific implementation environment.

Specifically, the terminal inputs the text feature and the target object to a text attention part of the attribute decoder, namely a second text attention device, and outputs a first text attention vectorI.e.

Wherein phi is ² Text attribute part for attribute decoder-second text focus, O _t In order for the object to be a target object,is a text feature.

Text attention part of attribute decoder-second text attention device processes target object and text feature, specifically input O by attention mechanism _t Focusing on the semantic context of the target text, i.eAnd highlighting attribute parts (attribute information such as color, emotion, action, position and the like) contained in the corresponding text features, weakening the object parts, pre-training a contained calculation matrix through a large amount of sample data in the text attention device, and learning a calculated attention score to determine text information contained in the text features of the object to be drawn, wherein the text information is characterized by a second text attention vector.

Step 206: and inputting the scene feature map and the second text attention vector into a scene convolution network, and outputting the second scene attention vector.

Specifically, the scene convolution network is an image (scene feature map) based attention module, and relevant scene information of the object to be added in the scene canvas, such as scene information of an adding position of the object to be added in the scene canvas, and the like, can be collected through the scene convolution network. Typically the scene convolution network may be a scene space attention module consisting of two convolution layers. I.e.

Wherein, the liquid crystal display device comprises a liquid crystal display device,as the second scene attention vector, ψ ^a For scene convolution network, ++>For scene feature map, < >>Is the second text attention vector.

Specifically, the scene convolution network includes, for example, an input layer, a convolution layer, a pooling layer, a full connection layer, and an output layer. In some embodiments, the scene feature map and the second text attention vector are input into the scene convolution network, for example, the scene feature map and the second text attention vector are received through an input layer of the scene convolution network, and the input layer can perform standardization processing on the input data (the scene feature map and the second text attention vector), so that learning efficiency and performance of the scene convolution network are improved. And then the scene feature map is subjected to feature extraction and calculation through a convolution layer, is transmitted to a pooling layer for text and scene feature selection and information filtering, and is synthesized to an output layer through a full connection layer so as to output a second scene attention vector. By inputting the scene feature map and the second text attention vector into the scene convolutional neural network, more feature information can be accumulated based on the text attention vector on the basis of the scene feature map, so that scene content characterization of the feature map is obtained, and the second scene attention vector which corresponds to the scene feature map and the second text attention vector together is obtained.

Step 207: and inputting the second scene attention vector, the target object and the second text attention vector into an attribute convolution network, and outputting attribute characteristics of the target object.

The attribute convolutional network is typically a convolutional sensor of a predetermined layer number, such as a convolutional network (CNN network) of a predetermined layer number 4; taking a second scene attention vector, the target object and the second text attention vector as inputs, integrating the characteristics of each input to an output layer at a full connection layer of a convolution sensor according to the second scene attention vector, the target object and the second text attention vector, and predicting at least one attribute of a next target object by using an excitation function (softmax function) so as to output attribute characteristics of the target object. The attribute profile P () is expressed as follows:

where Θ is an attribute convolutional network, the output layer of Θ has a "1+ Σ _k R ^K "number of output channels, where Rt ^k Representing the discrete range of the kth attribute of the current time step, lt is an attribute feature-location attribute, in practical application, the first channel of the output layer of Θ predicts the likelihood of minimizing the location of an object on the spatial domain using the softmax function. The remaining channels predict the properties of each grid location of interest (determined by the scene attention vector). During the training process, the likelihood from the true position is used to calculate the loss. In each time step of determining the attribute, the previous position is sampled from the attribute convolution network, and then attribute information corresponding to the target object is collected from the sampled position until all the sampled positions predicted for the second scene attention vector (i.e., each attention grid position) have collected the attribute information corresponding to the target object, so as to output the attribute features of the target object at the output layer using an excitation function (softmax function).

Step 208: and indexing the mapping object corresponding to the target object in a preset mapping index library.

Specifically, the terminal stores a mapping index library, the mapping index library at least stores a plurality of mapping elements (mapping corresponding to the object), the terminal can acquire the target mapping element from the mapping index library by adopting an image retrieval technology, and the target mapping element is used as the mapping object corresponding to the target object.

The mapping elements of the mapping index library may be to obtain all or part of sample images from an existing image database (such as WIDER FACE dataset, IJB-C test set, AVA dataset, coco dataset, etc.) as mapping elements, and/or obtain sample images taken under actual environment by using a device with a photographing function. The method comprises the steps of preprocessing a sample image by acquiring a large number of sample images, wherein the preprocessing comprises the processing procedures of digitalization, geometric transformation, normalization, smoothing, restoration enhancement and the like, so as to obtain a mapping element after the processing.

Alternatively, the image retrieval technique may be an image retrieval algorithm including, but not limited to, a locally sensitive hash (Locality Sensitive Hashing, LSH) algorithm, an SH spectrum hash algorithm, an SDH supervised discrete retrieval algorithm, a local aggregate vector (vector of locally aggregated descriptors, VLAD) retrieval algorithm, a K-D tree retrieval algorithm, or the like.

In a possible embodiment, in each map element included in the image database, when the acquired sample image is stored as the map element, an image identifier may be allocated to the map element, where the image identifier may be an image id, a number, a specific character string, or the like, and the map element is identified by the image identifier. Then, during retrieval, the image identification of the target object can be directly obtained based on the target object, and the map elements corresponding to the image identification are queried in an image database.

Step 209: and inputting the attribute characteristics, the map object and the scene canvas into a canvas adjustment model, and outputting the map corresponding to the target text.

The canvas adjustment model may be one of neural network based models for drawing and adjusting the map objects on the scene canvas according to the input of the model (attribute features, map objects, and scene canvas), and outputting the map corresponding to the target text with the adjusted scene canvas as output.

In the embodiment of the application, the canvas adjustment model draws the currently determined map object onto a scene canvas based on attribute characteristics, does not directly generate a high-resolution image, simplifies an image adjustment task, simulates image distribution step by step from low resolution to high resolution according to the attribute characteristics by layering, and performs image adjustment on the map object on the scene canvas. The object of image adjustment further includes adjustment of an image portion associated with the target object, for example, adjustment of an image background (plant, animal, etc.) associated with the target object. In a specific application scenario, for example, a text to be processed is input as "a gray bird with white chest, the gray bird is bad in mind", the terminal draws a chartlet object of the bird on a scene canvas according to the characteristic vector of the bird, and can synchronously or asynchronously perform image adjustment on the characteristic chartlet of the target object on the scene canvas according to the attribute characteristics corresponding to the bird object: adjusting the appearance characteristics of the birds to be grey, the chest areas of the birds to be white, adjusting the emotional characteristics of the birds to be characterized by the fact that the faces are low in mood, the image parts associated with the birds, namely, the scene environment, are characterized by overcast and rainy days (such as adding clouds and raindrops, correspondingly adjusting the brightness and contrast of the scene), and the like. And after the target object is adjusted according to the attribute characteristics, obtaining an adjusted scene canvas, namely the map corresponding to the target text.

In practical application, the target text generally corresponds to a plurality of objects to be drawn in the process of text mapping, and the adding and drawing processes of each object can respectively correspond to one time step, it can be understood that in one time step, the scene canvas can draw only one object, and can synchronously or asynchronously draw a plurality of objects, in the embodiment, the method is not particularly limited, after all the objects are added, the canvas adjustment model completes image adjustment of the scene canvas, and at the moment, the mapping corresponding to the target text is output.

Illustratively, the target text may be, for example:

Tim is holding a hotdog.Amy is sitting in the sandbox.Amy is holding the shovel.

when the terminal processes the target text, the text transfer mapping method can be adopted to generate a mapping step by step, specifically:

at time step T1: and carrying out text coding on the target text to obtain text characteristics corresponding to the target text. And extracting a scene theme 1 corresponding to the text characteristic, and indexing a scene map corresponding to the scene theme 1 in a preset map index library, wherein the scene map is shown in fig. 5, fig. 5 is a schematic diagram corresponding to the scene map, and the scene map shown in fig. 5 is determined as a scene canvas.

At time step T2, the target object to be drawn of interest may be a "sadbox", then the properties (location, size, color, spatial relationship, etc.) of the "sadbox" are determined, the target object- "sadbox" is drawn on the scene canvas, and the target object is adjusted based on the properties features. The scene map after adjustment is shown in fig. 6, and it can be noted that a "sadbox" is drawn in fig. 6.

At time step T3, the target object to be drawn of interest may be "Tim", the history object is "sadbox", then the properties (location, size, color, spatial relationship, etc.) of "Tim" are determined, such as "holding". The target object- "Tim" is drawn on the scene canvas, and the target object is adjusted based on the properties of the properties. The adjusted scene map is shown in fig. 7, and it can be noted that a character "Tim" is drawn in fig. 7, where the character "Tim" stands beside "sadbox" and the action is "left hand holding.

At time step T4, the target object to be drawn of interest may be "Amy" history object "Tim", then the properties (location, size, color, spatial relationship, etc.) of "Jenny" are determined, such as "holding", "sitting", drawing the target object- "Amy" on the scene canvas, and adjusting the target object based on the properties features. The adjusted scene map is shown in fig. 8, and it can be noted that a character "Amy" is drawn in fig. 8, where "Amy" stands on "sadbox" and the action is "left hand holding.

In time step T5, the target object to be drawn that is concerned may be "hotlog" history object "Tim", then the properties (position, size, color, spatial relationship, etc.) of "hotlog" are determined, such as "holding", "Timholding", the target object- "hotlog" is drawn on the scene canvas, and the target object is adjusted based on the property characteristics. The scene map after adjustment is shown in fig. 9, and it can be noted that an item "hotlog" is drawn in fig. 9, where "hotlog" is held on "Tim" hand.

At time step T6, the target object to be drawn of interest may be a "shodel" history object of "Amy". Then, attributes (location, size, color, spatial relationship, etc.) of "shodel" are determined, such as "holding", "shodel holding", the target object- "shodel" is drawn on the scene canvas, and the target object is adjusted based on the attribute characteristics. The scene map after adjustment is shown in fig. 10, and it can be noted that an item "shodel" is drawn in fig. 10, where "shodel" is held on "Amy" hands.

At this time, after all objects are added, the canvas adjustment model completes the image adjustment of the scene canvas 1, and at this time, a map corresponding to the target text is output, and the map is shown in fig. 10. The above target text is merely for understanding to better explain the embodiments of the present application, and the details of the addition referred to in the explanation are not specifically limited.

In one possible implementation, the canvas adjustment model may be an end-to-end generated countermeasure model to simulate a series of multi-scale image distributions. The canvas adjustment model may be composed of a number of generators and discriminants. Map elements of different resolutions are drawn on each branch of the model for generation and correction. On each branch, the generator captures the image distribution of scene canvas on corresponding resolution, the discriminator discriminates the generated image and the real image with corresponding size, so as to jointly train the generator, thereby approaching the multi-layer distribution, further being beneficial to ensuring that the generated map of the text is semantically approaching the real semantic of the target text, and the generated map has higher accuracy.

In a possible implementation manner, the canvas adjustment model may be an image synthesis model based on map retrieval, specifically, the canvas adjustment model may be based on an appearance vector in an attribute feature, at least one matched map element is retrieved in a preset map index library, patches (that is, all or part of images of the map element) are obtained from the map elements, and canvas adjustment is performed on a target object of a scene canvas and a part associated with the target object by each patch. The method specifically comprises the steps of predicting appearance vectors for each position in a feature map corresponding to a scene canvas, searching similar patches in a map index library according to the appearance vectors, training a patch embedder based on a CNN network when a canvas adjustment model is created, embedding a foreground patch into the scene canvas by the patch embedder of the canvas adjustment model in the process of drawing object maps of the scene canvas according to the searched previous Jing Buding (namely the searched patches), specifically reducing the foreground patch into a vector with a specified dimension, such as a one-dimensional vector Ft, calculating triplet loss in the model processing process by using a triplet embedding method, so as to reduce the Euclidean distance between lt and Ft, and obtaining the adjusted scene canvas, namely the map corresponding to the target text after the target object is adjusted according to attribute characteristics.

The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.

Referring to fig. 11, a schematic structural diagram of a text-to-map device according to an exemplary embodiment of the present application is shown. The text-to-map device may be implemented as all or part of the device by software, hardware, or a combination of both. The apparatus 1 includes a scene canvas rendering module 11, a target object determination module 12, an attribute feature determination module 13, and a map generation module 14.

The scene canvas drawing module 11 is used for obtaining text characteristics of the target text and drawing scene canvases corresponding to the text characteristics;

a target object determining module 12, configured to determine a target object to be drawn based on the text feature and the scene canvas;

an attribute feature determining module 13, configured to determine an attribute feature of the target object according to the target object, the text feature, and the scene canvas;

and the map generation module 14 is used for drawing the target object on the scene canvas, adjusting the target object based on the attribute characteristics and generating a map corresponding to the target text.

Optionally, as shown in fig. 12, the scene canvas drawing module 11 includes:

a text feature output unit 111, configured to input a target text into a text encoder, and output a text feature corresponding to the target text;

the scene canvas determining unit 112 is configured to extract a scene topic corresponding to the text feature, index a scene map corresponding to the scene topic in a preset map index library, and determine the scene map as a scene canvas.

Optionally, as shown in fig. 16, the apparatus 1 further includes:

and the scene characteristic diagram coding module 15 is used for inputting the scene canvas into a convolution network to perform scene coding and outputting a scene characteristic diagram after scene coding.

Optionally, the target object determining module 12 is specifically configured to:

and inputting the text characteristic and the scene characteristic diagram into an object decoder, and outputting the target object.

Optionally, as shown in fig. 16, the apparatus 1 further includes:

a history object acquisition module 16 for acquiring a drawn history object;

as shown in fig. 13, the target object determining module 12 includes:

a vector pooling unit 121, configured to pool the scene feature map to obtain a pooled first scene attention vector;

A text vector output unit 122, configured to input the first scene attention vector, the history object, and the text feature to a first text attention device, and output a first text attention vector;

and a target object output unit 123, configured to input the first scene attention vector, the history object, and the first text attention vector into an object convolution network, and output the target object.

Optionally, as shown in fig. 14, the attribute feature determining module 13 includes:

a text vector output unit 131, configured to input the text feature and the target object to a second text focus device, and output a second text attention vector;

a scene vector output unit 132, configured to input the scene feature map and the second text attention vector into a scene convolution network, and output a second scene attention vector;

and an attribute feature output unit 133, configured to input the second scene attention vector, the target object, and the second text attention vector into an attribute convolution network, and output an attribute feature of the target object.

Optionally, as shown in fig. 15, the map generating module 14 includes:

A mapping object drawing unit 141, configured to index, in a preset mapping index library, a mapping object corresponding to the target object;

and a scene canvas adjustment unit 142, configured to input the attribute feature, the map object, and the scene canvas into a canvas adjustment model, and output a map corresponding to the target text.

It should be noted that, when the text transfer mapping apparatus provided in the above embodiment performs the text transfer mapping method, only the division of the above functional modules is used for illustration, in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the text transfer mapping device and the text transfer mapping method provided in the above embodiments belong to the same concept, which embody detailed implementation procedures in the method embodiments, and are not repeated here.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are adapted to be loaded by a processor and execute the text transfer mapping method according to the embodiment shown in fig. 1 to 10, and the specific execution process may refer to the specific description of the embodiment shown in fig. 1 to 10, which is not repeated herein.

The present application further provides a computer program product, where at least one instruction is stored, where the at least one instruction is loaded by the processor and executed by the processor, where the specific execution process may refer to the specific description of the embodiment shown in fig. 1 to 10, and details are not repeated herein.

Referring to fig. 17, a schematic structural diagram of an electronic device is provided in an embodiment of the present application. As shown in fig. 17, the electronic device 1000 may include: at least one processor 1001, at least one network interface 1004, a user interface 1003, a memory 1005, at least one communication bus 1002.

Wherein the communication bus 1002 is used to enable connected communication between these components.

The user interface 1003 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 1003 may further include a standard wired interface and a wireless interface.

The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Wherein the processor 1001 may include one or more processing cores. The processor 1001 connects various parts within the entire server 1000 using various interfaces and lines, and performs various functions of the server 1000 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1005, and calling data stored in the memory 1005. Alternatively, the processor 1001 may be implemented in at least one hardware form of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 1001 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 1001 and may be implemented by a single chip.

The Memory 1005 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1005 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). The memory 1005 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 1005 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the above-described respective method embodiments, etc.; the storage data area may store data or the like referred to in the above respective method embodiments. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 17, an operating system, a network communication module, a user interface module, and a text transfer map application may be included in a memory 1005, which is one type of computer storage medium.

In the electronic device 1000 shown in fig. 17, the user interface 1003 is mainly used for providing an input interface for a user, and acquiring data input by the user; and the processor 1001 may be configured to invoke the text-to-map application program stored in the memory 1005, and specifically perform the following operations:

In one embodiment, when executing the text feature of the acquisition target text and drawing the scene canvas corresponding to the text feature, the processor 1001 specifically performs the following operations:

inputting a target text into a text encoder, and outputting text features corresponding to the target text;

and extracting a scene theme corresponding to the text feature, indexing a scene map corresponding to the scene theme in a preset map index library, and determining the scene map as a scene canvas.

In one embodiment, after executing the text feature of the acquisition target text and drawing the scene canvas corresponding to the text feature, the processor 1001 further performs the following operations:

And inputting the scene canvas into a convolution network to perform scene coding, and outputting a scene characteristic diagram after scene coding.

In one embodiment, the processor 1001, when executing the determining the target object to be drawn based on the text feature and the scene canvas, specifically performs the following operations:

In one embodiment, the processor 1001, when executing the method of text transfer mapping, further performs the following operations:

acquiring a drawn history object;

the inputting the text feature and the scene feature map into an object decoder, and outputting the target object includes:

pooling the scene feature map to obtain a pooled first scene attention vector;

inputting the first scene attention vector, the history object and the text feature to a first text attention device, and outputting a first text attention vector;

and inputting the first scene attention vector, the historical objects and the first text attention vector into an object convolution network, and outputting the target object.

In one embodiment, the processor 1001, when executing the determining the attribute characteristics of the target object according to the target object, the text characteristics and the scene canvas, specifically performs the following operations:

inputting the text feature and the target object to a second text attention device, and outputting a second text attention vector;

inputting the scene feature map and the second text attention vector into a scene convolution network, and outputting a second scene attention vector;

and inputting the second scene attention vector, the target object and the second text attention vector into an attribute convolution network, and outputting attribute characteristics of the target object.

In one embodiment, when the processor 1001 draws the target object on the scene canvas and adjusts the target object based on the attribute feature to generate a map corresponding to the target text, the following operations are specifically executed:

indexing a mapping object corresponding to the target object in a preset mapping index library;

and inputting the attribute characteristics, the map object and the scene canvas into a canvas adjustment model, and outputting the map corresponding to the target text.

It will be clear to a person skilled in the art that the solution according to the application can be implemented by means of software and/or hardware. "Unit" and "module" in this specification refer to software and/or hardware capable of performing a specific function, either alone or in combination with other components, such as Field programmable gate arrays (Field-ProgrammaBLE Gate Array, FPGAs), integrated circuits (Integrated Circuit, ICs), etc.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some service interface, device or unit indirect coupling or communication connection, electrical or otherwise.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present application may be embodied essentially or partly in the form of a software product, or all or part of the technical solution, which is stored in a memory, and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be performed by hardware associated with a program that is stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The foregoing is merely exemplary embodiments of the present disclosure and is not intended to limit the scope of the present disclosure. That is, equivalent changes and modifications are contemplated by the teachings of this disclosure, which fall within the scope of the present disclosure. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a scope and spirit of the disclosure being indicated by the claims.

Claims

1. A method of text transfer mapping, the method comprising:

inputting the scene canvas into a convolution network to perform scene coding, and outputting a scene feature map after the scene coding;

inputting the second scene attention vector, the target object and the second text attention vector into an attribute convolution network, and outputting attribute characteristics of the target object;

2. The method of claim 1, wherein the obtaining text features of the target text and drawing a scene canvas corresponding to the text features comprises:

3. The method of claim 1, wherein the determining a target object to be drawn based on the text feature and the scene canvas comprises:

4. A method according to claim 3, characterized in that the method further comprises:

acquiring a drawn history object;

pooling the scene feature map to obtain a pooled first scene attention vector;

5. The method of claim 1, wherein the drawing the target object on the scene canvas and adjusting the target object based on the attribute features to generate a map corresponding to the target text comprises:

6. A text-to-map apparatus, the apparatus comprising:

the scene feature map coding module is used for inputting the scene canvas into a convolution network to perform scene coding and outputting a scene feature map after scene coding;

an attribute feature determination module comprising: a text vector output unit, configured to input the text feature and the target object to a second text focus device, and output a second text attention vector; the scene vector output unit is used for inputting the scene feature map and the second text attention vector into a scene convolution network and outputting the second scene attention vector; the attribute feature output unit is used for inputting the second scene attention vector, the target object and the second text attention vector into an attribute convolution network and outputting attribute features of the target object;

7. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any one of claims 1 to 5.

8. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1-5.