CN110619357A

CN110619357A - Picture processing method and device and electronic equipment

Info

Publication number: CN110619357A
Application number: CN201910810070.6A
Authority: CN
Inventors: 胡先军; 田凯; 李斌
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-12-27
Anticipated expiration: 2039-08-29
Also published as: CN110619357B

Abstract

The embodiment of the invention provides a picture processing method, a picture processing device and electronic equipment, wherein the method comprises the following steps: acquiring a picture to be processed; processing the picture to be processed according to a preset model, and determining a target word vector corresponding to the picture to be processed, wherein the preset model is trained based on prior information, and the prior information comprises relevancy information of relevant texts; generating a target text matched with the context of the picture to be processed according to the target word vector; the preset model is trained based on prior information, the prior information comprises relevancy information of related texts, so that the prediction capability of the preset model is high, and then the follow-up text which is determined according to the preset model and is matched with the context of the picture can better meet the user requirements without manual marking, so that the production efficiency of the expression bag can be improved.

Description

Picture processing method and device and electronic equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing a picture, and an electronic device.

Background

With the continuous development of terminal technology and communication technology, terminal communication brings great convenience to the communication of people, wherein people can communicate not only by making a call, but also by various chat applications such as audio/video chat, voice chat, text chat and the like.

In the process of text chatting, a user can input not only a pure text, but also active chatting atmospheres such as characters, expressions, emoticons and the like, so that the chatting pleasure is increased. The emoticons can comprise pictures and graphics (the texts of the pictures are matched with the contexts of the pictures), and compared with the picture emoticons, the graphic emoticons are more directly expressed and are more popular with the public; therefore, in order to meet the requirements of users, more image-text facial expression packages can be made for the users to use.

One key for making the graphic expression package is to generate a text (such as a humor text) matched with the context of the picture; in the prior art, there is a scheme for generating an Image summary (Image summary), but the essence of the scheme is that what is said in a picture by looking at a picture, and as shown in fig. 1, the generated summary is as follows: "a girl and a dog" cannot generate text for the picture that matches its context. If the user needs to generate the image-text emoticon, the user needs to manually label each image with a text matched with the context of the image-text emoticon, and the emoticon making efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a picture processing method for improving the production efficiency of an emoticon.

Correspondingly, the embodiment of the invention also provides an image processing device and electronic equipment, which are used for ensuring the realization and application of the method.

In order to solve the above problem, an embodiment of the present invention discloses a picture processing method, which specifically includes: acquiring a picture to be processed; processing the picture to be processed according to a preset model, and determining a target word vector corresponding to the picture to be processed, wherein the preset model is trained based on prior information, and the prior information comprises relevancy information of relevant texts; and generating a target text matched with the context of the picture to be processed according to the target word vector.

Optionally, the preset model is an encoder-decoder model, and the processing the picture to be processed according to the preset model to determine the target word vector corresponding to the picture to be processed includes: the encoder extracts the characteristics of the picture to be processed and outputs characteristic information corresponding to the picture to be processed; and the decoder processes the characteristic information and outputs a target word vector corresponding to the picture to be processed.

Optionally, the input parameters of the encoder include an input word vector and feature information; the decoder processes the feature information and outputs a target word vector corresponding to the picture to be processed, and the method comprises the following steps: initializing the input word vector; cascading the initialized input word vector and the characteristic information of the picture to be processed to obtain cascading characteristic information; and the decoder processes the cascade characteristic information and outputs a target word vector corresponding to the picture to be processed.

Optionally, the decoder is a bidirectional network, and the decoder processes the cascade feature information and outputs a target word vector corresponding to the picture to be processed, including: the decoder performs forward operation on the cascade characteristic information to obtain a corresponding forward word vector; the decoder performs reverse operation on the cascade characteristic information to obtain a corresponding reverse word vector; and taking the forward word vector and the reverse word vector with the highest probability as the target word vector corresponding to the picture to be processed.

Optionally, the decoder comprises a plurality of units; in the forward operation process, each unit outputs a forward word vector with the maximum probability; in the process of inverse operation, each unit outputs the inverse word vector with the maximum probability.

Optionally, the target word vector comprises a plurality of words; generating a target text matched with the context of the picture to be processed according to the target word vector, wherein the generating comprises: converting the target word vectors into corresponding texts, and determining the output sequence corresponding to each target word vector; and splicing the corresponding texts according to the output sequence of each target word vector to generate a target text matched with the context of the picture to be processed.

Optionally, the method further comprises: and synthesizing the picture to be processed and the target text to generate a corresponding expression package.

Optionally, the method further comprises the step of training the preset model: collecting corpora, establishing relevancy information among texts in the corpora, and determining prior information according to the relevancy information; collecting training data, wherein the training data comprises training texts and training pictures; and pre-loading the prior information by a preset model, and training by adopting the training data.

Optionally, the training picture includes a first training picture, the training text includes a first training text, the first training text is a text corresponding to the first training picture and matched with a context, and the preset model is an encoder-decoder model; the preset model pre-loads the prior information, and the training is performed by adopting the training data, which comprises the following steps: the encoder performs feature extraction on the first training picture and outputs first training feature information corresponding to the first training picture; the decoder preloads the prior information, processes the first training characteristic information corresponding to the first training picture and the reference word vector of the first training text based on the prior information, and outputs a first training word vector and a corresponding first probability; and adjusting the weight of the preset model according to the first training word vector and the corresponding first probability.

Optionally, the training pictures include a second training picture, and the preset model is an encoder-decoder model; the preset model pre-loads the prior information, and the training is performed by adopting the training data, which comprises the following steps: constructing a reference word vector corresponding to the second training picture; the encoder performs feature extraction on the second training picture and outputs second training feature information corresponding to the second training picture; the decoder preloads the prior information, processes second training characteristic information and a reference word vector corresponding to the second training picture based on the prior information, and outputs a second training word vector and a corresponding second probability; and adjusting the weight of the preset model according to the second training word vector and the corresponding second probability.

Optionally, the training text comprises a second training text, and the preset model is a coder-decoder model; the preset model is preloaded with the prior information, and the training is performed by adopting the training data, which comprises the following steps: constructing a third training picture corresponding to the second training text; the encoder performs feature extraction on the third training picture and outputs third training feature information corresponding to the third training picture; the decoder preloads the prior information, processes the third training characteristic information and a reference word vector of a third training text based on the prior information, and outputs a third training word vector and a corresponding third probability; and adjusting the weight of the preset model according to the third reference word vector and the corresponding third probability.

Optionally, the processing, based on the prior information, the first training feature information corresponding to the first training picture and the reference word vector of the first training text, and outputting a first training word vector and a corresponding first probability includes: cascading first training characteristic information corresponding to the first training picture with a reference word vector of a first training text to obtain corresponding first cascading training characteristic information; and processing the first cascade training characteristic information based on the prior information, and outputting a first training word vector and a corresponding first probability.

The embodiment of the invention also discloses a picture processing device, which specifically comprises: the image acquisition module is used for acquiring an image to be processed; the word vector determination module is used for processing the picture to be processed according to a preset model and determining a target word vector corresponding to the picture to be processed, wherein the preset model is trained based on prior information, and the prior information comprises relevancy information of related texts; and the text generation module is used for generating a target text matched with the context of the picture to be processed according to the target word vector.

Optionally, the preset model is an encoder-decoder model, and the word vector determination module includes: the characteristic extraction submodule is used for calling the encoder to extract the characteristics of the picture to be processed and outputting the characteristic information corresponding to the picture to be processed; and the word vector processing submodule is used for calling the decoder to process the characteristic information and outputting a target word vector corresponding to the picture to be processed.

Optionally, the input parameters of the encoder include an input word vector and feature information; the word vector processing submodule comprises: the initialization unit is used for initializing the input word vector; the cascade unit is used for cascading the initialized input word vector and the characteristic information of the picture to be processed to obtain cascade characteristic information; and the word vector output unit is used for calling the decoder to process the cascade characteristic information and outputting a target word vector corresponding to the picture to be processed.

Optionally, the decoder is a bidirectional network, and the word vector output unit is configured to invoke the decoder to perform forward operation on the cascade feature information to obtain a corresponding forward word vector; the decoder performs reverse operation on the cascade characteristic information to obtain a corresponding reverse word vector; and taking the forward word vector and the reverse word vector with the highest probability as the target word vector corresponding to the picture to be processed.

Optionally, the target word vector comprises a plurality of words; the text generation module is used for converting the target word vectors into corresponding texts and determining the output sequence corresponding to each target word vector; and splicing the corresponding texts according to the output sequence of each target word vector to generate a target text matched with the context of the picture to be processed.

Optionally, the apparatus further comprises: and the expression package generating module is used for synthesizing the picture to be processed and the target text to generate a corresponding expression package.

Optionally, the apparatus further comprises: the prior information determining module is used for collecting the linguistic data, establishing relevancy information among texts in the linguistic data, and determining prior information according to the relevancy information; the data collection module is used for collecting training data, and the training data comprises training texts and training pictures; and the training module is used for pre-loading the prior information by a preset model and training by adopting the training data.

Optionally, the training picture includes a first training picture, the training text includes a first training text, the first training text is a text corresponding to the first training picture and matched with a context, and the preset model is an encoder-decoder model; the training module comprises: the first preset model training submodule is used for calling the encoder to perform feature extraction on the first training picture and outputting first training feature information corresponding to the first training picture; calling the decoder to preload the prior information, processing first training characteristic information corresponding to the first training picture and a reference word vector of a first training text based on the prior information, and outputting a first training word vector and a corresponding first probability; and adjusting the weight of the preset model according to the first training word vector and the corresponding first probability.

Optionally, the training pictures include a second training picture, and the preset model is an encoder-decoder model; the training module comprises: the second preset model training submodule is used for constructing a reference word vector corresponding to the second training picture; calling the encoder to perform feature extraction on the second training picture, and outputting second training feature information corresponding to the second training picture; calling the decoder to preload the prior information, processing second training characteristic information and a reference word vector corresponding to the second training picture based on the prior information, and outputting a second training word vector and a corresponding second probability; and adjusting the weight of the preset model according to the second training word vector and the corresponding second probability.

Optionally, the training text comprises a second training text, and the preset model is a coder-decoder model; the training module comprises: the third preset model training submodule is used for constructing a third training picture corresponding to the second training text; calling the encoder to perform feature extraction on the third training picture, and outputting third training feature information corresponding to the third training picture; calling the decoder to preload the prior information, processing the third training characteristic information and a reference word vector of a third training text based on the prior information, and outputting a third training word vector and a corresponding third probability; and adjusting the weight of the preset model according to the third reference word vector and the corresponding third probability.

Optionally, the first preset model training sub-module is configured to cascade first training feature information corresponding to the first training picture and a reference word vector of a first training text to obtain corresponding first cascade training feature information; and processing the first cascade training characteristic information based on the prior information, and outputting a first training word vector and a corresponding first probability.

The embodiment of the invention also discloses a readable storage medium, and when the instructions in the storage medium are executed by a processor of the electronic equipment, the electronic equipment can execute the picture processing method according to any one of the embodiments of the invention.

An embodiment of the present invention also discloses an electronic device, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, and the one or more programs include instructions for: acquiring a picture to be processed; processing the picture to be processed according to a preset model, and determining a target word vector corresponding to the picture to be processed, wherein the preset model is trained based on prior information, and the prior information comprises relevancy information of relevant texts; and generating a target text matched with the context of the picture to be processed according to the target word vector.

Optionally, the electronic device further includes: and synthesizing the picture to be processed and the target text to generate a corresponding expression package.

Optionally, the method further comprises the following steps of training the preset model: collecting corpora, establishing relevancy information among texts in the corpora, and determining prior information according to the relevancy information; collecting training data, wherein the training data comprises training texts and training pictures; and pre-loading the prior information by a preset model, and training by adopting the training data.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, a picture to be processed can be obtained, then the picture to be processed is processed according to a preset model, a target word vector corresponding to the picture to be processed is determined, and then a target text matched with the context of the picture to be processed is generated according to the target word vector; the preset model is trained based on prior information, the prior information comprises relevancy information of related texts, so that the prediction capability of the preset model is high, and then the follow-up text which is determined according to the preset model and is matched with the context of the picture can better meet the user requirements without manual marking, so that the production efficiency of the expression bag can be improved.

Drawings

FIG. 1 is a diagram of a prior art picture;

FIG. 2 is a flowchart illustrating steps of an embodiment of a method for processing pictures according to the present invention;

FIG. 3a is a flowchart illustrating the steps of a method for training a predictive model according to an embodiment of the present invention;

FIG. 3b is a schematic diagram of an encoder-decoder model of the present invention;

FIG. 3c is a schematic diagram of a cascade process of the present invention;

FIG. 4 is a flow chart of steps in an alternative embodiment of a picture processing method of the present invention;

FIG. 5 is a schematic view of an emoticon of the present invention;

FIG. 6 is a block diagram of an embodiment of a picture processing apparatus according to the present invention;

FIG. 7 is a block diagram of an alternative embodiment of a picture processing apparatus according to the present invention;

FIG. 8 illustrates a block diagram of an electronic device for picture processing in accordance with an exemplary embodiment;

fig. 9 is a schematic structural diagram of an electronic device for picture processing according to another exemplary embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

One of the core ideas of the embodiment of the invention is that a text matched with the context of a picture to be processed is determined according to a preset model; the preset model is trained based on prior information containing correlation degree information between related texts, the prediction capability is strong, and then the texts which are determined according to the preset model and matched with the picture contexts can better meet the requirements of users without manual marking, so that the production efficiency of the expression bag can be improved.

Referring to fig. 2, a flowchart illustrating steps of an embodiment of an image processing method according to the present invention is shown, which may specifically include the following steps:

step 202, obtaining a picture to be processed.

In the embodiment of the invention, the picture of the text matched with the context of the picture to be determined can be obtained, the picture is determined as the picture to be processed, and then the text matched with the context of the picture to be processed is determined; the picture to be processed may be a picture set by a user, may be a picture in a local album, may also be a picture collected by the user, a network side picture, and the like, which is not limited in this embodiment of the present invention. The pictures to be processed may include various types of pictures, such as caricatures, character expression screenshots in movie and television works, and the embodiment of the present invention is not limited to this.

Step 204, processing the picture to be processed according to a preset model, and determining a target word vector corresponding to the picture to be processed, wherein the preset model is trained based on prior information, and the prior information comprises association information between associated texts.

In the embodiment of the present invention, a preset model may be trained in advance, and then the preset model is used to process the to-be-processed picture, for example, feature extraction is performed first, and then a word vector (which may be referred to as a target word vector subsequently) corresponding to a text that is matched with the context of the to-be-processed picture is determined based on the extracted features.

In the training process, the relevancy information between any two texts (which can be characters, words and sentences) can be predetermined, and the relevancy information can be used for representing the relevancy between the semantics of the two texts; in an example of the present invention, the relevancy information may be represented by distance information (which may be of multiple dimensions) between two texts, and at this time, the relevancy information may be inversely proportional to the relevancy degree, and then the relevancy information smaller than a threshold may be determined as prior information; the threshold may be set as required, which is not limited in this embodiment of the present invention. For example, the text information includes: "i'm good and difficult to pass", "i'm good and easy to worry", and calculate the relevancy information of any two sentences in the three text information, for example, the relevancy information of "i'm good and difficult to pass" and "i'm good and easy to worry" is R1, the relevancy information of "i'm good and difficult to pass" and "i'm easy and easy to worry" is R2, and the relevancy information of "i'm good and easy to worry" and "i'm easy and easy to worry" is R3; obviously, the correlation degrees of "i'm good and difficult to get past" and "i'm good and easy to get heart apart" are high, the correlation degrees of "i'm good and difficult to get past" and "i'm good and easy to get heart apart" are low, so the correlation information R1 is smaller than the correlation information R2 and R3, and if the correlation information R1 is smaller than the threshold, the correlation information R1 can be used as prior information. Then, before training the preset model by adopting training data, the preset model can be preloaded with prior information so that the preset model can learn the relevancy information of relevant texts, and then training is carried out by adopting the training data; the training is carried out on the basis of the relevancy information, the prediction capability of the preset model after training can be improved, the accuracy of determining the target word vector is correspondingly improved, and therefore the efficiency of making the expression package is improved; the specific training process is described later.

And step 206, generating a target text matched with the context of the picture to be processed according to the target word vector.

Then, based on the dictionary, searching a text corresponding to the target word vector, and determining the text as a target text matched with the sentence of the picture to be processed; when the target word vector is one, determining a text corresponding to the target word vector as a target text; when a plurality of target word vectors are provided, the texts corresponding to the word vectors can be combined into one text to obtain the target text.

In an example of the present invention, as shown in fig. 1, an image to be processed may be obtained, the image to be processed is processed according to a preset model, a target word vector corresponding to the image to be processed is determined, and a target text such as "distant point" that matches the context of the image to be processed is generated according to the target word vector.

In summary, in the embodiment of the present invention, an image to be processed may be obtained, the image to be processed is processed according to a preset model, a target word vector corresponding to the image to be processed is determined, and a target text matched with a context of the image to be processed is generated according to the target word vector; the preset model is trained based on prior information, the prior information comprises relevancy information of related texts, so that the prediction capability of the preset model is high, and then the follow-up text which is determined according to the preset model and is matched with the context of the picture can better meet the user requirements without manual marking, so that the production efficiency of the expression bag can be improved.

In another embodiment of the present invention, the preset model may be an encoder-decoder model, where the encoder may be a CNN (Convolutional Neural Networks) or other Networks, and the embodiment of the present invention is not limited thereto; the decoder may be a current Neural network (RNN), such as a Basic RNN, a Long Short-Term Memory (LSTM), a gated unit (GRU), or the like, although other Networks may be used, which is not limited in this embodiment of the present invention. In an example of the present invention, the network corresponding to the decoder may be a bidirectional network, so as to improve the prediction capability of the preset model, so that the quality of the target text generated and matched with the context of the to-be-processed picture is higher, and further improve the production efficiency of the emoticon.

In the embodiment of the invention, prior information and training data can be obtained in advance in the training process, then a preset model can be preloaded with the prior information, and then the training data is adopted for training; the method comprises the following specific steps:

referring to fig. 3a, a flowchart illustrating steps of an embodiment of a method for training a preset model according to the present invention is shown, which may specifically include the following steps:

step 302, collecting corpora, establishing relevancy information among texts in the corpora, and determining prior information according to the relevancy information.

In the embodiment of the invention, a large amount of linguistic data can be collected, and then prior information is generated based on the collected linguistic data; the manner of collecting the corpus may include multiple manners, for example, hotspot information may be collected from a social platform as the corpus, hotspot comments may be collected from a forum as the corpus, input information of each user may be collected as the corpus, and the like, which is not limited in this embodiment of the present invention.

In an example of the present invention, distance information between any two texts in a corpus may be calculated, and the distance information is used as relevancy information of the two texts; the word vectors corresponding to the two texts may be determined respectively, and then the word vectors corresponding to the two texts may be calculated by using a natural language processing model to determine distance information between the word vectors corresponding to the two texts. Wherein the distance may include distances in multiple dimensions, and may be a multi-dimensional matrix. Then, each distance information can be compared with a threshold value to determine the distance information smaller than the threshold value, and then the distance information smaller than the threshold value is adopted to generate prior information.

Step 304, collecting training data, wherein the training data comprises training texts and training pictures.

In the embodiment of the present invention, a large number of training pictures and training texts may be collected as training data, where the training pictures may include a first training picture and a second training picture, the training texts may include a first training text and a second training text, and the first training text is a text that is context-matched with a corresponding context of the first training picture; and the text which is in context matching with the second training picture does not exist in the training text, and the second training text is not in context matching with the context of any training picture in the training pictures. The training data collection method may include various ways, for example, hot emotion packages, hot paragraphs, and the like may be collected from a social platform and a forum as training data, or emotion packages, paragraphs, and the like may be collected from input information of all users as training data, which is not limited in this embodiment of the present invention.

In addition, the training texts can be preprocessed, and a corresponding word bank is generated based on the training texts, wherein all the training texts can be deduplicated, and then each text after deduplication is subjected to word segmentation processing to obtain corresponding word segments; and determining word vectors corresponding to the word segments (which can be used as reference word vectors subsequently), and generating a word stock by using the obtained reference word vectors. The thesaurus may further comprise a reference word vector corresponding to the first training text and a reference word vector corresponding to the second training text.

The preset model may then be trained using (the first training picture and the first training text), (the second training picture and the second training text), respectively, as shown in step 306.

And 306, preloading the prior information by a preset model, and training by adopting the training data.

In the embodiment of the invention, the preset model can pre-load the prior information so as to integrate the prior information into the preset model, so that the preset model can learn the relevancy information of the relevant text and has the prior capability corresponding to the prior information; then, on the basis of the prior information, training is carried out by adopting training data, and the prediction capability of the preset model can be further improved.

The following describes the training process of the encoder-decoder model, taking the decoder as a bi-directional LSTM network as an example, where the encoder-decoder model is shown in fig. 3 b; where the encoder output in fig. 3b is X (feature information of a picture), the decoder includes two input parameters: inputting word vectors (shown as W1-W4) and characteristic information (such as X), wherein the output of the decoder is target word vectors shown as T1-T4; the number of W and the number of T may be set as required, which is not limited in this embodiment of the present invention.

In one example of the present invention, training the encoder-decoder model using the first training picture and the first training text may include the following sub-steps 22-26:

and a substep 22, performing feature extraction on the first training picture by the encoder, and outputting first training feature information corresponding to the first training picture.

In the embodiment of the present invention, the first training picture may be input into the encoder, the encoder performs feature extraction on the first training picture, obtains first training feature information corresponding to the first training picture, such as X in fig. 3b, and then inputs the first training feature information into the decoder. The first training feature information may be a matrix of M × N, where M and N are positive integers.

And a substep 24, preloading the prior information by the decoder, processing the first training feature information corresponding to the first training picture and the reference word vector of the first training text based on the prior information, and outputting a first training word vector and a corresponding first probability.

And a substep 26 of adjusting the weight of the preset model according to the first training word vector and the corresponding first probability.

In the embodiment of the invention, the decoder can pre-load prior information before the encoder outputs the first training characteristic information, and can search a reference word vector corresponding to the first training text from a pre-established word stock; then, after loading the prior information, taking a reference word vector corresponding to the first training text as an input word vector, taking first training feature information corresponding to the first training picture as input feature information, processing the first training feature information corresponding to the first training picture and the reference word vector of the first training text based on the prior information, and outputting the first training word vector and a corresponding first probability.

The dimensionality of the first training feature information corresponding to the first training picture and the dimensionality of the reference word vector corresponding to the first training text may be different; therefore, after the first training feature information and the dimensionality of the reference word vector are unified, the first training feature information and the corresponding reference word vector are cascaded, for example, line cascading, and then the decoder processes the cascaded training feature information.

In an example of the present invention, as shown in fig. 3c, a method for concatenating first training feature information of a first training picture and a reference word vector corresponding to a first training text may be that a plurality of reference word vectors corresponding to the first training text may be obtained, and then a mean value removing process is performed on each word vector to avoid an excessive difference in numerical values of each dimension of the reference word vectors; the mean value of the word vectors can be calculated, and the mean value is subtracted from the value of each dimension of each word vector to realize mean value removal processing. If the first training characteristic information and the reference word vector need to be subjected to row cascade connection, the column number of the matrix corresponding to the first training characteristic information and the column number of the reference word vector can be compared; if the number of columns of the first training feature information is greater than the number of columns of the reference word vector, the number of columns of the reference word vector can be increased to be the same as the number of columns of the first training feature information, and the newly added columns are filled; then the two are cascaded. Of course, a method for performing column cascade on the first training feature information and the reference word vector may also be used, which is similar to the method for performing row cascade and is not described herein again.

The first training text may include a plurality of word segments, each word segment has a corresponding reference word vector, and the arrangement order of the reference word vectors corresponds to the arrangement order of each text in the first training text; for example, the first training text "i am a bit harder today" may include four word segments: "i", "today", "somewhat" and "difficult", may correspond in sequence to a reference word vector a1, a2, A3 and a4, with a1 being arranged before a2, a2 being arranged before A3 and A3 being arranged before a 4.

The decoder may include a plurality of units, where an input of a first unit is a first reference word vector corresponding to the first training feature information and a reference word vector corresponding to the first training text, and inputs of other units are an output of the previous unit and a reference word vector in a corresponding order; the output of each unit can be referred to as a first training word vector such as G11/G12/G13/G14 in FIG. 3 b.

As shown in fig. 3b, the decoder includes 4 units, and during the training process, word vectors W1, W2, W3 and W4 are input, corresponding to a1, a2, A3 and a4:

in the forward training process: input of the first unit: inputting a word vector W1 and first training feature information, wherein the output of the first unit is a first training word vector G11; input of the second unit: g11 and an input word vector W2, the output of the second unit being a first training word vector G12; input of the third unit: g12 and an input word vector W3, the output of the third unit being a first training word vector G13; input to the fourth unit: g13 and an input word vector W4, the output of the fourth unit being a first training word vector G14.

In the reverse training process: input to the fourth unit: first training feature information and W1, the output of the fourth unit being a first training word vector G24; input of the third unit: g24 and the input word vector W2, the output of the third unit being the first training word vector G23, the input of the second unit: g2,3 and an input word vector W3, the output of the second unit being a first training word vector G22; input of the first unit: g22 and an input word vector W4, the output of the first unit being a first training word vector G21.

Each unit may output a plurality of first training word vectors and first probabilities corresponding to the respective first training word vectors, where the first probabilities of the first training word vectors may be determined by using a softmax function; then outputting the first K first training word vectors with the maximum first probability to the next unit; the K may be a positive integer, and may be specifically set as required, which is not limited in this embodiment of the present invention.

In addition, for the first training word vector output in the forward training process and the first training word vector output in the reverse training process of each unit, a softmax function can also be adopted to determine a corresponding first probability; and then, based on the first training word vector output in the forward training process and the first training word vector output in the reverse training process, the first training word vector with the largest first probability is selected as the final output of the unit, namely the target word vector, such as T1/T2/T3/T4.

In the training process, a loss function such as a cross entropy function can be determined according to the first training word vector output by each unit and the corresponding first probability, reverse derivation is performed according to the loss function, and the weight of the coder-decoder model is adjusted. Wherein, the inverse derivation is performed according to the loss function, and the weights of the encoder-decoder model are adjusted until T1, T2, T3 and T4 are at minimum distances from the corresponding W1, W2, W3 and W4.

In one embodiment of the present invention, the output result of each cell may refer to the result after adding a disturbance variable, so that an offset may be made, the randomness of the generated result may be increased, and further the local minimum may be avoided. In one example, the following formula is referred to:

wherein T is a disturbance variable, p_iIs the first probability, p, corresponding to the ith first training word vector_jIs the first probability corresponding to the jth first training word vector, f (p)_iAnd adding the probability of the disturbance variable to the ith first training word vector.

In an embodiment of the present invention, training the encoder-decoder model using the second training picture may include the following sub-steps:

and a substep 42 of constructing a reference word vector corresponding to the second training picture.

And a substep 44, performing feature extraction on the second training picture by the encoder, and outputting second training feature information corresponding to the second training picture.

And a substep 46, preloading the prior information by the decoder, processing second training feature information and a reference word vector corresponding to the second training picture based on the prior information, and outputting a second training word vector and a corresponding second probability.

And a substep 48, adjusting the weight of the preset model according to the second training word vector and the corresponding second probability.

In the embodiment of the invention, the difference between the second training picture and the first training picture is that no text matched with the second training picture in context exists in the training text; therefore, in the embodiment of the present invention, a corresponding reference word vector may be constructed for the second training picture, for example, if the number of input word vectors in fig. 3B is 4, four reference word vectors may be constructed for the second training picture, such as B1, B2, B3, and B4, where B1, B2, B3, and B4 may all be vectors with a value of 1 in each dimension. In order to facilitate the cascade connection of the feature information of the second training picture and the reference word vector corresponding to the second training picture, the corresponding dimension of the reference word vector of the second training picture may be configured to be the same as the dimension of the feature information corresponding to the second training picture, for example, the same as the row or the same as the column of the feature information corresponding matrix. Then, B1, B2, B3 and B4 correspond to the input word vectors W1, W2, W3 and W4, respectively, and then the second training picture and the correspondingly constructed reference word vector are adopted to train the encoder-decoder model, the training process is similar to the above substeps 22-26, and is not repeated here.

In one embodiment of the present invention, training the coder-decoder model using the second training text may comprise the following sub-steps:

and a substep 62 of constructing a third training picture corresponding to the second training text.

And a substep 64, performing feature extraction on the third training picture by the encoder, and outputting third training feature information corresponding to the third training picture.

And a substep 66, preloading the prior information by the decoder, processing the third training feature information and the reference word vector of the third training text based on the prior information, and outputting a third training word vector and a corresponding third probability.

And a substep 68 of adjusting the weight of the preset model according to the third reference word vector and the corresponding third probability.

In the embodiment of the invention, the difference between the second training text and the first training text is that the second training text is not matched with any training picture in the training pictures in context; therefore, in the embodiment of the present invention, a third training picture corresponding to the second training text may be constructed, for example, a picture in which a pixel value of each pixel is set to a set value, for example, 1, is constructed as the third training picture corresponding to the second training text, where the set value may be set as required, and this is not limited in the embodiment of the present invention. The encoder-decoder model may then be trained using the second training text and a correspondingly constructed third training picture, the training process being similar to the above-described substeps 22-26 and not further described herein.

In summary, in the embodiment of the present invention, a corpus may be collected, relevancy information between texts in the corpus may be established, prior information may be determined according to the relevancy information, training data may be collected, the training data includes a training text and a training image, and then a preset model is pre-loaded with the prior information, and training is performed by using the training data; and the preset model can be trained by adopting training data on the basis of the learned relevancy information of the relevant texts, so that the prediction capability of the preset model is improved.

Secondly, in the embodiment of the invention, for a second training picture without a text matched with the context thereof in the training text, a corresponding reference word vector can be constructed for the second training picture in the training process; then, taking the second training picture and the reference word vector constructed correspondingly as training data, and training a preset model; for a second training text which is not matched with any one of the training pictures in the context, constructing a corresponding third training picture for the second training text, and then training the preset model by taking the second training text and the correspondingly constructed third training picture as training data; and then can enrich training data, improve the prediction ability of presetting the model, further improve the preparation efficiency of expression package.

In this embodiment of the present invention, the processing the first training feature information corresponding to the first training picture and the reference word vector of the first training text based on the prior information, and outputting the first training word vector and the corresponding first probability includes: cascading first training characteristic information corresponding to the first training picture with a reference word vector of a first training text to obtain corresponding first cascading training characteristic information; processing the first cascade training characteristic information based on the prior information, and outputting a first training word vector and a corresponding first probability; and then realize carrying out the semi-supervised training to predetermineeing the model, save the cost of labor, improve training efficiency.

Referring to fig. 4, a flowchart illustrating steps of an alternative embodiment of the image processing method of the present invention is shown, which may specifically include the following steps:

and step 402, acquiring a picture to be processed.

In the embodiment of the invention, if a picture needs to be generated into the image-text emoticon, the picture can be obtained and determined as the picture to be processed; then, the picture to be processed may be processed according to a preset model, and a target word vector corresponding to the picture to be processed is determined, which may be implemented through steps 404 to 406:

in an example of the present invention, the preset model may be an encoder-decoder model, and then the picture to be processed may be input into the encoder-decoder model, and the encoder-decoder model determines the target word vector corresponding to the picture to be processed.

And step 404, the encoder performs feature extraction on the picture to be processed and outputs feature information corresponding to the picture to be processed.

The method comprises the steps that a picture to be processed can be input into an encoder, the encoder can extract features of the picture to be processed, and feature information corresponding to the picture to be processed is extracted; the feature information is then output to a decoder.

And step 406, the decoder processes the feature information and outputs a target word vector corresponding to the picture to be processed.

In an example of the present invention, the decoder may be a bidirectional LSTM network, and the encoder and decoder model may be as shown in fig. 3b, and in the using process, the input word vectors W1-W4 may be initialized, and then processed by using the initialized input word vectors and the corresponding feature information of the picture to be processed; step 406 may include the sub-steps of:

substep 82, initializing the input word vector.

And a substep 84 of cascading the initialized input word vector and the characteristic information of the picture to be processed to obtain cascading characteristic information.

And a substep 86, processing the cascade characteristic information by the decoder, and outputting a target word vector corresponding to the picture to be processed.

Therefore, the input word vector may be initialized first, wherein the initialization may be performed by using a default value or may be performed by using a random value, which is not limited in the embodiment of the present invention; then, cascading the characteristic information with the initialized input word vector; the cascading manner is similar to that in the training process, and is not described herein again. Then the decoder processes the cascade characteristic information and outputs the target word vector corresponding to the picture to be processed, and the following substeps can be referred to:

and the substep S2, the decoder performs forward operation on the cascade characteristic information to obtain a corresponding forward word vector.

And the substep S4, the decoder performs inverse operation on the cascade characteristic information to obtain a corresponding inverse word vector.

And a substep S6, taking the maximum probability in the forward word vector and the reverse word vector as a target word vector corresponding to the picture to be processed.

In an optional embodiment of the present invention, in the forward operation process, each unit outputs a forward word vector with the highest probability; in the process of inverse operation, each unit outputs the inverse word vector with the maximum probability.

And 408, generating a target text matched with the context of the picture to be processed according to the target word vector.

In the embodiment of the invention, a dictionary can be searched based on the target word vector, and the text corresponding to the target word vector is determined; and then generating a target text matched with the context of the picture to be processed based on the text corresponding to the target word vector.

Wherein, the decoder may include a plurality of units, and the corresponding target word vector may also include a plurality of units, and step 408 may include the following sub-steps:

substep S8, converting the target word vectors into corresponding texts, and determining an output order corresponding to each target word vector.

And a substep S10, splicing the corresponding texts according to the output sequence of each target word vector to generate a target text matched with the context of the picture to be processed.

In the embodiment of the invention, each output unit of the decoder has a corresponding arrangement sequence, so that the output sequence of the target word vector corresponding to each output unit can be determined based on the arrangement sequence of each output unit; then, the texts corresponding to the target word vectors may be sequentially spliced according to the output order of the target word vectors to generate the target text matched with the context of the to-be-processed image, for example, the text corresponding to the next target word vector may be spliced after the text corresponding to the previous target word vector. As shown in fig. 3b, the text corresponding to T1 is "me", the text corresponding to T2 is "today", the text corresponding to T4 is "a little", and the text corresponding to T4 is "hard to pass"; therefore, the texts corresponding to the four target word vectors can be spliced to obtain ' I ' having difficulty today '.

And step 410, synthesizing the picture to be processed and the target text to generate a corresponding expression package.

The picture to be processed and the target text can be synthesized to generate a corresponding image-text expression package; for example, the target position of the target text corresponding to the picture to be processed can be determined, and then the corresponding target text is added at the target position of the picture to be processed; the target position may be determined as above, below, and the like of the picture to be processed according to a user requirement, which is not limited in the embodiment of the present invention. For example, the picture to be processed is shown in fig. 1, the obtained target text is "distant point", and then the picture to be processed and the target text are synthesized to generate a corresponding emoticon as shown in fig. 5.

In an optional embodiment of the present invention, after determining the target text corresponding to the image to be processed, for each image to be processed, the target text of the image to be processed may be used as a label of the image to be processed. And after the picture to be processed and the target text are synthesized and the corresponding expression package is generated, aiming at each expression package, the target text corresponding to the expression package can be used as a label of the expression package. When a subsequent user queries the emotion packets and/or the pictures to be processed by inputting query terms on the search platform, the search engine can match the query terms with the emotion packets and/or the pictures to be processed by adopting labels of the emotion packets and/or the pictures to be processed, and returns the emotion packets and/or the pictures to be processed matched with the query terms.

Secondly, in the embodiment of the present invention, the decoder is a bidirectional network, and the decoder processes the cascade feature information and outputs the target word vector corresponding to the to-be-processed picture, including: the decoder performs forward operation on the cascade characteristic information to obtain a corresponding forward word vector; the decoder performs reverse operation on the cascade characteristic information to obtain a corresponding reverse word vector; taking the forward word vector and the reverse word vector with the highest probability as the target word vector corresponding to the picture to be processed; and further, the accuracy of determining the target word text can be improved, so that the production efficiency of the expression bag is further improved.

Thirdly, in the embodiment of the invention, in the forward operation process, each unit outputs the forward word vector with the maximum probability; in the process of reverse operation, each unit outputs a reverse word vector with the maximum probability; the accuracy of determining the target word text can be improved, and therefore the production efficiency of the expression bag is further improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 6, a block diagram of a picture processing apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:

a picture obtaining module 602, configured to obtain a picture to be processed;

a word vector determining module 604, configured to process the picture to be processed according to a preset model, and determine a target word vector corresponding to the picture to be processed, where the preset model is trained based on prior information, and the prior information includes relevancy information of related texts;

a text generating module 606, configured to generate a target text matched with the context of the to-be-processed picture according to the target word vector.

Referring to fig. 7, a block diagram of an alternative embodiment of a picture processing apparatus of the present invention is shown.

In an alternative embodiment of the present invention, the preset model is an encoder-decoder model, and the word vector determining module 604 includes:

a feature extraction submodule 6042, configured to invoke the encoder to perform feature extraction on the to-be-processed picture, and output feature information corresponding to the to-be-processed picture;

and a word vector processing submodule 6044, configured to invoke the decoder to process the feature information, and output a target word vector corresponding to the to-be-processed picture.

In an optional embodiment of the present invention, the input parameters of the encoder include an input word vector and feature information; the word vector processing sub-module 6044, including:

an initialization unit 60442, configured to initialize the input word vector;

a cascade unit 60444, configured to cascade the initialized input word vector and feature information of the to-be-processed picture to obtain cascade feature information;

and a word vector output unit 60446, configured to invoke the decoder to process the cascade feature information, and output a target word vector corresponding to the picture to be processed.

In an alternative embodiment of the invention, the decoder is a bi-directional network,

the word vector output unit 60446 is configured to invoke the decoder to perform forward operation on the concatenated feature information to obtain a corresponding forward word vector; the decoder performs reverse operation on the cascade characteristic information to obtain a corresponding reverse word vector; and taking the forward word vector and the reverse word vector with the highest probability as the target word vector corresponding to the picture to be processed.

In an alternative embodiment of the present invention, the decoder comprises a plurality of units; in the forward operation process, each unit outputs a forward word vector with the maximum probability; in the process of inverse operation, each unit outputs the inverse word vector with the maximum probability.

In an optional embodiment of the present invention, the target word vector includes a plurality of vectors; the text generating module 606 is configured to convert the target word vectors into corresponding texts and determine an output sequence corresponding to each target word vector; and splicing the corresponding texts according to the output sequence of each target word vector to generate a target text matched with the context of the picture to be processed.

In an optional embodiment of the present invention, the apparatus further comprises:

and an expression package generating module 608, configured to synthesize the to-be-processed picture and the target text, and generate a corresponding expression package.

a priori information determining module 610, configured to collect corpora, establish relevancy information between texts in the corpora, and determine priori information according to the relevancy information;

a data collection module 612, configured to collect training data, where the training data includes training texts and training pictures;

and the training module 614 is configured to pre-load the prior information by using a preset model, and train by using the training data.

In an optional embodiment of the present invention, the training picture includes a first training picture, the training text includes a first training text, the first training text is a text corresponding to the first training picture and matched with a context, and the preset model is an encoder-decoder model; the training module 614 includes:

the first preset model training submodule 6142 is configured to invoke the encoder to perform feature extraction on the first training picture, and output first training feature information corresponding to the first training picture; calling the decoder to preload the prior information, processing first training characteristic information corresponding to the first training picture and a reference word vector of a first training text based on the prior information, and outputting a first training word vector and a corresponding first probability; and adjusting the weight of the preset model according to the first training word vector and the corresponding first probability.

In an optional embodiment of the present invention, the training pictures include a second training picture, and the preset model is a coder-decoder model; the training module 614 includes:

a second preset model training submodule 6144, configured to construct a reference word vector corresponding to the second training picture; calling the encoder to perform feature extraction on the second training picture, and outputting second training feature information corresponding to the second training picture; calling the decoder to preload the prior information, processing second training characteristic information and a reference word vector corresponding to the second training picture based on the prior information, and outputting a second training word vector and a corresponding second probability; and adjusting the weight of the preset model according to the second training word vector and the corresponding second probability.

In an optional embodiment of the present invention, the training texts include a second training text, and the preset model is an encoder-decoder model; the training module 614 includes:

a third preset model training submodule 6146, configured to construct a third training picture corresponding to the second training text; calling the encoder to perform feature extraction on the third training picture, and outputting third training feature information corresponding to the third training picture; calling the decoder to preload the prior information, processing the third training characteristic information and a reference word vector of a third training text based on the prior information, and outputting a third training word vector and a corresponding third probability; and adjusting the weight of the preset model according to the third reference word vector and the corresponding third probability.

In an optional embodiment of the present invention, the first preset model training sub-module 6142 is configured to cascade the first training feature information corresponding to the first training picture and the reference word vector of the first training text to obtain corresponding first cascade training feature information; and processing the first cascade training characteristic information based on the prior information, and outputting a first training word vector and a corresponding first probability.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Fig. 8 is a block diagram illustrating a structure of an electronic device 800 for picture processing according to an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power components 806 provide power to the various components of the electronic device 800. Power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 814 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 814 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of picture processing, the method comprising: acquiring a picture to be processed; processing the picture to be processed according to a preset model, and determining a target word vector corresponding to the picture to be processed, wherein the preset model is trained based on prior information, and the prior information comprises relevancy information of relevant texts; and generating a target text matched with the context of the picture to be processed according to the target word vector.

Fig. 9 is a schematic structural diagram of an electronic device 900 for picture processing according to another exemplary embodiment of the present invention. The electronic device 900 may be a server, which may vary widely depending on configuration or performance, and may include one or more Central Processing Units (CPUs) 922 (e.g., one or more processors) and memory 932, one or more storage media 930 (e.g., one or more mass storage devices) storing applications 942 or data 944. Memory 932 and storage media 930 can be, among other things, transient storage or persistent storage. The program stored on the storage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 922 may be arranged to communicate with the storage medium 930 to execute a series of instruction operations in the storage medium 930 on the server.

The server may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input-output interfaces 958, one or more keyboards 956, and/or one or more operating systems 941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for: acquiring a picture to be processed; processing the picture to be processed according to a preset model, and determining a target word vector corresponding to the picture to be processed, wherein the preset model is trained based on prior information, and the prior information comprises relevancy information of relevant texts; and generating a target text matched with the context of the picture to be processed according to the target word vector.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above detailed description is provided for a picture processing method, a picture processing apparatus and an electronic device, and a specific example is applied in this document to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An image processing method, comprising:

acquiring a picture to be processed;

processing the picture to be processed according to a preset model, and determining a target word vector corresponding to the picture to be processed, wherein the preset model is trained based on prior information, and the prior information comprises relevancy information of relevant texts;

and generating a target text matched with the context of the picture to be processed according to the target word vector.

2. The method of claim 1, wherein the predetermined model is a coder-decoder model,

the processing the picture to be processed according to the preset model, and determining the target word vector corresponding to the picture to be processed comprises:

the encoder extracts the characteristics of the picture to be processed and outputs characteristic information corresponding to the picture to be processed;

and the decoder processes the characteristic information and outputs a target word vector corresponding to the picture to be processed.

3. The method of claim 2, wherein the input parameters of the encoder comprise an input word vector and feature information;

the decoder processes the feature information and outputs a target word vector corresponding to the picture to be processed, and the method comprises the following steps:

initializing the input word vector;

cascading the initialized input word vector and the characteristic information of the picture to be processed to obtain cascading characteristic information;

and the decoder processes the cascade characteristic information and outputs a target word vector corresponding to the picture to be processed.

4. The method according to claim 3, wherein the decoder is a bidirectional network, and the decoder processes the concatenated feature information and outputs a target word vector corresponding to the picture to be processed, including:

the decoder performs forward operation on the cascade characteristic information to obtain a corresponding forward word vector;

the decoder performs reverse operation on the cascade characteristic information to obtain a corresponding reverse word vector;

and taking the forward word vector and the reverse word vector with the highest probability as the target word vector corresponding to the picture to be processed.

5. The method of claim 4, wherein the decoder comprises a plurality of units;

in the forward operation process, each unit outputs a forward word vector with the maximum probability;

in the process of inverse operation, each unit outputs the inverse word vector with the maximum probability.

6. The method of claim 5, wherein the target word vector comprises a plurality;

generating a target text matched with the context of the picture to be processed according to the target word vector, wherein the generating comprises:

converting the target word vectors into corresponding texts, and determining the output sequence corresponding to each target word vector;

and splicing the corresponding texts according to the output sequence of each target word vector to generate a target text matched with the context of the picture to be processed.

7. The method of claim 1, further comprising:

and synthesizing the picture to be processed and the target text to generate a corresponding expression package.

8. A picture processing apparatus, comprising:

the image acquisition module is used for acquiring an image to be processed;

the word vector determination module is used for processing the picture to be processed according to a preset model and determining a target word vector corresponding to the picture to be processed, wherein the preset model is trained based on prior information, and the prior information comprises relevancy information of related texts;

and the text generation module is used for generating a target text matched with the context of the picture to be processed according to the target word vector.

9. A readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the picture processing method according to any of the method claims 1-7.

10. An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for:

acquiring a picture to be processed;