CN108875758B

CN108875758B - Information processing method and device, and information detection method and device

Info

Publication number: CN108875758B
Application number: CN201710320880.4A
Authority: CN
Inventors: 侯翠琴; 夏迎炬; 杨铭; 张姝; 孙俊
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-05-09
Filing date: 2017-05-09
Publication date: 2022-01-11
Anticipated expiration: 2037-05-09
Also published as: CN108875758A

Abstract

Disclosed are an information processing method and apparatus, and an information detection method and apparatus, wherein the information processing method includes: extracting a group of feature maps with a predetermined width and a predetermined height from each of the plurality of sample images, wherein the feature maps in the group of feature maps respectively correspond to different image features; and training a word description model based on the extracted set of feature maps and word descriptions labeled for the plurality of sample images, the word description model being used to generate corresponding word descriptions from the input images, wherein training the word description model comprises calculating a center and a size of a window of interest on the set of feature maps based on the set of feature maps and a previous state vector of the recurrent neural network model. According to the embodiments of the present disclosure, a more appropriate textual description of an image can be generated.

Description

Information processing method and device, and information detection method and device

Technical Field

The present disclosure relates to the field of information processing, and in particular, to an information processing method and apparatus and an information detection method and apparatus that consider not only a position of a window of interest in an image but also a size of the window of interest.

Background

Understanding image content and describing it in natural language is one of the important issues and ultimate goals in the field of artificial intelligence. Describing an image requires not only identifying objects in the image, but also describing objects in the image and relationships between objects in natural language. Therefore, describing image content in natural language is a very challenging problem. Currently, however, there have been some attempts to solve this challenging problem. For example, objects in an image are first detected and relationships between the objects are inferred, and then natural sentences describing the contents of the image are generated based on templates. There are also end-to-end approaches based on neural network models. Furthermore, an attention model is added to the model, which automatically learns only the position of the fixed-size attention window.

Disclosure of Invention

The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. However, it should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.

In view of the above problems, an object of the present disclosure is to provide an information processing method and apparatus and an information detection method and apparatus that take into account not only the position but also the size of a window of interest in an image.

According to an aspect of the present disclosure, there is provided an information processing method including: a set of feature maps having a predetermined width and a predetermined height may be extracted from each of the plurality of sample images, wherein the feature maps in the set of feature maps respectively correspond to different image features; and training a word description model based on the extracted set of feature maps and word descriptions labeled for the plurality of sample images, the word description model for generating respective word descriptions from the input image, wherein training the word description model may include calculating a center and a size of a window of interest on the set of feature maps based on the set of feature maps and a previous state vector of the recurrent neural network model.

According to another aspect of the present disclosure, there is provided an information processing apparatus including: an extraction unit that may be configured to extract a set of feature maps having a predetermined width and a predetermined height from each of a plurality of sample images, wherein the feature maps in the set of feature maps respectively correspond to different image features; and a training unit configured to train a word description model based on the extracted set of feature maps and word descriptions labeled for the plurality of sample images, the word description model for generating respective word descriptions from the input image, wherein training the word description model may include calculating a center and a size of a window of interest on the set of feature maps based on the set of feature maps and a previous state vector of the recurrent neural network model.

According to still another aspect of the present disclosure, there is provided an information detecting method including: a set of feature maps having a predetermined width and a predetermined height may be extracted from an input image, wherein the feature maps in the set of feature maps respectively correspond to different image features; and generating a respective textual description of the input image using the trained textual description model based on the extracted set of feature maps, wherein generating the respective textual description of the input image using the trained textual description model may include: calculating a center and a size of a window of interest on the set of feature maps based on the set of feature maps and a previous state vector of the recurrent neural network model.

According to other aspects of the present disclosure, there are also provided computer program code and a computer program product for implementing the above-described method according to the present disclosure, and a computer readable storage medium having recorded thereon the computer program code for implementing the above-described method according to the present disclosure.

Additional aspects of the disclosed embodiments are set forth in the description section that follows, wherein the detailed description is presented to fully disclose the preferred embodiments of the disclosed embodiments without imposing limitations thereon.

Drawings

The disclosure may be better understood by reference to the following detailed description taken in conjunction with the accompanying drawings, in which like or similar reference numerals are used throughout the figures to designate like or similar components. The accompanying drawings, which are incorporated in and form a part of the specification, further illustrate preferred embodiments of the present disclosure and explain the principles and advantages of the present disclosure, are incorporated in and form a part of the specification. Wherein:

fig. 1 is a flowchart showing an example of a flow of an information processing method according to an embodiment of the present disclosure;

fig. 2 is a flowchart showing an example of a flow of calculating the center and size of a window of interest on a set of feature maps in an information processing method according to an embodiment of the present disclosure;

fig. 3 is a block diagram showing a functional configuration example of an information processing apparatus according to an embodiment of the present disclosure;

fig. 4 is a flowchart illustrating an example of a flow of an information detection method according to an embodiment of the present disclosure;

FIG. 5 is a diagram illustrating an example of an input image and its corresponding textual description, according to an embodiment of the present disclosure;

fig. 6 is a block diagram showing a functional configuration example of an information detection apparatus according to an embodiment of the present disclosure; and

fig. 7 is a block diagram showing an example configuration of a personal computer as an information processing apparatus employable in the embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.

Here, it should be further noted that, in order to avoid obscuring the present disclosure with unnecessary details, only the device structures and/or processing steps closely related to the scheme according to the present disclosure are shown in the drawings, and other details not so relevant to the present disclosure are omitted.

The information processing method automatically learns the position and the size of an image needing attention in the current state based on a previous state vector and image content of a cyclic neural network model, then the cyclic neural network model updates the current state vector and calculates the probability of generating each word based on the previous state vector and the calculated image window needing attention, and finally generates a sentence describing the image.

Embodiments according to the present disclosure are described in detail below with reference to the accompanying drawings.

First, a flow example of an information processing method 100 according to an embodiment of the present disclosure will be described with reference to fig. 1. Fig. 1 is a flowchart illustrating a flow example of an information processing method according to an embodiment of the present disclosure. As shown in fig. 1, an information processing method 100 according to an embodiment of the present disclosure includes an extraction step S102 and a training step S104.

In the extracting step S102, a set of feature maps having a predetermined width and a predetermined height may be extracted from each of the plurality of sample images, wherein the feature maps in the set of feature maps respectively correspond to different image features.

A group of (p) feature maps fc with a predetermined width s and a predetermined height r, fc ═ cn (image), of each sample image can be extracted by using the prior art, wherein the image represents the tensor of m × n × c of the image, and m, n and c respectively represent the length, width and channel number of the sample image; CN () represents a transformation function; the extracted feature maps fc are a tensor of r x s p, where p denotes the number of features, i.e. each feature map of the set (p) of feature maps fc corresponds to each image feature of the p image features.

Preferably, extracting a set of feature maps from each sample image may include extracting a set of feature maps from each sample image using a convolutional neural network model.

As an example, a set of (p) feature maps fc having a predetermined width s and a predetermined height r of each sample image may be extracted with a convolutional neural network, where CN () represents a transformation function implemented with the convolutional neural network.

In the training step S104, a word description model may be trained based on the extracted set of feature maps and word descriptions labeled for the plurality of sample images, and the word description model may be used to generate corresponding word descriptions from the input images, wherein training the word description model may include calculating a center and a size of a window of interest on the set of feature maps based on the set of feature maps and a previous state vector of the recurrent neural network model.

As an example, there is a textual description separately labeled for each of a plurality of sample images. A word description model may be trained based on the extracted set of feature maps and the labeled word description, which may be used to generate a corresponding word description from the input image. As an example, the center and size of the window of interest on a set of feature maps may be calculated based on the set of feature maps and a previous state vector of the recurrent neural network model.

By way of example, with H_tThe current state vector representing the time t of the recurrent neural network model, denoted by H_t-1The previous state vector at time t-1 of the recurrent neural network model is represented. Initial state vector H of recurrent neural network model₀Initialised to 0, i.e. H₀Zeros (hd), where zeros () represents an all-zero function and hd is the dimension of the state vector of the recurrent neural network model.

Fig. 2 is a flowchart showing an example of a flow of calculating the center and size of a window of interest on a set of feature maps in an information processing method according to an embodiment of the present disclosure. Preferably, calculating the center and size of the window of interest on the set of feature maps may comprise the steps of: in step S202, a first neural network model with sigmoid function as activation function may be applied to a set of feature maps to convert the set of feature maps into a vector, for example, the set of feature maps may be converted into a vectorMerging the extracted set of feature maps into a vector, and converting the set of feature maps into a vector by performing a nonlinear transformation on the merged vector by a fully connected layer of the neural network using the sigmoid function as an activation function, which is an example of a first neural network model using the sigmoid function as the activation function, to obtain f1(fc), wherein the nonlinear transformation function f1(fc) ═ σ (W1 ═ fc + b1), σ () represents the sigmoid function, and W1 and b1 are a parameter matrix and a bias parameter vector, respectively; in step S204, the converted vector may be merged with a previous state vector of the recurrent neural network model and a second neural network model with a sigmoid function as an activation function may be applied to the merged vector, as an example, the converted vector may be merged with a previous state vector H of the recurrent neural network model_t-1Are combined to obtain the vector f1(fc), H_t-1]The resulting vector is then non-linearly transformed f2([ f1(fc), H) through a fully connected layer of the neural network with the sigmoid function as the activation function, which is an example of a second neural network model with the sigmoid function as the activation function_t-1]) Wherein the non-linear transformation function f2([ f1(fc), H)_t-1])＝σ(W2*[f1(fc),H_t-1]+ b2), W2 and b2 being the parameter matrix and the offset parameter vector, respectively; in step S206, a third neural network model with tanh function as an activation function may be further applied to the vector obtained through the second neural network model, and as an example, the vector obtained after the nonlinear transformation of f2() may be subjected to the nonlinear transformation of tanh (f2([ f1(fc), H) through the third neural network model with tanh function as an activation function_t-1]) ); in step S208, the vector obtained by the third neural network model may be calculated with the parameter for comparison, and for example, the vector obtained by the third neural network model may be point-multiplied with the vector V (which is an example of the parameter for comparison), and then the result may be normalized by a σ function, so that σ (tanh (f2([ f1(fc), H) is obtained_t-1]) V); and in step S210, the center and size of the attention window may be calculated according to the result of the operation and the predetermined width and the predetermined height, and as an example, the center and size of the attention window may be calculated according to the result of the operation and the predetermined width and the predetermined heightNormalizing the learned window position and size at a fixed height, (cs ', cr ', s ', r ') (s, r, s, r) < ' > σ (tanh (f2([ f1(fc), H)_t-1]) V), where cs 'and cr' respectively indicate the width and height direction center positions of the attention window on the feature map fc, and s 'and r' respectively indicate the width and height of the attention window.

Preferably, training the text description model may further include: based on the center and the size of the attention window, attention feature vectors of a set of feature maps are obtained. As an example, the attention feature vector of a set of feature maps may be obtained based on the center and size of the attention window obtained as described above.

Preferably, obtaining the attention feature vector may include: applying a fourth neural network model to a portion corresponding to the attention window on a set of feature maps to convert the portion into one vector, and regarding the one vector as an attention feature vector.

As an example, it is assumed that att is a matrix vector representing the same size as the feature map fc, in which the value at the position corresponding to the attention window is 1 and the value at the position other than the attention window is 0. Thus, fc _ att indicates that only the contents of the feature map fc in the attention window are extracted, i.e., a set of portions of the feature map corresponding to the attention window can be indicated by fc _ att. Furthermore, fc @ att can be transformed by one fully connected layer of the neural network (which is an example of a fourth neural network model) by X_tF (fc &) and converting the vector X_tAs the feature vector of interest, where f () is a transform function. Feature vector X of interest_tAs input to the recurrent neural network model at time t.

Preferably, training the textual description model further comprises: and calculating a current state vector of the recurrent neural network model based on the attention feature vector and a previous state vector in the recurrent neural network model, and obtaining a text description corresponding to the attention window based on the current state vector.

As an example, feature vector X may be based on interest_tAnd the previous state vector H at time t-1 of the recurrent neural network model_t-1To calculate the current state vector H of the recurrent neural network model at the current time t_t＝tanh(Wh*H_t-1+Wi*X_t+ B), where Wh and Wi are parameter matrices and B is a bias parameter vector. Then, a textual description corresponding to the window of interest may be obtained based on the current state vector.

Preferably, obtaining the textual description corresponding to the window of interest may include: and applying a fifth neural network model to the current state vector so as to calculate the occurrence probability of each word in the predetermined word bank, and determining the word with the maximum occurrence probability as the character description corresponding to the attention window.

As an example, the current state vector H of the recurrent neural network model can be paired_tA neural network model (which is an example of a fifth neural network model) with a softmax function as an activation function is applied, thereby calculating an occurrence probability P (Y) of each word Yt in a predetermined thesaurus_t)＝softmax(σ(Wp*H_t+ bp)), where Wp and bp are the parameter matrix and the offset parameter vector, respectively. And, the word with the highest probability of occurrence is determined as the text description corresponding to the attention window.

Preferably, the recurrent neural network model may also include a long short term memory network (LSTM) model.

As an example, in the case where the recurrent neural network model is the LSTM model, when the LSTM model is initialized, it is necessary to initialize the state vector H of the LSTM model₀And cell state vector c₀I.e. let H₀Zeros (hd) and c₀Zeros (hd), where hd is the dimension of the state.

In the case where the recurrent neural network model is an LSTM model, the position and size of the window of interest are calculated and the input X at time t of the LSTM model is calculated_tAs described above for the general recurrent neural network model.

The calculation of the current state vector H at time t of the LSTM model is described in detail below_tThe method of (1). Current state vector H at time t of LSTM model_tDependent on the previous state vector H at the previous instant_t-1Cell state vector C of the previous time_t-1And is currentlyInput of time X_t. First based on the previous state vector H_t-1And the current input vector X_tThree gate state vectors are calculated, i.e. the input gate state vector it ═ σ (Wi × H)_t-1,X_t]+ bi), output gate state vector ot ═ σ (Wo × H)_t-1,X_t]+ bo) and forgetting gate state vector ft ═ σ (Wf × H)_t-1,X_t]+ bf), where Wi, Wo, and Wf are parameter matrices, respectively, and bi, bo, and bf are bias parameter vectors, respectively. Then, the current cell state vector C is calculated_tAnd Ht, specifically: c_t＝ft⊙C_t-1+it⊙tanh(Wc*[H_t-1,Xt]+bc)，Ht＝ot⊙tanh(C_t) Wherein Wc and bc are the parameter matrix and the offset parameter vector, respectively. Current state vector H at time t when LSTM model is calculated_tBased on the current state vector H_tThe method of obtaining the textual description corresponding to the window of interest is the same as described above for the general recurrent neural network model.

The previous state vector H based on a set of feature maps fc and a recurrent neural network model is introduced above by taking the current time as an example t_t-1Calculating the center and the size of the attention window on the group of characteristic graphs so as to calculate the current state vector H of the recurrent neural network model at the current moment t_tTo obtain a textual description corresponding to the window of interest. Similarly, the state vectors of the recurrent neural network model at the time t +1 and t +2 … may also be calculated, so as to obtain the textual descriptions corresponding to the attention windows at the time t +1 and t +2 …, respectively.

Preferably, training the text description model may further include: for one sample image of the plurality of sample images, when it is determined that the text description corresponding to the attention window is a terminator, training based on the one sample image is terminated.

As an example, for one sample image, when it is determined that the text description corresponding to the attention window is a terminator, it is determined that training based on the one sample image is terminated.

Preferably, the parameters of the textual description model may include parameters of the convolutional neural network model, parameters of the first neural network model, parameters of the second neural network model, parameters of the third neural network model, parameters of the fourth neural network model, parameters of the fifth neural network model, and parameters of the recurrent neural network model, and parameters for comparison. As an example, training the word description model may include training parameters of the word description model described above.

The training text description model is described above by taking a sample image as an example. In the following, how to obtain the text description model based on the training of multiple sample images is specifically described, for convenience of description, it is assumed that a set of feature maps is extracted from each sample image using a convolutional neural network model (CNN), and the center and the size of the attention window on the set of feature maps are calculated based on the set of feature maps and a previous state vector of the recurrent neural network model (RNN). Given n training data

Wherein, XⁱRepresenting a sub-sample image, YⁱThe corresponding text description is represented, and the process of training the text description model is as follows.

Step 1: initializing parameters of the textual description model, wherein the CNN employs the VGG-16 model and initializes with parameters trained by VGG-16 on the imagenet dataset, and initializing parameters of the RNN and other parameters. The total number of data in the data set is set to batch _ size 64.

Step 2: the batch _ size data is sampled from the training data set without playback.

And step 3: calculating a probability of generating a corresponding textual description for a sample image based on current model parameters

And updating the parameters of the current model by using a gradient descent method by taking the P as an objective function.

And 4, step 4: and repeating the steps 1 to 3 until the character description model converges.

In summary, the information processing method 100 according to the embodiment of the disclosure can automatically learn the position and size of the attention window in the image, and generate the corresponding text description based on the content of the attention window. Since the image area to which attention is paid in generating the current text can be dynamically found based on the history information, a more appropriate text description can be generated.

In correspondence with the above-described information processing method embodiments, the present disclosure also provides embodiments of the following information processing apparatus.

Fig. 3 is a block diagram showing a functional configuration example of an information processing apparatus 300 according to an embodiment of the present disclosure.

As shown in fig. 3, an information processing apparatus 300 according to an embodiment of the present disclosure may include an extraction unit 302 and a training unit 304. Next, a functional configuration example of the extraction unit 302 and the training unit 304 will be described.

In the extracting step 302, a set of feature maps having a predetermined width and a predetermined height may be extracted from each of the plurality of sample images, wherein the feature maps in the set of feature maps respectively correspond to different image features.

Examples of the characteristic map fc can be found in the description of the corresponding position in the above method embodiment, and are not repeated here.

As an example, a set of (p) feature maps fc having a predetermined width s and a predetermined height r of each sample image can be extracted with a convolutional neural network.

In the training unit 304, a word description model may be trained based on the extracted set of feature maps and word descriptions labeled for the plurality of sample images, and the word description model may be used to generate corresponding word descriptions from the input images, wherein training the word description model may include calculating a center and a size of a window of interest on the set of feature maps based on the set of feature maps and a previous state vector of the recurrent neural network model.

Preferably, calculating the center and size of the window of interest on the set of feature maps comprises: a first neural network model with a sigmoid function as an activation function can be applied to a set of feature maps to convert the set of feature maps into a vector; the converted vector may be merged with a previous state vector of the recurrent neural network model and a second neural network model with a sigmoid function as an activation function is applied to the merged vector; a third neural network model with a tanh function as an activation function may be further applied to the vector obtained via the second neural network model; vectors obtained via the third neural network model may be operated on with the parameters for comparison; and the center and size of the attention window can be calculated from the result of the operation and the predetermined width and height.

For an example of calculating the center and size of the attention window on the set of feature maps, reference may be made to the description of the corresponding position in the above method embodiment, and this is not repeated here.

For an example of obtaining the attention feature vector, reference may be made to the description of the corresponding position in the above method embodiment, and this is not repeated here.

Examples of calculating the current state vector of the recurrent neural network model can be found in the description of the corresponding positions in the above method embodiments, and are not repeated here.

Examples of determining the text description corresponding to the attention window can refer to the description of the corresponding position in the above method embodiment, and are not repeated here.

Preferably, the recurrent neural network model may also include an LSTM model.

Examples of the LSTM model can be found in the description of the corresponding positions in the above method embodiments, and are not repeated here.

An example of training a text description model based on a plurality of sample images may refer to the description of the corresponding positions in the above method embodiments, and is not repeated here.

In summary, the information processing apparatus 300 according to the embodiment of the present disclosure can automatically learn the position and size of the attention window in the image, and generate the corresponding text description based on the content of the attention window. Since the image area to which attention is paid in generating the current text can be dynamically found based on the history information, a more appropriate text description can be generated.

It should be noted that although the functional configuration of the information processing apparatus according to the embodiment of the present disclosure is described above, this is merely an example and not a limitation, and a person skilled in the art may modify the above embodiment according to the principle of the present disclosure, for example, addition, deletion, combination, or the like of functional blocks in the respective embodiments may be made, and such modifications fall within the scope of the present disclosure.

In addition, it should be further noted that the apparatus embodiments herein correspond to the method embodiments described above, and therefore, the content that is not described in detail in the apparatus embodiments may refer to the description of the corresponding location in the method embodiments, and the description is not repeated here.

It should be understood that the machine-executable instructions in the storage medium and the program product according to the embodiments of the present disclosure may also be configured to perform the above-described information processing method, and thus, the contents not described in detail herein may refer to the description of the previous corresponding location, and the description will not be repeated herein.

Accordingly, storage media for carrying the above-described program products comprising machine-executable instructions are also included in the present disclosure. Including, but not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.

According to another aspect of the present disclosure, there is provided an information detection method that considers not only a position of a window of interest in an image but also a size of the window of interest.

Next, a flow example of the information detection method 400 according to an embodiment of the present disclosure will be described with reference to fig. 4. Fig. 4 is a flowchart illustrating an example of a flow of an information detection method 400 according to an embodiment of the present disclosure. As shown in fig. 4, an information detection method 400 according to an embodiment of the present disclosure includes an extraction step S402 and a generation step S404.

In the extracting step S402, a set of feature maps having a predetermined width and a predetermined height may be extracted from the input image, wherein the feature maps in the set of feature maps respectively correspond to different image features.

A group of (p) feature maps fc with a predetermined width s and a predetermined height r, fc ═ cn (image), can be extracted from the input image by using the prior art, wherein the image represents the tensor of m × n × c of the image, and m, n and c respectively represent the length, width and channel number of the input image; CN () represents a transformation function; the extracted feature maps fc are a tensor of r x s p, where p denotes the number of features, i.e. each feature map of the set (p) of feature maps fc corresponds to each image feature of the p image features.

Preferably, extracting a set of feature maps from the input image may include extracting a set of feature maps from the input image using a convolutional neural network model.

As an example, a set (p) of feature maps fc having a predetermined width s and a predetermined height r of the input image may be extracted with a convolutional neural network, where CN () represents a transformation function implemented with the convolutional neural network.

In the generating step S404, a corresponding text description of the input image may be generated by using the trained text description model based on the extracted set of feature maps, wherein the generating of the corresponding text description of the input image by using the trained text description model may include: the center and size of the window of interest on the set of feature maps is calculated based on the set of feature maps and a previous state vector of the recurrent neural network model.

As an example, the trained textual description model may be used to generate a corresponding textual description from the input image. A corresponding textual description of the input image may be generated using the trained textual description model based on the extracted set of feature maps. As an example, the center and size of the window of interest on a set of feature maps may be calculated based on the set of feature maps and a previous state vector of the recurrent neural network model.

Preferably, calculating the center and size of the window of interest on the set of feature maps comprises: a first neural network model with a sigmoid function as an activation function may be applied to a set of feature maps to convert the set of feature maps into a vector, and as an example, the extracted set of feature maps may be merged into a vector, and a set of feature maps may be converted into a vector by nonlinearly transforming the merged vector into f1(fc) by a fully connected layer of the neural network with the sigmoid function as the activation function, which is an example of the first neural network model with the sigmoid function as the activation function, wherein the nonlinear transformation function f1(fc) ═ σ (W1 × fc + b1), σ () represents the sigmoid function, and W1 and b1 are a parameter matrix and a bias parameter vector, respectively; the converted vector may be merged with a previous state vector of the recurrent neural network model and a second neural network model with a sigmoid function as an activation function applied to the merged vector, as an example, the converted vector may be merged with a previous state vector H of the recurrent neural network model_t-1Are combined to obtain the vector f1(fc), H_t-1]The resulting vector is then non-linearly transformed f2([ f1(fc), H) through a fully connected layer of the neural network with the sigmoid function as the activation function, which is an example of a second neural network model with the sigmoid function as the activation function_t-1]) Wherein the non-linear transformation function f2([ f1(fc), H)_t-1])＝σ(W2*[f1(fc),H_t-1]+ b2), W2 and b2 being the parameter matrix and the offset parameter vector, respectively; vectors derived via the second neural network model may be further processedApplying the third neural network model using the tanh function as the activation function, as an example, the vector obtained after the nonlinear transformation f2() may be subjected to the nonlinear transformation tanh (f2([ f1(fc), H) by the third neural network model using the tanh function as the activation function_t-1]) ); the vector obtained by the third neural network model may be operated with the parameter for comparison, and for example, the vector obtained by the third neural network model may be subjected to a dot product operation with a vector V (which is an example of the parameter for comparison), and the result may be normalized by a σ function to obtain σ (tanh (f2([ f1(fc), H)_t-1]) V); and the center and size of the window of interest may be calculated according to the result of the operation and the predetermined width and the predetermined height, as an example, the window position and size may be normalized according to the result of the operation and the predetermined width and the predetermined height, and (cs ', cr ', s ', r ') (s, r, s, r) < ' > σ (tanh (f2([ f1(fc), H)_t-1]) V), where cs 'and cr' respectively indicate the width and height direction center positions of the attention window on the feature map fc, and s 'and r' respectively indicate the width and height of the attention window.

Preferably, generating the corresponding text description of the input image using the trained text description model may further include: based on the center and the size of the attention window, attention feature vectors of a set of feature maps are obtained. As an example, the attention feature vector of a set of feature maps may be obtained based on the center and size of the attention window obtained as described above.

As an example, it is assumed that att is a matrix vector representing the same size as the feature map fc, in which the value at the position corresponding to the attention window is 1 and the value at the position other than the attention window is 0. Thus, fc _ att indicates that only the contents of the feature map fc in the attention window are extracted, i.e., a set of portions of the feature map corresponding to the attention window can be indicated by fc _ att. This is achieved byIn addition, fc @ att may be transformed by one fully connected layer of the neural network (which is an example of a fourth neural network model) by X_tF (fc &) and converting the vector X_tAs the feature vector of interest, where f () is a transform function. The attention feature vector Xt may be used as an input to the recurrent neural network model at time t.

Preferably, generating the respective textual description of the input image using the trained textual description model further comprises: and calculating a current state vector of the recurrent neural network model based on the attention feature vector and a previous state vector in the recurrent neural network model, and obtaining a text description corresponding to the attention window based on the current state vector.

As an example, feature vector X may be based on interest_tAnd the previous state vector H at time t-1 of the recurrent neural network model_t-1To calculate a current state vector Ht ═ tanh (Wh × Ht-1+ Wi × Xt + B) for the current time t of the recurrent neural network model, where Wh and Wi are parameter matrices and B is a bias parameter vector. Then, a textual description corresponding to the window of interest may be obtained based on the current state vector.

As an example, a neural network model having a softmax function as an activation function, which is one example of a fifth neural network model, may be applied to the current state vector of the recurrent neural network model, thereby calculating an occurrence probability P (Y) of each word Yt in the predetermined thesaurus_t)＝softmax(σ(Wp*H_t+ bp)), where Wp and bp are the parameter matrix and the offset parameter vector, respectively. And, the word with the highest probability of occurrence is determined as the text description corresponding to the attention window.

As an example, the model of the recurrent neural network is the LSTM modelIn the case of type, when initializing the LSTM model, it is necessary to initialize the state vector H of the LSTM model₀And cell state vector c₀I.e. let H₀Zeros (hd) and c₀Zeros (hd), where hd is the dimension of the state.

The calculation of the current state vector H at time t of the LSTM model is described in detail below_tThe method of (1). Current state vector H at time t of LSTM model_tDependent on the previous state vector H at the previous instant_t-1Cell state vector C of the previous time_t-1And input X at the current time_t. First based on the previous state vector H_t-1And the current input vector X_tThree gate state vectors are calculated, i.e. the input gate state vector it ═ σ (Wi × H)_t-1,X_t]+ bi), output gate state vector ot ═ σ (Wo × H)_t-1,X_t]+ bo) and forgetting gate state vector ft ═ σ (Wf × H)_t-1,X_t]+ bf), where Wi, Wo, and Wf are parameter matrices, respectively, and bi, bo, and bf are bias parameter vectors, respectively. Then, the current cell state vector C is calculated_tAnd Ht, specifically: c_t＝ft⊙C_t-1+it⊙tanh(Wc*[H_t-1,Xt]+bc)，Ht＝ot⊙tanh(C_t) Wherein Wc and bc are the parameter matrix and the offset parameter vector, respectively. Current state vector H at time t when LSTM model is calculated_tBased on the current state vector H_tThe method of obtaining the textual description corresponding to the window of interest is the same as described above for the general recurrent neural network model.

The previous state vector H based on a set of feature maps fc and a recurrent neural network model is introduced above by taking the current time as an example t_t-1Calculating the center and the size of the attention window on the group of characteristic graphs so as to calculate the current state vector H of the recurrent neural network model at the current moment t_tTo obtain and focus on the windowMouth to mouth correspondence. Similarly, the state vectors of the recurrent neural network model at the time t +1 and t +2 … may also be calculated, so as to obtain the textual descriptions corresponding to the attention windows at the time t +1 and t +2 …, respectively.

Preferably, generating the corresponding text description of the input image using the trained text description model may further include: when the text description corresponding to the attention window is determined to be the terminator, the generation of the corresponding text description of the input image is terminated.

As an example, when it is determined that the text description corresponding to the window of interest is a terminator, the generation of the corresponding text description of the input image is terminated.

Preferably, the parameters of the trained textual description model may include parameters of the convolutional neural network model, parameters of the first neural network model, parameters of the second neural network model, parameters of the third neural network model, parameters of the fourth neural network model, parameters of the fifth neural network model, and parameters of the recurrent neural network model, as well as parameters for comparison. The above parameters of the trained word description model may be determined by an information processing method according to an embodiment of the present disclosure.

Fig. 5 is a diagram illustrating an example of an input image and its corresponding textual description, according to an embodiment of the present disclosure. The leftmost image in fig. 5 is the input image. The intermediate image of fig. 5 schematically shows an image related to the window of interest in the input image, for example, the image related to the window of interest in the input image includes an image of "girl," an image of "horse standing by" or the like, respectively. The rightmost side in fig. 5 is a textual description corresponding to the input image, i.e., "girl and horse standing aside".

In summary, the information detection method 400 according to the embodiment of the disclosure considers the position and size of the attention window in the image, and generates the corresponding text description based on the content of the attention window. Since the image area to which attention is paid in generating the current text can be dynamically found based on the history information, a more appropriate text description can be generated.

Correspondingly to the above information detection method embodiment, the present disclosure also provides the following information detection apparatus embodiment.

Fig. 6 is a block diagram showing a functional configuration example of an information detection apparatus 600 according to an embodiment of the present disclosure.

As shown in fig. 6, an information detecting apparatus 600 according to an embodiment of the present disclosure may include an extracting unit 602 and a generating unit 604. Next, a functional configuration example of the extracting unit 602 and the generating unit 604 will be described.

In the extraction unit 602, a set of feature maps having a predetermined width and a predetermined height may be extracted from the input image, wherein the feature maps in the set of feature maps respectively correspond to different image features.

As an example, a set of (p) feature maps fc of the input image having a predetermined width s and a predetermined height r may be extracted with a convolutional neural network.

In the generating unit 604, a corresponding text description of the input image may be generated using the trained text description model based on the extracted set of feature maps, where generating the corresponding text description of the input image using the trained text description model may include: the center and size of the window of interest on the set of feature maps is calculated based on the set of feature maps and a previous state vector of the recurrent neural network model.

Preferably, the recurrent neural network model may also include an LSTM model.

In summary, the information detection apparatus 600 according to the embodiment of the present disclosure considers the position and size of the attention window in the image, and generates a corresponding text description based on the content of the attention window. Since the image area to which attention is paid in generating the current text can be dynamically found based on the history information, a more appropriate text description can be generated.

It should be noted that although the functional configuration of the information detection apparatus according to the embodiment of the present disclosure is described above, this is merely an example and not a limitation, and a person skilled in the art may modify the above embodiment according to the principle of the present disclosure, for example, addition, deletion, combination, or the like of functional modules in the respective embodiments may be made, and such modifications fall within the scope of the present disclosure.

It should be understood that the machine-executable instructions in the storage medium and the program product according to the embodiments of the present disclosure may also be configured to perform the above-described information detection method, and thus, contents not described in detail herein may refer to the description of the previous corresponding location, and the description will not be repeated herein.

Further, it should be noted that the above series of processes and means may also be implemented by software and/or firmware. In the case of implementation by software and/or firmware, a program constituting the software is installed from a storage medium or a network to a computer having a dedicated hardware structure, such as a general-purpose personal computer 600 shown in fig. 6, which is capable of executing various functions and the like when various programs are installed.

In fig. 7, a Central Processing Unit (CPU)701 performs various processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 to a Random Access Memory (RAM) 703. In the RAM 703, data necessary when the CPU 701 executes various processes and the like is also stored as necessary.

The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output interface 705 is also connected to the bus 704.

The following components are connected to the input/output interface 705: an input section 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, and the like. The communication section 709 performs communication processing via a network such as the internet.

A driver 710 is also connected to the input/output interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that the computer program read out therefrom is mounted in the storage section 708 as necessary.

In the case where the above-described series of processes is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 711.

It should be understood by those skilled in the art that such a storage medium is not limited to the removable medium 711 shown in fig. 7 in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 711 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disc-read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a mini-disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM 702, a hard disk included in the storage section 708, or the like, in which programs are stored and which are distributed to users together with the apparatus including them.

The preferred embodiments of the present disclosure are described above with reference to the drawings, but the present disclosure is of course not limited to the above examples. Various changes and modifications within the scope of the appended claims may be made by those skilled in the art, and it should be understood that these changes and modifications naturally will fall within the technical scope of the present disclosure.

For example, a plurality of functions included in one unit may be implemented by separate devices in the above embodiments. Alternatively, a plurality of functions implemented by a plurality of units in the above embodiments may be implemented by separate devices, respectively. In addition, one of the above functions may be implemented by a plurality of units. Needless to say, such a configuration is included in the technical scope of the present disclosure.

In this specification, the steps described in the flowcharts include not only the processing performed in time series in the described order but also the processing performed in parallel or individually without necessarily being performed in time series. Further, even in the steps processed in time series, needless to say, the order can be changed as appropriate.

In addition, the technique according to the present disclosure can also be configured as follows.

Supplementary note 1. an information processing method, comprising:

extracting a set of feature maps having a predetermined width and a predetermined height from each of the plurality of sample images, wherein the feature maps in the set of feature maps respectively correspond to different image features; and

training a word description model based on the extracted set of feature maps and word descriptions labeled for the plurality of sample images, the word description model for generating respective word descriptions from the input images, wherein training the word description model comprises computing a center and a size of a window of interest on the set of feature maps based on the set of feature maps and a previous state vector of a recurrent neural network model.

Supplementary note 2. the information processing method according to supplementary note 1, wherein the extracting the set of feature maps from each sample image includes extracting the set of feature maps from each sample image using a convolutional neural network model.

Note 3. the information processing method according to note 1, wherein the calculating of the center and the size of the attention window on the set of feature maps includes:

applying a first neural network model with a sigmoid function as an activation function to the set of feature maps to convert the set of feature maps into a vector;

merging the converted vector with the previous state vector of the recurrent neural network model and applying a second neural network model with a sigmoid function as an activation function to the merged vector;

further applying a third neural network model with a tanh function as an activation function to the vectors obtained via the second neural network model;

calculating the vector obtained by the third neural network model and the parameter for comparison; and

and calculating the center and the size of the attention window according to the operation result, the preset width and the preset height.

Supplementary note 4. the information processing method according to supplementary note 1, wherein training the word description model further comprises: obtaining a feature vector of interest of the set of feature maps based on the center and size of the window of interest.

Note 5. the information processing method according to note 4, wherein the obtaining the attention feature vector includes: applying a fourth neural network model to a portion of the set of feature maps corresponding to the window of interest to convert the portion into one vector, and treating the one vector as the feature vector of interest.

Supplementary notes 6. the information processing method according to supplementary notes 4, wherein training the word description model further comprises: calculating a current state vector of the recurrent neural network model based on the attention feature vector and the previous state vector in the recurrent neural network model, and obtaining a textual description corresponding to the attention window based on the current state vector.

Note 7. the information processing method according to note 6, wherein the obtaining of the text description corresponding to the window of interest includes: and applying a fifth neural network model to the current state vector to calculate the occurrence probability of each word in a preset word bank, and determining the word with the maximum occurrence probability as the character description corresponding to the attention window.

Supplementary note 8. the information processing method according to supplementary note 6, wherein training the word description model further comprises: for one sample image of the plurality of sample images, terminating training performed based on the one sample image when it is determined that the text description corresponding to the attention window is a terminator.

Supplementary note 9. the information processing method according to supplementary note 7, wherein the parameters of the word description model include parameters of the convolutional neural network model, parameters of the first neural network model, parameters of the second neural network model, parameters of the third neural network model, parameters of the fourth neural network model, parameters of the fifth neural network model, and parameters of the recurrent neural network model, and the parameters for comparison.

Supplementary note 10 the information processing method according to supplementary note 1, wherein the recurrent neural network model further includes a long-short term memory network.

Note 11 that an information processing apparatus includes:

an extraction unit configured to extract a set of feature maps having a predetermined width and a predetermined height from each of a plurality of sample images, wherein the feature maps in the set of feature maps respectively correspond to different image features; and

a training unit configured to train a word description model based on the extracted set of feature maps and word descriptions labeled for the plurality of sample images, the word description model being used to generate respective word descriptions from an input image, wherein training the word description model comprises calculating a center and a size of a window of interest on the set of feature maps based on the set of feature maps and a previous state vector of a recurrent neural network model.

Note 12. an information detection method, comprising:

extracting a group of feature maps with a predetermined width and a predetermined height from an input image, wherein the feature maps in the group of feature maps respectively correspond to different image features; and

generating a respective textual description of the input image using the trained textual description model based on the extracted set of feature maps, wherein generating the respective textual description of the input image using the trained textual description model comprises: calculating a center and a size of a window of interest on the set of feature maps based on the set of feature maps and a previous state vector of the recurrent neural network model.

Note 13 the information detection method according to note 12, wherein extracting the set of feature maps from the input image includes extracting the set of feature maps from the input image using a convolutional neural network model.

Note 14. the information detection method according to note 12, wherein the calculating of the center and the size of the attention window on the set of feature maps includes:

Supplementary notes 15. the information detection method according to supplementary notes 12, wherein generating the corresponding text description of the input image using the trained text description model further comprises: obtaining a feature vector of interest of the set of feature maps based on the center and size of the window of interest.

Supplementary note 16. the information detection method according to supplementary note 15, wherein obtaining the attention feature vector includes: applying a fourth neural network model to a portion of the set of feature maps corresponding to the window of interest to convert the portion into one vector, and treating the one vector as the feature vector of interest.

Supplementary notes 17. the information detection method according to supplementary notes 15, wherein generating the corresponding text description of the input image using the trained text description model further comprises: calculating a current state vector of the recurrent neural network model based on the attention feature vector and the previous state vector in the recurrent neural network model, and obtaining a textual description corresponding to the attention window based on the current state vector.

Supplementary notes 18. the information detection method according to supplementary notes 17, wherein, obtain the word description corresponding to the window of concern includes: and applying a fifth neural network model to the current state vector to calculate the occurrence probability of each word in a preset word bank, and determining the word with the maximum occurrence probability as the character description corresponding to the attention window.

Supplementary notes 19. the information detection method according to supplementary notes 17, wherein generating the corresponding text description of the input image using the trained text description model further comprises: and when the text description corresponding to the attention window is determined to be a terminator, terminating the generation of the corresponding text description of the input image.

Supplementary notes 20. the information detection method according to supplementary notes 18, wherein the parameters of the trained textual description model include parameters of the convolutional neural network model, parameters of the first neural network model, parameters of the second neural network model, parameters of the third neural network model, parameters of the fourth neural network model, parameters of the fifth neural network model, and parameters of the recurrent neural network model and the parameters for comparison.

Claims

1. An information processing method comprising:

training a text description model based on the extracted set of feature maps and text descriptions labeled for the plurality of sample images, the text description model for generating respective text descriptions from input images, wherein training the text description model comprises computing a center and a size of a window of interest on the set of feature maps based on the set of feature maps and a previous state vector of a recurrent neural network model,

wherein calculating the center and size of the window of interest on the set of feature maps comprises:

2. The information processing method according to claim 1, wherein extracting the set of feature maps from each sample image includes extracting the set of feature maps from each sample image using a convolutional neural network model.

3. The information processing method of claim 1, wherein training the word description model further comprises: obtaining a feature vector of interest of the set of feature maps based on the center and size of the window of interest.

4. The information processing method according to claim 3, wherein obtaining the attention feature vector includes: applying a fourth neural network model to a portion of the set of feature maps corresponding to the window of interest to convert the portion into one vector, and treating the one vector as the feature vector of interest.

5. The information processing method of claim 3, wherein training the word description model further comprises: calculating a current state vector of the recurrent neural network model based on the attention feature vector and the previous state vector in the recurrent neural network model, and obtaining a textual description corresponding to the attention window based on the current state vector.

6. The information processing method according to claim 5, wherein obtaining the textual description corresponding to the window of interest includes: and applying a fifth neural network model to the current state vector to calculate the occurrence probability of each word in a preset word bank, and determining the word with the maximum occurrence probability as the character description corresponding to the attention window.

7. The information processing method of claim 5, wherein training the word description model further comprises: for one sample image of the plurality of sample images, terminating training performed based on the one sample image when it is determined that the text description corresponding to the attention window is a terminator.

8. An information processing apparatus comprising:

a training unit configured to train a text description model based on the extracted set of feature maps and text descriptions labeled for the plurality of sample images, the text description model for generating respective text descriptions from an input image, wherein training the text description model comprises calculating a center and a size of a window of interest on the set of feature maps based on the set of feature maps and a previous state vector of a recurrent neural network model,

9. An information detection method, comprising:

generating a respective textual description of the input image using the trained textual description model based on the extracted set of feature maps, wherein generating the respective textual description of the input image using the trained textual description model comprises: calculating a center and a size of a window of interest on the set of feature maps based on the set of feature maps and a previous state vector of the recurrent neural network model,