CN108875758A

CN108875758A - Information processing method and device and information detecting method and device

Info

Publication number: CN108875758A
Application number: CN201710320880.4A
Authority: CN
Inventors: 侯翠琴; 夏迎炬; 杨铭; 张姝; 孙俊
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-05-09
Filing date: 2017-05-09
Publication date: 2018-11-23
Anticipated expiration: 2037-05-09
Also published as: CN108875758B

Abstract

A kind of information processing method and device and information detecting method and device are disclosed, wherein information processing method includes：One group of characteristic pattern with preset width and predetermined altitude is extracted from each sample image in multiple sample images, wherein the characteristic pattern in one group of characteristic pattern is corresponding from different characteristics of image respectively；And verbal description model is trained based on extracted one group of characteristic pattern and for the verbal description of multiple sample images label, verbal description model is used to generate corresponding verbal description according to input picture, wherein, training verbal description model includes the previous state vector based on one group of characteristic pattern and Recognition with Recurrent Neural Network model, calculates center and the size of the focus window on one group of characteristic pattern.In accordance with an embodiment of the present disclosure, the more suitable verbal description of image can be generated.

Description

Information processing method and device and information detecting method and device

Technical field

This disclosure relates to field of information processing, and in particular to not only consider the position of the focus window in image but also consideration The information processing method and device and information detecting method and device of the size of focus window.

Background technique

Understand picture material and with natural language description picture material be artificial intelligence field major issue and ultimate mesh One of mark.Description image not only needs to identify the object in image, at the same need in natural language description image object and Relationship between object.It therefore, is a very challenging problem with natural language description picture material.But currently, There are certain methods to attempt to solve this challenge.For example, object first in detection image and inferring between object Relationship, be then based on the natural sentence that template generation describes picture material.There is also based on the end-to-end of neural network model Method.In addition, attention model is added in a model, and the attention model only learns the position of fixed size focus window automatically It sets.

Summary of the invention

The brief overview about the disclosure is given below, in order to provide the basic of some aspects about the disclosure Understand.It is understood, however, that this general introduction is not the exhaustive general introduction about the disclosure.It is not intended to for determining The critical component or pith of the disclosure, nor being intended to limit the scope of the present disclosure.Its purpose is only with letter The form of change provides certain concepts about the disclosure, in this, as preamble in greater detail given later.

In view of problem above, purpose of this disclosure is to provide the position of the focus window not only considered in image but also considerations The information processing method and device and information detecting method and device of the size of focus window.

According to the one side of the disclosure, a kind of information processing method is provided, including：It can be from multiple sample images Each sample image extracts one group of characteristic pattern with preset width and predetermined altitude, wherein the spy in one group of characteristic pattern Sign figure is corresponding from different characteristics of image respectively；It and based on extracted one group of characteristic pattern and can be the multiple sample The verbal description of image tagged trains verbal description model, and verbal description model is used to generate corresponding text according to input picture Description, wherein training verbal description model may include previous based on one group of characteristic pattern and Recognition with Recurrent Neural Network model State vector calculates center and the size of the focus window on one group of characteristic pattern.

According to another aspect of the present disclosure, a kind of information processing unit is provided, including：Extraction unit can be configured One group of characteristic pattern with preset width and predetermined altitude is extracted at from each sample image in multiple sample images, wherein Characteristic pattern in one group of characteristic pattern is corresponding from different characteristics of image respectively；And training unit, it may be configured to Verbal description model is trained based on extracted one group of characteristic pattern and for the verbal description of the multiple sample image label, text Word description model is used to generate corresponding verbal description according to input picture, wherein training verbal description model may include being based on The previous state vector of one group of characteristic pattern and Recognition with Recurrent Neural Network model calculates the focus window on one group of characteristic pattern Center and size.

According to the another aspect of the disclosure, a kind of information detecting method is provided, including：It can extract and have from input picture Have one group of characteristic pattern of preset width and predetermined altitude, wherein characteristic pattern in one group of characteristic pattern respectively from different figures As feature is corresponding；And it can be based on extracted one group of characteristic pattern, it is generated and is inputted using housebroken verbal description model The corresponding verbal description of image, wherein can using the corresponding verbal description that housebroken verbal description model generates input picture To include：Based on the previous state vector of one group of characteristic pattern and Recognition with Recurrent Neural Network model, one group of characteristic pattern is calculated On focus window center and size.

According to the other aspects of the disclosure, additionally provide for realizing the above-mentioned computer program according to disclosed method Code and computer program product and thereon record have this for realizing the above-mentioned computer program according to disclosed method The computer readable storage medium of code.

The other aspects of the embodiment of the present disclosure are provided in following specification part, wherein be described in detail for abundant Ground discloses the preferred embodiment of the embodiment of the present disclosure, without applying restriction to it.

Detailed description of the invention

The disclosure can by reference to being better understood below in association with detailed description given by attached drawing, wherein The same or similar appended drawing reference is used in all the appended drawings to indicate same or similar component.The attached drawing is together under The detailed description in face includes in the present specification and to form part of specification together, for the disclosure is further illustrated Preferred embodiment and explain the disclosure principle and advantage.Wherein：

Fig. 1 is the flow chart for showing the flow example of information processing method according to an embodiment of the present disclosure；

Fig. 2 is to show the focus window calculated on one group of characteristic pattern in information processing method according to an embodiment of the present disclosure Center and size flow example flow chart；

Fig. 3 is the block diagram for showing the functional configuration example of information processing unit according to an embodiment of the present disclosure；

Fig. 4 is the flow chart for showing the flow example of information detecting method according to an embodiment of the present disclosure；

Fig. 5 is the exemplary figure for showing input picture and its corresponding verbal description according to the embodiment of the present disclosure；

Fig. 6 is the block diagram for showing the functional configuration example of information detector according to an embodiment of the present disclosure；And

Fig. 7 is the example knot for being shown as the personal computer of adoptable information processing unit in embodiment of the disclosure The block diagram of structure.

Specific embodiment

It is described hereinafter in connection with exemplary embodiment of the attached drawing to the disclosure.For clarity and conciseness, All features of actual implementation mode are not described in the description.It should be understood, however, that developing any this actual implementation Much decisions specific to embodiment must be made during example, to realize the objectives of developer, for example, symbol Restrictive condition those of related to system and business is closed, and these restrictive conditions may have with the difference of embodiment Changed.In addition, it will also be appreciated that although development is likely to be extremely complex and time-consuming, to having benefited from the disclosure For those skilled in the art of content, this development is only routine task.

Here, and also it should be noted is that, in order to avoid having obscured the disclosure because of unnecessary details, in the accompanying drawings It illustrate only with according to the closely related device structure of the scheme of the disclosure and/or processing step, and be omitted and the disclosure The little other details of relationship.

The application proposes that a kind of information processing method for calculating the focus window in image, the information processing method are based on following The previous state vector and picture material of ring neural network model learn automatically the position of image that current state needs to pay close attention to and Size, then Recognition with Recurrent Neural Network model is worked as based on previous state vector and the calculated image window for needing to pay close attention to, update Preceding state vector simultaneously calculates the probability for generating each word, ultimately produces the sentence of description image.

It is described in detail with reference to the accompanying drawing in accordance with an embodiment of the present disclosure.

Firstly, the flow example that information processing method 100 according to the embodiment of the present disclosure will be described referring to Fig.1.Fig. 1 is to show The flow chart of the flow example of information processing method according to an embodiment of the present disclosure out.As shown in Figure 1, according to the reality of the disclosure The information processing method 100 for applying example includes extraction step S102 and training step S104.

In extraction step S102, can from each sample image in multiple sample images extract have preset width and One group of characteristic pattern of predetermined altitude, wherein the characteristic pattern in one group of characteristic pattern is corresponding from different characteristics of image respectively.

It can use one group of (p with preset width s and predetermined altitude r that the prior art extracts each sample image It is a) characteristic pattern fc, fc=CN (image), wherein image indicates the tensor of the m*n*c of image, and m, n and c respectively indicate sample Length, width and the port number of image；CN () indicates transforming function transformation function；Extracted characteristic pattern fc is the tensor of a r*s*p, Wherein p indicates the number of feature, i.e., each characteristic pattern in one group described (p) the characteristic pattern fc is respectively and in p characteristics of image Each characteristics of image it is corresponding.

Preferably, extracting one group of characteristic pattern from each sample image may include using convolutional neural networks model from each Sample image extracts one group of characteristic pattern.

As an example, can extract each sample image with convolutional neural networks has preset width s and predetermined height Spend one group of (p) characteristic pattern fc of r, wherein the transforming function transformation function that CN () expression is realized with convolutional neural networks.

It based on extracted one group of characteristic pattern and can be the text of multiple sample images label in training step S104 It describes to train verbal description model, verbal description model can be used for generating corresponding verbal description according to input picture, wherein Training verbal description model may include the previous state vector based on one group of characteristic pattern and Recognition with Recurrent Neural Network model, calculate one The center of focus window on group characteristic pattern and size.

As an example, existing for the verbal description that each sample image in multiple sample images marks respectively.It can be with base Verbal description model is trained in extracted one group of characteristic pattern and the verbal description that is marked, which can use According to the corresponding verbal description of input picture generation.As an example, one group of characteristic pattern and Recognition with Recurrent Neural Network model can be based on Previous state vector, calculate one group of characteristic pattern on focus window center and size.

As an example, using H_tThe current state vector for indicating t at the time of Recognition with Recurrent Neural Network model, uses H_t-1Indicate circulation The previous state vector of t-1 at the time of neural network model.By the initial state vector H of Recognition with Recurrent Neural Network model₀Initialization It is 0, i.e. H₀=zeros (hd), wherein zeros () indicates full null function, and hd is the state vector of Recognition with Recurrent Neural Network model Dimension.

Fig. 2 is to show the focus window calculated on one group of characteristic pattern in information processing method according to an embodiment of the present disclosure Center and size flow example flow chart.Preferably, center and the size of the focus window on one group of characteristic pattern are calculated It may comprise steps of：In step S202, the using sigmoid function as activation primitive can be applied to one group of characteristic pattern One neural network model is to be converted into a vector for one group of characteristic pattern, as an example, can be by extracted one group of characteristic pattern Merge into a vector, and by using sigmoid function as one of the neural network of activation primitive full articulamentum (its be with Sigmoid function is an example of the first nerves network model of activation primitive) it is non-linear to the vector progress after the merging It converts and obtains f1 (fc), so that one group of characteristic pattern is converted into a vector, wherein non-linear transform function f1 (fc)=σ (W1*fc+b1), σ () indicates sigmoid function, and W1 and b1 are respectively parameter matrix and offset parameter vector；In step S204 In, the vector after conversion can be merged with the previous state vector of Recognition with Recurrent Neural Network model and the vector after merging is answered It is the nervus opticus network model of activation primitive to sigmoid function, as an example, can be by the vector and circulation after conversion The previous state vector H of neural network model_t-1Merge to obtain vector [f1 (fc), H_t-1], then obtained vector is led to Cross that (it is using sigmoid function as activation primitive using sigmoid function as a full articulamentum of the neural network of activation primitive Nervus opticus network model an example) carry out nonlinear transformation f2 ([f1 (fc), H_t-1]), wherein nonlinear transformation letter Number f2 ([f1 (fc), H_t-1])=σ (W2* [f1 (fc), H_t-1]+b2), W2 and b2 are respectively parameter matrix and offset parameter vector； It can be activation letter to further being applied through the obtained vector of nervus opticus network model with tanh function in step S206 Several third nerve network model, as an example, can pass through the vector obtained after nonlinear transformation f2 () with tanh letter Number is that the third nerve network model of activation primitive carries out nonlinear transformation tanh (f2 ([f1 (fc), H_t-1]))；In step S208 In, operation can will be carried out through parameter of the obtained vector of third nerve network model compared with being used for, as an example, can be with Dot product fortune will be done with vector V (it is an example of the parameter for compared with) through the obtained vector of third nerve network model Result is normalized by sigma function again after calculation, obtains σ (tanh (f2 ([f1 (fc), H_t-1]))⊙V)；And in step In S210, center and the size of focus window can be calculated according to the result and preset width and predetermined altitude of operation, made For example, the window's position and size for learning can be advised according to the result and preset width and predetermined altitude of operation Generalized processing, and (cs ', cr ', s ', r ')=(s, r, s, r) ⊙ σ (tanh (f2 ([f1 (fc), H_t-1])) ⊙ V), wherein cs ' and Cr ' respectively indicates the center of width and short transverse of the focus window on characteristic pattern fc, and s ' and r ' respectively indicate concern The width and height of window.

Preferably, training verbal description model can also include：Center and size based on focus window obtain one group of spy Levy the concern feature vector of figure.As an example, can center based on the focus window obtained as described above and size, obtain one The concern feature vector of group characteristic pattern.

Preferably, obtaining concern feature vector may include：Part corresponding with focus window on one group of characteristic pattern is answered With fourth nerve network model the part is converted into a vector, and using one vector as concern feature to Amount.

As an example it is supposed that att is the matrix-vector indicated with characteristic pattern fc same size, in the matrix-vector, with Value at the corresponding position of focus window is 1, and the value at the position except focus window is 0.Fc ⊙ att is indicated only as a result, Extract content of the characteristic pattern fc in focus window, that is, can be indicated with fc ⊙ att corresponding with focus window on one group of characteristic pattern Part.Furthermore, it is possible to which a full articulamentum (it is an example of fourth nerve network model) by neural network is right Fc ⊙ att carries out transformation X_t=f (fc ⊙ att), and by transformed vector X_tAs feature vector is paid close attention to, wherein f () is Transforming function transformation function.It can will pay close attention to feature vector, X_tInput as t moment Recognition with Recurrent Neural Network model.

Preferably, training verbal description model further includes：Based in concern feature vector and Recognition with Recurrent Neural Network model Previous state vector calculates the current state vector of Recognition with Recurrent Neural Network model, and is obtained and closed based on current state vector Infuse the corresponding verbal description of window.

As an example, can be according to concern feature vector, X_tWith the previous state of t-1 at the time of Recognition with Recurrent Neural Network model Vector H_t-1Come calculate Recognition with Recurrent Neural Network model current time t current state vector H_t=tanh (Wh*H_t-1+Wi*X_t+ B), wherein Wh and Wi is parameter matrix, and B is offset parameter vector.It is then possible to be obtained based on current state vector and pay close attention to window The corresponding verbal description of mouth.

Preferably, obtaining verbal description corresponding with focus window may include：To current state vector using the 5th mind Calculate the probability of occurrence of each word in predetermined dictionary through network model, and by the maximum word of probability of occurrence be determined as with The corresponding verbal description of focus window.

As an example, can current state vector H to Recognition with Recurrent Neural Network model_tIt is sharp using with softmax function The neural network model (it is an example of fifth nerve network model) of function living, to calculate every in predetermined dictionary Probability of occurrence P (the Y of a word Yt_t)=softmax (σ (Wp*H_t+ bp)), wherein Wp and bp is parameter matrix and biasing ginseng respectively Number vector.Also, the maximum word of probability of occurrence is determined as verbal description corresponding with focus window.

Preferably, Recognition with Recurrent Neural Network model can also include shot and long term memory network (LSTM) model.

As an example, when initializing LSTM model, being needed in the case where Recognition with Recurrent Neural Network model is LSTM model Initialize the state vector H of LSTM model₀With cell state vector c₀, that is, enable H₀=zeros (hd) and c₀=zeros (hd), wherein hd is the dimension of state.

In the case where Recognition with Recurrent Neural Network model is LSTM model, position and size and the calculating of focus window are calculated The input X of the t moment of LSTM model_tIt is identical as the description above in relation to general Recognition with Recurrent Neural Network model.

The current state vector H for calculating the t moment of LSTM model is described in detail below_tMethod.The t moment of LSTM model Current state vector H_tPrevious state vector H dependent on last moment_t-1, last moment cell state vector C_t-1And The input X at current time_t.It is primarily based on previous state vector H_t-1And current input vector X_tCalculate three door state vectors, that is, Input gate state vector it=σ (Wi* [H_t-1,X_t]+bi), out gate state vector ot=σ (Wo* [H_t-1,X_t]+bo) and lose Forget door state vector f t=σ (Wf* [H_t-1,X_t]+bf), wherein Wi, Wo and Wf are parameter matrix respectively, and bi, bo with And bf is offset parameter vector respectively.Then current cell state vector C is calculated_tAnd Ht, specially：C_t=ft ⊙ C_t-1+it⊙ tanh(Wc*[H_t-1, Xt] and+bc), Ht=ot ⊙ tanh (C_t), wherein Wc and bc be parameter matrix and offset parameter respectively to Amount.In the current state vector H for calculating the t moment of LSTM model_tIn the case where, it is based on current state vector H_tIt obtains and closes The method for infusing the corresponding verbal description of window is identical as the method above in relation to general Recognition with Recurrent Neural Network model.

Above by taking current time is t as an example, describe previous based on one group of characteristic pattern fc and Recognition with Recurrent Neural Network model State vector H_t-1, center and the size of the focus window on one group of characteristic pattern are calculated, to calculate Recognition with Recurrent Neural Network mould The current state vector H of the current time t of type_tTo obtain verbal description corresponding with focus window.Similarly, it can also calculate The state vector of the Recognition with Recurrent Neural Network model of moment t+1, t+2 ..., to obtain respectively and concern when moment t+1, t+2 ... The corresponding verbal description of window.

Preferably, training verbal description model can also include：For a sample image in multiple sample images, When determining that verbal description corresponding with focus window is full stop, the training carried out based on one sample image is terminated.

As an example, for a sample image, when determining verbal description corresponding with focus window is full stop, then Determine that the training carried out based on one sample image is terminated.

Preferably, the parameter of verbal description model may include the parameter of convolutional neural networks model, first nerves network The parameter of model, the parameter of nervus opticus network model, the parameter of third nerve network model, the 4th neural network model Parameter, the parameter of the parameter of fifth nerve network model and Recognition with Recurrent Neural Network model and the parameter for comparing.As Example, training verbal description model may include being trained to the parameter of above-mentioned verbal description model.

Trained verbal description model is described by taking a sample image as an example above.It how is detailed below based on multiple Sample image training obtains verbal description model, for convenience of describing, it is assumed that use convolutional neural networks model (CNN) from each sample One group of characteristic pattern of this image zooming-out, and it is based on the previous state vector of one group of characteristic pattern and Recognition with Recurrent Neural Network model (RNN), Calculate center and the size of the focus window on one group of characteristic pattern.Give n training dataWherein, XⁱIndicate a secondary sample image, YⁱIndicate that corresponding verbal description, the process of training verbal description model are as follows.

Step 1：The parameter of verbal description model is initialized, wherein CNN is existed using VGG-16 model and with VGG-16 Trained parameter is initialized on imagenet data set, and the parameter and other parameters of initialization RNN.Data are set The data count of concentration is batch_size=64.

Step 2：It is concentrated from training data and samples batch_size data without playback.

Step 3：Based on "current" model parameter, the probability that corresponding verbal description is generated for sample image is calculatedThe parameter of "current" model is updated using P as objective function gradient descent method.

Step 4：Repeat the above steps 1- step 3, until verbal description model is restrained.

In conclusion information processing method 100 according to an embodiment of the present disclosure can learn the concern window in image automatically The position of mouth and size, and the content based on focus window generates corresponding verbal description.Since historical information dynamic can be based on Ground discovery generates the image-region that current character needs to pay close attention to, therefore can generate more suitable verbal description.

With above- mentioned information processing method embodiment correspondingly, the disclosure additionally provides the implementation of following information processing unit Example.

Fig. 3 is the block diagram for showing the functional configuration example of information processing unit 300 according to an embodiment of the present disclosure.

As shown in figure 3, information processing unit 300 according to an embodiment of the present disclosure may include extraction unit 302 and instruction Practice unit 304.It is described below the functional configuration example of extraction unit 302 and training unit 304.

In extracting step unit 302, can extract from each sample image in multiple sample images has preset width With one group of characteristic pattern of predetermined altitude, wherein the characteristic pattern in one group of characteristic pattern is opposite from different characteristics of image respectively It answers.

Example in relation to characteristic pattern fc may refer to the description of corresponding position in above method embodiment, no longer heavy herein It is multiple.

As an example, can extract each sample image with convolutional neural networks has preset width s and predetermined height Spend one group of (p) characteristic pattern fc of r.

It based on extracted one group of characteristic pattern and can be the text of multiple sample images label in training unit 304 It describes to train verbal description model, verbal description model can be used for generating corresponding verbal description according to input picture, wherein Training verbal description model may include the previous state vector based on one group of characteristic pattern and Recognition with Recurrent Neural Network model, calculate one The center of focus window on group characteristic pattern and size.

Preferably, the center and size for calculating the focus window on one group of characteristic pattern include：One group of characteristic pattern can be answered To sigmoid function be activation primitive first nerves network model one group of characteristic pattern is converted into a vector；It can be with By the vector after conversion merge with the previous state vector of Recognition with Recurrent Neural Network model and to the vector after merging apply with Sigmoid function is the nervus opticus network model of activation primitive；It can be to through the obtained vector of nervus opticus network model Further using using tanh function as the third nerve network model of activation primitive；It can will be through obtained by third nerve network model To vector be used for compared with parameter carry out operation；And it can be according to the result and preset width and predetermined altitude of operation Calculate center and the size of focus window.

In relation to calculate one group of characteristic pattern on the center of focus window and the example of size may refer to above method implement The description of corresponding position, is not repeated herein in example.

Example in relation to obtaining concern feature vector may refer to the description of corresponding position in above method embodiment, herein It is not repeated.

Example in relation to calculating the current state vector of Recognition with Recurrent Neural Network model may refer in above method embodiment The description of corresponding position, is not repeated herein.

Example in relation to determination verbal description corresponding with focus window may refer to corresponding positions in above method embodiment The description set, is not repeated herein.

Preferably, Recognition with Recurrent Neural Network model can also include LSTM model.

Example in relation to LSTM model may refer to the description of corresponding position in above method embodiment, no longer heavy herein It is multiple.

The example for obtaining verbal description model based on the training of multiple sample images may refer to phase in above method embodiment The description for answering position, is not repeated herein.

In conclusion information processing unit 300 according to an embodiment of the present disclosure can learn the concern window in image automatically The position of mouth and size, and the content based on focus window generates corresponding verbal description.Since historical information dynamic can be based on Ground discovery generates the image-region that current character needs to pay close attention to, therefore can generate more suitable verbal description.

It is noted that although the foregoing describe the functional configuration of information processing unit according to an embodiment of the present disclosure, This is only exemplary rather than limitation, and those skilled in the art can modify to above embodiments according to the principle of the disclosure, Such as the functional module in each embodiment can be added, deleted or be combined, and such modification each falls within this In scope of disclosure.

It is furthermore to be noted that Installation practice here is corresponding to the above method embodiment, therefore in device reality Applying the content being not described in detail in example can be found in the description of corresponding position in embodiment of the method, be not repeated to describe herein.

It should be understood that the instruction that the machine in storage medium and program product according to an embodiment of the present disclosure can be performed may be used also To be configured to execute above- mentioned information processing method, the content that therefore not described in detail here can refer to retouching for previous corresponding position It states, is not repeated to be described herein.

Correspondingly, this is also included within for carrying the storage medium of the program product of the above-mentioned instruction that can be performed including machine In the disclosure of invention.The storage medium includes but is not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc..

According to another aspect of the present disclosure, a kind of information detecting method is provided, which not only considers to scheme The position of focus window as in and the size for considering focus window.

The flow example of information detecting method 400 according to an embodiment of the present disclosure is described next, with reference to Fig. 4.Fig. 4 is The flow chart of the flow example of information detecting method 400 according to an embodiment of the present disclosure is shown.As shown in figure 4, according to this public affairs The information detecting method 400 for the embodiment opened includes extraction step S402 and generation step S404.

In extraction step S402, one group of feature with preset width and predetermined altitude can be extracted from input picture Figure, wherein the characteristic pattern in one group of characteristic pattern is corresponding from different characteristics of image respectively.

It can use the prior art and extract one group of (p) spy with preset width s and predetermined altitude r from input picture Sign figure fc, fc=CN (image), wherein image indicates the tensor of the m*n*c of image, and m, n and c respectively indicate input picture Length, width and port number；CN () indicates transforming function transformation function；Extracted characteristic pattern fc is the tensor of a r*s*p, wherein p Indicate the number of feature, i.e., each characteristic pattern in one group described (p) the characteristic pattern fc is respectively each of with p characteristics of image Characteristics of image is corresponding.

Preferably, extracting one group of characteristic pattern from input picture may include using convolutional neural networks model from input picture Extract one group of characteristic pattern.

As an example, can be extracted with convolutional neural networks input picture with preset width s and predetermined altitude r One group of (p) characteristic pattern fc, wherein the transforming function transformation function that CN () expression is realized with convolutional neural networks.

In generation step S404, it can be based on extracted one group of characteristic pattern, utilize housebroken verbal description model Generate the corresponding verbal description of input picture, wherein the corresponding text of input picture is generated using housebroken verbal description model Word description may include：Previous state vector based on one group of characteristic pattern and Recognition with Recurrent Neural Network model calculates one group of characteristic pattern On focus window center and size.

As an example, housebroken verbal description model can be used for generating corresponding verbal description according to input picture.It can To be based on extracted one group of characteristic pattern, the corresponding verbal description of input picture is generated using housebroken verbal description model. As an example, can the previous state vector based on one group of characteristic pattern and Recognition with Recurrent Neural Network model, calculate one group of characteristic pattern on Focus window center and size.

Preferably, the center and size for calculating the focus window on one group of characteristic pattern include：One group of characteristic pattern can be answered To sigmoid function be activation primitive first nerves network model one group of characteristic pattern is converted into a vector, as Extracted one group of characteristic pattern can be merged into a vector by example, and by using sigmoid function as the mind of activation primitive A full articulamentum through network (it is an example using sigmoid function as the first nerves network model of activation primitive) Nonlinear transformation is carried out to the vector after the merging and obtains f1 (fc), so that one group of characteristic pattern is converted into a vector, In, non-linear transform function f1 (fc)=σ (W1*fc+b1), σ () indicate sigmoid function, and W1 and b1 are respectively parameter matrix With offset parameter vector；Vector after conversion can be merged with the previous state vector of Recognition with Recurrent Neural Network model and pairing Vector after and is applied using sigmoid function as the nervus opticus network model of activation primitive, as an example, can will be after conversion Vector and Recognition with Recurrent Neural Network model previous state vector H_t-1Merge to obtain vector [f1 (fc), H_t-1], then will Obtained vector is by the way that using sigmoid function as a full articulamentum of the neural network of activation primitive, (it is with sigmoid letter Number is an example of the nervus opticus network model of activation primitive) carry out nonlinear transformation f2 ([f1 (fc), H_t-1]), wherein Non-linear transform function f2 ([f1 (fc), H_t-1])=σ (W2* [f1 (fc), H_t-1]+b2), W2 and b2 be respectively parameter matrix and Offset parameter vector；It can be activation letter to further being applied through the obtained vector of nervus opticus network model with tanh function Several third nerve network model, as an example, can pass through the vector obtained after nonlinear transformation f2 () with tanh letter Number is that the third nerve network model of activation primitive carries out nonlinear transformation tanh (f2 ([f1 (fc), H_t-1]))；It can will be through Parameter of the obtained vector of three neural network models compared with being used for carries out operation, as an example, can will be through third nerve The obtained vector of network model passes through σ after doing point multiplication operation with vector V (it is an example of the parameter for compared with) again Result is normalized in function, obtains σ (tanh (f2 ([f1 (fc), H_t-1]))⊙V)；And it can be according to the result of operation And preset width and predetermined altitude calculate center and the size of focus window, as an example, can be according to the result of operation Standardization processing is carried out to the window's position and size with preset width and predetermined altitude, and (cs ', cr ', s ', r ')=(s, r, s, r) ⊙σ(tanh(f2([f1(fc),H_t-1])) ⊙ V), wherein cs ' and cr ' respectively indicates width of the focus window on characteristic pattern fc The center of degree and short transverse, s ' and r ' respectively indicate the width and height of focus window.

Preferably, can also include using the corresponding verbal description that housebroken verbal description model generates input picture： Center and size based on focus window obtain the concern feature vector of one group of characteristic pattern.As an example, institute as above can be based on The center for the focus window stated and size obtain the concern feature vector of one group of characteristic pattern.

As an example it is supposed that att is the matrix-vector indicated with characteristic pattern fc same size, in the matrix-vector, with Value at the corresponding position of focus window is 1, and the value at the position except focus window is 0.Fc ⊙ att is indicated only as a result, Extract content of the characteristic pattern fc in focus window, that is, can be indicated with fc ⊙ att corresponding with focus window on one group of characteristic pattern Part.Furthermore, it is possible to which a full articulamentum (it is an example of fourth nerve network model) by neural network is right Fc ⊙ att carries out transformation X_t=f (fc ⊙ att), and by transformed vector X_tAs feature vector is paid close attention to, wherein f () is Transforming function transformation function.It can be using concern feature vector, X t as the input of t moment Recognition with Recurrent Neural Network model.

Preferably, further include using the corresponding verbal description that housebroken verbal description model generates input picture：It is based on The previous state vector in feature vector and Recognition with Recurrent Neural Network model is paid close attention to, the current state of Recognition with Recurrent Neural Network model is calculated Vector, and verbal description corresponding with focus window is obtained based on current state vector.

As an example, can be according to concern feature vector, X_tWith the previous state of t-1 at the time of Recognition with Recurrent Neural Network model Vector H_t-1Come calculate Recognition with Recurrent Neural Network model current time t current state vector Ht=tanh (Wh*Ht-1+Wi*Xt+ B), wherein Wh and Wi is parameter matrix, and B is offset parameter vector.It is then possible to be obtained based on current state vector and pay close attention to window The corresponding verbal description of mouth.

As an example, can current state vector to Recognition with Recurrent Neural Network model to apply with softmax function be activation The neural network model (it is an example of fifth nerve network model) of function, to calculate each of predetermined dictionary Probability of occurrence P (the Y of word Yt_t)=softmax (σ (Wp*H_t+ bp)), wherein Wp and bp is parameter matrix and offset parameter respectively Vector.Also, the maximum word of probability of occurrence is determined as verbal description corresponding with focus window.

Preferably, can also include using the corresponding verbal description that housebroken verbal description model generates input picture： When determining verbal description corresponding with focus window is full stop, the corresponding verbal description for generating input picture is terminated.

As an example, then terminating generation input picture when determining verbal description corresponding with focus window is full stop Corresponding verbal description.

Preferably, the parameter of housebroken verbal description model may include the parameter of convolutional neural networks model, first The parameter of neural network model, the parameter of nervus opticus network model, the parameter of third nerve network model, the 4th nerve net The parameter of network model, the parameter of the parameter of fifth nerve network model and Recognition with Recurrent Neural Network model and the ginseng for comparing Number.The above-mentioned parameter of housebroken verbal description model can be true and the information processing method according to the embodiment of the present disclosure It is fixed.

Fig. 5 is the exemplary figure for showing input picture and its corresponding verbal description according to the embodiment of the present disclosure.In Fig. 5 most The image on the left side is input picture.The intermediate image of Fig. 5 schematically shows the figure related with focus window in input picture Picture, for example, the image related with focus window in input picture respectively includes the image of " girl ", " horse to stand aside " Image etc..Rightmost is verbal description corresponding with input picture in Fig. 5, i.e. " horse of girl and station aside ".

In conclusion information detecting method 400 according to an embodiment of the present disclosure considers the position of the focus window in image It sets and size, and the content based on focus window generates corresponding verbal description.Due to can dynamically be found based on historical information The image-region that current character needs to pay close attention to is generated, therefore more suitable verbal description can be generated.

With above- mentioned information detection method embodiment correspondingly, the disclosure additionally provides the implementation of following information detector Example.

Fig. 6 is the block diagram for showing the functional configuration example of information detector 600 according to an embodiment of the present disclosure.

As shown in fig. 6, information detector 600 according to an embodiment of the present disclosure may include extraction unit 602 and life At unit 604.It is described below the functional configuration example of extraction unit 602 and generation unit 604.

In extraction unit 602, one group of characteristic pattern with preset width and predetermined altitude can be extracted from input picture, Wherein, the characteristic pattern in one group of characteristic pattern is corresponding from different characteristics of image respectively.

As an example, can be extracted with convolutional neural networks input picture with preset width s and predetermined altitude r One group of (p) characteristic pattern fc.

In generation unit 604, it can be based on extracted one group of characteristic pattern, it is raw using housebroken verbal description model At the corresponding verbal description of input picture, wherein generate the corresponding text of input picture using housebroken verbal description model Description may include：Previous state vector based on one group of characteristic pattern and Recognition with Recurrent Neural Network model calculates on one group of characteristic pattern Focus window center and size.

In conclusion information detector 600 according to an embodiment of the present disclosure considers the position of the focus window in image It sets and size, and the content based on focus window generates corresponding verbal description.Due to can dynamically be found based on historical information The image-region that current character needs to pay close attention to is generated, therefore more suitable verbal description can be generated.

It is noted that although the foregoing describe the functional configuration of information detector according to an embodiment of the present disclosure, This is only exemplary rather than limitation, and those skilled in the art can modify to above embodiments according to the principle of the disclosure, Such as the functional module in each embodiment can be added, deleted or be combined, and such modification each falls within this In scope of disclosure.

It should be understood that the instruction that the machine in storage medium and program product according to an embodiment of the present disclosure can be performed may be used also To be configured to execute above- mentioned information detection method, the content that therefore not described in detail here can refer to retouching for previous corresponding position It states, is not repeated to be described herein.

In addition, it should also be noted that above-mentioned series of processes and device can also be realized by software and/or firmware.? In the case where being realized by software and/or firmware, from storage medium or network to the computer with specialized hardware structure, such as The installation of general purpose personal computer 600 shown in fig. 6 constitutes the program of the software, and the computer is when being equipped with various programs, energy Enough perform various functions etc..

In Fig. 7, central processing unit (CPU) 701 is according to the program stored in read-only memory (ROM) 702 or from depositing The program that storage part 708 is loaded into random access memory (RAM) 703 executes various processing.In RAM 703, also according to need Store the data required when CPU 701 executes various processing etc..

CPU 701, ROM 702 and RAM 703 are connected to each other via bus 704.Input/output interface 705 is also connected to Bus 704.

Components described below is connected to input/output interface 705：Importation 706, including keyboard, mouse etc.；Output par, c 707, including display, such as cathode-ray tube (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.；Storage section 708, Including hard disk etc.；With communications portion 709, including network interface card such as LAN card, modem etc..Communications portion 709 via Network such as internet executes communication process.

As needed, driver 710 is also connected to input/output interface 705.Detachable media 711 such as disk, light Disk, magneto-optic disk, semiconductor memory etc. are installed on driver 710 as needed, so that the computer journey read out Sequence is mounted to as needed in storage section 708.

It is such as removable from network such as internet or storage medium in the case where series of processes above-mentioned by software realization Unload the program that the installation of medium 711 constitutes software.

It will be understood by those of skill in the art that this storage medium be not limited to it is shown in Fig. 7 be wherein stored with program, Separately distribute with equipment to provide a user the detachable media 711 of program.The example of detachable media 711 includes disk (including floppy disk (registered trademark)), CD (including compact disc read-only memory (CD-ROM) and digital versatile disc (DVD)), magneto-optic disk (including mini-disk (MD) (registered trademark)) and semiconductor memory.Alternatively, storage medium can be ROM 702, storage section Hard disk for including in 708 etc., wherein computer program stored, and user is distributed to together with the equipment comprising them.

Preferred embodiment of the present disclosure is described above by reference to attached drawing, but the disclosure is certainly not limited to above example.This Field technical staff can obtain various changes and modifications within the scope of the appended claims, and should be understood that these changes and repair Changing nature will fall into scope of the presently disclosed technology.

For example, can be realized in the embodiment above by the device separated including multiple functions in a unit. As an alternative, the multiple functions of being realized in the embodiment above by multiple units can be realized by the device separated respectively.In addition, with One of upper function can be realized by multiple units.Needless to say, such configuration includes in scope of the presently disclosed technology.

In this specification, described in flow chart the step of not only includes the place executed in temporal sequence with the sequence Reason, and including concurrently or individually rather than the processing that must execute in temporal sequence.In addition, even in temporal sequence In the step of processing, needless to say, the sequence can also be suitably changed.

In addition, can also be configured as follows according to the technology of the disclosure.

A kind of 1. information processing methods are attached, including：

One group of characteristic pattern with preset width and predetermined altitude is extracted from each sample image in multiple sample images, Wherein, the characteristic pattern in one group of characteristic pattern is corresponding from different characteristics of image respectively；And

Text is trained to retouch based on extracted one group of characteristic pattern and for the verbal description of the multiple sample image label Model is stated, the verbal description model is used to generate corresponding verbal description according to input picture, wherein the training verbal description Model includes the previous state vector based on one group of characteristic pattern and Recognition with Recurrent Neural Network model, calculates one group of characteristic pattern On focus window center and size.

The information processing method according to note 1 of note 2., wherein extract one group of feature from each sample image Figure includes extracting one group of characteristic pattern from each sample image using convolutional neural networks model.

The information processing method according to note 1 of note 3., wherein calculate the focus window on one group of characteristic pattern Center and size include：

Applying one group of characteristic pattern using sigmoid function is the first nerves network model of activation primitive with will be described One group of characteristic pattern is converted into a vector；

Vector after conversion is merged and with the previous state vector of the Recognition with Recurrent Neural Network model to merging Vector afterwards is applied using sigmoid function as the nervus opticus network model of activation primitive；

It further applies to through the obtained vector of nervus opticus network model using tanh function as activation primitive Third nerve network model；

Operation will be carried out through parameter of the obtained vector of third nerve network model compared with being used for；And

The center of the focus window is calculated according to the result of the operation and the preset width and predetermined altitude And size.

The information processing methods according to note 1 of note 4., wherein the training verbal description model further includes：Base Center and size in the focus window obtain the concern feature vector of one group of characteristic pattern.

It is attached 5. information processing methods according to the attached note 4, wherein obtaining the concern feature vector includes：To institute Certain applications fourth nerve network model corresponding with the focus window is stated on one group of characteristic pattern so that the part to be converted into One vector, and using one vector as the concern feature vector.

It is attached 6. information processing methods according to the attached note 4, wherein the training verbal description model further includes：Base The previous state vector in the concern feature vector and the Recognition with Recurrent Neural Network model calculates the circulation nerve The current state vector of network model, and text corresponding with the focus window is obtained based on the current state vector and is retouched It states.

The information processing method according to note 6 of note 7., wherein obtain text corresponding with the focus window and retouch State including：To the current state vector application fifth nerve network model to calculate going out for each word in predetermined dictionary Existing probability, and the maximum word of probability of occurrence is determined as verbal description corresponding with the focus window.

The information processing methods according to note 6 of note 8., wherein the training verbal description model further includes：It is right A sample image in the multiple sample image is determining that verbal description corresponding with the focus window is full stop When, terminate the training carried out based on one sample image.

The information processing method according to note 7 of note 9., wherein the parameter of the verbal description model includes described The parameter of convolutional neural networks model, the parameter of the first nerves network model, the parameter of the nervus opticus network model, The parameter of the third nerve network model, the parameter of the 4th neural network model, the fifth nerve network model Parameter and the Recognition with Recurrent Neural Network model parameter and the parameter for comparing.

The information processing method according to note 1 of note 10., wherein the Recognition with Recurrent Neural Network model further includes length Short-term memory network.

A kind of 11. information processing units are attached, including：

Extraction unit is configured to each sample image from multiple sample images and extracts with preset width and make a reservation for One group of characteristic pattern of height, wherein the characteristic pattern in one group of characteristic pattern is corresponding from different characteristics of image respectively；And

Training unit is configured to based on extracted one group of characteristic pattern and the text marked for the multiple sample image It describes to train verbal description model, the verbal description model is used to generate corresponding verbal description according to input picture, wherein The training verbal description model includes the previous state vector based on one group of characteristic pattern and Recognition with Recurrent Neural Network model, meter Calculate center and the size of the focus window on one group of characteristic pattern.

A kind of 12. information detecting methods are attached, including：

One group of characteristic pattern with preset width and predetermined altitude is extracted from input picture, wherein one group of characteristic pattern In characteristic pattern it is corresponding from different characteristics of image respectively；And

Based on extracted one group of characteristic pattern, the corresponding of the input picture is generated using housebroken verbal description model Verbal description, wherein include using the corresponding verbal description that housebroken verbal description model generates the input picture：It is based on The previous state vector of one group of characteristic pattern and Recognition with Recurrent Neural Network model calculates the focus window on one group of characteristic pattern Center and size.

The information detecting method according to note 12 of note 13., wherein extract one group of spy from the input picture Sign figure includes extracting one group of characteristic pattern from the input picture using convolutional neural networks model.

The information detecting method according to note 12 of note 14., wherein calculate the concern window on one group of characteristic pattern The center and size of mouth include：

The information detecting method according to note 12 of note 15., wherein generated using housebroken verbal description model The corresponding verbal description of input picture further includes：Center and size based on the focus window obtain one group of characteristic pattern Concern feature vector.

The information detecting method according to note 15 of note 16., wherein obtaining the concern feature vector includes：It is right Certain applications fourth nerve network model corresponding with the focus window is on one group of characteristic pattern to convert the part At a vector, and using one vector as the concern feature vector.

The information detecting method according to note 15 of note 17., wherein generated using housebroken verbal description model The corresponding verbal description of input picture further includes：Based on the institute in the concern feature vector and the Recognition with Recurrent Neural Network model State previous state vector, calculate the current state vector of the Recognition with Recurrent Neural Network model, and based on the current state to Amount obtains verbal description corresponding with the focus window.

Note 18. is according to information detecting method as stated in Note 17, wherein obtains text corresponding with the focus window Description includes：To the current state vector application fifth nerve network model to calculate each word in predetermined dictionary Probability of occurrence, and the maximum word of probability of occurrence is determined as verbal description corresponding with the focus window.

Note 19. is according to information detecting method as stated in Note 17, wherein is generated using housebroken verbal description model The corresponding verbal description of input picture further includes：When determining verbal description corresponding with the focus window is full stop, eventually Only generate the corresponding verbal description of the input picture.

The information detecting method according to note 18 of note 20., wherein the ginseng of the housebroken verbal description model Number includes parameter, the parameter of the first nerves network model, the nervus opticus network of the convolutional neural networks model The parameter of model, the parameter of the third nerve network model, the parameter of the 4th neural network model, the 5th mind The parameter and the parameter for comparing of parameter and the Recognition with Recurrent Neural Network model through network model.

Claims

1. a kind of information processing method, including：

One group of characteristic pattern with preset width and predetermined altitude is extracted from each sample image in multiple sample images, In, the characteristic pattern in one group of characteristic pattern is corresponding from different characteristics of image respectively；And

Verbal description mould is trained based on extracted one group of characteristic pattern and for the verbal description of the multiple sample image label Type, the verbal description model are used to generate corresponding verbal description according to input picture, wherein the training verbal description model Including the previous state vector based on one group of characteristic pattern and Recognition with Recurrent Neural Network model, calculate on one group of characteristic pattern The center of focus window and size.

2. information processing method according to claim 1, wherein extract one group of characteristic pattern packet from each sample image It includes and extracts one group of characteristic pattern from each sample image using convolutional neural networks model.

3. information processing method according to claim 1, wherein calculate in the focus window on one group of characteristic pattern The heart and size include：

Applying one group of characteristic pattern using sigmoid function is the first nerves network model of activation primitive by described one group Characteristic pattern is converted into a vector；

Vector after conversion is merged with the previous state vector of the Recognition with Recurrent Neural Network model and to merging after Vector is applied using sigmoid function as the nervus opticus network model of activation primitive；

It further applies to through the obtained vector of nervus opticus network model using tanh function as the third of activation primitive Neural network model；

The center of the focus window and big is calculated according to the result of the operation and the preset width and predetermined altitude It is small.

4. information processing method according to claim 1, wherein the training verbal description model further includes：Based on institute Center and the size of focus window are stated, the concern feature vector of one group of characteristic pattern is obtained.

5. information processing method according to claim 4, wherein obtaining the concern feature vector includes：To described one Certain applications fourth nerve network model corresponding with the focus window is on group characteristic pattern to be converted into one for the part Vector, and using one vector as the concern feature vector.

6. information processing method according to claim 4, wherein the training verbal description model further includes：Based on institute The previous state vector in concern feature vector and the Recognition with Recurrent Neural Network model is stated, the Recognition with Recurrent Neural Network is calculated The current state vector of model, and verbal description corresponding with the focus window is obtained based on the current state vector.

7. information processing method according to claim 6, wherein obtain verbal description packet corresponding with the focus window It includes：To the current state vector application fifth nerve network model to which the appearance for calculating each word in predetermined dictionary is general Rate, and the maximum word of probability of occurrence is determined as verbal description corresponding with the focus window.

8. information processing method according to claim 6, wherein the training verbal description model further includes：For institute A sample image in multiple sample images is stated, when determining verbal description corresponding with the focus window is full stop, Terminate the training carried out based on one sample image.

9. a kind of information processing unit, including：

Extraction unit, each sample image being configured to from multiple sample images are extracted with preset width and predetermined altitude One group of characteristic pattern, wherein the characteristic pattern in one group of characteristic pattern is corresponding from different characteristics of image respectively；And

Training unit is configured to based on extracted one group of characteristic pattern and the verbal description marked for the multiple sample image Train verbal description model, the verbal description model is used to generate corresponding verbal description according to input picture, wherein training The verbal description model includes the previous state vector based on one group of characteristic pattern and Recognition with Recurrent Neural Network model, calculates institute State center and the size of the focus window on one group of characteristic pattern.

10. a kind of information detecting method, including：

One group of characteristic pattern with preset width and predetermined altitude is extracted from input picture, wherein in one group of characteristic pattern Characteristic pattern is corresponding from different characteristics of image respectively；And

Based on extracted one group of characteristic pattern, the corresponding text of the input picture is generated using housebroken verbal description model Description, wherein include using the corresponding verbal description that housebroken verbal description model generates the input picture：Based on described The previous state vector of one group of characteristic pattern and Recognition with Recurrent Neural Network model calculates in the focus window on one group of characteristic pattern The heart and size.