CN108665506B

CN108665506B - Image processing method, image processing device, computer storage medium and server

Info

Publication number: CN108665506B
Application number: CN201810442810.0A
Authority: CN
Inventors: 姜文浩; 马林; 刘威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-05-10
Filing date: 2018-05-10
Publication date: 2021-09-28
Anticipated expiration: 2038-05-10
Also published as: CN108665506A

Abstract

The embodiment of the invention discloses an image processing method and device, a computer storage medium and a system, wherein the method is applied to a fusion device and comprises the following steps: the fusion device obtains M groups of image characteristics of the image to be processed from the encoder; first image representation information corresponding to each group of image features in the M groups of image features is obtained. Generating M image representation information sets according to each group of image features and first image representation information corresponding to each group of image features, wherein one image representation information set generated corresponding to one group of image features comprises at least one piece of second image representation information; and fusing second image representation information included in the M image representation information sets to obtain target image representation information, and outputting the target image representation information to the decoder. By adopting the embodiment of the invention, the natural sentence description accuracy of the image can be improved, and the quality of the image content understanding service can be optimized.

Description

Image processing method, image processing device, computer storage medium and server

Technical Field

The present invention relates to the field of internet technologies, and in particular, to the field of image processing technologies, and in particular, to an image processing method, an image processing apparatus, a computer storage medium, and a server.

Background

In order to facilitate quick understanding of the main contents of images, image content understanding services have been developed. The image content understanding service is a service for converting image content into a description using one natural sentence, and thus image content understanding can also be understood as image content description. In other words, image content understanding can be seen as a translation problem, i.e. translating image content into a natural sentence description. One important factor for measuring the quality of the image content understanding service is the description accuracy of natural sentences used for describing image content.

In the prior art, an image processing flow is generally divided into an encoding stage and a decoding stage. And (3) an encoding stage: the image characteristics of each frame image of the original image are extracted by an encoder. And a decoding stage: natural sentences used for describing the image content are predicted by a decoder according to the image characteristics extracted by the encoder. Although the image content understanding service is realized by the prior art scheme, the prior art only obtains natural sentences for describing the image content through a decoder and a decoder, and does not pay attention to describing the image from multiple angles, so that the quality of the image content understanding service cannot be guaranteed.

Disclosure of Invention

Embodiments of the present invention provide an image processing method, an image processing apparatus, a computer storage medium, and a server, which can improve description accuracy in describing image content using natural sentences, improve quality of image content understanding service, and further improve user experience of the image content understanding service.

In a first aspect, an embodiment of the present invention provides an image processing method, where the method is applied to an image processing system, where the image processing system includes an encoder, a fuser, and a decoder, and the method includes:

the fusion device obtains M groups of image characteristics of the image to be processed from the encoder, wherein M is an integer not less than 2;

the fusion device acquires first image representation information corresponding to each group of image features in the M groups of image features;

the fusion device generates M hidden state sets according to the image characteristics and the first image representation information corresponding to the image characteristics, wherein one image representation information set is generated corresponding to one image characteristic, and one image representation information set comprises at least one second image representation information;

the fusion device fuses second image representation information included in the M image representation information sets to obtain target image representation information, and outputs the target image representation information to a decoder;

the target image representation information is used for a decoder to decode the image to be processed to obtain the image description of the image to be processed.

In some possible embodiments, the obtaining, by the fuser, M sets of image features of the image to be processed from the encoder includes:

the fusion device obtains M groups of image characteristics of an image to be processed from M encoders included in the image processing system, wherein one encoder corresponds to one group of encoding parameters, and one encoder outputs one group of image characteristics; or

The fusion device obtains M groups of image characteristics of an image to be processed from M coding channels of an encoder of the image processing system, wherein one coding channel of the encoder corresponds to one group of coding parameters, and one coding channel outputs one group of image characteristics.

In some possible embodiments, each of the M groups of image features includes a global image feature of the image to be processed;

the acquiring, by the fusion device, first image representation information corresponding to each of the M sets of image features includes:

the fusion device generates first image representation information corresponding to each group of image features according to the global image features in each group of image features in the M groups of image features and a specified linear transformation matrix.

In some possible embodiments, the first image representation information corresponding to the sets of image features may be a first hidden state corresponding to the sets of image features;

the fusion device generates M image representation information sets according to the image features and the first image representation information corresponding to the image features, and comprises:

the fusion device learns any group of image features and first image representation information corresponding to the image features on the basis of a first long-short term memory (LSTM) unit to obtain image representation information A corresponding to the image features;

the fusion device learns any group of image characteristics and the image representation information A based on a second LSTM unit to obtain image representation information B corresponding to any group of image characteristics;

combining the image representation information a and the image representation information B corresponding to any one set of image features to obtain an image representation information set i corresponding to any one set of image features, wherein the image representation information a and the image representation information B are second image representation information included in the image representation information set i;

and acquiring image representation information sets corresponding to all groups of image features to obtain M image representation information sets corresponding to the M groups of image features.

In some feasible embodiments, the image representation information provided in the embodiment of the present invention may include a hidden state, the first image representation information may be a first hidden state, the second image representation information may be a second hidden state, and the hidden state a and the hidden state B may be the image representation information a and the image representation information B, which is not described in detail below.

In some possible embodiments, each of the M groups of image features further includes a sub-region local image feature of the image to be processed;

the fusion device learns any group of image features and first image representation information corresponding to each group of image features based on a first LSTM unit, and the obtaining of the image representation information A corresponding to any group of image features comprises:

the fusion device learns the partial image characteristics of the sub-region in any group of image characteristics and the first image representation information corresponding to any group of image characteristics based on the attention model in the first LSTM unit and outputs context vectors corresponding to any group of image characteristics;

the fusion device learns the context vector corresponding to any one set of image features and the first image representation information corresponding to each set of image features based on the first LSTM unit to obtain the image representation information a corresponding to any one set of image features.

In some possible embodiments, the fusing, by the fuser, the second image representation information included in the M image representation information sets to obtain the target image representation information includes:

the fusion device determines third image representation information according to the image representation information B included in each image representation information set in the M image representation information sets;

and executing the following operations on any image representation information set j in each image representation information set to obtain a context vector corresponding to the image representation information set j:

learning the third image representation information and the second image representation information in the image representation information set j based on an attention model and outputting a context vector corresponding to the image representation information set j, wherein one image representation information set corresponds to one attention model;

acquiring M context vectors corresponding to the M image representation information sets, and obtaining a target vector matrix according to the M context vectors;

and processing the target vector matrix and the third image representation information based on a third LSTM unit to generate target image representation information.

In some possible embodiments, the third LSTM unit at least includes LSTM1 and LSTM2, and the generating the target image representation information based on the third LSTM unit according to the target vector matrix and the third image representation information includes:

learning the M context vectors included in the target vector matrix and the third image representing information based on the LSTM1 to obtain image representing information C;

learning second image representing information and the image representing information C included in the M image representing information sets based on the LSTM2 to obtain image representing information D;

combining the image representation information C and the image representation information D to obtain a target image representation information set, and determining the image representation information C and the image representation information D in the target image representation information set as target image representation information.

In some possible embodiments, the method further comprises:

the fusion device obtains the image description of the image to be processed from the decoder, and determines the discrimination supervision loss function of the image processing according to the image description of the image to be processed;

the fusion device combines the discrimination supervision loss function to construct a loss function of an image processing system according to the M image representation information sets of the images to be processed and the target image representation information;

and the fusion device modifies the network parameters of the LSTM unit adopted by the fusion device according to the loss function.

In some possible embodiments, the above-described loss function may also be used to modify the network parameters of the LSTM unit employed by the decoder.

In a second aspect, an embodiment of the present invention provides an image processing apparatus, where the image processing apparatus is applied to an image processing system, where the image processing system includes an encoder and a decoder, the image processing system further includes a fuser, and the apparatus may be a fuser, and the apparatus includes:

an obtaining unit, configured to obtain M sets of image features of an image to be processed from the encoder, where M is an integer not less than 2;

the obtaining unit is further configured to obtain a first hidden state corresponding to each group of image features in the M groups of image features;

a first fusion unit, configured to generate M image representation information sets according to the groups of image features acquired by the acquisition unit and first image representation information corresponding to the groups of image features, where a group of image features corresponds to a generated image representation information set, and each image representation information set includes at least one piece of second image representation information;

a second fusion unit configured to fuse second image representation information included in the M image representation information sets obtained by the first fusion unit and learn the second image representation information to obtain target image representation information;

an output unit configured to output the target image representation information obtained by the second fusion unit to the decoder;

the target image representation information is used for the decoder to decode the image to be processed to obtain the image description of the image to be processed.

In some possible embodiments, the obtaining unit is configured to:

acquiring M groups of image characteristics of an image to be processed from M encoders included in the image processing system, wherein one encoder corresponds to one group of encoding parameters and one encoder outputs one group of image characteristics; or

Acquiring M groups of image features of an image to be processed from M encoding channels of an encoder of the image processing system, wherein one encoding channel of the encoder corresponds to one group of encoding parameters, and one encoding channel outputs one group of image features.

the acquisition unit is configured to:

and generating first image representation information corresponding to each group of image features according to the global image features in each group of image features in the M groups of image features and the specified linear transformation matrix.

In some possible embodiments, the first fusion unit is configured to:

learning any group of image features and first image representation information corresponding to the image features on the basis of a first LSTM unit to obtain image representation information A corresponding to the image features;

learning any group of image characteristics and the image representation information A based on a second LSTM unit to obtain image representation information B corresponding to any group of image characteristics;

the first fusing unit is configured to:

learning the partial image features of the sub-region in any group of image features and first image representation information corresponding to any group of image features based on an attention model in a first LSTM unit and outputting context vectors corresponding to any group of image features;

and learning the context vector corresponding to any one group of image features and the first image representation information corresponding to each group of image features based on the first LSTM unit to obtain the image representation information A corresponding to any one group of image features.

In some possible embodiments, the second fusion unit is configured to:

determining third image representation information according to image representation information B included in each of the M image representation information sets;

and learning the M context vectors in the target vector matrix and the third image representation information based on a third LSTM unit to obtain target image representation information.

In some possible embodiments, the third LSTM unit employed by the second fusion unit includes at least LSTM1 and LSTM2, and the second fusion unit is configured to:

learning the M context vectors in the target vector matrix and the third image representing information based on the LSTM1 to obtain image representing information C;

combining the image representation information C and the image representation information D to obtain a target image representation information set, and determining the image representation information C and the image representation information D included in the target image representation information set as target image representation information.

In some possible embodiments, the image processing apparatus further includes an optimization unit configured to:

acquiring the image description of the image to be processed from the decoder, and determining a discrimination supervision loss function of image processing according to the image description of the image to be processed;

according to the M image representation information sets of the image to be processed and the target image representation information, constructing a loss function of an image processing system by combining the discrimination supervision loss function;

and correcting the network parameters of the LSTM unit adopted by the LSTM unit according to the loss function.

In a third aspect, the present invention provides a computer storage medium applied in an image processing system, the image processing system including an encoder, a fuser and a decoder, the computer storage medium storing one or more instructions, the one or more instructions being adapted to be loaded by the fuser and to perform the method provided by any feasible implementation manner of the first aspect and the first aspect.

In a fourth aspect, an embodiment of the present invention provides a server, where the server includes an image processing system, where the image processing system includes an encoder, a fuser and a decoder, and the fuser further includes:

a processor adapted to implement one or more instructions; and the number of the first and second groups,

a computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the method as provided by any of the possible embodiments of the first aspect and the first aspect described above.

The method and the device can learn and fuse a plurality of groups of image characteristics obtained from an encoder and image representation information corresponding to the image characteristics to obtain a plurality of image representation information sets through a fusion device, further fuse the image representation information sets to obtain target image representation information, output the target image representation information to a decoder, decode the image to be processed through the decoder in combination with the target image representation information of the image to be processed obtained through fusion to obtain natural sentences corresponding to the image to be processed, and the natural sentences are used for image description of the image to be processed. Therefore, the fusion device fuses a plurality of groups of image descriptions processed by the encoder to obtain image features with richer data volume, and the image features with larger data volume are provided for the decoder to decode, so that the description accuracy of natural sentences is improved, and the quality of image content understanding service is optimized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of an image processing system according to an embodiment of the present invention;

FIG. 2 is a block diagram of another embodiment of an image processing system;

FIG. 3 is a schematic diagram of an application scenario of image processing provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of image feature fusion provided in an embodiment of the present invention;

FIG. 5 is a flowchart illustrating an image processing method according to an embodiment of the present invention;

FIG. 6 is a schematic view of another flowchart of an image processing method according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

In order to facilitate quick understanding of the main contents of images, image content understanding services have been developed. The image content understanding service is a service for generating a natural expression for describing image content by performing a series of encoding and decoding processes including extracting features of an image, understanding image content, and the like on an image. In other words, the image content understanding service is a service that converts image content into a description in one natural sentence. The image content understanding service may be applied in a variety of internet scenarios, such as: the image content understanding service can be applied to an image classification scene of an image website, and images are classified based on natural sentence descriptions of the images; the following steps are repeated: the image content understanding service can also be applied to an image retrieval scene of an image website, and the image is retrieved based on natural sentence description of the image; the system can also be combined with a voice retrieval system, and natural sentences of the images are converted into voice to be output so as to assist the visually impaired to retrieve the images; the image content understanding service can also be applied to a target detection scene for finding objects such as target characters from the monitoring video; and the like, which can be determined according to the actual application scenario, and is not limited herein.

One important factor for measuring the quality of the image content understanding service is the description accuracy of natural sentences used for describing image content. If the description accuracy of the natural sentence is higher, which indicates that the matching degree of the content described by the natural sentence and the image content is higher, the image content understanding service quality is better, and the user use experience is better. Conversely, the lower the description accuracy of the natural language sentence is, the lower the matching degree between the content described by the natural language sentence and the image content is, the worse the image content understanding service quality is, and the worse the user experience is.

Referring to fig. 1, fig. 1 is a schematic diagram of a framework of an image processing system according to an embodiment of the present invention. Referring to fig. 1, the framework of the image processing system is mainly composed of an encoder and a decoder. Correspondingly, the flow of image processing by the image processing system is divided into two stages, including an encoding stage and a decoding stage. And (3) an encoding stage: image feature extraction is performed on each frame of the original image by an encoder. And a decoding stage: the frame features extracted in the encoding stage are transformed by means of a mean feature mechanism, an attention mechanism, etc., and then a natural sentence for describing the image content is predicted by a decoder according to the transformed image features. The attention mechanism will be taken as an example to illustrate the embodiment of the present invention, and the detailed description thereof is omitted. As can be seen from the image processing scheme corresponding to the image processing system shown in fig. 1, the decoder generates natural sentences by using the image features output by the encoder, and whether the image features output by the encoder are rich or not directly affects the accuracy of the decoder in describing the image content by the natural sentences obtained by processing the image features output by the encoder. In the image processing scheme shown in fig. 1, vectors output by an encoder are directly used for decoding by a decoder without any processing, the amount of image information of the vectors output by a single encoder is small, the decoder can only decode and obtain natural sentences used for describing image content according to the vectors output by the encoder, the description accuracy of the natural sentences on the image content cannot be guaranteed, the quality of image content understanding service cannot be guaranteed, and the applicability is poor.

Based on this, the embodiment of the present invention proposes an image processing method that adds a fuser in an image processing system, so that the image processing system includes both an encoder and a decoder, and the fuser. The image data such as the image characteristics extracted by a plurality of encoders and the image representation information corresponding to the image characteristics are fused by a fusion device, or the image data such as the image characteristics output by a plurality of encoding channels of the encoders and the image representation information corresponding to the image characteristics are fused, and the fused image characteristics are output to a decoder. The decoder decodes the image to be processed by utilizing richer image data obtained by fusion of the fusion device to obtain image descriptions such as natural sentences for describing the image to be processed, so that the image description accuracy of the image to be processed is improved, the service quality of understanding the image content of the image to be processed is improved, and the user experience of the image content understanding service of the image to be processed can be enhanced.

Based on this principle, the image processing system according to the embodiment of the present invention introduces a fuser for fusing image features output by an encoder based on the system architecture shown in fig. 1, please refer to fig. 2 together, and fig. 2 is another schematic diagram of the framework of the image processing system according to the embodiment of the present invention. The image processing system of the embodiment of the invention comprises an encoder, a fuser and a decoder. Based on the image processing system shown in fig. 2, the image processing flow of the embodiment of the present invention is mainly divided into three stages, including an encoding stage at the encoder end, a fusion stage at the fusion device end, and a decoding stage at the decoder end. The implementation of each of the above three stages will be described as follows:

firstly, an encoding stage:

an original image (i.e., an image to be processed, which will be described below by taking the image to be processed as an example for convenience of description) is input into an encoder, and feature extraction is performed on each frame of the image to be processed by the encoder to obtain a frame feature sequence. Generally, the encoder may perform feature extraction based on a Convolutional Neural Network (CNN). The encoder may perform image encoding on an image to be processed through the CNN, and may output a global image feature for representing global information of the image through a last fully-connected layer of the CNN, and output a local image feature set for representing local information of the image through a last convolutional layer (conv layer) of the CNN. The global image feature for representing the global information of the image can be represented by a vector, and the local image feature set for representing the local information can be represented by a vector set. One vector in the set of vectors represents an image feature of a region of the image.

Second, the fusion stage

In the embodiment of the invention, a fusion device is added between the encoder and the decoder, the fusion device can combine a plurality of image representations of the image to be processed output by the encoder into a final image representation of the image to be processed, the final image representation of the image to be processed obtained by combination can be input into the decoder, and the natural sentence for image description of the image to be processed is output by the decoder. For example, referring to fig. 3, fig. 3 is a schematic view of an application scenario of image processing according to an embodiment of the present invention. As shown in fig. 3, in the image processing method provided by the embodiment of the present invention, the image to be processed may be input into the encoder, and a plurality of sets of image features (which may be represented as a plurality of images of the image to be processed) of the image to be processed are output through a plurality of CNNs at the encoder. The fuser end can acquire a plurality of groups of image characteristics from the encoder end, fuse the plurality of groups of image characteristics to obtain the final image representation of the image to be processed, and output the final image representation to the decoder end. The decoder side can decode the final image representation output by the fuser side to obtain the image description of the image to be processed, for example, outputting a natural sentence of the image to be processed as a "pedestrian" image.

In some possible embodiments, the image processing system may perform image encoding on the image to be processed through a plurality of encoders to output a plurality of sets of image features of the image to be processed. One encoder may correspondingly adopt one CNN, and the network parameters adopted by each CNN are different, that is, the encoding parameters adopted by each encoder in the plurality of encoders are different. Each group of image features of the multiple groups of image features output by the multiple encoders comprises a global image feature and a group of local image features.

Optionally, in some possible embodiments, a plurality of CNNs may also be included in one encoder, and one CNN is one encoding channel, that is, a plurality of encoding channels may be included in the encoder for image encoding the image to be processed to output a plurality of sets of image features of the image to be processed. The network parameters adopted by each CNN in the multiple CNNs are different, that is, one coding channel in the multiple coding channels of the encoder corresponds to one group of coding parameters, and then the image to be processed is coded by the multiple coding channels corresponding to the multiple CNNs, so that multiple groups of different image features can be output. The plurality of different sets of image features may be a plurality of sets of image representations of the image to be processed. Similarly, each of the image features in the plurality of sets of image features includes a global image feature and a set of local image features.

In some possible embodiments, it is assumed that the number of encoders is M, where one encoder corresponds to one CNN, or the number of encoding channels in one encoder is M, where one encoding channel corresponds to one CNN. The network parameters adopted by different CNNs are different, and therefore, the M CNNs corresponding to the M encoders, or the M CNNs corresponding to the M encoding channels of one encoder, may be M different CNNs. For convenience of description, the following description will take the example of extracting image features from an image to be processed by using M CNNs. Assume that the global image feature obtained by feature extraction of the mth CNN of the M CNNs on the image to be processed is represented as

The local image characteristics corresponding to each subregion of the image to be processed are expressed as

Wherein M may be any one of 1 to M, which is not limited herein. The M CNNs extract image features of the image to be processed to obtain M groups of image features, wherein the CNNs are used for processing the image to be processed

And A^(m)Then there may be a set of image features corresponding to the mth CNN.

In some possible embodiments, the fuser may obtain M sets of image features obtained by extracting image features of the image to be processed from the M CNNs obtained by the encoder, and further fuse the M sets of image features to obtain a final image representation of the image to be processed, and output the final image representation to the decoder. The fusion of the M sets of image features obtained from the encoder by the fusion device may include a fusion stage 1 and a fusion stage 2, each of which may include one or more image feature processing steps. In the embodiment of the present invention, the image feature processing steps in the fusion stage 1 and the fusion stage 2 occur at different times, so for convenience of description, the image feature processing steps can also be described by taking time steps as an example. Referring to fig. 4, fig. 4 is a schematic diagram of image feature fusion according to an embodiment of the present invention.

In the decoding stage, generally, the decoder may perform prediction of a natural sentence by using a Recurrent Neural Network (RNN), and the RNN may be implemented by using a Long Short-Term Memory (LSTM) unit, so that the embodiment of the present invention takes the Recurrent Neural Network based on the LSTM unit as an example, and takes processing an image of an image to be processed by using a time attention mechanism as an example. In the fusion stage, the fusion device can also adopt an LSTM unit to process the image characteristics acquired from the encoder, so that the image representation output to a decoder in the decoding stage after processing can meet the requirements of the decoder, the accuracy of image description output after decoding the image to be processed can be guaranteed, and the processing quality of image processing can be improved. Correspondingly, in the embodiment of the present invention, in the process of processing the image feature data at the fuser end and the decoder end, the image representation information of the image to be processed also includes an implicit state corresponding to the image feature of the image to be processed. For ease of understanding, the LSTM unit will be briefly described below:

in some possible embodiments, the fusion cage of the embodiments of the inventionThe LSTM cell used may be an attention model LSTM, which is essentially a function with states, abstractly represented as h_t＝LSTM(H_t，f_att(A，h_t-1)). Wherein the content of the first and second substances,

vector x_tIs the input of an image feature processing step (which may be referred to simply as the tth time step, or time step t for convenience of description) corresponding to time t, h_t-1Is an implicit state after the previous time step (i.e., the t-1 st time step, or time step t-1) has elapsed. A ═ a₁，a₂，...，a_k]Is a set of annotation vectors, i.e. a represents a set, the elements included in the set are vectors, and the vectors are called annotation vectors (annotation vectors). For example, the local image feature of the above-mentioned map to be processed

And are not intended to be limiting herein. F above_att(A，h_t-1) Is an attention model, note f_att(A，h_t-1) The output at the t-th time step is the vector z_tWherein the vector z_tIs a context vector. Inside the LSTM unit, the linear transformation is denoted by T, and the image feature processing procedure of the LSTM unit with attention model can be expressed as follows:

c_t＝f_t⊙c_t-1+i_t⊙g_t (2)

h_t＝o_t⊙tanh(c_t) (3)

wherein i_t、f_t、c_tAnd o_tInput gates, forget gates, memory gates and output gates, respectively called LSTM units, tanh (·) is a hyperbolic tangent function, σ is a common activation function of neural networks, such as a sigmoid function, etc.

An Attention Model (Attention Model), also called an Attention Mechanism (Attention Mechanism), is widely applied to various deep learning tasks of different types, such as natural language processing, image recognition, voice recognition and the like, and has a great promotion effect on the deep learning tasks. For convenience of description, the attention model will be described as an example.

Attention model f_att(-) can be used to determine which region of the image was noted at the previous time step, i.e. a weight value is calculated for each vector in the set of annotation vectors a, and the image region corresponding to the vector with higher weight represents the region being noted. The attention model calculates any vector a in the annotation vector set_iThe corresponding weight value may be calculated using a multi-layer perceptron (MLP). A can be calculated by MLP_iAnd h_t-1Degree of similarity e of_iThen calculate a_iCorresponding weight value w_i. Wherein, w is_iSatisfies the following conditions:

the context vector z corresponding to the identification vector set input in the time step t can be generated by using the weight value corresponding to each vector in the labeled vector set A_tWherein z is_tSatisfies the following conditions:

z_t＝∑w_ia_i(5) wherein z is as defined above_tThe image feature processing procedure for the LSTM unit may be used at time step t.

In some possible embodiments, as shown in fig. 4, it is assumed that the number of image feature processing steps included in the fusion stage 1 is T1, and the number of image feature processing steps included in the fusion stage 2 is T2. For convenience of description, M will be 3, T₁2 (including T-1 and T-2), T₂The image feature processing procedure of each step in the fusion stage 1 and the fusion stage 2 is described by taking 3 (including T-T1 +1, T-T1 +2, and T-T1 +3) as an example.

Fusion stage 1:

in some possible embodiments, it is assumed that at time step t, as shown in fig. 4, the image representation of the mth CNN output is input to the mth row of the fusion stage 1. Wherein, the image representation of the m-th CNN output can comprise a global image feature representation

And local image features are expressed as

Assume that in the fusion phase 1, at time step t, the implicit states and memory cells of the m-th row of LSTM cells are recorded as

And

at time step t, the implicit states and the memory cells of the LSTM cells of each row are initialized as follows, with the implicit states of the LSTM cells of the m-th row

And a memory cell

For the purpose of illustration, the above description is provided

And

satisfies the following conditions:

wherein the content of the first and second substances,

a linear transformation matrix that is linearly transformed inside the LSTM cells of the m-th row,

and outputting the global image characteristics of the image to be processed for the CNN of the mth row. As described above

The initial hidden state corresponding to the mth group of image features output for the mth CNN may also be recorded as the hidden state corresponding to the mth group of image features at the initial time step t0 before the time step t, and for convenience of description, the first hidden state may be taken as an example for explanation. Here, the first implicit state may be first image representation information of the image to be processed for the LSTM unit to learn image data such as image features of the image to be processed. Similarly, the subsequent second hidden state and the like may be second image representation information of the image to be processed, and the first hidden state and the second hidden state and the like are only hidden states (i.e., image representation information) generated by marking different time nodes, and are not specifically limited, and will not be described any further below.

In some possible embodiments, in the fusion stage 1, when the fusion device fuses the M groups of image features obtained from the encoder at time step t, the hidden states of the LSTM unit in the fusion stage 1

And a memory cell

Satisfies the following conditions:

wherein H_tSatisfies the following conditions:

wherein the content of the first and second substances,

is the attention model of the m-th line in the fusion stage 1 (or the fusion stage I), and the local image features of each sub-region of the image to be processed in the m-th group of image features output according to the m-th CNN and the m-th group of image features (namely A) can be obtained through the attention model of the m-th line^(m)) Corresponding first implicit State (i.e. the first implicit State)

Assuming t is 1, then

(may be)

) Outputting context vector z corresponding to mth group of image features_m. Wherein z is as defined above_mSatisfies the following conditions:

similarly, the context vectors corresponding to the image features of the M-1 th group can be output according to the attention models of the M-1 th group except the M-1 th group and the first hidden states corresponding to the image features of the M-1 th group.

In the above formula (8)

Is the LSTM cell for line m at time step t, by

Context vector z corresponding to the mth group of image features_mAnd learning the implicit state (such as a first implicit state) corresponding to the last time step t-1 of each group of image features output by the encoder and outputting the implicit state corresponding to the mth group of image features.

H in the above formula (8)_tIs a vector obtained by overlapping (or merging) the hidden states corresponding to each set of image features (e.g. the first hidden states corresponding to each set of image features) in the previous time step of time step t (i.e. time step t-1). For example, for time step t, assuming time step t is 1, then H₁Satisfies the following conditions:

similarly, for time step t +1, there is H_t+1(e.g. H)₂) Satisfies the following conditions:

in an embodiment of the invention, if the two time steps are different, for example time step t₁And time step t₂And t is₁≠t₂Or different CNN extracted image features are input into two different lines, e.g. m₁And m₂And m is₁≠m₂Then, then

And

also, the network parameters of (1) are different, so there is M T in the convergence stage 1₁An LSTM cell. For example, as shown in fig. 4, in the fusion stage 1, when M ═ 3, T₁When the number of LSTM units is 2, the number of LSTM units is 6, for example, LSTM11, LSTM12, and LSTM13 corresponding to time step T ═ 1, and LSTM21, LSTM22, and LSTM23 corresponding to time step T ═ T1 ═ 2.

In some possible embodiments, different sets of image features obtained by the fuser from different CNNs on the encoder side will be input into different rows of LSTM cells. In the fusion stage 1, the hidden states output after the LSTM units of each row at different time steps process the image features of the image to be processed may be fused into one hidden state set, and correspondingly, one hidden state set may also be represented as an image representation information set corresponding to the image to be processed. Thus, in stage 1 of fusion, M rows of LSTM cells of the fuser may output M different sets of implicit states. Wherein, the implicit state set correspondingly output by the LSTM unit in the mth row satisfies:

wherein, the above

Can be represented in the m-th row respectively when the time steps T are 1, 2, …, T₁The implicit state of each LSTM unit output. For convenience of description, in the fusion stage 1, at time step t₁(e.g., t ═ 1), the hidden state output by the LSTM unit of each row may be represented by hidden state a, which may represent image representation information a of the image to be processed. Time step t after time step t1₂(e.g., t ═ 1), the hidden state output by the LSTM unit of each row may be represented by hidden state B, and similarly, hidden state B may represent image representation information B of the image to be processed. In the embodiment of the present invention, the hidden state a and the hidden state B may be respectively used to represent hidden states (i.e., image representation information) generated in different time steps in the fusion stage 1, and may specifically be represented in other more forms according to requirements of an actual application scenario, which is not limited herein. That is, after the fusion stage 1, one of the M groups of image features output by the M CNNs corresponds to one hidden state set, and one hidden state set includes at least one hidden state (e.g., hidden state a and hidden state B).

In some possible embodiments, each of the M sets of hidden states may be used in the fusion stage 2, or in the optimization stage of the fusion device.

And (3) fusion stage 2:

in some possible embodiments, in fusion stage 2, T₂For convenience of description, T1 (e.g., T1 ═ T1+1), T2 (e.g., T1 ═ T1+2), and T3 (e.g., T1 ═ T1+ T2) are assumed to be included in the fusion phase 2 as 3 time steps. As shown in fig. 4, in the fusion stage 2, different time steps may include different LSTM units, and each LSTM unit may be configured to fuse the M hidden state sets processed in the fusion stage 1 again to obtain a final hidden state.

As shown in fig. 4, at the initial time step (assumed as time step T0) of the fusion stage 2, the initial hidden state (e.g. hidden state B) of the fusion stage 2 may be determined according to the hidden state (e.g. hidden state B) of the last time step (e.g. T2 ═ T1 in the fusion stage 1) in the fusion stage 1 included in each hidden state set obtained by processing in the fusion stage 1 (for convenience of description, a third hidden state may be exemplified, where the third hidden state may be another image representation information corresponding to the image to be processed, and may also be referred to as third image representation information). Wherein the third implicit state is defined as

For example) and an initial memory cell in the fusion stage 2 (to

For example) satisfies the following conditions:

as shown in the above equation (13), in the fusion stage 2, the state is hidden (in order to

For example) can be turned onFor each LSTM unit in the last time step in the merging stage 1 (M LSTM units corresponding to M rows)_t＝T1) The output implicit states are averaged. Similarly, as shown in the above equation (14), in the fusion stage 2, the memory unit is stored

For example) can be initialized by performing the LSTM unit on each row (M LSTM units corresponding to M rows) in the last time step in the merging stage 1_t＝T1) The output memory cells are averaged.

In some possible embodiments, in the fusion stage 2, for each implicit state included in the set of implicit states output by the LSTM unit in any row in the fusion stage 1, the following operations may be performed to obtain a context vector corresponding to each set of implicit states:

for each time step (illustrated by taking time step t as an example), the implicit state h of the LSTM unit at time step t_tAnd a memory cell c_tSatisfies the following conditions:

wherein, in the above formula (15), LSTM_t(-) is the LSTM unit for time step t, which will not be described in detail later.

Satisfies the following conditions:

wherein in the above formula (16)

Is an attention model of fusion stage 2 (or fusion stage II), m is different

Nor is itThe same is true. Thus, as shown in FIG. 4, there are M attention models, T, in fusion stage 2₂An LSTM cell. And carrying out image feature processing on an implicit state set obtained in the fusion stage 1 corresponding to an attention model in the fusion stage 2. Based on one attention model in the M attention models, a context vector corresponding to one implicit state set output by one row of LSTM units in the fusion stage 1 can be output. For example, the context vector corresponding to the implicit state set m output by the m-th attention model in the fusion stage 2 can be output by the m-th attention model in the fusion stage 1. At any time step t, M context vectors obtained based on M attention model processing and the implicit states output by the time step t-1 can learn to obtain a target implicit state based on an LSTM unit. Thus, in fusion stage 2, T₂Processing in one time step to obtain T₂The target implies the state. Here, the target hidden state may also represent information for a target image of the image to be processed. The output of the fusion stage 2 is a set, and the elements in the set are hidden states (i.e. image representation information), which can be illustrated by taking a target hidden state set as an example for convenience of description. The target set of implicit states satisfies:

the target hidden states in the target hidden state set shown in the above formula (17) may be used in an attention model of a decoder, where each target hidden state is used for decoding an image to be processed by the decoder to obtain an image description of the image to be processed.

And thirdly, decoding stage:

the decoder decodes the final image representation (i.e. target image representation information, such as a target hidden state) output by the fusion device based on the LSTM unit with the attention model, outputs vocabularies corresponding to the image representation obtained by processing the image features output by the encoder at each moment through the fusion device, and further can obtain natural sentences used for image description of the image to be processed. Wherein the final graph output by the fusion deviceThe image representation may include the set of implicit states corresponding to equation (17) above and the memory cell c output from the last LSTM cell in the fusion stage 2 above_T1+T2Wherein, the above c_T1+T2Satisfies the following conditions:

in the decoding stage, at any time step t, the decoding of the decoder can be expressed as:

[h_t，c_t]＝LSTM_dec(H_t，f_att-dec(C，h_t-1)) (19)

wherein, LSTM_dec(-) represents the LSTM unit for the decoder. In the decoding phase of the decoder, the network parameters of the LSTM units used for decoding at each time step are the same.

Wherein H in the above formula (19)_tSatisfies the following conditions:

above, f_att-dec(-) represents the attention model used in the decoding phase, and C is the set of hidden states output by the fuser, i.e., the set of hidden states obtained by equation (17) above.

In general, let S be a natural sentence generated by a decoder to describe the image content of an image to be processed, and the length of the natural sentence S is n (n is a positive integer), and the value of n can be set according to actual needs. For example: setting n to 30, which means that the natural sentence S has a length of 30 words; the following steps are repeated: when n is set to 25, the natural sentence S is 25 words long. Since the natural sentence S has a length of n, it means that the decoder performs the decoding process n times in total in the decoding stage, that is, the decoder needs to perform the decoding process n time steps, and each decoding process needs to predict one word. I.e. the time step (or decoding time) t of the decoder in the decoding stage₁Predicting the word s₁At decoding time t₂Predicting the word s₂By analogy, at decoding time t_nPredicting the word s_n. That is, in the decoding phase, the decoder is at any decoding time t_k(k is a positive integer, and k is more than or equal to 1 and less than or equal to n) predicting to obtain the word s_kThen, the decoder predicts the natural sentence S ═ S₁，s₂，...s_k，...，s_n}。

Optionally, the image processing flow of the embodiment of the present invention may further include an optimization stage, and an implementation manner of the optimization stage is described below, specifically as follows:

fourthly, an optimization stage:

in some possible embodiments, after the encoder, the fuser and the decoder in the image processing system perform encoding and decoding processing on the image to be processed to obtain the natural sentence for performing the image description on the image to be processed, the fuser may obtain the image description of the image to be processed from the decoder, and determine the discrimination and supervision loss function of the image processing according to the image description of the image to be processed. The fusion device can determine the loss function of image processing by combining the discrimination supervision loss function according to the M hidden state sets of the images to be processed obtained by the processing of the fusion stage 1 and the target hidden state of the images to be processed obtained by the processing of the fusion stage 2, and revise the network parameters of the LSTM unit adopted in the fusion device according to the loss function so as to optimize the capability of the fusion device for processing and outputting the hidden state sets and the target hidden state of any image.

Optionally, the above-mentioned loss function may also be used in a network parameter of an LSTM unit in the decoder to optimize the decoder processing and output a natural sentence corresponding to any image to perform image description on any image.

In some possible embodiments, the fuser may use a Discrimination Supervision (DS) image processing mechanism to further improve its own image processing performance. For example, the fusion device may obtain M sets of image features corresponding to the image to be processed from the encoder at any time step, where any set of image features includes the global image features and the local image features of the image to be processed. The global image features and the local image features in each group of image features output by the encoder can construct a matrix with two columns and entries, and the matrix can be marked as V for convenience of description. Determining a natural sentence set S in the image description corresponding to the image characteristics acquired in the time step according to the matrix V and the linear transformation matrix W, wherein S satisfies the following condition:

S＝Row_Max_Pool(WV) (21)

where W is a linearly transformed matrix and Row _ Max _ Pool (·) is the Max _ stacking operation along the direction of the matrix Row vectors, i.e., taking the maximum value in the matrix Row vectors. Denote the ith element of S as S_iThere is a discrimination supervision loss function

As described above

Satisfies the following conditions:

wherein the content of the first and second substances,

is a frequent word in natural sentence description for image description of an image to be processed. Optionally, the frequent words may select the first 1000 words with high frequent occurrence probability in a natural sentence in which an image is described in the image to be processed. The frequent words are the first 1000 words, which are only an example, and may be determined according to an actual application scenario, and the determination is not limited herein.

A discrimination supervision loss function expressed by the above equation (21)

Can further obtain, for one<Image, description>In contrast, the image processing system provided by the embodiment of the invention has the loss function of

Wherein the above

Satisfies the following conditions:

wherein λ is an empirical parameter for balancing the effect of the loss of the fuser on the whole image processing system, and the value of λ can be set according to practical experience. y is_tIs the word p (y)_t+1|y_t) The decoding method is calculated by linear transformation and SoftMax operation of the t-th implicit state output by a decoder at the t-th decoding moment in the decoding stage. The fusion device can correct the network parameters of the LSTM unit by the loss function, so that the final image representation of the image to be processed output by the fusion device can be more accurate, and the image processing accuracy of the image processing system is higher.

The embodiment of the invention fuses the image characteristics output by each CNN in a plurality of CNNs at an encoder end and the corresponding hidden states thereof at different time steps through a fusion device to obtain a plurality of hidden state sets, and then fuses the image characteristics such as richer hidden states and the like contained in the plurality of hidden state sets to obtain the final hidden state corresponding to the image to be processed. The acquisition of the final hidden state can be fused with image characteristics such as more hidden states of the image to be processed, so that image characteristics with richer contents can be acquired, the final hidden state of the image to be processed is obtained by processing the image characteristics with richer contents, and the final hidden state of the image to be processed is output to a decoder. The decoder decodes the image to be processed by using the final hidden state obtained by fusion of the fusion device to obtain image descriptions such as natural sentences for describing the image to be processed, so that the image description accuracy of the image to be processed is improved, the service quality of understanding the image content of the image to be processed is improved, and the user experience of the image content understanding service of the image to be processed can be enhanced.

Referring to fig. 5, fig. 5 is a flow chart illustrating an image processing method according to an embodiment of the invention. The image processing method provided by the embodiment of the invention can comprise the following steps of S101-S104:

s101, the fusion device obtains M groups of image characteristics of the image to be processed from the encoder, and obtains first image representation information corresponding to each group of image characteristics in the M groups of image characteristics.

In some possible embodiments, the encoder may perform image encoding on the image to be processed through the CNN, may output a global image feature for representing global information of the image through a last fully-connected layer of the CNN, and may output a local image feature set for representing local information of the image through a last convolutional layer (conv layer) of the CNN. The global image feature for representing the global information of the image can be represented by a vector, and the local image feature set for representing the local information can be represented by a vector set. One vector in the set of vectors represents an image feature of a region of the image. The global image features and the local image features output by one CNN at the encoder end can be combined to obtain a group of image features of the image to be processed, and M CNNs at the encoder end can correspondingly output M groups of image features of the image to be processed.

Assume that the global image feature obtained by feature extraction of the mth CNN of the M CNNs on the image to be processed is represented as

And A^(m)Then there may be a set of image features corresponding to the mth CNN. In the fusion stage, the fusion device can be composedAnd the decoder acquires M groups of image characteristics, and then the M groups of image characteristics can be fused to obtain the final image representation of the image to be processed and output to the decoder.

In some possible embodiments, the fuser may process the image features obtained from the encoder using an LSTM unit, wherein the LSTM unit used by the fuser may be an attention model LSTM, which is essentially a function of the state, and which may be abstractly represented as h_t＝LSTM(H_t，f_att(A，h_t-1)). Wherein the content of the first and second substances,

vector x_tIs the input of an image feature processing step (which may be referred to simply as the tth time step, or time step t for convenience of description) corresponding to time t, h_t-1Is an implicit state after the previous time step (i.e., the t-1 st time step, or time step t-1) has elapsed. A ═ a₁，a₂，...，a_k]Is a set of label vectors, i.e. a represents a set, the elements included in the set are label vectors. For example, the local image feature of the above-mentioned map to be processed

Etc., without limitation. F above_att(A，h_t-1) Is an attention model, note f_att(A，h_t-1) The output at the t-th time step is the vector z_tWherein the vector z_tIs a context vector. In the LSTM unit, the linear transformation is represented by T, and the image feature processing process of the LSTM unit with the attention model may refer to the implementation manners provided in equations (1) to (3) in the fusion stage 1, which is not described herein again.

In some possible embodiments, in the fusion phase 1, it is assumed that at time step t, the implicit states and memory cells of the m-th row of LSTM cells are recorded as

And

And a memory cell

For the purpose of illustration, the above description is provided

And

satisfying the above equation (6), and will not be described herein. As described above

The initial hidden state corresponding to the mth group of image features output for the mth CNN may also be recorded as the hidden state corresponding to the mth group of image features at the initial time step t0 before the time step t, and for convenience of description, the first hidden state may be taken as an example for explanation. Therefore, the first implicit state, namely the first image representation information, corresponding to each group of image features in the M groups of image features can be obtained through the implementation mode.

S102, the fusion device generates M image representation information sets according to the image features and the first image representation information corresponding to the image features.

In some possible embodiments, one of the M sets of image features output by the M CNNs corresponds to one set of hidden states (i.e., a set of image representation information), and one set of hidden states includes at least one hidden state, which may be illustrated as a second hidden state (i.e., a second set of image representation information) for convenience of description. At time step t, when the fusion device fuses M groups of image features obtained from the encoder, in the fusion stage 1, the implicit state of the LSTM unit in the M-th row

And a memory cell

The implementation modes provided by the above equations (7) and (8) are satisfied, and are not described herein again. Similarly, the context vectors corresponding to the image features of the M-1 th group can be output according to the attention models of the M-1 th group except the M-1 th group and the first hidden states corresponding to the image features of the M-1 th group.

In the above formula (8)

Is the LSTM cell for line m at time step t, by

H in the above formula (8)_tIs a vector obtained by stacking the hidden states corresponding to each set of image features (e.g. the first hidden state corresponding to each set of image features) in the time step (i.e. time step t-1) before time step t. For example, for time step t, assuming time step t is 1, then H₁The implementation manner provided by the above equation (10) is satisfied, and will not be described herein again.

In some possible embodiments, different sets of image features obtained by the fuser from different CNNs on the encoder side will be input into different rows of LSTM cells. In the fusion stage 1, the output implicit states after the LSTM units of each row at different time steps process the image features of the image to be processed can be fused into one set of implicit states. Thus, in stage 1 of fusion, M rows of LSTM cells of the fuser may output M different sets of implicit states. The implicit state set output by the LSTM unit in the mth row correspondingly satisfies the above equation (12), which is not described herein again.

For convenience of description, in the fusion stage 1, at time step t1, the implicit states of the LSTM cell outputs of each row can be represented by implicit state a. At time step t2, which follows time step t1, the implicit states output by the LSTM cells of each row may be represented by implicit state B. In the embodiment of the present invention, the hidden state a and the hidden state B may be respectively used to represent hidden states (i.e., image representation information) generated in different time steps in the fusion stage 1, and may specifically be represented in other more forms according to requirements of an actual application scenario, which is not limited herein.

S103, the fusion device fuses the second image representation information included in the M image representation information sets to obtain target image representation information.

In some possible embodiments, as shown in fig. 4, in the fusion stage 2, different LSTM units may be included at different time steps, and each LSTM unit may be configured to re-fuse M hidden state sets (i.e., M image representation information sets) obtained by processing in the fusion stage 1 to obtain a final hidden state, which may be illustrated by taking a target hidden state as an example for convenience of description. Here, the target hidden state may be used to represent final image representation information of the image to be processed. Optionally, the final image representation information of the image to be processed may also be represented by information in other representation forms besides the hidden state, which may be specifically determined according to the actual application scenario, and is not limited herein.

As shown in fig. 4, at the initial time step of the fusion phase 2 (assumed to be time step T0), the initial implicit state of the fusion phase 2 may be determined according to the implicit state (e.g. implicit state B) of the last time step in the fusion phase 1 (e.g. T1 ═ 2 in the fusion phase 1) included in each set of implicit states processed in the fusion phase 1 (for convenience of description, the third implicit state may be exemplified). Wherein the third implicit state is defined as

For example) and initiation at fusion stage 2Memory cell (with)

For example), the initialization satisfies the above equations (13) and (14), and will not be described again. As shown in the above equation (13), in the fusion stage 2, the state is hidden (in order to

For example) can be initialized by performing the LSTM unit on each row (M LSTM units corresponding to M rows) in the last time step in the merging stage 1_t＝T1) The output implicit states are averaged. Similarly, as shown in the above equation (14), in the fusion stage 2, the memory unit is stored

In some possible embodiments, in the fusion stage 2, the following operations may be performed for each implicit state included in the set of implicit states output by the LSTM unit in any row in the fusion stage 1 to obtain a context vector corresponding to each set of implicit states. For each time step (illustrated by taking time step t as an example), the implicit state h of the LSTM unit at time step t_tAnd a memory cell c_tThe above-mentioned equations (15) and (16) are satisfied, and will not be described in detail herein. FIG. 4 shows that there are M attention models, T, in fusion stage 2₂An LSTM cell. And carrying out image feature processing on an implicit state set obtained in the fusion stage 1 corresponding to an attention model in the fusion stage 2. Based on one attention model in the M attention models, a context vector corresponding to one implicit state set output by one row of LSTM units in the fusion stage 1 can be output. For example, the context vector corresponding to the implicit state set m output by the m-th attention model in the fusion stage 2 can be output by the m-th attention model in the fusion stage 1. At any time step t, M context vectors obtained based on M attention model processing can be obtainedThe hidden state output with the time step t-1 is learned to obtain a target hidden state based on the LSTM unit. Thus, in fusion stage 2, T₂Processing in one time step to obtain T₂The target implies the state. The output of the fusion stage 2 is a set, and the elements in the set are hidden states, which can be illustrated by taking a target hidden state set as an example for convenience of description. The target hidden state set satisfies the above equation (17), and will not be described in detail here. The hidden state set shown in the above formula (17) may be used in an attention model of a decoder, where each target hidden state in the target hidden state set is used for decoding an image to be processed by the decoder to obtain an image description of the image to be processed.

And S104, outputting the target image representation information to a decoder.

In some possible embodiments, after the fusion device processes the target hidden state (i.e., the target image representation information) corresponding to the image to be processed, the target hidden state may be output to the decoder. In the decoding phase of the decoder, the network parameters of the LSTM units used for decoding at each time step are the same. And the decoder decodes the final image representation output by the fusion device based on the LSTM unit with the attention model, outputs vocabularies corresponding to the image representation obtained by processing the image features output by the encoder at each moment through the fusion device, and further can obtain natural sentences used for carrying out image description on the image to be processed. Wherein, the final image representation output by the fusion device may include the implicit state set corresponding to the above formula (17) and the memory unit c output by the last LSTM unit in the above fusion stage 2_T1+T2Wherein, the above c_T1+T2Satisfying the above equation (18), will not be described herein. In the decoding stage, at any time step t, the decoding of the decoder can be represented by the expression provided by the above equation (19), which is not described herein again.

In general, let S be a natural sentence generated by a decoder to describe the image content of an image to be processed, and the length of the natural sentence S is n (n is a positive integer), and the value of n can be set according to actual needs. For example: setting n to 30 indicates that the natural sentence S has a length of 30Length of each word; the following steps are repeated: when n is set to 25, the natural sentence S is 25 words long. Since the natural sentence S has a length of n, it means that the decoder performs the decoding process n times in total in the decoding stage, that is, the decoder needs to perform the decoding process n time steps, and each decoding process needs to predict one word. I.e. the time step (or decoding time) t of the decoder in the decoding stage₁Predicting the word s₁At decoding time t₂Predicting the word s₂By analogy, at decoding time t_nPredicting the word s_n. That is, in the decoding phase, the decoder is at any decoding time t_k(k is a positive integer, and k is more than or equal to 1 and less than or equal to n) predicting to obtain the word s_kThen, the decoder predicts the natural sentence S ═ S₁，s₂，...s_k，..，s_n}。

Referring to fig. 5, fig. 5 is another schematic flow chart of the image processing method according to the embodiment of the invention. The image processing method provided by the embodiment of the invention can comprise the following steps S201-S204:

s201, the encoder outputs M groups of image characteristics of the image to be processed to the fusion device.

In some possible embodiments, an original image (i.e., an image to be processed) is input into an encoder, and feature extraction is performed on each frame of the image to be processed by the encoder, so as to obtain a frame feature sequence. The encoder may perform image encoding on an image to be processed through the CNN, and may output a global image feature for representing global information of the image through a last fully-connected layer of the CNN, and output a local image feature set for representing local information of the image through a last convolutional layer (conv layer) of the CNN. The global image feature for representing the global information of the image can be represented by a vector, and the local image feature set for representing the local information can be represented by a vector set. One vector in the set of vectors represents an image feature of a region of the image. The global image features and the local image features output by one CNN at the encoder end can be combined to obtain a group of image features of the image to be processed. Assume that the number of CNNs employed by the encoder side is M, where one encoder corresponds to one CNN, or the number of encoding channels in one encoder is M, where one encoding channel corresponds to one CNN. The network parameters adopted by different CNNs are different, and therefore, the M CNNs corresponding to the M encoders, or the M CNNs corresponding to the M encoding channels of one encoder, may be M different CNNs. The M CNNs at the encoder end can correspondingly output M groups of image characteristics of the image to be processed. Assume that the global image feature obtained by feature extraction of the mth CNN of the M CNNs on the image to be processed is represented as

The M CNNs extract image features of the image to be processed to obtain M groups of image features, wherein the CNNs are used for processing the image to be processed

S202, the fusion device generates first image representation information corresponding to each group of image features according to the global image features and the specified linear transformation matrix in each group of image features in the M groups of image features acquired from the encoder.

In some possible embodiments, the LSTM unit used by the fuser can be an attention model LSTM, which is essentially a function with states, abstractably represented as h_t＝LSTM(H_t，f_att(A，h_t-1)). Wherein the content of the first and second substances,

vector x_tIs the input of an image feature processing step (which may be referred to simply as the tth time step, or time step t for convenience of description) corresponding to time t, h_t-1Is an implicit state after the previous time step (i.e., the t-1 st time step, or time step t-1) has elapsed. Assume that at time step t, the image representation of the mth CNN output at the encoder side is input to the mth row of the fusion stage 1, as shown in fig. 4. In fusion phase 1, assume that at time step t, the implicit states and memory cells of the m-th row of LSTM cells are written as

And

And a memory cell

For the purpose of illustration, the above description is provided

And

Then, the initial hidden state corresponding to the mth group of image features output for the mth CNN may also be recorded as the hidden state (i.e., image representation information) corresponding to the mth group of image features at initial time step t0 before time step t, and for convenience of description, the first hidden state (i.e., first image representation information) may be exemplified.

S203, the fusion device learns any group of image features and first image representation information corresponding to each group of image features based on the first LSTM unit to obtain image representation information A corresponding to any group of image features.

S204, the fusion device learns any group of image characteristics and the image representation information A based on the second LSTM unit to obtain image representation information B corresponding to any group of image characteristics.

In some possible embodiments, any one of the M sets of image features output by the encoder end includes both the global image feature of the image to be processed and the sub-region local image feature of the image to be processed. The fusion device learns the sub-region local image features in any one group of image features in the M groups of image features output by the encoder and the first implicit state (i.e. the first image representation information) corresponding to any one group of image features based on the attention model in the LSTM unit (for convenience of description, the first LSTM can be taken as an example) in the fusion stage 1, and outputs the context vector corresponding to any one group of image features. The fuser can learn the context vector and the first hidden state corresponding to any group of image features based on the first LSTM to obtain the hidden state A corresponding to any group of image features.

In some possible embodiments, one of the M sets of image features of the M CNN outputs corresponds to one hidden state set, and oneThe set of implicit states includes at least one implicit state that can be illustrated as a second implicit state for ease of description. At time step t, when the fusion device fuses M groups of image features obtained from the encoder, the hidden state of the LSTM unit in the fusion stage 1

And a memory cell

Satisfies the above equation (7) as follows:

wherein H_tSatisfies the following conditions:

wherein the content of the first and second substances,

the attention model in the mth row in the fusion stage 1 (or the fusion stage I) is a first hidden state (i.e., a first hidden state (a (m)) corresponding to each sub-region local image feature of the image to be processed in the mth group of image features output by the mth CNN and the mth group of image features (i.e., a (m)) through the attention model in the mth row in the fusion stage 1, that is, the attention model carried in the LSTM unit (e.g., the first LSTM unit) in the mth row can be output according to the mth CNN

) Outputting context vector z corresponding to mth group of image features_m. Wherein z is as defined above_mSatisfying the above equation (9), will not be described herein. Similarly, the context vectors corresponding to the image features of the M-1 th group can be output according to the attention models of the M-1 th group except the M-1 th group and the first hidden states corresponding to the image features of the M-1 th group.

In the above formula (8)

Is the LSTM cell for line m at time step t, by

H in the above formula (8)_tThe hidden states corresponding to each group of image features (for example, the first hidden state corresponding to each group of image features) in the previous time step of time step t (that is, time step t-1) are superimposed to obtain a vector, for example, the implementation manners provided by the above equations (10) and (11) are not described herein again.

In some possible embodiments, different sets of image features obtained by the fuser from different CNNs on the encoder side will be input into different rows of LSTM cells. In the fusion stage 1, the output implicit states after the LSTM units of each row at different time steps process the image features of the image to be processed can be fused into one set of implicit states. Thus, in stage 1 of fusion, M rows of LSTM cells of the fuser may output M different sets of implicit states. The implicit state set output by the LSTM unit in the mth row correspondingly satisfies the condition (12), which is not described herein again.

For convenience of description, in the fusion stage 1, at time step t1 (e.g., t ═ 1), the implicit states of the LSTM unit outputs of each row can be represented by implicit state a. At a time step T2 (e.g., T1 2) following time step T1, the implicit state output by the LSTM cells of each row may be represented by implicit state B. In the embodiment of the present invention, the hidden state a and the hidden state B may be respectively used to represent hidden states generated in different time steps in the fusion stage 1, and may specifically be represented in other more forms according to requirements of an actual application scenario, which is not limited herein.

In some possible embodiments, in the fusion stage 1, the fuser may combine the hidden state a and the hidden state B corresponding to any group of image features to obtain a set of hidden states corresponding to the group of image features. For convenience of description, an implicit state set i may represent an implicit state set corresponding to any group of image features, where the implicit state a and the implicit state B are second implicit states included in the implicit state set i. In the fusion stage 1, the fusion device can obtain the hidden state sets corresponding to each group of image features through the LSTM units of each row, and then obtain M hidden state sets corresponding to the M groups of image features. Each hidden state set in the M hidden state sets can be used in the fusion stage 2 to further fuse and obtain a target hidden state of the image to be processed.

S205, the fuser determines third image representing information from the image representing information B included in each set of image representing information.

In some possible embodiments, as shown in fig. 4, in the fusion stage 2, different LSTM units may be included at different time steps, and each LSTM unit may be configured to re-fuse the M sets of hidden states processed in the fusion stage 1 to obtain a final hidden state of the image to be processed. As shown in fig. 4, at the initial time step of the fusion phase 2 (assumed as time step T0), the initial hidden state of the fusion phase 2 may be determined according to the hidden state (e.g. hidden state B) of the last time step in the fusion phase 1 (e.g. T2 ═ T1 in the fusion phase 1) included in each hidden state set processed in the fusion phase 1 (for convenience of description, the third hidden state may be taken as an example for illustration). Wherein the third implicit state is defined as

For example) and an initial memory cell in the fusion stage 2 (to

For example), the initialization satisfies the implementation provided by the above equation (13) and equation (14), and will not be described herein again. As shown in the above formula (13), in the fusion stage 2,implicit status (in)

S206, the third image representing information and the second image representing information in each image representing information set are learned based on the attention model, and a context vector corresponding to each image representing information set is output.

And S207, generating target image representation information according to the target vector matrix formed by the M context vectors and the third image representation information based on the third LSTM unit.

In some feasible embodiments, in the process of processing in the fusion stage 2, one implicit state set corresponds to one attention model, and M context vectors corresponding to the M implicit state sets can be obtained based on M LSTM units carrying the attention model. In the specific implementation, in the fusion stage 2, for each implicit state included in the implicit state set output by the LSTM unit in any row in the fusion stage 1, the following operations may be performed to obtain a context vector corresponding to each implicit state set:

Satisfies the following conditions:

wherein in the above formula (16)

Is an attention model of fusion stage 2 (or fusion stage II), m is different

And also different. Each attention model can correspondingly output a context vector, and then M context vectors can be correspondingly output by the M attention models, and a target vector matrix is obtained by the M context vectors

In some possible embodiments, in the time step t1 of the fusion stage 2, the fusion device may combine the M context vectors to form the target vector matrix based on the LSTM unit (for convenience of description, the LSTM1 may be taken as an example) corresponding to the time step t1

And learning the third implicit state to obtain the implicit state C corresponding to the M context vectors at the time step t 1. Further, in the next time step t2 of the time step t1, the fuser may further learn, based on the LSTM unit corresponding to the time step t2 (for convenience of description, the LSTM2 may be exemplified as an example), the second implicit state included in the M implicit state sets and the implicit state (for convenience of description, the implicit state C may be exemplified as an example) obtained by processing the time step t1Conventionally, the implicit states corresponding to the M context vectors are obtained (for convenience of description, the implicit state D may be used as an example for explanation). In the fusion stage 2, the LSTM unit corresponding to any other time step (for example, time step t3) after the time step t2 may also learn, according to the LSTM unit corresponding to the time step t2, the implicit states corresponding to the time step t3 (for example, the implicit state E may be exemplified for convenience of description) obtained by learning the second implicit state included in the M implicit state sets and the implicit state output by the previous time step (for example, time step t2), until the last time step of the fusion stage 2, and output the last implicit state of the image to be processed in the fusion stage 2 through the LSTM unit corresponding to the last time step. For convenience of description, the embodiment of the present invention will be described by taking two time steps (time step t1 and time step t2) included in the fusion phase 2 as an example. The fusion device can combine the hidden state C output by the LSTM1 in the fusion stage 2 and the hidden state D output by the LSTM2 to obtain a target hidden state set, and determine the hidden state C and the hidden state D as the target hidden state of the image to be processed.

For example, as shown in FIG. 4, there are M attention models, T, in fusion stage 2₂An LSTM cell. And carrying out image feature processing on an implicit state set obtained in the fusion stage 1 corresponding to an attention model in the fusion stage 2. Based on one attention model in the M attention models, a context vector corresponding to one implicit state set output by one row of LSTM units in the fusion stage 1 can be output. For example, the context vector corresponding to the implicit state set m output by the m-th attention model in the fusion stage 2 can be output by the m-th attention model in the fusion stage 1. At time step t, M context vectors processed based on M attention models and the hidden states output at time step t-1 can learn to obtain a target hidden state based on an LSTM unit. Thus, in fusion stage 2, T₂Processing in one time step to obtain T₂The target implies the state. The output of the fusion stage 2 is a set, and the elements in the set are hidden states, which can be illustrated by taking a target hidden state set as an example for convenience of description.The target hidden state set satisfies the above equation (17), and will not be described in detail here. The set of implicit states shown in the above equation (17) can be used in the attention model of the decoder, wherein each target implicit state is used for decoding the image to be processed by the decoder to obtain the image description of the image to be processed.

In a specific implementation, the fusion of the M groups of image features obtained from the encoder by the fusion device may include a fusion stage 1 and a fusion stage 2, each stage may include a plurality of image feature processing steps, and more implementation manners provided in the fusion stage 1 and the fusion stage 2 may be specifically referred to, and are not described herein again.

S208, the target image representation information is output to the decoder.

In general, let S be a natural sentence generated by a decoder to describe the image content of an image to be processed, and the length of the natural sentence S is n (n is a positive integer), and the value of n can be set according to actual needs. For example: setting n to 30, which means that the natural sentence S has a length of 30 words; the following steps are repeated: setting n to 25, representing natural language SThe length is 25 words long. Since the natural sentence S has a length of n, it means that the decoder performs the decoding process n times in total in the decoding stage, that is, the decoder needs to perform the decoding process n time steps, and each decoding process needs to predict one word. I.e. the time step (or decoding time) t of the decoder in the decoding stage₁Predicting the word s₁At decoding time t₂Predicting the word s₂By analogy, at decoding time t_nPredicting the word s_n. That is, in the decoding phase, the decoder is at any decoding time t_k(k is a positive integer, and k is more than or equal to 1 and less than or equal to n) predicting to obtain the word s_kThen, the decoder predicts the natural sentence S ═ S₁，s₂，...s_k，..，s_n}。

S209, the fusion device obtains the image description of the image to be processed from the decoder, and determines the discrimination supervision loss function of the image processing according to the image description of the image to be processed.

S210, the fusion device constructs a loss function of the image processing system by combining the discrimination supervision loss function according to the M image representation information sets and the target image representation information of the image to be processed.

S211, the fusion device corrects the network parameters of the LSTM unit according to the loss function, and the image processing performance of the image processing system is optimized.

In some possible embodiments, after the encoder, the fuser and the decoder in the image processing system perform encoding and decoding processing on the image to be processed to obtain the natural sentence for performing the image description on the image to be processed, the fuser may obtain the image description of the image to be processed from the decoder, and determine the discrimination and supervision loss function of the image processing system according to the image description of the image to be processed. The fusion device can determine the loss function of image processing by combining the discrimination supervision loss function according to the M hidden state sets of the images to be processed obtained by the processing of the fusion stage 1 and the target hidden state of the images to be processed obtained by the processing of the fusion stage 2, and revise the network parameters of the LSTM unit adopted in the fusion device according to the loss function so as to optimize the capability of the fusion device for processing and outputting the hidden state sets and the target hidden state of any image.

In some possible embodiments, the fuser may use a discriminative supervised image processing mechanism to further improve its own image processing performance. For example, the fusion device may obtain M sets of image features corresponding to the image to be processed from the encoder at any time step, where any set of image features includes the global image features and the local image features of the image to be processed. The global image features and the local image features in each group of image features output by the encoder can construct a matrix with two columns and entries, and the matrix can be marked as V for convenience of description. And determining a natural sentence set S in the image description corresponding to the image characteristics obtained in the time step according to the matrix V and the linear transformation matrix W, wherein S satisfies the formula (21), which is not described herein again.

In some possible embodiments, the ith element of S is denoted as S_iThere is a discrimination supervision loss function

As described above

Satisfies the following conditions:

wherein the content of the first and second substances,

is a frequent word in natural sentence description for image description of an image to be processed. Optionally, the frequent words may select the first 1000 words with high frequent occurrence probability in a natural sentence in which an image is described in the image to be processed. Wherein the above frequent words are the first 1000The words are only one example, and may be determined according to an actual application scenario, and are not limited herein.

A discrimination supervision loss function expressed by the above equation (21)

Wherein the above

Satisfies the following conditions:

wherein λ is an empirical parameter for balancing the effect of the loss of the fuser on the entire image processing system, and the value of λ can be set based on practical experience. y is_tIs the word p (y)_t+1|y_t) The decoding method is calculated by linear transformation and SoftMax operation of the t-th implicit state output by a decoder at the t-th decoding moment in the decoding stage. The fusion device can correct the network parameters of the LSTM unit by the loss function, so that the final image representation of the image to be processed output by the fusion device can be more accurate, and the image processing accuracy of the image processing system is higher.

The embodiment of the invention fuses the image characteristics output by each CNN in a plurality of CNNs at an encoder end and the corresponding hidden states thereof at different time steps through a fusion device to obtain a plurality of hidden state sets, and then fuses the image characteristics such as richer hidden states and the like contained in the plurality of hidden state sets to obtain the final hidden state corresponding to the image to be processed. The acquisition of the final hidden state can be fused with image characteristics such as more hidden states of the image to be processed, so that image characteristics with richer contents can be acquired, the final hidden state of the image to be processed is obtained by processing the image characteristics with richer contents, and the final hidden state of the image to be processed is output to a decoder. The decoder decodes the image to be processed by using the final hidden state obtained by fusion of the fusion device to obtain image descriptions such as natural sentences for describing the image to be processed, so that the image description accuracy of the image to be processed is improved, the service quality of understanding the image content of the image to be processed is improved, and the user experience of the image content understanding service of the image to be processed can be enhanced. In addition, the image processing method provided by the embodiment of the invention can also construct a loss function through the output data of the fusion device and the decoder, further modify the network parameters of the LSTM unit in the fusion device and/or the decoder through the loss function, optimize the performance of the fusion device and the processor, further improve the image processing performance of the image processing system, and enhance the user stickiness of the image processing system.

Based on the description of the embodiments of the image processing system and the image processing method, the embodiment of the invention also discloses an image processing apparatus, which can be a computer program (including a program code) running in a server, and the image processing apparatus can be applied to the image processing methods of the embodiments shown in fig. 5-6 for executing the steps in the image processing methods. Referring to fig. 7, the image processing apparatus operates as follows:

an obtaining unit 61, configured to obtain M sets of image features of the image to be processed from the encoder, where M is an integer not less than 2.

The obtaining unit 61 is further configured to obtain a first hidden state corresponding to each of the M groups of image features.

A first fusion unit 62, configured to generate M sets of image representation information according to the sets of image features acquired by the acquisition unit 61 and the first image representation information corresponding to the sets of image features.

The image representation information set is generated by a group of image features correspondingly, and the image representation information set comprises at least one piece of second image representation information.

A second fusion unit 63 configured to learn second image representation information included in the M image representation information sets obtained by the first fusion unit 62 to obtain target image representation information;

an output unit 64 for outputting the target image representation information obtained by the second fusion unit 63 to the decoder;

In some possible embodiments, the obtaining unit 61 is configured to:

the above-mentioned acquisition unit 61 is configured to:

In some possible embodiments, the first fusing unit 62 is configured to:

the first fusing unit 62 is configured to:

In some possible embodiments, the second fusing unit 63 is configured to:

and learning based on the third LSTM unit according to the M context vectors in the target vector matrix and the third image representation information to obtain target image representation information.

In some possible embodiments, the third LSTM unit used in the second fusion unit 63 at least includes LSTM1 and LSTM2, and the second fusion unit 63 is configured to:

In some possible embodiments, the image processing apparatus further includes an optimization unit 65 configured to:

According to an embodiment of the present invention, steps S101-S104 involved in the image processing method shown in fig. 5 may be performed by respective units in the image processing apparatus shown in fig. 7. For example, steps S101, S102, S103, S104 shown in fig. 5 may be performed by the acquisition unit 61, the first fusion unit 62, the second fusion unit 63, and the output unit 64 shown in fig. 7, respectively.

According to an embodiment of the present invention, steps S201 to S211 related to the image processing method shown in fig. 6 may be executed by each unit in the image processing apparatus shown in fig. 7, and specific reference may be made to implementation manners provided by each step in the embodiment corresponding to fig. 6, which are not described herein again.

According to another embodiment of the present invention, the units in the image processing apparatus shown in fig. 7 may be respectively or entirely combined into one or several other units to form the image processing apparatus, or some unit(s) thereof may be further split into multiple units with smaller functions to form the image processing apparatus, which may achieve the same operation without affecting the achievement of the technical effects of the embodiments of the present invention. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present invention, the image processing apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.

Based on the image processing system and the image processing method in the embodiments, the embodiment of the invention also provides a server. Referring to fig. 8, the internal structure of the server at least includes the image processing system shown in fig. 2, that is, includes an encoder, a fuser and a decoder, and further, the server also includes a processor, a communication interface and a computer storage medium. The processor, the communication interface and the computer storage medium in the server may be connected by a bus or other means, and fig. 8 shows an example of the communication bus connection according to the embodiment of the present invention.

The communication interface is a medium for realizing interaction and information exchange between the server and external devices (such as terminal devices). The processor (or Central Processing Unit, CPU) is a computing core and a control core of the server, and it is understood that the processor herein may also be a processor integrated in the fusion device, and is adapted to implement one or more instructions, and specifically, is adapted to load and execute one or more instructions so as to implement the corresponding method flow or the corresponding function. A computer storage medium (Memory) is a Memory device in a server for storing programs and data. It is understood that the computer storage medium herein may include both the built-in storage medium of the server and, of course, the extended storage medium supported by the server. The computer storage media provides storage space that stores the operating system of the server. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer storage medium located remotely from the processor.

In the embodiment of the present invention, the processor loads and executes one or more instructions stored in the computer storage medium to implement the corresponding steps in the method flows shown in fig. 5 to 6; in a specific implementation, one or more instructions in a computer storage medium are loaded by a processor and perform the following steps:

acquiring M groups of image characteristics of an image to be processed from an encoder, wherein M is an integer not less than 2;

acquiring first image representation information corresponding to each group of image features in the M groups of image features;

generating M image representation information sets according to the image characteristics of each group and first image representation information corresponding to the image characteristics of each group, wherein one image representation information set generated corresponding to one image characteristic of each group comprises at least one piece of second image representation information;

fusing second image representation information included in the M image representation information sets to learn to obtain target image representation information, and outputting the target image representation information to the decoder;

In one embodiment, in the process of the processor loading one or more instructions in a computer storage medium to execute the step of acquiring M groups of image features of the image to be processed from the encoder, the following steps are specifically executed:

The method comprises the steps of obtaining M groups of image characteristics of an image to be processed from M coding channels of an encoder of the image processing system, wherein one coding channel of the encoder corresponds to one group of coding parameters, and one coding channel outputs one group of image characteristics.

In another embodiment, each of the M groups of image features includes a global image feature of the image to be processed; in the process of the processor loading one or more instructions in the computer storage medium to execute the step of obtaining the first image representation information corresponding to each group of image features in the M groups of image features, the following steps are specifically executed:

In another embodiment, in the process of the processor loading one or more instructions in the computer storage medium and executing the step of generating M image representation information sets for the respective sets of image features and the first image representation information corresponding to the respective sets of image features, the following steps are specifically executed:

In another embodiment, each of the M groups of image features further includes a sub-region local image feature of the image to be processed, and the processor loads one or more instructions in a computer storage medium to perform the step of learning, based on the first LSTM unit, any group of image features and the first image representation information corresponding to the group of image features to obtain the image representation information a corresponding to the group of image features, specifically performing the following steps:

and learning the context vector corresponding to any group of image features and the first image representation information corresponding to each group of image features based on the first LSTM unit to obtain the image representation information A corresponding to any group of image features.

In another embodiment, in the process that the processor loads one or more instructions in the computer storage medium to perform the step of fusing the second image representation information included in the M image representation information sets to obtain the target image representation information, the following steps are specifically performed:

target image representation information is generated based on a third LSTM unit from the target vector matrix and the third image representation information.

In yet another embodiment, the third LSTM includes at least LSTM1 and LSTM2, and the following steps are specifically performed during the step of the processor loading one or more instructions in the computer storage medium to perform the step of generating the target image representation information based on the third LSTM unit according to the target vector matrix and the third image representation information:

In yet another embodiment, the processor loads one or more instructions in the computer storage medium to perform the following steps:

The image characteristics output by each CNN in a plurality of CNNs at an encoder end and the corresponding hidden states thereof are fused at different time steps to obtain a plurality of hidden state sets, and the image characteristics such as richer hidden states contained in the plurality of hidden state sets are fused to obtain the final hidden state corresponding to the image to be processed. The acquisition of the final hidden state can be fused with image characteristics such as more hidden states of the image to be processed, so that image characteristics with richer contents can be acquired, the final hidden state of the image to be processed is obtained by processing the image characteristics with richer contents, and the final hidden state of the image to be processed is output to a decoder. The decoder decodes the image to be processed by using the final hidden state obtained by fusion to obtain image descriptions such as natural sentences for describing the image to be processed, so that the image description accuracy of the image to be processed is improved, the service quality of understanding the image content of the image to be processed is improved, and the user experience of the image content understanding service of the image to be processed can be enhanced. In addition, the image processing method provided by the embodiment of the invention can also construct a loss function through the output data of the fusion device and the decoder, further modify the network parameters of the LSTM unit in the fusion device and/or the decoder through the loss function, optimize the performance of the fusion device and the processor, further improve the image processing performance of the image processing system, and enhance the user stickiness of the image processing system.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. An image processing method applied to an image processing system, the image processing system comprising an encoder and a decoder, the image processing system further comprising a fuser, the method comprising:

the fusion device obtains M groups of image characteristics of the image to be processed from the encoder;

the fusion device generates M image representation information sets according to the image characteristics and the first image representation information corresponding to the image characteristics, wherein one image representation information set generated by one group of image characteristics corresponds to one image representation information set, and one image representation information set comprises at least one piece of second image representation information;

the fusion device fuses second image representation information included in the M image representation information sets to obtain target image representation information, and the target image representation information is output to the decoder;

the target image representation information is used for the decoder to decode the image to be processed to obtain the image description of the image to be processed;

wherein the fusing device fuses second image representation information included in the M image representation information sets to obtain target image representation information includes:

the fusion device determines third image representation information according to image representation information B included in each image representation information set in the M image representation information sets, wherein the image representation information B included in each image representation information set is image representation information obtained last in each image representation information set;

executing the following operations on any image representation information set j in each image representation information set to obtain a context vector corresponding to the image representation information set j:

learning the third image representation information and second image representation information in the image representation information set j based on an attention model and outputting a context vector corresponding to the image representation information set j, wherein one image representation information set corresponds to one attention model;

and processing the target vector matrix and the third image representation information based on a third long-short time memory (LSTM) unit to generate target image representation information.

2. The method of claim 1, wherein the fuser acquiring M sets of image features of the image to be processed from the encoder comprises:

the fusion device acquires M groups of image characteristics of an image to be processed from M encoders included in the image processing system, wherein one encoder corresponds to one group of encoding parameters, and one encoder outputs one group of image characteristics; or

The fusion device obtains M groups of image characteristics of the image to be processed from M coding channels of the coder of the image processing system, wherein one coding channel of the coder corresponds to one group of coding parameters, and one coding channel outputs one group of image characteristics.

3. The method according to claim 1 or 2, wherein each of the M sets of image features includes a global image feature of the image to be processed;

and the fusion device generates first image representation information corresponding to each group of image features according to the global image features in each group of image features in the M groups of image features and the specified linear transformation matrix.

4. The method of claim 3, wherein the fuser generating M sets of image representation information based on the sets of image features and first image representation information corresponding to the sets of image features comprises:

the fusion device learns any group of image features and first image representation information corresponding to each group of image features based on a first long-short term memory (LSTM) unit to obtain image representation information A corresponding to any group of image features;

combining the image representation information A and the image representation information B corresponding to any group of image features to obtain an image representation information set i corresponding to any group of image features, wherein the image representation information A and the image representation information B are second image representation information included in the image representation information set i;

5. The method of claim 4, wherein each of the M sets of image features further includes a subregion local image feature of the image to be processed;

the fusion device learns the partial image features of the sub-region in any group of image features and the first image representation information corresponding to any group of image features based on the attention model in the first LSTM unit and outputs context vectors corresponding to any group of image features;

the fusion device learns the context vector corresponding to any group of image features and the first image representation information corresponding to each group of image features based on the first LSTM unit to obtain the image representation information A corresponding to any group of image features.

6. The method of claim 5, wherein the third LSTM cells comprise at least LSTM1 and LSTM2, the generating target image representation information from the target vector matrix and the third image representation information based on the third LSTM cells comprises:

learning the M context vectors and the third image representation information included in the target vector matrix based on the LSTM1 to obtain image representation information C;

learning second image representation information and the image representation information C included in the M image representation information sets based on the LSTM2 to obtain image representation information D;

and combining the image representation information C and the image representation information D to obtain a target image representation information set, and determining the image representation information C and the image representation information D in the target image representation information set as target image representation information.

7. The method of claim 5 or 6, further comprising:

the fusion device acquires the image description of the image to be processed from the decoder, and determines a discrimination supervision loss function of image processing according to the image description of the image to be processed;

the fusion device constructs a loss function of an image processing system by combining the discrimination supervision loss function according to the M image representation information sets of the image to be processed and the target image representation information;

and the fusion device corrects the network parameters of the LSTM unit adopted by the fusion device according to the loss function.

8. An image processing apparatus applied to an image processing system including an encoder and a decoder, wherein the image processing system further includes a fuser, the apparatus is the fuser, and the apparatus includes:

an acquisition unit for acquiring M sets of image features of an image to be processed from the encoder;

the acquiring unit is further configured to acquire first image representation information corresponding to each of the M groups of image features;

the first fusion unit is used for generating M image representation information sets according to the groups of image features acquired by the acquisition unit and first image representation information corresponding to the groups of image features, wherein one group of image features corresponds to one generated image representation information set, and one image representation information set comprises at least one piece of second image representation information;

a second fusion unit, configured to fuse second image representation information included in the M image representation information sets obtained by the first fusion unit to obtain target image representation information;

wherein the second fusion unit is configured to:

determining third image representation information according to image representation information B included in each image representation information set in the M image representation information sets, wherein the image representation information B included in each image representation information set is image representation information obtained last in each image representation information set;

9. A computer storage medium having one or more instructions stored thereon, the one or more instructions adapted to be loaded by a processor and to perform the image processing method of any of claims 1-7.

10. A server comprising an image processing system including an encoder and a decoder, characterized in that the image processing system further comprises a fuser comprising:

a computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the image processing method of any of claims 1-7.