CN108665506B - Image processing method, image processing device, computer storage medium and server - Google Patents

Image processing method, image processing device, computer storage medium and server Download PDF

Info

Publication number
CN108665506B
CN108665506B CN201810442810.0A CN201810442810A CN108665506B CN 108665506 B CN108665506 B CN 108665506B CN 201810442810 A CN201810442810 A CN 201810442810A CN 108665506 B CN108665506 B CN 108665506B
Authority
CN
China
Prior art keywords
image
representation information
image representation
features
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810442810.0A
Other languages
Chinese (zh)
Other versions
CN108665506A (en
Inventor
姜文浩
马林
刘威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810442810.0A priority Critical patent/CN108665506B/en
Publication of CN108665506A publication Critical patent/CN108665506A/en
Application granted granted Critical
Publication of CN108665506B publication Critical patent/CN108665506B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The embodiment of the invention discloses an image processing method and device, a computer storage medium and a system, wherein the method is applied to a fusion device and comprises the following steps: the fusion device obtains M groups of image characteristics of the image to be processed from the encoder; first image representation information corresponding to each group of image features in the M groups of image features is obtained. Generating M image representation information sets according to each group of image features and first image representation information corresponding to each group of image features, wherein one image representation information set generated corresponding to one group of image features comprises at least one piece of second image representation information; and fusing second image representation information included in the M image representation information sets to obtain target image representation information, and outputting the target image representation information to the decoder. By adopting the embodiment of the invention, the natural sentence description accuracy of the image can be improved, and the quality of the image content understanding service can be optimized.

Description

Image processing method, image processing device, computer storage medium and server
Technical Field
The present invention relates to the field of internet technologies, and in particular, to the field of image processing technologies, and in particular, to an image processing method, an image processing apparatus, a computer storage medium, and a server.
Background
In order to facilitate quick understanding of the main contents of images, image content understanding services have been developed. The image content understanding service is a service for converting image content into a description using one natural sentence, and thus image content understanding can also be understood as image content description. In other words, image content understanding can be seen as a translation problem, i.e. translating image content into a natural sentence description. One important factor for measuring the quality of the image content understanding service is the description accuracy of natural sentences used for describing image content.
In the prior art, an image processing flow is generally divided into an encoding stage and a decoding stage. And (3) an encoding stage: the image characteristics of each frame image of the original image are extracted by an encoder. And a decoding stage: natural sentences used for describing the image content are predicted by a decoder according to the image characteristics extracted by the encoder. Although the image content understanding service is realized by the prior art scheme, the prior art only obtains natural sentences for describing the image content through a decoder and a decoder, and does not pay attention to describing the image from multiple angles, so that the quality of the image content understanding service cannot be guaranteed.
Disclosure of Invention
Embodiments of the present invention provide an image processing method, an image processing apparatus, a computer storage medium, and a server, which can improve description accuracy in describing image content using natural sentences, improve quality of image content understanding service, and further improve user experience of the image content understanding service.
In a first aspect, an embodiment of the present invention provides an image processing method, where the method is applied to an image processing system, where the image processing system includes an encoder, a fuser, and a decoder, and the method includes:
the fusion device obtains M groups of image characteristics of the image to be processed from the encoder, wherein M is an integer not less than 2;
the fusion device acquires first image representation information corresponding to each group of image features in the M groups of image features;
the fusion device generates M hidden state sets according to the image characteristics and the first image representation information corresponding to the image characteristics, wherein one image representation information set is generated corresponding to one image characteristic, and one image representation information set comprises at least one second image representation information;
the fusion device fuses second image representation information included in the M image representation information sets to obtain target image representation information, and outputs the target image representation information to a decoder;
the target image representation information is used for a decoder to decode the image to be processed to obtain the image description of the image to be processed.
In some possible embodiments, the obtaining, by the fuser, M sets of image features of the image to be processed from the encoder includes:
the fusion device obtains M groups of image characteristics of an image to be processed from M encoders included in the image processing system, wherein one encoder corresponds to one group of encoding parameters, and one encoder outputs one group of image characteristics; or
The fusion device obtains M groups of image characteristics of an image to be processed from M coding channels of an encoder of the image processing system, wherein one coding channel of the encoder corresponds to one group of coding parameters, and one coding channel outputs one group of image characteristics.
In some possible embodiments, each of the M groups of image features includes a global image feature of the image to be processed;
the acquiring, by the fusion device, first image representation information corresponding to each of the M sets of image features includes:
the fusion device generates first image representation information corresponding to each group of image features according to the global image features in each group of image features in the M groups of image features and a specified linear transformation matrix.
In some possible embodiments, the first image representation information corresponding to the sets of image features may be a first hidden state corresponding to the sets of image features;
the fusion device generates M image representation information sets according to the image features and the first image representation information corresponding to the image features, and comprises:
the fusion device learns any group of image features and first image representation information corresponding to the image features on the basis of a first long-short term memory (LSTM) unit to obtain image representation information A corresponding to the image features;
the fusion device learns any group of image characteristics and the image representation information A based on a second LSTM unit to obtain image representation information B corresponding to any group of image characteristics;
combining the image representation information a and the image representation information B corresponding to any one set of image features to obtain an image representation information set i corresponding to any one set of image features, wherein the image representation information a and the image representation information B are second image representation information included in the image representation information set i;
and acquiring image representation information sets corresponding to all groups of image features to obtain M image representation information sets corresponding to the M groups of image features.
In some feasible embodiments, the image representation information provided in the embodiment of the present invention may include a hidden state, the first image representation information may be a first hidden state, the second image representation information may be a second hidden state, and the hidden state a and the hidden state B may be the image representation information a and the image representation information B, which is not described in detail below.
In some possible embodiments, each of the M groups of image features further includes a sub-region local image feature of the image to be processed;
the fusion device learns any group of image features and first image representation information corresponding to each group of image features based on a first LSTM unit, and the obtaining of the image representation information A corresponding to any group of image features comprises:
the fusion device learns the partial image characteristics of the sub-region in any group of image characteristics and the first image representation information corresponding to any group of image characteristics based on the attention model in the first LSTM unit and outputs context vectors corresponding to any group of image characteristics;
the fusion device learns the context vector corresponding to any one set of image features and the first image representation information corresponding to each set of image features based on the first LSTM unit to obtain the image representation information a corresponding to any one set of image features.
In some possible embodiments, the fusing, by the fuser, the second image representation information included in the M image representation information sets to obtain the target image representation information includes:
the fusion device determines third image representation information according to the image representation information B included in each image representation information set in the M image representation information sets;
and executing the following operations on any image representation information set j in each image representation information set to obtain a context vector corresponding to the image representation information set j:
learning the third image representation information and the second image representation information in the image representation information set j based on an attention model and outputting a context vector corresponding to the image representation information set j, wherein one image representation information set corresponds to one attention model;
acquiring M context vectors corresponding to the M image representation information sets, and obtaining a target vector matrix according to the M context vectors;
and processing the target vector matrix and the third image representation information based on a third LSTM unit to generate target image representation information.
In some possible embodiments, the third LSTM unit at least includes LSTM1 and LSTM2, and the generating the target image representation information based on the third LSTM unit according to the target vector matrix and the third image representation information includes:
learning the M context vectors included in the target vector matrix and the third image representing information based on the LSTM1 to obtain image representing information C;
learning second image representing information and the image representing information C included in the M image representing information sets based on the LSTM2 to obtain image representing information D;
combining the image representation information C and the image representation information D to obtain a target image representation information set, and determining the image representation information C and the image representation information D in the target image representation information set as target image representation information.
In some possible embodiments, the method further comprises:
the fusion device obtains the image description of the image to be processed from the decoder, and determines the discrimination supervision loss function of the image processing according to the image description of the image to be processed;
the fusion device combines the discrimination supervision loss function to construct a loss function of an image processing system according to the M image representation information sets of the images to be processed and the target image representation information;
and the fusion device modifies the network parameters of the LSTM unit adopted by the fusion device according to the loss function.
In some possible embodiments, the above-described loss function may also be used to modify the network parameters of the LSTM unit employed by the decoder.
In a second aspect, an embodiment of the present invention provides an image processing apparatus, where the image processing apparatus is applied to an image processing system, where the image processing system includes an encoder and a decoder, the image processing system further includes a fuser, and the apparatus may be a fuser, and the apparatus includes:
an obtaining unit, configured to obtain M sets of image features of an image to be processed from the encoder, where M is an integer not less than 2;
the obtaining unit is further configured to obtain a first hidden state corresponding to each group of image features in the M groups of image features;
a first fusion unit, configured to generate M image representation information sets according to the groups of image features acquired by the acquisition unit and first image representation information corresponding to the groups of image features, where a group of image features corresponds to a generated image representation information set, and each image representation information set includes at least one piece of second image representation information;
a second fusion unit configured to fuse second image representation information included in the M image representation information sets obtained by the first fusion unit and learn the second image representation information to obtain target image representation information;
an output unit configured to output the target image representation information obtained by the second fusion unit to the decoder;
the target image representation information is used for the decoder to decode the image to be processed to obtain the image description of the image to be processed.
In some possible embodiments, the obtaining unit is configured to:
acquiring M groups of image characteristics of an image to be processed from M encoders included in the image processing system, wherein one encoder corresponds to one group of encoding parameters and one encoder outputs one group of image characteristics; or
Acquiring M groups of image features of an image to be processed from M encoding channels of an encoder of the image processing system, wherein one encoding channel of the encoder corresponds to one group of encoding parameters, and one encoding channel outputs one group of image features.
In some possible embodiments, each of the M groups of image features includes a global image feature of the image to be processed;
the acquisition unit is configured to:
and generating first image representation information corresponding to each group of image features according to the global image features in each group of image features in the M groups of image features and the specified linear transformation matrix.
In some possible embodiments, the first fusion unit is configured to:
learning any group of image features and first image representation information corresponding to the image features on the basis of a first LSTM unit to obtain image representation information A corresponding to the image features;
learning any group of image characteristics and the image representation information A based on a second LSTM unit to obtain image representation information B corresponding to any group of image characteristics;
combining the image representation information a and the image representation information B corresponding to any one set of image features to obtain an image representation information set i corresponding to any one set of image features, wherein the image representation information a and the image representation information B are second image representation information included in the image representation information set i;
and acquiring image representation information sets corresponding to all groups of image features to obtain M image representation information sets corresponding to the M groups of image features.
In some possible embodiments, each of the M groups of image features further includes a sub-region local image feature of the image to be processed;
the first fusing unit is configured to:
learning the partial image features of the sub-region in any group of image features and first image representation information corresponding to any group of image features based on an attention model in a first LSTM unit and outputting context vectors corresponding to any group of image features;
and learning the context vector corresponding to any one group of image features and the first image representation information corresponding to each group of image features based on the first LSTM unit to obtain the image representation information A corresponding to any one group of image features.
In some possible embodiments, the second fusion unit is configured to:
determining third image representation information according to image representation information B included in each of the M image representation information sets;
and executing the following operations on any image representation information set j in each image representation information set to obtain a context vector corresponding to the image representation information set j:
learning the third image representation information and the second image representation information in the image representation information set j based on an attention model and outputting a context vector corresponding to the image representation information set j, wherein one image representation information set corresponds to one attention model;
acquiring M context vectors corresponding to the M image representation information sets, and obtaining a target vector matrix according to the M context vectors;
and learning the M context vectors in the target vector matrix and the third image representation information based on a third LSTM unit to obtain target image representation information.
In some possible embodiments, the third LSTM unit employed by the second fusion unit includes at least LSTM1 and LSTM2, and the second fusion unit is configured to:
learning the M context vectors in the target vector matrix and the third image representing information based on the LSTM1 to obtain image representing information C;
learning second image representing information and the image representing information C included in the M image representing information sets based on the LSTM2 to obtain image representing information D;
combining the image representation information C and the image representation information D to obtain a target image representation information set, and determining the image representation information C and the image representation information D included in the target image representation information set as target image representation information.
In some possible embodiments, the image processing apparatus further includes an optimization unit configured to:
acquiring the image description of the image to be processed from the decoder, and determining a discrimination supervision loss function of image processing according to the image description of the image to be processed;
according to the M image representation information sets of the image to be processed and the target image representation information, constructing a loss function of an image processing system by combining the discrimination supervision loss function;
and correcting the network parameters of the LSTM unit adopted by the LSTM unit according to the loss function.
In a third aspect, the present invention provides a computer storage medium applied in an image processing system, the image processing system including an encoder, a fuser and a decoder, the computer storage medium storing one or more instructions, the one or more instructions being adapted to be loaded by the fuser and to perform the method provided by any feasible implementation manner of the first aspect and the first aspect.
In a fourth aspect, an embodiment of the present invention provides a server, where the server includes an image processing system, where the image processing system includes an encoder, a fuser and a decoder, and the fuser further includes:
a processor adapted to implement one or more instructions; and the number of the first and second groups,
a computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the method as provided by any of the possible embodiments of the first aspect and the first aspect described above.
The method and the device can learn and fuse a plurality of groups of image characteristics obtained from an encoder and image representation information corresponding to the image characteristics to obtain a plurality of image representation information sets through a fusion device, further fuse the image representation information sets to obtain target image representation information, output the target image representation information to a decoder, decode the image to be processed through the decoder in combination with the target image representation information of the image to be processed obtained through fusion to obtain natural sentences corresponding to the image to be processed, and the natural sentences are used for image description of the image to be processed. Therefore, the fusion device fuses a plurality of groups of image descriptions processed by the encoder to obtain image features with richer data volume, and the image features with larger data volume are provided for the decoder to decode, so that the description accuracy of natural sentences is improved, and the quality of image content understanding service is optimized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a block diagram of an image processing system according to an embodiment of the present invention;
FIG. 2 is a block diagram of another embodiment of an image processing system;
FIG. 3 is a schematic diagram of an application scenario of image processing provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of image feature fusion provided in an embodiment of the present invention;
FIG. 5 is a flowchart illustrating an image processing method according to an embodiment of the present invention;
FIG. 6 is a schematic view of another flowchart of an image processing method according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
In order to facilitate quick understanding of the main contents of images, image content understanding services have been developed. The image content understanding service is a service for generating a natural expression for describing image content by performing a series of encoding and decoding processes including extracting features of an image, understanding image content, and the like on an image. In other words, the image content understanding service is a service that converts image content into a description in one natural sentence. The image content understanding service may be applied in a variety of internet scenarios, such as: the image content understanding service can be applied to an image classification scene of an image website, and images are classified based on natural sentence descriptions of the images; the following steps are repeated: the image content understanding service can also be applied to an image retrieval scene of an image website, and the image is retrieved based on natural sentence description of the image; the system can also be combined with a voice retrieval system, and natural sentences of the images are converted into voice to be output so as to assist the visually impaired to retrieve the images; the image content understanding service can also be applied to a target detection scene for finding objects such as target characters from the monitoring video; and the like, which can be determined according to the actual application scenario, and is not limited herein.
One important factor for measuring the quality of the image content understanding service is the description accuracy of natural sentences used for describing image content. If the description accuracy of the natural sentence is higher, which indicates that the matching degree of the content described by the natural sentence and the image content is higher, the image content understanding service quality is better, and the user use experience is better. Conversely, the lower the description accuracy of the natural language sentence is, the lower the matching degree between the content described by the natural language sentence and the image content is, the worse the image content understanding service quality is, and the worse the user experience is.
Referring to fig. 1, fig. 1 is a schematic diagram of a framework of an image processing system according to an embodiment of the present invention. Referring to fig. 1, the framework of the image processing system is mainly composed of an encoder and a decoder. Correspondingly, the flow of image processing by the image processing system is divided into two stages, including an encoding stage and a decoding stage. And (3) an encoding stage: image feature extraction is performed on each frame of the original image by an encoder. And a decoding stage: the frame features extracted in the encoding stage are transformed by means of a mean feature mechanism, an attention mechanism, etc., and then a natural sentence for describing the image content is predicted by a decoder according to the transformed image features. The attention mechanism will be taken as an example to illustrate the embodiment of the present invention, and the detailed description thereof is omitted. As can be seen from the image processing scheme corresponding to the image processing system shown in fig. 1, the decoder generates natural sentences by using the image features output by the encoder, and whether the image features output by the encoder are rich or not directly affects the accuracy of the decoder in describing the image content by the natural sentences obtained by processing the image features output by the encoder. In the image processing scheme shown in fig. 1, vectors output by an encoder are directly used for decoding by a decoder without any processing, the amount of image information of the vectors output by a single encoder is small, the decoder can only decode and obtain natural sentences used for describing image content according to the vectors output by the encoder, the description accuracy of the natural sentences on the image content cannot be guaranteed, the quality of image content understanding service cannot be guaranteed, and the applicability is poor.
Based on this, the embodiment of the present invention proposes an image processing method that adds a fuser in an image processing system, so that the image processing system includes both an encoder and a decoder, and the fuser. The image data such as the image characteristics extracted by a plurality of encoders and the image representation information corresponding to the image characteristics are fused by a fusion device, or the image data such as the image characteristics output by a plurality of encoding channels of the encoders and the image representation information corresponding to the image characteristics are fused, and the fused image characteristics are output to a decoder. The decoder decodes the image to be processed by utilizing richer image data obtained by fusion of the fusion device to obtain image descriptions such as natural sentences for describing the image to be processed, so that the image description accuracy of the image to be processed is improved, the service quality of understanding the image content of the image to be processed is improved, and the user experience of the image content understanding service of the image to be processed can be enhanced.
Based on this principle, the image processing system according to the embodiment of the present invention introduces a fuser for fusing image features output by an encoder based on the system architecture shown in fig. 1, please refer to fig. 2 together, and fig. 2 is another schematic diagram of the framework of the image processing system according to the embodiment of the present invention. The image processing system of the embodiment of the invention comprises an encoder, a fuser and a decoder. Based on the image processing system shown in fig. 2, the image processing flow of the embodiment of the present invention is mainly divided into three stages, including an encoding stage at the encoder end, a fusion stage at the fusion device end, and a decoding stage at the decoder end. The implementation of each of the above three stages will be described as follows:
firstly, an encoding stage:
an original image (i.e., an image to be processed, which will be described below by taking the image to be processed as an example for convenience of description) is input into an encoder, and feature extraction is performed on each frame of the image to be processed by the encoder to obtain a frame feature sequence. Generally, the encoder may perform feature extraction based on a Convolutional Neural Network (CNN). The encoder may perform image encoding on an image to be processed through the CNN, and may output a global image feature for representing global information of the image through a last fully-connected layer of the CNN, and output a local image feature set for representing local information of the image through a last convolutional layer (conv layer) of the CNN. The global image feature for representing the global information of the image can be represented by a vector, and the local image feature set for representing the local information can be represented by a vector set. One vector in the set of vectors represents an image feature of a region of the image.
Second, the fusion stage
In the embodiment of the invention, a fusion device is added between the encoder and the decoder, the fusion device can combine a plurality of image representations of the image to be processed output by the encoder into a final image representation of the image to be processed, the final image representation of the image to be processed obtained by combination can be input into the decoder, and the natural sentence for image description of the image to be processed is output by the decoder. For example, referring to fig. 3, fig. 3 is a schematic view of an application scenario of image processing according to an embodiment of the present invention. As shown in fig. 3, in the image processing method provided by the embodiment of the present invention, the image to be processed may be input into the encoder, and a plurality of sets of image features (which may be represented as a plurality of images of the image to be processed) of the image to be processed are output through a plurality of CNNs at the encoder. The fuser end can acquire a plurality of groups of image characteristics from the encoder end, fuse the plurality of groups of image characteristics to obtain the final image representation of the image to be processed, and output the final image representation to the decoder end. The decoder side can decode the final image representation output by the fuser side to obtain the image description of the image to be processed, for example, outputting a natural sentence of the image to be processed as a "pedestrian" image.
In some possible embodiments, the image processing system may perform image encoding on the image to be processed through a plurality of encoders to output a plurality of sets of image features of the image to be processed. One encoder may correspondingly adopt one CNN, and the network parameters adopted by each CNN are different, that is, the encoding parameters adopted by each encoder in the plurality of encoders are different. Each group of image features of the multiple groups of image features output by the multiple encoders comprises a global image feature and a group of local image features.
Optionally, in some possible embodiments, a plurality of CNNs may also be included in one encoder, and one CNN is one encoding channel, that is, a plurality of encoding channels may be included in the encoder for image encoding the image to be processed to output a plurality of sets of image features of the image to be processed. The network parameters adopted by each CNN in the multiple CNNs are different, that is, one coding channel in the multiple coding channels of the encoder corresponds to one group of coding parameters, and then the image to be processed is coded by the multiple coding channels corresponding to the multiple CNNs, so that multiple groups of different image features can be output. The plurality of different sets of image features may be a plurality of sets of image representations of the image to be processed. Similarly, each of the image features in the plurality of sets of image features includes a global image feature and a set of local image features.
In some possible embodiments, it is assumed that the number of encoders is M, where one encoder corresponds to one CNN, or the number of encoding channels in one encoder is M, where one encoding channel corresponds to one CNN. The network parameters adopted by different CNNs are different, and therefore, the M CNNs corresponding to the M encoders, or the M CNNs corresponding to the M encoding channels of one encoder, may be M different CNNs. For convenience of description, the following description will take the example of extracting image features from an image to be processed by using M CNNs. Assume that the global image feature obtained by feature extraction of the mth CNN of the M CNNs on the image to be processed is represented as
Figure GDA0003208502170000111
The local image characteristics corresponding to each subregion of the image to be processed are expressed as
Figure GDA0003208502170000112
Wherein M may be any one of 1 to M, which is not limited herein. The M CNNs extract image features of the image to be processed to obtain M groups of image features, wherein the CNNs are used for processing the image to be processed
Figure GDA0003208502170000113
And A(m)Then there may be a set of image features corresponding to the mth CNN.
In some possible embodiments, the fuser may obtain M sets of image features obtained by extracting image features of the image to be processed from the M CNNs obtained by the encoder, and further fuse the M sets of image features to obtain a final image representation of the image to be processed, and output the final image representation to the decoder. The fusion of the M sets of image features obtained from the encoder by the fusion device may include a fusion stage 1 and a fusion stage 2, each of which may include one or more image feature processing steps. In the embodiment of the present invention, the image feature processing steps in the fusion stage 1 and the fusion stage 2 occur at different times, so for convenience of description, the image feature processing steps can also be described by taking time steps as an example. Referring to fig. 4, fig. 4 is a schematic diagram of image feature fusion according to an embodiment of the present invention.
In the decoding stage, generally, the decoder may perform prediction of a natural sentence by using a Recurrent Neural Network (RNN), and the RNN may be implemented by using a Long Short-Term Memory (LSTM) unit, so that the embodiment of the present invention takes the Recurrent Neural Network based on the LSTM unit as an example, and takes processing an image of an image to be processed by using a time attention mechanism as an example. In the fusion stage, the fusion device can also adopt an LSTM unit to process the image characteristics acquired from the encoder, so that the image representation output to a decoder in the decoding stage after processing can meet the requirements of the decoder, the accuracy of image description output after decoding the image to be processed can be guaranteed, and the processing quality of image processing can be improved. Correspondingly, in the embodiment of the present invention, in the process of processing the image feature data at the fuser end and the decoder end, the image representation information of the image to be processed also includes an implicit state corresponding to the image feature of the image to be processed. For ease of understanding, the LSTM unit will be briefly described below:
in some possible embodiments, the fusion cage of the embodiments of the inventionThe LSTM cell used may be an attention model LSTM, which is essentially a function with states, abstractly represented as ht=LSTM(Ht,fatt(A,ht-1)). Wherein the content of the first and second substances,
Figure GDA0003208502170000121
vector xtIs the input of an image feature processing step (which may be referred to simply as the tth time step, or time step t for convenience of description) corresponding to time t, ht-1Is an implicit state after the previous time step (i.e., the t-1 st time step, or time step t-1) has elapsed. A ═ a1,a2,...,ak]Is a set of annotation vectors, i.e. a represents a set, the elements included in the set are vectors, and the vectors are called annotation vectors (annotation vectors). For example, the local image feature of the above-mentioned map to be processed
Figure GDA0003208502170000131
And are not intended to be limiting herein. F aboveatt(A,ht-1) Is an attention model, note fatt(A,ht-1) The output at the t-th time step is the vector ztWherein the vector ztIs a context vector. Inside the LSTM unit, the linear transformation is denoted by T, and the image feature processing procedure of the LSTM unit with attention model can be expressed as follows:
Figure GDA0003208502170000132
ct=ft⊙ct-1+it⊙gt (2)
ht=ot⊙tanh(ct) (3)
wherein it、ft、ctAnd otInput gates, forget gates, memory gates and output gates, respectively called LSTM units, tanh (·) is a hyperbolic tangent function, σ is a common activation function of neural networks, such as a sigmoid function, etc.
An Attention Model (Attention Model), also called an Attention Mechanism (Attention Mechanism), is widely applied to various deep learning tasks of different types, such as natural language processing, image recognition, voice recognition and the like, and has a great promotion effect on the deep learning tasks. For convenience of description, the attention model will be described as an example.
Attention model fatt(-) can be used to determine which region of the image was noted at the previous time step, i.e. a weight value is calculated for each vector in the set of annotation vectors a, and the image region corresponding to the vector with higher weight represents the region being noted. The attention model calculates any vector a in the annotation vector setiThe corresponding weight value may be calculated using a multi-layer perceptron (MLP). A can be calculated by MLPiAnd ht-1Degree of similarity e ofiThen calculate aiCorresponding weight value wi. Wherein, w isiSatisfies the following conditions:
Figure GDA0003208502170000133
the context vector z corresponding to the identification vector set input in the time step t can be generated by using the weight value corresponding to each vector in the labeled vector set AtWherein z istSatisfies the following conditions:
zt=∑wiai(5) wherein z is as defined abovetThe image feature processing procedure for the LSTM unit may be used at time step t.
In some possible embodiments, as shown in fig. 4, it is assumed that the number of image feature processing steps included in the fusion stage 1 is T1, and the number of image feature processing steps included in the fusion stage 2 is T2. For convenience of description, M will be 3, T12 (including T-1 and T-2), T2The image feature processing procedure of each step in the fusion stage 1 and the fusion stage 2 is described by taking 3 (including T-T1 +1, T-T1 +2, and T-T1 +3) as an example.
Fusion stage 1:
in some possible embodiments, it is assumed that at time step t, as shown in fig. 4, the image representation of the mth CNN output is input to the mth row of the fusion stage 1. Wherein, the image representation of the m-th CNN output can comprise a global image feature representation
Figure GDA0003208502170000141
And local image features are expressed as
Figure GDA0003208502170000142
Assume that in the fusion phase 1, at time step t, the implicit states and memory cells of the m-th row of LSTM cells are recorded as
Figure GDA0003208502170000143
And
Figure GDA0003208502170000144
at time step t, the implicit states and the memory cells of the LSTM cells of each row are initialized as follows, with the implicit states of the LSTM cells of the m-th row
Figure GDA0003208502170000145
And a memory cell
Figure GDA0003208502170000146
For the purpose of illustration, the above description is provided
Figure GDA0003208502170000147
And
Figure GDA0003208502170000148
satisfies the following conditions:
Figure GDA0003208502170000149
wherein the content of the first and second substances,
Figure GDA00032085021700001410
a linear transformation matrix that is linearly transformed inside the LSTM cells of the m-th row,
Figure GDA00032085021700001411
and outputting the global image characteristics of the image to be processed for the CNN of the mth row. As described above
Figure GDA00032085021700001412
The initial hidden state corresponding to the mth group of image features output for the mth CNN may also be recorded as the hidden state corresponding to the mth group of image features at the initial time step t0 before the time step t, and for convenience of description, the first hidden state may be taken as an example for explanation. Here, the first implicit state may be first image representation information of the image to be processed for the LSTM unit to learn image data such as image features of the image to be processed. Similarly, the subsequent second hidden state and the like may be second image representation information of the image to be processed, and the first hidden state and the second hidden state and the like are only hidden states (i.e., image representation information) generated by marking different time nodes, and are not specifically limited, and will not be described any further below.
In some possible embodiments, in the fusion stage 1, when the fusion device fuses the M groups of image features obtained from the encoder at time step t, the hidden states of the LSTM unit in the fusion stage 1
Figure GDA00032085021700001413
And a memory cell
Figure GDA00032085021700001414
Satisfies the following conditions:
Figure GDA00032085021700001415
wherein HtSatisfies the following conditions:
Figure GDA0003208502170000151
wherein the content of the first and second substances,
Figure GDA0003208502170000152
is the attention model of the m-th line in the fusion stage 1 (or the fusion stage I), and the local image features of each sub-region of the image to be processed in the m-th group of image features output according to the m-th CNN and the m-th group of image features (namely A) can be obtained through the attention model of the m-th line(m)) Corresponding first implicit State (i.e. the first implicit State)
Figure GDA0003208502170000153
Assuming t is 1, then
Figure GDA0003208502170000154
(may be)
Figure GDA0003208502170000155
) Outputting context vector z corresponding to mth group of image featuresm. Wherein z is as defined abovemSatisfies the following conditions:
Figure GDA0003208502170000156
similarly, the context vectors corresponding to the image features of the M-1 th group can be output according to the attention models of the M-1 th group except the M-1 th group and the first hidden states corresponding to the image features of the M-1 th group.
In the above formula (8)
Figure GDA0003208502170000157
Is the LSTM cell for line m at time step t, by
Figure GDA0003208502170000158
Context vector z corresponding to the mth group of image featuresmAnd learning the implicit state (such as a first implicit state) corresponding to the last time step t-1 of each group of image features output by the encoder and outputting the implicit state corresponding to the mth group of image features.
H in the above formula (8)tIs a vector obtained by overlapping (or merging) the hidden states corresponding to each set of image features (e.g. the first hidden states corresponding to each set of image features) in the previous time step of time step t (i.e. time step t-1). For example, for time step t, assuming time step t is 1, then H1Satisfies the following conditions:
Figure GDA0003208502170000159
similarly, for time step t +1, there is Ht+1(e.g. H)2) Satisfies the following conditions:
Figure GDA00032085021700001510
in an embodiment of the invention, if the two time steps are different, for example time step t1And time step t2And t is1≠t2Or different CNN extracted image features are input into two different lines, e.g. m1And m2And m is1≠m2Then, then
Figure GDA00032085021700001511
And
Figure GDA00032085021700001512
also, the network parameters of (1) are different, so there is M T in the convergence stage 11An LSTM cell. For example, as shown in fig. 4, in the fusion stage 1, when M ═ 3, T1When the number of LSTM units is 2, the number of LSTM units is 6, for example, LSTM11, LSTM12, and LSTM13 corresponding to time step T ═ 1, and LSTM21, LSTM22, and LSTM23 corresponding to time step T ═ T1 ═ 2.
In some possible embodiments, different sets of image features obtained by the fuser from different CNNs on the encoder side will be input into different rows of LSTM cells. In the fusion stage 1, the hidden states output after the LSTM units of each row at different time steps process the image features of the image to be processed may be fused into one hidden state set, and correspondingly, one hidden state set may also be represented as an image representation information set corresponding to the image to be processed. Thus, in stage 1 of fusion, M rows of LSTM cells of the fuser may output M different sets of implicit states. Wherein, the implicit state set correspondingly output by the LSTM unit in the mth row satisfies:
Figure GDA0003208502170000161
wherein, the above
Figure GDA0003208502170000162
Can be represented in the m-th row respectively when the time steps T are 1, 2, …, T1The implicit state of each LSTM unit output. For convenience of description, in the fusion stage 1, at time step t1(e.g., t ═ 1), the hidden state output by the LSTM unit of each row may be represented by hidden state a, which may represent image representation information a of the image to be processed. Time step t after time step t12(e.g., t ═ 1), the hidden state output by the LSTM unit of each row may be represented by hidden state B, and similarly, hidden state B may represent image representation information B of the image to be processed. In the embodiment of the present invention, the hidden state a and the hidden state B may be respectively used to represent hidden states (i.e., image representation information) generated in different time steps in the fusion stage 1, and may specifically be represented in other more forms according to requirements of an actual application scenario, which is not limited herein. That is, after the fusion stage 1, one of the M groups of image features output by the M CNNs corresponds to one hidden state set, and one hidden state set includes at least one hidden state (e.g., hidden state a and hidden state B).
In some possible embodiments, each of the M sets of hidden states may be used in the fusion stage 2, or in the optimization stage of the fusion device.
And (3) fusion stage 2:
in some possible embodiments, in fusion stage 2, T2For convenience of description, T1 (e.g., T1 ═ T1+1), T2 (e.g., T1 ═ T1+2), and T3 (e.g., T1 ═ T1+ T2) are assumed to be included in the fusion phase 2 as 3 time steps. As shown in fig. 4, in the fusion stage 2, different time steps may include different LSTM units, and each LSTM unit may be configured to fuse the M hidden state sets processed in the fusion stage 1 again to obtain a final hidden state.
As shown in fig. 4, at the initial time step (assumed as time step T0) of the fusion stage 2, the initial hidden state (e.g. hidden state B) of the fusion stage 2 may be determined according to the hidden state (e.g. hidden state B) of the last time step (e.g. T2 ═ T1 in the fusion stage 1) in the fusion stage 1 included in each hidden state set obtained by processing in the fusion stage 1 (for convenience of description, a third hidden state may be exemplified, where the third hidden state may be another image representation information corresponding to the image to be processed, and may also be referred to as third image representation information). Wherein the third implicit state is defined as
Figure GDA0003208502170000171
For example) and an initial memory cell in the fusion stage 2 (to
Figure GDA0003208502170000172
For example) satisfies the following conditions:
Figure GDA0003208502170000173
Figure GDA0003208502170000174
as shown in the above equation (13), in the fusion stage 2, the state is hidden (in order to
Figure GDA0003208502170000175
For example) can be turned onFor each LSTM unit in the last time step in the merging stage 1 (M LSTM units corresponding to M rows)t=T1) The output implicit states are averaged. Similarly, as shown in the above equation (14), in the fusion stage 2, the memory unit is stored
Figure GDA0003208502170000176
For example) can be initialized by performing the LSTM unit on each row (M LSTM units corresponding to M rows) in the last time step in the merging stage 1t=T1) The output memory cells are averaged.
In some possible embodiments, in the fusion stage 2, for each implicit state included in the set of implicit states output by the LSTM unit in any row in the fusion stage 1, the following operations may be performed to obtain a context vector corresponding to each set of implicit states:
for each time step (illustrated by taking time step t as an example), the implicit state h of the LSTM unit at time step ttAnd a memory cell ctSatisfies the following conditions:
Figure GDA0003208502170000177
wherein, in the above formula (15), LSTMt(-) is the LSTM unit for time step t, which will not be described in detail later.
Figure GDA0003208502170000181
Satisfies the following conditions:
Figure GDA0003208502170000182
wherein in the above formula (16)
Figure GDA0003208502170000183
Is an attention model of fusion stage 2 (or fusion stage II), m is different
Figure GDA0003208502170000184
Nor is itThe same is true. Thus, as shown in FIG. 4, there are M attention models, T, in fusion stage 22An LSTM cell. And carrying out image feature processing on an implicit state set obtained in the fusion stage 1 corresponding to an attention model in the fusion stage 2. Based on one attention model in the M attention models, a context vector corresponding to one implicit state set output by one row of LSTM units in the fusion stage 1 can be output. For example, the context vector corresponding to the implicit state set m output by the m-th attention model in the fusion stage 2 can be output by the m-th attention model in the fusion stage 1. At any time step t, M context vectors obtained based on M attention model processing and the implicit states output by the time step t-1 can learn to obtain a target implicit state based on an LSTM unit. Thus, in fusion stage 2, T2Processing in one time step to obtain T2The target implies the state. Here, the target hidden state may also represent information for a target image of the image to be processed. The output of the fusion stage 2 is a set, and the elements in the set are hidden states (i.e. image representation information), which can be illustrated by taking a target hidden state set as an example for convenience of description. The target set of implicit states satisfies:
Figure GDA0003208502170000185
the target hidden states in the target hidden state set shown in the above formula (17) may be used in an attention model of a decoder, where each target hidden state is used for decoding an image to be processed by the decoder to obtain an image description of the image to be processed.
And thirdly, decoding stage:
the decoder decodes the final image representation (i.e. target image representation information, such as a target hidden state) output by the fusion device based on the LSTM unit with the attention model, outputs vocabularies corresponding to the image representation obtained by processing the image features output by the encoder at each moment through the fusion device, and further can obtain natural sentences used for image description of the image to be processed. Wherein the final graph output by the fusion deviceThe image representation may include the set of implicit states corresponding to equation (17) above and the memory cell c output from the last LSTM cell in the fusion stage 2 aboveT1+T2Wherein, the above cT1+T2Satisfies the following conditions:
Figure GDA0003208502170000191
in the decoding stage, at any time step t, the decoding of the decoder can be expressed as:
[ht,ct]=LSTMdec(Ht,fatt-dec(C,ht-1)) (19)
wherein, LSTMdec(-) represents the LSTM unit for the decoder. In the decoding phase of the decoder, the network parameters of the LSTM units used for decoding at each time step are the same.
Wherein H in the above formula (19)tSatisfies the following conditions:
Figure GDA0003208502170000192
above, fatt-dec(-) represents the attention model used in the decoding phase, and C is the set of hidden states output by the fuser, i.e., the set of hidden states obtained by equation (17) above.
In general, let S be a natural sentence generated by a decoder to describe the image content of an image to be processed, and the length of the natural sentence S is n (n is a positive integer), and the value of n can be set according to actual needs. For example: setting n to 30, which means that the natural sentence S has a length of 30 words; the following steps are repeated: when n is set to 25, the natural sentence S is 25 words long. Since the natural sentence S has a length of n, it means that the decoder performs the decoding process n times in total in the decoding stage, that is, the decoder needs to perform the decoding process n time steps, and each decoding process needs to predict one word. I.e. the time step (or decoding time) t of the decoder in the decoding stage1Predicting the word s1At decoding time t2Predicting the word s2By analogy, at decoding time tnPredicting the word sn. That is, in the decoding phase, the decoder is at any decoding time tk(k is a positive integer, and k is more than or equal to 1 and less than or equal to n) predicting to obtain the word skThen, the decoder predicts the natural sentence S ═ S1,s2,...sk,...,sn}。
Optionally, the image processing flow of the embodiment of the present invention may further include an optimization stage, and an implementation manner of the optimization stage is described below, specifically as follows:
fourthly, an optimization stage:
in some possible embodiments, after the encoder, the fuser and the decoder in the image processing system perform encoding and decoding processing on the image to be processed to obtain the natural sentence for performing the image description on the image to be processed, the fuser may obtain the image description of the image to be processed from the decoder, and determine the discrimination and supervision loss function of the image processing according to the image description of the image to be processed. The fusion device can determine the loss function of image processing by combining the discrimination supervision loss function according to the M hidden state sets of the images to be processed obtained by the processing of the fusion stage 1 and the target hidden state of the images to be processed obtained by the processing of the fusion stage 2, and revise the network parameters of the LSTM unit adopted in the fusion device according to the loss function so as to optimize the capability of the fusion device for processing and outputting the hidden state sets and the target hidden state of any image.
Optionally, the above-mentioned loss function may also be used in a network parameter of an LSTM unit in the decoder to optimize the decoder processing and output a natural sentence corresponding to any image to perform image description on any image.
In some possible embodiments, the fuser may use a Discrimination Supervision (DS) image processing mechanism to further improve its own image processing performance. For example, the fusion device may obtain M sets of image features corresponding to the image to be processed from the encoder at any time step, where any set of image features includes the global image features and the local image features of the image to be processed. The global image features and the local image features in each group of image features output by the encoder can construct a matrix with two columns and entries, and the matrix can be marked as V for convenience of description. Determining a natural sentence set S in the image description corresponding to the image characteristics acquired in the time step according to the matrix V and the linear transformation matrix W, wherein S satisfies the following condition:
S=Row_Max_Pool(WV) (21)
where W is a linearly transformed matrix and Row _ Max _ Pool (·) is the Max _ stacking operation along the direction of the matrix Row vectors, i.e., taking the maximum value in the matrix Row vectors. Denote the ith element of S as SiThere is a discrimination supervision loss function
Figure GDA0003208502170000201
As described above
Figure GDA0003208502170000202
Satisfies the following conditions:
Figure GDA0003208502170000203
wherein the content of the first and second substances,
Figure GDA0003208502170000204
is a frequent word in natural sentence description for image description of an image to be processed. Optionally, the frequent words may select the first 1000 words with high frequent occurrence probability in a natural sentence in which an image is described in the image to be processed. The frequent words are the first 1000 words, which are only an example, and may be determined according to an actual application scenario, and the determination is not limited herein.
A discrimination supervision loss function expressed by the above equation (21)
Figure GDA0003208502170000205
Can further obtain, for one<Image, description>In contrast, the image processing system provided by the embodiment of the invention has the loss function of
Figure GDA0003208502170000206
Wherein the above
Figure GDA0003208502170000207
Satisfies the following conditions:
Figure GDA0003208502170000208
wherein λ is an empirical parameter for balancing the effect of the loss of the fuser on the whole image processing system, and the value of λ can be set according to practical experience. y istIs the word p (y)t+1|yt) The decoding method is calculated by linear transformation and SoftMax operation of the t-th implicit state output by a decoder at the t-th decoding moment in the decoding stage. The fusion device can correct the network parameters of the LSTM unit by the loss function, so that the final image representation of the image to be processed output by the fusion device can be more accurate, and the image processing accuracy of the image processing system is higher.
The embodiment of the invention fuses the image characteristics output by each CNN in a plurality of CNNs at an encoder end and the corresponding hidden states thereof at different time steps through a fusion device to obtain a plurality of hidden state sets, and then fuses the image characteristics such as richer hidden states and the like contained in the plurality of hidden state sets to obtain the final hidden state corresponding to the image to be processed. The acquisition of the final hidden state can be fused with image characteristics such as more hidden states of the image to be processed, so that image characteristics with richer contents can be acquired, the final hidden state of the image to be processed is obtained by processing the image characteristics with richer contents, and the final hidden state of the image to be processed is output to a decoder. The decoder decodes the image to be processed by using the final hidden state obtained by fusion of the fusion device to obtain image descriptions such as natural sentences for describing the image to be processed, so that the image description accuracy of the image to be processed is improved, the service quality of understanding the image content of the image to be processed is improved, and the user experience of the image content understanding service of the image to be processed can be enhanced.
Referring to fig. 5, fig. 5 is a flow chart illustrating an image processing method according to an embodiment of the invention. The image processing method provided by the embodiment of the invention can comprise the following steps of S101-S104:
s101, the fusion device obtains M groups of image characteristics of the image to be processed from the encoder, and obtains first image representation information corresponding to each group of image characteristics in the M groups of image characteristics.
In some possible embodiments, the encoder may perform image encoding on the image to be processed through the CNN, may output a global image feature for representing global information of the image through a last fully-connected layer of the CNN, and may output a local image feature set for representing local information of the image through a last convolutional layer (conv layer) of the CNN. The global image feature for representing the global information of the image can be represented by a vector, and the local image feature set for representing the local information can be represented by a vector set. One vector in the set of vectors represents an image feature of a region of the image. The global image features and the local image features output by one CNN at the encoder end can be combined to obtain a group of image features of the image to be processed, and M CNNs at the encoder end can correspondingly output M groups of image features of the image to be processed.
Assume that the global image feature obtained by feature extraction of the mth CNN of the M CNNs on the image to be processed is represented as
Figure GDA0003208502170000221
The local image characteristics corresponding to each subregion of the image to be processed are expressed as
Figure GDA0003208502170000222
Wherein M may be any one of 1 to M, which is not limited herein. The M CNNs extract image features of the image to be processed to obtain M groups of image features, wherein the CNNs are used for processing the image to be processed
Figure GDA0003208502170000223
And A(m)Then there may be a set of image features corresponding to the mth CNN. In the fusion stage, the fusion device can be composedAnd the decoder acquires M groups of image characteristics, and then the M groups of image characteristics can be fused to obtain the final image representation of the image to be processed and output to the decoder.
In some possible embodiments, the fuser may process the image features obtained from the encoder using an LSTM unit, wherein the LSTM unit used by the fuser may be an attention model LSTM, which is essentially a function of the state, and which may be abstractly represented as ht=LSTM(Ht,fatt(A,ht-1)). Wherein the content of the first and second substances,
Figure GDA0003208502170000224
vector xtIs the input of an image feature processing step (which may be referred to simply as the tth time step, or time step t for convenience of description) corresponding to time t, ht-1Is an implicit state after the previous time step (i.e., the t-1 st time step, or time step t-1) has elapsed. A ═ a1,a2,...,ak]Is a set of label vectors, i.e. a represents a set, the elements included in the set are label vectors. For example, the local image feature of the above-mentioned map to be processed
Figure GDA0003208502170000225
Etc., without limitation. F aboveatt(A,ht-1) Is an attention model, note fatt(A,ht-1) The output at the t-th time step is the vector ztWherein the vector ztIs a context vector. In the LSTM unit, the linear transformation is represented by T, and the image feature processing process of the LSTM unit with the attention model may refer to the implementation manners provided in equations (1) to (3) in the fusion stage 1, which is not described herein again.
In some possible embodiments, in the fusion phase 1, it is assumed that at time step t, the implicit states and memory cells of the m-th row of LSTM cells are recorded as
Figure GDA0003208502170000226
And
Figure GDA0003208502170000227
at time step t, the implicit states and the memory cells of the LSTM cells of each row are initialized as follows, with the implicit states of the LSTM cells of the m-th row
Figure GDA0003208502170000228
And a memory cell
Figure GDA0003208502170000229
For the purpose of illustration, the above description is provided
Figure GDA00032085021700002210
And
Figure GDA00032085021700002211
satisfying the above equation (6), and will not be described herein. As described above
Figure GDA00032085021700002212
The initial hidden state corresponding to the mth group of image features output for the mth CNN may also be recorded as the hidden state corresponding to the mth group of image features at the initial time step t0 before the time step t, and for convenience of description, the first hidden state may be taken as an example for explanation. Therefore, the first implicit state, namely the first image representation information, corresponding to each group of image features in the M groups of image features can be obtained through the implementation mode.
S102, the fusion device generates M image representation information sets according to the image features and the first image representation information corresponding to the image features.
In some possible embodiments, one of the M sets of image features output by the M CNNs corresponds to one set of hidden states (i.e., a set of image representation information), and one set of hidden states includes at least one hidden state, which may be illustrated as a second hidden state (i.e., a second set of image representation information) for convenience of description. At time step t, when the fusion device fuses M groups of image features obtained from the encoder, in the fusion stage 1, the implicit state of the LSTM unit in the M-th row
Figure GDA0003208502170000231
And a memory cell
Figure GDA0003208502170000232
The implementation modes provided by the above equations (7) and (8) are satisfied, and are not described herein again. Similarly, the context vectors corresponding to the image features of the M-1 th group can be output according to the attention models of the M-1 th group except the M-1 th group and the first hidden states corresponding to the image features of the M-1 th group.
In the above formula (8)
Figure GDA0003208502170000233
Is the LSTM cell for line m at time step t, by
Figure GDA0003208502170000234
Context vector z corresponding to the mth group of image featuresmAnd learning the implicit state (such as a first implicit state) corresponding to the last time step t-1 of each group of image features output by the encoder and outputting the implicit state corresponding to the mth group of image features.
H in the above formula (8)tIs a vector obtained by stacking the hidden states corresponding to each set of image features (e.g. the first hidden state corresponding to each set of image features) in the time step (i.e. time step t-1) before time step t. For example, for time step t, assuming time step t is 1, then H1The implementation manner provided by the above equation (10) is satisfied, and will not be described herein again.
In some possible embodiments, different sets of image features obtained by the fuser from different CNNs on the encoder side will be input into different rows of LSTM cells. In the fusion stage 1, the output implicit states after the LSTM units of each row at different time steps process the image features of the image to be processed can be fused into one set of implicit states. Thus, in stage 1 of fusion, M rows of LSTM cells of the fuser may output M different sets of implicit states. The implicit state set output by the LSTM unit in the mth row correspondingly satisfies the above equation (12), which is not described herein again.
For convenience of description, in the fusion stage 1, at time step t1, the implicit states of the LSTM cell outputs of each row can be represented by implicit state a. At time step t2, which follows time step t1, the implicit states output by the LSTM cells of each row may be represented by implicit state B. In the embodiment of the present invention, the hidden state a and the hidden state B may be respectively used to represent hidden states (i.e., image representation information) generated in different time steps in the fusion stage 1, and may specifically be represented in other more forms according to requirements of an actual application scenario, which is not limited herein.
S103, the fusion device fuses the second image representation information included in the M image representation information sets to obtain target image representation information.
In some possible embodiments, as shown in fig. 4, in the fusion stage 2, different LSTM units may be included at different time steps, and each LSTM unit may be configured to re-fuse M hidden state sets (i.e., M image representation information sets) obtained by processing in the fusion stage 1 to obtain a final hidden state, which may be illustrated by taking a target hidden state as an example for convenience of description. Here, the target hidden state may be used to represent final image representation information of the image to be processed. Optionally, the final image representation information of the image to be processed may also be represented by information in other representation forms besides the hidden state, which may be specifically determined according to the actual application scenario, and is not limited herein.
As shown in fig. 4, at the initial time step of the fusion phase 2 (assumed to be time step T0), the initial implicit state of the fusion phase 2 may be determined according to the implicit state (e.g. implicit state B) of the last time step in the fusion phase 1 (e.g. T1 ═ 2 in the fusion phase 1) included in each set of implicit states processed in the fusion phase 1 (for convenience of description, the third implicit state may be exemplified). Wherein the third implicit state is defined as
Figure GDA0003208502170000241
For example) and initiation at fusion stage 2Memory cell (with)
Figure GDA0003208502170000242
For example), the initialization satisfies the above equations (13) and (14), and will not be described again. As shown in the above equation (13), in the fusion stage 2, the state is hidden (in order to
Figure GDA0003208502170000243
For example) can be initialized by performing the LSTM unit on each row (M LSTM units corresponding to M rows) in the last time step in the merging stage 1t=T1) The output implicit states are averaged. Similarly, as shown in the above equation (14), in the fusion stage 2, the memory unit is stored
Figure GDA0003208502170000244
For example) can be initialized by performing the LSTM unit on each row (M LSTM units corresponding to M rows) in the last time step in the merging stage 1t=T1) The output memory cells are averaged.
In some possible embodiments, in the fusion stage 2, the following operations may be performed for each implicit state included in the set of implicit states output by the LSTM unit in any row in the fusion stage 1 to obtain a context vector corresponding to each set of implicit states. For each time step (illustrated by taking time step t as an example), the implicit state h of the LSTM unit at time step ttAnd a memory cell ctThe above-mentioned equations (15) and (16) are satisfied, and will not be described in detail herein. FIG. 4 shows that there are M attention models, T, in fusion stage 22An LSTM cell. And carrying out image feature processing on an implicit state set obtained in the fusion stage 1 corresponding to an attention model in the fusion stage 2. Based on one attention model in the M attention models, a context vector corresponding to one implicit state set output by one row of LSTM units in the fusion stage 1 can be output. For example, the context vector corresponding to the implicit state set m output by the m-th attention model in the fusion stage 2 can be output by the m-th attention model in the fusion stage 1. At any time step t, M context vectors obtained based on M attention model processing can be obtainedThe hidden state output with the time step t-1 is learned to obtain a target hidden state based on the LSTM unit. Thus, in fusion stage 2, T2Processing in one time step to obtain T2The target implies the state. The output of the fusion stage 2 is a set, and the elements in the set are hidden states, which can be illustrated by taking a target hidden state set as an example for convenience of description. The target hidden state set satisfies the above equation (17), and will not be described in detail here. The hidden state set shown in the above formula (17) may be used in an attention model of a decoder, where each target hidden state in the target hidden state set is used for decoding an image to be processed by the decoder to obtain an image description of the image to be processed.
And S104, outputting the target image representation information to a decoder.
In some possible embodiments, after the fusion device processes the target hidden state (i.e., the target image representation information) corresponding to the image to be processed, the target hidden state may be output to the decoder. In the decoding phase of the decoder, the network parameters of the LSTM units used for decoding at each time step are the same. And the decoder decodes the final image representation output by the fusion device based on the LSTM unit with the attention model, outputs vocabularies corresponding to the image representation obtained by processing the image features output by the encoder at each moment through the fusion device, and further can obtain natural sentences used for carrying out image description on the image to be processed. Wherein, the final image representation output by the fusion device may include the implicit state set corresponding to the above formula (17) and the memory unit c output by the last LSTM unit in the above fusion stage 2T1+T2Wherein, the above cT1+T2Satisfying the above equation (18), will not be described herein. In the decoding stage, at any time step t, the decoding of the decoder can be represented by the expression provided by the above equation (19), which is not described herein again.
In general, let S be a natural sentence generated by a decoder to describe the image content of an image to be processed, and the length of the natural sentence S is n (n is a positive integer), and the value of n can be set according to actual needs. For example: setting n to 30 indicates that the natural sentence S has a length of 30Length of each word; the following steps are repeated: when n is set to 25, the natural sentence S is 25 words long. Since the natural sentence S has a length of n, it means that the decoder performs the decoding process n times in total in the decoding stage, that is, the decoder needs to perform the decoding process n time steps, and each decoding process needs to predict one word. I.e. the time step (or decoding time) t of the decoder in the decoding stage1Predicting the word s1At decoding time t2Predicting the word s2By analogy, at decoding time tnPredicting the word sn. That is, in the decoding phase, the decoder is at any decoding time tk(k is a positive integer, and k is more than or equal to 1 and less than or equal to n) predicting to obtain the word skThen, the decoder predicts the natural sentence S ═ S1,s2,...sk,..,sn}。
The embodiment of the invention fuses the image characteristics output by each CNN in a plurality of CNNs at an encoder end and the corresponding hidden states thereof at different time steps through a fusion device to obtain a plurality of hidden state sets, and then fuses the image characteristics such as richer hidden states and the like contained in the plurality of hidden state sets to obtain the final hidden state corresponding to the image to be processed. The acquisition of the final hidden state can be fused with image characteristics such as more hidden states of the image to be processed, so that image characteristics with richer contents can be acquired, the final hidden state of the image to be processed is obtained by processing the image characteristics with richer contents, and the final hidden state of the image to be processed is output to a decoder. The decoder decodes the image to be processed by using the final hidden state obtained by fusion of the fusion device to obtain image descriptions such as natural sentences for describing the image to be processed, so that the image description accuracy of the image to be processed is improved, the service quality of understanding the image content of the image to be processed is improved, and the user experience of the image content understanding service of the image to be processed can be enhanced.
Referring to fig. 5, fig. 5 is another schematic flow chart of the image processing method according to the embodiment of the invention. The image processing method provided by the embodiment of the invention can comprise the following steps S201-S204:
s201, the encoder outputs M groups of image characteristics of the image to be processed to the fusion device.
In some possible embodiments, an original image (i.e., an image to be processed) is input into an encoder, and feature extraction is performed on each frame of the image to be processed by the encoder, so as to obtain a frame feature sequence. The encoder may perform image encoding on an image to be processed through the CNN, and may output a global image feature for representing global information of the image through a last fully-connected layer of the CNN, and output a local image feature set for representing local information of the image through a last convolutional layer (conv layer) of the CNN. The global image feature for representing the global information of the image can be represented by a vector, and the local image feature set for representing the local information can be represented by a vector set. One vector in the set of vectors represents an image feature of a region of the image. The global image features and the local image features output by one CNN at the encoder end can be combined to obtain a group of image features of the image to be processed. Assume that the number of CNNs employed by the encoder side is M, where one encoder corresponds to one CNN, or the number of encoding channels in one encoder is M, where one encoding channel corresponds to one CNN. The network parameters adopted by different CNNs are different, and therefore, the M CNNs corresponding to the M encoders, or the M CNNs corresponding to the M encoding channels of one encoder, may be M different CNNs. The M CNNs at the encoder end can correspondingly output M groups of image characteristics of the image to be processed. Assume that the global image feature obtained by feature extraction of the mth CNN of the M CNNs on the image to be processed is represented as
Figure GDA0003208502170000271
The local image characteristics corresponding to each subregion of the image to be processed are expressed as
Figure GDA0003208502170000272
The M CNNs extract image features of the image to be processed to obtain M groups of image features, wherein the CNNs are used for processing the image to be processed
Figure GDA0003208502170000273
And A(m)Then there may be a set of image features corresponding to the mth CNN.
S202, the fusion device generates first image representation information corresponding to each group of image features according to the global image features and the specified linear transformation matrix in each group of image features in the M groups of image features acquired from the encoder.
In some possible embodiments, the LSTM unit used by the fuser can be an attention model LSTM, which is essentially a function with states, abstractably represented as ht=LSTM(Ht,fatt(A,ht-1)). Wherein the content of the first and second substances,
Figure GDA0003208502170000274
vector xtIs the input of an image feature processing step (which may be referred to simply as the tth time step, or time step t for convenience of description) corresponding to time t, ht-1Is an implicit state after the previous time step (i.e., the t-1 st time step, or time step t-1) has elapsed. Assume that at time step t, the image representation of the mth CNN output at the encoder side is input to the mth row of the fusion stage 1, as shown in fig. 4. In fusion phase 1, assume that at time step t, the implicit states and memory cells of the m-th row of LSTM cells are written as
Figure GDA0003208502170000275
And
Figure GDA0003208502170000276
at time step t, the implicit states and the memory cells of the LSTM cells of each row are initialized as follows, with the implicit states of the LSTM cells of the m-th row
Figure GDA0003208502170000277
And a memory cell
Figure GDA0003208502170000278
For the purpose of illustration, the above description is provided
Figure GDA0003208502170000279
And
Figure GDA00032085021700002710
satisfying the above equation (6), and will not be described herein. As described above
Figure GDA00032085021700002711
Then, the initial hidden state corresponding to the mth group of image features output for the mth CNN may also be recorded as the hidden state (i.e., image representation information) corresponding to the mth group of image features at initial time step t0 before time step t, and for convenience of description, the first hidden state (i.e., first image representation information) may be exemplified.
S203, the fusion device learns any group of image features and first image representation information corresponding to each group of image features based on the first LSTM unit to obtain image representation information A corresponding to any group of image features.
S204, the fusion device learns any group of image characteristics and the image representation information A based on the second LSTM unit to obtain image representation information B corresponding to any group of image characteristics.
In some possible embodiments, any one of the M sets of image features output by the encoder end includes both the global image feature of the image to be processed and the sub-region local image feature of the image to be processed. The fusion device learns the sub-region local image features in any one group of image features in the M groups of image features output by the encoder and the first implicit state (i.e. the first image representation information) corresponding to any one group of image features based on the attention model in the LSTM unit (for convenience of description, the first LSTM can be taken as an example) in the fusion stage 1, and outputs the context vector corresponding to any one group of image features. The fuser can learn the context vector and the first hidden state corresponding to any group of image features based on the first LSTM to obtain the hidden state A corresponding to any group of image features.
In some possible embodiments, one of the M sets of image features of the M CNN outputs corresponds to one hidden state set, and oneThe set of implicit states includes at least one implicit state that can be illustrated as a second implicit state for ease of description. At time step t, when the fusion device fuses M groups of image features obtained from the encoder, the hidden state of the LSTM unit in the fusion stage 1
Figure GDA0003208502170000281
And a memory cell
Figure GDA0003208502170000282
Satisfies the above equation (7) as follows:
Figure GDA0003208502170000283
wherein HtSatisfies the following conditions:
Figure GDA0003208502170000284
wherein the content of the first and second substances,
Figure GDA0003208502170000285
the attention model in the mth row in the fusion stage 1 (or the fusion stage I) is a first hidden state (i.e., a first hidden state (a (m)) corresponding to each sub-region local image feature of the image to be processed in the mth group of image features output by the mth CNN and the mth group of image features (i.e., a (m)) through the attention model in the mth row in the fusion stage 1, that is, the attention model carried in the LSTM unit (e.g., the first LSTM unit) in the mth row can be output according to the mth CNN
Figure GDA0003208502170000286
) Outputting context vector z corresponding to mth group of image featuresm. Wherein z is as defined abovemSatisfying the above equation (9), will not be described herein. Similarly, the context vectors corresponding to the image features of the M-1 th group can be output according to the attention models of the M-1 th group except the M-1 th group and the first hidden states corresponding to the image features of the M-1 th group.
In the above formula (8)
Figure GDA0003208502170000291
Is the LSTM cell for line m at time step t, by
Figure GDA0003208502170000292
Context vector z corresponding to the mth group of image featuresmAnd learning the implicit state (such as a first implicit state) corresponding to the last time step t-1 of each group of image features output by the encoder and outputting the implicit state corresponding to the mth group of image features.
H in the above formula (8)tThe hidden states corresponding to each group of image features (for example, the first hidden state corresponding to each group of image features) in the previous time step of time step t (that is, time step t-1) are superimposed to obtain a vector, for example, the implementation manners provided by the above equations (10) and (11) are not described herein again.
In some possible embodiments, different sets of image features obtained by the fuser from different CNNs on the encoder side will be input into different rows of LSTM cells. In the fusion stage 1, the output implicit states after the LSTM units of each row at different time steps process the image features of the image to be processed can be fused into one set of implicit states. Thus, in stage 1 of fusion, M rows of LSTM cells of the fuser may output M different sets of implicit states. The implicit state set output by the LSTM unit in the mth row correspondingly satisfies the condition (12), which is not described herein again.
For convenience of description, in the fusion stage 1, at time step t1 (e.g., t ═ 1), the implicit states of the LSTM unit outputs of each row can be represented by implicit state a. At a time step T2 (e.g., T1 2) following time step T1, the implicit state output by the LSTM cells of each row may be represented by implicit state B. In the embodiment of the present invention, the hidden state a and the hidden state B may be respectively used to represent hidden states generated in different time steps in the fusion stage 1, and may specifically be represented in other more forms according to requirements of an actual application scenario, which is not limited herein.
In some possible embodiments, in the fusion stage 1, the fuser may combine the hidden state a and the hidden state B corresponding to any group of image features to obtain a set of hidden states corresponding to the group of image features. For convenience of description, an implicit state set i may represent an implicit state set corresponding to any group of image features, where the implicit state a and the implicit state B are second implicit states included in the implicit state set i. In the fusion stage 1, the fusion device can obtain the hidden state sets corresponding to each group of image features through the LSTM units of each row, and then obtain M hidden state sets corresponding to the M groups of image features. Each hidden state set in the M hidden state sets can be used in the fusion stage 2 to further fuse and obtain a target hidden state of the image to be processed.
S205, the fuser determines third image representing information from the image representing information B included in each set of image representing information.
In some possible embodiments, as shown in fig. 4, in the fusion stage 2, different LSTM units may be included at different time steps, and each LSTM unit may be configured to re-fuse the M sets of hidden states processed in the fusion stage 1 to obtain a final hidden state of the image to be processed. As shown in fig. 4, at the initial time step of the fusion phase 2 (assumed as time step T0), the initial hidden state of the fusion phase 2 may be determined according to the hidden state (e.g. hidden state B) of the last time step in the fusion phase 1 (e.g. T2 ═ T1 in the fusion phase 1) included in each hidden state set processed in the fusion phase 1 (for convenience of description, the third hidden state may be taken as an example for illustration). Wherein the third implicit state is defined as
Figure GDA0003208502170000301
For example) and an initial memory cell in the fusion stage 2 (to
Figure GDA0003208502170000302
For example), the initialization satisfies the implementation provided by the above equation (13) and equation (14), and will not be described herein again. As shown in the above formula (13), in the fusion stage 2,implicit status (in)
Figure GDA0003208502170000303
For example) can be initialized by performing the LSTM unit on each row (M LSTM units corresponding to M rows) in the last time step in the merging stage 1t=T1) The output implicit states are averaged. Similarly, as shown in the above equation (14), in the fusion stage 2, the memory unit is stored
Figure GDA0003208502170000304
For example) can be initialized by performing the LSTM unit on each row (M LSTM units corresponding to M rows) in the last time step in the merging stage 1t=T1) The output memory cells are averaged.
S206, the third image representing information and the second image representing information in each image representing information set are learned based on the attention model, and a context vector corresponding to each image representing information set is output.
And S207, generating target image representation information according to the target vector matrix formed by the M context vectors and the third image representation information based on the third LSTM unit.
In some feasible embodiments, in the process of processing in the fusion stage 2, one implicit state set corresponds to one attention model, and M context vectors corresponding to the M implicit state sets can be obtained based on M LSTM units carrying the attention model. In the specific implementation, in the fusion stage 2, for each implicit state included in the implicit state set output by the LSTM unit in any row in the fusion stage 1, the following operations may be performed to obtain a context vector corresponding to each implicit state set:
for each time step (illustrated by taking time step t as an example), the implicit state h of the LSTM unit at time step ttAnd a memory cell ctSatisfies the following conditions:
Figure GDA0003208502170000311
wherein, in the above formula (15), LSTMt(-) is the LSTM unit for time step t, which will not be described in detail later.
Figure GDA0003208502170000312
Satisfies the following conditions:
Figure GDA0003208502170000313
wherein in the above formula (16)
Figure GDA0003208502170000314
Is an attention model of fusion stage 2 (or fusion stage II), m is different
Figure GDA0003208502170000315
And also different. Each attention model can correspondingly output a context vector, and then M context vectors can be correspondingly output by the M attention models, and a target vector matrix is obtained by the M context vectors
Figure GDA0003208502170000316
In some possible embodiments, in the time step t1 of the fusion stage 2, the fusion device may combine the M context vectors to form the target vector matrix based on the LSTM unit (for convenience of description, the LSTM1 may be taken as an example) corresponding to the time step t1
Figure GDA0003208502170000317
And learning the third implicit state to obtain the implicit state C corresponding to the M context vectors at the time step t 1. Further, in the next time step t2 of the time step t1, the fuser may further learn, based on the LSTM unit corresponding to the time step t2 (for convenience of description, the LSTM2 may be exemplified as an example), the second implicit state included in the M implicit state sets and the implicit state (for convenience of description, the implicit state C may be exemplified as an example) obtained by processing the time step t1Conventionally, the implicit states corresponding to the M context vectors are obtained (for convenience of description, the implicit state D may be used as an example for explanation). In the fusion stage 2, the LSTM unit corresponding to any other time step (for example, time step t3) after the time step t2 may also learn, according to the LSTM unit corresponding to the time step t2, the implicit states corresponding to the time step t3 (for example, the implicit state E may be exemplified for convenience of description) obtained by learning the second implicit state included in the M implicit state sets and the implicit state output by the previous time step (for example, time step t2), until the last time step of the fusion stage 2, and output the last implicit state of the image to be processed in the fusion stage 2 through the LSTM unit corresponding to the last time step. For convenience of description, the embodiment of the present invention will be described by taking two time steps (time step t1 and time step t2) included in the fusion phase 2 as an example. The fusion device can combine the hidden state C output by the LSTM1 in the fusion stage 2 and the hidden state D output by the LSTM2 to obtain a target hidden state set, and determine the hidden state C and the hidden state D as the target hidden state of the image to be processed.
For example, as shown in FIG. 4, there are M attention models, T, in fusion stage 22An LSTM cell. And carrying out image feature processing on an implicit state set obtained in the fusion stage 1 corresponding to an attention model in the fusion stage 2. Based on one attention model in the M attention models, a context vector corresponding to one implicit state set output by one row of LSTM units in the fusion stage 1 can be output. For example, the context vector corresponding to the implicit state set m output by the m-th attention model in the fusion stage 2 can be output by the m-th attention model in the fusion stage 1. At time step t, M context vectors processed based on M attention models and the hidden states output at time step t-1 can learn to obtain a target hidden state based on an LSTM unit. Thus, in fusion stage 2, T2Processing in one time step to obtain T2The target implies the state. The output of the fusion stage 2 is a set, and the elements in the set are hidden states, which can be illustrated by taking a target hidden state set as an example for convenience of description.The target hidden state set satisfies the above equation (17), and will not be described in detail here. The set of implicit states shown in the above equation (17) can be used in the attention model of the decoder, wherein each target implicit state is used for decoding the image to be processed by the decoder to obtain the image description of the image to be processed.
In a specific implementation, the fusion of the M groups of image features obtained from the encoder by the fusion device may include a fusion stage 1 and a fusion stage 2, each stage may include a plurality of image feature processing steps, and more implementation manners provided in the fusion stage 1 and the fusion stage 2 may be specifically referred to, and are not described herein again.
S208, the target image representation information is output to the decoder.
In some possible embodiments, after the fusion device processes the target hidden state (i.e., the target image representation information) corresponding to the image to be processed, the target hidden state may be output to the decoder. In the decoding phase of the decoder, the network parameters of the LSTM units used for decoding at each time step are the same. And the decoder decodes the final image representation output by the fusion device based on the LSTM unit with the attention model, outputs vocabularies corresponding to the image representation obtained by processing the image features output by the encoder at each moment through the fusion device, and further can obtain natural sentences used for carrying out image description on the image to be processed. Wherein, the final image representation output by the fusion device may include the implicit state set corresponding to the above formula (17) and the memory unit c output by the last LSTM unit in the above fusion stage 2T1+T2Wherein, the above cT1+T2Satisfying the above equation (18), will not be described herein. In the decoding stage, at any time step t, the decoding of the decoder can be represented by the expression provided by the above equation (19), which is not described herein again.
In general, let S be a natural sentence generated by a decoder to describe the image content of an image to be processed, and the length of the natural sentence S is n (n is a positive integer), and the value of n can be set according to actual needs. For example: setting n to 30, which means that the natural sentence S has a length of 30 words; the following steps are repeated: setting n to 25, representing natural language SThe length is 25 words long. Since the natural sentence S has a length of n, it means that the decoder performs the decoding process n times in total in the decoding stage, that is, the decoder needs to perform the decoding process n time steps, and each decoding process needs to predict one word. I.e. the time step (or decoding time) t of the decoder in the decoding stage1Predicting the word s1At decoding time t2Predicting the word s2By analogy, at decoding time tnPredicting the word sn. That is, in the decoding phase, the decoder is at any decoding time tk(k is a positive integer, and k is more than or equal to 1 and less than or equal to n) predicting to obtain the word skThen, the decoder predicts the natural sentence S ═ S1,s2,...sk,..,sn}。
S209, the fusion device obtains the image description of the image to be processed from the decoder, and determines the discrimination supervision loss function of the image processing according to the image description of the image to be processed.
S210, the fusion device constructs a loss function of the image processing system by combining the discrimination supervision loss function according to the M image representation information sets and the target image representation information of the image to be processed.
S211, the fusion device corrects the network parameters of the LSTM unit according to the loss function, and the image processing performance of the image processing system is optimized.
In some possible embodiments, after the encoder, the fuser and the decoder in the image processing system perform encoding and decoding processing on the image to be processed to obtain the natural sentence for performing the image description on the image to be processed, the fuser may obtain the image description of the image to be processed from the decoder, and determine the discrimination and supervision loss function of the image processing system according to the image description of the image to be processed. The fusion device can determine the loss function of image processing by combining the discrimination supervision loss function according to the M hidden state sets of the images to be processed obtained by the processing of the fusion stage 1 and the target hidden state of the images to be processed obtained by the processing of the fusion stage 2, and revise the network parameters of the LSTM unit adopted in the fusion device according to the loss function so as to optimize the capability of the fusion device for processing and outputting the hidden state sets and the target hidden state of any image.
Optionally, the above-mentioned loss function may also be used in a network parameter of an LSTM unit in the decoder to optimize the decoder processing and output a natural sentence corresponding to any image to perform image description on any image.
In some possible embodiments, the fuser may use a discriminative supervised image processing mechanism to further improve its own image processing performance. For example, the fusion device may obtain M sets of image features corresponding to the image to be processed from the encoder at any time step, where any set of image features includes the global image features and the local image features of the image to be processed. The global image features and the local image features in each group of image features output by the encoder can construct a matrix with two columns and entries, and the matrix can be marked as V for convenience of description. And determining a natural sentence set S in the image description corresponding to the image characteristics obtained in the time step according to the matrix V and the linear transformation matrix W, wherein S satisfies the formula (21), which is not described herein again.
In some possible embodiments, the ith element of S is denoted as SiThere is a discrimination supervision loss function
Figure GDA0003208502170000341
As described above
Figure GDA0003208502170000342
Satisfies the following conditions:
Figure GDA0003208502170000343
wherein the content of the first and second substances,
Figure GDA0003208502170000344
is a frequent word in natural sentence description for image description of an image to be processed. Optionally, the frequent words may select the first 1000 words with high frequent occurrence probability in a natural sentence in which an image is described in the image to be processed. Wherein the above frequent words are the first 1000The words are only one example, and may be determined according to an actual application scenario, and are not limited herein.
A discrimination supervision loss function expressed by the above equation (21)
Figure GDA0003208502170000345
Can further obtain, for one<Image, description>In contrast, the image processing system provided by the embodiment of the invention has the loss function of
Figure GDA0003208502170000346
Wherein the above
Figure GDA0003208502170000347
Satisfies the following conditions:
Figure GDA0003208502170000348
wherein λ is an empirical parameter for balancing the effect of the loss of the fuser on the entire image processing system, and the value of λ can be set based on practical experience. y istIs the word p (y)t+1|yt) The decoding method is calculated by linear transformation and SoftMax operation of the t-th implicit state output by a decoder at the t-th decoding moment in the decoding stage. The fusion device can correct the network parameters of the LSTM unit by the loss function, so that the final image representation of the image to be processed output by the fusion device can be more accurate, and the image processing accuracy of the image processing system is higher.
The embodiment of the invention fuses the image characteristics output by each CNN in a plurality of CNNs at an encoder end and the corresponding hidden states thereof at different time steps through a fusion device to obtain a plurality of hidden state sets, and then fuses the image characteristics such as richer hidden states and the like contained in the plurality of hidden state sets to obtain the final hidden state corresponding to the image to be processed. The acquisition of the final hidden state can be fused with image characteristics such as more hidden states of the image to be processed, so that image characteristics with richer contents can be acquired, the final hidden state of the image to be processed is obtained by processing the image characteristics with richer contents, and the final hidden state of the image to be processed is output to a decoder. The decoder decodes the image to be processed by using the final hidden state obtained by fusion of the fusion device to obtain image descriptions such as natural sentences for describing the image to be processed, so that the image description accuracy of the image to be processed is improved, the service quality of understanding the image content of the image to be processed is improved, and the user experience of the image content understanding service of the image to be processed can be enhanced. In addition, the image processing method provided by the embodiment of the invention can also construct a loss function through the output data of the fusion device and the decoder, further modify the network parameters of the LSTM unit in the fusion device and/or the decoder through the loss function, optimize the performance of the fusion device and the processor, further improve the image processing performance of the image processing system, and enhance the user stickiness of the image processing system.
Based on the description of the embodiments of the image processing system and the image processing method, the embodiment of the invention also discloses an image processing apparatus, which can be a computer program (including a program code) running in a server, and the image processing apparatus can be applied to the image processing methods of the embodiments shown in fig. 5-6 for executing the steps in the image processing methods. Referring to fig. 7, the image processing apparatus operates as follows:
an obtaining unit 61, configured to obtain M sets of image features of the image to be processed from the encoder, where M is an integer not less than 2.
The obtaining unit 61 is further configured to obtain a first hidden state corresponding to each of the M groups of image features.
A first fusion unit 62, configured to generate M sets of image representation information according to the sets of image features acquired by the acquisition unit 61 and the first image representation information corresponding to the sets of image features.
The image representation information set is generated by a group of image features correspondingly, and the image representation information set comprises at least one piece of second image representation information.
A second fusion unit 63 configured to learn second image representation information included in the M image representation information sets obtained by the first fusion unit 62 to obtain target image representation information;
an output unit 64 for outputting the target image representation information obtained by the second fusion unit 63 to the decoder;
the target image representation information is used for the decoder to decode the image to be processed to obtain the image description of the image to be processed.
In some possible embodiments, the obtaining unit 61 is configured to:
acquiring M groups of image characteristics of an image to be processed from M encoders included in the image processing system, wherein one encoder corresponds to one group of encoding parameters and one encoder outputs one group of image characteristics; or
Acquiring M groups of image features of an image to be processed from M encoding channels of an encoder of the image processing system, wherein one encoding channel of the encoder corresponds to one group of encoding parameters, and one encoding channel outputs one group of image features.
In some possible embodiments, each of the M groups of image features includes a global image feature of the image to be processed;
the above-mentioned acquisition unit 61 is configured to:
and generating first image representation information corresponding to each group of image features according to the global image features in each group of image features in the M groups of image features and the specified linear transformation matrix.
In some possible embodiments, the first fusing unit 62 is configured to:
learning any group of image features and first image representation information corresponding to the image features on the basis of a first LSTM unit to obtain image representation information A corresponding to the image features;
learning any group of image characteristics and the image representation information A based on a second LSTM unit to obtain image representation information B corresponding to any group of image characteristics;
combining the image representation information a and the image representation information B corresponding to any one set of image features to obtain an image representation information set i corresponding to any one set of image features, wherein the image representation information a and the image representation information B are second image representation information included in the image representation information set i;
and acquiring image representation information sets corresponding to all groups of image features to obtain M image representation information sets corresponding to the M groups of image features.
In some possible embodiments, each of the M groups of image features further includes a sub-region local image feature of the image to be processed;
the first fusing unit 62 is configured to:
learning the partial image features of the sub-region in any group of image features and first image representation information corresponding to any group of image features based on an attention model in a first LSTM unit and outputting context vectors corresponding to any group of image features;
and learning the context vector corresponding to any one group of image features and the first image representation information corresponding to each group of image features based on the first LSTM unit to obtain the image representation information A corresponding to any one group of image features.
In some possible embodiments, the second fusing unit 63 is configured to:
determining third image representation information according to image representation information B included in each of the M image representation information sets;
and executing the following operations on any image representation information set j in each image representation information set to obtain a context vector corresponding to the image representation information set j:
learning the third image representation information and the second image representation information in the image representation information set j based on an attention model and outputting a context vector corresponding to the image representation information set j, wherein one image representation information set corresponds to one attention model;
acquiring M context vectors corresponding to the M image representation information sets, and obtaining a target vector matrix according to the M context vectors;
and learning based on the third LSTM unit according to the M context vectors in the target vector matrix and the third image representation information to obtain target image representation information.
In some possible embodiments, the third LSTM unit used in the second fusion unit 63 at least includes LSTM1 and LSTM2, and the second fusion unit 63 is configured to:
learning the M context vectors in the target vector matrix and the third image representing information based on the LSTM1 to obtain image representing information C;
learning second image representing information and the image representing information C included in the M image representing information sets based on the LSTM2 to obtain image representing information D;
combining the image representation information C and the image representation information D to obtain a target image representation information set, and determining the image representation information C and the image representation information D included in the target image representation information set as target image representation information.
In some possible embodiments, the image processing apparatus further includes an optimization unit 65 configured to:
acquiring the image description of the image to be processed from the decoder, and determining a discrimination supervision loss function of image processing according to the image description of the image to be processed;
according to the M image representation information sets of the image to be processed and the target image representation information, constructing a loss function of an image processing system by combining the discrimination supervision loss function;
and the fusion device modifies the network parameters of the LSTM unit adopted by the fusion device according to the loss function.
According to an embodiment of the present invention, steps S101-S104 involved in the image processing method shown in fig. 5 may be performed by respective units in the image processing apparatus shown in fig. 7. For example, steps S101, S102, S103, S104 shown in fig. 5 may be performed by the acquisition unit 61, the first fusion unit 62, the second fusion unit 63, and the output unit 64 shown in fig. 7, respectively.
According to an embodiment of the present invention, steps S201 to S211 related to the image processing method shown in fig. 6 may be executed by each unit in the image processing apparatus shown in fig. 7, and specific reference may be made to implementation manners provided by each step in the embodiment corresponding to fig. 6, which are not described herein again.
According to another embodiment of the present invention, the units in the image processing apparatus shown in fig. 7 may be respectively or entirely combined into one or several other units to form the image processing apparatus, or some unit(s) thereof may be further split into multiple units with smaller functions to form the image processing apparatus, which may achieve the same operation without affecting the achievement of the technical effects of the embodiments of the present invention. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present invention, the image processing apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.
The embodiment of the invention fuses the image characteristics output by each CNN in a plurality of CNNs at an encoder end and the corresponding hidden states thereof at different time steps through a fusion device to obtain a plurality of hidden state sets, and then fuses the image characteristics such as richer hidden states and the like contained in the plurality of hidden state sets to obtain the final hidden state corresponding to the image to be processed. The acquisition of the final hidden state can be fused with image characteristics such as more hidden states of the image to be processed, so that image characteristics with richer contents can be acquired, the final hidden state of the image to be processed is obtained by processing the image characteristics with richer contents, and the final hidden state of the image to be processed is output to a decoder. The decoder decodes the image to be processed by using the final hidden state obtained by fusion of the fusion device to obtain image descriptions such as natural sentences for describing the image to be processed, so that the image description accuracy of the image to be processed is improved, the service quality of understanding the image content of the image to be processed is improved, and the user experience of the image content understanding service of the image to be processed can be enhanced. In addition, the image processing method provided by the embodiment of the invention can also construct a loss function through the output data of the fusion device and the decoder, further modify the network parameters of the LSTM unit in the fusion device and/or the decoder through the loss function, optimize the performance of the fusion device and the processor, further improve the image processing performance of the image processing system, and enhance the user stickiness of the image processing system.
Based on the image processing system and the image processing method in the embodiments, the embodiment of the invention also provides a server. Referring to fig. 8, the internal structure of the server at least includes the image processing system shown in fig. 2, that is, includes an encoder, a fuser and a decoder, and further, the server also includes a processor, a communication interface and a computer storage medium. The processor, the communication interface and the computer storage medium in the server may be connected by a bus or other means, and fig. 8 shows an example of the communication bus connection according to the embodiment of the present invention.
The communication interface is a medium for realizing interaction and information exchange between the server and external devices (such as terminal devices). The processor (or Central Processing Unit, CPU) is a computing core and a control core of the server, and it is understood that the processor herein may also be a processor integrated in the fusion device, and is adapted to implement one or more instructions, and specifically, is adapted to load and execute one or more instructions so as to implement the corresponding method flow or the corresponding function. A computer storage medium (Memory) is a Memory device in a server for storing programs and data. It is understood that the computer storage medium herein may include both the built-in storage medium of the server and, of course, the extended storage medium supported by the server. The computer storage media provides storage space that stores the operating system of the server. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer storage medium located remotely from the processor.
In the embodiment of the present invention, the processor loads and executes one or more instructions stored in the computer storage medium to implement the corresponding steps in the method flows shown in fig. 5 to 6; in a specific implementation, one or more instructions in a computer storage medium are loaded by a processor and perform the following steps:
acquiring M groups of image characteristics of an image to be processed from an encoder, wherein M is an integer not less than 2;
acquiring first image representation information corresponding to each group of image features in the M groups of image features;
generating M image representation information sets according to the image characteristics of each group and first image representation information corresponding to the image characteristics of each group, wherein one image representation information set generated corresponding to one image characteristic of each group comprises at least one piece of second image representation information;
fusing second image representation information included in the M image representation information sets to learn to obtain target image representation information, and outputting the target image representation information to the decoder;
the target image representation information is used for the decoder to decode the image to be processed to obtain the image description of the image to be processed.
In one embodiment, in the process of the processor loading one or more instructions in a computer storage medium to execute the step of acquiring M groups of image features of the image to be processed from the encoder, the following steps are specifically executed:
acquiring M groups of image characteristics of an image to be processed from M encoders included in the image processing system, wherein one encoder corresponds to one group of encoding parameters and one encoder outputs one group of image characteristics; or
The method comprises the steps of obtaining M groups of image characteristics of an image to be processed from M coding channels of an encoder of the image processing system, wherein one coding channel of the encoder corresponds to one group of coding parameters, and one coding channel outputs one group of image characteristics.
In another embodiment, each of the M groups of image features includes a global image feature of the image to be processed; in the process of the processor loading one or more instructions in the computer storage medium to execute the step of obtaining the first image representation information corresponding to each group of image features in the M groups of image features, the following steps are specifically executed:
and generating first image representation information corresponding to each group of image features according to the global image features in each group of image features in the M groups of image features and the specified linear transformation matrix.
In another embodiment, in the process of the processor loading one or more instructions in the computer storage medium and executing the step of generating M image representation information sets for the respective sets of image features and the first image representation information corresponding to the respective sets of image features, the following steps are specifically executed:
learning any group of image features and first image representation information corresponding to the image features on the basis of a first LSTM unit to obtain image representation information A corresponding to the image features;
learning any group of image characteristics and the image representation information A based on a second LSTM unit to obtain image representation information B corresponding to any group of image characteristics;
combining the image representation information a and the image representation information B corresponding to any one set of image features to obtain an image representation information set i corresponding to any one set of image features, wherein the image representation information a and the image representation information B are second image representation information included in the image representation information set i;
and acquiring image representation information sets corresponding to all groups of image features to obtain M image representation information sets corresponding to the M groups of image features.
In another embodiment, each of the M groups of image features further includes a sub-region local image feature of the image to be processed, and the processor loads one or more instructions in a computer storage medium to perform the step of learning, based on the first LSTM unit, any group of image features and the first image representation information corresponding to the group of image features to obtain the image representation information a corresponding to the group of image features, specifically performing the following steps:
learning the partial image features of the sub-region in any group of image features and first image representation information corresponding to any group of image features based on an attention model in a first LSTM unit and outputting context vectors corresponding to any group of image features;
and learning the context vector corresponding to any group of image features and the first image representation information corresponding to each group of image features based on the first LSTM unit to obtain the image representation information A corresponding to any group of image features.
In another embodiment, in the process that the processor loads one or more instructions in the computer storage medium to perform the step of fusing the second image representation information included in the M image representation information sets to obtain the target image representation information, the following steps are specifically performed:
determining third image representation information according to image representation information B included in each of the M image representation information sets;
and executing the following operations on any image representation information set j in each image representation information set to obtain a context vector corresponding to the image representation information set j:
learning the third image representation information and the second image representation information in the image representation information set j based on an attention model and outputting a context vector corresponding to the image representation information set j, wherein one image representation information set corresponds to one attention model;
acquiring M context vectors corresponding to the M image representation information sets, and obtaining a target vector matrix according to the M context vectors;
target image representation information is generated based on a third LSTM unit from the target vector matrix and the third image representation information.
In yet another embodiment, the third LSTM includes at least LSTM1 and LSTM2, and the following steps are specifically performed during the step of the processor loading one or more instructions in the computer storage medium to perform the step of generating the target image representation information based on the third LSTM unit according to the target vector matrix and the third image representation information:
learning the M context vectors included in the target vector matrix and the third image representing information based on the LSTM1 to obtain image representing information C;
learning second image representing information and the image representing information C included in the M image representing information sets based on the LSTM2 to obtain image representing information D;
combining the image representation information C and the image representation information D to obtain a target image representation information set, and determining the image representation information C and the image representation information D in the target image representation information set as target image representation information.
In yet another embodiment, the processor loads one or more instructions in the computer storage medium to perform the following steps:
acquiring the image description of the image to be processed from the decoder, and determining a discrimination supervision loss function of image processing according to the image description of the image to be processed;
according to the M image representation information sets of the image to be processed and the target image representation information, constructing a loss function of an image processing system by combining the discrimination supervision loss function;
and correcting the network parameters of the LSTM unit adopted by the LSTM unit according to the loss function.
The image characteristics output by each CNN in a plurality of CNNs at an encoder end and the corresponding hidden states thereof are fused at different time steps to obtain a plurality of hidden state sets, and the image characteristics such as richer hidden states contained in the plurality of hidden state sets are fused to obtain the final hidden state corresponding to the image to be processed. The acquisition of the final hidden state can be fused with image characteristics such as more hidden states of the image to be processed, so that image characteristics with richer contents can be acquired, the final hidden state of the image to be processed is obtained by processing the image characteristics with richer contents, and the final hidden state of the image to be processed is output to a decoder. The decoder decodes the image to be processed by using the final hidden state obtained by fusion to obtain image descriptions such as natural sentences for describing the image to be processed, so that the image description accuracy of the image to be processed is improved, the service quality of understanding the image content of the image to be processed is improved, and the user experience of the image content understanding service of the image to be processed can be enhanced. In addition, the image processing method provided by the embodiment of the invention can also construct a loss function through the output data of the fusion device and the decoder, further modify the network parameters of the LSTM unit in the fusion device and/or the decoder through the loss function, optimize the performance of the fusion device and the processor, further improve the image processing performance of the image processing system, and enhance the user stickiness of the image processing system.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims (10)

1. An image processing method applied to an image processing system, the image processing system comprising an encoder and a decoder, the image processing system further comprising a fuser, the method comprising:
the fusion device obtains M groups of image characteristics of the image to be processed from the encoder;
the fusion device acquires first image representation information corresponding to each group of image features in the M groups of image features;
the fusion device generates M image representation information sets according to the image characteristics and the first image representation information corresponding to the image characteristics, wherein one image representation information set generated by one group of image characteristics corresponds to one image representation information set, and one image representation information set comprises at least one piece of second image representation information;
the fusion device fuses second image representation information included in the M image representation information sets to obtain target image representation information, and the target image representation information is output to the decoder;
the target image representation information is used for the decoder to decode the image to be processed to obtain the image description of the image to be processed;
wherein the fusing device fuses second image representation information included in the M image representation information sets to obtain target image representation information includes:
the fusion device determines third image representation information according to image representation information B included in each image representation information set in the M image representation information sets, wherein the image representation information B included in each image representation information set is image representation information obtained last in each image representation information set;
executing the following operations on any image representation information set j in each image representation information set to obtain a context vector corresponding to the image representation information set j:
learning the third image representation information and second image representation information in the image representation information set j based on an attention model and outputting a context vector corresponding to the image representation information set j, wherein one image representation information set corresponds to one attention model;
acquiring M context vectors corresponding to the M image representation information sets, and obtaining a target vector matrix according to the M context vectors;
and processing the target vector matrix and the third image representation information based on a third long-short time memory (LSTM) unit to generate target image representation information.
2. The method of claim 1, wherein the fuser acquiring M sets of image features of the image to be processed from the encoder comprises:
the fusion device acquires M groups of image characteristics of an image to be processed from M encoders included in the image processing system, wherein one encoder corresponds to one group of encoding parameters, and one encoder outputs one group of image characteristics; or
The fusion device obtains M groups of image characteristics of the image to be processed from M coding channels of the coder of the image processing system, wherein one coding channel of the coder corresponds to one group of coding parameters, and one coding channel outputs one group of image characteristics.
3. The method according to claim 1 or 2, wherein each of the M sets of image features includes a global image feature of the image to be processed;
the acquiring, by the fusion device, first image representation information corresponding to each of the M sets of image features includes:
and the fusion device generates first image representation information corresponding to each group of image features according to the global image features in each group of image features in the M groups of image features and the specified linear transformation matrix.
4. The method of claim 3, wherein the fuser generating M sets of image representation information based on the sets of image features and first image representation information corresponding to the sets of image features comprises:
the fusion device learns any group of image features and first image representation information corresponding to each group of image features based on a first long-short term memory (LSTM) unit to obtain image representation information A corresponding to any group of image features;
the fusion device learns any group of image characteristics and the image representation information A based on a second LSTM unit to obtain image representation information B corresponding to any group of image characteristics;
combining the image representation information A and the image representation information B corresponding to any group of image features to obtain an image representation information set i corresponding to any group of image features, wherein the image representation information A and the image representation information B are second image representation information included in the image representation information set i;
and acquiring image representation information sets corresponding to all groups of image features to obtain M image representation information sets corresponding to the M groups of image features.
5. The method of claim 4, wherein each of the M sets of image features further includes a subregion local image feature of the image to be processed;
the fusion device learns any group of image features and first image representation information corresponding to each group of image features based on a first LSTM unit, and the obtaining of the image representation information A corresponding to any group of image features comprises:
the fusion device learns the partial image features of the sub-region in any group of image features and the first image representation information corresponding to any group of image features based on the attention model in the first LSTM unit and outputs context vectors corresponding to any group of image features;
the fusion device learns the context vector corresponding to any group of image features and the first image representation information corresponding to each group of image features based on the first LSTM unit to obtain the image representation information A corresponding to any group of image features.
6. The method of claim 5, wherein the third LSTM cells comprise at least LSTM1 and LSTM2, the generating target image representation information from the target vector matrix and the third image representation information based on the third LSTM cells comprises:
learning the M context vectors and the third image representation information included in the target vector matrix based on the LSTM1 to obtain image representation information C;
learning second image representation information and the image representation information C included in the M image representation information sets based on the LSTM2 to obtain image representation information D;
and combining the image representation information C and the image representation information D to obtain a target image representation information set, and determining the image representation information C and the image representation information D in the target image representation information set as target image representation information.
7. The method of claim 5 or 6, further comprising:
the fusion device acquires the image description of the image to be processed from the decoder, and determines a discrimination supervision loss function of image processing according to the image description of the image to be processed;
the fusion device constructs a loss function of an image processing system by combining the discrimination supervision loss function according to the M image representation information sets of the image to be processed and the target image representation information;
and the fusion device corrects the network parameters of the LSTM unit adopted by the fusion device according to the loss function.
8. An image processing apparatus applied to an image processing system including an encoder and a decoder, wherein the image processing system further includes a fuser, the apparatus is the fuser, and the apparatus includes:
an acquisition unit for acquiring M sets of image features of an image to be processed from the encoder;
the acquiring unit is further configured to acquire first image representation information corresponding to each of the M groups of image features;
the first fusion unit is used for generating M image representation information sets according to the groups of image features acquired by the acquisition unit and first image representation information corresponding to the groups of image features, wherein one group of image features corresponds to one generated image representation information set, and one image representation information set comprises at least one piece of second image representation information;
a second fusion unit, configured to fuse second image representation information included in the M image representation information sets obtained by the first fusion unit to obtain target image representation information;
an output unit configured to output the target image representation information obtained by the second fusion unit to the decoder;
the target image representation information is used for the decoder to decode the image to be processed to obtain the image description of the image to be processed;
wherein the second fusion unit is configured to:
determining third image representation information according to image representation information B included in each image representation information set in the M image representation information sets, wherein the image representation information B included in each image representation information set is image representation information obtained last in each image representation information set;
executing the following operations on any image representation information set j in each image representation information set to obtain a context vector corresponding to the image representation information set j:
learning the third image representation information and second image representation information in the image representation information set j based on an attention model and outputting a context vector corresponding to the image representation information set j, wherein one image representation information set corresponds to one attention model;
acquiring M context vectors corresponding to the M image representation information sets, and obtaining a target vector matrix according to the M context vectors;
and learning the M context vectors in the target vector matrix and the third image representation information based on a third LSTM unit to obtain target image representation information.
9. A computer storage medium having one or more instructions stored thereon, the one or more instructions adapted to be loaded by a processor and to perform the image processing method of any of claims 1-7.
10. A server comprising an image processing system including an encoder and a decoder, characterized in that the image processing system further comprises a fuser comprising:
a processor adapted to implement one or more instructions; and the number of the first and second groups,
a computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the image processing method of any of claims 1-7.
CN201810442810.0A 2018-05-10 2018-05-10 Image processing method, image processing device, computer storage medium and server Active CN108665506B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810442810.0A CN108665506B (en) 2018-05-10 2018-05-10 Image processing method, image processing device, computer storage medium and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810442810.0A CN108665506B (en) 2018-05-10 2018-05-10 Image processing method, image processing device, computer storage medium and server

Publications (2)

Publication Number Publication Date
CN108665506A CN108665506A (en) 2018-10-16
CN108665506B true CN108665506B (en) 2021-09-28

Family

ID=63778945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810442810.0A Active CN108665506B (en) 2018-05-10 2018-05-10 Image processing method, image processing device, computer storage medium and server

Country Status (1)

Country Link
CN (1) CN108665506B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859210B (en) * 2018-12-25 2021-08-06 上海联影智能医疗科技有限公司 Medical data processing device and method
CN109902723A (en) * 2019-01-31 2019-06-18 北京市商汤科技开发有限公司 Image processing method and device
CN109919888B (en) * 2019-02-26 2023-09-19 腾讯科技(深圳)有限公司 Image fusion method, model training method and related device
CN110069994B (en) * 2019-03-18 2021-03-23 中国科学院自动化研究所 Face attribute recognition system and method based on face multiple regions
CN110084128B (en) * 2019-03-29 2021-12-14 安徽艾睿思智能科技有限公司 Scene graph generation method based on semantic space constraint and attention mechanism
CN110310253B (en) * 2019-05-09 2021-10-12 杭州迪英加科技有限公司 Digital slice classification method and device
CN110458242A (en) * 2019-08-16 2019-11-15 广东工业大学 A kind of iamge description generation method, device, equipment and readable storage medium storing program for executing
CN110309839B (en) * 2019-08-27 2019-12-03 北京金山数字娱乐科技有限公司 A kind of method and device of iamge description
CN111191791B (en) * 2019-12-02 2023-09-29 腾讯云计算(北京)有限责任公司 Picture classification method, device and equipment based on machine learning model
US20210279386A1 (en) * 2020-03-05 2021-09-09 International Business Machines Corporation Multi-modal deep learning based surrogate model for high-fidelity simulation
CN113763232A (en) * 2020-08-10 2021-12-07 北京沃东天骏信息技术有限公司 Image processing method, device, equipment and computer readable storage medium
US11775617B1 (en) * 2021-03-15 2023-10-03 Amazon Technologies, Inc. Class-agnostic object detection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665356A (en) * 2017-10-18 2018-02-06 北京信息科技大学 A kind of image labeling method
CN107748757A (en) * 2017-09-21 2018-03-02 北京航空航天大学 A kind of answering method of knowledge based collection of illustrative plates
CN107945282A (en) * 2017-12-05 2018-04-20 洛阳中科信息产业研究院(中科院计算技术研究所洛阳分所) The synthesis of quick multi-view angle three-dimensional and methods of exhibiting and device based on confrontation network
CN107979764A (en) * 2017-12-06 2018-05-01 中国石油大学(华东) Video caption generation method based on semantic segmentation and multilayer notice frame

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9830709B2 (en) * 2016-03-11 2017-11-28 Qualcomm Incorporated Video analysis with convolutional attention recurrent neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107748757A (en) * 2017-09-21 2018-03-02 北京航空航天大学 A kind of answering method of knowledge based collection of illustrative plates
CN107665356A (en) * 2017-10-18 2018-02-06 北京信息科技大学 A kind of image labeling method
CN107945282A (en) * 2017-12-05 2018-04-20 洛阳中科信息产业研究院(中科院计算技术研究所洛阳分所) The synthesis of quick multi-view angle three-dimensional and methods of exhibiting and device based on confrontation network
CN107979764A (en) * 2017-12-06 2018-05-01 中国石油大学(华东) Video caption generation method based on semantic segmentation and multilayer notice frame

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Learning Multimodal Attention LSTM Networks for Video Captioning;Jun Xu等;《Proceedings of the 25th ACM international conference on multimedia》;20171031;正文第3节以及附图2,正文第539-541页 *

Also Published As

Publication number Publication date
CN108665506A (en) 2018-10-16

Similar Documents

Publication Publication Date Title
CN108665506B (en) Image processing method, image processing device, computer storage medium and server
US10762305B2 (en) Method for generating chatting data based on artificial intelligence, computer device and computer-readable storage medium
US11853709B2 (en) Text translation method and apparatus, storage medium, and computer device
CN109271646B (en) Text translation method and device, readable storage medium and computer equipment
CN109034378B (en) Network representation generation method and device of neural network, storage medium and equipment
CN110475129B (en) Video processing method, medium, and server
KR20180001889A (en) Language processing method and apparatus
CN110134971B (en) Method and device for machine translation and computer readable storage medium
CN108776832B (en) Information processing method, information processing device, computer equipment and storage medium
CN110083702B (en) Aspect level text emotion conversion method based on multi-task learning
CN116415654A (en) Data processing method and related equipment
US11475225B2 (en) Method, system, electronic device and storage medium for clarification question generation
CN111428470B (en) Text continuity judgment method, text continuity judgment model training method, electronic device and readable medium
CN110543561A (en) Method and device for emotion analysis of text
CN110765733A (en) Text normalization method, device, equipment and storage medium
CN115186147B (en) Dialogue content generation method and device, storage medium and terminal
CN113396429A (en) Regularization of recursive machine learning architectures
CN112364650A (en) Entity relationship joint extraction method, terminal and storage medium
CN112819050A (en) Knowledge distillation and image processing method, device, electronic equipment and storage medium
CN116341651A (en) Entity recognition model training method and device, electronic equipment and storage medium
CN109979461B (en) Voice translation method and device
CN111563391A (en) Machine translation method and device and electronic equipment
CN114359592A (en) Model training and image processing method, device, equipment and storage medium
CN111259673B (en) Legal decision prediction method and system based on feedback sequence multitask learning
WO2023017568A1 (en) Learning device, inference device, learning method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant