WO2021037113A1 - 一种图像描述的方法及装置、计算设备和存储介质 - Google Patents

一种图像描述的方法及装置、计算设备和存储介质 Download PDF

Info

Publication number
WO2021037113A1
WO2021037113A1 PCT/CN2020/111602 CN2020111602W WO2021037113A1 WO 2021037113 A1 WO2021037113 A1 WO 2021037113A1 CN 2020111602 W CN2020111602 W CN 2020111602W WO 2021037113 A1 WO2021037113 A1 WO 2021037113A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
vector
decoding
coding
self
Prior art date
Application number
PCT/CN2020/111602
Other languages
English (en)
French (fr)
Inventor
宋振旗
李长亮
廖敏鹏
Original Assignee
北京金山数字娱乐科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京金山数字娱乐科技有限公司 filed Critical 北京金山数字娱乐科技有限公司
Priority to JP2022513610A priority Critical patent/JP2022546811A/ja
Priority to US17/753,304 priority patent/US20220351487A1/en
Priority to EP20856644.8A priority patent/EP4024274A4/en
Publication of WO2021037113A1 publication Critical patent/WO2021037113A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • G06V10/422Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation for representing the structure of the pattern or shape of an object therefor
    • G06V10/424Syntactic representation, e.g. by using alphabets or grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/457Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by analysing connectivity, e.g. edge linking, connected component analysis or slices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • This application relates to the field of image processing technology, and in particular to an image description method and device, computing device and storage medium.
  • Image description refers to the automatic generation of a descriptive text based on the image, similar to "look at the picture and speak". For humans, image description is a simple and natural thing, but for machines, this task is full of challenges. The reason is that the machine must not only be able to detect the objects in the image, but also understand the relationship between the objects, and finally express it in a reasonable language.
  • the process of image description requires a machine to extract local information and global information from a target image, and input the global information and local information into a translation model, and use sentences output by the translation model as the description information corresponding to the image.
  • a single feature extraction model is mostly used to extract global information from the target image.
  • the feature extraction model's extraction of global information depends on the performance of the feature extraction model itself.
  • Some feature extraction models will pay attention to a certain type of information in the image, and some feature extraction models will pay attention to the information in the image. Another type of information, this will cause the translation model in the subsequent process to often fail to use the complete global information corresponding to the image as a reference, resulting in deviations in the output sentences.
  • the embodiments of the present application provide a method and device for image description, a computing device, and a storage medium to solve the technical defects in the prior art.
  • an image description method including:
  • the global image feature and the target detection feature corresponding to the target image are input to a translation model, and the generated translation sentence is used as a description sentence of the target image.
  • performing fusion processing on the image features generated by the multiple first feature extraction models to generate the global image feature corresponding to the target image includes:
  • the initial global features are fused through at least one second self-attention layer to generate global image features.
  • the translation model includes an encoder and a decoder
  • Inputting the global image feature and the target detection feature corresponding to the target image into a translation model, and using the generated translation sentence as the description sentence of the target image includes:
  • a corresponding translation sentence is generated according to the decoding vector output by the decoder, and the translation sentence is used as a description sentence of the target image.
  • the encoder includes N sequentially connected coding layers, where N is an integer greater than 1;
  • Inputting the target detection feature and the global image feature to the encoder of the translation model to generate the encoding vector output by the encoder includes:
  • the coding layer includes: a first coding self-attention layer, a second coding self-attention layer, and a first feedforward layer;
  • Inputting the target detection feature and the global image feature to the first coding layer to obtain the output vector of the first coding layer includes:
  • the second intermediate vector is processed through the first feedforward layer to obtain the output vector of the first coding layer.
  • the coding layer includes: a first coding self-attention layer, a second coding self-attention layer, and a first feedforward layer;
  • Inputting the output vector of the i-1th coding layer and the global image feature to the i-th coding layer to obtain the output vector of the i-th coding layer includes: the output vector of the i-1th coding layer Input to the first coded self-attention layer to obtain a third intermediate vector; input the third intermediate vector and the global image feature to the second coded self-attention layer to obtain a fourth intermediate vector; The intermediate vector is processed by the first feedforward layer to obtain the output vector of the i-th coding layer.
  • the decoder includes M sequentially connected decoding layers, where M is an integer greater than 1;
  • Inputting the encoding vector and the global image feature to a decoder to generate a decoding vector output by the decoder includes:
  • the decoding layer includes: a first decoding self-attention layer, a second decoding self-attention layer, a third decoding self-attention layer, and a second feedforward layer;
  • Inputting the reference decoding vector, the coding vector, and the global image feature to the first decoding layer to obtain the output vector of the first decoding layer includes:
  • the decoding layer includes: a first decoding self-attention layer, a second decoding self-attention layer, a third decoding self-attention layer, and a second feedforward layer;
  • Input the output vector, coding vector and global image feature of the j-1th decoding layer to the jth decoding layer to obtain the output vector of the jth decoding layer including:
  • the tenth intermediate vector is processed through the second feedforward layer to obtain the output vector of the j-th decoding layer.
  • an image description device including:
  • the feature extraction module is configured to perform feature extraction on the target image using a plurality of first feature extraction models to obtain image features generated by each of the first feature extraction models;
  • a global image feature extraction module configured to perform fusion processing on image features generated by the plurality of first feature extraction models to generate global image features corresponding to the target image
  • the target detection feature extraction module is configured to perform feature extraction on the target image using a second feature extraction model to obtain target detection features corresponding to the target image;
  • the translation module is configured to input the global image feature and the target detection feature corresponding to the target image into a translation model, and use the generated translation sentence as a description sentence of the target image.
  • embodiments of the present application provide a computing device, including a memory, a processor, and computer instructions stored in the memory and capable of running on the processor.
  • the processor implements the above-mentioned instructions when the instructions are executed. The steps of the method described by the image.
  • an embodiment of the present application provides a computer-readable storage medium that stores computer instructions that, when executed by a processor, implement the steps of the image description method described above.
  • the embodiments of the present application provide a computer program product for implementing the steps of the image description method described above at runtime.
  • the image description method and device, computing device, and storage medium provided in this application use multiple first feature extraction models to perform feature extraction on a target image to obtain image features generated by each first feature extraction model.
  • the image feature fusion generated by a feature extraction model generates the global image feature corresponding to the target image, which overcomes the defect that the single feature extraction model is too dependent on the performance of the model itself.
  • the single feature extraction model has a single defect in the performance of the image features extracted, so that in the subsequent process of inputting the global image features and target detection features corresponding to the target image to the translation model to generate translation sentences, there are more global image features with richer image information as Reference to make the output translation sentence more accurate.
  • this application uses multiple first feature extraction models to perform feature extraction on the target image, and stitch the image features extracted by the multiple first feature extraction models to obtain the initial global features, so that the initial global features can be included as much as possible
  • the more complete features of the target image are then merged through multiple second self-attention layers to obtain the target area that needs to be focused on, and then more attention and computing resources are invested in this area to obtain more and the target image Relevant detailed information, while ignoring other irrelevant information.
  • limited attentional computing resources can be used to quickly filter out high-value information from a large amount of information, and obtain global image features that contain richer image information.
  • this application inputs the target detection feature and the global image feature to the encoder, so that the global image feature containing rich image information can be used as the background information in the encoding process of each coding layer, and the obtained information of each coding layer
  • the encoding vector can extract more image information, making the output translation sentence more accurate.
  • this application inputs global image features to each decoding layer of the decoder, so that the global image features containing rich image information can be used as background information in the decoding process of each decoding layer, so that the decoded decoded
  • the correspondence between the vector and the image information is higher, which makes the output translation sentence more accurate.
  • FIG. 1 is a schematic structural diagram of a computing device according to an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of an image description method according to an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of an image description method according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the structure of the coding layer of the translation model according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a decoding layer of a translation model according to an embodiment of the present application.
  • Fig. 6 is a schematic diagram of an image description method according to another embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an image description device according to another embodiment of the present application.
  • first, second, etc. may be used to describe various information in one or more embodiments of this specification, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • the first may also be referred to as the second, and similarly, the second may also be referred to as the first, depending on the context.
  • Image feature fusion In the image feature input stage, the features extracted by multiple pre-trained convolutional networks are used for fusion instead of a single image feature, so as to provide richer feature input for the training network.
  • RNN Recurrent Neural Network
  • the RNN model models time by adding self-connected hidden layers that span time points; in other words, the feedback of the hidden layer not only enters the output end, but also enters the hidden layer of the next time.
  • a translation model whose architecture includes: encoder-decoder.
  • the encoder realizes that the source sentence to be translated is encoded to generate a vector
  • the decoder realizes that the vector of the source sentence is decoded to generate the corresponding target sentence.
  • Image caption A comprehensive problem integrating computer vision, natural language processing, and machine learning. According to an image, a natural language sentence that can describe the content of the image is given. In layman's terms, it is the translation of a picture into a paragraph of description text.
  • Self-attention calculation For example, input a sentence for self-attention calculation, then each word in it must be self-attention calculation with all words in the sentence, the purpose is to learn the word dependency relationship within the sentence and capture the interior of the sentence structure. Perform self-attention calculations on the input image features, and perform self-attention calculations on each feature and other features. The purpose is to learn the feature dependencies within the image.
  • Global image features all features corresponding to the target image.
  • Target detection feature It is the feature of a specific area in the target image.
  • Fig. 1 shows a structural block diagram of a computing device 100 according to an embodiment of the present application.
  • the components of the computing device 100 include but are not limited to a memory 110 and a processor 120.
  • the processor 120 and the memory 110 are connected through a bus 130, and the database 150 is used to store data.
  • the computing device 100 also includes an access device 140 that enables the computing device 100 to communicate via one or more networks 160.
  • the computing device 100 can use the access device 140 to communicate with the database 150 via the network 160.
  • networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet, and the like.
  • PSTN public switched telephone network
  • LAN local area network
  • WAN wide area network
  • PAN personal area network
  • a combination of communication networks such as the Internet, and the like.
  • the access device 140 may include one or more of any type of wired or wireless network interface (for example, a network interface card (NIC)), such as IEEE802.11 wireless local area network (WLAN) wireless interface, global interconnection for microwave access ( Wi-MAX) interface, Ethernet interface, universal serial bus (USB) interface, cellular network interface, Bluetooth interface, near field communication (NFC) interface, etc.
  • NIC network interface card
  • Wi-MAX wireless local area network
  • Wi-MAX global interconnection for microwave access
  • Ethernet interface Ethernet interface
  • USB universal serial bus
  • cellular network interface Bluetooth interface
  • NFC near field communication
  • the aforementioned components of the computing device 100 and other components not shown in FIG. 1 may also be connected to each other, for example, via a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 1 is only for the purpose of example, and is not intended to limit the scope of this specification. Those skilled in the art can add or replace other components as needed.
  • the computing device 100 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (for example, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (for example, smart phones). ), wearable computing devices (for example, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs.
  • the computing device 100 may also be a mobile or stationary server.
  • FIG. 2 shows a schematic flowchart of an image description method according to an embodiment of the present application, including step 201 to step 204.
  • first feature extraction models there may be multiple first feature extraction models.
  • multiple first feature extraction models are used to perform feature extraction on the target image.
  • the type of the first feature extraction model may include: VGG (Visual Geometry Group Network, visual geometry) Group network), Resnet model, Densnet model, inceptionv3 model and other convolutional network models.
  • the image features extracted by the multiple first feature models have the same size.
  • the size of the image feature can be adjusted.
  • the number of channels of each image feature can also be the same.
  • the dimension of the extracted image feature can be expressed as 224*224*3, where 224*224 represents the height*width of the image feature, that is, the size of the image feature; 3 is the number of channels, that is, the number of image features.
  • the height and width of the input image are equal, and the size of the convolution kernel of the convolution layer can be set according to actual needs.
  • Commonly used convolution kernels are 1*1*1, 3*3*3, 5*5*5 , 7*7*7, etc.
  • the sizes of the image features generated by the multiple first feature models are all the same, but the number of image features (the number of channels) may be different from each other.
  • the image feature generated by the first feature extraction model is P*Q*L1, that is, the image feature is L1, and the size of the image feature is P*Q
  • the image feature generated by the second feature extraction model is P*Q*L2, that is, there are L2 image features
  • the size of the image feature is P*Q, where P*Q is the height * width of the image feature, and L1 and L2 are the first feature model and the first feature model respectively.
  • the number of image features generated by the two first feature models is P*Q*L1, that is, the image feature is L1, and the size of the image feature is P*Q
  • the image feature generated by the second feature extraction model is P*Q*L2, that is, there are L2 image features
  • the size of the image feature is P*Q, where P*Q is the height * width of the image feature, and L1 and L2 are the first
  • the image features generated by each first feature extraction model can be fused by Poisson fusion method, weighted average method, feathering algorithm, Laplacian fusion algorithm, self-attention algorithm, etc., to obtain the global image feature corresponding to the target image .
  • step 202 includes:
  • S2021 Perform feature extraction on the image features generated by the multiple first feature extraction models through the corresponding first self-attention layer to obtain multiple intermediate features.
  • the first self-attention layer includes a multi-head self-attention layer and a feed-forward layer.
  • the number of first self-attention layers is the same as the number of first feature extraction models.
  • Each first feature extraction model may correspond to a corresponding first self-attention layer. For example, taking five first feature extraction models as an example, the five first feature models all process the same image to generate corresponding image features, and then pass the image features generated by each first feature extraction model through the corresponding first feature model.
  • the self-attention layer performs feature extraction to obtain the generated intermediate features.
  • the splicing process can be realized by calling the contact function.
  • the intermediate features generated by the first self-attention layer corresponding to the five first feature extraction models are spliced to generate one initial global feature.
  • the first self-attention layer corresponding to the first first feature extraction model generates A1 intermediate features
  • the size of the intermediate feature is P*Q
  • the first self-attention layer corresponding to the second first feature extraction model generates A2 Intermediate features
  • the size of the intermediate feature is P*Q
  • the first self-attention layer corresponding to the third first feature extraction model generates A3 intermediate features
  • the size of the intermediate feature is P*Q
  • the fourth first feature The first self-attention layer corresponding to the extraction model generates A4 intermediate features.
  • the size of the intermediate feature is P*Q.
  • the first self-attention layer corresponding to the fifth first feature extraction model generates A5 intermediate features.
  • the size is P*Q.
  • the initial global feature after the splicing process contains (A1+A2+A3+A4+A5) features.
  • this step is to splice multiple intermediate features without further fusion processing. Therefore, compared with the intermediate features, the relationship between the features in the generated initial global features has not changed. This means that the features of the initial global features will be partially repetitive, and such features will be further processed in subsequent steps.
  • S2023 Perform fusion processing on the initial global features through at least one second self-attention layer to generate global image features.
  • the second self-attention layer includes a multi-head self-attention layer and a feed-forward layer.
  • the number of second self-attention layers can be multiple, and the settings can be customized according to actual needs.
  • the structure of the second self-attention layer can be the same as the structure of the first self-attention layer, and its purpose is to perform self-attention processing on the input vector to extract the vector that needs to be processed in the subsequent steps.
  • the difference is that in the case where the first self-attention layer and the second self-attention layer are both multiple, the multiple first self-attention layers are image features generated by extracting each first feature in parallel
  • the second self-attention layer is to serially process the initial global features layer by layer.
  • the initial global feature generated by the splicing of multiple intermediate features, and the fusion processing of the second self-attention layer, will promote the fusion of different features.
  • the correlation between the two is relatively strong.
  • the second self-attention layer will pay attention to the features C1 and C2 with strong correlation, and merge the features C1 and C2 to obtain the feature C1'.
  • the initial global feature contains multiple repeated D features D1.
  • the second self-attention layer will pay attention to the repeated multiple features D1, and repeat A plurality of features D1 of D1 generates a feature D1 of class D.
  • a key-value pair (key-value) can be used to represent the input information, where the address Key represents the key, and the value represents the value corresponding to the key.
  • the "key” is used to calculate the attention distribution, and the "value” is used to calculate the aggregate information.
  • Si is the attention score
  • Q is Query, which is the query vector
  • ki corresponds to each key vector.
  • the softmax function to convert the attention score numerically through formula (2).
  • it can be normalized to obtain a probability distribution with the sum of all weight coefficients being 1.
  • the characteristics of the softmax function can be used to highlight the weight of important elements:
  • ⁇ i is the weight coefficient
  • v i is the vector value.
  • the initial global feature containing (A1+A2+A3+A4+A5) features after the fusion processing of the second self-attention layer, can obtain the global image features of A'features.
  • A' is less than or equal to (A1+A2+A3+A4+A5).
  • the second feature model may be a target detection feature model, so as to extract local information of the target image.
  • the second feature extraction model can select the Faster-RNN (Faster Regions with CNN features, fast convolution feature region) model to identify the region of interest in the image, and allow multiple The overlapping of the interest frames corresponding to the regions of interest, so that the image content can be understood more effectively.
  • the Faster-RNN Faster Regions with CNN features, fast convolution feature region
  • the main steps of Faster-RNN to extract target detection features include:
  • Feature extraction take the entire target image as input to obtain the feature layer of the target image.
  • Candidate regions Use methods such as Selective Search to extract regions of interest from the target image, and project the interest frames corresponding to these regions of interest to the final feature layer one by one.
  • Area normalization Perform a pooling operation for each candidate area candidate frame on the feature layer to obtain a fixed-size feature representation.
  • the translation model includes an encoder and a decoder.
  • the Transformer model is preferably used, which can further make the output sentence more accurate.
  • the Transformer model does not require loops, but processes the global image features and target detection features corresponding to the input target image in parallel, and uses the self-attention mechanism to combine features.
  • the training speed of Transformer model is much faster than RNN, and its translation result is more accurate than that of RNN.
  • the translation sentence may include multiple translation words, and for the decoder, one translation word is obtained every time it is decoded.
  • the reference decoding vector is a preset initial decoding vector; for other translation words except the first translation word of the translation sentence, the reference decoding vector is above The decoding vector corresponding to a translated word.
  • the image description method provided in this application uses multiple first feature extraction models to perform feature extraction on a target image to obtain image features generated by each first feature extraction model, and extract the image features generated by the multiple first feature extraction models. Fusion generates the global image features corresponding to the target image, which overcomes the defect that the single feature extraction model is too dependent on the performance of the model itself. Compared with the image features of the single feature extraction model in the prior art, it can reduce the image features extracted by the single feature extraction model. A single performance defect, so that in the subsequent process of inputting the global image features and target detection features corresponding to the target image to the translation model to generate translation sentences, the global image features with richer image information are used as references to make the output translation sentences more accurate.
  • the image description method of an embodiment of the present application may also be as shown in FIG. 3, including:
  • Steps 301 to 303 are the same as steps 201 to 203 of the foregoing embodiment, and specific explanations can be referred to the foregoing embodiment, which will not be repeated here.
  • the encoder may include one coding layer or multiple coding layers.
  • the encoder includes N sequentially connected coding layers as an example for description, where N>1.
  • Step 304 includes the following steps S3041 to S3044:
  • the global image feature is input to each coding layer, so that the target detection feature is integrated into the global image feature in the processing of each coding layer, and the feature representation of the target detection feature is enhanced .
  • the coding layer includes: a first coding self-attention layer, a second coding self-attention layer, and a first feedforward layer;
  • Step S3041 includes: inputting the target detection feature to the first coded self-attention layer to obtain a first intermediate vector; inputting the first intermediate vector and global image features to the second coded self-attention layer to obtain a second intermediate vector ; Process the second intermediate vector through the first feedforward layer to obtain the output vector of the first coding layer.
  • Step S3042 includes: inputting the output vector of the i-1th coding layer to the first coding self-attention layer to obtain a third intermediate vector; inputting the third intermediate vector and global image features to the second coding self-attention layer Layer to obtain the fourth intermediate vector; and process the fourth intermediate vector through the first feedforward layer to obtain the output vector of the i-th encoding layer.
  • the decoder may include one decoding layer or multiple decoding layers.
  • the decoder includes M successively connected decoding layers as an example for description. Among them, M>1.
  • Step 305 includes the following steps S3051 to S3054:
  • the reference decoding vector is an initial decoding vector
  • the reference decoding vector is the decoding vector corresponding to the previous translation word.
  • the global image feature is input to each decoding layer of the decoder, so that the global image feature containing rich image information can be used as the background information during the decoding process of each decoding layer, which can make the decoding
  • the obtained decoding vector has a higher degree of correspondence with the image information, which makes the output translation sentence more accurate.
  • the decoding layer includes: a first decoding self-attention layer, a second decoding self-attention layer, a third decoding self-attention layer, and a second feedforward layer.
  • Step S3051 includes: processing the reference decoded vector through the first decoded self-attention layer to obtain a fifth intermediate vector; processing the fifth intermediate vector and the global image feature through the second decoded self-attention layer , Obtain the sixth intermediate vector; process the sixth intermediate vector and the encoded vector through the third decoded self-attention layer to obtain the seventh intermediate vector; process the seventh intermediate vector through the second feedforward layer, Get the output vector of the first decoding layer.
  • Step S3052 includes: processing the output vector of the j-1th decoding layer through the first decoding self-attention layer to obtain an eighth intermediate vector; passing the eighth intermediate vector and the global image feature through the second Decode the self-attention layer for processing to obtain a ninth intermediate vector; process the ninth intermediate vector and the encoded vector through the third decoded self-attention layer to obtain the tenth intermediate vector; pass the tenth intermediate vector through the first Two feedforward layers perform processing to obtain the output vector of the j-th decoding layer.
  • a corresponding translation term is generated according to the decoding vector output by the decoder, and a translation sentence is generated according to the translation term.
  • the translation sentence may include multiple translation words.
  • one translation word is obtained every time it is decoded.
  • the reference decoding vector is a preset initial decoding vector; for other translation words except the first translation word of the translation sentence, the reference decoding vector is above The decoding vector corresponding to a translated word.
  • the image description method provided in this application uses multiple first feature extraction models to perform feature extraction on a target image to obtain image features generated by each first feature extraction model, and extract the image features generated by the multiple first feature extraction models. Fusion generates the global image features corresponding to the target image, which overcomes the defect that the single feature extraction model is too dependent on the performance of the model itself. Compared with the image features of the single feature extraction model in the prior art, it can reduce the image features extracted by the single feature extraction model. A single performance defect, so that in the subsequent process of inputting the global image features and target detection features corresponding to the target image to the translation model to generate translation sentences, the global image features with richer image information are used as references to make the output translation sentences more accurate.
  • this embodiment uses multiple first feature extraction models to perform feature extraction on the target image, and stitch the image features extracted by the multiple first feature extraction models to obtain the initial global features, so that the initial global features can be made as much as possible Contain more complete features of the target image, and then merge through multiple second self-attention layers to obtain the target area that needs to be focused on, and then invest more attention and computing resources in this area to obtain more and the target Relevant detailed information, while ignoring other irrelevant information.
  • limited attentional computing resources can be used to quickly filter out high-value information from a large amount of information, and obtain global image features that contain richer image information.
  • this method inputs global image features to each decoding layer of the decoder, so that the global image features containing rich image information can be used as background information during the decoding process of each decoding layer, so that the decoded decoded
  • the correspondence between the vector and the image information is higher, which makes the output translation sentence more accurate.
  • the image description method in this embodiment is applicable to the encoder-decoder machine translation model.
  • FIG. 6 and take the Transformer translation model as an example for a schematic description.
  • first feature extraction models namely VGG, Resnet, Densnet, and inceptionv3
  • first self-attention layers namely VGG, Resnet, Densnet, and inceptionv3
  • K second self-attention layers 4 first self-attention layers
  • 1 second feature extraction model and Transformer translation model.
  • Contact refers to the contact function, which is a contact function.
  • the image description method of this embodiment includes the following steps S61 to S68:
  • S61 Perform feature extraction on the target image using the four first feature extraction models to obtain image features generated by each first feature extraction model.
  • the image features generated by the four first feature extraction models are respectively processed through the corresponding first self-attention layer to obtain the generated intermediate features.
  • the image features generated by the first first feature extraction model are processed by the corresponding first self-attention layer to obtain A1 intermediate features, the size of the intermediate feature is P*Q; the second first feature extraction model is generated The image features of is processed by the corresponding first self-attention layer to obtain A2 intermediate features, the size of the intermediate feature is P*Q; the image features generated by the third first feature extraction model pass the corresponding first self-attention Layer processing to obtain A3 intermediate features, the size of the intermediate feature is P*Q; the image features generated by the fourth first feature extraction model are processed by the corresponding first self-attention layer, and A4 intermediate features are obtained.
  • the size of the feature is P*Q.
  • the 4 intermediate features are spliced to generate an initial global feature containing (A1+A2+A3+A4) features.
  • the initial global feature containing (A1+A2+A3+A4) features is fused to generate a global image feature containing A'features.
  • A' ⁇ (A1+A2+A3+A4).
  • S65 Perform feature extraction on the target image using the second feature extraction model to obtain target detection features corresponding to the target image.
  • the second feature extraction model is a Faster RNN (Faster Regions with CNN features, fast convolution feature region) model.
  • the encoder includes N coding layers
  • the decoder includes M decoding layers.
  • the description sentence can output description sentences in different languages according to the performance of the Transformer model.
  • the performance of the Transformer model can be formed through the training of sample sets.
  • the sample set is a collection of "Chinese sentences to be translated + French translation sentences", a collection of "English sentences to be translated + Japanese translation sentences” or "image features + A collection of English translation sentences.
  • the performance of the Transformer model is used as an example for generating English translation sentences according to the input image feature translation.
  • the decoder outputs the decoding vector and obtains the first word "a".
  • the vector corresponding to the second word "boy” is used as the reference decoding vector, so that the decoder can get the next word "play” according to the reference decoding vector, coding vector and global image features...and so on, get the description sentence "A boy play” football on football field”.
  • An embodiment of the present application also provides an image description device, referring to FIG. 7, including:
  • the feature extraction module 701 is configured to perform feature extraction on a target image using multiple first feature extraction models to obtain image features generated by each first feature extraction model;
  • the global image feature extraction module 702 is configured to perform fusion processing on the image features generated by the multiple first feature extraction models to generate a global image feature corresponding to the target image;
  • the target detection feature extraction module 703 is configured to perform feature extraction on the target image by using the second feature extraction model to obtain target detection features corresponding to the target image;
  • the translation module 704 is configured to input global image features and target detection features corresponding to the target image into the translation model, and use the generated translation sentence as a description sentence of the target image.
  • the global image feature extraction module 702 is specifically configured to:
  • the initial global features are fused through at least one second self-attention layer to generate global image features.
  • the translation model includes an encoder and a decoder
  • the translation module 704 includes:
  • An encoding module configured to input the target detection feature and the global image feature to the encoder of the translation model, and generate an encoding vector output by the encoder;
  • a decoding module configured to input the encoding vector and the global image feature to a decoder, and generate a decoding vector output by the decoder;
  • the sentence generation module is configured to generate a corresponding translation sentence according to the decoding vector output by the decoder, and use the translation sentence as a description sentence of the target image.
  • the encoder includes N sequentially connected coding layers, where N is an integer greater than 1; the coding module includes:
  • the first processing unit is configured to input the target detection feature and the global image feature to the first coding layer to obtain an output vector of the first coding layer;
  • the second processing unit is configured to input the output vector of the i-1th coding layer and the global image feature to the i-th coding layer to obtain the output vector of the i-th coding layer, where 2 ⁇ i ⁇ N;
  • the first judging unit is configured to judge whether i is equal to N, if not, increment i by 1 and execute the second processing unit, and if so, execute the code vector generating unit;
  • the coding vector generating unit is configured to use the output vector of the Nth coding layer as the coding vector output by the encoder.
  • the coding layer includes: a first coding self-attention layer, a second coding self-attention layer, and a first feedforward layer; the first processing unit is specifically configured to: input the target detection feature to the first coding Self-attention layer to obtain a first intermediate vector; input the first intermediate vector and global image features to the second encoded self-attention layer to obtain a second intermediate vector; pass the second intermediate vector through the first feedforward layer Process to get the output vector of the first coding layer.
  • the coding layer includes: a first coding self-attention layer, a second coding self-attention layer, and a first feedforward layer; the second processing unit is specifically configured to: The output vector is input to the first coding self-attention layer to obtain a third intermediate vector; the third intermediate vector and global image features are input to the second coding self-attention layer to obtain a fourth intermediate vector; the fourth intermediate vector After processing by the first feedforward layer, the output vector of the i-th coding layer is obtained.
  • the decoder includes M sequentially connected decoding layers, where M is an integer greater than 1;
  • the decoding module includes:
  • the third processing unit is configured to input the reference decoding vector, the encoding vector, and the global image feature to the first decoding layer to obtain the output vector of the first decoding layer;
  • the fourth processing unit is configured to input the output vector, encoding vector, and global image feature of the j-1th decoding layer to the jth decoding layer to obtain the output vector of the jth decoding layer, where 2 ⁇ j ⁇ M;
  • the second judging unit is configured to judge whether j is equal to M, if not, increment j by 1 and execute the fourth processing unit, and if so, execute the decoding vector generating unit;
  • the decoding vector generating unit is configured to use the output vector of the M-th decoding layer as the decoding vector output by the decoder.
  • the decoding layer includes: a first decoding self-attention layer, a second decoding self-attention layer, a third decoding self-attention layer, and a second feedforward layer; the third processing unit is specifically configured as:
  • the seventh intermediate vector is processed through the second feedforward layer to obtain the output vector of the first decoding layer.
  • the decoding layer includes: a first decoding self-attention layer, a second decoding self-attention layer, a third decoding self-attention layer, and a second feedforward layer;
  • the fourth processing unit is specifically configured as:
  • the tenth intermediate vector is processed through the second feedforward layer to obtain the output vector of the j-th decoding layer.
  • An embodiment of the present application also provides a computer-readable storage medium that stores computer instructions, which when executed by a processor, implement the steps of the method described in the foregoing image.
  • An embodiment of the present application also provides a computer program product, which is used to implement the steps of the aforementioned image description method at runtime.
  • the computer instructions include computer program codes, and the computer program codes may be in the form of source code, object code, executable files, or some intermediate forms.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc. It should be noted that the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable medium Does not include electrical carrier signals and telecommunication signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)
  • Compression Of Band Width Or Redundancy In Fax (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请提供一种图像描述的方法及装置、计算设备和存储介质,所述方法包括:利用多个第一特征提取模型对目标图像进行特征提取,得到每个第一特征提取模型生成的图像特征;对多个第一特征提取模型生成的图像特征进行融合处理,生成目标图像对应的全局图像特征;利用第二特征提取模型对目标图像进行特征提取得到目标图像对应的目标检测特征;将目标图像对应的全局图像特征和目标检测特征输入至翻译模型,将生成的翻译语句作为目标图像的描述语句,从而在后续将目标图像对应的全局图像特征和目标检测特征输入至翻译模型生成翻译语句的过程中,有更为丰富图像信息的全局图像特征作为参考,使输出的翻译语句更加准确。

Description

一种图像描述的方法及装置、计算设备和存储介质
本申请要求于2019年08月27日提交中国专利局、申请号为201910797332.X发明名称为“一种图像描述的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,特别涉及一种图像描述的方法及装置、计算设备和存储介质。
背景技术
图像描述,是指根据图像自动生成一段描述性文字,类似于“看图说话”。对于人来说,图像描述是简单而自然的一件事,但对于机器来说,这项任务却充满了挑战性。原因在于机器不仅要能检测出图像中的物体,而且还要理解物体之间的相互关系,最后还要用合理的语言表达出来。
现有技术中,图像描述的过程中需要机器对目标图像提取局部信息和全局信息,并将全局信息和局部信息输入至翻译模型,并将翻译模型输出的语句作为图像对应的描述信息。目前的图像描述任务中,大多使用单一的特征提取模型对目标图像进行全局信息的提取。此种情况下,特征提取模型对全局信息的提取,依赖该特征提取模型自身的性能,有的特征提取模型会关注到图像中的某一类信息,有的特征提取模型会关注到图像中的另一类信息,这样会导致后续过程中翻译模型往往不能以图像对应的完整的全局信息作为参考,导致输出的语句有偏差。
发明内容
有鉴于此,本申请实施例提供了一种图像描述的方法及装置、计算设备和存储介质,以解决现有技术中存在的技术缺陷。
第一方面,本申请实施例提供了一种图像描述的方法,包括:
利用多个第一特征提取模型对目标图像进行特征提取,得到每个所述第一特征提取模型生成的图像特征;
对所述多个第一特征提取模型生成的图像特征进行融合处理,生成所述目标图像对应的全局图像特征;
利用第二特征提取模型对所述目标图像进行特征提取,得到所述目标图像对应的目标检测特征;
将所述目标图像对应的所述全局图像特征和所述目标检测特征输入至翻译模型,将生 成的翻译语句作为所述目标图像的描述语句。
可选地,对所述多个第一特征提取模型生成的图像特征进行融合处理,生成所述目标图像对应的全局图像特征,包括:
对所述多个第一特征提取模型生成的图像特征分别通过对应的第一自注意力层进行特征提取,得到多个中间特征;
对多个所述中间特征进行拼接,生成初始全局特征;
将所述初始全局特征通过至少一个第二自注意力层进行融合处理,生成全局图像特征。
可选地,所述翻译模型包括编码器和解码器;
将所述目标图像对应的所述全局图像特征和所述目标检测特征输入至翻译模型,将生成的翻译语句作为所述目标图像的描述语句,包括:
将所述目标检测特征和所述全局图像特征输入至所述翻译模型的编码器,生成所述编码器输出的编码向量;
将所述编码向量以及所述全局图像特征输入至解码器,生成所述解码器输出的解码向量;
根据所述解码器输出的解码向量生成对应的翻译语句,并将所述翻译语句作为所述目标图像的描述语句。
可选地,编码器包括N个依次连接的编码层,其中,N为大于1的整数;
将所述目标检测特征和所述全局图像特征输入至所述翻译模型的编码器,生成所述编码器输出的编码向量,包括:
S11、将所述目标检测特征和所述全局图像特征输入至第一个编码层,得到第一个编码层的输出向量;
S12、将第i-1个编码层的输出向量和所述全局图像特征输入至第i个编码层,得到第i个编码层的输出向量,其中,2≤i≤N;
S13、判断i是否等于N,若否,将i自增1,执行步骤S12,若是,执行步骤S14;
S14、将第N个编码层的输出向量作为所述编码器输出的编码向量。
可选地,所述编码层包括:第一编码自注意力层、第二编码自注意力层和第一前馈层;
将所述目标检测特征和所述全局图像特征输入至第一个编码层,得到第一个编码层的输出向量,包括:
将所述目标检测特征输入至第一编码自注意力层,得到第一中间向量;
将所述第一中间向量和所述全局图像特征输入至所述第二编码自注意力层,得到第二中间向量;
将所述第二中间向量经过所述第一前馈层进行处理,得到第一个编码层的输出向量。
可选地,所述编码层包括:第一编码自注意力层、第二编码自注意力层和第一前馈层;
将第i-1个编码层的输出向量和所述全局图像特征输入至第i个编码层,得到第i个编 码层的输出向量,包括:将所述第i-1个编码层的输出向量输入至第一编码自注意力层,得到第三中间向量;将所述第三中间向量和所述全局图像特征输入至第二编码自注意力层,得到第四中间向量;将所述第四中间向量经过第一前馈层进行处理,得到第i个编码层的输出向量。
可选地,解码器包括M个依次连接的解码层,其中,M为大于1的整数;
将所述编码向量以及所述全局图像特征输入至解码器,生成所述解码器输出的解码向量,包括:
S21、将参考解码向量、所述编码向量和所述全局图像特征输入至第一个解码层,得到第一个解码层的输出向量;
S22、将第j-1个解码层的输出向量、所述编码向量和所述全局图像特征输入至第j个解码层,得到第j个解码层的输出向量,其中,2≤j≤M;
S23、判断j是否等于M,若否,将j自增1,执行步骤S22,若是,执行步骤S24;
S24、将第M个解码层的输出向量作为所述解码器输出的解码向量。
可选地,所述解码层包括:第一解码自注意力层、第二解码自注意力层、第三解码自注意力层和第二前馈层;
将参考解码向量、所述编码向量和所述全局图像特征输入至第一个解码层,得到第一个解码层的输出向量,包括:
将所述参考解码向量经过所述第一解码自注意力层进行处理,得到第五中间向量;将所述第五中间向量和所述全局图像特征经过所述第二解码自注意力层进行处理,得到第六中间向量;将所述第六中间向量和所述编码向量经过所述第三解码自注意力层进行处理,得到第七中间向量;将第七中间向量经过第二前馈层进行处理,得到第一个解码层的输出向量。
可选地,所述解码层包括:第一解码自注意力层、第二解码自注意力层、第三解码自注意力层和第二前馈层;
将第j-1个解码层的输出向量、编码向量和全局图像特征输入至第j个解码层,得到第j个解码层的输出向量,包括:
将第j-1个解码层的输出向量经过所述第一解码自注意力层进行处理,得到第八中间向量;
将所述第八中间向量和所述全局图像特征经过所述第二解码自注意力层进行处理,得到第九中间向量;
将所述第九中间向量和所述编码向量经过所述第三解码自注意力层进行处理,得到第十中间向量;
将所述第十中间向量经过所述第二前馈层进行处理,得到第j个解码层的输出向量。
第二方面,本申请实施例提供了一种图像描述的装置,包括:
特征提取模块,被配置为利用多个第一特征提取模型对目标图像进行特征提取,得到每个所述第一特征提取模型生成的图像特征;
全局图像特征提取模块,被配置为对所述多个第一特征提取模型生成的图像特征进行融合处理,生成所述目标图像对应的全局图像特征;
目标检测特征提取模块,被配置为利用第二特征提取模型对所述目标图像进行特征提取,得到所述目标图像对应的目标检测特征;
翻译模块,被配置为将所述目标图像对应的所述全局图像特征和所述目标检测特征输入至翻译模型,将生成的翻译语句作为所述目标图像的描述语句。
第三方面,本申请实施例提供了一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机指令,所述处理器执行所述指令时实现如上所述的图像描述的方法的步骤。
第四方面,本申请实施例提供了一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现如上所述的图像描述的方法的步骤。
第五方面,本申请实施例提供了一种计算机程序产品,用于在运行时实现如上所述的图像描述的方法的步骤。
本申请提供的图像描述的方法及装置、计算设备和存储介质,通过利用多个第一特征提取模型对目标图像进行特征提取,得到每个第一特征提取模型生成的图像特征,将多个第一特征提取模型生成的图像特征融合生成目标图像对应的全局图像特征,克服了单一特征提取模型过于依赖模型自身性能的缺陷,相比于现有技术中利用单一特征提取模型的图像特征,能够减轻单一特征提取模型提取的图像特征性能单一的缺陷,从而在后续将目标图像对应的全局图像特征和目标检测特征输入至翻译模型生成翻译语句的过程中,有更为丰富图像信息的全局图像特征作为参考,使输出的翻译语句更加准确。
其次,本申请通过多个第一特征提取模型对目标图像进行特征提取,并把多个第一特征提取模型提取到的图像特征进行拼接得到初始全局特征,从而可以使初始全局特征尽可能地包含目标图像的更全的特征,然后再经过多个第二自注意力层进行融合,获取需要重点关注的目标区域,而后对这一区域投入更多的注意力计算资源,获取更多与目标图像有关的细节信息,而忽视其他无关信息。通过这种机制可以利用有限的注意力计算资源从大量信息中快速筛选出高价值的信息,得到包含更为丰富的图像信息的全局图像特征。
再次,本申请将目标检测特征和全局图像特征输入至编码器,从而可以在每个编码层的编码过程中,将包含有丰富图像信息的全局图像特征作为背景信息,得到的每个编码层的编码向量可以更多地提取到图像的信息,使输出的翻译语句更加准确。
另外,本申请将全局图像特征输入至解码器的每个解码层,从而可以在每个解码层的解码过程中,将包含有丰富图像信息的全局图像特征作为背景信息,可以使解码得到的解码向量与图像信息的对应度更高,使输出的翻译语句更加准确。
附图说明
为了更清楚地说明本申请实施例和现有技术的技术方案,下面对实施例和现有技术中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例的计算设备的结构示意图;
图2是本申请一实施例的图像描述的方法的流程示意图;
图3是本申请一实施例的图像描述的方法的流程示意图;
图4是本申请一实施例的翻译模型的编码层的结构示意图;
图5是本申请一实施例的翻译模型的解码层的结构示意图;
图6是本申请另一实施例的图像描述的方法的示意图;
图7是本申请另一实施例的图像描述的装置的结构示意图。
具体实施方式
为使本申请的目的、技术方案、及优点更加清楚明白,以下参照附图并举实施例,对本申请进一步详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在本说明书一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本说明书一个或多个实施例。在本说明书一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本说明书一个或多个实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本说明书一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本说明书一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一,取决于语境。
首先,对本申请一个或多个实施例涉及的名词术语进行解释。
图像特征融合:指在图像特征输入阶段使用多个预训练的卷积网络提取的特征进行融合代替单一图像特征,从而给训练网络提供更丰富的特征输入。
RNN(Recurrent Neural Network,递归神经网络)模型:是一种具有反馈结构的神经网络,其输出不但与当前输入和网络的权值有关,而且也与之前网络的输入有关。RNN模型通过添加跨越时间点的自连接隐藏层,对时间进行建模;换句话说,隐藏层的反馈不仅 仅进入输出端,而且还进入了下一时间的隐藏层。
Transformer:一种翻译模型,其架构包括:编码器(encoder)—解码器(decoder)。编码器实现对待翻译的源语句进行编码生成向量,解码器实现对源语句的向量进行解码生成对应的目标语句。
图像描述(image caption):一个融合计算机视觉、自然语言处理和机器学习的综合问题,根据图像给出能够描述图像内容的自然语言语句,通俗讲,它就是翻译一副图片为一段描述文字。
自注意力计算:例如输入一个句子进行自注意力计算,那么里面的每个词都要和该句子中的所有词进行自注意力计算,目的是学习句子内部的词依赖关系,捕获句子的内部结构。对输入的图像特征进行自注意力计算,会对每个特征与其他特征进行自注意力计算,目的是学习图像内部的特征依赖关系。
全局图像特征:为目标图像对应的全部特征。
目标检测特征:为目标图像中特定区域的特征。
在本申请中,提供了一种图像描述的方法及装置、计算设备和计算机可读存储介质,在下面的实施例中逐一进行详细说明。
图1示出了根据本申请一实施例的计算设备100的结构框图。该计算设备100的部件包括但不限于存储器110和处理器120。处理器120与存储器110通过总线130相连接,数据库150用于保存数据。
计算设备100还包括接入设备140,接入设备140使得计算设备100能够经由一个或多个网络160通信。例如,计算设备100可以利用接入设备140,经由网络160与数据库150通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合等。接入设备140可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如IEEE802.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。
在本申请的一个实施例中,计算设备100的上述部件以及图1中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图1所示的计算设备结构框图仅仅是出于示例的目的,而不是对本说明书范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。
计算设备100可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备100还可以是移动式或静止式的服务器。
其中,处理器120可以执行图2所示方法中的步骤。图2示出了根据本申请一实施例的图像描述方法的示意性流程图,包括步骤201至步骤204。
201、利用多个第一特征提取模型对目标图像进行特征提取,得到每个第一特征提取模型生成的图像特征。
具体地,第一特征提取模型可以为多个,本申请中利用多个第一特征提取模型对目标图像进行特征提取,第一特征提取模型的类型可以包括:VGG(Visual Geometry Group Network,视觉几何组网络)、Resnet模型、Densnet模型、inceptionv3模型等卷积网络模型。
一种可能的实施方式中,多个第一特征模型提取的图像特征的尺寸相同。通过设置第一特征模型的卷积层参数,可以调节图像特征的尺寸。除了尺寸相同外,各图像特征的通道数也可以相同。例如,提取的图像特征的维度可以表示为224*224*3,其中224*224表示图像特征的高度*宽度,即图像特征的尺寸;3是通道数,也即图像特征的个数。通常情况下,输入图像的高度和宽度相等,卷积层的卷积核大小可以根据实际需求而设置,常用的卷积核有1*1*1、3*3*3、5*5*5、7*7*7等。
一种可能的实施方式中,多个第一特征模型生成的图像特征的尺寸均相同,但是图像特征的个数(通道数)可以彼此不同。例如第1个第一特征提取模型生成的图像特征为P*Q*L1,也即图像特征为L1个,图像特征的尺寸为P*Q;第2个第一特征提取模型生成的图像特征为P*Q*L2,也即图像特征为L2个,图像特征的尺寸为P*Q,其中,P*Q是图像特征的高度*宽度,L1和L2分别为第1个第一特征模型和第2个第一特征模型生成的图像特征的个数。
202、对多个第一特征提取模型生成的图像特征进行融合处理,生成目标图像对应的全局图像特征。
可以通过泊松融合方法、加权平均法、羽化算法、拉普拉斯融合算法、自注意力算法等,将各第一特征提取模型生成的图像特征进行融合处理,得到目标图像对应的全局图像特征。
一种可能的实施方式中,步骤202包括:
S2021、对多个第一特征提取模型生成的图像特征分别通过对应的第一自注意力层进行特征提取,得到多个中间特征。
其中,第一自注意力层包括多头自注意力层和前馈层。本步骤中第一自注意力层的个数与第一特征提取模型的个数相同。
各第一特征提取模型均可以对应有相应的第一自注意力层。例如,以5个第一特征提取模型为例,该5个第一特征模型均对同一图像进行处理生成对应的图像特征,然后将每个第一特征提取模型生成的图像特征通过对应的第一自注意力层进行特征提取,得到生成的中间特征。
S2022、对多个中间特征进行拼接,生成初始全局特征。
其中,拼接处理可以通过调用contact函数来实现。
例如,仍以5个第一特征提取模型为例,将5个第一特征提取模型对应的第一自注意力层生成的中间特征进行拼接处理,生成1个初始全局特征。例如第1个第一特征提取模型对应的第一自注意力层生成A1个中间特征,中间特征的尺寸为P*Q,第2个第一特征提取模型对应的第一自注意力层生成A2个中间特征,中间特征的尺寸为P*Q,第3个第一特征提取模型对应的第一自注意力层生成A3个中间特征,中间特征的尺寸为P*Q,第4个第一特征提取模型对应的第一自注意力层生成A4个中间特征,中间特征的尺寸为P*Q,第5个第一特征提取模型对应的第一自注意力层生成A5个中间特征,中间特征的尺寸为P*Q。那么拼接处理后的初始全局特征包含(A1+A2+A3+A4+A5)个特征。
可以理解的是,本步骤为将多个中间特征进行拼接,并不进行进一步地融合处理,所以,相比于中间特征,生成的初始全局特征中特征之间的关系并未改变,这也就意味着初始全局特征的特征会有部分重复,此类特征会在后续步骤中进一步地进行处理。
S2023、将初始全局特征通过至少一个第二自注意力层进行融合处理,生成全局图像特征。
其中,第二自注意力层包括多头自注意力层和前馈层。本步骤中第二自注意力层的个数可以为多个,可以根据实际需求自定义设置。
一种实施方式中,第二自注意力层的结构与第一自注意力层的结构可以相同,其目的均是对输入的向量进行自注意力处理,以提取后续步骤中需要进行处理的向量。但不同的是,在第一自注意力层和第二自注意力层均为多个的情形下,多个第一自注意力层为并行地对每个第一特征提取模型生成的图像特征进行处理,而第二自注意力层为串行地对初始全局特征进行逐层处理。
经过多个中间特征进行拼接生成的初始全局特征,经过第二自注意力层进行融合处理,会促使不同特征之间的相互融合。
例如,对于初始全局特征包含C类的特征C1,以及C类的特征C2,二者之间的关联性较强。在通过第二自注意力层进行融合处理的过程中,第二自注意力层会关注到关联性强的特征C1和C2,并根据特征C1和C2融合得到特征C1'。
又例如初始全局特征包含重复的多个D类的特征D1,在通过第二自注意力层进行融合处理的过程中,第二自注意力层会关注到重复的多个特征D1,并将重复的多个特征D1生成一个D类的特征D1。
本实施例中,特征融合的方法有很多,例如泊松融合方法、加权平均法、羽化算法、拉普拉斯融合算法、自注意力算法等,本实施例优选使用自注意力算法。
例如,可以用键值对(key-value)来表示输入信息,其中,地址Key代表键,value代表该键对应的值。“键”用来计算注意力分布,“值”用来计算聚合信息。则n个输入信息就可以表示为(K,V)=[(k1,v1),(k2,v2),...,(kn,vn)]。
具体地,可以先根据公式(1),计算Query和Key的相似度:
Si=F(Q,ki)(1)
其中,Si为注意力得分;
Q为Query,为查询向量;
ki对应于每个key向量。
然后,通过公式(2)用softmax函数对注意力得分进行数值转换。一方面可以进行归一化,得到所有权重系数之和为1的概率分布,另一方面可以用softmax函数的特性突出重要元素的权重:
Figure PCTCN2020111602-appb-000001
其中,α i为权重系数。
最后,通过公式(3),根据权重系数对value进行加权求和:
Figure PCTCN2020111602-appb-000002
其中,v i为value向量。
根据自注意力计算,将包含(A1+A2+A3+A4+A5)个特征的初始全局特征,经过第二自注意力层的融合处理,可以得到A'个特征的全局图像特征。一般地,A'小于等于(A1+A2+A3+A4+A5)。
203、利用第二特征提取模型对目标图像进行特征提取,得到目标图像对应的目标检测特征。
本申请中,第二特征模型可以为目标检测特征模型,以实现对目标图像的局部信息的提取。
本步骤203中,第二特征提取模型可以选取Faster-RNN(Faster Regions with CNN features,快速卷积特征区域)模型,用于识别出图像中的感兴趣区域,并通过设定的阈值允许多个感兴趣区域对应的兴趣框的重叠,这样可以更有效的理解图像内容。
Faster-RNN提取目标检测特征的主要步骤包括:
1)特征提取:以整个目标图像为输入,得到目标图像的特征层。
2)候选区域:利用选择查找(Selective Search)等方法从目标图像中提取感兴趣区域,并把这些感兴趣区域对应的兴趣框一一投影到最后的特征层。
3)区域归一化:针对特征层上的每个候选区域候选框进行池化操作,得到固定大小的特征表示。
4)分类:通过两个全连接层,分别用Softmax多分类函数做目标识别,得到最终的目标检测特征。
204、将目标图像对应的全局图像特征和目标检测特征输入至翻译模型,将生成的翻译语句作为目标图像的描述语句。
其中,翻译模型包括编码器和解码器。翻译模型有多种,例如Transformer模型、RNN模型等,本实施例优选使用Transformer模型,可以进一步使输出的句子更为准确。
与RNN模型相比,Transformer模型不需要循环,而是并行处理输入目标图像对应的全局图像特征和目标检测特征,同时利用自注意力机制将特征之间相结合。Transformer模型的训练速度比RNN快很多,而且其翻译结果相比于RNN的翻译结果也较为准确。
一种实施方式中,翻译语句可以包括多个翻译词语,对于解码器来说,每次解码得到一个翻译词语。对于所述翻译语句的第一个翻译词语,所述参考解码向量为预设的初始解码向量;对于所述翻译语句的除去第一个翻译词语之外的其他翻译词语,其参考解码向量为上一个翻译词语对应的解码向量。
本申请提供的图像描述的方法,通过利用多个第一特征提取模型对目标图像进行特征提取,得到每个第一特征提取模型生成的图像特征,将多个第一特征提取模型生成的图像特征融合生成目标图像对应的全局图像特征,克服了单一特征提取模型过于依赖模型自身性能的缺陷,相比于现有技术中利用单一特征提取模型的图像特征,能够减轻单一特征提取模型提取的图像特征性能单一的缺陷,从而在后续将目标图像对应的全局图像特征和目标检测特征输入至翻译模型生成翻译语句的过程中,有更为丰富图像信息的全局图像特征作为参考,使输出的翻译语句更加准确。
本申请一实施例的图像描述的方法还可以如图3所示,包括:
301、利用多个第一特征提取模型对目标图像进行特征提取,得到每个第一特征提取模型生成的图像特征。
302、对多个第一特征提取模型生成的图像特征进行融合处理,生成目标图像对应的全局图像特征。
303、利用第二特征提取模型对目标图像进行特征提取,得到目标图像对应的目标检测特征。
对于步骤301~303,与前述实施例的步骤201~203相同,具体的解释可以参见前述实施例,此处不再赘述。
304、将目标检测特征和全局图像特征输入至翻译模型的编码器,生成编码器输出的编码向量。
可选地,编码器可以包括1个编码层,也可以包括多个编码层。本实施例以编码器包括N个依次连接的编码层为例进行说明,其中,N>1。步骤304包括下述步骤S3041~S3044:
S3041、将所述目标检测特征和全局图像特征输入至第一个编码层,得到第一个编码层的输出向量。
S3042、将第i-1个编码层的输出向量和全局图像特征输入至第i个编码层,得到第i 个编码层的输出向量,其中,2≤i≤N。
S3043、判断i是否等于N,若否,将i自增1,执行步骤S3042,若是,执行步骤S3044。
S3044、将第N个编码层的输出向量作为编码器输出的编码向量。
将全局图像特征和第一个编码层的输出向量输入至第二个编码层,得到第二个编码层的输出向量;将全局图像特征和第二个编码层的输出向量输入至第三个编码层,得到第三个编码层的输出向量;继续下去,直至得到第N个编码层的输出向量。
在本申请实施例中,在编码层侧,将全局图像特征输入至每个编码层,使目标检测特征在每个编码层的处理中均融入了全局图像特征,增强了目标检测特征的特征表示。
一种可能的实施方式中,参见图4,编码层包括:第一编码自注意力层、第二编码自注意力层和第一前馈层;
步骤S3041包括:将所述目标检测特征输入至第一编码自注意力层,得到第一中间向量;将第一中间向量和全局图像特征输入至第二编码自注意力层,得到第二中间向量;将所述第二中间向量经过第一前馈层进行处理,得到第一个编码层的输出向量。
步骤S3042包括:将所述第i-1个编码层的输出向量输入至第一编码自注意力层,得到第三中间向量;将第三中间向量和全局图像特征输入至第二编码自注意力层,得到第四中间向量;将所述第四中间向量经过第一前馈层进行处理,得到第i个编码层的输出向量。
305、将编码向量以及全局图像特征输入至解码器,生成解码器输出的解码向量。
可选的,解码器可以包括1个解码层,也可以包括多个解码层。本实施例以解码器包括M个依次连接的解码层为例进行说明。其中,M>1。
步骤305包括下述步骤S3051~S3054:
S3051、将参考解码向量、编码向量和全局图像特征输入至第一个解码层,得到第一个解码层的输出向量。
对于所述翻译语句的第一个翻译词语,所述参考解码向量为初始解码向量;
对于所述翻译语句的其他翻译词语,所述参考解码向量为上一个翻译词语对应的解码向量。
S3052、将第j-1个解码层的输出向量、编码向量和全局图像特征输入至第j个解码层,得到第j个解码层的输出向量,其中,2≤j≤M。
S3053、判断j是否等于M,若否,将j自增1,执行步骤S3052,若是,执行步骤S3054。
S3054、将第M个解码层的输出向量作为解码器输出的解码向量。
将编码向量、全局图像特征和第一个解码层的输出向量输入至第二个解码层,得到第二个解码层的输出向量;将编码向量、全局图像特征和第二个解码层的输出向量输入至第三个解码层,得到第三个解码层的输出向量;继续下去,直至得到第M个解码层的输出向量。
在本申请实施例中,将全局图像特征输入至解码器的每个解码层,从而可以在每个解 码层的解码过程中,将包含有丰富图像信息的全局图像特征作为背景信息,可以使解码得到的解码向量与图像信息的对应度更高,使输出的翻译语句更加准确。
一种可能的实施方式中,参见图5,解码层包括:第一解码自注意力层、第二解码自注意力层、第三解码自注意力层和第二前馈层。
步骤S3051包括:将参考解码向量经过所述第一解码自注意力层进行处理,得到第五中间向量;将第五中间向量和所述全局图像特征经过所述第二解码自注意力层进行处理,得到第六中间向量;将第六中间向量和所述编码向量经过所述第三解码自注意力层进行处理,得到第七中间向量;将第七中间向量经过第二前馈层进行处理,得到第一个解码层的输出向量。
步骤S3052包括:将第j-1个解码层的输出向量经过所述第一解码自注意力层进行处理,得到第八中间向量;将第八中间向量和所述全局图像特征经过所述第二解码自注意力层进行处理,得到第九中间向量;将第九中间向量和所述编码向量经过所述第三解码自注意力层进行处理,得到第十中间向量;将第十中间向量经过第二前馈层进行处理,得到第j个解码层的输出向量。
306、根据解码器输出的解码向量生成对应的翻译语句,并将翻译语句作为目标图像的描述语句。
一种可能的实施方式中,根据所述解码器输出的解码向量生成对应的翻译词语,并根据所述翻译词语生成翻译语句。
可选的,翻译语句可以包括多个翻译词语,对于解码器来说,每次解码得到一个翻译词语。对于所述翻译语句的第一个翻译词语,所述参考解码向量为预设的初始解码向量;对于所述翻译语句的除去第一个翻译词语之外的其他翻译词语,其参考解码向量为上一个翻译词语对应的解码向量。
本申请提供的图像描述的方法,通过利用多个第一特征提取模型对目标图像进行特征提取,得到每个第一特征提取模型生成的图像特征,将多个第一特征提取模型生成的图像特征融合生成目标图像对应的全局图像特征,克服了单一特征提取模型过于依赖模型自身性能的缺陷,相比于现有技术中利用单一特征提取模型的图像特征,能够减轻单一特征提取模型提取的图像特征性能单一的缺陷,从而在后续将目标图像对应的全局图像特征和目标检测特征输入至翻译模型生成翻译语句的过程中,有更为丰富图像信息的全局图像特征作为参考,使输出的翻译语句更加准确。
其次,本实施例通过多个第一特征提取模型对目标图像进行特征提取,并把多个第一特征提取模型提取到的图像特征进行拼接得到初始全局特征,从而可以使初始全局特征尽可能地包含目标图像的更全的特征,然后再经过多个第二自注意力层进行融合,获取需要重点关注的目标区域,而后对这一区域投入更多的注意力计算资源,获取更多与目标有关的细节信息,而忽视其他无关信息。通过这种机制可以利用有限的注意力计算资源从大量 信息中快速筛选出高价值的信息,得到包含更为丰富的图像信息的全局图像特征。
再次,本方法将全局图像特征输入至解码器的每个解码层,从而可以在每个解码层的解码过程中,将包含有丰富图像信息的全局图像特征作为背景信息,可以使解码得到的解码向量与图像信息的对应度更高,使输出的翻译语句更加准确。
本实施例的图像描述的方法适用于编码器—解码器的机器翻译模型。为了更清楚地对本申请的图像描述的方法进行说明,参见图6,以Transformer翻译模型为例进行示意性的说明。图6中,包括4个第一特征提取模型,即VGG、Resnet、Densnet、inceptionv3;4个第一自注意力层;K个第二自注意力层;1个第二特征提取模型以及Transformer翻译模型。Contact指contact函数,是一种联系函数。
本实施例的图像描述的方法包括下述步骤S61~S68:
S61、利用4个第一特征提取模型对目标图像进行特征提取,得到每个第一特征提取模型生成的图像特征。
S62、对4个第一特征提取模型生成的图像特征分别通过对应的第一自注意力层进行处理,得到生成的中间特征。
其中,第1个第一特征提取模型生成的图像特征通过对应的第一自注意力层进行处理,得到A1个中间特征,中间特征的尺寸为P*Q;第2个第一特征提取模型生成的图像特征通过对应的第一自注意力层进行处理,得到A2个中间特征,中间特征的尺寸为P*Q;第3个第一特征提取模型生成的图像特征通过对应的第一自注意力层进行处理,得到A3个中间特征,中间特征的尺寸为P*Q;第4个第一特征提取模型生成的图像特征通过对应的第一自注意力层进行处理,得到A4个中间特征,中间特征的尺寸为P*Q。
S63、对4个中间特征进行拼接,生成初始全局特征。
其中,对4个中间特征进行拼接,生成包含(A1+A2+A3+A4)个特征的初始全局特征。
S64、将初始全局特征通过K个第二自注意力层进行融合处理,生成全局图像特征。
本实施例中,K=3。
其中,对包含(A1+A2+A3+A4)个特征的初始全局特征进行融合处理,生成包含A'个特征的全局图像特征。一般地,A'≤(A1+A2+A3+A4)。
S65、利用第二特征提取模型对目标图像进行特征提取,得到目标图像对应的目标检测特征。
本实施例中,第二特征提取模型为Faster RNN(Faster Regions with CNN features,快速卷积特征区域)模型。
S66、将目标检测特征和全局图像特征输入至Transformer翻译模型的编码器,生成编码器输出的编码向量。
S67、将参考解码向量、编码向量以及全局图像特征输入至解码器,生成解码器输出 的解码向量。
其中,编码器包括N个编码层,解码器包括M个解码层。
S68、根据解码器输出的解码向量生成对应的翻译语句,并将所述翻译语句作为所述目标图像的描述语句。
其中,描述语句可以根据Transformer模型的性能,输出不同语言的描述语句。其中,Transformer模型的性能可以通过样本集的训练而形成,例如样本集为“中语待翻译语句+法语翻译语句”的集合、“英语待翻译语句+日语翻译语句”的集合或者“图像特征+英语翻译语句”的集合。本实施例以Transformer模型的性能为根据输入的图像特征翻译生成英文翻译语句为例进行说明。
可选的,根据输入的初始参考解码向量、编码向量以及全局图像特征,解码器输出解码向量,并得到第1个词语“a”。将第1个词语“a”对应的向量作为参考解码第2个词语“boy”。将第2个词语“boy”对应的向量作为参考解码向量,以使解码器根据参考解码向量、编码向量以及全局图像特征得到下一个词语“play”……依次类推,得到描述语句“A boy play football on football field”。
本申请一实施例还提供一种图像描述的装置,参见图7,包括:
特征提取模块701,被配置为利用多个第一特征提取模型对目标图像进行特征提取,得到每个第一特征提取模型生成的图像特征;
全局图像特征提取模块702,被配置为对所述多个第一特征提取模型生成的图像特征进行融合处理,生成所述目标图像对应的全局图像特征;
目标检测特征提取模块703,被配置为利用第二特征提取模型对目标图像进行特征提取,得到所述目标图像对应的目标检测特征;
翻译模块704,被配置为将所述目标图像对应的全局图像特征和目标检测特征输入至翻译模型,将生成的翻译语句作为所述目标图像的描述语句。
可选地,全局图像特征提取模块702具体被配置为:
对所述多个第一特征提取模型生成的图像特征分别通过对应的第一自注意力层进行特征提取,得到多个中间特征;
对多个中间特征进行拼接,生成初始全局特征;
将初始全局特征通过至少一个第二自注意力层进行融合处理,生成全局图像特征。
可选地,翻译模型包括编码器和解码器,所述翻译模块704包括:
编码模块,被配置为将所述目标检测特征和全局图像特征输入至所述翻译模型的编码器,生成所述编码器输出的编码向量;
解码模块,被配置为将所述编码向量以及所述全局图像特征输入至解码器,生成所述解码器输出的解码向量;
语句生成模块,被配置为根据所述解码器输出的解码向量生成对应的翻译语句,并将 所述翻译语句作为所述目标图像的描述语句。
可选地,所述编码器包括N个依次连接的编码层,其中,N为大于1的整数;编码模块包括:
第一处理单元,被配置为将所述目标检测特征和全局图像特征输入至第一个编码层,得到第一个编码层的输出向量;
第二处理单元,被配置为将第i-1个编码层的输出向量和全局图像特征输入至第i个编码层,得到第i个编码层的输出向量,其中,2≤i≤N;
第一判断单元,被配置为判断i是否等于N,若否,将i自增1,执行第二处理单元,若是,执行编码向量生成单元;
编码向量生成单元,被配置为将第N个编码层的输出向量作为编码器输出的编码向量。
可选地,编码层包括:第一编码自注意力层、第二编码自注意力层和第一前馈层;第一处理单元具体被配置为:将所述目标检测特征输入至第一编码自注意力层,得到第一中间向量;将第一中间向量和全局图像特征输入至第二编码自注意力层,得到第二中间向量;将所述第二中间向量经过第一前馈层进行处理,得到第一个编码层的输出向量。
可选地,编码层包括:第一编码自注意力层、第二编码自注意力层和第一前馈层;第二处理单元具体被配置为:将所述第i-1个编码层的输出向量输入至第一编码自注意力层,得到第三中间向量;将第三中间向量和全局图像特征输入至第二编码自注意力层,得到第四中间向量;将所述第四中间向量经过第一前馈层进行处理,得到第i个编码层的输出向量。
可选地,解码器包括M个依次连接的解码层,其中,M为大于1的整数;
所述解码模块包括:
第三处理单元,被配置为将参考解码向量、编码向量和全局图像特征输入至第一个解码层,得到第一个解码层的输出向量;
第四处理单元,被配置为将第j-1个解码层的输出向量、编码向量和全局图像特征输入至第j个解码层,得到第j个解码层的输出向量,其中,2≤j≤M;
第二判断单元,被配置为判断j是否等于M,若否,将j自增1,执行第四处理单元,若是,执行解码向量生成单元;
解码向量生成单元,被配置为将第M个解码层的输出向量作为解码器输出的解码向量。
可选地,所述解码层包括:第一解码自注意力层、第二解码自注意力层、第三解码自注意力层和第二前馈层;第三处理单元具体被配置为:
将参考解码向量经过第一解码自注意力层进行处理,得到第五中间向量;
将第五中间向量和所述全局图像特征经过所述第二解码自注意力层进行处理,得到第六中间向量;
将第六中间向量和所述编码向量经过所述第三解码自注意力层进行处理,得到第七中 间向量;
将第七中间向量经过第二前馈层进行处理,得到第一个解码层的输出向量。
可选地,解码层包括:第一解码自注意力层、第二解码自注意力层、第三解码自注意力层和第二前馈层;第四处理单元具体被配置为:
将第j-1个解码层的输出向量经过所述第一解码自注意力层进行处理,得到第八中间向量;
将第八中间向量和所述全局图像特征经过所述第二解码自注意力层进行处理,得到第九中间向量;
将第九中间向量和所述编码向量经过所述第三解码自注意力层进行处理,得到第十中间向量;
将第十中间向量经过第二前馈层进行处理,得到第j个解码层的输出向量。
上述为本实施例的一种图像描述的装置的示意性方案。需要说明的是,该图像描述的装置的技术方案与上述的图像描述的方法的技术方案属于同一构思,图像描述的装置的技术方案未详细描述的细节内容,均可以参见上述图像描述的方法的技术方案的描述。
本申请一实施例还提供一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现如前所述图像描述的方法的步骤。
上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的图像描述的方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述图像描述的方法的技术方案的描述。
本申请一实施例还提供了一种计算机程序产品,用于在运行时实现如前所述的图像描述的方法的步骤。
所述计算机指令包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分, 可以参见其它实施例的相关描述。
以上公开的本申请优选实施例只是用于帮助阐述本申请。可选实施例并没有详尽叙述所有的细节,也不限制该申请仅为所述的具体实施方式。显然,根据本说明书的内容,可作很多的修改和变化。本说明书选取并具体描述这些实施例,是为了更好地解释本申请的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本申请。本申请仅受权利要求书及其全部范围和等效物的限制。

Claims (20)

  1. 一种图像描述的方法,包括:
    利用多个第一特征提取模型对目标图像进行特征提取,得到每个所述第一特征提取模型生成的图像特征;
    对所述多个第一特征提取模型生成的图像特征进行融合处理,生成所述目标图像对应的全局图像特征;
    利用第二特征提取模型对所述目标图像进行特征提取,得到所述目标图像对应的目标检测特征;
    将所述目标图像对应的所述全局图像特征和所述目标检测特征输入至翻译模型,将生成的翻译语句作为所述目标图像的描述语句。
  2. 如权利要求1所述的方法,对所述多个第一特征提取模型生成的图像特征进行融合处理,生成所述目标图像对应的全局图像特征,包括:
    对所述多个第一特征提取模型生成的图像特征分别通过对应的第一自注意力层进行特征提取,得到多个中间特征;
    对所述多个中间特征进行拼接,生成初始全局特征;
    将所述初始全局特征通过至少一个第二自注意力层进行融合处理,生成全局图像特征。
  3. 如权利要求1或2所述的方法,所述翻译模型包括编码器和解码器;
    将所述目标图像对应的所述全局图像特征和所述目标检测特征输入至翻译模型,将生成的翻译语句作为所述目标图像的描述语句,包括:
    将所述目标检测特征和所述全局图像特征输入至所述翻译模型的编码器,生成所述编码器输出的编码向量;
    将所述编码向量以及所述全局图像特征输入至解码器,生成所述解码器输出的解码向量;
    根据所述解码器输出的解码向量生成对应的翻译语句,并将所述翻译语句作为所述目标图像的描述语句。
  4. 如权利要求3所述的方法,所述编码器包括N个依次连接的编码层,其中,N为大于1的整数;
    将所述目标检测特征和所述全局图像特征输入至所述翻译模型的编码器,生成所述编码器输出的编码向量,包括:
    S11、将所述目标检测特征和所述全局图像特征输入至第一个编码层,得到第一个编码层的输出向量;
    S12、将第i-1个编码层的输出向量和所述全局图像特征输入至第i个编码层,得到第i个编码层的输出向量,其中,2≤i≤N;
    S13、判断i是否等于N,若否,将i自增1,执行步骤S12,若是,执行步骤S14;
    S14、将第N个编码层的输出向量作为所述编码器输出的编码向量。
  5. 如权利要求4所述的方法,所述编码层包括:第一编码自注意力层、第二编码自注意力层和第一前馈层;
    将所述目标检测特征和所述全局图像特征输入至第一个编码层,得到第一个编码层的输出向量,包括:
    将所述目标检测特征输入至第一编码自注意力层,得到第一中间向量;
    将所述第一中间向量和所述全局图像特征输入至所述第二编码自注意力层,得到第二中间向量;
    将所述第二中间向量经过所述第一前馈层进行处理,得到第一个编码层的输出向量。
  6. 如权利要求4或5所述的方法,所述编码层包括:第一编码自注意力层、第二编码自注意力层和第一前馈层;
    将第i-1个编码层的输出向量和所述全局图像特征输入至第i个编码层,得到第i个编码层的输出向量,包括:
    将所述第i-1个编码层的输出向量输入至第一编码自注意力层,得到第三中间向量;
    将所述第三中间向量和所述全局图像特征输入至第二编码自注意力层,得到第四中间向量;
    将所述第四中间向量经过第一前馈层进行处理,得到第i个编码层的输出向量。
  7. 如权利要求3-6任一所述的方法,所述解码器包括M个依次连接的解码层,其中,M为大于1的整数;
    将所述编码向量以及所述全局图像特征输入至解码器,生成所述解码器输出的解码向量,包括:
    S21、将参考解码向量、所述编码向量和所述全局图像特征输入至第一个解码层,得到第一个解码层的输出向量;
    S22、将第j-1个解码层的输出向量、所述编码向量和所述全局图像特征输入至第j个解码层,得到第j个解码层的输出向量,其中,2≤j≤M;
    S23、判断j是否等于M,若否,将j自增1,执行步骤S22,若是,执行步骤S24;
    S24、将第M个解码层的输出向量作为所述解码器输出的解码向量。
  8. 如权利要求7所述的方法,所述解码层包括:第一解码自注意力层、第二解码自注意力层、第三解码自注意力层和第二前馈层;
    将参考解码向量、所述编码向量和所述全局图像特征输入至第一个解码层,得到第一个解码层的输出向量,包括:
    将所述参考解码向量经过所述第一解码自注意力层进行处理,得到第五中间向量;
    将所述第五中间向量和所述全局图像特征经过所述第二解码自注意力层进行处理,得到第六中间向量;
    将所述第六中间向量和所述编码向量经过所述第三解码自注意力层进行处理,得到第七中间向量;
    将所述第七中间向量经过第二前馈层进行处理,得到第一个解码层的输出向量。
  9. 如权利要求7或8所述的方法,所述解码层包括:第一解码自注意力层、第二解码自注意力层、第三解码自注意力层和第二前馈层;
    将第j-1个解码层的输出向量、编码向量和全局图像特征输入至第j个解码层,得到第j个解码层的输出向量,包括:
    将第j-1个解码层的输出向量经过所述第一解码自注意力层进行处理,得到第八中间向量;
    将所述第八中间向量和所述全局图像特征经过所述第二解码自注意力层进行处理,得到第九中间向量;
    将所述第九中间向量和所述编码向量经过所述第三解码自注意力层进行处理,得到第十中间向量;
    将所述第十中间向量经过所述第二前馈层进行处理,得到第j个解码层的输出向量。
  10. 一种图像描述的装置,包括:
    特征提取模块,被配置为利用多个第一特征提取模型对目标图像进行特征提取,得到每个所述第一特征提取模型生成的图像特征;
    全局图像特征提取模块,被配置为对所述多个第一特征提取模型生成的图像特征进行融合处理,生成所述目标图像对应的全局图像特征;
    目标检测特征提取模块,被配置为利用第二特征提取模型对所述目标图像进行特征提取,得到所述目标图像对应的目标检测特征;
    翻译模块,被配置为将所述目标图像对应的所述全局图像特征和所述目标检测特征输入至翻译模型,将生成的翻译语句作为所述目标图像的描述语句。
  11. 如权利要求10所述的装置,所述全局图像特征提取模块具体被配置为:
    对所述多个第一特征提取模型生成的图像特征分别通过对应的第一自注意力层进行特征提取,得到多个中间特征;
    对多个中间特征进行拼接,生成初始全局特征;
    将初始全局特征通过至少一个第二自注意力层进行融合处理,生成全局图像特征。
  12. 如权利要求10或11所述的装置,所述翻译模型包括编码器和解码器,所述翻译模块包括:
    编码模块,被配置为将所述目标检测特征和全局图像特征输入至所述翻译模型的编码器,生成所述编码器输出的编码向量;
    解码模块,被配置为将所述编码向量以及所述全局图像特征输入至解码器,生成所述解码器输出的解码向量;
    语句生成模块,被配置为根据所述解码器输出的解码向量生成对应的翻译语句,并将所述翻译语句作为所述目标图像的描述语句。
  13. 如权利要求12所述的装置,所述编码器包括N个依次连接的编码层,其中,N为大于1的整数;所述编码模块包括:
    第一处理单元,被配置为将所述目标检测特征和全局图像特征输入至第一个编码层,得到第一个编码层的输出向量;
    第二处理单元,被配置为将第i-1个编码层的输出向量和全局图像特征输入至第i个编码层,得到第i个编码层的输出向量,其中,2≤i≤N;
    第一判断单元,被配置为判断i是否等于N,若否,将i自增1,执行第二处理单元,若是,执行编码向量生成单元;
    编码向量生成单元,被配置为将第N个编码层的输出向量作为编码器输出的编码向量。
  14. 如权利要求13所述的装置,所述编码层包括:第一编码自注意力层、第二编码自注意力层和第一前馈层;所述第一处理单元具体被配置为:将所述目标检测特征输入至第一编码自注意力层,得到第一中间向量;将第一中间向量和全局图像特征输入至第二编码自注意力层,得到第二中间向量;将所述第二中间向量经过第一前馈层进行处理,得到第一个编码层的输出向量。
  15. 如权利要求13或14所述的装置,所述编码层包括:第一编码自注意力层、第二编码自注意力层和第一前馈层;所述第二处理单元具体被配置为:将所述第i-1个编码层的输出向量输入至第一编码自注意力层,得到第三中间向量;将第三中间向量和全局图像特征输入至第二编码自注意力层,得到第四中间向量;将所述第四中间向量经过第一前馈层进行处理,得到第i个编码层的输出向量。
  16. 如权利要求12-15任一所述的装置,所述解码器包括M个依次连接的解码层,其中,M为大于1的整数;
    所述解码模块包括:
    第三处理单元,被配置为将参考解码向量、编码向量和全局图像特征输入至第一个解码层,得到第一个解码层的输出向量;
    第四处理单元,被配置为将第j-1个解码层的输出向量、编码向量和全局图像特征输入至第j个解码层,得到第j个解码层的输出向量,其中,2≤j≤M;
    第二判断单元,被配置为判断j是否等于M,若否,将j自增1,执行第四处理单元,若是,执行解码向量生成单元;
    解码向量生成单元,被配置为将第M个解码层的输出向量作为解码器输出的解码向量。
  17. 如权利要求16所述的装置,所述解码层包括:第一解码自注意力层、第二解码自注意力层、第三解码自注意力层和第二前馈层;所述第三处理单元具体被配置为:将参考解码向量经过第一解码自注意力层进行处理,得到第五中间向量;将第五中间向量和所述 全局图像特征经过所述第二解码自注意力层进行处理,得到第六中间向量;将第六中间向量和所述编码向量经过所述第三解码自注意力层进行处理,得到第七中间向量;将第七中间向量经过第二前馈层进行处理,得到第一个解码层的输出向量。
  18. 如权利要求16或17所述的装置,所述解码层包括:第一解码自注意力层、第二解码自注意力层、第三解码自注意力层和第二前馈层;所述第四处理单元具体被配置为:将第j-1个解码层的输出向量经过所述第一解码自注意力层进行处理,得到第八中间向量;将第八中间向量和所述全局图像特征经过所述第二解码自注意力层进行处理,得到第九中间向量;将第九中间向量和所述编码向量经过所述第三解码自注意力层进行处理,得到第十中间向量;将第十中间向量经过第二前馈层进行处理,得到第j个解码层的输出向量。
  19. 一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机指令,所述处理器执行所述指令时实现权利要求1-9任意一项所述方法的步骤。
  20. 一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现权利要求1-9任意一项所述方法的步骤。
PCT/CN2020/111602 2019-08-27 2020-08-27 一种图像描述的方法及装置、计算设备和存储介质 WO2021037113A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2022513610A JP2022546811A (ja) 2019-08-27 2020-08-27 画像キャプションの方法、装置、計算機器及び記憶媒体
US17/753,304 US20220351487A1 (en) 2019-08-27 2020-08-27 Image Description Method and Apparatus, Computing Device, and Storage Medium
EP20856644.8A EP4024274A4 (en) 2019-08-27 2020-08-27 IMAGE DESCRIPTION METHOD AND DEVICE, COMPUTER DEVICE AND STORAGE MEDIA

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910797332.XA CN110309839B (zh) 2019-08-27 2019-08-27 一种图像描述的方法及装置
CN201910797332.X 2019-08-27

Publications (1)

Publication Number Publication Date
WO2021037113A1 true WO2021037113A1 (zh) 2021-03-04

Family

ID=68083691

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/111602 WO2021037113A1 (zh) 2019-08-27 2020-08-27 一种图像描述的方法及装置、计算设备和存储介质

Country Status (5)

Country Link
US (1) US20220351487A1 (zh)
EP (1) EP4024274A4 (zh)
JP (1) JP2022546811A (zh)
CN (1) CN110309839B (zh)
WO (1) WO2021037113A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019142A (zh) * 2022-06-14 2022-09-06 辽宁工业大学 基于融合特征的图像标题生成方法、系统、电子设备
CN116579352A (zh) * 2023-04-25 2023-08-11 无锡捷通数智科技有限公司 翻译模型训练方法、装置、移动终端及存储介质

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368476B (zh) * 2017-07-25 2020-11-03 深圳市腾讯计算机系统有限公司 一种翻译的方法、目标信息确定的方法及相关装置
CN110309839B (zh) * 2019-08-27 2019-12-03 北京金山数字娱乐科技有限公司 一种图像描述的方法及装置
CN111275110B (zh) * 2020-01-20 2023-06-09 北京百度网讯科技有限公司 图像描述的方法、装置、电子设备及存储介质
CN111611420B (zh) * 2020-05-26 2024-01-23 北京字节跳动网络技术有限公司 用于生成图像描述信息的方法和装置
CN111767727B (zh) * 2020-06-24 2024-02-06 北京奇艺世纪科技有限公司 数据处理方法及装置
CN111916050A (zh) * 2020-08-03 2020-11-10 北京字节跳动网络技术有限公司 语音合成方法、装置、存储介质和电子设备
CN112256902A (zh) * 2020-10-20 2021-01-22 广东三维家信息科技有限公司 图片的文案生成方法、装置、设备及存储介质
CN113269182A (zh) * 2021-04-21 2021-08-17 山东师范大学 一种基于变体transformer对小区域敏感的目标果实检测方法及系统
CN113378919B (zh) * 2021-06-09 2022-06-14 重庆师范大学 融合视觉常识和增强多层全局特征的图像描述生成方法
CN113673557A (zh) * 2021-07-12 2021-11-19 浙江大华技术股份有限公司 特征处理方法、动作定位方法及相关设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9978119B2 (en) * 2015-10-22 2018-05-22 Korea Institute Of Science And Technology Method for automatic facial impression transformation, recording medium and device for performing the method
CN108665506A (zh) * 2018-05-10 2018-10-16 腾讯科技(深圳)有限公司 图像处理方法、装置、计算机存储介质及服务器
CN109726696A (zh) * 2019-01-03 2019-05-07 电子科技大学 基于推敲注意力机制的图像描述生成系统及方法
CN110210499A (zh) * 2019-06-03 2019-09-06 中国矿业大学 一种图像语义描述的自适应生成系统
CN110309839A (zh) * 2019-08-27 2019-10-08 北京金山数字娱乐科技有限公司 一种图像描述的方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070098303A1 (en) * 2005-10-31 2007-05-03 Eastman Kodak Company Determining a particular person from a collection
CN105117688B (zh) * 2015-07-29 2018-08-28 重庆电子工程职业学院 基于纹理特征融合和svm的人脸识别方法
CN108875767A (zh) * 2017-12-07 2018-11-23 北京旷视科技有限公司 图像识别的方法、装置、系统及计算机存储介质
CN108510012B (zh) * 2018-05-04 2022-04-01 四川大学 一种基于多尺度特征图的目标快速检测方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9978119B2 (en) * 2015-10-22 2018-05-22 Korea Institute Of Science And Technology Method for automatic facial impression transformation, recording medium and device for performing the method
CN108665506A (zh) * 2018-05-10 2018-10-16 腾讯科技(深圳)有限公司 图像处理方法、装置、计算机存储介质及服务器
CN109726696A (zh) * 2019-01-03 2019-05-07 电子科技大学 基于推敲注意力机制的图像描述生成系统及方法
CN110210499A (zh) * 2019-06-03 2019-09-06 中国矿业大学 一种图像语义描述的自适应生成系统
CN110309839A (zh) * 2019-08-27 2019-10-08 北京金山数字娱乐科技有限公司 一种图像描述的方法及装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI LINGHUI, TANG SHENG, ZHANG YONGDONG, DENG LIXI, TIAN QI: "GLA: Global–Local Attention for Image Description", IEEE TRANSACTIONS ON MULTIMEDIA., IEEE SERVICE CENTER, US, vol. 20, no. 3, 1 March 2018 (2018-03-01), US, pages 726 - 737, XP055785886, ISSN: 1520-9210, DOI: 10.1109/TMM.2017.2751140 *
See also references of EP4024274A4

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019142A (zh) * 2022-06-14 2022-09-06 辽宁工业大学 基于融合特征的图像标题生成方法、系统、电子设备
CN115019142B (zh) * 2022-06-14 2024-03-29 辽宁工业大学 基于融合特征的图像标题生成方法、系统、电子设备
CN116579352A (zh) * 2023-04-25 2023-08-11 无锡捷通数智科技有限公司 翻译模型训练方法、装置、移动终端及存储介质

Also Published As

Publication number Publication date
JP2022546811A (ja) 2022-11-09
US20220351487A1 (en) 2022-11-03
CN110309839A (zh) 2019-10-08
EP4024274A1 (en) 2022-07-06
EP4024274A4 (en) 2022-10-12
CN110309839B (zh) 2019-12-03

Similar Documents

Publication Publication Date Title
WO2021037113A1 (zh) 一种图像描述的方法及装置、计算设备和存储介质
CN108920622B (zh) 一种意图识别的训练方法、训练装置和识别装置
JP7193252B2 (ja) 画像の領域のキャプション付加
CN107066464B (zh) 语义自然语言向量空间
CN111368993B (zh) 一种数据处理方法及相关设备
WO2022095682A1 (zh) 文本分类模型的训练方法、文本分类方法、装置、设备、存储介质及计算机程序产品
CN111738251B (zh) 一种融合语言模型的光学字符识别方法、装置和电子设备
CN112435656B (zh) 模型训练方法、语音识别方法、装置、设备及存储介质
CN113255755A (zh) 一种基于异质融合网络的多模态情感分类方法
US20230325673A1 (en) Neural network training utilizing loss functions reflecting neighbor token dependencies
CN109919221B (zh) 基于双向双注意力机制图像描述方法
CN110083729B (zh) 一种图像搜索的方法及系统
CN110162766B (zh) 词向量更新方法和装置
CN107305543B (zh) 对实体词的语义关系进行分类的方法和装置
WO2023050708A1 (zh) 一种情感识别方法、装置、设备及可读存储介质
WO2023134083A1 (zh) 基于文本的情感分类方法和装置、计算机设备、存储介质
CN113111908A (zh) 一种基于模板序列或词序列的bert异常检测方法及设备
CN114528374A (zh) 一种基于图神经网络的电影评论情感分类方法及装置
CN115131613A (zh) 一种基于多向知识迁移的小样本图像分类方法
CN115130591A (zh) 一种基于交叉监督的多模态数据分类方法及装置
CN115964638A (zh) 多模态社交数据情感分类方法、系统、终端、设备及应用
CN116912642A (zh) 基于双模多粒度交互的多模态情感分析方法、设备及介质
Wang et al. Contrastive Predictive Coding of Audio with an Adversary.
CN117634459A (zh) 目标内容生成及模型训练方法、装置、系统、设备及介质
CN114970467A (zh) 基于人工智能的作文初稿生成方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20856644

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022513610

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020856644

Country of ref document: EP

Effective date: 20220328