WO2022161298A1 - 信息生成方法、装置、设备、存储介质及程序产品 - Google Patents

信息生成方法、装置、设备、存储介质及程序产品 Download PDF

Info

Publication number
WO2022161298A1
WO2022161298A1 PCT/CN2022/073372 CN2022073372W WO2022161298A1 WO 2022161298 A1 WO2022161298 A1 WO 2022161298A1 CN 2022073372 W CN2022073372 W CN 2022073372W WO 2022161298 A1 WO2022161298 A1 WO 2022161298A1
Authority
WO
WIPO (PCT)
Prior art keywords
time step
attention
vector
visual
target image
Prior art date
Application number
PCT/CN2022/073372
Other languages
English (en)
French (fr)
Inventor
高俊
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2023523236A priority Critical patent/JP2023545543A/ja
Publication of WO2022161298A1 publication Critical patent/WO2022161298A1/zh
Priority to US18/071,481 priority patent/US20230103340A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • G06V10/464Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Definitions

  • the present application relates to the technical field of image processing, and in particular, to an information generation method, apparatus, device, storage medium and program product.
  • the computer device uses a recurrent neural network to generate the overall description of the image after acquiring the visual features of the image through the encoder.
  • Embodiments of the present application provide an information generation method, apparatus, device, storage medium, and program product.
  • the technical solution is as follows:
  • a method for generating information comprising:
  • the input at the time step includes the semantic attention vector at the t-th time step, the visual attention vector at the t-th time-step, and the attention fusion process at the t-1-th time step.
  • the output result on is used to indicate the descriptor on the t-1 th time step; the t th time step is any one of the n time steps; 1 ⁇ t ⁇ n, and t , n are positive integers;
  • Image description information of the target image is generated based on the descriptors of the target image at the n time steps.
  • an apparatus for generating information comprising:
  • the image acquisition module is used to acquire the target image
  • a feature extraction module for extracting the semantic feature set of the target image, and extracting the visual feature set of the target image
  • a descriptor acquisition module configured to perform attention fusion on the semantic features of the target image and the visual features of the target image at n time steps to obtain descriptors on the n time steps;
  • the attention The input of the fusion process at the t-th time step includes the semantic attention vector at the t-th time step, the visual attention vector at the t-th time step, and the attention fusion process in The output result at the t-1 th time step;
  • the semantic attention vector at the t th time step is obtained by performing attention mechanism processing on the semantic feature set at the t th time step ;
  • the visual attention vector on the t-th time step is obtained by performing attention mechanism processing on the visual feature set on the t-th time step;
  • the process of the attention fusion is obtained in the
  • the output result on the t-1 th time step is used to indicate the descriptor on the t-1 th time step;
  • the t th time step is any one of the n time steps; 1 ⁇ t ⁇ n
  • An information generation module configured to generate image description information of the target image based on the descriptors of the target image at the n time steps.
  • a computer device in another aspect, includes a processor and a memory, the memory stores at least one computer program, and the at least one computer program is loaded and executed by the processor to realize the above information generation method.
  • a computer-readable storage medium is provided, and at least one computer program is stored in the computer-readable storage medium, and the computer program is loaded and executed by a processor to implement the above-mentioned information generation method.
  • a computer program product includes at least one computer program, and the computer program is loaded and executed by a processor to implement the information generation methods provided in the above-mentioned various optional implementation manners.
  • the attention fusion of semantic features and visual features is realized at n time steps;
  • the combined effect of the output results of the visual features and semantic features on the previous time step generates the descriptor of the target image on the current time step, and then generates the image description information corresponding to the target image; so that in the generation process of the image description information, Complementing the advantages of visual features in generating visual vocabulary and semantic features in generating non-visual features improves the accuracy of generating image description information.
  • FIG. 1 shows a schematic diagram of a system used by an information generation method provided by an exemplary embodiment of the present application
  • FIG. 2 shows a flowchart of an information generation method provided by an exemplary embodiment of the present application
  • FIG. 3 shows a schematic diagram of extracting word information in an image based on different attentions according to an exemplary embodiment of the present application
  • FIG. 4 shows a schematic diagram of selecting a corresponding target image in a video scene according to an exemplary embodiment of the present application
  • FIG. 5 is a frame diagram of a model training stage and an information generation stage according to an exemplary embodiment
  • FIG. 6 shows a flowchart of a training method for an information generation model provided by an exemplary embodiment of the present application
  • FIG. 7 shows a flowchart of a model training and information generation method provided by an exemplary embodiment of the present application
  • FIG. 8 shows a schematic diagram of a process of generating image description information according to an exemplary embodiment of the present application
  • FIG. 9 shows a schematic diagram of the input and output of the attention fusion network shown in an exemplary embodiment of the present application.
  • FIG. 10 shows a frame diagram illustrating an information generating apparatus provided by an exemplary embodiment of the present application.
  • FIG. 11 shows a structural block diagram of a computer device shown in an exemplary embodiment of the present application.
  • FIG. 12 shows a structural block diagram of a computer device according to an exemplary embodiment of the present application.
  • FIG. 1 shows a schematic diagram of a system used by an information generation method provided by an exemplary embodiment of the present application.
  • the system includes a server 110 and a terminal 120 .
  • the above-mentioned server 110 may be an independent physical server, or may be a server cluster or a distributed system composed of multiple physical servers.
  • the above-mentioned terminal 120 may be a terminal device with a network connection function, an image display function and/or a video playback function; further, the terminal may be a terminal with a function of generating image description information, for example, the terminal 120 may be a smart phone, a tablet Computers, e-book readers, smart glasses, smart watches, smart TVs, MP3 players (Moving Picture Experts Group Audio Layer III, Moving Picture Experts Group Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, moving pictures) Expert Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
  • the above system includes one or more servers 110 and multiple terminals 120 .
  • This embodiment of the present application does not limit the number of servers 110 and terminals 120 .
  • the terminal and the server can be connected through a communication network.
  • the communication network is a wired network or a wireless network.
  • the computer device can obtain the target image; extract the semantic feature set and the visual feature set of the target image; perform attention fusion on the semantic feature of the target image and the visual feature of the target image at n time steps, Obtain the descriptors at n time steps;
  • the input of the attention fusion process at the t time step includes the semantic attention vector at the t time step, the visual attention vector at the t time step, and the output result of the attention fusion process at the t-1 time step;
  • the semantic attention vector at the t time step is obtained from the attention mechanism processing the semantic feature set at the t time step;
  • the visual attention vector at the t-th time step is obtained from the attention mechanism processing the visual feature set at the t-th time-step;
  • the output result of the attention fusion process at the t-1-th time step is used for Indicates the descriptor at the t-1 th time step;
  • the t th time step is any one of n time steps; 1 ⁇ t ⁇ n, and both t and
  • the computer equipment can perform attention fusion on the visual features and semantic features of the target image at each time step in the generation process of the image description information, so as to realize the advantages of the visual features in generating the visual vocabulary and the semantic features. Complementing the advantages of generating non-visual features, thus improving the accuracy of generating image description information.
  • Figure 2 shows the A flowchart of an information generation method provided by an exemplary embodiment of the present application, the method may be executed by a computer device, the computer device may be implemented as a terminal or a server, and the terminal or server may be the terminal or server shown in FIG. 1; As shown in Figure 2, the information generation method may include the following steps:
  • Step 210 acquiring a target image.
  • the target image may be a locally stored image, or, the target image may also be an image obtained in real time based on a specified operation of the target object; for example, the target image may be a real-time image of the target object based on a screen capture operation
  • the acquired image alternatively, the target image can also be the image on the terminal screen collected in real time by the computer device when the target object triggers the generation of image description information by long-pressing the designated area on the screen; or, the target image can also be based on The image acquired in real time by the image acquisition component of the terminal.
  • the present application does not limit the acquisition method of the target image.
  • Step 220 extracting the semantic feature set of the target image, and extracting the visual feature set of the target image.
  • the semantic feature set of the target image is used to indicate the set of word vectors corresponding to the candidate descriptors describing the image information of the target image.
  • the visual feature set of the target image is used to indicate a set of image features obtained based on features such as RGB (red, green and blue) distribution of pixels of the target image.
  • Step 230 through the attention fusion network in the information generation model, perform attention fusion on the semantic features of the target image and the visual features of the target image at n time steps to obtain descriptors at n time steps.
  • the input of the attention fusion network at the t-th time step includes the semantic attention vector at the t-th time step, the visual attention vector at the t-th time step, and the attention
  • the output result of the force fusion network at the t-1 th time step is obtained from the attention mechanism processing the semantic feature set at the t th time step; the t th time step
  • the visual attention vector at the time step is obtained by the attention mechanism processing the visual feature set at the t time step;
  • the output result of the attention fusion network at the t-1 time step is used to indicate the t- Descriptor on 1 time step;
  • the t-th time step is any one of n time steps; 1 ⁇ t ⁇ n, and both t and n are positive integers.
  • the number of time steps n represents the number of time steps required to generate the image description information of the target image.
  • Attention Mechanism is a set of weight coefficients that are learned autonomously by the network, and in a "dynamically weighted" way to emphasize the area of interest of the target object, while suppressing the mechanism of irrelevant background areas.
  • attention mechanisms can be roughly divided into two categories: strong attention and soft attention.
  • RNN Recurrent Neural Networks, Recurrent Neural Networks
  • RNN with attention mechanism when processing part of the pixels of the target image, will focus on the previous state of the current state, Part of the pixels of the target image are processed instead of all the pixels of the target image, which can reduce the processing complexity of the task.
  • the computer device when generating image description information, after generating a word, the computer device generates the next word based on the generated word; wherein, the time required to generate a word is called a time step (Time Step).
  • the number n of time steps may be a non-fixed value greater than 1; in response to the generated descriptor being a word or character used to indicate the end of the descriptor generation process, the computer device ends the descriptor generation process.
  • the information generation model in the embodiment of the present application is used to generate image description information of an image; the information generation model is generated by training a sample image and the image description information corresponding to the sample image; wherein, the image description information of the sample image may be text information.
  • the semantic attention vector can use multiple attributes to simultaneously strengthen the generation of visual descriptors and non-visual descriptors;
  • visual descriptors refer to descriptor information that can be directly extracted based on image pixel information, for example, In the image description information, the part of speech is a descriptor of a noun, etc.;
  • the non-visual descriptor refers to the descriptor information with a low probability of extracting pixel information based on the image, or the descriptor information that cannot be directly extracted, such as image description.
  • the part of speech is a verb, or a descriptor of a preposition, etc.
  • Fig. 3 shows a schematic diagram of extracting word information in an image based on different attentions according to an exemplary embodiment of the present application. As shown in Fig. 3, part A in Fig.
  • the visual attention and semantic attention are combined, so that the computer equipment While being able to guide the generation of visual words and non-visual words more accurately, the interference of visual attention in generating non-visual words is reduced, and the generated image description is more complete and enriched.
  • Step 240 Generate image description information of the target image based on the descriptors of the target image at n time steps.
  • the descriptors on the n time steps are sorted in a specified order, such as sequential sorting, to generate image description information of the target image.
  • the information generation method realizes attention to semantic features and visual features by separately extracting the semantic feature set and the visual feature set of the target image, and using the attention fusion network in the information generation model.
  • Force fusion so that at each time step of generating image description information, the computer device can generate the descriptor of the target image on the current time step based on the visual and semantic features of the target image, combined with the output results on the previous time step, Then, the image description information of the target image is generated; in the process of generating image description information, the advantages of visual features in generating visual vocabulary and the advantages of semantic features in generating non-visual features are complemented, thereby improving the generation of image description information. accuracy.
  • the visual function of visually impaired persons cannot achieve normal vision due to reduced visual acuity or impaired visual field, which affects the visually impaired persons' acquisition of visual information.
  • visually impaired persons uses a mobile phone to view pictures, texts or videos, since the complete visual information content cannot be obtained visually, they need to use hearing to obtain the information in the image;
  • the image description information corresponding to the area is generated by the information generation method in this embodiment of the present application, and the image description information is converted from text information to audio information for playback, thereby assisting the viewing process.
  • Disabled people can obtain complete image information.
  • FIG. 4 shows a schematic diagram of selecting a corresponding target image in a video scene shown in an exemplary embodiment of the present application.
  • the target image may be a computer device from the video being played, based on the received pair of playback images.
  • the dynamic image displayed in the live broadcast preview interface is used to assist the target object to make a decision on whether to enter the live broadcast room for viewing by previewing the real-time content in the live broadcast room.
  • the target object can click (specify the operation) a certain area of the video image or dynamic image to determine the current image in the area (the image when the click operation is received) as the target image.
  • the area selected based on the specified operation can be highlighted; As shown in FIG. 4 , the area 410 is displayed in bold.
  • the information generation method shown in this application can be used to touch children's touch.
  • the images are used to describe the image information, so as to transmit information to the children from the two directions of vision and hearing, stimulate the children's learning interest, and improve the information transmission effect.
  • FIG. 5 is a frame diagram of a model training stage and an information generation stage according to an exemplary embodiment; as shown in FIG. 5 , in the model training stage, the model training device 510 passes the preset training samples (including samples The image description information corresponding to the image and the sample image, illustratively, the image description information can be sequentially arranged descriptors) to obtain a visual-semantic double attention (Visual-Semantic Double Attention, VSDA) model, that is, an information generation model;
  • the visual-semantic dual attention model includes semantic attention network, visual attention network and attention fusion network.
  • the information generation device 520 processes the input target image based on the visual-semantic dual attention model to obtain image description information corresponding to the target image.
  • the above-mentioned model training device 510 and information generating device 520 may be computer devices, for example, the computer devices may be fixed computer devices such as personal computers and servers, or the computer devices may also be tablet computers, e-book readers, etc. Mobile computer equipment.
  • the model training device 510 and the information generating device 520 may be the same device, or the model training device 510 and the information generating device 520 may also be different devices.
  • the model training device 510 and the information generating device 520 may be the same type of device, for example, the model training device 510 and the information generating device 520 may both be servers; or , the model training device 510 and the information generating device 520 may also be different types of devices, for example, the information generating device 520 may be a personal computer or a terminal, and the model training device 510 may be a server or the like.
  • the embodiments of the present application do not limit the specific types of the model training device 510 and the information generating device 520 .
  • Step 610 Obtain a sample image set, where the sample image set includes at least two image samples and image description information corresponding to the at least two image samples respectively.
  • Step 620 Perform training based on the sample image set to obtain an information generation model.
  • the information generation model can be a visual-semantic dual attention model, including a semantic attention network, a visual attention network and an attention fusion network; the semantic attention network is used to obtain a semantic attention vector based on a semantic feature set of an image, and the visual attention The attention network is used to obtain the visual attention vector based on the visual feature set of the image; the attention fusion network is used to fuse the semantic features and visual features of the image to obtain the descriptors that constitute the image description information corresponding to the image.
  • the training method of the information generation model obtains the information generation model including the semantic attention network, the visual attention network and the attention fusion network based on the training of the sample image set;
  • the above information generation model can be used to generate the descriptor of the target image at the current time step based on the combined effect of the visual and semantic features of the target image on the output results of the previous time step, and then generate the target image.
  • the corresponding image description information makes it possible to complement the advantages of visual features in generating visual vocabulary with the advantages of semantic features in generating non-visual features in the process of generating image description information, thereby improving the accuracy of generating image description information.
  • the model training process may be performed by the server, and the image description information generation process may be performed by the server or the terminal; when the image description information generation process is performed by the terminal, the server will The attention model is sent to the terminal, so that the terminal can process the acquired target image based on the visual-semantic dual attention model to obtain image description information of the target image.
  • the model training process and the generation process of the image description information are both performed by the server as an example for description.
  • FIG. 7 shows a flowchart of a model training and information generation method provided by an exemplary embodiment of the present application. The method can be executed by a computer device. As shown in FIG. 7 , the model training and information generation method can include the following steps:
  • Step 701 Obtain a sample image set, where the sample image set includes at least two image samples and image description information corresponding to the at least two image samples respectively.
  • Step 702 Perform training based on the sample image set to obtain an information generation model.
  • the information generation model is a visual-semantic dual attention model, including a semantic attention network, a visual attention network and an attention fusion network; the semantic attention network is used to obtain a semantic attention vector based on the semantic feature set of the target image.
  • the attention network is used to obtain the visual attention vector based on the visual feature set of the target image; the attention fusion network is used to fuse the semantic features and visual features of the target image to obtain the description of the image description information corresponding to the target image. word.
  • the information generation model further includes a semantic convolutional neural network and a visual convolutional neural network, wherein the semantic convolutional neural network is used to process the target image to obtain a semantic feature vector of the target image , to obtain the descriptor set corresponding to the target image; the visual convolutional neural network is used to process the target image to obtain the visual feature set corresponding to the target image.
  • the process of training the information generation model is implemented as:
  • Each sample image in the sample image set is input into the information generation model, and the predicted image description information corresponding to each sample image is obtained;
  • the parameters of the information generation model are updated.
  • the output result of the information generation model based on the sample image (that is, the predicted image description information) needs to be similar to the image description information corresponding to the sample image, the accuracy of the image description information of the target image can be generated when the information generation model is applied. Therefore, During the training process of the information generation model, multiple trainings need to be performed to update each parameter in each network in the information generation model until the information generation model converges.
  • denote all parameters involved in the information generation model, given the target sequence (Ground Truth Sequence) ⁇ w 1 ,w 2 ,...,w t ⁇ , that is, the sequence of descriptors in the image description information of the sample image, and
  • the loss function is to minimize the cross entropy (Cross Entropy loss) function, and the formula for calculating the loss function value corresponding to the information generation model can be expressed as:
  • Step 703 acquiring a target image.
  • the target image may be an image obtained by the terminal and then sent to the server for obtaining the image description information, and correspondingly, the server receives the target image.
  • Step 704 acquiring the semantic feature vector of the target image.
  • the target image is input into the semantic convolutional neural network, and the semantic feature vector of the target image output by the semantic convolutional neural network is obtained.
  • the semantic convolutional neural network may be a fully convolutional network (Fully Conventional Network, FCN), or, may also be a convolutional neural network (Convolutional Neural Networks, CNN); wherein, CNN is a feedforward neural network, It is a neural network with a one-way multilayer structure. There is no interconnection between neurons in the same layer, and the information transmission between layers is only carried out in one direction. Except for the input layer and the output layer, all the middle layers are hidden layers, and the hidden layers are one or more layers; CNN can directly extract images from images. Starting from the underlying pixel features, feature extraction is performed on the image layer by layer; CNN is the most commonly used implementation model for encoders, which is responsible for encoding images into vectors.
  • FCN Fully convolutional Network
  • CNN Convolutional Neural Networks
  • the computer device can obtain a rough graph representation vector of the target image, that is, the semantic feature vector of the target image.
  • Step 705 based on the semantic feature vector, extract the semantic feature set of the target image.
  • the computer device can first screen the attribute words in the vocabulary database based on the acquired semantic feature vector used to indicate the attribute of the target image, and obtain the attribute word set composed of the attribute words that may correspond to the target image, that is, A set of candidate descriptors, and then the semantic features of attribute words in the set of candidate descriptors are extracted to obtain a set of semantic features of the target image.
  • the computer device can extract the attribute word set corresponding to the target image from the vocabulary database based on the semantic feature vector; the attribute word set refers to the set of candidate descriptors describing the target image;
  • the word vector set corresponding to the attribute word set is obtained as the semantic feature set of the target image.
  • the word vector set includes word vectors corresponding to each candidate descriptor in the attribute word set.
  • the candidate descriptors in the attribute word set are attribute words corresponding to the context of the target image; the present application does not limit the number of candidate descriptors in the attribute word set.
  • the candidate descriptors may include different forms of the same word, such as: play, playing, plays and so on.
  • the matching probability of each vocabulary can be obtained, and candidate descriptors are selected from the vocabulary database based on the matching probability of each vocabulary to form a set of attribute words.
  • the process can be implemented as follows:
  • the matching probability refers to the probability that the vocabulary in the vocabulary matches the target image
  • words with matching probability greater than the matching probability threshold are candidate descriptors to form attribute word sets.
  • the probability of each attribute word in the image can be calculated by the noise-OR method; in order to improve the accuracy of the acquired attribute word, the probability threshold can be set to 0.5; it should be noted that , the setting of the probability threshold can be adjusted according to the actual situation, which is not limited in this application.
  • a vocabulary detector can be pre-trained, and the vocabulary detector is used to obtain attribute words from the vocabulary database based on the feature vector of the target image; therefore, the computer can use the help of The trained vocabulary detector obtains attribute words, namely:
  • the vocabulary detector is a vocabulary detection model obtained by training a weakly supervised method of Multiple Instance Learning (MIL).
  • MIL Multiple Instance Learning
  • Step 706 extracting the visual feature set of the target image.
  • the computer device may input the target image into the visual convolutional neural network, and obtain the visual feature set of the target image output by the visual convolutional neural network.
  • the computer device may preprocess the target image, and the preprocessing process may include the following steps:
  • the process of extracting the visual feature set of the target image can be implemented as:
  • the visual features of at least one sub-region are respectively extracted to form a visual feature set.
  • the computer equipment may divide the target image at equal intervals to obtain at least one sub-region; the division distance may be set by the computer equipment based on the image size of the target image, and the division distances corresponding to different image sizes are different; The number of sub-regions and the size of the division interval are not limited.
  • the process of extracting the semantic feature set of the target object and the process of extracting the visual feature set of the target object may be performed synchronously, that is, steps 704 to 705 and step 706 may be performed synchronously.
  • Step 707 perform attention fusion on the semantic features of the target image and the visual features of the target image at n time steps through the attention fusion network in the information generation model to obtain descriptors at n time steps.
  • the process of obtaining the descriptor on the t th time step can be implemented as:
  • the semantic attention vector at the t-th time-step, the visual attention vector at the t-th time-step, the hidden layer vector at the t-1-th time-step, and the attention fusion The output result of the network at the t-1th time step is input to the attention fusion network, and the output result of the attention fusion network at the tth time step and the hidden layer vector at the tth time step are obtained;
  • the semantic attention vector at the t-th time-step, the visual attention vector at the t-th time-step, and the output result of the attention fusion network at the t-1-th time step Input to the attention fusion network to obtain the output result of the attention fusion network at the t-th time step and the hidden layer vector at the t-th time step.
  • the semantic attention vector and the visual attention vector can be applied to the output result on the previous time step to obtain the output result on the current time step; or in another possibility
  • the semantic attention vector, the visual attention vector and the hidden layer vector at the previous time step can be applied to the output at the previous time step.
  • the output result at the current time step is obtained; the output result at the current time step is the word vector of the descriptor at the current time step.
  • the attention vector includes the semantic attention vector and the visual attention vector.
  • t-th time step when obtaining the semantic attention vector: at the t-th time-step, based on the hidden layer vector at the t-1-th time step and the semantic feature set of the target image, generate the t-th time step. Semantic attention vector over time steps.
  • the hidden layer vector indicates the intermediate content generated when the descriptor is generated, and the hidden layer vector contains historical information or context information used to indicate the generation of the next descriptor, so that the next description generated at the next time step Words are more in line with the current context.
  • the t-th time step represents any time step among the n time steps, where n represents the number of time steps required to generate image description information, 1 ⁇ t ⁇ n, and both t and n are positive integers.
  • the information generation model can generate the semantic attention vector at the current time step based on the hidden layer vector at the previous time step and the semantic feature set of the target image.
  • the information generation model can input the hidden layer vector output at the t-1th time step and the semantic feature set of the target image into the semantic attention network in the information generation model to obtain semantic attention The semantic attention vector at the t-th time step of the network output.
  • the semantic attention network is used to obtain the weight of each semantic feature in the semantic feature set at the t-1 time step based on the hidden layer vector at the t-1 time step and the semantic feature set of the target image;
  • the information generation model can generate a semantic attention vector at the t-th time step based on the weight of each semantic feature in the semantic feature set at the t-1 th time step and the semantic feature set of the target image.
  • the semantic attention vector at each time step is the weight sum of each attribute word, and the calculation formula is:
  • the information generation model can generate the visual attention vector at the current time step based on the hidden layer vector output at the previous time step and the visual feature set of the target image.
  • the information generation model can input the hidden layer vector output at the t-1th time step and the visual feature set of the target image into the visual attention model in the information generation model to obtain visual attention.
  • the visual attention model is used to obtain the weight of each visual feature in the visual feature set at the t-1 th time step based on the hidden layer vector and the visual feature set at the t-1 th time step;
  • the information generation model can generate a visual attention vector at the t-th time step based on the weight of each visual feature in the visual feature set at the t-1 th time step and the visual feature set.
  • the visual attention vector at each time step is the weight sum of the visual features of each sub-region, and the calculation formula is:
  • the information generation model when calculating the weights corresponding to the visual features of each sub-region, the information generation model can be calculated through the element-wise multiplication strategy (Element-Wise MultiplicationStrategy) to obtain better performance.
  • element-wise multiplication strategy Element-Wise MultiplicationStrategy
  • the attention model can capture more detailed image features of sub-regions, when generating the description vocabulary of different objects, the soft attention mechanism can adaptively focus on the corresponding regions, and the performance is better.
  • the visual attention model and the semantic attention model calculate the weight of the corresponding feature vector at each time step. Since the hidden layer vectors at different time steps are different, the weight of each feature vector obtained at each time step is also different. , therefore, at each time step, the information generation model can focus on image focus regions that are more in line with the context at each time step and feature words for generating image descriptions.
  • the attention fusion network in the information generation model may be implemented as a sequence network, and the sequence network may include LSTM (Long Short Term Memory, long short-term memory network), Transformer network, and the like.
  • LSTM Long Short Term Memory, long short-term memory network
  • Transformer network and the like.
  • LSTM is a temporal recurrent neural network, which is used to predict the important time interval or delay in a time series with a relatively long time. It is a special RNN.
  • the visual attention vector V and the semantic attention vector A are used as additional input parameters of the LSTM network, and these two attention features are merged into the unit of the LSTM network. node to guide the generation of image description information, and guide the information generation model to pay attention to the visual and semantic features of the image at the same time, so that the two feature vectors complement each other.
  • the BOS and EOS notation can be used to represent the beginning and the end of the sentence respectively; based on this, the formula for the LSTM network to generate the descriptor based on the visual attention vector and the semantic attention vector is as follows:
  • denotes the sigmoid function
  • denotes the maxout nonlinear activation function with two units ( represents the unit); it represents the input gate, ft represents the forget gate , and o t represents the output gate .
  • the LSTM uses a softmax function to output the probability distribution of the next word:
  • hyperparameters are set in the attention fusion network in the information generation model, and the hyperparameters are used to indicate the respective weights of the visual attention vector and the semantic attention vector in the attention fusion network.
  • the visual attention vector V will guide the model to pay attention to the relevant areas of the image
  • the semantic attention vector A will strengthen the generation of the most relevant attribute words; since these two attention vectors are complementary to each other, therefore, a hyperparameter can be set in the attention fusion network to determine the relationship between the two attention vectors. the best combination between.
  • the updated LSTM network generates a descriptor based on the visual attention vector and the semantic attention vector as follows:
  • z represents a hyperparameter, and its value range is [0.1, 0.9], which is used to represent the different weights of the two attention vectors.
  • the numerical settings of the hyperparameters can be set according to the performance effects of the model under different weight assignments, and the application does not limit the numerical values of the hyperparameters.
  • Step 708 Generate image description information of the target image based on the descriptors of the target image at n time steps.
  • the image description information generated by the information generation model is description information in a first language, for example, the first language may be English, or Chinese, or other languages.
  • the computer device may convert the generated description information in the first language Change the description information to the specified language; for example, the image description information generated by the information generation model is English description information, and the specified language required by the target object is Chinese, then after the information generation model generates the English image description information, the computer device can The English image description information is translated into Chinese image description information and output.
  • the language type of the output image description information that is to say, the type of the specified language can be set by the relevant object according to actual requirements, and the present application does not limit the language type of the image description information.
  • the computer device may, based on TTS (Text-To-Speech, speech synthesis) technology, convert text-type images into text-type images.
  • the description information is converted into the image description information of the voice type, and the image description information is transmitted to the target object in the form of voice playback.
  • the above process can be implemented as follows: after the server converts the obtained image description information of text type into image description information of voice type through TTS technology, and sends the image description information of voice type to the terminal, so that the terminal can make the terminal according to the obtained voice type image description information. image description information, and play the image description information; alternatively, the server can also send text-type image description information to the terminal, and the terminal converts the text-type image description information into voice-type image description information through TTS technology, and then performs Voice playback.
  • the model training and information generation methods realize the integration of semantic features and visual features by extracting the semantic feature set and visual feature set of the target image respectively, and using the attention fusion network in the information generation model.
  • the attention fusion of features enables, at each time step of generating image description information, to generate the descriptor of the target image at the current time step based on the combined effect of the output results of the visual and semantic features of the target image at the previous time step. , and then generate the image description information corresponding to the target image; in the process of generating image description information, the advantages of visual features in generating visual vocabulary and the advantages of semantic features in generating non-visual features are complemented, thereby improving the generation of image descriptions. the accuracy of the information;
  • the vocabulary in the vocabulary database is screened by the feature vector based on the image, and the attribute word related to the image is obtained as the candidate descriptor, and the weight is calculated based on the candidate descriptor. , thereby reducing the data processing volume of the semantic attention network, and reducing the data processing pressure of the information generation model while ensuring the processing accuracy.
  • FIG. 8 shows a schematic diagram of the generation process of image description information shown in an exemplary embodiment of the present application. As shown in FIG.
  • the computer device inputs the target image 810 into the information Generation model 820; the information generation model 820 inputs the target image 810 into the semantic convolutional neural network 821 to obtain the semantic feature vector of the target image; after that, the vocabulary detector 822 compares the attributes in the vocabulary database based on the semantic feature vector of the target image. Then, the information generation model 820 inputs the target image 810 into the visual convolutional neural network 824 to obtain the corresponding semantic feature set of the target image.
  • the visual feature set 825; the semantic feature set is input to the semantic attention network 826, so that the semantic attention network 826 obtains the semantic attention vector A t , t on the current time step according to the input hidden layer vector output at the previous time step represents the current time step; wherein, when t 1, the hidden layer vector output at the previous time step is the preset hidden layer vector; correspondingly, the visual feature set is input to the visual attention network 827, so that the visual attention The network 827 obtains the visual attention vector V t at the current time step according to the input hidden layer vector output at the previous time step; the visual attention vector V t , the semantic attention vector A t , the hidden layer output at the previous time step
  • the input of the attention fusion network 910 includes, the t th time step - the hidden layer vector h t-1 at 1 time step, the visual attention vector V t at the t-th time step generated based on h t-1 , the semantic attention vector A t generated based on h t-1, and the graph representation vector of the descriptor output at the t-1 time step (ie, the output vector y t-1 at the t-1 time step); the output of the attention fusion network 910 includes the output vector at the t time step ( y t ), and the hidden layer vector (h t , used to generate the next descriptor) at the t-th time step.
  • the visual attention vector is calculated by the visual attention network 930 based on the weighted sum of the visual features corresponding to each sub-region, and the semantic attention vector
  • FIG. 10 shows a frame diagram of an information generating apparatus provided by an exemplary embodiment of the present application. As shown in FIG. 10 , the apparatus includes:
  • an image acquisition module 1010 configured to acquire a target image
  • a feature extraction module 1020 configured to extract the semantic feature set of the target image, and extract the visual feature set of the target image
  • Descriptor acquisition module 1030 configured to perform attention fusion on the semantic features of the target image and the visual features of the target image at n time steps to obtain descriptors on the n time steps; the attention
  • the input of the force fusion process at the t-th time step includes the semantic attention vector at the t-th time step, the visual attention vector at the t-th time step, and the attention fusion process.
  • the output result at the t-1 th time step; the semantic attention vector at the t th time step is obtained by performing attention mechanism processing on the semantic feature set at the t th time step
  • the visual attention vector on the t-th time step is obtained by performing attention mechanism processing on the visual feature set on the t-th time step;
  • the output result on the t-1 th time step is used to indicate the descriptor on the t-1 th time step;
  • the t th time step is any one of the n time steps; 1 ⁇ t ⁇ n, and both t and n are positive integers;
  • the information generation module 1040 is configured to generate image description information of the target image based on the descriptors of the target image at the n time steps.
  • the descriptor obtaining module 1030 is configured to use the attention fusion network in the information generation model to analyze the semantic features of the target image and the semantic features of the target image at n time steps. The visual features are fused with attention to obtain the descriptors on the n time steps.
  • the descriptor obtaining module 1030 is configured to:
  • the semantic attention vector on the t th time step, the visual attention vector on the t th time step, the t-1 th The hidden layer vector on the time step and the output result of the attention fusion network at the t-1 th time step are input to the attention fusion network to obtain the attention fusion network at the t th time
  • the semantic attention vector at the t-th time step the semantic attention vector at the t-th time step, the visual attention vector at the t-th time step, and the attention fusion network
  • the output result at the t-1 th time step is input to the attention fusion network, and the output result of the attention fusion network at the t th time step is obtained, and the t th time step on the hidden layer vector.
  • hyperparameters are set in the attention fusion network, and the hyperparameters are used to indicate the difference between the visual attention vector and the semantic attention vector in the attention fusion network. Weights.
  • the apparatus further includes:
  • a first generation module configured to generate the t-th time step at the t-th time step based on the hidden layer vector at the t-1-th time step and the semantic feature set The semantic attention vector on .
  • the first generation module includes:
  • the first acquisition sub-module is configured to acquire, based on the hidden layer vector and the semantic feature set at the t-1 th time step, each semantic feature in the semantic feature set is weights over time steps;
  • the first generation sub-module is configured to generate the t-th time step based on the weight of each semantic feature in the semantic feature set at the t-1 th time step and the semantic feature set. the semantic attention vector.
  • the apparatus further includes:
  • the second generation module is configured to, at the t-th time step, generate the t-th time-step based on the hidden layer vector at the t-1-th time-step and the visual feature set The visual attention vector on .
  • the second generation module includes:
  • the second obtaining sub-module is configured to obtain, based on the hidden layer vector and the visual feature set at the t-1 th time step, each visual feature in the visual feature set at the t-1 th time step weights over time steps;
  • the second generating sub-module is configured to generate the t-th time step based on the weight of each visual feature in the visual feature set at the t-1 th time step and the visual feature set the visual attention vector.
  • the feature extraction module 1020 includes:
  • the third acquisition sub-module is used to acquire the semantic feature vector of the target image
  • An extraction sub-module configured to extract the semantic feature set of the target image based on the semantic feature vector.
  • the extraction submodule includes:
  • an attribute word extraction unit configured to extract a set of attribute words corresponding to the target image from the vocabulary library based on the semantic feature vector; the set of attribute words refers to a set of candidate descriptors for describing the target image;
  • the semantic feature extraction unit is configured to obtain the set of word vectors corresponding to the set of attribute words as the set of semantic features of the target image.
  • the attribute word extraction unit is configured to obtain the matching probability of each word in the vocabulary based on the semantic feature vector; the matching probability refers to the vocabulary in the vocabulary the probability of matching the target image;
  • the attribute word extraction unit is configured to input the semantic feature vector into a vocabulary detector, and obtain the lexical detector extracting from the vocabulary database based on the semantic feature vector to the attribute word set;
  • the vocabulary detector is a vocabulary detection model obtained by training a weakly supervised method of multi-instance learning.
  • the apparatus before the feature extraction module 1020 extracts the visual feature set of the target image, the apparatus further includes:
  • a sub-region dividing module configured to perform sub-region division on the target image to obtain at least one sub-region
  • the feature extraction module 1020 is configured to extract the visual features of the at least one sub-region respectively to form the visual feature set.
  • the information generating apparatus realizes attention to semantic features and visual features by extracting the semantic feature set and the visual feature set of the target image respectively, and using the attention fusion network in the information generating model.
  • Force fusion so that at each time step of generating image description information, based on the combined effect of the output results of the visual features and semantic features of the target image at the previous time step, the descriptor of the target image on the current time step is generated, and then generate The image description information corresponding to the target image, so that in the process of generating image description information, the advantages of visual features in generating visual vocabulary and the advantages of semantic features in generating non-visual features are complemented, thereby improving the accuracy of generating image description information. sex.
  • FIG. 11 shows a structural block diagram of a computer device 1100 according to an exemplary embodiment of the present application.
  • the computer device can be implemented as the server in the above solution of the present application.
  • the computer device 1100 includes a Central Processing Unit (CPU) 1101, a system memory 1104 including a Random Access Memory (RAM) 1102 and a Read-Only Memory (ROM) 1103, and A system bus 1105 that connects the system memory 1104 and the central processing unit 1101 .
  • the computer device 1100 also includes a mass storage device 1106 for storing an operating system 1109 , application programs 1110 and other program modules 1111 .
  • the computer-readable media can include computer storage media and communication media.
  • Computer storage media include RAM, ROM, Erasable Programmable Read Only Memory (EPROM), Electronically-Erasable Programmable Read-Only Memory (EEPROM) flash memory or other Solid state storage technology, CD-ROM, Digital Versatile Disc (DVD) or other optical storage, cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices.
  • EPROM Erasable Programmable Read Only Memory
  • EEPROM Electronically-Erasable Programmable Read-Only Memory
  • CD-ROM Compact Disc
  • DVD Digital Versatile Disc
  • cassettes magnetic tape
  • magnetic disk storage magnetic disk storage devices
  • the memory also includes at least one instruction, at least one piece of program, code set or instruction set, and the at least one instruction, at least one piece of program, code set or instruction set is stored in the memory, and the central processing unit 1101 executes the at least one instruction, At least one piece of program, code set or instruction set implements all or part of the steps in the information generation methods shown in the above embodiments.
  • FIG. 12 shows a structural block diagram of a computer device 1200 provided by an exemplary embodiment of the present application.
  • the computer device 1200 may be implemented as the above-mentioned face quality assessment device and/or quality assessment model training device, such as: a smart phone, a tablet computer, a laptop computer or a desktop computer.
  • Computer device 1200 may also be called a terminal device, portable terminal, laptop terminal, desktop terminal, and the like by other names.
  • computer device 1200 includes: processor 1201 and memory 1202 .
  • Processor 1201 may include one or more processing cores.
  • Memory 1202 may include one or more computer-readable storage media, which may be non-transitory.
  • the non-transitory computer-readable storage medium in the memory 1202 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 1201 to realize the information generation provided by the method embodiments in this application. method.
  • the computer device 1200 may also optionally include: a peripheral device interface 1203 and at least one peripheral device.
  • the processor 1201, the memory 1202 and the peripheral device interface 1203 can be connected through a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 1203 through a bus, a signal line or a circuit board.
  • the peripheral equipment includes: at least one of a radio frequency circuit 1204, a display screen 1205, a camera assembly 1206, an audio circuit 1207 and a power supply 1208.
  • computer device 1200 also includes one or more sensors 1209 .
  • the one or more sensors 1209 include, but are not limited to, an acceleration sensor 1210 , a gyro sensor 1211 , a pressure sensor 1212 , an optical sensor 1213 , and a proximity sensor 1214 .
  • FIG. 12 does not constitute a limitation on the computer device 1200, and may include more or less components than the one shown, or combine some components, or adopt different component arrangements.
  • a computer-readable storage medium is also provided, and at least one computer program is stored in the computer-readable storage medium, and the computer program is loaded and executed by a processor to realize the above information generation method. all or part of the steps.
  • the computer-readable storage medium may be Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc Read-Only Memory (CD-ROM), Tape, floppy disk, and optical data storage devices, etc.
  • a computer program product is also provided, the computer program product includes at least one computer program, and the computer program is loaded by a processor and executes any of the above-mentioned embodiments of FIG. 2 , FIG. 6 or FIG. 7 . show all or part of the steps of the method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

一种信息生成方法、装置、设备、存储介质及程序产品,涉及图像处理技术领域。所述方法包括:获取目标图像(210);提取目标图像的语义特征集合,以及,提取目标对象的视觉特征集合(220);在n个时间步上对目标图像的语义特征和目标图像的视觉特征进行注意力融合,获取n个时间步上的描述词(230);基于目标图像在n个时间步上的描述词,生成目标图像的图像描述信息(240)。通过上述方法,使得在生成图像描述信息的过程中,将视觉特征在生成视觉词汇上的优势与语义特征在生成非视觉特征的优势进行互补,从而提高了生成图像描述信息的准确性。

Description

信息生成方法、装置、设备、存储介质及程序产品
本申请要求于2021年01月29日提交的申请号为202110126753.7、发明名称为“图像描述信息生成方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,特别涉及一种信息生成方法、装置、设备、存储介质及程序产品。
背景技术
随着图像识别技术的发展,通过算法可以实现计算机的“看图说话”功能;也就是说,计算机设备通过图像描述(Image Caption),可以将图像中的内容信息转化为图像描述信息。
在相关技术中,往往专注于基于提取获得的图像的视觉特征来生成图像的图像描述信息,即,计算机设备在通过编码器获取图像的视觉特征之后,使用一个循环神经网络生成图像的整体描述。
发明内容
本申请实施例提供了一种信息生成方法、装置、设备、存储介质及程序产品。该技术方案如下:
一方面,提供了一种信息生成方法,所述方法包括:
获取目标图像;
提取所述目标图像的语义特征集合,以及,提取所述目标图像的视觉特征集合;
在n个时间步上对所述目标图像的语义特征和所述目标图像的视觉特征进行注意力融合,获取所述n个时间步上的描述词;所述注意力融合的过程在第t个时间步上的输入包括所述第t个时间步上的语义注意力向量、所述第t个时间步上的视觉注意力向量、以及所述注意力融合的过程在第t-1个时间步上的输出结果;所述第t个时间步上的所述语义注意力向量是在所述第t个时间步上对所述语义特征集合进行注意力机制处理获得的;所述第t个时间步上的所述视觉注意力向量是在所述第t个时间步上对所述视觉特征集合进行注意力机制处理获得的;所述注意力融合的过程在所述第t-1个时间步上的所述输出结果用于指示所述第t-1个时间步上的描述词;所述第t个时间步是所述n个时间步中的任意一个;1≤t≤n,且t、n均为正整数;
基于所述目标图像在所述n个时间步上的描述词,生成所述目标图像的图像描述信息。
另一方面,提供了一种信息生成装置,所述装置包括:
图像获取模块,用于获取目标图像;
特征提取模块,用于提取所述目标图像的语义特征集合,以及,提取所述目标图像的视觉特征集合;
描述词获取模块,用于在n个时间步上对所述目标图像的语义特征和所述目标图像的视觉特征进行注意力融合,获取所述n个时间步上的描述词;所述注意力融合的过程在第t个时间步上的输入包括所述第t个时间步上的语义注意力向量、所述第t个时间步上的视觉注意力向量、以及所述注意力融合的过程在第t-1个时间步上的输出结果;所述第t个时间步上的所述语义注意力向量是在所述第t个时间步上对所述语义特征集合进行注意力机制处理获得的;所述 第t个时间步上的所述视觉注意力向量是在所述第t个时间步上对所述视觉特征集合进行注意力机制处理获得的;所述注意力融合的过程在所述第t-1个时间步上的所述输出结果用于指示所述第t-1个时间步上的描述词;所述第t个时间步是所述n个时间步中的任意一个;1≤t≤n,且t、n均为正整数;
信息生成模块,用于基于所述目标图像在所述n个时间步上的描述词,生成所述目标图像的图像描述信息。
另一方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储至少一条计算机程序,所述至少一条计算机程序由所述处理器加载并执行以实现上述信息生成方法。
另一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述计算机程序由处理器加载并执行以实现上述信息生成方法。
另一方面,提供了一种计算机程序产品,该计算机程序产品包括至少一条计算机程序,该计算机程序由处理器加载并执行以实现上述各种可选实现方式中提供的信息生成方法。
本申请提供的技术方案可以包括以下有益效果:
通过分别提取目标图像的语义特征集合和视觉特征集合,在n个时间步上实现了对语义特征和视觉特征的注意力融合;使得计算机设备在生成图像描述信息的各个时间步上,基于目标图像的视觉特征和语义特征在上一个时间步上的输出结果的综合作用,生成当前时间步上目标图像的描述词,进而生成目标图像对应的图像描述信息;使得在图像描述信息的生成过程中,将视觉特征在生成视觉词汇上的优势与语义特征在生成非视觉特征的优势进行互补,从而提高了生成图像描述信息的准确性。
附图说明
图1示出了本申请一示例性实施例提供的信息生成方法所使用的系统的示意图;
图2示出了本申请一示例性实施例提供的信息生成方法的流程图;
图3示出了本申请一示例性实施例示出的基于不同的注意力提取图像中单词信息的示意图;
图4示出了本申请一示例性实施例示出的视频场景下对应的目标图像选择示意图;
图5是根据一示例性实施例示出的一种模型训练阶段和信息生成阶段的框架图;
图6示出了本申请一示例性实施例提供的信息生成模型的训练方法的流程图;
图7示出了本申请一示例性实施例提供的模型训练以及信息生成方法的流程图;
图8示出了本申请一示例性实施例示出的图像描述信息的生成过程的示意图;
图9示出了本申请一示例性实施例示出的注意力融合网络的输入输出示意图;
图10示出了示出了本申请一示例性实施例提供的信息生成装置的框架图;
图11示出了本申请一示例性实施例示出的计算机设备的结构框图;
图12示出了本申请一示例性实施例示出的计算机设备的结构框图。
具体实施方式
图1示出了本申请一示例性实施例提供的信息生成方法所使用的系统的示意图,如图1所示,该系统包括:服务器110以及终端120。
其中,上述服务器110可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统。
上述终端120可以是具有网络连接功能以及图像展示功能和/或视频播放功能的终端设备;进一步的,该终端可以是具有生成图像描述信息的功能的终端,比如,终端120可以是智能手机、平板电脑、电子书阅读器、智能眼镜、智能手表、智能电视、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。
可选的,上述系统中包含一个或者多个服务器110,以及多个终端120。本申请实施例对于服务器110和终端120的个数不做限制。
终端以及服务器可以通过通信网络相连。可选的,通信网络是有线网络或无线网络。
在本申请实施例中,计算机设备可以通过获取目标图像;提取目标图像的语义特征集合以及视觉特征集合;在n个时间步上对目标图像的语义特征和目标图像的视觉特征进行注意力融合,获取n个时间步上的描述词;该注意力融合的过程在第t个时间步上的输入包括第t个时间步上的语义注意力向量、第t个时间步上的视觉注意力向量、以及注意力融合的过程在第t-1个时间步上的输出结果;第t个时间步上的语义注意力向量是在第t个时间步上对语义特征集合进行注意力机制处理获得的;第t个时间步上的视觉注意力向量是在第t个时间步上对视觉特征集合进行注意力机制处理获得的;注意力融合的过程在第t-1个时间步上的输出结果用于指示第t-1个时间步上的描述词;第t个时间步是n个时间步中的任意一个;1≤t≤n,且t、n均为正整数;基于目标图像在n个时间步上的描述词,生成目标图像的图像描述信息。通过上述方法,计算机设备可以在图像描述信息的生成过程中的各个时间步上,通过对目标图像的视觉特征和语义特征进行注意力融合,实现将视觉特征在生成视觉词汇上的优势与语义特征在生成非视觉特征的优势的互补,从而提高了生成图像描述信息的准确性。
可选的,计算机设备可以通过信息生成模型中的注意力融合网络实现对目标图像的语义特征和视觉特征的注意力融合,以获得各个时间步上的描述词;基于此,图2示出了本申请一示例性实施例提供的信息生成方法的流程图,该方法可以由计算机设备执行,该计算机设备可以实现为终端或服务器,该终端或服务器可以是图1所示的终端或服务器;如图2所示,该信息生成方法可以包括以下步骤:
步骤210,获取目标图像。
在一种可能的实现方式中,该目标图像可以是本地存储的图像,或者,该目标图像也可以是基于目标对象指定操作实时获取的图像;比如,该目标图像可以是目标对象基于截屏操作实时获取的图像;或者,该目标图像也可以是目标对象通过长按屏幕中的指定区域触发生成图像描述信息时,计算机设备实时采集到的终端屏幕上的图像;或者,该目标图像也可以是基于终端的图像采集组件实时获取到的图像。本申请对目标图像的获取方式不进行限制。
步骤220,提取目标图像的语义特征集合,以及,提取目标图像的视觉特征集合。
目标图像的语义特征集合用于指示描述目标图像的图像信息的候选描述词对应的词向量的集合。
目标图像的视觉特征集合用于指示基于目标图像的像素点的RGB(红绿蓝)分布等特征获取到的图像特征的集合。
步骤230,通过信息生成模型中的注意力融合网络,在n个时间步上对目标图像的语义特征和目标图像的视觉特征进行注意力融合,获取n个时间步上的描述词。
对应于上述的注意力融合过程,该注意力融合网络在第t个时间步上的输入包括第t个时间步上的语义注意力向量、第t个时间步上的视觉注意力向量、以及注意力融合网络在第t-1个时间步上的输出结果;第t个时间步上的语义注意力向量是在第t个时间步上对语义特征集合进行注意力机制处理获得的;第t个时间步上的视觉注意力向量是在第t个时间步上对视觉特征集合 进行注意力机制处理获得的;注意力融合网络在第t-1个时间步上的输出结果用于指示第t-1个时间步上的描述词;第t个时间步是n个时间步中的任意一个;1≤t≤n,且t、n均为正整数。
其中,时间步的数量n表示生成目标图像的图像描述信息所需的时间步的数量。
注意力机制(Attention Mechanism)的本质是一种通过网络自主学习出的一组权重系数,并以“动态加权”的方式来强调目标对象感兴趣的区域,同时抑制不相关背景区域的机制。在计算机视觉领域中,注意力机制可以大致分为两大类:强注意力和软注意力。
注意力机制常被运用在RNN(Recurrent Neural Networks,循环神经网络)上;带有注意力机制的RNN,在每次处理目标图像的部分像素时,都会根据当前状态的前一个状态所关注的,目标图像的部分像素去处理,而不是根据目标图像的全部像素去处理,可以减少任务的处理复杂度。
本申请实施例中,在生成图像描述信息时,计算机设备在生成一个单词之后,基于生成的这个单词生成下一个单词;其中,生成一个单词所需要的时间称为一个时间步(Time Step)。可选的,时间步的个数n可以是大于1的非固定值;响应于生成的描述词为用于指示描述词的生成过程结束的词或字符,计算机设备结束描述词的生成过程。
本申请实施例中的信息生成模型用以生成图像的图像描述信息;该信息生成模型是通过样本图像,以及样本图像对应的图像描述信息训练生成的;其中,样本图像的图像描述信息可以是文本信息。
在本申请实施例中,语义注意力向量可以利用多种属性同时强化视觉描述词和非视觉描述词的生成;视觉描述词是指基于图像的像素信息可以直接提取到的描述词信息,比如,图像描述信息中,词性为名词的描述词等;而非视觉描述词则是指代基于图像的像素信息提取概率较低的描述词信息,或者无法直接提取到的描述词信息,比如,图像描述信息中,词性为动词,或者,介词的描述词等。
视觉注意力向量可以强化视觉描述词的生成,在提取图像中的视觉描述词上具有良好的表现。图3示出了本申请一示例性实施例示出的基于不同的注意力提取图像中单词信息的示意图,如图3所示,图3中的A部分示出了指定图像在语义注意力机制的作用下获取到的各个描述词的权重变化;图3中的B部分示出了同一指定图像在视觉注意力机制的作用下获取到的各个描述词的权重变化;以描述词为单词为例,对于“people”,“standing”和“table”这三个单词而言,在语义注意力机制下,在各个单词生成的时刻,各个单词对应的权重达到峰值,即语义注意力机制会关注与当前语境相关度最高的单词;在视觉注意力机制下,在生成三个单词中的视觉单词时,也就是说,在生成“people”和“table”时,视觉注意力会聚焦于指定图像中,与视觉单词相对应的图像区域中,示意性的,如图3所示,在生成“people”时,视觉注意力聚焦于指定图像中包含人脸的区域310;在生成三个单词中的非视觉单词时,也就是说,在生成“table”时,视觉注意力聚焦于指定图像中包含桌子的区域320;但在基于视觉注意力机制生成非视觉单词时,比如,在生成“standing”时,视觉注意力机制聚焦于无关的,有可能产生误导的图像区域330。
因此,为了结合视觉注意力机制在生成视觉词汇上的优势以及语义注意力机制在生成非视觉单词上的优势,在本申请实施例中,将视觉注意力和语义注意力相结合,使得计算机设备在能够更为精确地引导视觉单词和非视觉单词的生成的同时,降低了视觉注意力在生成非视觉单词上的干扰,使得生成的图像描述更为完整和充实。
步骤240,基于目标图像在n个时间步上的描述词,生成目标图像的图像描述信息。
在一种可能的实现方式中,按照指定顺序对n个时间步上的描述词进行排序,比如顺序排序,以生成目标图像的图像描述信息。
综上所述,本申请实施例提供的信息生成方法,通过分别提取目标图像的语义特征集合和视觉特征集合,利用信息生成模型中的注意力融合网络,实现了对语义特征和视觉特征的注意力融合,使得在生成图像描述信息的各个时间步上,计算机设备可以基于目标图像的视 觉特征和语义特征,结合在上一个时间步上的输出结果,生成当前时间步上目标图像的描述词,进而生成目标图像的图像描述信息;使得在图像描述信息的生成过程中,将视觉特征在生成视觉词汇上的优势与语义特征在生成非视觉特征的优势进行互补,从而提高了生成图像描述信息的准确性。
示意性的,本申请实施例所述的方法可以应用且不限于以下场景中:
1、视障人士获取图像信息的场景;
视障人士(即具有视觉障碍的人士)的视觉功能由于视觉敏锐度降低或视野受损,无法达到正常视力,从而影响到视障人士对视觉信息的获取。比如,当视障人士使用手机查看图文或者视频时,由于无法通过视觉获取到完整的视觉信息内容,需要借助听觉对图像中的信息进行获取;一种可能的方式是,目标对象通过选中需要查看的内容的所在区域或者区域范围,通过本申请实施例中的信息生成方法,生成对应于该区域的图像描述信息,并将该图像描述信息由文字信息转化为音频信息进行播放,从而辅助视障人士获取到完整的图像信息。
图4示出了本申请一示例性实施例示出的视频场景下对应的目标图像选择示意图,如图4所示,该目标图像可以是计算机设备从播放中的视频中,基于接收到的对播放中的视频的指定操作获取到的图像;或者,也可以是计算机设备接收到的从直播预览界面中实时展示的直播间的动态影像中,基于接收到的对动态影像的指定操作获取到的图像;该直播预览界面中展示的动态影像用于辅助目标对象通过对直播间内的实时内容的预览,做出是否进入直播间进行观看的决策。
在一种可能的实现方式中,目标对象可以单击(指定操作)视频图像或者动态影像的某个区域,以确定将该区域中的当前图像(接收到单击操作时的图像)获取为目标图像。
为强化显示目标对象对目标图像的选择,可以将基于指定操作被选中的区域进行突出显示;比如高亮显示,或者,放大显示,或者边框加粗显示等等。如图4所示,为将区域410进行边框加粗显示。
2、早期教育场景;
在早期教育场景中,由于幼儿对物体或文字的认知范围有限,通过图像进行教学会有较好的教学效果;在此场景中,可以通过本申请所示的信息生成方法,对幼儿触控的图像进行图像信息描述,从而从视觉和听觉两个方向对幼儿进行信息传输,激发幼儿的学习兴趣,提高信息传输效果。
本申请涉及的方法包括模型训练阶段和信息生成阶段。图5是根据一示例性实施例示出的一种模型训练阶段和信息生成阶段的框架图;如图5所示,在模型训练阶段,模型训练设备510,通过预先设置好的训练样本(包括样本图像、样本图像对应的图像描述信息,示意性的,该图像描述信息可以是顺序排列的描述词),得到视觉-语义双重注意力(Visual-Semantic Double Attention,VSDA)模型,即信息生成模型;该视觉-语义双重注意力模型包括语义注意力网络,视觉注意力网络以及注意力融合网络。
在信息生成阶段,信息生成设备520基于该视觉-语义双重注意力模型,对输入的目标图像进行处理,获得目标图像对应的图像描述信息。
其中,上述模型训练设备510和信息生成设备520可以是计算机设备,比如,该计算机设备可以是个人电脑、服务器等固定式计算机设备,或者,该计算机设备也可以是平板电脑、电子书阅读器等移动式计算机设备。
可选的,上述模型训练设备510和信息生成设备520可以是同一个设备,或者,模型训练设备510和信息生成设备520也可以是不同的设备。并且,当模型训练设备510和信息生成设备520是不同的设备时,模型训练设备510和信息生成设备520可以是同一类型的设备,比如模型训练设备510和信息生成设备520可以都是服务器;或者,模型训练设备510和信 息生成设备520也可以是不同类型的设备,比如信息生成设备520可以是个人电脑或者终端,而模型训练设备510可以是服务器等。本申请实施例对于模型训练设备510和信息生成设备520的具体类型不做限定。
图6示出了本申请一示例性实施例提供的信息生成模型的训练方法的流程图,该方法可以由计算机设备执行,该计算机设备可以实现为终端或服务器,该终端或服务器可以是图1所示的终端或服务器,如图6所示,该信息生成模型的训练方法包括以下步骤:
步骤610,获取样本图像集,该样本图像集包括至少两个图像样本以及至少两个图像样本分别对应的图像描述信息。
步骤620,基于样本图像集进行训练,获得信息生成模型。
该信息生成模型可以是视觉-语义双重注意力模型,包括语义注意力网络、视觉注意力网络以及注意力融合网络;该语义注意网络用于基于图像的语义特征集合获得语义注意力向量,该视觉注意力网络用于基于图像的视觉特征集合获得视觉注意力向量;该注意力融合网络用于对图像的语义特征以及视觉特征进行注意力融合,获得组成图像对应的图像描述信息的描述词。
综上所述,本申请实施例提供的信息生成模型的训练方法,基于样本图像集的训练,获得包括语义注意力网络、视觉注意力网络以及注意力融合网络的信息生成模型;使得在生成图像描述信息的过程中,利用上述信息生成模型,能够基于目标图像的视觉特征和语义特征在上一个时间步上的输出结果的综合作用,生成当前时间步上目标图像的描述词,进而生成目标图像对应的图像描述信息,使得在图像描述信息的生成过程中,将视觉特征在生成视觉词汇上的优势与语义特征在生成非视觉特征的优势进行互补,从而提高了生成图像描述信息的准确性。
在本申请实施例中,模型训练的过程可以由服务器执行,图像描述信息的生成过程可以由服务器或终端执行;当图像描述信息的生成过程由终端执行时,服务器将训练好的视觉-语义双注意力模型发送给终端,以使得终端可以基于视觉-语义双注意力模型对获取的目标图像进行处理,获得目标图像的图像描述信息。以下实施例以模型训练过程与图像描述信息的生成过程均由服务器执行为例进行说明。图7示出了本申请一示例性实施例提供的模型训练以及信息生成方法的流程图,该方法可以由计算机设备执行,如图7所示,该模型训练以及信息生成方法可以包括以下步骤:
步骤701,获取样本图像集,该样本图像集包括至少两个图像样本以及该至少两个图像样本分别对应的图像描述信息。
其中,各个样本图像对分别应的图像描述信息可以是由相关人员进行标注的。
步骤702,基于样本图像集进行训练,获得信息生成模型。
该信息生成模型为视觉-语义双注意力模型,包括语义注意力网络、视觉注意力网络以及注意力融合网络;该语义注意网络用于基于目标图像的语义特征集合获得语义注意力向量,该视觉注意力网络用于基于目标图像的视觉特征集合获得视觉注意力向量;该注意力融合网络用于对目标图像的语义特征以及视觉特征进行注意力融合,获得组成目标图像对应的图像描述信息的描述词。
在一种可能的实现方式中,该信息生成模型还包括语义卷积神经网络以及视觉卷积神经网络,其中,该语义卷积神经网络用于对目标图像进行处理,获得目标图像的语义特征向量,以获取该目标图像对应的描述词集合;该视觉卷积神经网络用于对目标图像进行处理,获得该目标图像对应的视觉特征集合。
在一种可能的实现方式中,对信息生成模型进行训练的过程实现为:
将样本图像集中的各个样本图像输入到信息生成模型中,获得各个样本图像对应的预测 图像描述信息;
基于各个样本图像对应的预测图像描述信息与各个样本图像对应的图像描述信息,计算损失函数值;
基于损失函数值,对信息生成模型进行参数更新。
由于需要使得信息生成模型基于样本图像的输出结果(即预测图像描述信息)与样本图像对应的图像描述信息相近,才可以保证信息生成模型在应用时生成目标图像的图像描述信息的准确性,因此需要在信息生成模型的训练过程中进行多次训练,更新信息生成模型中各个网络中的各个参数,直至信息生成模型收敛。
令θ表示信息生成模型中涉及的所有参数,给定目标序列(Ground Truth Sequence){w 1,w 2,...,w t},即样本图像的图像描述信息中的描述词序列,且损失函数为最小化交叉熵(Cross Entropy loss)函数,计算信息生成模型对应的损失函数值的公式可以表示为:
Figure PCTCN2022073372-appb-000001
上式中的
Figure PCTCN2022073372-appb-000002
表示信息生成模型输出的预测图像描述信息中各个描述词的概率。基于损失函数的计算结果对信息生成模型中的各个网络中的各个参数进行调节。
步骤703,获取目标图像。
响应于图像描述信息的生成过程由服务器执行,该目标图像可以是通过终端获取到目标图像后发送给服务器进行图像描述信息获取的图像,相应的,服务器接收该目标图像。
步骤704,获取目标图像的语义特征向量。
在一种可能的实现方式中,将目标图像输入语义卷积神经网络,获得语义卷积神经网络输出的目标图像的语义特征向量。
其中,该语义卷积神经网络可以是全卷积网络(Fully Conventional Network,FCN),或者,也可以是卷积神经网络(Convolutional Neural Networks,CNN);其中,CNN是一种前馈神经网络,是一种单向多层结构的神经网络。同一层神经元之间没有互相连接,层间信息传达只沿一个方向进行,除输入层,输出层之外,中间的全部为隐藏层,隐藏层为一层或多层;CNN可以直接从图像底层的像素特征开始,逐层对图像进行特征提取;CNN是编码器最常用的实现模型,负责将图像编码成向量。
计算机设备通过该语义卷积神经网络对目标图像的处理,可以获得该目标图像的粗略的图表示向量,即目标图像的语义特征向量。
步骤705,基于该语义特征向量,提取目标图像的语义特征集合。
在词汇库中,并不是所有的属性词都对应于该目标图像,若对词汇库中的所有的词都进行概率计算或验证,则会造成过多且不必要的数据处理,因此在进行描述词集合获取之前,计算机设备可以先基于获取的用于指示目标图像属性的语义特征向量,对词汇库中的属性词进行筛选,获取其中可能对应于目标图像的属性词组成的属性词集合,即候选描述词集合,之后提取候选描述词集合中的属性词的语义特征,以得到目标图像的语义特征集合。
在一种可能的实现方式中,计算机设备可以基于语义特征向量,从词汇库中提取目标图像对应的属性词集合;该属性词集合是指对目标图像进行描述的候选描述词的集合;
将属性词集合所对应的词向量集合获取为目标图像的语义特征集合。该词向量集合中包含属性词集合中各个候选描述词各自对应的词向量。
该属性词集合中的候选描述词即为与目标图像的语境相对应的属性词;本申请对属性词集合中的候选描述词的数量不进行限制。
其中,候选描述词中可以包括同一单词的不同形式,比如:play,playing,plays等等。
在一种可能的实现方式中,可以获取各个词汇的匹配概率,基于各个词汇的匹配概率从词汇库中筛选候选描述词,以组成属性词集合该过程可以实现为:
基于语义特征向量,获取词汇库中各个词汇的匹配概率,该匹配概率是指词汇库中的词汇与目标图像相匹配的概率;
提取词汇库中,匹配概率大于匹配概率阈值的词汇为候选描述词,以组成属性词集合。
在一种可能的实现方式中,可以通过Noisy-OR的方法来计算图像中每个属性单词的概率;为了提高获取到的属性词的精度,可以将该概率阈值设置为0.5;需要说明的是,该概率阈值的设定可以根据实际情况进行调节,本申请对此不进行限制。
为了提高属性词获取的准确性,在一种可能的实现方式中,可以预先训练词汇检测器,该词汇检测器用于基于目标图像的特征向量,从词汇库中获取属性词;因此,计算机可以借助训练好的词汇检测器获取属性词,即:
将特征向量输入到词汇检测器中,以使得词汇检测器基于特征向量,从词汇库中提取属性词;
可选的,该词汇检测器是通过多示例学习(Multiple Instance Learning,MIL)的弱监督方法训练获得的词汇检测模型。
步骤706,提取目标图像的视觉特征集合。
在一种可能的实现方式中,计算机设备可以将目标图像输入到视觉卷积神经网络中,获取视觉卷积神经网络输出的目标图像的视觉特征集合。
为了提高获得的视觉特征集合的准确性,在一种可能的实现方式中,在提取目标图像的视觉特征集合之前,计算机设备可以先对目标图像进行预处理,该预处理过程可以包括以下步骤:
对目标图像进行子区域划分,获得至少一个子区域;
在此情况下,提取目标图像的视觉特征集合的过程可以实现为:
分别提取至少一个子区域的视觉特征,组成视觉特征集合。
其中,计算机设备可以对目标图像进行等间距划分,以获得至少一个子区域;划分间距可以是由计算机设备基于目标图像的图像尺寸进行设置的,不同的图像尺寸对应的划分间距不同;本申请对子区域的数量以及划分间距的大小不进行限制。
在本申请实施例中,提取目标对象的语义特征集合的过程与提取目标对象的视觉特征集合的过程可以同步进行,也就是说,步骤704至步骤705,与步骤706可以同步执行。
步骤707,通过信息生成模型中的注意力融合网络,在n个时间步上对目标图像的语义特征和目标图像的视觉特征进行注意力融合,获取n个时间步上的描述词。
以n个时间步中的第t个时间步为例,获取第t个时间步上的描述词的过程可以实现为:
在第t个时间步上,将第t个时间步上的语义注意力向量、第t个时间步上的视觉注意力向量、第t-1个时间步上的隐藏层向量、以及注意力融合网络在第t-1个时间步上的输出结果输入至注意力融合网络,获得注意力融合网络在第t个时间步上的输出结果,以及第t个时间步上的隐藏层向量;
或者,
在第t个时间步上,将第t个时间步上的语义注意力向量、第t个时间步上的视觉注意力向量、以及注意力融合网络在第t-1个时间步上的输出结果输入至注意力融合网络,获得注意力融合网络在第t个时间步上的输出结果,以及第t个时间步上的隐藏层向量。
也就是说,在一种可能的实现方式中,可以将语义注意力向量和视觉注意力向量作用于上一个时间步上的输出结果,获得当前时间步上的输出结果;或者在另一种可能的实现方式中,为了提高获得的各个时间步上的输出结果的准确性,可以将语义注意力向量、视觉注意力向量以及上一个时间步上的隐藏层向量作用于上一个时间步上的输出结果,获得当前时间步上的输出结果;当前时间步上的输出结果即为当前时间步上的描述词的词向量。
为获取目标图像在各个时间步上的描述词,需要获取目标图像在各个时间步上的注意力向量,该注意力向量包括语义注意力向量和视觉注意力向量。
以第t个时间步为例,在获取语义注意力向量时:在第t个时间步上,基于第t-1个时间步上的隐藏层向量,以及目标图像的语义特征集合,生成第t个时间步上的语义注意力向量。
其中,隐藏层向量指示在生成描述词时产生的中间内容,隐藏层向量中包含了用于指示生成下一个描述词的历史信息或者语境信息,从而使得在下一个时间步上生成的下一个描述词更加符合当前语境。
第t个时间步表示n个时间步中的任意时间步,n表示生成图像描述信息所需的时间步的个数,1≤t≤n,且t、n均为正整数。
在生成当前时间步上的语义注意力向量时,信息生成模型可以基于上一个时间步上的隐藏层向量,以及目标图像的语义特征集合,生成当前时间步上的语义注意力向量。
在一种可能的实现方式中,信息生成模型可以将第t-1个时间步上输出的隐藏层向量,以及目标图像的语义特征集合输入信息生成模型中的语义注意力网络,获得语义注意力网络输出的第t个时间步上的语义注意力向量。
该语义注意力网络用于基于第t-1个时间步上的隐藏层向量以及目标图像的语义特征集合,获取语义特征集合中的各个语义特征在第t-1个时间步上的权重;
信息生成模型可以基于语义特征集合中的各个语义特征在第t-1个时间步上的权重,以及目标图像的语义特征集合,生成第t个时间步上的语义注意力向量。
其中,各个时间步上的语义注意力向量为各个属性词的权重和,计算公式为:
c t=b i·h t-1
β t=softmax(c t)
Figure PCTCN2022073372-appb-000003
b i={b 1,...,b L}表示从目标图像中获取到的属性;L表示属性的长度,即属性词的数量;此处b i表示每个属性词的词向量;c t表示长期记忆向量;h t-1表示第t-1个时间步上的隐藏层向量;β t表示在第t个时间步上的各个属性词各自的权重;A t表示第t个时间步的语义注意力向量。
以第t个时间步为例,在获取视觉注意力向量时:在第t个时间步上,基于第t-1个时间步上的隐藏层向量,以及视觉特征集合,生成第t个时间步上的视觉注意力向量。
在生成当前时间步上的视觉注意力向量时,信息生成模型可以基于上一个时间步输出的隐藏层向量,以及目标图像的视觉特征集合,生成当前时间步上的视觉注意力向量。
在一种可能的实现方式中,信息生成模型可以将第t-1个时间步上输出的隐藏层向量,以及目标图像的视觉特征集合输入信息生成模型中的视觉注意力模型,获得视觉注意力模型输出的第t个时间步上的语义注意力向量。
该视觉注意力模型用于基于第t-1个时间步上的隐藏层向量以及视觉特征集合,获取视觉特征集合中的各个视觉特征在第t-1个时间步上的权重;
信息生成模型可以基于视觉特征集合中的各个视觉特征在第t-1个时间步上的权重,以及视觉特征集合,生成第t个时间步上的视觉注意力向量。
其中,各个时间步上的视觉注意力向量为各个子区域的视觉特征的权重和,计算公式为:
α t=softmax(a i·h t-1)
Figure PCTCN2022073372-appb-000004
a i={a 1,…,a m}表示各个子区域的视觉特征,用以指示目标图像的焦点区域;m表示子区域的个数;即提取到的视觉特征的个数;α t表示各个视觉特征对应的权重;V t表示第t个时间步的视觉注意力向量。
其中,在计算各个子区域的视觉特征对应的权重时,信息生成模型可以通过逐元素乘积策略(Element-Wise MultiplicationStrategy)进行计算,以获得更好的性能。
由于注意力模型能够捕获到更详细的子区域图像特征,在生成不同物体的描述词汇时,软注意力机制能够自适应地聚焦于对应的区域,性能更佳,因此在本申请实施例中采用基于软注意力机制构建的视觉注意力模型。
视觉注意力模型和语义注意力模型在每个时间步上都会计算对应的特征向量的权重,由于不同时间步上的隐藏层向量不同,每个时间步上获得的各个特征向量的权重也不相同,因此,在各个时间步上,信息生成模型可以关注与各个时间步上的语境更符合的图像焦点区域以及用于生成图像描述的特征词。
在一种可能的实现方式中,该信息生成模型中的注意力融合网络可以实现为序列网络,该序列网络可以包括LSTM(Long Short Term Memory,长短期记忆网络),Transformer网络等。其中,LSTM是一种时间递归神经网络,用于预测时间序列中间隔或者延迟相对较长时间的重要时间,是一种特殊的RNN。
以该序列网络为LSTM网络为例,在生成图像描述信息时,将视觉注意力向量V和语义注意力向量A作为LSTM网络的额外输入参数,将这两个注意力特征合并入LSTM网络的单元节点来引导图像描述信息的生成,引导信息生成模型同时关注图像的视觉特征和语义特征,以使得两个特征向量相互补足。
在本申请实施例中,可以使用BOS和EOS记号分别表示语句的开头和结尾;基于此,LSTM网络基于视觉注意力向量和语义注意力向量生成描述词的公式如下:
Figure PCTCN2022073372-appb-000005
i t=σ(W ixx t+W ihh t-1+b i)
f t=σ(W fxx t+W fhh t-1+b f)
o t=σ(W oxx t+W ohh t-1+b o)
Figure PCTCN2022073372-appb-000006
h t=o t⊙tanh(c t)
s t=W sh t
其中,σ表示sigmoid函数;φ表示带有两个单元的maxout非线性激活函数(
Figure PCTCN2022073372-appb-000007
表示单元);i t表示input gate,f t表示forget gate,o t表示output gate。
LSTM使用一个softmax函数输出下一个单词的概率分布:
w t~softmax(s t)
在一种可能的实现方式中,信息生成模型中的注意力融合网络中设置有超参数,该超参数用以指示视觉注意力向量与语义注意力向量分别在注意力融合网络中的权重。
由于在图像描述信息的生成过程中,视觉注意力特征与语义注意力特征会在不同的方面对信息生成模型生成的图像描述信息造成影响,视觉注意力向量V会引导模型去关注图像的相关区域,语义注意力向量A会强化生成关联度最高的属性单词;鉴于这两个注意力向量是相互补足的,因此,可以通过在注意力融合网络中设置一个超参数以确定两个注意力向量之间的最佳组合方式。仍以该注意力融合网络为LSTM网络为例,更新后的LSTM网络基于视觉注意力向量和语义注意力向量生成描述词的公式如下:
Figure PCTCN2022073372-appb-000008
i t=σ(W ixx t+W ihh t-1+b i)
f t=σ(W fxx t+W fhh t-1+b f)
o t=σ(W oxx t+W ohh t-1+b o)
Figure PCTCN2022073372-appb-000009
h t=o t⊙tanh(c t)
s t=W sh t
其中,z表示超参数,其取值范围为[0.1,0.9],用以代表两个注意力向量的不同权重,z越大,视觉特征在注意力引导中的权重越大,语义特征在注意力引导中的权重越小;反之,z越小,语义特征在注意力引导中的权重越大,视觉特征在注意力引导中的权重越小。
需要说明的是,超参数的数值设置可以根据模型在不同权重分配下的表现效果进行设置,本申请对超参数的数值大小不进行限制。
步骤708,基于目标图像在n个时间步上的描述词,生成目标图像的图像描述信息。
在一种可能的实现方式中,信息生成模型生成的图像描述信息为第一语言的描述信息,比如,该第一语言可以为英文,或者,中文,或者其他语言。
为了使得图像描述信息更加适应于不同对象的使用需求,在一种可能的实现方式中,响应于生成的目标图像描述信息的语言为非指定语言,计算机设备可以将生成的第一语言的描述信息更改为指定语言的描述信息;比如,信息生成模型生成的图像描述信息为英文的描述信息,而目标对象需求的指定语言为中文,那么在信息生成模型生成英文的图像描述信息后,计算机设备可以将该英文的图像描述信息翻译为中文的图像描述信息后输出。
其中,输出的图像描述信息的语言类型,也就是说指定语言的类型可以由相关对象根据实际需求进行设置,本申请对图像描述信息的语言类型不进行限制。
在一种可能的实现方式中,由于生成的图像描述信息为文字信息,为了便于目标对象接收图像描述信息,计算机设备可以基于TTS(Text-To-Speech,语音合成)技术,将文字类型的图像描述信息转化为语音类型的图像描述信息,并通过语音播放的形式将图像描述信息传输给目标对象。
上述过程可以实现为,服务器将获取到的文字类型的图像描述信息通过TTS技术转化为语音类型的图像描述信息后,将语音类型的图像描述信息发送给终端,以使得终端根据获取到的语音类型的图像描述信息,并播放图像描述信息;或者,服务器也可以将文字类型的图像描述信息发送给终端,由终端通过TTS技术将文字类型的图像描述信息转化为语音类型的图像描述信息后,进行语音播放。
综上所述,本申请实施例提供的模型训练以及信息生成方法,通过分别提取目标图像的语义特征集合和视觉特征集合,利用信息生成模型中的注意力融合网络,实现了对语义特征和视觉特征的注意力融合,使得在生成图像描述信息的各个时间步上,基于目标图像的视觉特征和语义特征在上一个时间步上的输出结果的综合作用,生成当前时间步上目标图像的描述词,进而生成目标图像对应的图像描述信息;使得在图像描述信息的生成过程中,将视觉特征在生成视觉词汇上的优势与语义特征在生成非视觉特征的优势进行互补,从而提高了生成图像描述信息的准确性;
同时,在语义注意力网络获取各个属性词的权重之前,通过基于图像的特征向量对词汇库中的词汇进行筛选,获取到与图像相关的属性词作为候选描述词,基于候选描述词进行权重计算,从而减少了语义注意力网络的数据处理量,在保证处理精度的同时,降低了信息生成模型的数据处理压力。
以注意力融合网络为LSTM网络,注意力融合网络的输入包括上一个时间步的隐藏层向量,上一个时间步的输出结果,当前时间步的视觉注意力向量,以及当前时间步的语义注意力向量为例,图8示出了本申请一示例性实施例示出的图像描述信息的生成过程的示意图,如图8所示,计算机设备在获取到目标图像810之后,将目标图像810输入到信息生成模型820;信息生成模型820将该目标图像810输入到语义卷积神经网络821中,获得目标图像的 语义特征向量;之后,词汇检测器822基于目标图像的语义特征向量对词汇库中的属性词进行筛选,获得目标图像对应的候选描述词823,进而获取到目标图像对应的语义特征集合;同时,信息生成模型820将目标图像810输入到视觉卷积神经网络824中,获得目标图像对应的视觉特征集合825;将语义特征集合输入到语义注意力网络826,以使得语义注意力网络826根据输入的上一个时间步输出的隐藏层向量获取当前时间步上的语义注意力向量A t,t表示当前时间步;其中,当t=1时,上一个时间步输出的隐藏层向量为预设的隐藏层向量;相应的,将视觉特征集合输入到视觉注意力网络827,以使得视觉注意力网络827根据输入的上一个时间步输出的隐藏层向量获取当前时间步上的视觉注意力向量V t;将视觉注意力向量V t,语义注意力向量A t,上一个时间步输出的隐藏层向量以及上一个时间步输出的描述词x t(即y t-1),输入到LSTM网络828中,获得LSTM网络828输出的当前时间步上的描述词y t;其中,当t=1时,上一个时间步输出的描述词为预设的起始词或字符;重复上述过程直至LSTM网络输出的描述词为终止词或终止字符;计算机设备将获得的各个描述词按照获取的先后顺序排列后,获得该目标图像的图像描述信息830。
其中,图9示出了本申请一示例性实施例示出的注意力融合网络的输入输出示意图,如图9所示,在第t个时间步时,注意力融合网络910的输入包括,第t-1个时间步上的隐藏层向量h t-1,基于h t-1生成的第t个时间步上的视觉注意力向量V t,基于h t-1生成的语义注意力向量A t,以及第t-1个时间步上输出的描述词的图表示向量(即t-1时间步的输出向量y t-1);注意力融合网络910的输出包括第t个时间步的输出向量(y t),以及第t个时间步的隐藏层向量(h t,用于生成下一个描述词)。其中,视觉注意力向量是通过视觉注意力网络930基于各个子区域对应的视觉特征的加权和计算得到的,语义注意力向量是通过语义注意力网络920基于各个属性词的加权和计算得到的。
可以理解的是,在本申请的具体实施方式中,涉及到目标图像等用户相关数据,当本申请以上实施运用到具体产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
图10示出了本申请一示例性实施例提供的信息生成装置的框架图,如图10所示,该装置包括:
图像获取模块1010,用于获取目标图像;
特征提取模块1020,用于提取所述目标图像的语义特征集合,以及,提取所述目标图像的视觉特征集合;
描述词获取模块1030,用于在n个时间步上对所述目标图像的语义特征和所述目标图像的视觉特征进行注意力融合,获取所述n个时间步上的描述词;所述注意力融合的过程在第t个时间步上的输入包括所述第t个时间步上的语义注意力向量、所述第t个时间步上的视觉注意力向量、以及所述注意力融合的过程在第t-1个时间步上的输出结果;所述第t个时间步上的所述语义注意力向量是在所述第t个时间步上对所述语义特征集合进行注意力机制处理获得的;所述第t个时间步上的所述视觉注意力向量是在所述第t个时间步上对所述视觉特征集合进行注意力机制处理获得的;所述注意力融合的过程在所述第t-1个时间步上的所述输出结果用于指示所述第t-1个时间步上的描述词;所述第t个时间步是所述n个时间步中的任意一个;1≤t≤n,且t、n均为正整数;
信息生成模块1040,用于基于所述目标图像在所述n个时间步上的描述词,生成所述目标图像的图像描述信息。
在一种可能的实现方式中,所述描述词获取模块1030,用于通过信息生成模型中的注意力融合网络,在n个时间步上对所述目标图像的语义特征和所述目标图像的视觉特征进行注意力融合,获取所述n个时间步上的描述词。
在一种可能的实现方式中,所述描述词获取模块1030,用于,
在所述第t个时间步上,将所述第t个时间步上的所述语义注意力向量、所述第t个时间步上的所述视觉注意力向量、所述第t-1个时间步上的隐藏层向量、以及所述注意力融合网络在第t-1个时间步上的输出结果输入至所述注意力融合网络,获得所述注意力融合网络在所述第t个时间步上的所述输出结果,以及所述第t个时间步上的所述隐藏层向量;
或者,
在所述第t个时间步上,将所述第t个时间步上的所述语义注意力向量、所述第t个时间步上的所述视觉注意力向量、以及所述注意力融合网络在第t-1个时间步上的输出结果输入至所述注意力融合网络,获得所述注意力融合网络在所述第t个时间步上的所述输出结果,以及所述第t个时间步上的所述隐藏层向量。
在一种可能的实现方式中,所述注意力融合网络中设置有超参数,所述超参数用以指示所述视觉注意力向量与所述语义注意力向量在所述注意力融合网络中的权重。
在一种可能的实现方式中,所述装置还包括:
第一生成模块,用于在所述第t个时间步上,基于所述第t-1个时间步上的所述隐藏层向量,以及所述语义特征集合,生成所述第t个时间步上的所述语义注意力向量。
在一种可能的实现方式中,所述第一生成模块,包括:
第一获取子模块,用于基于所述第t-1个时间步上的所述隐藏层向量以及所述语义特征集合,获取所述语义特征集合中的各个语义特征在所述第t-1个时间步上的权重;
第一生成子模块,用于基于所述语义特征集合中的各个语义特征在所述第t-1个时间步上的权重,以及所述语义特征集合,生成所述第t个时间步上的所述语义注意力向量。
在一种可能的实现方式中,所述装置还包括:
第二生成模块,用于在所述第t个时间步上,基于所述第t-1个时间步上的所述隐藏层向量,以及所述视觉特征集合,生成所述第t个时间步上的所述视觉注意力向量。
在一种可能的实现方式中,所述第二生成模块,包括:
第二获取子模块,用于基于所述第t-1个时间步上的所述隐藏层向量以及所述视觉特征集合,获取所述视觉特征集合中的各个视觉特征在所述第t-1个时间步上的权重;
第二生成子模块,用于基于所述视觉特征集合中的各个视觉特征在所述第t-1个时间步上的权重,以及所述视觉特征集合,生成所述第t个时间步上的所述视觉注意力向量。
在一种可能的实现方式中,所述特征提取模块1020,包括:
第三获取子模块,用于获取所述目标图像的语义特征向量;
提取子模块,用于基于所述语义特征向量,提取所述目标图像的所述语义特征集合。
在一种可能的实现方式中,所述提取子模块,包括:
属性词提取单元,用于基于所述语义特征向量,从词汇库中提取所述目标图像对应的属性词集合;所述属性词集合是指对所述目标图像进行描述的候选描述词的集合;
语义特征提取单元,用于将所述属性词集合所对应的词向量集合,获取为所述目标图像的所述语义特征集合。
在一种可能的实现方式中,所述属性词提取单元,用于基于所述语义特征向量,获取所述词汇库中各个词汇的匹配概率;所述匹配概率是指所述词汇库中的词汇与所述目标图像相匹配的概率;
提取所述词汇库中,所述匹配概率大于匹配概率阈值的词汇,作为所述候选描述词,以组成所述属性词集合。
在一种可能的实现方式中,所述属性词提取单元,用于将所述语义特征向量输入到词汇检测器中,获得所述词汇检测器基于所述语义特征向量从所述词汇库中提取到的所述属性词集合;
其中,所述词汇检测器是通过多示例学习的弱监督方法训练获得的词汇检测模型。
在一种可能的实现方式中,在所述特征提取模块1020提取所述目标图像的视觉特征集合之前,所述装置还包括:
子区域划分模块,用于对所述目标图像进行子区域划分,获得至少一个子区域;
所述特征提取模块1020,用于分别提取所述至少一个子区域的视觉特征,组成所述视觉特征集合。
综上所述,本申请实施例提供的信息生成装置,通过分别提取目标图像的语义特征集合和视觉特征集合,利用信息生成模型中的注意力融合网络,实现了对语义特征和视觉特征的注意力融合,使得在生成图像描述信息的各个时间步上,基于目标图像的视觉特征和语义特征在上一个时间步上的输出结果的综合作用,生成当前时间步上目标图像的描述词,进而生成目标图像对应的图像描述信息,使得在图像描述信息的生成过程中,将视觉特征在生成视觉词汇上的优势与语义特征在生成非视觉特征的优势进行互补,从而提高了生成图像描述信息的准确性。
图11示出了本申请一示例性实施例示出的计算机设备1100的结构框图。该计算机设备可以实现为本申请上述方案中的服务器。所述计算机设备1100包括中央处理单元(Central Processing Unit,CPU)1101、包括随机存取存储器(Random Access Memory,RAM)1102和只读存储器(Read-Only Memory,ROM)1103的系统存储器1104,以及连接系统存储器1104和中央处理单元1101的系统总线1105。所述计算机设备1100还包括用于存储操作系统1109、应用程序1110和其他程序模块1111的大容量存储设备1106。
不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括RAM、ROM、可擦除可编程只读寄存器(Erasable Programmable Read Only Memory,EPROM)、电子抹除式可复写只读存储器(Electrically-Erasable Programmable Read-Only Memory,EEPROM)闪存或其他固态存储其技术,CD-ROM、数字多功能光盘(Digital Versatile Disc,DVD)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器1104和大容量存储设备1106可以统称为存储器。
所述存储器还包括至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、至少一段程序、代码集或指令集存储于存储器中,中央处理器1101通过执行该至少一条指令、至少一段程序、代码集或指令集来实现上述各个实施例所示的信息生成方法中的全部或者部分步骤。
图12示出了本申请一个示例性实施例提供的计算机设备1200的结构框图。该计算机设备1200可以实现为上述的人脸质量评估设备和/或质量评估模型训练设备,比如:智能手机、平板电脑、笔记本电脑或台式电脑。计算机设备1200还可能被称为终端设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,计算机设备1200包括有:处理器1201和存储器1202。
处理器1201可以包括一个或多个处理核心。
存储器1202可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。在一些实施例中,存储器1202中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器1201所执行以实现本申请中方法实施例提供的信息生成方法。
在一些实施例中,计算机设备1200还可选包括有:外围设备接口1203和至少一个外围设备。处理器1201、存储器1202和外围设备接口1203之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1203相连。具体地,外围设备包括:射频电路1204、显示屏1205、摄像头组件1206、音频电路1207和电源1208中的至少 一种。
在一些实施例中,计算机设备1200还包括有一个或多个传感器1209。该一个或多个传感器1209包括但不限于:加速度传感器1210、陀螺仪传感器1211、压力传感器1212、光学传感器1213以及接近传感器1214。
本领域技术人员可以理解,图12中示出的结构并不构成对计算机设备1200的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
在一示例性实施例中,还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有至少一条计算机程序,该计算机程序由处理器加载并执行以实现上述信息生成方法中的全部或部分步骤。例如,该计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在一示例性实施例中,还提供了一种计算机程序产品,该计算机程序产包括至少一条计算机程序,该计算机程序由处理器加载并执行上述图2、图6或图7任一实施例所示方法的全部或部分步骤。

Claims (20)

  1. 一种信息生成方法,所述方法由计算机设备执行,所述方法包括:
    获取目标图像;
    提取所述目标图像的语义特征集合,以及,提取所述目标图像的视觉特征集合;
    在n个时间步上对所述目标图像的语义特征和所述目标图像的视觉特征进行注意力融合,获取所述n个时间步上的描述词;所述注意力融合的过程在第t个时间步上的输入包括所述第t个时间步上的语义注意力向量、所述第t个时间步上的视觉注意力向量、以及所述注意力融合的过程在第t-1个时间步上的输出结果;所述第t个时间步上的所述语义注意力向量是在所述第t个时间步上对所述语义特征集合进行注意力机制处理获得的;所述第t个时间步上的所述视觉注意力向量是在所述第t个时间步上对所述视觉特征集合进行注意力机制处理获得的;所述注意力融合的过程在所述第t-1个时间步上的所述输出结果用于指示所述第t-1个时间步上的描述词;所述第t个时间步是所述n个时间步中的任意一个;1≤t≤n,且t、n均为正整数;
    基于所述目标图像在所述n个时间步上的描述词,生成所述目标图像的图像描述信息。
  2. 据权利要求1所述的方法,所述在n个时间步上对所述目标图像的语义特征和所述目标图像的视觉特征进行注意力融合,获取所述n个时间步上的描述词,包括:
    通过信息生成模型中的注意力融合网络,在n个时间步上对所述目标图像的语义特征和所述目标图像的视觉特征进行注意力融合,获取所述n个时间步上的描述词。
  3. 根据权利要求2所述的方法,所述通过信息生成模型中的注意力融合网络,在n个时间步上对所述目标图像的语义特征和所述目标图像的视觉特征进行注意力融合,获取所述n个时间步上的描述词,包括:
    在所述第t个时间步上,将所述第t个时间步上的所述语义注意力向量、所述第t个时间步上的所述视觉注意力向量、所述第t-1个时间步上的隐藏层向量、以及所述注意力融合网络在第t-1个时间步上的输出结果输入至所述注意力融合网络,获得所述注意力融合网络在所述第t个时间步上的所述输出结果,以及所述第t个时间步上的所述隐藏层向量;
    或者,
    在所述第t个时间步上,将所述第t个时间步上的所述语义注意力向量、所述第t个时间步上的所述视觉注意力向量、以及所述注意力融合网络在第t-1个时间步上的输出结果输入至所述注意力融合网络,获得所述注意力融合网络在所述第t个时间步上的所述输出结果,以及所述第t个时间步上的所述隐藏层向量。
  4. 根据权利要求2或3所述的方法,所述注意力融合网络中设置有超参数,所述超参数用以指示所述视觉注意力向量与所述语义注意力向量分别在所述注意力融合网络中的权重。
  5. 根据权利要求3所述的方法,所述方法还包括:
    在所述第t个时间步上,基于所述第t-1个时间步上的所述隐藏层向量,以及所述语义特征集合,生成所述第t个时间步上的所述语义注意力向量。
  6. 根据权利要求5所述的方法,所述在所述第t个时间步上,基于所述第t-1个时间步上的所述隐藏层向量,以及所述语义特征集合,生成所述第t个时间步上的所述语义注意力向量,包括:
    基于所述第t-1个时间步上的所述隐藏层向量以及所述语义特征集合,获取所述语义特征 集合中的各个语义特征在所述第t-1个时间步上的权重;
    基于所述语义特征集合中的各个语义特征在所述第t-1个时间步上的权重,以及所述语义特征集合,生成所述第t个时间步上的所述语义注意力向量。
  7. 根据权利要求3所述的方法,所述方法还包括:
    在所述第t个时间步上,基于所述第t-1个时间步上的所述隐藏层向量,以及所述视觉特征集合,生成所述第t个时间步上的所述视觉注意力向量。
  8. 根据权利要求7所述的方法,所述在所述第t个时间步上,基于所述第t-1个时间步上的所述隐藏层向量,以及所述视觉特征集合,生成所述第t个时间步上的所述视觉注意力向量,包括:
    基于所述第t-1个时间步上的所述隐藏层向量以及所述视觉特征集合,获取所述视觉特征集合中的各个视觉特征在所述第t-1个时间步上的权重;
    基于所述视觉特征集合中的各个视觉特征在所述第t-1个时间步上的权重,以及所述视觉特征集合,生成所述第t个时间步上的所述视觉注意力向量。
  9. 根据权利要求1所述的方法,所述提取所述目标图像的语义特征集合,包括:
    获取所述目标图像的语义特征向量;
    基于所述语义特征向量,提取所述目标图像的所述语义特征集合。
  10. 根据权利要求9所述的方法,所述基于所述语义特征向量,提取所述目标图像的所述语义特征集合,包括:
    基于所述语义特征向量,从词汇库中提取所述目标图像对应的属性词集合;所述属性词集合是指对所述目标图像进行描述的候选描述词的集合;
    将所述属性词集合所对应的词向量集合,获取为所述目标图像的所述语义特征集合。
  11. 根据权利要求10所述的方法,所述基于所述语义特征向量,从词汇库中提取所述目标图像对应的属性词集合,包括:
    基于所述语义特征向量,获取所述词汇库中各个词汇的匹配概率;所述匹配概率是指所述词汇库中的词汇与所述目标图像相匹配的概率;
    提取所述词汇库中,所述匹配概率大于匹配概率阈值的词汇为所述候选描述词,以组成所述属性词集合。
  12. 根据权利要求10所述的方法,所述基于所述语义特征向量,从词汇库中提取所述目标图像对应的属性词集合,包括:
    将所述语义特征向量输入到词汇检测器中,获得所述词汇检测器基于所述语义特征向量从所述词汇库中提取到的所述属性词集合;
    其中,所述词汇检测器是通过多示例学习的弱监督方法训练获得的词汇检测模型。
  13. 根据权利要求1所述的方法,在提取所述目标图像的视觉特征集合之前,所述方法还包括:
    对所述目标图像进行子区域划分,获得至少一个子区域;
    所述提取所述目标图像的视觉特征集合,包括:
    分别提取所述至少一个子区域的视觉特征,组成所述视觉特征集合。
  14. 一种信息生成装置,所述装置包括:
    图像获取模块,用于获取目标图像;
    特征提取模块,用于提取所述目标图像的语义特征集合,以及,提取所述目标图像的视觉特征集合;
    描述词获取模块,用于在n个时间步上对所述目标图像的语义特征和所述目标图像的视觉特征进行注意力融合,获取所述n个时间步上的描述词;所述注意力融合的过程在第t个时间步上的输入包括所述第t个时间步上的语义注意力向量、所述第t个时间步上的视觉注意力向量、以及所述注意力融合的过程在第t-1个时间步上的输出结果;所述第t个时间步上的所述语义注意力向量是在所述第t个时间步上对所述语义特征集合进行注意力机制处理获得的;所述第t个时间步上的所述视觉注意力向量是在所述第t个时间步上对所述视觉特征集合进行注意力机制处理获得的;所述注意力融合的过程在所述第t-1个时间步上的所述输出结果用于指示所述第t-1个时间步上的描述词;所述第t个时间步是所述n个时间步中的任意一个;1≤t≤n,且t、n均为正整数;
    信息生成模块,用于基于所述目标图像在所述n个时间步上的描述词,生成所述目标图像的图像描述信息。
  15. 根据权利要求14所述的装置,所述描述词获取模块,用于通过信息生成模型中的注意力融合网络,在n个时间步上对所述目标图像的语义特征和所述目标图像的视觉特征进行注意力融合,获取所述n个时间步上的描述词。
  16. 根据权利要求15所述的装置,所述描述词获取模块,用于,
    在所述第t个时间步上,将所述第t个时间步上的所述语义注意力向量、所述第t个时间步上的所述视觉注意力向量、所述第t-1个时间步上的隐藏层向量、以及所述注意力融合网络在第t-1个时间步上的输出结果输入至所述注意力融合网络,获得所述注意力融合网络在所述第t个时间步上的所述输出结果,以及所述第t个时间步上的所述隐藏层向量;
    或者,
    在所述第t个时间步上,将所述第t个时间步上的所述语义注意力向量、所述第t个时间步上的所述视觉注意力向量、以及所述注意力融合网络在第t-1个时间步上的输出结果输入至所述注意力融合网络,获得所述注意力融合网络在所述第t个时间步上的所述输出结果,以及所述第t个时间步上的所述隐藏层向量。
  17. 根据权利要求15或16所述的装置,所述注意力融合网络中设置有超参数,所述超参数用以指示所述视觉注意力向量与所述语义注意力向量分别在所述注意力融合网络中的权重。
  18. 一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器存储有至少一条计算机程序,所述至少一条计算机程序由所述处理器加载并执行以实现如权利要求1至13任一所述的信息生成方法。
  19. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述计算机程序由处理器加载并执行以实现如权利要求1至13任一所述的信息生成方法。
  20. 一种计算机程序产品,所述计算机程序产品包括至少一条计算机程序,所述计算机程序由处理器加载并执行以实现如权利要求1至13任一所述的信息生成方法。
PCT/CN2022/073372 2021-01-29 2022-01-24 信息生成方法、装置、设备、存储介质及程序产品 WO2022161298A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2023523236A JP2023545543A (ja) 2021-01-29 2022-01-24 情報生成方法、装置、コンピュータ機器、記憶媒体及びコンピュータプログラム
US18/071,481 US20230103340A1 (en) 2021-01-29 2022-11-29 Information generating method and apparatus, device, storage medium, and program product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110126753.7 2021-01-29
CN202110126753.7A CN113569892A (zh) 2021-01-29 2021-01-29 图像描述信息生成方法、装置、计算机设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/071,481 Continuation US20230103340A1 (en) 2021-01-29 2022-11-29 Information generating method and apparatus, device, storage medium, and program product

Publications (1)

Publication Number Publication Date
WO2022161298A1 true WO2022161298A1 (zh) 2022-08-04

Family

ID=78161062

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/073372 WO2022161298A1 (zh) 2021-01-29 2022-01-24 信息生成方法、装置、设备、存储介质及程序产品

Country Status (4)

Country Link
US (1) US20230103340A1 (zh)
JP (1) JP2023545543A (zh)
CN (1) CN113569892A (zh)
WO (1) WO2022161298A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687674A (zh) * 2022-12-20 2023-02-03 昆明勤砖晟信息科技有限公司 服务于智慧云服务平台的大数据需求分析方法及系统
CN116416440A (zh) * 2023-01-13 2023-07-11 北京百度网讯科技有限公司 目标识别方法、模型训练方法、装置、介质和电子设备
CN117742546A (zh) * 2023-12-29 2024-03-22 广东福临门世家智能家居有限公司 基于悬浮窗的智能家居控制方法及系统

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569892A (zh) * 2021-01-29 2021-10-29 腾讯科技(深圳)有限公司 图像描述信息生成方法、装置、计算机设备及存储介质
CN114627353B (zh) * 2022-03-21 2023-12-12 北京有竹居网络技术有限公司 一种图像描述生成方法、装置、设备、介质及产品
CN114693790B (zh) * 2022-04-02 2022-11-18 江西财经大学 基于混合注意力机制的自动图像描述方法与系统
CN117237834A (zh) * 2022-06-08 2023-12-15 华为技术有限公司 图像描述方法、电子设备及计算机可读存储介质
CN115238111B (zh) * 2022-06-15 2023-11-14 荣耀终端有限公司 一种图片显示方法及电子设备
CN116453120B (zh) * 2023-04-19 2024-04-05 浪潮智慧科技有限公司 基于时序场景图注意力机制的图像描述方法、设备及介质
CN116388184B (zh) * 2023-06-05 2023-08-15 南京信息工程大学 一种基于风速日波动特征的超短期风速修订方法、系统
CN117454016B (zh) * 2023-12-21 2024-03-15 深圳须弥云图空间科技有限公司 基于改进点击预测模型的对象推荐方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563498A (zh) * 2017-09-08 2018-01-09 中国石油大学(华东) 基于视觉与语义注意力相结合策略的图像描述方法及系统
CN107608943A (zh) * 2017-09-08 2018-01-19 中国石油大学(华东) 融合视觉注意力和语义注意力的图像字幕生成方法及系统
CN110472642A (zh) * 2019-08-19 2019-11-19 齐鲁工业大学 基于多级注意力的细粒度图像描述方法及系统
US20200193245A1 (en) * 2018-12-17 2020-06-18 Sri International Aligning symbols and objects using co-attention for understanding visual content
CN113569892A (zh) * 2021-01-29 2021-10-29 腾讯科技(深圳)有限公司 图像描述信息生成方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563498A (zh) * 2017-09-08 2018-01-09 中国石油大学(华东) 基于视觉与语义注意力相结合策略的图像描述方法及系统
CN107608943A (zh) * 2017-09-08 2018-01-19 中国石油大学(华东) 融合视觉注意力和语义注意力的图像字幕生成方法及系统
US20200193245A1 (en) * 2018-12-17 2020-06-18 Sri International Aligning symbols and objects using co-attention for understanding visual content
CN110472642A (zh) * 2019-08-19 2019-11-19 齐鲁工业大学 基于多级注意力的细粒度图像描述方法及系统
CN113569892A (zh) * 2021-01-29 2021-10-29 腾讯科技(深圳)有限公司 图像描述信息生成方法、装置、计算机设备及存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687674A (zh) * 2022-12-20 2023-02-03 昆明勤砖晟信息科技有限公司 服务于智慧云服务平台的大数据需求分析方法及系统
CN116416440A (zh) * 2023-01-13 2023-07-11 北京百度网讯科技有限公司 目标识别方法、模型训练方法、装置、介质和电子设备
CN116416440B (zh) * 2023-01-13 2024-02-06 北京百度网讯科技有限公司 目标识别方法、模型训练方法、装置、介质和电子设备
CN117742546A (zh) * 2023-12-29 2024-03-22 广东福临门世家智能家居有限公司 基于悬浮窗的智能家居控制方法及系统

Also Published As

Publication number Publication date
JP2023545543A (ja) 2023-10-30
US20230103340A1 (en) 2023-04-06
CN113569892A (zh) 2021-10-29

Similar Documents

Publication Publication Date Title
WO2022161298A1 (zh) 信息生成方法、装置、设备、存储介质及程序产品
JP7179183B2 (ja) ビデオキャプションの生成方法、装置、デバイスおよびコンピュータプログラム
EP3951617A1 (en) Video description information generation method, video processing method, and corresponding devices
CN110234018B (zh) 多媒体内容描述生成方法、训练方法、装置、设备及介质
US20230082605A1 (en) Visual dialog method and apparatus, method and apparatus for training visual dialog model, electronic device, and computer-readable storage medium
CN112949622B (zh) 融合文本与图像的双模态性格分类方法及装置
EP3937072A1 (en) Video sequence selection method, computer device and storage medium
US11868738B2 (en) Method and apparatus for generating natural language description information
CN114339450B (zh) 视频评论生成方法、系统、设备及存储介质
CN113380271B (zh) 情绪识别方法、系统、设备及介质
CN115329779A (zh) 一种多人对话情感识别方法
CN116050496A (zh) 图片描述信息生成模型的确定方法及装置、介质、设备
US11216497B2 (en) Method for processing language information and electronic device therefor
JP2022075668A (ja) ビデオ処理方法、装置、デバイスおよび記憶媒体
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
JP7483532B2 (ja) キーワード抽出装置、キーワード抽出方法及びキーワード抽出プログラム
Mishra et al. Environment descriptor for the visually impaired
Guo et al. Attention-based visual-audio fusion for video caption generation
CN116913278B (zh) 语音处理方法、装置、设备和存储介质
WO2023238722A1 (ja) 情報作成方法、情報作成装置、及び動画ファイル
US12008810B2 (en) Video sequence selection method, computer device, and storage medium
CN116702094B (zh) 一种群体应用偏好特征表示方法
Li et al. Video-based multimodal personality analysis
Varma et al. Light weight Real Time Indian Sign Language Symbol Recognition with Captioning and Speech Output
Ralhan et al. Qualitative content analysis in visual question answering-based datasets and algorithms

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22745173

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023523236

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.12.2023)