WO2022206094A1 - 用于生成字幕器以及输出字幕的方法和装置 - Google Patents

用于生成字幕器以及输出字幕的方法和装置 Download PDF

Info

Publication number
WO2022206094A1
WO2022206094A1 PCT/CN2022/070476 CN2022070476W WO2022206094A1 WO 2022206094 A1 WO2022206094 A1 WO 2022206094A1 CN 2022070476 W CN2022070476 W CN 2022070476W WO 2022206094 A1 WO2022206094 A1 WO 2022206094A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
image
sample
subtitler
objects
Prior art date
Application number
PCT/CN2022/070476
Other languages
English (en)
French (fr)
Inventor
潘滢炜
李业豪
姚霆
梅涛
Original Assignee
京东科技控股股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东科技控股股份有限公司 filed Critical 京东科技控股股份有限公司
Priority to JP2023559796A priority Critical patent/JP2024512628A/ja
Priority to US18/284,225 priority patent/US20240177506A1/en
Publication of WO2022206094A1 publication Critical patent/WO2022206094A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/30Writer recognition; Reading and verifying signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/40Spoof detection, e.g. liveness detection

Definitions

  • Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method and apparatus for generating subtitles and outputting subtitles.
  • Image captioning is an emerging and rapidly growing research topic, which is a technique to automatically describe images in natural language sentences.
  • Embodiments of the present disclosure propose methods and apparatuses for generating subtitles and methods and apparatuses for outputting subtitles.
  • An embodiment of the present disclosure provides a method for generating a subtitler, including: acquiring a sample image set; inputting the sample image set into an image encoder of a sentence generator, and outputting an object set; and grouping the object set into a first object set and a second object set, wherein the first object set is an object set included in a predetermined object set, and the second object set is an object set excluded from the predetermined object set; input the object set output by the image encoder into a sentence
  • the sentence decoder of the generator uses the first object set and the second object set as constraints to perform beam search to generate a pseudo-image sentence pair set; train the sentence generator with the pseudo-image sentence pair set as a sample set, and obtain subtitler.
  • the method further includes: optimizing the subtitler by at least one of the following ways: adversarial training of the subtitler by the sentence discriminator to optimize the subtitler; objects identified by the subtitler in the sentence output by the subtitler The subtitler is optimized by the inclusion degree in ; the subtitler is optimized by the semantic correlation between the image triplet and the corresponding generated sentence, where the image triplet includes the query image, positive image and negative image.
  • adversarial training of the subtitler on the sentence discriminator to optimize the subtitler includes: extracting a preset first sample set, wherein each first sample includes an image and a corresponding true sentence; Extract the pre-established generative adversarial network, in which the generative adversarial network includes a subtitler and a sentence discriminator.
  • the subtitler is used to encode the input image and then decode the sentence to obtain a pseudo sentence.
  • the sentence discriminator is used to determine the Whether the input sentence is a pseudo sentence output by the subtitler; based on the machine learning method, select the first sample from the first sample set, and perform the following first training step: input the image in the selected first sample into the subtitle input the pseudo sentence and the real sentence in the selected first sample into the sentence discriminator, and input the discrimination result; according to the output discrimination result, the accuracy rate of the sentence discriminator is calculated; if the accuracy rate reaches the preset value, Then it is determined that the subtitler training is completed.
  • the method further includes: if the accuracy rate does not reach a preset value, calculating the adversarial loss of the sentence discriminator, adjusting the relevant parameters of the sentence discriminator to reduce the adversarial loss, and extracting the adversarial loss from the first sample The first sample is reselected centrally, and the first training step is continued.
  • the method further includes: if the accuracy rate does not reach a preset value, calculating the adversarial reward of the subtitler, adjusting the relevant parameters of the subtitler to increase the adversarial reward, and restarting from the first sample set Select the first sample and continue to perform the first training step.
  • optimizing the subtitler by the inclusion of the objects identified by the subtitler in the sentences output by the subtitler includes: extracting a preset second set of samples, wherein each second sample includes an image; A learning method, selecting samples from a second sample set, and performing the following second training step: inputting the images in the selected second sample into an image encoder of a subtitler, and outputting a sample object set; inputting the sample object set into sentences of the subtitler
  • the decoder outputs a pseudo-sentence; calculates the average confidence score of the sample objects in the sample object set contained in the pseudo-sentence, as the object of the pseudo-sentence contains the reward; if the object contains the reward and reaches the preset threshold of containing the reward, it is determined that the subtitler training is complete .
  • the method further includes: if the object inclusion reward does not reach a preset inclusion reward threshold, adjusting relevant parameters of the subtitler to increase the object inclusion reward, and reselecting a second sample from the second sample set, and continuing Perform the second training step.
  • optimizing the subtitler by semantic correlation between image triples and corresponding generated sentences includes: extracting a preset third set of samples, wherein each third sample includes a query image, a positive image and negative images, the positive images share at least two objects with the query images, and the negative images and the query images do not have any objects in common; based on the machine learning method, a third sample is selected from the third sample set, and the following third training step is performed: Input the query image, positive image and negative image in the selected third sample into the subtitler respectively, and output the query sentence, positive sentence and negative sentence; calculate the first semantic similarity between the query sentence and the positive sentence, and calculate the difference between the query sentence and the negative sentence.
  • the second semantic similarity; the self-supervised triplet loss is calculated according to the first semantic similarity and the second semantic similarity; if the self-supervised triplet loss is less than a predetermined loss threshold, it is determined that the subtitler training is completed.
  • the method further includes: if the self-supervised triplet loss is not less than a predetermined loss threshold, adjusting relevant parameters of the subtitler to reduce the self-supervised triplet loss, and reselecting a third sample from the third sample set , and proceed to the third training step.
  • calculating the first semantic similarity between the query sentence and the positive sentence and calculating the second semantic similarity between the query sentence and the negative sentence includes: for the query sentence, the positive sentence and the negative sentence, respectively calculating each word in the sentence The object-based probability distribution of , and the maximum pooling operation is performed to obtain query sentence features, positive sentence features and negative sentence features respectively; Second Semantic Similarity.
  • the method further includes: if the weighted sum of the adversarial reward, the object inclusion reward and the self-supervised triplet loss is greater than a predetermined target value, adjusting the relevant parameters of the captioner such that the weighted sum decreases.
  • the image encoder includes a two-layer LSTM with a region-level attention mechanism, where the first-layer LSTM acts as a top-down attention module that computes object-level attention based on contextual information, and the second-layer LSTM is A language model for generating sentences.
  • Embodiments of the present disclosure also provide a method for outputting subtitles, including: acquiring an image to be processed; inputting the image into a subtitler generated by the above-mentioned method for generating a subtitler, and the output image corresponds to a subtitler. subtitle.
  • Embodiments of the present disclosure also provide an apparatus for generating a subtitler, comprising: an acquisition unit configured to acquire a sample image set; an encoding unit configured to input the sample image set into an image encoder of the sentence generator, an output object set; a grouping unit configured to group the object set into a first object set and a second object set, wherein the first object set is an object set included in the predetermined object set, and the second object set is an excluded object set The object set outside the predetermined object set; the decoding unit is configured to input the object set output by the image encoder into the sentence decoder of the sentence generator, and in the decoding step, the first object set and the second object set are used as constraints.
  • the beam search generates a pseudo-image sentence pair set; the training unit is configured to use the pseudo-image sentence pair set as a sample set to train a sentence generator to obtain a subtitler.
  • the apparatus further includes an optimization unit configured to: optimize the subtitler by at least one of: adversarial training of the subtitler by the sentence discriminator to optimize the subtitler; objects identified by the subtitler The subtitler is optimized by the degree of inclusion in the sentences output by the subtitler; the subtitler is optimized by the semantic correlation between the image triplet and the corresponding generated sentence, where the image triplet includes the query image, the positive image and the negative image.
  • an optimization unit configured to: optimize the subtitler by at least one of: adversarial training of the subtitler by the sentence discriminator to optimize the subtitler; objects identified by the subtitler The subtitler is optimized by the degree of inclusion in the sentences output by the subtitler; the subtitler is optimized by the semantic correlation between the image triplet and the corresponding generated sentence, where the image triplet includes the query image, the positive image and the negative image.
  • the optimization unit is further configured to: extract a preset first sample set, wherein each first sample includes an image and a corresponding true sentence; extract a pre-established generative adversarial network, wherein, The generative adversarial network includes a subtitler and a sentence discriminator.
  • the subtitler is used to encode the input image and then decode the sentence to obtain a pseudo sentence.
  • the sentence discriminator is used to determine whether the input sentence is a pseudo sentence output by the subtitler.
  • sentence based on the machine learning device, select a first sample from the first sample set, and perform the following first training step: inputting the image in the selected first sample into a subtitler, and outputting a pseudo sentence;
  • the real sentence in the first sample is input into the sentence discriminator, and the discrimination result is input; the accuracy rate of the sentence discriminator is calculated according to the output discrimination result; if the accuracy rate reaches the preset value, it is determined that the subtitler training is completed.
  • the optimization unit is further configured to: if the accuracy rate does not reach a preset value, calculate the adversarial loss of the sentence discriminator, adjust the relevant parameters of the sentence discriminator so that the adversarial loss is reduced, and from the first The first sample is reselected in the sample set, and the first training step is continued.
  • the optimization unit is further configured to: if the accuracy rate does not reach the preset value, calculate the adversarial reward of the subtitler, adjust the relevant parameters of the subtitler to increase the adversarial reward, and extract the adversarial reward from the first sample The first sample is reselected centrally, and the first training step is continued.
  • the optimization unit is further configured to: extract a preset second sample set, wherein each second sample includes an image; select samples from the second sample set based on the machine learning device, and perform the following second Training step: input the image in the selected second sample into the image encoder of the subtitler, and output the sample object set; input the sample object set into the sentence decoder of the subtitler, and output the pseudo sentence; calculate the pseudo sentence containing the sample object set.
  • the optimization unit is further configured to: if the object inclusion reward does not reach the preset inclusion reward threshold, adjust relevant parameters of the subtitler to increase the object inclusion reward, and reselect the second sample from the second sample set , and proceed to the second training step.
  • the optimization unit is further configured to: extract a preset third sample set, wherein each third sample includes a query image, a positive image and a negative image, the positive image and the query image share at least two objects, And the negative image and the query image do not have any common objects; based on the machine learning device, a third sample is selected from the third sample set, and the following third training step is performed: the query image, the positive image and the negative image in the selected third sample are selected.
  • the images are input into the subtitler respectively, and the query sentence, positive sentence and negative sentence are output; the first semantic similarity between the query sentence and the positive sentence is calculated and the second semantic similarity between the query sentence and the negative sentence is calculated; according to the first semantic similarity and the second semantic similarity
  • the semantic similarity calculates the self-supervised triplet loss; if the self-supervised triplet loss is less than a predetermined loss threshold, it is determined that the captioner training is complete.
  • the optimization unit is further configured to: if the self-supervised triplet loss is not less than a predetermined loss threshold, adjust the relevant parameters of the subtitler so that the self-supervised triplet loss is reduced, and reselect the third sample set from the third sample set. Three samples, continue to perform the third training step.
  • the optimization unit is further configured to: for the query sentence, the positive sentence and the negative sentence, respectively calculate the object-based probability distribution of each word in the sentence, perform a maximum pooling operation, and obtain the query sentence feature, positive sentence and negative sentence respectively. Sentence features and negative sentence features; calculating the first semantic similarity between the query sentence feature and the positive sentence feature and calculating the second semantic similarity between the query sentence feature and the negative sentence feature.
  • the optimization unit is further configured to: if the weighted sum of the adversarial reward, the object inclusion reward and the self-supervised triplet loss is greater than a predetermined target value, adjust the relevant parameters of the captioner such that the weighted sum decreases.
  • the image encoder includes a two-layer LSTM with a region-level attention mechanism, where the first-layer LSTM acts as a top-down attention module that computes object-level attention based on contextual information, and the second-layer LSTM is A language model for generating sentences.
  • An embodiment of the present disclosure provides an apparatus for outputting subtitles, including: an acquisition unit configured to acquire an image to be processed; an output unit configured to input the image using the above-mentioned method for generating a subtitle.
  • the subtitler generated by the method the subtitles corresponding to the output images are output.
  • Embodiments of the present disclosure provide an electronic device, including: one or more processors; a storage device on which one or more computer programs are stored, when the one or more computer programs are executed by the one or more processors , causing one or more processors to implement the method for generating a subtitler as described above.
  • Embodiments of the present disclosure provide a computer-readable medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method for generating a subtitler as described above.
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure may be applied;
  • FIG. 2 is a flowchart of one embodiment of a method for generating a subtitler according to the present disclosure
  • FIG. 3 is a schematic diagram of an application scenario of the method for generating a subtitler according to the present disclosure
  • FIG. 5 is a schematic structural diagram of an embodiment of an apparatus for generating a subtitler according to the present disclosure
  • FIG. 6 is a schematic structural diagram of an embodiment of an apparatus for outputting subtitles according to the present disclosure
  • FIG. 7 is a schematic structural diagram of a computer system suitable for implementing an electronic device of an embodiment of the present disclosure.
  • FIG. 1 illustrates an exemplary system architecture 100 of a method for generating a subtitle, an apparatus for generating a subtitle, a method for outputting a subtitle, or an apparatus for outputting a subtitle to which embodiments of the present disclosure may be applied.
  • the system architecture 100 may include terminals 101 and 102 , a network 103 , a database server 104 and a server 105 .
  • the network 103 is used as a medium for providing a communication link between the terminals 101 , 102 , the database server 104 and the server 105 .
  • the network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user 110 can use the terminals 101, 102 to interact with the server 105 through the network 103 to receive or send messages and the like.
  • Various client applications may be installed on the terminals 101 and 102 , such as model training applications, subtitle generation applications, image processing applications, shopping applications, payment applications, web browsers, and instant messaging tools.
  • the terminals 101 and 102 here may be hardware or software.
  • the terminals 101 and 102 are hardware, they can be various electronic devices with display screens, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving picture experts Compression standard audio layer 3), laptops and desktops, etc.
  • the terminals 101 and 102 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (eg, to provide distributed services), or as a single software or software module. There is no specific limitation here.
  • an image acquisition device may also be installed thereon.
  • the image capturing device may be various devices that can realize the function of capturing images, such as a camera, a sensor, and the like.
  • the user 110 can use the image capture devices on the terminals 101 and 102 to capture images of various scenes.
  • Database server 104 may be a database server that provides various services.
  • a sample set may be stored in a database server.
  • the sample set contains a large number of samples.
  • the samples may include sample images and sentences corresponding to the sample images.
  • the user 110 can also select samples from the sample set stored in the database server 104 through the terminals 101 and 102 .
  • the server 105 may also be a server that provides various services, such as a background server that provides support for various applications displayed on the terminals 101 and 102 .
  • the background server can use the samples in the sample set sent by the terminals 101 and 102 to train the initial model, and can send the training results (such as the generated captioners) to the terminals 101 and 102 . In this way, the user can apply the generated subtitler to generate subtitles for the images.
  • the database server 104 and the server 105 here can also be hardware or software. When they are hardware, they can be implemented as a distributed server cluster consisting of multiple servers, or as a single server. When they are software, they can be implemented as multiple software or software modules (eg, to provide distributed services), or as a single software or software module. There is no specific limitation here.
  • the method for generating a subtitle or the method for outputting subtitles provided by the embodiments of the present disclosure is generally executed by the server 105 . Accordingly, means for generating subtitles or means for outputting subtitles are generally also provided in the server 105 .
  • the server 105 can implement the relevant functions of the database server 104, the database server 104 may not be provided in the system architecture 100.
  • terminals, networks, database servers and servers in FIG. 1 are merely illustrative. There can be any number of terminals, networks, database servers, and servers according to implementation needs.
  • the method for generating a subtitler may include the following steps:
  • Step 201 acquiring a sample image set.
  • the execution body of the method for generating a subtitler may acquire a pre-stored set of sample images from a database server.
  • the image captured by the terminal can also be acquired from the terminal as a sample image.
  • Step 202 input the sample image set into the image encoder of the sentence generator, and output the object set.
  • the sentence generator is the initial subtitler, which is a neural network that converts input images into sentences.
  • the sentence generator may include an image encoder and a sentence decoder.
  • the image encoder generates an intermediate representation for each input image.
  • the present disclosure uses the most common object detection model (Faster R-CNN) as the image encoder to detect objects in the image, and other image encoders may also be used in practical applications. Encode each image I i into a set of salient image regions It contains K detected objects, such as people, flowers, grass, trees, chairs, dogs, etc.
  • Step 203 Group the object set into a first object set and a second object set.
  • the first object set is an object set included in the predetermined object set
  • the second object set is an object set excluded from the predetermined object set.
  • the identified objects Based on the 80 most common detected objects in the COCO dataset, regroup the identified objects into object sets that need to be included and the set of excluded objects For example, if the predetermined object set includes houses, cars, people, flowers, grass, and trees, and the object set includes people, flowers, grass, trees, chairs, and dogs, the first object set (objects included in the predetermined object set) set) includes people, flowers, grass, trees, while the second object set (the excluded object set) includes chairs, dogs.
  • Step 204 Input the object set output by the image encoder into the sentence decoder of the sentence generator.
  • the first object set and the second object set are used as constraints to perform beam search to generate a pseudo-image sentence pair set.
  • the sentence decoder is used to decode the output sentence word by word.
  • the sentence decoder can be implemented as a two-layer LSTM (Long Short-Term Memory) with a region-level attention mechanism.
  • the first layer LSTM (LSTM 1 ) acts as a top-down attention module that computes object-level attention based on contextual information
  • the second layer LSTM (LSTM 2 ) is a language model for generating sentences.
  • the hidden state of the second layer LSTM Mean of encoded image features And the input word w t-1 is regarded as context information and fed into the first layer LSTM, from which the hidden state of the first layer LSTM is obtained:
  • ⁇ t ,k is the kth element of ⁇ t , which represents the attention probability of the image region vk .
  • W E is a linear embedding matrix that will Projection into the vocabulary space for word prediction.
  • a natural way to generate pseudo-image-sentence pairs via a pretrained subtitler is to employ beam search, a heuristic search algorithm that maintains a beam B t at each decoding step, containing b most possible partial sentences.
  • beam search a heuristic search algorithm that maintains a beam B t at each decoding step, containing b most possible partial sentences.
  • the semantic correlation between input images and output sentences is not fully utilized for sentence generation at inference time.
  • the present disclosure devises a semantically constrained beam search that restructures the beam search to ensure that identified objects are included and extraneous objects are excluded.
  • a set of recognized objects is output by an object detection model (eg FasterR-CNN) where O k is the recognized object with the highest confidence score in the Kth image region, is the corresponding confidence score.
  • object detection model eg FasterR-CNN
  • O k the recognized object with the highest confidence score in the Kth image region
  • the corresponding confidence score is used.
  • a finite state machine Finite-state machine
  • the search beam is maintained for each state a ⁇ A in the finite state machine
  • the candidate set keep the b most probable partial word sequences in to update each beam
  • V is the dictionary
  • w 1:t-1 means the output sentence length is t-1
  • V is the dictionary
  • w 1:t-1 means the output sentence length is t-1
  • V is the dictionary
  • w 1:t-1 means the output sentence length is t-1
  • V is the dictionary
  • w 1:t-1 means the output sentence length is t-1
  • V is the dictionary
  • the state transition function in a finite state machine This only looks up words from the vocabulary (no unrelated objects in ) to expand the current partial sequence of words. Therefore, the design of the finite state machine requires that the word sequence of the accepting state must satisfy the inclusion condition, while excluding all extraneous objects.
  • Each pseudo-image-sentence pair in the pseudo-image-sentence pair set includes an image and a sentence, and the image and sentence can be unpaired.
  • Step 205 using the pseudo-image sentence pair set as a sample set to train a sentence generator to obtain a subtitler.
  • available represents a collection of pseudo-image-sentence pairs, where represents the generated pseudo-sentences, and with these pseudo-image-sentence pairs, one can directly train a subtitler with the following cross-entropy loss:
  • denotes the parameters in the sentence decoder.
  • the method further includes: optimizing the subtitler in at least one of the following ways: optimizing the subtitler by adversarial training of the subtitler by using the sentence discriminator; The subtitler is optimized by the inclusion of the object in the sentences output by the subtitler; the subtitler is optimized by the semantic correlation between the image triplet and the corresponding generated sentence, where the image triplet includes the query image, positive image and negative image .
  • the subtitler can be optimized by any of the above methods, or can be optimized by any combination of the two methods. You can also combine the three ways to optimize the subtitler.
  • performing adversarial training on the subtitler with a sentence discriminator to optimize the subtitler includes: extracting a preset first sample set, wherein each A sample includes an image and a corresponding true sentence; a pre-established generative adversarial network is extracted, wherein the generative adversarial network includes a subtitler and a sentence discriminator, and the subtitler is used to image-encode the input image.
  • the sentence discriminator is used to determine whether the input sentence is a pseudo-sentence output by the captioner; based on the machine learning method, select a first sample from the first sample set , and perform the following first training steps: the image in the selected first sample is input into the captioner, and a pseudo sentence is output; the pseudo sentence and the real sentence in the selected first sample are input into the sentence identification , input the discrimination result; count the accuracy rate of the sentence discriminator according to the output discrimination result; if the accuracy rate reaches a preset value, it is determined that the captioner training is completed.
  • the accuracy rate does not reach a preset value, calculate the adversarial loss of the sentence discriminator, adjust the relevant parameters of the sentence discriminator so that the adversarial loss is reduced, and extract the data from the first sample set The first sample is reselected, and the first training step is continued.
  • the accuracy rate does not reach the preset value, calculate the adversarial reward of the captioner, adjust the relevant parameters of the captioner to increase the adversarial reward, and reselect from the first sample set For the first sample, continue to perform the first training step.
  • the structure of the sentence discriminator is shown in Figure 3.
  • the sentence discriminator and subtitler (including the image encoder and sentence decoder) make up the generative adversarial network.
  • the sentence discriminator is used to distinguish whether the input sentence is a real sentence in the unpaired sentence dataset or a pseudo sentence generated by the subtitler.
  • RNN Recurrent Neural Network
  • LSTMs can be used to contextually encode word sequences into sentence-level representations to identify real/generated sentences.
  • RNN Recurrent Neural Network
  • W FC is the embedding matrix of the fully connected layer.
  • the sentence discriminator judges whether the input sentence is a real sentence or a pseudo sentence generated by the subtitler. Whether the statistical identification result is correct, if the accuracy rate reaches a preset value (for example, 0.5), it means that the subtitler has a good effect of forging sentences and can fool the sentence discriminator, and the training is ended. Otherwise, the network parameters of the subtitler and sentence discriminator need to be adjusted before training. You can first fix the parameters of the subtitler, adjust the parameters of the sentence discriminator for training, and then fix the parameters of the sentence discriminator and adjust the parameters of the subtitler for training. The parameters of sentence discriminator and subtitler are adjusted alternately, and finally the sentence discriminator and subtitler are trained. The actual application is the subtitler.
  • a preset value for example, 0.5
  • Adversarial Reward To generate sentences that are indistinguishable from human-written subtitles, the present disclosure employs adversarial training and sentence-level adversarial rewards to match the generated sentence distribution with the manually described sentence distribution.
  • image captioners Treated as a sentence generator the data distribution is captured to generate sentences.
  • Sentence discriminator D to learn from actual sentences or A randomly selected sentence from the generated sentences is used as input, and a probability distribution is produced over the two sentence sources (i.e. the generated sentence or the real sentence)
  • the image captioner and sentence discriminator are trained in a two-player game.
  • the sentence discriminator D works by correctly distinguishing between real sentences ⁇ S i ⁇ and generated sentences to optimize, i.e. minimize the adversarial loss:
  • the image captioner Learning by maximizing the adversarial reward r adv aims to fool the sentence discriminator with the generated sentences:
  • the accuracy of the subtitler can be improved by generative adversarial networks.
  • optimizing the subtitler according to the inclusion degree of the object identified by the subtitler in the sentence output by the subtitler includes:
  • each second sample includes an image
  • samples are selected from the second sample set, and the following second training step is performed: inputting the images in the selected second sample into the image encoder of the captioner, and outputting a sample object set;
  • the sample object set is input to the sentence decoder of the subtitler, and a pseudo sentence is output; the average confidence score of the sample objects in the sample object set is calculated in the pseudo sentence, and the object of the pseudo sentence contains a reward; if the object If the inclusion reward reaches the preset inclusion reward threshold, it is determined that the captioner training is completed.
  • the object inclusion reward does not reach the preset inclusion reward threshold, adjust the relevant parameters of the subtitler to increase the object inclusion reward, reselect a second sample from the second sample set, and continue to execute the Second training step.
  • the present disclosure further regards the inclusion degree of the identified target in the output sentence as an additional self-supervised target, ie, the target inclusion reward, to encourage the subtitler to describe that the generated sentence contains the identified target. In this way, the semantic correlation between the two is emphasized, enhancing the quality of the generated subtitles.
  • the present disclosure constructs a collection of containing objects with all the identified objects Given a generated sentence By counting the sentences contained in the above set Confidence mean scores for objects in to construct object inclusion rewards:
  • the indicator function Represents the corresponding confidence score of the recognized object, and some objects have low confidence, so the weight is correspondingly reduced.
  • optimizing the subtitler through semantic correlation between image triples and corresponding generated sentences includes: extracting a preset third sample set, wherein each The third sample includes a query image, a positive image and a negative image, the positive image and the query image share at least two objects, while the negative image and the query image do not have any common objects; based on the machine learning method, the third sample set is selected from the third sample set.
  • the self-supervised triplet loss is not less than a predetermined loss threshold, adjust the relevant parameters of the subtitler to reduce the self-supervised triplet loss, and reselect a third sample from the third sample set, and continue to execute the third training step.
  • the calculating the first semantic similarity between the query sentence and the positive sentence and the calculating the second semantic similarity between the query sentence and the negative sentence include: for the query sentence, the positive sentence and the negative sentence Sentence, calculate the object-based probability distribution of each word in the sentence, and perform the maximum pooling operation to obtain query sentence features, positive sentence features and negative sentence features respectively; calculate the first semantic similarity between query sentence features and positive sentence features and Calculate the second semantic similarity between query sentence features and negative sentence features.
  • Self-supervised Triplet Loss in optimization with object inclusion reward, independently exploits the difference between each image and the corresponding generated sentence, regardless of the similarity or dissimilarity between images Semantic relevance.
  • the present disclosure designs a self-supervised triplet loss to semantically constrain the learning of the subtitler in a triplet manner, aiming to preserve the relative semantic order between sentences.
  • Each image triplet (consisting of query image, positive image, and negative image) is constructed based on the visual object identified in the image. Positive images share at least two objects with query images, while negative images and query images do not have any objects in common. Given such an image triplet, the subtitler is optimized so that the generated sentences for query images are more similar to those for positive images than those for negative images.
  • each triple contains a query image I i , a positive image and a negative image
  • the self-supervised triplet loss is calculated as follows:
  • the final training of the entire model can be combined with Adversarial Reward, Object Inclusion Reward and Self-supervised Triplet Loss in the self-critical sequence training.
  • the overall goal of The gradient formula is approximated as:
  • T denotes the sampled sentence
  • b denotes the combination of obtained adversarial and object inclusion rewards.
  • ⁇ 1 , ⁇ 2 , and ⁇ 3 represent the weights of adversarial reward, object inclusion reward and self-supervised triplet loss, respectively, and the weight can be 0.
  • FIG. 3 is a schematic diagram of an application scenario of the method for generating a subtitler according to this embodiment.
  • the query image, positive image, and negative image are input into the image encoder Faster R-CNN of the subtitler to obtain the object set ⁇ tree (tree), man (person), bench (bench), grass (grass), dog (dog)... ⁇ .
  • the sentence decoder of the subtitler the two-layer LSTM structure below in Figure 3
  • perform beam search decoding based on the semantics of the object set, and generate the pseudo sentence "a man sitting on a bench near" a tree” etc.
  • pseudo-sentences and corresponding images are used as a set of pseudo-image-sentence pairs for training the subtitler (the upper two-layer LSTM structure in Figure 3 represents the sentence decoder).
  • the parameters of the image encoder can be fixed, and only the sentence decoder can be trained, or the image encoder can be trained after the sentence decoder training is completed, and the image encoder and the sentence decoder can be trained alternately to obtain the best-performing subtitler. .
  • the way cross-entropy is used during training.
  • the obtained parameters of the upper two-layer LSTM structure can be shared with the lower two-layer LSTM structure.
  • Adversarial reward optimization Input the real sentence "a cow stands in the back of a large truck" together with the pseudo sentence generated by the subtitler into the sentence discriminator for discrimination. If the discrimination accuracy does not reach 0.5, adjust the parameters of the sentence discriminator in the direction of minimizing the adversarial loss, and then adjust the parameters of the subtitler in the direction of maximizing the adversarial reward.
  • the subtitler can be optimized by alternately training (adjusting) the sentence discriminator and subtitler.
  • Object inclusion reward optimization Calculate the inclusion degree of the identified objects in the pseudo-sentences generated by the subtitler.
  • the recognized objects include tree, man, and bench. If sentence 1 includes tree (confidence of 0.9) and sentence 2 includes tree (confidence of 0.8) and man (confidence of 0.7), the object inclusion reward of sentence 2 is higher than that of sentence 1.
  • the purpose of training is to maximize the object inclusion reward, and each adjustment of the parameters increases the object inclusion reward.
  • Self-supervised triplet loss optimization The input samples in Figure 3 can be triples: query image, positive image, and negative image. Different images can generate different pseudo-sentences, and the self-supervised triplet loss is determined by comparing the semantic similarity between query sentences, positive sentences, and negative sentences. The purpose of training is to reduce the self-supervised triplet loss, so that positive sentences are semantically closer to the query sentence, and negative sentences are not semantically related to the query sentence.
  • the present disclosure adopts a self-learning mode, and optimizes the entire model by alternately performing the two processes of pseudo-subtitle pair generation and subtitle retraining, so as to achieve the purpose of cyclically and iteratively improving the subtitler.
  • the present disclosure proposes a self-learning framework based on semantic constraints, and deeply studies the self-learning idea of unpaired image captioners. This problem is studied from the perspective of establishing a pseudo-sentence generation and iterative optimization, gradually improving the quality of sentence generation. Furthermore, semantic constraints are well integrated into the model, which fully utilizes the semantics of objects in the image to guide the training of the captioner, resulting in advanced unsupervised captioning techniques.
  • the process 400 of the method for outputting subtitles includes the following steps:
  • Step 401 acquiring an image to be processed.
  • the electronic device on which the method for outputting subtitles runs may receive the to-be-processed subtitles from the terminal through which the user performs subtitle editing through a wired connection or a wireless connection.
  • image The image to be processed may be an individual image or a video file.
  • the server divides the video into frames to obtain the image to be processed.
  • Step 402 Input the image into the subtitle device, and output the subtitle corresponding to the image.
  • the subtitler is obtained by training according to the method of steps 201-205.
  • Subtitles can be automatically added to images through the subtitler.
  • the subtitles can be directly output to the image, or a separate file can be generated and returned to the terminal, and the terminal can set the format of the subtitles according to the user's needs, and then output it to the image.
  • the subtitler can not only input subtitles, but also output objects recognized by the image encoder, which can be used for semantic constraints during training.
  • Steps 401-402 may be performed alternately with steps 201-205.
  • the subtitles generated in steps 401-402 can be used as training samples in steps 201-205.
  • the process 400 of the method for outputting subtitles in this embodiment embodies the application steps of the subtitler. Therefore, the solution described in this embodiment can generate training samples through the subtitler, and then use them for training the subtitler, alternately generate subtitles and retrain the subtitler, thereby optimizing the subtitler and improving the accuracy of generating subtitles.
  • the present disclosure provides an embodiment of an apparatus for generating a subtitler, and the apparatus embodiment corresponds to the method embodiment shown in FIG. 2 , Specifically, the device can be applied to various electronic devices.
  • the apparatus 500 for generating a subtitler in this embodiment includes: an obtaining unit 501 , an encoding unit 502 , a grouping unit 503 , a decoding unit 504 and a training unit 505 .
  • the obtaining unit 501 is configured to obtain a sample image set; the encoding unit 502 is configured to input the sample image set into the image encoder of the sentence generator, and output an object set; the grouping unit 503 is configured to The set of objects is grouped into a first set of objects and a second set of objects, wherein the first set of objects is a set of objects included in a predetermined set of objects, and the second set of objects is a set of objects excluded from the predetermined set of objects
  • the decoding unit 504 is configured to input the object set output by the image encoder into the sentence decoder of the sentence generator, and in the decoding step, the first object set and the second object set are used as constraints.
  • the beam search generates a set of pseudo-image sentence pairs; the training unit 505 is configured to use the pseudo-image sentence pair set as a sample set to train the sentence generator to obtain a subtitler.
  • the apparatus further includes an optimization unit (not shown in the drawings), configured to: optimize the subtitler in at least one of the following ways: perform a sentence discriminator on the subtitler adversarial training to optimize the subtitler; optimize the subtitler by the inclusion of objects identified by the subtitler in the sentences output by the subtitler; optimize the subtitler by the semantic correlation between image triples and the corresponding generated sentences, where , the image triplet includes query image, positive image and negative image.
  • an optimization unit configured to: optimize the subtitler in at least one of the following ways: perform a sentence discriminator on the subtitler adversarial training to optimize the subtitler; optimize the subtitler by the inclusion of objects identified by the subtitler in the sentences output by the subtitler; optimize the subtitler by the semantic correlation between image triples and the corresponding generated sentences, where , the image triplet includes query image, positive image and negative image.
  • the optimization unit is further configured to: extract a preset first sample set, wherein each first sample includes an image and a corresponding true sentence; extract a preset first sample set; Generative adversarial network, in which the generative adversarial network includes a subtitler and a sentence discriminator.
  • the subtitler is used to encode the input image and then decode the sentence to obtain a pseudo sentence.
  • the sentence discriminator is used to determine whether the input sentence is not.
  • the pseudo sentence output by the subtitler based on the machine learning device, select a first sample from the first sample set, and perform the following first training step: input the image in the selected first sample into the subtitler, and output the pseudo sentence Sentence; input the pseudo sentence and the real sentence in the selected first sample into the sentence discriminator, and input the discrimination result; count the accuracy rate of the sentence discriminator according to the output discrimination result; if the accuracy rate reaches the preset value, then determine the subtitle Machine training is complete.
  • the optimization unit is further configured to: if the accuracy rate does not reach a preset value, calculate the adversarial loss of the sentence discriminator, and adjust the relevant parameters of the sentence discriminator to make the adversarial loss reduce, and re-select the first sample from the first sample set, and continue to perform the first training step.
  • the optimization unit is further configured to: if the accuracy rate does not reach the preset value, calculate the adversarial reward of the subtitler, and adjust the relevant parameters of the subtitler to increase the adversarial reward , and reselect the first sample from the first sample set, and continue to perform the first training step.
  • the optimization unit is further configured to: extract a preset second sample set, wherein each second sample includes an image; and select from the second sample set based on the machine learning device sample, and perform the following second training steps: inputting the image in the selected second sample into the image encoder of the subtitler, and outputting a sample object set; inputting the sample object set into the sentence decoder of the subtitler, and outputting a pseudo-sentence; calculating a pseudo-sentence
  • the sentence contains the average confidence score of the sample objects in the sample object set, and the object contains the reward as a pseudo-sentence; if the object contains the reward and reaches the preset threshold of containing the reward, it is determined that the subtitler training is completed.
  • the optimization unit is further configured to: if the object inclusion reward does not reach a preset inclusion reward threshold, adjust relevant parameters of the subtitler so that the object inclusion reward increases, and from the second The second sample is reselected from the sample set, and the second training step is continued.
  • the optimization unit is further configured to: extract a preset third sample set, wherein each third sample includes a query image, a positive image and a negative image, and the positive image and the query The images share at least two objects, while the negative image and the query image do not have any objects in common; based on the machine learning device, a third sample is selected from the third sample set, and the following third training step is performed:
  • the query image, positive image and negative image are input into the captioner respectively, and the query sentence, positive sentence and negative sentence are output; the first semantic similarity between the query sentence and the positive sentence and the second semantic similarity between the query sentence and the negative sentence are calculated;
  • the first semantic similarity and the second semantic similarity calculate the self-supervised triplet loss; if the self-supervised triplet loss is less than a predetermined loss threshold, it is determined that the subtitler training is completed.
  • the optimization unit is further configured to: if the self-supervised triplet loss is not less than a predetermined loss threshold, adjust the relevant parameters of the captioner to reduce the self-supervised triplet loss, and from The third sample is reselected from the third sample set, and the third training step is continued.
  • the optimization unit is further configured to: for the query sentence, the positive sentence and the negative sentence, respectively calculate the object-based probability distribution of each word in the sentence, and perform a maximum pooling operation, Obtain the query sentence feature, positive sentence feature and negative sentence feature respectively; calculate the first semantic similarity between the query sentence feature and the positive sentence feature and calculate the second semantic similarity between the query sentence feature and the negative sentence feature.
  • the optimization unit is further configured to: if the weighted sum of the adversarial reward, the object inclusion reward and the self-supervised triplet loss is greater than a predetermined target value, adjust the relevant parameters of the captioner such that Weighted and reduced.
  • the image encoder includes a two-layer LSTM with a region-level attention mechanism, wherein the first-layer LSTM acts as a top-down attention module that computes object-level attention based on contextual information , while the second layer LSTM is the language model used to generate sentences.
  • the present disclosure provides an embodiment of an apparatus for outputting subtitles.
  • the apparatus embodiment corresponds to the method embodiment shown in FIG. 4 .
  • the device can be specifically applied to various electronic devices.
  • the apparatus 600 for outputting subtitles in this embodiment includes: an acquiring unit 601 , which is configured to acquire an image to be processed; , the subtitles corresponding to the output images.
  • the present disclosure also provides an electronic device and a readable storage medium.
  • FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure.
  • Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 700 includes a computing unit 701 that can be executed according to a computer program stored in a read only memory (ROM) 702 or loaded into a random access memory (RAM) 703 from a storage unit 708 Various appropriate actions and handling. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored.
  • the computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704.
  • An input/output (I/O) interface 705 is also connected to bus 704 .
  • Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage unit 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 709 allows the device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
  • Computing unit 701 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of computing units 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the computing unit 701 performs the various methods and processes described above, eg, methods for generating a subtitler.
  • a method for generating a subtitler may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708 .
  • part or all of the computer program may be loaded and/or installed on device 700 via ROM 702 and/or communication unit 709.
  • the computer program When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the method described above for generating a subtitler may be performed.
  • the computing unit 701 may be configured by any other suitable means (eg, by means of firmware) to perform a method for generating a subtitler.
  • the method and device for generating a subtitle and the method and device for outputting subtitles aim to provide an unsupervised solution for image subtitles. Unlike existing image captioning methods that rely heavily on a large number of image-sentence pairs for training, the present disclosure eliminates this dependence by learning an image captioner in a self-learning manner. The captioner can be trained with unpaired image and sentence data to pursue more realistic scenarios.
  • Various implementations of the systems and techniques described herein above may be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC systems on chips system
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that
  • the processor which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.
  • Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer.
  • a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user's computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
  • a computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
  • the server can be a distributed system server, or a server combined with a blockchain.
  • the server can also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology.
  • the server can be a distributed system server, or a server combined with a blockchain.
  • the server can also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

一种用于生成字幕器的方法、装置以及用于输出字幕的方法、装置。用于生成字幕器的方法包括:获取样本图像集;将样本图像集输入句子生成器的图像编码器,输出对象集;将对象集分组成第一对象集和第二对象集,其中,第一对象集为被包含在预定对象集内的对象集,第二对象集为被排除在预定对象集外的对象集;将图像编码器输出的对象集输入句子生成器的句子解码器,在解码步骤中以第一对象集、第二对象集为约束条件进行波束搜索,生成伪图像句子对集;将伪图像句子对集作为样本集训练句子生成器,得到字幕器。

Description

用于生成字幕器以及输出字幕的方法和装置
相关申请的交叉引用
本公开要求于2021年3月30日提交的申请号为202110338045.X、发明名称为“用于生成字幕器以及输出字幕的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开的实施例涉及计算机技术领域,具体涉及用于生成字幕器以及输出字幕的方法和装置。
背景技术
图像字幕是一个新兴且发展迅速的研究主题,它是一种用自然语言句子自动描述图像的技术。
相关技术中,大部分在带注释的图像-句子对上训练字幕器,他们都遵循先利用卷积神经网络对输入图像进行编码,然后利用循环神经网络对句子进行解码的编解码器范式。一系列的工作都在升级图像字幕的注意机制,以增强视觉内容和自然句子之间的跨域基础。
相关技术严重依赖于大量的训练图像句子对,一方面,这些训练图像句子对的获取是极为昂贵且耗时的。另一方面,过分依赖训练图像句子对,阻碍了字幕器的广泛应用。
发明内容
本公开的实施例提出了用于生成字幕器的方法和装置以及用于输出字幕的方法和装置。
本公开的实施例提供了一种用于生成字幕器的方法,包括:获取样本图像集;将样本图像集输入句子生成器的图像编码器,输出对象集;将对象集分组成第一对象集和第二对象集,其中,第一对象集为被包含在预定对象集内的对象集,第二对象集为被排除在预定对象集外的对象集;将图像编码器输出的对象集输入句子生成器的句子解码器,在解码步骤中以第一对象集、第二对象集为约束条件进行波束搜索,生成伪图像句子对集;将伪图像句子 对集作为样本集训练句子生成器,得到字幕器。
在一些实施例中,该方法还包括:通过以下至少一种方式优化字幕器:通过句子鉴别器对字幕器进行对抗式训练来优化字幕器;通过字幕器识别出的对象在字幕器输出的句子中的包含程度优化字幕器;通过图像三元组与相应生成的句子之间的语义相关性优化字幕器,其中,图像三元组包括查询图像,正图像和负图像。
在一些实施例中,通过句子鉴别器对字幕器进行对抗式训练来优化字幕器,包括:提取预置的第一样本集,其中,每个第一样本包括图像和对应的真句子;提取预先建立的生成式对抗网络,其中,生成式对抗网络包括字幕器和句子鉴别器,字幕器用于对所输入的图像进行图像编码后再进行句子解码,得到伪句子,句子鉴别器用于确定所输入的句子是否为字幕器所输出的伪句子;基于机器学习方法,从第一样本集中选取第一样本,以及执行以下第一训练步骤:将选取的第一样本中的图像输入字幕器,输出伪句子;将伪句子和选取的第一样本中的真句子输入句子鉴别器,输入鉴别结果;根据输出的鉴别结果统计句子鉴别器的准确率;若准确率达到预设数值,则确定出字幕器训练完成。
在一些实施例中,该方法还包括:若准确率未达到预设数值,则计算句子鉴别器的对抗性损失,调整句子鉴别器的相关参数使得对抗性损失减小,以及从第一样本集中重新选取第一样本,继续执行第一训练步骤。
在一些实施例中,该方法还包括:若准确率未达到预设数值,则计算字幕器的对抗性奖励,调整字幕器的相关参数使得对抗性奖励增大,以及从第一样本集中重新选取第一样本,继续执行第一训练步骤。
在一些实施例中,通过字幕器识别出的对象在字幕器输出的句子中的包含程度优化字幕器,包括:提取预置的第二样本集,其中,每个第二样本包括图像;基于机器学习方法,从第二样本集中选取样本,以及执行以下第二训练步骤:将选取的第二样本中的图像输入字幕器的图像编码器,输出样本对象集;将样本对象集输入字幕器的句子解码器,输出伪句子;计算伪句子中包含样本对象集中的样本对象的置信度均值分数,作为伪句子的对象包含奖励;若对象包含奖励达到预设包含奖励阈值,则确定出字幕器训练完成。
在一些实施例中,该方法还包括:若对象包含奖励未达到预设包含奖励阈值,则调整字幕器的相关参数使得对象包含奖励增大,以及从第二样本集 中重新选取第二样本,继续执行第二训练步骤。
在一些实施例中,通过图像三元组与相应生成的句子之间的语义相关性优化字幕器,包括:提取预置的第三样本集,其中,每个第三样本包括查询图像、正图像和负图像,正图像与查询图像共享至少两个对象,而负图像和查询图像没有任何共同的对象;基于机器学习方法,从第三样本集中选取第三样本,以及执行以下第三训练步骤:将选取的第三样本中的查询图像、正图像和负图像分别输入字幕器,输出查询句子、正句子和负句子;计算查询句子和正句子的第一语义相似度以及计算查询句子和负句子的第二语义相似度;根据第一语义相似度和第二语义相似度计算自监督三重态损失;若自监督三重态损失小于预定损失阈值,则确定出字幕器训练完成。
在一些实施例中,该方法还包括:若自监督三重态损失不小于预定损失阈值,则调整字幕器的相关参数使得自监督三重态损失减小,以及从第三样本集中重新选取第三样本,继续执行第三训练步骤。
在一些实施例中,计算查询句子和正句子的第一语义相似度以及计算查询句子和负句子的第二语义相似度,包括:对于查询句子、正句子和负句子,分别计算句子中每一个词的基于对象的概率分布,进行最大池化操作,分别得到查询句子特征、正句子特征和负句子特征;计算查询句子特征和正句子特征的第一语义相似度以及计算查询句子特征和负句子特征的第二语义相似度。
在一些实施例中,该方法还包括:若对抗性奖励、对象包含奖励和自监督三重态损失的加权和大于预定目标值,则调整字幕器的相关参数使得加权和减小。
在一些实施例中,图像编码器包括具有区域级别注意机制的两层LSTM,其中,第一层LSTM充当自上而下的注意模块,根据上下文信息计算对象级别的注意,而第二层LSTM是用于生成句子的语言模型。
本公开的实施例还提供了一种用于输出字幕的方法,包括:获取待处理的图像;将图像输入采用如上所述的用于生成字幕器的方法生成的字幕器中,输出图像对应的字幕。
本公开的实施例还提供了一种用于生成字幕器的装置,包括:获取单元,被配置成获取样本图像集;编码单元,被配置成将样本图像集输入句子生成器的图像编码器,输出对象集;分组单元,被配置成将对象集分组成第一对 象集和第二对象集,其中,第一对象集为被包含在预定对象集内的对象集,第二对象集为被排除在预定对象集外的对象集;解码单元,被配置成将图像编码器输出的对象集输入句子生成器的句子解码器,在解码步骤中以第一对象集、第二对象集为约束条件进行波束搜索,生成伪图像句子对集;训练单元,被配置成将伪图像句子对集作为样本集训练句子生成器,得到字幕器。
在一些实施例中,该装置还包括优化单元,被配置成:通过以下至少一种方式优化字幕器:通过句子鉴别器对字幕器进行对抗式训练来优化字幕器;通过字幕器识别出的对象在字幕器输出的句子中的包含程度优化字幕器;通过图像三元组与相应生成的句子之间的语义相关性优化字幕器,其中,图像三元组包括查询图像,正图像和负图像。
在一些实施例中,优化单元进一步被配置成:提取预置的第一样本集,其中,每个第一样本包括图像和对应的真句子;提取预先建立的生成式对抗网络,其中,生成式对抗网络包括字幕器和句子鉴别器,字幕器用于对所输入的图像进行图像编码后再进行句子解码,得到伪句子,句子鉴别器用于确定所输入的句子是否为字幕器所输出的伪句子;基于机器学习装置,从第一样本集中选取第一样本,以及执行以下第一训练步骤:将选取的第一样本中的图像输入字幕器,输出伪句子;将伪句子和选取的第一样本中的真句子输入句子鉴别器,输入鉴别结果;根据输出的鉴别结果统计句子鉴别器的准确率;若准确率达到预设数值,则确定出字幕器训练完成。
在一些实施例中,优化单元进一步被配置成:若准确率未达到预设数值,则计算句子鉴别器的对抗性损失,调整句子鉴别器的相关参数使得对抗性损失减小,以及从第一样本集中重新选取第一样本,继续执行第一训练步骤。
在一些实施例中,优化单元进一步被配置成:若准确率未达到预设数值,则计算字幕器的对抗性奖励,调整字幕器的相关参数使得对抗性奖励增大,以及从第一样本集中重新选取第一样本,继续执行第一训练步骤。
在一些实施例中,优化单元进一步被配置成:提取预置的第二样本集,其中,每个第二样本包括图像;基于机器学习装置,从第二样本集中选取样本,以及执行以下第二训练步骤:将选取的第二样本中的图像输入字幕器的图像编码器,输出样本对象集;将样本对象集输入字幕器的句子解码器,输出伪句子;计算伪句子中包含样本对象集中的样本对象的置信度均值分数,作为伪句子的对象包含奖励;若对象包含奖励达到预设包含奖励阈值,则确 定出字幕器训练完成。
在一些实施例中,优化单元进一步被配置成:若对象包含奖励未达到预设包含奖励阈值,则调整字幕器的相关参数使得对象包含奖励增大,以及从第二样本集中重新选取第二样本,继续执行第二训练步骤。
在一些实施例中,优化单元进一步被配置成:提取预置的第三样本集,其中,每个第三样本包括查询图像、正图像和负图像,正图像与查询图像共享至少两个对象,而负图像和查询图像没有任何共同的对象;基于机器学习装置,从第三样本集中选取第三样本,以及执行以下第三训练步骤:将选取的第三样本中的查询图像、正图像和负图像分别输入字幕器,输出查询句子、正句子和负句子;计算查询句子和正句子的第一语义相似度以及计算查询句子和负句子的第二语义相似度;根据第一语义相似度和第二语义相似度计算自监督三重态损失;若自监督三重态损失小于预定损失阈值,则确定出字幕器训练完成。
在一些实施例中,优化单元进一步被配置成:若自监督三重态损失不小于预定损失阈值,则调整字幕器的相关参数使得自监督三重态损失减小,以及从第三样本集中重新选取第三样本,继续执行第三训练步骤。
在一些实施例中,优化单元进一步被配置成:对于查询句子、正句子和负句子,分别计算句子中每一个词的基于对象的概率分布,进行最大池化操作,分别得到查询句子特征、正句子特征和负句子特征;计算查询句子特征和正句子特征的第一语义相似度以及计算查询句子特征和负句子特征的第二语义相似度。
在一些实施例中,优化单元进一步被配置成:若对抗性奖励、对象包含奖励和自监督三重态损失的加权和大于预定目标值,则调整字幕器的相关参数使得加权和减小。
在一些实施例中,图像编码器包括具有区域级别注意机制的两层LSTM,其中,第一层LSTM充当自上而下的注意模块,根据上下文信息计算对象级别的注意,而第二层LSTM是用于生成句子的语言模型。
本公开的实施例提供了一种用于输出字幕的装置,包括:获取单元,被配置成获取待处理的图像;输出单元,被配置成将图像输入采用如上所述的用于生成字幕器的方法生成的字幕器中,输出图像对应的字幕。
本公开的实施例提供了一种电子设备,包括:一个或多个处理器;存储 装置,其上存储有一个或多个计算机程序,当一个或多个计算机程序被一个或多个处理器执行,使得一个或多个处理器实现如上所述的用于生成字幕器的方法。
本公开的实施例提供了一种计算机可读介质,其上存储有计算机程序,其中,计算机程序被处理器执行时实现如上所述的用于生成字幕器的方法。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本公开的其它特征、目的和优点将会变得更明显:
图1是本公开的一个实施例可以应用于其中的示例性系统架构图;
图2是根据本公开的用于生成字幕器的方法的一个实施例的流程图;
图3是根据本公开的用于生成字幕器的方法的一个应用场景的示意图;
图4是根据本公开的用于输出字幕的方法的一个实施例的流程图;
图5是根据本公开的用于生成字幕器的装置的一个实施例的结构示意图;
图6是根据本公开的用于输出字幕的装置的一个实施例的结构示意图;
图7是适于用来实现本公开的实施例的电子设备的计算机系统的结构示意图。
具体实施方式
下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关方案,而非对该方案的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关方案相关的部分。
需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。
图1示出了可以应用本公开实施例的用于生成字幕器的方法、用于生成字幕器的装置、用于输出字幕的方法或用于输出字幕的装置的示例性系统架构100。
如图1所示,系统架构100可以包括终端101、102,网络103、数据库服务器104和服务器105。网络103用以在终端101、102,数据库服务器104与服务器105之间提供通信链路的介质。网络103可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户110可以使用终端101、102通过网络103与服务器105进行交互,以接收或发送消息等。终端101、102上可以安装有各种客户端应用,例如模型训练类应用、字幕生成类应用、图像处理类应用、购物类应用、支付类应用、网页浏览器和即时通讯工具等。
这里的终端101、102可以是硬件,也可以是软件。当终端101、102为硬件时,可以是具有显示屏的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、膝上型便携计算机和台式计算机等等。当终端101、102为软件时,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做具体限定。
当终端101、102为硬件时,其上还可以安装有图像采集设备。图像采集设备可以是各种能实现采集图像功能的设备,如摄像头、传感器等等。用户110可以利用终端101、102上的图像采集设备,来采集各种场景的图像。
数据库服务器104可以是提供各种服务的数据库服务器。例如数据库服务器中可以存储有样本集。样本集中包含有大量的样本。其中,样本可以包括样本图像以及与样本图像对应的句子。这样,用户110也可以通过终端101、102,从数据库服务器104所存储的样本集中选取样本。
服务器105也可以是提供各种服务的服务器,例如对终端101、102上显示的各种应用提供支持的后台服务器。后台服务器可以利用终端101、102发送的样本集中的样本,对初始模型进行训练,并可以将训练结果(如生成的字幕器)发送给终端101、102。这样,用户可以应用生成的字幕器为图像生成字幕。
这里的数据库服务器104和服务器105同样可以是硬件,也可以是软件。当它们为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当它们为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做具体限定。
需要说明的是,本公开实施例所提供的用于生成字幕器的方法或用于输出字幕的方法一般由服务器105执行。相应地,用于生成字幕器的装置或用于输出字幕的装置一般也设置于服务器105中。
需要指出的是,在服务器105可以实现数据库服务器104的相关功能的情况下,系统架构100中可以不设置数据库服务器104。
应该理解,图1中的终端、网络、数据库服务器和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端、网络、数据库服务器和服务器。
继续参见图2,其示出了根据本公开的用于生成字幕器的方法的一个实施例的流程200。该用于生成字幕器的方法可以包括以下步骤:
步骤201,获取样本图像集。
在本实施例中,用于生成字幕器的方法的执行主体(例如图1所示的服务器)可以从数据库服务器获取预先存储的样本图像集。也可从终端获取终端拍摄的图像作为样本图像。
步骤202,将样本图像集输入句子生成器的图像编码器,输出对象集。
在本实施例中,句子生成器是初始的字幕器,是一种将输入的图像转换成句子的神经网络。句子生成器可包括图像编码器和句子解码器。
图像编码器为每个输入图像生成中间表示,本公开以最常见的对象检测模型(Faster R-CNN)作为图像编码器来检测图像中的对象,实际应用中还可使用其它的图像编码器。将每个图像I i编码成一组显著图像区域
Figure PCTCN2022070476-appb-000001
Figure PCTCN2022070476-appb-000002
其中包含K个检测到的对象,例如,人、花、草、树、椅子、狗等。
步骤203,将对象集分组成第一对象集和第二对象集。
在本实施例中,第一对象集为被包含在预定对象集内的对象集,第二对象集为被排除在预定对象集外的对象集。从技术上讲,给定输入图像I i,通过对象检测模型(例如Faster R-CNN)输出识别对象集
Figure PCTCN2022070476-appb-000003
其中O k是第K个图像区域中具有最高置信度得分的被识别对象,
Figure PCTCN2022070476-appb-000004
是相应的置信度得分。以COCO数据集中最常见的80个检测对象为基准,将识别到的对象重新分组成需要被包含在内的对象集
Figure PCTCN2022070476-appb-000005
和被排除在外的对象集
Figure PCTCN2022070476-appb-000006
例如,预定对象集内包括房屋、汽车、人、花、草、树,而对象集包括人、花、草、树、椅子、狗,则第一对象集(被包含在预定对象集内的对象集)包括人、花、草、树,而第二对象集(被排除在外的对象集)包括椅子、狗。
步骤204,将图像编码器输出的对象集输入句子生成器的句子解码器,在解码步骤中以第一对象集、第二对象集为约束条件进行波束搜索,生成伪图像句子对集。
在本实施例中,给定图像编码器产生的中间表示,利用句子解码器对输 出的句子逐字进行解码。参考自下而上的注意力模型(Bottom-up and Top-Down),可将句子解码器实现为具有区域级别注意机制的两层LSTM(Long Short-Term Memory,长短期记忆网络)。第一层LSTM(LSTM 1)充当自上而下的注意模块,根据上下文信息计算对象级别的注意,而第二层LSTM(LSTM 2)是用于生成句子的语言模型。在每个解码步骤t处,第二层LSTM的隐藏状态
Figure PCTCN2022070476-appb-000007
编码图像特征的均值
Figure PCTCN2022070476-appb-000008
以及输入字w t-1被视为上下文信息,并馈入第一层LSTM,由此得到第一层LSTM的隐藏状态:
Figure PCTCN2022070476-appb-000009
其中W μ是单词嵌入矩阵,w t-1是词编码,基于隐藏状态
Figure PCTCN2022070476-appb-000010
测量所有K个图像区域的注意力分布如下:
Figure PCTCN2022070476-appb-000011
λ t=softmax(a t)
其中a t,k表示a t的第K个元素,W va,W ha
Figure PCTCN2022070476-appb-000012
是变换矩阵。利用注意力分布加权得到图像特征:
Figure PCTCN2022070476-appb-000013
其中λ t,k是λ t的第k个元素,代表了图像区域v k的注意力概率。接着串联图像特征
Figure PCTCN2022070476-appb-000014
和隐藏状态
Figure PCTCN2022070476-appb-000015
做为第二层LSTM的输入,得到第二层LSTM的隐藏状态:
Figure PCTCN2022070476-appb-000016
Figure PCTCN2022070476-appb-000017
其中W E是一个线性嵌入矩阵,它将
Figure PCTCN2022070476-appb-000018
投影到词汇空间以进行单词预测。
通过预训练的字幕器生成伪图像句子对,一种自然的方法是采用波束搜索(beam search),它是一种启发式搜索算法,它在每个解码步骤保持波束B t,包含b个最可能的部分句子。但是,输入图像和输出句子之间的语义相关性并未完全用于推理时的句子生成。为了缓解这个问题,本公开设计了语义约束波束搜索,重新构造了波束搜索以确保包含识别的对象并排除无关的对象。
从技术上讲,给定输入图像I i,通过对象检测模型(例如FasterR-CNN)输出识别对象集
Figure PCTCN2022070476-appb-000019
其中O k是第K个图像区域中具有最高置信度得分的被识别对象,
Figure PCTCN2022070476-appb-000020
是相应的置信度得分。以COCO数据集中最常见的80个检测对象为基准,将识别到的对象重新分组成需要被包含在内的对象集
Figure PCTCN2022070476-appb-000021
和被排除在外的对象集
Figure PCTCN2022070476-appb-000022
将识别到
Figure PCTCN2022070476-appb-000023
中的对象和排除掉
Figure PCTCN2022070476-appb-000024
中的对象做为一种约束条件,并且用有限状态机(Finite-state machine)来执行这种约束,这样有限状态机就可以识别出满足所有对象包含约束的词序列,然后将波束搜索算法与有限状态机相结合。具体地说,在有限状态机中为每个状态a∈A保持搜索波束
Figure PCTCN2022070476-appb-000025
在每个解码步骤t,通过在候选集
Figure PCTCN2022070476-appb-000026
中保留b个最可能的部分字序列来更新每个波束
Figure PCTCN2022070476-appb-000027
Figure PCTCN2022070476-appb-000028
Figure PCTCN2022070476-appb-000029
其中w 1:t-1表示输出句子长度为t-1,V是词典,
Figure PCTCN2022070476-appb-000030
是有限状态机中的状态转移函数。这里只从词汇表中查找单词(没有
Figure PCTCN2022070476-appb-000031
中的无关对象),以扩展当前的部分单词序列。因此,有限状态机的设计要求接受状态的词序列必须满足包含条件,同时排除所有无关对象。
伪图像句子对集中每个伪图像句子对包括图像和句子,图像和句子可以是非配对的。
步骤205,将伪图像句子对集作为样本集训练句子生成器,得到字幕器。
在本实施例中,可用
Figure PCTCN2022070476-appb-000032
表示伪图像句子对的集合,其中
Figure PCTCN2022070476-appb-000033
Figure PCTCN2022070476-appb-000034
代表产生的伪句子,利用这些伪图像句子对,可以直接训练一个具有以下交叉熵损失的字幕器:
Figure PCTCN2022070476-appb-000035
这里θ表示句子解码器中的参数。
在本实施例的一些可选的实现方式中,该方法还包括:通过以下至少一种方式优化字幕器:通过句子鉴别器对字幕器进行对抗式训练来优化字幕器;通过字幕器识别出的对象在字幕器输出的句子中的包含程度优化字幕器;通过图像三元组与相应生成的句子之间的语义相关性优化字幕器,其中,图像三元组包括查询图像,正图像和负图像。
可通过上述任一方式优化字幕器,也可任意两种方式的组合进行优化。还可将三种方式结合在一起优化字幕器。
在本实施例的一些可选的实现方式中,通过句子鉴别器对所述字幕器进行对抗式训练来优化所述字幕器,包括:提取预置的第一样本集,其中,每个第一样本包括图像和对应的真句子;提取预先建立的生成式对抗网络,其 中,所述生成式对抗网络包括字幕器和句子鉴别器,所述字幕器用于对所输入的图像进行图像编码后再进行句子解码,得到伪句子,所述句子鉴别器用于确定所输入的句子是否为所述字幕器所输出的伪句子;基于机器学习方法,从所述第一样本集中选取第一样本,以及执行以下第一训练步骤:将选取的第一样本中的图像输入所述字幕器,输出伪句子;将所述伪句子和选取的第一样本中的真句子输入所述句子鉴别器,输入鉴别结果;根据输出的鉴别结果统计所述句子鉴别器的准确率;若所述准确率达到预设数值,则确定出所述字幕器训练完成。
若所述准确率未达到预设数值,则计算所述句子鉴别器的对抗性损失,调整所述句子鉴别器的相关参数使得所述对抗性损失减小,以及从所述第一样本集中重新选取第一样本,继续执行所述第一训练步骤。
若所述准确率未达到预设数值,则计算所述字幕器的对抗性奖励,调整所述字幕器的相关参数使得所述对抗性奖励增大,以及从所述第一样本集中重新选取第一样本,继续执行所述第一训练步骤。
句子鉴别器的结构如图3所示。句子鉴别器和字幕器(包括图像编码器和句子解码器)组成了生成式对抗网络。句子鉴别器用于区分输入句子是未配对句子数据集中的真实句子还是由字幕器生成的伪句子。基于递归神经网络(RNN)的句子建模,可利用LSTM在上下文上将单词序列编码为句子级别的表示形式,以识别真实/生成的句子。从技术上讲,给定包含T个单词的句子
Figure PCTCN2022070476-appb-000036
句子鉴别器中的LSTM以自然顺序读取输入的单词序列:
Figure PCTCN2022070476-appb-000037
其中
Figure PCTCN2022070476-appb-000038
是t时刻的输出隐藏状态,
Figure PCTCN2022070476-appb-000039
是词嵌入矩阵。之后,将最终的输出隐藏状态
Figure PCTCN2022070476-appb-000040
作为句子级别的表示。利用
Figure PCTCN2022070476-appb-000041
通过以下sigmoid函数
Figure PCTCN2022070476-appb-000042
来产生识别真实句子性的概率:
Figure PCTCN2022070476-appb-000043
其中W FC是全连层的嵌入矩阵。
每次训练过程中,句子鉴别器判断输入的句子是真句子还是字幕器生成的伪句子。统计鉴别结果是否正确,如果准确率达到预设数值(例如0.5),则说明字幕器的伪造句子的效果很好,能够骗得过句子鉴别器,则结束训练。否则需要调整字幕器和句子鉴别器的网络参数,再进行训练。可先固定字幕器的参数,调整句子鉴别器的参数进行训练,然后固定句子鉴别器的参数, 调整字幕器的参数进行训练。交替调整句子鉴别器和字幕器的参数,最后训练完成了句子鉴别器和字幕器。而实际应用的是字幕器。
对抗性奖励(Adversarial Reward),为了生成与人类书写的字幕不可区分的句子,本公开采用对抗式训练和句子级别的对抗式奖励,使生成的句子分布与手动描述的句子分布相匹配。从技术上讲,图像字幕器
Figure PCTCN2022070476-appb-000044
被视为句子生成器,捕获数据分布以生成句子。句子鉴别器D以从实际句子或
Figure PCTCN2022070476-appb-000045
生成的句子中随机选择的句子作为输入,并且在两个句子源(即生成的句子或真实的句子)上产生概率分布
Figure PCTCN2022070476-appb-000046
通过对抗性训练之后,在双人博弈中训练图像字幕器和句子鉴别器。具体来说,句子鉴别器D通过正确区分真实句子{S i}和生成句子
Figure PCTCN2022070476-appb-000047
来优化,即最小化对抗性损失:
Figure PCTCN2022070476-appb-000048
同时,图像字幕器
Figure PCTCN2022070476-appb-000049
通过最大化对抗性奖励r adv来学习,旨在用生成的句子愚弄句子鉴别器:
Figure PCTCN2022070476-appb-000050
通过生成式对抗网络可以提高字幕器的准确率。
在本实施例的一些可选的实现方式中,通过所述字幕器识别出的对象在所述字幕器输出的句子中的包含程度优化所述字幕器,包括:
提取预置的第二样本集,其中,每个第二样本包括图像;
基于机器学习方法,从所述第二样本集中选取样本,以及执行以下第二训练步骤:将选取的第二样本中的图像输入所述字幕器的图像编码器,输出样本对象集;将所述样本对象集输入字幕器的句子解码器,输出伪句子;计算所述伪句子中包含所述样本对象集中的样本对象的置信度均值分数,作为所述伪句子的对象包含奖励;若所述对象包含奖励达到预设包含奖励阈值,则确定出所述字幕器训练完成。
若所述对象包含奖励未达到预设包含奖励阈值,则调整所述字幕器的相关参数使得所述对象包含奖励增大,以及从所述第二样本集中重新选取第二样本,继续执行所述第二训练步骤。
对象包含奖励(Object Inclusion Reward),由于对抗性奖励只强化字幕器以产生更逼真的句子,而没有明确的概念来描述图像内容和生成句子之间的语义关联。因此,本公开进一步将识别出的目标在输出句子中的包含程度作为一个额外的自监督目标,即对象包含奖励,以鼓励字幕器描述生成句子包 含识别出的目标。通过这种方式,强调了两者之间的语义相关性,加强了生成的字幕的质量。具体地,本公开用所有识别出的对象构建一个包含对象集合
Figure PCTCN2022070476-appb-000051
给定生成地句子
Figure PCTCN2022070476-appb-000052
通过计算句子中包含的在上述集合
Figure PCTCN2022070476-appb-000053
中的对象的置信度均值分数来构建对象包含奖励:
Figure PCTCN2022070476-appb-000054
Figure PCTCN2022070476-appb-000055
其中
Figure PCTCN2022070476-appb-000056
是指标函数。
Figure PCTCN2022070476-appb-000057
代表识别出来的物体相应的置信度得分,有的物体置信度不高所以权重就相应降低些。
在本实施例的一些可选的实现方式中,通过图像三元组与相应生成的句子之间的语义相关性优化所述字幕器,包括:提取预置的第三样本集,其中,每个第三样本包括查询图像、正图像和负图像,正图像与查询图像共享至少两个对象,而负图像和查询图像没有任何共同的对象;基于机器学习方法,从所述第三样本集中选取第三样本,以及执行以下第三训练步骤:将选取的第三样本中的查询图像、正图像和负图像分别输入所述字幕器,输出查询句子、正句子和负句子;计算查询句子和正句子的第一语义相似度以及计算查询句子和负句子的第二语义相似度;根据所述第一语义相似度和所述第二语义相似度计算自监督三重态损失;若所述自监督三重态损失小于预定损失阈值,则确定出所述字幕器训练完成。
若所述自监督三重态损失不小于预定损失阈值,则调整所述字幕器的相关参数使得所述自监督三重态损失减小,以及从所述第三样本集中重新选取第三样本,继续执行所述第三训练步骤。
在本实施例的一些可选的实现方式中,所述计算查询句子和正句子的第一语义相似度以及计算查询句子和负句子的第二语义相似度,包括:对于查询句子、正句子和负句子,分别计算句子中每一个词的基于对象的概率分布,进行最大池化操作,分别得到查询句子特征、正句子特征和负句子特征;计算查询句子特征和正句子特征的第一语义相似度以及计算查询句子特征和负句子特征的第二语义相似度。
自监督的三重态损失(Self-supervised Triplet Loss),在具有对象包含奖励的优化中,无论图像之间相似或不相似的关系如何,都独立地利用每个图像与相应生成的句子之间的语义相关性。从探索相对关系的想法出发,本公 开设计了一种自监督的三元组损失,以三元组的方式在语义上约束字幕器的学习,旨在保留句子之间的相对语义顺序。基于图像中识别的视觉对象构造每个图像三元组(由查询图像,正图像和负图像组成)。正图像与查询图像共享至少两个对象,而负图像和查询图像没有任何共同的对象。给定这样的图像三元组,优化字幕器以使查询图像的生成语句比负图像的生成语句更类似于正图像的生成语句。具体来说,假设有一组三元组集
Figure PCTCN2022070476-appb-000058
其中每个三元组
Figure PCTCN2022070476-appb-000059
包含一个查询图像I i,一个正图像
Figure PCTCN2022070476-appb-000060
和一个负图像
Figure PCTCN2022070476-appb-000061
这样的三元组输入字幕器生成对应的句子三元组
Figure PCTCN2022070476-appb-000062
旨在使得字幕器生成的句子三元组中S i在语义上更趋近
Figure PCTCN2022070476-appb-000063
而疏远
Figure PCTCN2022070476-appb-000064
因此,自监督的三重态损失计算公式如下:
Figure PCTCN2022070476-appb-000065
其中α表示边距,
Figure PCTCN2022070476-appb-000066
表示
Figure PCTCN2022070476-appb-000067
的基于对象的句子特征,具体来说,在解码阶段,通过仅保留1600个对象的概率,预测的单词分布被进一步转换为基于对象的分布。接下来,沿着
Figure PCTCN2022070476-appb-000068
的解码过程累积所有基于对象的分布,并对它们执行最大池化,以产生相应的基于对象的句子特征。
可选地,最终整个模型的训练可以在自批判序列训练综合了对抗性奖励(Adversarial Reward)、对象包含奖励(Object Inclusion Reward)和自监督三重态损失(Self-supervised Triplet Loss),总体的目标梯度公式近似为:
Figure PCTCN2022070476-appb-000069
其中
Figure PCTCN2022070476-appb-000070
T表示抽样句子,b表示获得的对抗性和对象包含奖励的组合。λ 1、λ 2、λ 3分别代表的是对抗性奖励、对象包含奖励和自监督三重态损失的权重,权重可以为0。
继续参见图3,图3是根据本实施例的用于生成字幕器的方法的应用场景的一个示意图。在图3的应用场景中,将查询图像、正图像、负图像输入字幕器的图像编码器Faster R-CNN,得到对象集{tree(树)、man(人)、bench(长椅)、grass(草)、dog(狗)…}。将对象集根据预定对象集分组后,输入字幕器的句子解码器(图3中下面的两层LSTM结构),基于对象集的语义进行波束搜索解码,生成伪句子“a man sitting on a bench near a tree”等。将这些伪句子和对应的图像作为伪图像句子对集,用于训练字幕器(图3中上面的两层LSTM结构代表了句子解码器)。为了简便起见,可固定图像编码器的参数,仅训练句子解码器,也可在句子解码器训练完成后再训练图像编码 器,交替训练图像编码器和句子解码器,得到性能最佳的字幕器。训练过程中使用交叉熵的方式。得到的上面的两层LSTM结构的参数可以共享给下面的两层LSTM结构。
为了进一步优化字幕器,可引入对抗性奖励、对象包含奖励和自监督三重态损失。
1、对抗性奖励优化:将真句子“a cow stands in the back of a large truck”和字幕器生成的伪句子一起输入句子鉴别器进行判别。如果判别的准确率未达到0.5,则按最小化对抗性损失的方向调整句子鉴别器的参数,然后再按最大化对抗性奖励的方向调整字幕器的参数。交替训练(调整)句子鉴别器和字幕器可以对字幕器进行优化。
2、对象包含奖励优化:计算识别出的对象在字幕器生成的伪句子中的包含程度。例如,识别出的对象包括tree,man,bench。如果句子1包括tree(置信度为0.9),句子2包括tree(置信度为0.8)和man(置信度为0.7),则句子2的对象包含奖励要高于句子1。训练的目的是尽量提高对象包含奖励,每次调整参数都能提高对象包含奖励。
3、自监督三重态损失优化:图3中输入的样本可以是三元组:查询图像、正图像和负图像。不同图像能生成不同的伪句子,通过比较查询句子、正句子和负句子之间的语义相似性,来确定自监督三重态损失。训练的目的是降低自监督三重态损失,使得正句子与查询句子语义更接近,负句子与查询句子的语义不相关。
在训练阶段,本公开采用自学习模式,通过交替进行伪字幕对生成和字幕器再训练这两个过程来优化整个模型,以达到循环迭代地改进字幕器的目的。
本公开提出了基于语义约束的自学习框架,深入研究了非配对图像字幕器的自学习思想。从建立一个伪句子生成和迭代优化的角度研究了这个问题,逐步提高了句子生成的质量。此外,将语义约束很好地集成到模型中,充分利用图像中物体的语义指导字幕器的训练,得到了先进的无监督字幕技术。
进一步参考图4,其示出了用于输出字幕的方法的又一个实施例的流程400。该用于输出字幕的方法的流程400,包括以下步骤:
步骤401,获取待处理的图像。
在本实施例中,用于输出字幕的方法运行于其上的电子设备(例如图1 所示的服务器)可以通过有线连接方式或者无线连接方式从用户利用其进行字幕编辑的终端接收待处理的图像。待处理的图像可以是单独的图像,也可以是视频文件,服务器将视频分帧后得到待处理的图像。
步骤402,将图像输入字幕器中,输出图像对应的字幕。
在本实施例中,字幕器为根据步骤201-205的方法训练得到的。通过字幕器能够为图像自动配上字幕。可将字幕直接输出到图像上,也可生成单独的文件,返回给终端,由终端根据用户的需求设置字幕的格式,再输出到图像上。字幕器不仅可以输入字幕,还可输出图像编码器识别出的对象,可用于训练过程中的语义约束。
步骤401-402可与步骤201-205交替执行。步骤401-402生成的字幕可作为步骤201-205的训练样本。
从图4中可以看出,与图2对应的实施例相比,本实施例中的用于输出字幕的方法的流程400体现了字幕器的应用步骤。由此,本实施例描述的方案可以通过字幕器生成训练样本,再用于字幕器的训练,交替地生成字幕和字幕器再训练,可以优化字幕器,提高生成字幕的准确性。
进一步参考图5,作为对上述各图所示方法的实现,本公开提供了一种用于生成字幕器的装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图5所示,本实施例的用于生成字幕器的装置500包括:获取单元501、编码单元502、分组单元503、解码单元504和训练单元505。其中,获取单元501,被配置成获取样本图像集;编码单元502,被配置成将所述样本图像集输入句子生成器的图像编码器,输出对象集;分组单元503,被配置成将所述对象集分组成第一对象集和第二对象集,其中,所述第一对象集为被包含在预定对象集内的对象集,所述第二对象集为被排除在预定对象集外的对象集;解码单元504,被配置成将所述图像编码器输出的对象集输入句子生成器的句子解码器,在解码步骤中以所述第一对象集、所述第二对象集为约束条件进行波束搜索,生成伪图像句子对集;训练单元505,被配置成将所述伪图像句子对集作为样本集训练所述句子生成器,得到字幕器。
在本实施例中,用于生成字幕器的装置500的获取单元501、编码单元502、分组单元503、解码单元504和训练单元505的具体处理可以参考图2对应实施例中的步骤201、步骤202、步骤203、步骤204和步骤205。
在本实施例的一些可选的实现方式中,该装置还包括优化单元(附图中未示出),被配置成:通过以下至少一种方式优化字幕器:通过句子鉴别器对字幕器进行对抗式训练来优化字幕器;通过字幕器识别出的对象在字幕器输出的句子中的包含程度优化字幕器;通过图像三元组与相应生成的句子之间的语义相关性优化字幕器,其中,图像三元组包括查询图像,正图像和负图像。
在本实施例的一些可选的实现方式中,优化单元进一步被配置成:提取预置的第一样本集,其中,每个第一样本包括图像和对应的真句子;提取预先建立的生成式对抗网络,其中,生成式对抗网络包括字幕器和句子鉴别器,字幕器用于对所输入的图像进行图像编码后再进行句子解码,得到伪句子,句子鉴别器用于确定所输入的句子是否为字幕器所输出的伪句子;基于机器学习装置,从第一样本集中选取第一样本,以及执行以下第一训练步骤:将选取的第一样本中的图像输入字幕器,输出伪句子;将伪句子和选取的第一样本中的真句子输入句子鉴别器,输入鉴别结果;根据输出的鉴别结果统计句子鉴别器的准确率;若准确率达到预设数值,则确定出字幕器训练完成。
在本实施例的一些可选的实现方式中,优化单元进一步被配置成:若准确率未达到预设数值,则计算句子鉴别器的对抗性损失,调整句子鉴别器的相关参数使得对抗性损失减小,以及从第一样本集中重新选取第一样本,继续执行第一训练步骤。
在本实施例的一些可选的实现方式中,优化单元进一步被配置成:若准确率未达到预设数值,则计算字幕器的对抗性奖励,调整字幕器的相关参数使得对抗性奖励增大,以及从第一样本集中重新选取第一样本,继续执行第一训练步骤。
在本实施例的一些可选的实现方式中,优化单元进一步被配置成:提取预置的第二样本集,其中,每个第二样本包括图像;基于机器学习装置,从第二样本集中选取样本,以及执行以下第二训练步骤:将选取的第二样本中的图像输入字幕器的图像编码器,输出样本对象集;将样本对象集输入字幕器的句子解码器,输出伪句子;计算伪句子中包含样本对象集中的样本对象的置信度均值分数,作为伪句子的对象包含奖励;若对象包含奖励达到预设包含奖励阈值,则确定出字幕器训练完成。
在本实施例的一些可选的实现方式中,优化单元进一步被配置成:若对 象包含奖励未达到预设包含奖励阈值,则调整字幕器的相关参数使得对象包含奖励增大,以及从第二样本集中重新选取第二样本,继续执行第二训练步骤。
在本实施例的一些可选的实现方式中,优化单元进一步被配置成:提取预置的第三样本集,其中,每个第三样本包括查询图像、正图像和负图像,正图像与查询图像共享至少两个对象,而负图像和查询图像没有任何共同的对象;基于机器学习装置,从第三样本集中选取第三样本,以及执行以下第三训练步骤:将选取的第三样本中的查询图像、正图像和负图像分别输入字幕器,输出查询句子、正句子和负句子;计算查询句子和正句子的第一语义相似度以及计算查询句子和负句子的第二语义相似度;根据第一语义相似度和第二语义相似度计算自监督三重态损失;若自监督三重态损失小于预定损失阈值,则确定出字幕器训练完成。
在本实施例的一些可选的实现方式中,优化单元进一步被配置成:若自监督三重态损失不小于预定损失阈值,则调整字幕器的相关参数使得自监督三重态损失减小,以及从第三样本集中重新选取第三样本,继续执行第三训练步骤。
在本实施例的一些可选的实现方式中,优化单元进一步被配置成:对于查询句子、正句子和负句子,分别计算句子中每一个词的基于对象的概率分布,进行最大池化操作,分别得到查询句子特征、正句子特征和负句子特征;计算查询句子特征和正句子特征的第一语义相似度以及计算查询句子特征和负句子特征的第二语义相似度。
在本实施例的一些可选的实现方式中,优化单元进一步被配置成:若对抗性奖励、对象包含奖励和自监督三重态损失的加权和大于预定目标值,则调整字幕器的相关参数使得加权和减小。
在本实施例的一些可选的实现方式中,图像编码器包括具有区域级别注意机制的两层LSTM,其中,第一层LSTM充当自上而下的注意模块,根据上下文信息计算对象级别的注意,而第二层LSTM是用于生成句子的语言模型。
进一步参考图6,作为对上述各图所示方法的实现,本公开提供了一种用于输出字幕的装置的一个实施例,该装置实施例与图4所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图6所示,本实施例的用于输出字幕的装置600包括:获取单元601,被配置成获取待处理的图像;输出单元602,被配置成将图像输入采用如装置500生成的字幕器中,输出图像对应的字幕。
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质。
图7示出了可以用来实施本公开的实施例的示例电子设备700的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或要求的本公开的实现。
如图7所示,设备700包括计算单元701,其可以根据存储在只读存储器(ROM)702中的计算机程序或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序,来执行各种适当的动作和处理。在RAM 703中,还可存储设备700操作所需的各种程序和数据。计算单元701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。
设备700中的多个部件连接至I/O接口705,包括:输入单元706,例如键盘、鼠标等;输出单元707,例如各种类型的显示器、扬声器等;存储单元708,例如磁盘、光盘等;以及通信单元709,例如网卡、调制解调器、无线通信收发机等。通信单元709允许设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
计算单元701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元701执行上文所描述的各个方法和处理,例如方法用于生成字幕器。例如,在一些实施例中,方法用于生成字幕器可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元708。在一些实施例中,计算机程序的部分或者全部可以经由ROM 702和/或通信单元 709而被载入和/或安装到设备700上。当计算机程序加载到RAM 703并由计算单元701执行时,可以执行上文描述的方法用于生成字幕器的一个或多个步骤。备选地,在其他实施例中,计算单元701可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行方法用于生成字幕器。
本公开的实施例提供的用于生成字幕器的方法和装置以及用于输出字幕的方法和装置,旨在为图像字幕提供无监督的解决方案。与现有的图像字幕方法严重依赖大量的图像句子对进行训练不同,本公开通过在自学方式中学习图像字幕器来消除这种依赖性。可以用非配对的图像和句子数据训练字幕器,追求更多真实的场景。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程 只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入、或者触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以为分布式系统的服务器,或者是结合了区块链的服务器。服务器也可以是云服务器,或者是带人工智能技术的智能云计算服务器或智能云主机。服务器可以为分布式系统的服务器,或者是结合了区块链的服务器。服务器也可以是云服务器,或者是带人工智能技术的智能云计算服务器或智能云主机。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进 等,均应包含在本公开保护范围之内。

Claims (17)

  1. 一种用于生成字幕器的方法,包括:
    获取样本图像集;
    将所述样本图像集输入句子生成器的图像编码器,输出对象集;
    将所述对象集分组成第一对象集和第二对象集,其中,所述第一对象集为被包含在预定对象集内的对象集,所述第二对象集为被排除在预定对象集外的对象集;
    将所述图像编码器输出的对象集输入句子生成器的句子解码器,在解码步骤中以所述第一对象集、所述第二对象集为约束条件进行波束搜索,生成伪图像句子对集;
    将所述伪图像句子对集作为样本集训练所述句子生成器,得到字幕器。
  2. 根据权利要求1所述的方法,其中,所述方法还包括:
    通过以下至少一种方式优化所述字幕器:
    通过句子鉴别器对所述字幕器进行对抗式训练来优化所述字幕器;
    通过所述字幕器识别出的对象在所述字幕器输出的句子中的包含程度优化所述字幕器;
    通过图像三元组与相应生成的句子之间的语义相关性优化所述字幕器,其中,图像三元组包括查询图像,正图像和负图像。
  3. 根据权利要求2所述的方法,其中,所述通过句子鉴别器对所述字幕器进行对抗式训练来优化所述字幕器,包括:
    提取预置的第一样本集,其中,每个第一样本包括图像和对应的真句子;
    提取预先建立的生成式对抗网络,其中,所述生成式对抗网络包括字幕器和句子鉴别器,所述字幕器用于对所输入的图像进行图像编码后再进行句子解码,得到伪句子,所述句子鉴别器用于确定所输入的句子是否为所述字幕器所输出的伪句子;
    基于机器学习方法,从所述第一样本集中选取第一样本,以及执行以下第一训练步骤:将选取的第一样本中的图像输入所述字幕器,输出伪句子;将所述伪句子和选取的第一样本中的真句子输入所述句子鉴别器,输入鉴别结果;根据输出的鉴别结果统计所述句子鉴别器的准确率;若所述准确率达 到预设数值,则确定出所述字幕器训练完成。
  4. 根据权利要求3所述的方法,其中,所述方法还包括:
    若所述准确率未达到所述预设数值,则计算所述句子鉴别器的对抗性损失,调整所述句子鉴别器的相关参数使得所述对抗性损失减小,以及从所述第一样本集中重新选取第一样本,继续执行所述第一训练步骤。
  5. 根据权利要求3所述的方法,其中,所述方法还包括:
    若所述准确率未达到所述预设数值,则计算所述字幕器的对抗性奖励,调整所述字幕器的相关参数使得所述对抗性奖励增大,以及从所述第一样本集中重新选取第一样本,继续执行所述第一训练步骤。
  6. 根据权利要求2所述的方法,其中,所述通过所述字幕器识别出的对象在所述字幕器输出的句子中的包含程度优化所述字幕器,包括:
    提取预置的第二样本集,其中,每个第二样本包括图像;
    基于机器学习方法,从所述第二样本集中选取样本,以及执行以下第二训练步骤:将选取的第二样本中的图像输入所述字幕器的图像编码器,输出样本对象集;将所述样本对象集输入字幕器的句子解码器,输出伪句子;计算所述伪句子中包含所述样本对象集中的样本对象的置信度均值分数,作为所述伪句子的对象包含奖励;若所述对象包含奖励达到预设包含奖励阈值,则确定出所述字幕器训练完成。
  7. 根据权利要求6所述的方法,其中,所述方法还包括:
    若所述对象包含奖励未达到所述预设包含奖励阈值,则调整所述字幕器的相关参数使得所述对象包含奖励增大,以及从所述第二样本集中重新选取第二样本,继续执行所述第二训练步骤。
  8. 根据权利要求2所述的方法,其中,所述通过图像三元组与相应生成的句子之间的语义相关性优化所述字幕器,包括:
    提取预置的第三样本集,其中,每个第三样本包括查询图像、正图像和负图像,正图像与查询图像共享至少两个对象,而负图像和查询图像没有任 何共同的对象;
    基于机器学习方法,从所述第三样本集中选取第三样本,以及执行以下第三训练步骤:将选取的第三样本中的查询图像、正图像和负图像分别输入所述字幕器,输出查询句子、正句子和负句子;计算查询句子和正句子的第一语义相似度以及计算查询句子和负句子的第二语义相似度;根据所述第一语义相似度和所述第二语义相似度计算自监督三重态损失;若所述自监督三重态损失小于预定损失阈值,则确定出所述字幕器训练完成。
  9. 根据权利要求8所述的方法,其中,所述方法还包括:
    若所述自监督三重态损失不小于所述预定损失阈值,则调整所述字幕器的相关参数使得所述自监督三重态损失减小,以及从所述第三样本集中重新选取第三样本,继续执行所述第三训练步骤。
  10. 根据权利要求8所述的方法,其中,所述计算查询句子和正句子的第一语义相似度以及计算查询句子和负句子的第二语义相似度,包括:
    对于查询句子、正句子和负句子,分别计算句子中每一个词的基于对象的概率分布,进行最大池化操作,分别得到查询句子特征、正句子特征和负句子特征;
    计算查询句子特征和正句子特征的第一语义相似度以及计算查询句子特征和负句子特征的第二语义相似度。
  11. 根据权利要求2-10中任一项所述的方法,其中,所述方法还包括:
    若对抗性奖励、对象包含奖励和自监督三重态损失的加权和大于预定目标值,则调整字幕器的相关参数使得所述加权和减小。
  12. 根据权利要求1-10中任一项所述的方法,其中,所述图像编码器包括具有区域级别注意机制的两层LSTM,其中,第一层LSTM充当自上而下的注意模块,根据上下文信息计算对象级别的注意,而第二层LSTM是用于生成句子的语言模型。
  13. 一种用于输出字幕的方法,包括:
    获取待处理的图像;
    将所述图像输入采用如权利要求1-12中任一项所述的方法生成的字幕器中,输出所述图像对应的字幕。
  14. 一种用于生成字幕器的装置,包括:
    获取单元,被配置成获取样本图像集;
    编码单元,被配置成将所述样本图像集输入句子生成器的图像编码器,输出对象集;
    分组单元,被配置成将所述对象集分组成第一对象集和第二对象集,其中,所述第一对象集为被包含在预定对象集内的对象集,所述第二对象集为被排除在预定对象集外的对象集;
    解码单元,被配置成将所述图像编码器输出的对象集输入句子生成器的句子解码器,在解码步骤中以所述第一对象集、所述第二对象集为约束条件进行波束搜索,生成伪图像句子对集;
    训练单元,被配置成将所述伪图像句子对集作为样本集训练所述句子生成器,得到字幕器。
  15. 一种用于输出字幕的装置,包括:
    获取单元,被配置成获取待处理的图像;
    输出单元,被配置成将所述图像输入采用如权利要求1-12中任一项所述的方法生成的字幕器中,输出所述图像对应的字幕。
  16. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个计算机程序,
    当所述一个或多个计算机程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-13中任一项所述的方法。
  17. 一种计算机可读介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1-13中任一项所述的方法。
PCT/CN2022/070476 2021-03-30 2022-01-06 用于生成字幕器以及输出字幕的方法和装置 WO2022206094A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2023559796A JP2024512628A (ja) 2021-03-30 2022-01-06 キャプション生成器を生成するための方法および装置、並びにキャプションを出力するための方法および装置
US18/284,225 US20240177506A1 (en) 2021-03-30 2022-01-06 Method and Apparatus for Generating Captioning Device, and Method and Apparatus for Outputting Caption

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110338045.X 2021-03-30
CN202110338045.XA CN113052090B (zh) 2021-03-30 2021-03-30 用于生成字幕器以及输出字幕的方法和装置

Publications (1)

Publication Number Publication Date
WO2022206094A1 true WO2022206094A1 (zh) 2022-10-06

Family

ID=76516172

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/070476 WO2022206094A1 (zh) 2021-03-30 2022-01-06 用于生成字幕器以及输出字幕的方法和装置

Country Status (4)

Country Link
US (1) US20240177506A1 (zh)
JP (1) JP2024512628A (zh)
CN (1) CN113052090B (zh)
WO (1) WO2022206094A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113052090B (zh) * 2021-03-30 2024-03-05 京东科技控股股份有限公司 用于生成字幕器以及输出字幕的方法和装置
CN113628288B (zh) * 2021-07-06 2024-05-31 上海电力大学 一种基于编-解码器结构的可控图像字幕生成优化方法
CN114821271B (zh) * 2022-05-19 2022-09-16 平安科技(深圳)有限公司 模型训练方法、图像描述生成方法、装置及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061250A1 (en) * 2015-08-28 2017-03-02 Microsoft Technology Licensing, Llc Discovery of semantic similarities between images and text
CN107608943A (zh) * 2017-09-08 2018-01-19 中国石油大学(华东) 融合视觉注意力和语义注意力的图像字幕生成方法及系统
CN112084841A (zh) * 2020-07-27 2020-12-15 齐鲁工业大学 跨模态的图像多风格字幕生成方法及系统
CN112508048A (zh) * 2020-10-22 2021-03-16 复旦大学 图像描述的生成方法和装置
CN113052090A (zh) * 2021-03-30 2021-06-29 京东数字科技控股股份有限公司 用于生成字幕器以及输出字幕的方法和装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8548794B2 (en) * 2003-07-02 2013-10-01 University Of Southern California Statistical noun phrase translation
US9183466B2 (en) * 2013-06-15 2015-11-10 Purdue Research Foundation Correlating videos and sentences
US9811765B2 (en) * 2016-01-13 2017-11-07 Adobe Systems Incorporated Image captioning with weak supervision
WO2018212822A1 (en) * 2017-05-16 2018-11-22 Google Inc. Suggested actions for images
US11113599B2 (en) * 2017-06-22 2021-09-07 Adobe Inc. Image captioning utilizing semantic text modeling and adversarial learning
US11514252B2 (en) * 2018-06-10 2022-11-29 Adobe Inc. Discriminative caption generation
CN110135567A (zh) * 2019-05-27 2019-08-16 中国石油大学(华东) 基于多注意力生成对抗网络的图像字幕生成方法
CN111126479A (zh) * 2019-12-20 2020-05-08 山东浪潮人工智能研究院有限公司 一种基于无监督独特性优化的图像描述生成方法及系统
CN111612103B (zh) * 2020-06-23 2023-07-11 中国人民解放军国防科技大学 结合抽象语义表示的图像描述生成方法、系统及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061250A1 (en) * 2015-08-28 2017-03-02 Microsoft Technology Licensing, Llc Discovery of semantic similarities between images and text
CN107608943A (zh) * 2017-09-08 2018-01-19 中国石油大学(华东) 融合视觉注意力和语义注意力的图像字幕生成方法及系统
CN112084841A (zh) * 2020-07-27 2020-12-15 齐鲁工业大学 跨模态的图像多风格字幕生成方法及系统
CN112508048A (zh) * 2020-10-22 2021-03-16 复旦大学 图像描述的生成方法和装置
CN113052090A (zh) * 2021-03-30 2021-06-29 京东数字科技控股股份有限公司 用于生成字幕器以及输出字幕的方法和装置

Also Published As

Publication number Publication date
CN113052090B (zh) 2024-03-05
US20240177506A1 (en) 2024-05-30
CN113052090A (zh) 2021-06-29
JP2024512628A (ja) 2024-03-19

Similar Documents

Publication Publication Date Title
WO2022206094A1 (zh) 用于生成字幕器以及输出字幕的方法和装置
CN108829757B (zh) 一种聊天机器人的智能服务方法、服务器及存储介质
WO2022155994A1 (zh) 基于注意力的深度跨模态哈希检索方法、装置及相关设备
US20220245347A1 (en) Entity recognition method, apparatus, electronic device and computer readable storage medium
CN107273458B (zh) 深度模型训练方法及装置、图像检索方法及装置
US11856277B2 (en) Method and apparatus for processing video, electronic device, medium and product
CN115359383B (zh) 跨模态特征提取、检索以及模型的训练方法、装置及介质
WO2018196718A1 (zh) 图像消歧方法、装置、存储介质和电子设备
CN110390363A (zh) 一种图像描述方法
CN111444968A (zh) 一种基于注意力融合的图像描述生成方法
CN114663915B (zh) 基于Transformer模型的图像人-物交互定位方法及系统
Hani et al. Image caption generation using a deep architecture
CN111259197B (zh) 一种基于预编码语义特征的视频描述生成方法
CN110263218B (zh) 视频描述文本生成方法、装置、设备和介质
CN114550057A (zh) 一种基于多模态表示学习的视频情绪识别方法
CN114492646A (zh) 一种基于跨模态互注意力机制的图文匹配方法
CN116166827B (zh) 语义标签抽取模型的训练和语义标签的抽取方法及其装置
CN114548274A (zh) 一种基于多模态交互的谣言检测方法及系统
CN116304042A (zh) 一种基于多模态特征自适应融合的虚假新闻检测方法
CN114973229B (zh) 文本识别模型训练、文本识别方法、装置、设备及介质
CN116258147A (zh) 一种基于异构图卷积的多模态评论情感分析方法及系统
CN112084788B (zh) 一种影像字幕隐式情感倾向自动标注方法及系统
CN113792537A (zh) 一种动作生成方法以及装置
WO2023168997A9 (zh) 一种跨模态搜索方法及相关设备
CN114386412B (zh) 一种基于不确定性感知的多模态命名实体识别方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22778283

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18284225

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2023559796

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 11202306944T

Country of ref document: SG

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19/02/2024)