WO2022206094A1 - 用于生成字幕器以及输出字幕的方法和装置 - Google Patents
用于生成字幕器以及输出字幕的方法和装置 Download PDFInfo
- Publication number
- WO2022206094A1 WO2022206094A1 PCT/CN2022/070476 CN2022070476W WO2022206094A1 WO 2022206094 A1 WO2022206094 A1 WO 2022206094A1 CN 2022070476 W CN2022070476 W CN 2022070476W WO 2022206094 A1 WO2022206094 A1 WO 2022206094A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sentence
- image
- sample
- subtitler
- objects
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 90
- 238000012549 training Methods 0.000 claims abstract description 92
- 238000010801 machine learning Methods 0.000 claims description 15
- 238000009826 distribution Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 13
- 238000003860 storage Methods 0.000 claims description 12
- 230000007246 mechanism Effects 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 230000007423 decrease Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 description 25
- 238000004891 communication Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 238000012545 processing Methods 0.000 description 8
- 244000025254 Cannabis sativa Species 0.000 description 6
- 241000282472 Canis lupus familiaris Species 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 5
- 230000015654 memory Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000010845 search algorithm Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000005242 forging Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/30—Writer recognition; Reading and verifying signatures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/40—Spoof detection, e.g. liveness detection
Definitions
- Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method and apparatus for generating subtitles and outputting subtitles.
- Image captioning is an emerging and rapidly growing research topic, which is a technique to automatically describe images in natural language sentences.
- Embodiments of the present disclosure propose methods and apparatuses for generating subtitles and methods and apparatuses for outputting subtitles.
- An embodiment of the present disclosure provides a method for generating a subtitler, including: acquiring a sample image set; inputting the sample image set into an image encoder of a sentence generator, and outputting an object set; and grouping the object set into a first object set and a second object set, wherein the first object set is an object set included in a predetermined object set, and the second object set is an object set excluded from the predetermined object set; input the object set output by the image encoder into a sentence
- the sentence decoder of the generator uses the first object set and the second object set as constraints to perform beam search to generate a pseudo-image sentence pair set; train the sentence generator with the pseudo-image sentence pair set as a sample set, and obtain subtitler.
- the method further includes: optimizing the subtitler by at least one of the following ways: adversarial training of the subtitler by the sentence discriminator to optimize the subtitler; objects identified by the subtitler in the sentence output by the subtitler The subtitler is optimized by the inclusion degree in ; the subtitler is optimized by the semantic correlation between the image triplet and the corresponding generated sentence, where the image triplet includes the query image, positive image and negative image.
- adversarial training of the subtitler on the sentence discriminator to optimize the subtitler includes: extracting a preset first sample set, wherein each first sample includes an image and a corresponding true sentence; Extract the pre-established generative adversarial network, in which the generative adversarial network includes a subtitler and a sentence discriminator.
- the subtitler is used to encode the input image and then decode the sentence to obtain a pseudo sentence.
- the sentence discriminator is used to determine the Whether the input sentence is a pseudo sentence output by the subtitler; based on the machine learning method, select the first sample from the first sample set, and perform the following first training step: input the image in the selected first sample into the subtitle input the pseudo sentence and the real sentence in the selected first sample into the sentence discriminator, and input the discrimination result; according to the output discrimination result, the accuracy rate of the sentence discriminator is calculated; if the accuracy rate reaches the preset value, Then it is determined that the subtitler training is completed.
- the method further includes: if the accuracy rate does not reach a preset value, calculating the adversarial loss of the sentence discriminator, adjusting the relevant parameters of the sentence discriminator to reduce the adversarial loss, and extracting the adversarial loss from the first sample The first sample is reselected centrally, and the first training step is continued.
- the method further includes: if the accuracy rate does not reach a preset value, calculating the adversarial reward of the subtitler, adjusting the relevant parameters of the subtitler to increase the adversarial reward, and restarting from the first sample set Select the first sample and continue to perform the first training step.
- optimizing the subtitler by the inclusion of the objects identified by the subtitler in the sentences output by the subtitler includes: extracting a preset second set of samples, wherein each second sample includes an image; A learning method, selecting samples from a second sample set, and performing the following second training step: inputting the images in the selected second sample into an image encoder of a subtitler, and outputting a sample object set; inputting the sample object set into sentences of the subtitler
- the decoder outputs a pseudo-sentence; calculates the average confidence score of the sample objects in the sample object set contained in the pseudo-sentence, as the object of the pseudo-sentence contains the reward; if the object contains the reward and reaches the preset threshold of containing the reward, it is determined that the subtitler training is complete .
- the method further includes: if the object inclusion reward does not reach a preset inclusion reward threshold, adjusting relevant parameters of the subtitler to increase the object inclusion reward, and reselecting a second sample from the second sample set, and continuing Perform the second training step.
- optimizing the subtitler by semantic correlation between image triples and corresponding generated sentences includes: extracting a preset third set of samples, wherein each third sample includes a query image, a positive image and negative images, the positive images share at least two objects with the query images, and the negative images and the query images do not have any objects in common; based on the machine learning method, a third sample is selected from the third sample set, and the following third training step is performed: Input the query image, positive image and negative image in the selected third sample into the subtitler respectively, and output the query sentence, positive sentence and negative sentence; calculate the first semantic similarity between the query sentence and the positive sentence, and calculate the difference between the query sentence and the negative sentence.
- the second semantic similarity; the self-supervised triplet loss is calculated according to the first semantic similarity and the second semantic similarity; if the self-supervised triplet loss is less than a predetermined loss threshold, it is determined that the subtitler training is completed.
- the method further includes: if the self-supervised triplet loss is not less than a predetermined loss threshold, adjusting relevant parameters of the subtitler to reduce the self-supervised triplet loss, and reselecting a third sample from the third sample set , and proceed to the third training step.
- calculating the first semantic similarity between the query sentence and the positive sentence and calculating the second semantic similarity between the query sentence and the negative sentence includes: for the query sentence, the positive sentence and the negative sentence, respectively calculating each word in the sentence The object-based probability distribution of , and the maximum pooling operation is performed to obtain query sentence features, positive sentence features and negative sentence features respectively; Second Semantic Similarity.
- the method further includes: if the weighted sum of the adversarial reward, the object inclusion reward and the self-supervised triplet loss is greater than a predetermined target value, adjusting the relevant parameters of the captioner such that the weighted sum decreases.
- the image encoder includes a two-layer LSTM with a region-level attention mechanism, where the first-layer LSTM acts as a top-down attention module that computes object-level attention based on contextual information, and the second-layer LSTM is A language model for generating sentences.
- Embodiments of the present disclosure also provide a method for outputting subtitles, including: acquiring an image to be processed; inputting the image into a subtitler generated by the above-mentioned method for generating a subtitler, and the output image corresponds to a subtitler. subtitle.
- Embodiments of the present disclosure also provide an apparatus for generating a subtitler, comprising: an acquisition unit configured to acquire a sample image set; an encoding unit configured to input the sample image set into an image encoder of the sentence generator, an output object set; a grouping unit configured to group the object set into a first object set and a second object set, wherein the first object set is an object set included in the predetermined object set, and the second object set is an excluded object set The object set outside the predetermined object set; the decoding unit is configured to input the object set output by the image encoder into the sentence decoder of the sentence generator, and in the decoding step, the first object set and the second object set are used as constraints.
- the beam search generates a pseudo-image sentence pair set; the training unit is configured to use the pseudo-image sentence pair set as a sample set to train a sentence generator to obtain a subtitler.
- the apparatus further includes an optimization unit configured to: optimize the subtitler by at least one of: adversarial training of the subtitler by the sentence discriminator to optimize the subtitler; objects identified by the subtitler The subtitler is optimized by the degree of inclusion in the sentences output by the subtitler; the subtitler is optimized by the semantic correlation between the image triplet and the corresponding generated sentence, where the image triplet includes the query image, the positive image and the negative image.
- an optimization unit configured to: optimize the subtitler by at least one of: adversarial training of the subtitler by the sentence discriminator to optimize the subtitler; objects identified by the subtitler The subtitler is optimized by the degree of inclusion in the sentences output by the subtitler; the subtitler is optimized by the semantic correlation between the image triplet and the corresponding generated sentence, where the image triplet includes the query image, the positive image and the negative image.
- the optimization unit is further configured to: extract a preset first sample set, wherein each first sample includes an image and a corresponding true sentence; extract a pre-established generative adversarial network, wherein, The generative adversarial network includes a subtitler and a sentence discriminator.
- the subtitler is used to encode the input image and then decode the sentence to obtain a pseudo sentence.
- the sentence discriminator is used to determine whether the input sentence is a pseudo sentence output by the subtitler.
- sentence based on the machine learning device, select a first sample from the first sample set, and perform the following first training step: inputting the image in the selected first sample into a subtitler, and outputting a pseudo sentence;
- the real sentence in the first sample is input into the sentence discriminator, and the discrimination result is input; the accuracy rate of the sentence discriminator is calculated according to the output discrimination result; if the accuracy rate reaches the preset value, it is determined that the subtitler training is completed.
- the optimization unit is further configured to: if the accuracy rate does not reach a preset value, calculate the adversarial loss of the sentence discriminator, adjust the relevant parameters of the sentence discriminator so that the adversarial loss is reduced, and from the first The first sample is reselected in the sample set, and the first training step is continued.
- the optimization unit is further configured to: if the accuracy rate does not reach the preset value, calculate the adversarial reward of the subtitler, adjust the relevant parameters of the subtitler to increase the adversarial reward, and extract the adversarial reward from the first sample The first sample is reselected centrally, and the first training step is continued.
- the optimization unit is further configured to: extract a preset second sample set, wherein each second sample includes an image; select samples from the second sample set based on the machine learning device, and perform the following second Training step: input the image in the selected second sample into the image encoder of the subtitler, and output the sample object set; input the sample object set into the sentence decoder of the subtitler, and output the pseudo sentence; calculate the pseudo sentence containing the sample object set.
- the optimization unit is further configured to: if the object inclusion reward does not reach the preset inclusion reward threshold, adjust relevant parameters of the subtitler to increase the object inclusion reward, and reselect the second sample from the second sample set , and proceed to the second training step.
- the optimization unit is further configured to: extract a preset third sample set, wherein each third sample includes a query image, a positive image and a negative image, the positive image and the query image share at least two objects, And the negative image and the query image do not have any common objects; based on the machine learning device, a third sample is selected from the third sample set, and the following third training step is performed: the query image, the positive image and the negative image in the selected third sample are selected.
- the images are input into the subtitler respectively, and the query sentence, positive sentence and negative sentence are output; the first semantic similarity between the query sentence and the positive sentence is calculated and the second semantic similarity between the query sentence and the negative sentence is calculated; according to the first semantic similarity and the second semantic similarity
- the semantic similarity calculates the self-supervised triplet loss; if the self-supervised triplet loss is less than a predetermined loss threshold, it is determined that the captioner training is complete.
- the optimization unit is further configured to: if the self-supervised triplet loss is not less than a predetermined loss threshold, adjust the relevant parameters of the subtitler so that the self-supervised triplet loss is reduced, and reselect the third sample set from the third sample set. Three samples, continue to perform the third training step.
- the optimization unit is further configured to: for the query sentence, the positive sentence and the negative sentence, respectively calculate the object-based probability distribution of each word in the sentence, perform a maximum pooling operation, and obtain the query sentence feature, positive sentence and negative sentence respectively. Sentence features and negative sentence features; calculating the first semantic similarity between the query sentence feature and the positive sentence feature and calculating the second semantic similarity between the query sentence feature and the negative sentence feature.
- the optimization unit is further configured to: if the weighted sum of the adversarial reward, the object inclusion reward and the self-supervised triplet loss is greater than a predetermined target value, adjust the relevant parameters of the captioner such that the weighted sum decreases.
- the image encoder includes a two-layer LSTM with a region-level attention mechanism, where the first-layer LSTM acts as a top-down attention module that computes object-level attention based on contextual information, and the second-layer LSTM is A language model for generating sentences.
- An embodiment of the present disclosure provides an apparatus for outputting subtitles, including: an acquisition unit configured to acquire an image to be processed; an output unit configured to input the image using the above-mentioned method for generating a subtitle.
- the subtitler generated by the method the subtitles corresponding to the output images are output.
- Embodiments of the present disclosure provide an electronic device, including: one or more processors; a storage device on which one or more computer programs are stored, when the one or more computer programs are executed by the one or more processors , causing one or more processors to implement the method for generating a subtitler as described above.
- Embodiments of the present disclosure provide a computer-readable medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method for generating a subtitler as described above.
- FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure may be applied;
- FIG. 2 is a flowchart of one embodiment of a method for generating a subtitler according to the present disclosure
- FIG. 3 is a schematic diagram of an application scenario of the method for generating a subtitler according to the present disclosure
- FIG. 5 is a schematic structural diagram of an embodiment of an apparatus for generating a subtitler according to the present disclosure
- FIG. 6 is a schematic structural diagram of an embodiment of an apparatus for outputting subtitles according to the present disclosure
- FIG. 7 is a schematic structural diagram of a computer system suitable for implementing an electronic device of an embodiment of the present disclosure.
- FIG. 1 illustrates an exemplary system architecture 100 of a method for generating a subtitle, an apparatus for generating a subtitle, a method for outputting a subtitle, or an apparatus for outputting a subtitle to which embodiments of the present disclosure may be applied.
- the system architecture 100 may include terminals 101 and 102 , a network 103 , a database server 104 and a server 105 .
- the network 103 is used as a medium for providing a communication link between the terminals 101 , 102 , the database server 104 and the server 105 .
- the network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
- the user 110 can use the terminals 101, 102 to interact with the server 105 through the network 103 to receive or send messages and the like.
- Various client applications may be installed on the terminals 101 and 102 , such as model training applications, subtitle generation applications, image processing applications, shopping applications, payment applications, web browsers, and instant messaging tools.
- the terminals 101 and 102 here may be hardware or software.
- the terminals 101 and 102 are hardware, they can be various electronic devices with display screens, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving picture experts Compression standard audio layer 3), laptops and desktops, etc.
- the terminals 101 and 102 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (eg, to provide distributed services), or as a single software or software module. There is no specific limitation here.
- an image acquisition device may also be installed thereon.
- the image capturing device may be various devices that can realize the function of capturing images, such as a camera, a sensor, and the like.
- the user 110 can use the image capture devices on the terminals 101 and 102 to capture images of various scenes.
- Database server 104 may be a database server that provides various services.
- a sample set may be stored in a database server.
- the sample set contains a large number of samples.
- the samples may include sample images and sentences corresponding to the sample images.
- the user 110 can also select samples from the sample set stored in the database server 104 through the terminals 101 and 102 .
- the server 105 may also be a server that provides various services, such as a background server that provides support for various applications displayed on the terminals 101 and 102 .
- the background server can use the samples in the sample set sent by the terminals 101 and 102 to train the initial model, and can send the training results (such as the generated captioners) to the terminals 101 and 102 . In this way, the user can apply the generated subtitler to generate subtitles for the images.
- the database server 104 and the server 105 here can also be hardware or software. When they are hardware, they can be implemented as a distributed server cluster consisting of multiple servers, or as a single server. When they are software, they can be implemented as multiple software or software modules (eg, to provide distributed services), or as a single software or software module. There is no specific limitation here.
- the method for generating a subtitle or the method for outputting subtitles provided by the embodiments of the present disclosure is generally executed by the server 105 . Accordingly, means for generating subtitles or means for outputting subtitles are generally also provided in the server 105 .
- the server 105 can implement the relevant functions of the database server 104, the database server 104 may not be provided in the system architecture 100.
- terminals, networks, database servers and servers in FIG. 1 are merely illustrative. There can be any number of terminals, networks, database servers, and servers according to implementation needs.
- the method for generating a subtitler may include the following steps:
- Step 201 acquiring a sample image set.
- the execution body of the method for generating a subtitler may acquire a pre-stored set of sample images from a database server.
- the image captured by the terminal can also be acquired from the terminal as a sample image.
- Step 202 input the sample image set into the image encoder of the sentence generator, and output the object set.
- the sentence generator is the initial subtitler, which is a neural network that converts input images into sentences.
- the sentence generator may include an image encoder and a sentence decoder.
- the image encoder generates an intermediate representation for each input image.
- the present disclosure uses the most common object detection model (Faster R-CNN) as the image encoder to detect objects in the image, and other image encoders may also be used in practical applications. Encode each image I i into a set of salient image regions It contains K detected objects, such as people, flowers, grass, trees, chairs, dogs, etc.
- Step 203 Group the object set into a first object set and a second object set.
- the first object set is an object set included in the predetermined object set
- the second object set is an object set excluded from the predetermined object set.
- the identified objects Based on the 80 most common detected objects in the COCO dataset, regroup the identified objects into object sets that need to be included and the set of excluded objects For example, if the predetermined object set includes houses, cars, people, flowers, grass, and trees, and the object set includes people, flowers, grass, trees, chairs, and dogs, the first object set (objects included in the predetermined object set) set) includes people, flowers, grass, trees, while the second object set (the excluded object set) includes chairs, dogs.
- Step 204 Input the object set output by the image encoder into the sentence decoder of the sentence generator.
- the first object set and the second object set are used as constraints to perform beam search to generate a pseudo-image sentence pair set.
- the sentence decoder is used to decode the output sentence word by word.
- the sentence decoder can be implemented as a two-layer LSTM (Long Short-Term Memory) with a region-level attention mechanism.
- the first layer LSTM (LSTM 1 ) acts as a top-down attention module that computes object-level attention based on contextual information
- the second layer LSTM (LSTM 2 ) is a language model for generating sentences.
- the hidden state of the second layer LSTM Mean of encoded image features And the input word w t-1 is regarded as context information and fed into the first layer LSTM, from which the hidden state of the first layer LSTM is obtained:
- ⁇ t ,k is the kth element of ⁇ t , which represents the attention probability of the image region vk .
- W E is a linear embedding matrix that will Projection into the vocabulary space for word prediction.
- a natural way to generate pseudo-image-sentence pairs via a pretrained subtitler is to employ beam search, a heuristic search algorithm that maintains a beam B t at each decoding step, containing b most possible partial sentences.
- beam search a heuristic search algorithm that maintains a beam B t at each decoding step, containing b most possible partial sentences.
- the semantic correlation between input images and output sentences is not fully utilized for sentence generation at inference time.
- the present disclosure devises a semantically constrained beam search that restructures the beam search to ensure that identified objects are included and extraneous objects are excluded.
- a set of recognized objects is output by an object detection model (eg FasterR-CNN) where O k is the recognized object with the highest confidence score in the Kth image region, is the corresponding confidence score.
- object detection model eg FasterR-CNN
- O k the recognized object with the highest confidence score in the Kth image region
- the corresponding confidence score is used.
- a finite state machine Finite-state machine
- the search beam is maintained for each state a ⁇ A in the finite state machine
- the candidate set keep the b most probable partial word sequences in to update each beam
- V is the dictionary
- w 1:t-1 means the output sentence length is t-1
- V is the dictionary
- w 1:t-1 means the output sentence length is t-1
- V is the dictionary
- w 1:t-1 means the output sentence length is t-1
- V is the dictionary
- w 1:t-1 means the output sentence length is t-1
- V is the dictionary
- the state transition function in a finite state machine This only looks up words from the vocabulary (no unrelated objects in ) to expand the current partial sequence of words. Therefore, the design of the finite state machine requires that the word sequence of the accepting state must satisfy the inclusion condition, while excluding all extraneous objects.
- Each pseudo-image-sentence pair in the pseudo-image-sentence pair set includes an image and a sentence, and the image and sentence can be unpaired.
- Step 205 using the pseudo-image sentence pair set as a sample set to train a sentence generator to obtain a subtitler.
- available represents a collection of pseudo-image-sentence pairs, where represents the generated pseudo-sentences, and with these pseudo-image-sentence pairs, one can directly train a subtitler with the following cross-entropy loss:
- ⁇ denotes the parameters in the sentence decoder.
- the method further includes: optimizing the subtitler in at least one of the following ways: optimizing the subtitler by adversarial training of the subtitler by using the sentence discriminator; The subtitler is optimized by the inclusion of the object in the sentences output by the subtitler; the subtitler is optimized by the semantic correlation between the image triplet and the corresponding generated sentence, where the image triplet includes the query image, positive image and negative image .
- the subtitler can be optimized by any of the above methods, or can be optimized by any combination of the two methods. You can also combine the three ways to optimize the subtitler.
- performing adversarial training on the subtitler with a sentence discriminator to optimize the subtitler includes: extracting a preset first sample set, wherein each A sample includes an image and a corresponding true sentence; a pre-established generative adversarial network is extracted, wherein the generative adversarial network includes a subtitler and a sentence discriminator, and the subtitler is used to image-encode the input image.
- the sentence discriminator is used to determine whether the input sentence is a pseudo-sentence output by the captioner; based on the machine learning method, select a first sample from the first sample set , and perform the following first training steps: the image in the selected first sample is input into the captioner, and a pseudo sentence is output; the pseudo sentence and the real sentence in the selected first sample are input into the sentence identification , input the discrimination result; count the accuracy rate of the sentence discriminator according to the output discrimination result; if the accuracy rate reaches a preset value, it is determined that the captioner training is completed.
- the accuracy rate does not reach a preset value, calculate the adversarial loss of the sentence discriminator, adjust the relevant parameters of the sentence discriminator so that the adversarial loss is reduced, and extract the data from the first sample set The first sample is reselected, and the first training step is continued.
- the accuracy rate does not reach the preset value, calculate the adversarial reward of the captioner, adjust the relevant parameters of the captioner to increase the adversarial reward, and reselect from the first sample set For the first sample, continue to perform the first training step.
- the structure of the sentence discriminator is shown in Figure 3.
- the sentence discriminator and subtitler (including the image encoder and sentence decoder) make up the generative adversarial network.
- the sentence discriminator is used to distinguish whether the input sentence is a real sentence in the unpaired sentence dataset or a pseudo sentence generated by the subtitler.
- RNN Recurrent Neural Network
- LSTMs can be used to contextually encode word sequences into sentence-level representations to identify real/generated sentences.
- RNN Recurrent Neural Network
- W FC is the embedding matrix of the fully connected layer.
- the sentence discriminator judges whether the input sentence is a real sentence or a pseudo sentence generated by the subtitler. Whether the statistical identification result is correct, if the accuracy rate reaches a preset value (for example, 0.5), it means that the subtitler has a good effect of forging sentences and can fool the sentence discriminator, and the training is ended. Otherwise, the network parameters of the subtitler and sentence discriminator need to be adjusted before training. You can first fix the parameters of the subtitler, adjust the parameters of the sentence discriminator for training, and then fix the parameters of the sentence discriminator and adjust the parameters of the subtitler for training. The parameters of sentence discriminator and subtitler are adjusted alternately, and finally the sentence discriminator and subtitler are trained. The actual application is the subtitler.
- a preset value for example, 0.5
- Adversarial Reward To generate sentences that are indistinguishable from human-written subtitles, the present disclosure employs adversarial training and sentence-level adversarial rewards to match the generated sentence distribution with the manually described sentence distribution.
- image captioners Treated as a sentence generator the data distribution is captured to generate sentences.
- Sentence discriminator D to learn from actual sentences or A randomly selected sentence from the generated sentences is used as input, and a probability distribution is produced over the two sentence sources (i.e. the generated sentence or the real sentence)
- the image captioner and sentence discriminator are trained in a two-player game.
- the sentence discriminator D works by correctly distinguishing between real sentences ⁇ S i ⁇ and generated sentences to optimize, i.e. minimize the adversarial loss:
- the image captioner Learning by maximizing the adversarial reward r adv aims to fool the sentence discriminator with the generated sentences:
- the accuracy of the subtitler can be improved by generative adversarial networks.
- optimizing the subtitler according to the inclusion degree of the object identified by the subtitler in the sentence output by the subtitler includes:
- each second sample includes an image
- samples are selected from the second sample set, and the following second training step is performed: inputting the images in the selected second sample into the image encoder of the captioner, and outputting a sample object set;
- the sample object set is input to the sentence decoder of the subtitler, and a pseudo sentence is output; the average confidence score of the sample objects in the sample object set is calculated in the pseudo sentence, and the object of the pseudo sentence contains a reward; if the object If the inclusion reward reaches the preset inclusion reward threshold, it is determined that the captioner training is completed.
- the object inclusion reward does not reach the preset inclusion reward threshold, adjust the relevant parameters of the subtitler to increase the object inclusion reward, reselect a second sample from the second sample set, and continue to execute the Second training step.
- the present disclosure further regards the inclusion degree of the identified target in the output sentence as an additional self-supervised target, ie, the target inclusion reward, to encourage the subtitler to describe that the generated sentence contains the identified target. In this way, the semantic correlation between the two is emphasized, enhancing the quality of the generated subtitles.
- the present disclosure constructs a collection of containing objects with all the identified objects Given a generated sentence By counting the sentences contained in the above set Confidence mean scores for objects in to construct object inclusion rewards:
- the indicator function Represents the corresponding confidence score of the recognized object, and some objects have low confidence, so the weight is correspondingly reduced.
- optimizing the subtitler through semantic correlation between image triples and corresponding generated sentences includes: extracting a preset third sample set, wherein each The third sample includes a query image, a positive image and a negative image, the positive image and the query image share at least two objects, while the negative image and the query image do not have any common objects; based on the machine learning method, the third sample set is selected from the third sample set.
- the self-supervised triplet loss is not less than a predetermined loss threshold, adjust the relevant parameters of the subtitler to reduce the self-supervised triplet loss, and reselect a third sample from the third sample set, and continue to execute the third training step.
- the calculating the first semantic similarity between the query sentence and the positive sentence and the calculating the second semantic similarity between the query sentence and the negative sentence include: for the query sentence, the positive sentence and the negative sentence Sentence, calculate the object-based probability distribution of each word in the sentence, and perform the maximum pooling operation to obtain query sentence features, positive sentence features and negative sentence features respectively; calculate the first semantic similarity between query sentence features and positive sentence features and Calculate the second semantic similarity between query sentence features and negative sentence features.
- Self-supervised Triplet Loss in optimization with object inclusion reward, independently exploits the difference between each image and the corresponding generated sentence, regardless of the similarity or dissimilarity between images Semantic relevance.
- the present disclosure designs a self-supervised triplet loss to semantically constrain the learning of the subtitler in a triplet manner, aiming to preserve the relative semantic order between sentences.
- Each image triplet (consisting of query image, positive image, and negative image) is constructed based on the visual object identified in the image. Positive images share at least two objects with query images, while negative images and query images do not have any objects in common. Given such an image triplet, the subtitler is optimized so that the generated sentences for query images are more similar to those for positive images than those for negative images.
- each triple contains a query image I i , a positive image and a negative image
- the self-supervised triplet loss is calculated as follows:
- the final training of the entire model can be combined with Adversarial Reward, Object Inclusion Reward and Self-supervised Triplet Loss in the self-critical sequence training.
- the overall goal of The gradient formula is approximated as:
- T denotes the sampled sentence
- b denotes the combination of obtained adversarial and object inclusion rewards.
- ⁇ 1 , ⁇ 2 , and ⁇ 3 represent the weights of adversarial reward, object inclusion reward and self-supervised triplet loss, respectively, and the weight can be 0.
- FIG. 3 is a schematic diagram of an application scenario of the method for generating a subtitler according to this embodiment.
- the query image, positive image, and negative image are input into the image encoder Faster R-CNN of the subtitler to obtain the object set ⁇ tree (tree), man (person), bench (bench), grass (grass), dog (dog)... ⁇ .
- the sentence decoder of the subtitler the two-layer LSTM structure below in Figure 3
- perform beam search decoding based on the semantics of the object set, and generate the pseudo sentence "a man sitting on a bench near" a tree” etc.
- pseudo-sentences and corresponding images are used as a set of pseudo-image-sentence pairs for training the subtitler (the upper two-layer LSTM structure in Figure 3 represents the sentence decoder).
- the parameters of the image encoder can be fixed, and only the sentence decoder can be trained, or the image encoder can be trained after the sentence decoder training is completed, and the image encoder and the sentence decoder can be trained alternately to obtain the best-performing subtitler. .
- the way cross-entropy is used during training.
- the obtained parameters of the upper two-layer LSTM structure can be shared with the lower two-layer LSTM structure.
- Adversarial reward optimization Input the real sentence "a cow stands in the back of a large truck" together with the pseudo sentence generated by the subtitler into the sentence discriminator for discrimination. If the discrimination accuracy does not reach 0.5, adjust the parameters of the sentence discriminator in the direction of minimizing the adversarial loss, and then adjust the parameters of the subtitler in the direction of maximizing the adversarial reward.
- the subtitler can be optimized by alternately training (adjusting) the sentence discriminator and subtitler.
- Object inclusion reward optimization Calculate the inclusion degree of the identified objects in the pseudo-sentences generated by the subtitler.
- the recognized objects include tree, man, and bench. If sentence 1 includes tree (confidence of 0.9) and sentence 2 includes tree (confidence of 0.8) and man (confidence of 0.7), the object inclusion reward of sentence 2 is higher than that of sentence 1.
- the purpose of training is to maximize the object inclusion reward, and each adjustment of the parameters increases the object inclusion reward.
- Self-supervised triplet loss optimization The input samples in Figure 3 can be triples: query image, positive image, and negative image. Different images can generate different pseudo-sentences, and the self-supervised triplet loss is determined by comparing the semantic similarity between query sentences, positive sentences, and negative sentences. The purpose of training is to reduce the self-supervised triplet loss, so that positive sentences are semantically closer to the query sentence, and negative sentences are not semantically related to the query sentence.
- the present disclosure adopts a self-learning mode, and optimizes the entire model by alternately performing the two processes of pseudo-subtitle pair generation and subtitle retraining, so as to achieve the purpose of cyclically and iteratively improving the subtitler.
- the present disclosure proposes a self-learning framework based on semantic constraints, and deeply studies the self-learning idea of unpaired image captioners. This problem is studied from the perspective of establishing a pseudo-sentence generation and iterative optimization, gradually improving the quality of sentence generation. Furthermore, semantic constraints are well integrated into the model, which fully utilizes the semantics of objects in the image to guide the training of the captioner, resulting in advanced unsupervised captioning techniques.
- the process 400 of the method for outputting subtitles includes the following steps:
- Step 401 acquiring an image to be processed.
- the electronic device on which the method for outputting subtitles runs may receive the to-be-processed subtitles from the terminal through which the user performs subtitle editing through a wired connection or a wireless connection.
- image The image to be processed may be an individual image or a video file.
- the server divides the video into frames to obtain the image to be processed.
- Step 402 Input the image into the subtitle device, and output the subtitle corresponding to the image.
- the subtitler is obtained by training according to the method of steps 201-205.
- Subtitles can be automatically added to images through the subtitler.
- the subtitles can be directly output to the image, or a separate file can be generated and returned to the terminal, and the terminal can set the format of the subtitles according to the user's needs, and then output it to the image.
- the subtitler can not only input subtitles, but also output objects recognized by the image encoder, which can be used for semantic constraints during training.
- Steps 401-402 may be performed alternately with steps 201-205.
- the subtitles generated in steps 401-402 can be used as training samples in steps 201-205.
- the process 400 of the method for outputting subtitles in this embodiment embodies the application steps of the subtitler. Therefore, the solution described in this embodiment can generate training samples through the subtitler, and then use them for training the subtitler, alternately generate subtitles and retrain the subtitler, thereby optimizing the subtitler and improving the accuracy of generating subtitles.
- the present disclosure provides an embodiment of an apparatus for generating a subtitler, and the apparatus embodiment corresponds to the method embodiment shown in FIG. 2 , Specifically, the device can be applied to various electronic devices.
- the apparatus 500 for generating a subtitler in this embodiment includes: an obtaining unit 501 , an encoding unit 502 , a grouping unit 503 , a decoding unit 504 and a training unit 505 .
- the obtaining unit 501 is configured to obtain a sample image set; the encoding unit 502 is configured to input the sample image set into the image encoder of the sentence generator, and output an object set; the grouping unit 503 is configured to The set of objects is grouped into a first set of objects and a second set of objects, wherein the first set of objects is a set of objects included in a predetermined set of objects, and the second set of objects is a set of objects excluded from the predetermined set of objects
- the decoding unit 504 is configured to input the object set output by the image encoder into the sentence decoder of the sentence generator, and in the decoding step, the first object set and the second object set are used as constraints.
- the beam search generates a set of pseudo-image sentence pairs; the training unit 505 is configured to use the pseudo-image sentence pair set as a sample set to train the sentence generator to obtain a subtitler.
- the apparatus further includes an optimization unit (not shown in the drawings), configured to: optimize the subtitler in at least one of the following ways: perform a sentence discriminator on the subtitler adversarial training to optimize the subtitler; optimize the subtitler by the inclusion of objects identified by the subtitler in the sentences output by the subtitler; optimize the subtitler by the semantic correlation between image triples and the corresponding generated sentences, where , the image triplet includes query image, positive image and negative image.
- an optimization unit configured to: optimize the subtitler in at least one of the following ways: perform a sentence discriminator on the subtitler adversarial training to optimize the subtitler; optimize the subtitler by the inclusion of objects identified by the subtitler in the sentences output by the subtitler; optimize the subtitler by the semantic correlation between image triples and the corresponding generated sentences, where , the image triplet includes query image, positive image and negative image.
- the optimization unit is further configured to: extract a preset first sample set, wherein each first sample includes an image and a corresponding true sentence; extract a preset first sample set; Generative adversarial network, in which the generative adversarial network includes a subtitler and a sentence discriminator.
- the subtitler is used to encode the input image and then decode the sentence to obtain a pseudo sentence.
- the sentence discriminator is used to determine whether the input sentence is not.
- the pseudo sentence output by the subtitler based on the machine learning device, select a first sample from the first sample set, and perform the following first training step: input the image in the selected first sample into the subtitler, and output the pseudo sentence Sentence; input the pseudo sentence and the real sentence in the selected first sample into the sentence discriminator, and input the discrimination result; count the accuracy rate of the sentence discriminator according to the output discrimination result; if the accuracy rate reaches the preset value, then determine the subtitle Machine training is complete.
- the optimization unit is further configured to: if the accuracy rate does not reach a preset value, calculate the adversarial loss of the sentence discriminator, and adjust the relevant parameters of the sentence discriminator to make the adversarial loss reduce, and re-select the first sample from the first sample set, and continue to perform the first training step.
- the optimization unit is further configured to: if the accuracy rate does not reach the preset value, calculate the adversarial reward of the subtitler, and adjust the relevant parameters of the subtitler to increase the adversarial reward , and reselect the first sample from the first sample set, and continue to perform the first training step.
- the optimization unit is further configured to: extract a preset second sample set, wherein each second sample includes an image; and select from the second sample set based on the machine learning device sample, and perform the following second training steps: inputting the image in the selected second sample into the image encoder of the subtitler, and outputting a sample object set; inputting the sample object set into the sentence decoder of the subtitler, and outputting a pseudo-sentence; calculating a pseudo-sentence
- the sentence contains the average confidence score of the sample objects in the sample object set, and the object contains the reward as a pseudo-sentence; if the object contains the reward and reaches the preset threshold of containing the reward, it is determined that the subtitler training is completed.
- the optimization unit is further configured to: if the object inclusion reward does not reach a preset inclusion reward threshold, adjust relevant parameters of the subtitler so that the object inclusion reward increases, and from the second The second sample is reselected from the sample set, and the second training step is continued.
- the optimization unit is further configured to: extract a preset third sample set, wherein each third sample includes a query image, a positive image and a negative image, and the positive image and the query The images share at least two objects, while the negative image and the query image do not have any objects in common; based on the machine learning device, a third sample is selected from the third sample set, and the following third training step is performed:
- the query image, positive image and negative image are input into the captioner respectively, and the query sentence, positive sentence and negative sentence are output; the first semantic similarity between the query sentence and the positive sentence and the second semantic similarity between the query sentence and the negative sentence are calculated;
- the first semantic similarity and the second semantic similarity calculate the self-supervised triplet loss; if the self-supervised triplet loss is less than a predetermined loss threshold, it is determined that the subtitler training is completed.
- the optimization unit is further configured to: if the self-supervised triplet loss is not less than a predetermined loss threshold, adjust the relevant parameters of the captioner to reduce the self-supervised triplet loss, and from The third sample is reselected from the third sample set, and the third training step is continued.
- the optimization unit is further configured to: for the query sentence, the positive sentence and the negative sentence, respectively calculate the object-based probability distribution of each word in the sentence, and perform a maximum pooling operation, Obtain the query sentence feature, positive sentence feature and negative sentence feature respectively; calculate the first semantic similarity between the query sentence feature and the positive sentence feature and calculate the second semantic similarity between the query sentence feature and the negative sentence feature.
- the optimization unit is further configured to: if the weighted sum of the adversarial reward, the object inclusion reward and the self-supervised triplet loss is greater than a predetermined target value, adjust the relevant parameters of the captioner such that Weighted and reduced.
- the image encoder includes a two-layer LSTM with a region-level attention mechanism, wherein the first-layer LSTM acts as a top-down attention module that computes object-level attention based on contextual information , while the second layer LSTM is the language model used to generate sentences.
- the present disclosure provides an embodiment of an apparatus for outputting subtitles.
- the apparatus embodiment corresponds to the method embodiment shown in FIG. 4 .
- the device can be specifically applied to various electronic devices.
- the apparatus 600 for outputting subtitles in this embodiment includes: an acquiring unit 601 , which is configured to acquire an image to be processed; , the subtitles corresponding to the output images.
- the present disclosure also provides an electronic device and a readable storage medium.
- FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure.
- Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
- Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices.
- the components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
- the device 700 includes a computing unit 701 that can be executed according to a computer program stored in a read only memory (ROM) 702 or loaded into a random access memory (RAM) 703 from a storage unit 708 Various appropriate actions and handling. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored.
- the computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704.
- An input/output (I/O) interface 705 is also connected to bus 704 .
- Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage unit 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like.
- the communication unit 709 allows the device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
- Computing unit 701 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of computing units 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
- the computing unit 701 performs the various methods and processes described above, eg, methods for generating a subtitler.
- a method for generating a subtitler may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708 .
- part or all of the computer program may be loaded and/or installed on device 700 via ROM 702 and/or communication unit 709.
- the computer program When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the method described above for generating a subtitler may be performed.
- the computing unit 701 may be configured by any other suitable means (eg, by means of firmware) to perform a method for generating a subtitler.
- the method and device for generating a subtitle and the method and device for outputting subtitles aim to provide an unsupervised solution for image subtitles. Unlike existing image captioning methods that rely heavily on a large number of image-sentence pairs for training, the present disclosure eliminates this dependence by learning an image captioner in a self-learning manner. The captioner can be trained with unpaired image and sentence data to pursue more realistic scenarios.
- Various implementations of the systems and techniques described herein above may be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
- FPGAs field programmable gate arrays
- ASICs application specific integrated circuits
- ASSPs application specific standard products
- SOC systems on chips system
- CPLD load programmable logic device
- computer hardware firmware, software, and/or combinations thereof.
- These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that
- the processor which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.
- Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented.
- the program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.
- a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
- the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
- Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
- machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
- RAM random access memory
- ROM read only memory
- EPROM or flash memory erasable programmable read only memory
- CD-ROM compact disk read only memory
- magnetic storage or any suitable combination of the foregoing.
- the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer.
- a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and pointing device eg, a mouse or trackball
- Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.
- the systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user's computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
- the components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
- a computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
- the server can be a distributed system server, or a server combined with a blockchain.
- the server can also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology.
- the server can be a distributed system server, or a server combined with a blockchain.
- the server can also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (17)
- 一种用于生成字幕器的方法,包括:获取样本图像集;将所述样本图像集输入句子生成器的图像编码器,输出对象集;将所述对象集分组成第一对象集和第二对象集,其中,所述第一对象集为被包含在预定对象集内的对象集,所述第二对象集为被排除在预定对象集外的对象集;将所述图像编码器输出的对象集输入句子生成器的句子解码器,在解码步骤中以所述第一对象集、所述第二对象集为约束条件进行波束搜索,生成伪图像句子对集;将所述伪图像句子对集作为样本集训练所述句子生成器,得到字幕器。
- 根据权利要求1所述的方法,其中,所述方法还包括:通过以下至少一种方式优化所述字幕器:通过句子鉴别器对所述字幕器进行对抗式训练来优化所述字幕器;通过所述字幕器识别出的对象在所述字幕器输出的句子中的包含程度优化所述字幕器;通过图像三元组与相应生成的句子之间的语义相关性优化所述字幕器,其中,图像三元组包括查询图像,正图像和负图像。
- 根据权利要求2所述的方法,其中,所述通过句子鉴别器对所述字幕器进行对抗式训练来优化所述字幕器,包括:提取预置的第一样本集,其中,每个第一样本包括图像和对应的真句子;提取预先建立的生成式对抗网络,其中,所述生成式对抗网络包括字幕器和句子鉴别器,所述字幕器用于对所输入的图像进行图像编码后再进行句子解码,得到伪句子,所述句子鉴别器用于确定所输入的句子是否为所述字幕器所输出的伪句子;基于机器学习方法,从所述第一样本集中选取第一样本,以及执行以下第一训练步骤:将选取的第一样本中的图像输入所述字幕器,输出伪句子;将所述伪句子和选取的第一样本中的真句子输入所述句子鉴别器,输入鉴别结果;根据输出的鉴别结果统计所述句子鉴别器的准确率;若所述准确率达 到预设数值,则确定出所述字幕器训练完成。
- 根据权利要求3所述的方法,其中,所述方法还包括:若所述准确率未达到所述预设数值,则计算所述句子鉴别器的对抗性损失,调整所述句子鉴别器的相关参数使得所述对抗性损失减小,以及从所述第一样本集中重新选取第一样本,继续执行所述第一训练步骤。
- 根据权利要求3所述的方法,其中,所述方法还包括:若所述准确率未达到所述预设数值,则计算所述字幕器的对抗性奖励,调整所述字幕器的相关参数使得所述对抗性奖励增大,以及从所述第一样本集中重新选取第一样本,继续执行所述第一训练步骤。
- 根据权利要求2所述的方法,其中,所述通过所述字幕器识别出的对象在所述字幕器输出的句子中的包含程度优化所述字幕器,包括:提取预置的第二样本集,其中,每个第二样本包括图像;基于机器学习方法,从所述第二样本集中选取样本,以及执行以下第二训练步骤:将选取的第二样本中的图像输入所述字幕器的图像编码器,输出样本对象集;将所述样本对象集输入字幕器的句子解码器,输出伪句子;计算所述伪句子中包含所述样本对象集中的样本对象的置信度均值分数,作为所述伪句子的对象包含奖励;若所述对象包含奖励达到预设包含奖励阈值,则确定出所述字幕器训练完成。
- 根据权利要求6所述的方法,其中,所述方法还包括:若所述对象包含奖励未达到所述预设包含奖励阈值,则调整所述字幕器的相关参数使得所述对象包含奖励增大,以及从所述第二样本集中重新选取第二样本,继续执行所述第二训练步骤。
- 根据权利要求2所述的方法,其中,所述通过图像三元组与相应生成的句子之间的语义相关性优化所述字幕器,包括:提取预置的第三样本集,其中,每个第三样本包括查询图像、正图像和负图像,正图像与查询图像共享至少两个对象,而负图像和查询图像没有任 何共同的对象;基于机器学习方法,从所述第三样本集中选取第三样本,以及执行以下第三训练步骤:将选取的第三样本中的查询图像、正图像和负图像分别输入所述字幕器,输出查询句子、正句子和负句子;计算查询句子和正句子的第一语义相似度以及计算查询句子和负句子的第二语义相似度;根据所述第一语义相似度和所述第二语义相似度计算自监督三重态损失;若所述自监督三重态损失小于预定损失阈值,则确定出所述字幕器训练完成。
- 根据权利要求8所述的方法,其中,所述方法还包括:若所述自监督三重态损失不小于所述预定损失阈值,则调整所述字幕器的相关参数使得所述自监督三重态损失减小,以及从所述第三样本集中重新选取第三样本,继续执行所述第三训练步骤。
- 根据权利要求8所述的方法,其中,所述计算查询句子和正句子的第一语义相似度以及计算查询句子和负句子的第二语义相似度,包括:对于查询句子、正句子和负句子,分别计算句子中每一个词的基于对象的概率分布,进行最大池化操作,分别得到查询句子特征、正句子特征和负句子特征;计算查询句子特征和正句子特征的第一语义相似度以及计算查询句子特征和负句子特征的第二语义相似度。
- 根据权利要求2-10中任一项所述的方法,其中,所述方法还包括:若对抗性奖励、对象包含奖励和自监督三重态损失的加权和大于预定目标值,则调整字幕器的相关参数使得所述加权和减小。
- 根据权利要求1-10中任一项所述的方法,其中,所述图像编码器包括具有区域级别注意机制的两层LSTM,其中,第一层LSTM充当自上而下的注意模块,根据上下文信息计算对象级别的注意,而第二层LSTM是用于生成句子的语言模型。
- 一种用于输出字幕的方法,包括:获取待处理的图像;将所述图像输入采用如权利要求1-12中任一项所述的方法生成的字幕器中,输出所述图像对应的字幕。
- 一种用于生成字幕器的装置,包括:获取单元,被配置成获取样本图像集;编码单元,被配置成将所述样本图像集输入句子生成器的图像编码器,输出对象集;分组单元,被配置成将所述对象集分组成第一对象集和第二对象集,其中,所述第一对象集为被包含在预定对象集内的对象集,所述第二对象集为被排除在预定对象集外的对象集;解码单元,被配置成将所述图像编码器输出的对象集输入句子生成器的句子解码器,在解码步骤中以所述第一对象集、所述第二对象集为约束条件进行波束搜索,生成伪图像句子对集;训练单元,被配置成将所述伪图像句子对集作为样本集训练所述句子生成器,得到字幕器。
- 一种用于输出字幕的装置,包括:获取单元,被配置成获取待处理的图像;输出单元,被配置成将所述图像输入采用如权利要求1-12中任一项所述的方法生成的字幕器中,输出所述图像对应的字幕。
- 一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个计算机程序,当所述一个或多个计算机程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-13中任一项所述的方法。
- 一种计算机可读介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1-13中任一项所述的方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023559796A JP2024512628A (ja) | 2021-03-30 | 2022-01-06 | キャプション生成器を生成するための方法および装置、並びにキャプションを出力するための方法および装置 |
US18/284,225 US20240177506A1 (en) | 2021-03-30 | 2022-01-06 | Method and Apparatus for Generating Captioning Device, and Method and Apparatus for Outputting Caption |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110338045.X | 2021-03-30 | ||
CN202110338045.XA CN113052090B (zh) | 2021-03-30 | 2021-03-30 | 用于生成字幕器以及输出字幕的方法和装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022206094A1 true WO2022206094A1 (zh) | 2022-10-06 |
Family
ID=76516172
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/070476 WO2022206094A1 (zh) | 2021-03-30 | 2022-01-06 | 用于生成字幕器以及输出字幕的方法和装置 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20240177506A1 (zh) |
JP (1) | JP2024512628A (zh) |
CN (1) | CN113052090B (zh) |
WO (1) | WO2022206094A1 (zh) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113052090B (zh) * | 2021-03-30 | 2024-03-05 | 京东科技控股股份有限公司 | 用于生成字幕器以及输出字幕的方法和装置 |
CN113628288B (zh) * | 2021-07-06 | 2024-05-31 | 上海电力大学 | 一种基于编-解码器结构的可控图像字幕生成优化方法 |
CN114821271B (zh) * | 2022-05-19 | 2022-09-16 | 平安科技(深圳)有限公司 | 模型训练方法、图像描述生成方法、装置及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170061250A1 (en) * | 2015-08-28 | 2017-03-02 | Microsoft Technology Licensing, Llc | Discovery of semantic similarities between images and text |
CN107608943A (zh) * | 2017-09-08 | 2018-01-19 | 中国石油大学(华东) | 融合视觉注意力和语义注意力的图像字幕生成方法及系统 |
CN112084841A (zh) * | 2020-07-27 | 2020-12-15 | 齐鲁工业大学 | 跨模态的图像多风格字幕生成方法及系统 |
CN112508048A (zh) * | 2020-10-22 | 2021-03-16 | 复旦大学 | 图像描述的生成方法和装置 |
CN113052090A (zh) * | 2021-03-30 | 2021-06-29 | 京东数字科技控股股份有限公司 | 用于生成字幕器以及输出字幕的方法和装置 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8548794B2 (en) * | 2003-07-02 | 2013-10-01 | University Of Southern California | Statistical noun phrase translation |
US9183466B2 (en) * | 2013-06-15 | 2015-11-10 | Purdue Research Foundation | Correlating videos and sentences |
US9811765B2 (en) * | 2016-01-13 | 2017-11-07 | Adobe Systems Incorporated | Image captioning with weak supervision |
WO2018212822A1 (en) * | 2017-05-16 | 2018-11-22 | Google Inc. | Suggested actions for images |
US11113599B2 (en) * | 2017-06-22 | 2021-09-07 | Adobe Inc. | Image captioning utilizing semantic text modeling and adversarial learning |
US11514252B2 (en) * | 2018-06-10 | 2022-11-29 | Adobe Inc. | Discriminative caption generation |
CN110135567A (zh) * | 2019-05-27 | 2019-08-16 | 中国石油大学(华东) | 基于多注意力生成对抗网络的图像字幕生成方法 |
CN111126479A (zh) * | 2019-12-20 | 2020-05-08 | 山东浪潮人工智能研究院有限公司 | 一种基于无监督独特性优化的图像描述生成方法及系统 |
CN111612103B (zh) * | 2020-06-23 | 2023-07-11 | 中国人民解放军国防科技大学 | 结合抽象语义表示的图像描述生成方法、系统及介质 |
-
2021
- 2021-03-30 CN CN202110338045.XA patent/CN113052090B/zh active Active
-
2022
- 2022-01-06 JP JP2023559796A patent/JP2024512628A/ja active Pending
- 2022-01-06 US US18/284,225 patent/US20240177506A1/en active Pending
- 2022-01-06 WO PCT/CN2022/070476 patent/WO2022206094A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170061250A1 (en) * | 2015-08-28 | 2017-03-02 | Microsoft Technology Licensing, Llc | Discovery of semantic similarities between images and text |
CN107608943A (zh) * | 2017-09-08 | 2018-01-19 | 中国石油大学(华东) | 融合视觉注意力和语义注意力的图像字幕生成方法及系统 |
CN112084841A (zh) * | 2020-07-27 | 2020-12-15 | 齐鲁工业大学 | 跨模态的图像多风格字幕生成方法及系统 |
CN112508048A (zh) * | 2020-10-22 | 2021-03-16 | 复旦大学 | 图像描述的生成方法和装置 |
CN113052090A (zh) * | 2021-03-30 | 2021-06-29 | 京东数字科技控股股份有限公司 | 用于生成字幕器以及输出字幕的方法和装置 |
Also Published As
Publication number | Publication date |
---|---|
CN113052090B (zh) | 2024-03-05 |
US20240177506A1 (en) | 2024-05-30 |
CN113052090A (zh) | 2021-06-29 |
JP2024512628A (ja) | 2024-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022206094A1 (zh) | 用于生成字幕器以及输出字幕的方法和装置 | |
CN108829757B (zh) | 一种聊天机器人的智能服务方法、服务器及存储介质 | |
WO2022155994A1 (zh) | 基于注意力的深度跨模态哈希检索方法、装置及相关设备 | |
US20220245347A1 (en) | Entity recognition method, apparatus, electronic device and computer readable storage medium | |
CN107273458B (zh) | 深度模型训练方法及装置、图像检索方法及装置 | |
US11856277B2 (en) | Method and apparatus for processing video, electronic device, medium and product | |
CN115359383B (zh) | 跨模态特征提取、检索以及模型的训练方法、装置及介质 | |
WO2018196718A1 (zh) | 图像消歧方法、装置、存储介质和电子设备 | |
CN110390363A (zh) | 一种图像描述方法 | |
CN111444968A (zh) | 一种基于注意力融合的图像描述生成方法 | |
CN114663915B (zh) | 基于Transformer模型的图像人-物交互定位方法及系统 | |
Hani et al. | Image caption generation using a deep architecture | |
CN111259197B (zh) | 一种基于预编码语义特征的视频描述生成方法 | |
CN110263218B (zh) | 视频描述文本生成方法、装置、设备和介质 | |
CN114550057A (zh) | 一种基于多模态表示学习的视频情绪识别方法 | |
CN114492646A (zh) | 一种基于跨模态互注意力机制的图文匹配方法 | |
CN116166827B (zh) | 语义标签抽取模型的训练和语义标签的抽取方法及其装置 | |
CN114548274A (zh) | 一种基于多模态交互的谣言检测方法及系统 | |
CN116304042A (zh) | 一种基于多模态特征自适应融合的虚假新闻检测方法 | |
CN114973229B (zh) | 文本识别模型训练、文本识别方法、装置、设备及介质 | |
CN116258147A (zh) | 一种基于异构图卷积的多模态评论情感分析方法及系统 | |
CN112084788B (zh) | 一种影像字幕隐式情感倾向自动标注方法及系统 | |
CN113792537A (zh) | 一种动作生成方法以及装置 | |
WO2023168997A9 (zh) | 一种跨模态搜索方法及相关设备 | |
CN114386412B (zh) | 一种基于不确定性感知的多模态命名实体识别方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22778283 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18284225 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023559796 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 11202306944T Country of ref document: SG |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19/02/2024) |