WO2020215988A1 - 视频描述生成方法、装置、设备及存储介质 - Google Patents
视频描述生成方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2020215988A1 WO2020215988A1 PCT/CN2020/081721 CN2020081721W WO2020215988A1 WO 2020215988 A1 WO2020215988 A1 WO 2020215988A1 CN 2020081721 W CN2020081721 W CN 2020081721W WO 2020215988 A1 WO2020215988 A1 WO 2020215988A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- visual
- target
- candidate
- feature
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 75
- 230000000007 visual effect Effects 0.000 claims abstract description 303
- 230000015654 memory Effects 0.000 claims abstract description 52
- 230000007246 mechanism Effects 0.000 claims abstract description 19
- 239000013598 vector Substances 0.000 claims description 34
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 description 24
- 238000005516 engineering process Methods 0.000 description 12
- 238000012545 processing Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 11
- 230000001771 impaired effect Effects 0.000 description 9
- 239000000284 extract Substances 0.000 description 6
- 235000014121 butter Nutrition 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 230000007480 spreading Effects 0.000 description 5
- 235000008429 bread Nutrition 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 230000007613 environmental effect Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 210000000988 bone and bone Anatomy 0.000 description 2
- 238000011109 contamination Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 239000011521 glass Substances 0.000 description 2
- 239000007788 liquid Substances 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 235000019987 cider Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 235000019197 fats Nutrition 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
- H04N21/4884—Data services, e.g. news ticker for displaying subtitles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/28—Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/772—Determining representative reference patterns, e.g. averaging or distorting patterns; Generating dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
- G06V20/47—Detecting features for summarising video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/635—Overlay text, e.g. embedded captions in a TV program
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
- H04N21/234336—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by media transcoding, e.g. video is transformed into a slideshow of still pictures or audio is converted into text
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/235—Processing of additional data, e.g. scrambling of additional data or processing content descriptors
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/25—Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
- H04N21/251—Learning process for intelligent management, e.g. learning user preferences for recommending movies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/435—Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/4402—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8549—Creating video summaries, e.g. movie trailer
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/278—Subtitling
Definitions
- the embodiments of the present application relate to the field of artificial intelligence technology and the field of video description, and in particular, to a method, device, device, and storage medium for generating a video description.
- Video Captioning is a technology for generating content description information for videos.
- video description generation models are usually used to automatically generate video descriptions for videos, and video description generation models are mostly based on the Encoder-Decoder framework.
- the video description generation model first extracts the visual features in the video through the encoder, and then inputs the extracted visual features into the decoder, and the decoder sequentially generates decoded words according to the visual features, and finally generates The decoded words of are combined into a video description.
- the video description generation model in the related technology only focuses on the currently processed video.
- the same decoded word may be used in multiple videos with similar but not identical semantics, resulting in the video description generation model focusing too much Limitations, which in turn affect the quality of the generated video description.
- a method for generating a video description, executed by a computer device including:
- the attention mechanism is used to decode the target visual features to obtain the first selection probability corresponding to each candidate vocabulary
- the auxiliary decoder of the video description generation model decodes the target visual features to obtain the second selection probability corresponding to each candidate word.
- the memory structure of the auxiliary decoder includes the reference visual context information corresponding to each candidate word.
- the reference visual context information is based on Related video generation corresponding to candidate words;
- a video description generating device is set in computer equipment, and the device includes:
- the encoding module is used to encode the target video through the encoder of the video description generation model to obtain the target visual features of the target video;
- the first decoding module is used to generate the basic decoder of the model through the video description, and use the attention mechanism to decode the target visual features to obtain the first selection probability corresponding to each candidate vocabulary;
- the second decoding module is used to decode the target visual feature through the auxiliary decoder of the video description generation model to obtain the second selection probability corresponding to each candidate word.
- the memory structure of the auxiliary decoder includes the reference visual context corresponding to each candidate word Information, refer to the visual context information and generate according to the relevant video corresponding to the candidate vocabulary;
- the first determining module is configured to determine the decoded words in the candidate vocabulary according to the first selection probability and the second selection probability;
- the first generating module is used to generate a video description corresponding to the target video according to each decoded word.
- a computer device that includes one or more processors and a memory.
- the memory stores at least one computer readable instruction, at least one program, code set, or computer readable instruction set, at least one computer readable instruction, at least one The program, code set or computer readable instruction set is loaded and executed by one or more processors to implement the video description generation method as described above.
- One or more computer-readable storage media stores at least one computer-readable instruction, at least one section of program, code set, or computer-readable instruction set, at least one computer-readable instruction, at least one section of program, code set Or a set of computer-readable instructions is loaded and executed by one or more processors to implement the video description generation method as described above.
- a computer program product when the computer program product runs on a computer, causes the computer to execute the video description generation method as described above.
- FIG. 1 is a schematic diagram of the principle of using the SA-LSTM model to generate a video description in a related technology in an embodiment
- Figure 2 is a schematic diagram of an implementation of a video description generation method in a video classification retrieval scenario in an embodiment
- Fig. 3 is a schematic diagram of an implementation of a method for generating a video description in an assisted scene for the visually impaired in an embodiment
- Figure 4 is a schematic diagram of an implementation environment in an embodiment
- Fig. 5 is a flowchart of a method for generating a video description in an embodiment
- Figure 6 is a video description generated by a video description generation model in an embodiment
- FIG. 7 is a flowchart of a method for generating a video description in an embodiment
- FIG. 8 is a schematic structural diagram of a video description generation model in an embodiment
- FIG. 9 is a flowchart of the process of assisting the decoder to determine the candidate vocabulary selection probability in an embodiment
- FIG. 10 is a video description generated by a video description generation model in an embodiment of the related technology and an embodiment of this application;
- FIG. 11 is a flowchart of a process of generating reference visual context information corresponding to candidate words in an embodiment
- FIG. 12 is a schematic diagram of an implementation of a process of generating reference visual context information in an embodiment
- Figure 13 is a structural block diagram of a video description generating apparatus in an embodiment
- Fig. 14 is a schematic structural diagram of a computer device in an embodiment.
- the video description generation model based on the encoding-decoding framework can be a soft attention-long short-term memory ((Soft Attention Long-Short-Term Memory, SA-LSTM) model.
- SA-LSTM Soft Attention Long-Short-Term Memory
- Figure 1 The process of generating a video description is shown in Figure 1.
- the SA-LSTM model first performs feature extraction on the input video 11 to obtain the visual features 12 (v 1 , v 2 ,..., v n ) of the video 11. Then, the SA-LSTM model uses the soft attention mechanism to calculate the effect of each visual feature 12 on the current decoding process (that is, the t-th time) according to the last hidden state 13 (the hidden state output during the t-1th decoding process) and the visual feature 12 Decoding) weight 14 to perform a weighted sum calculation on the visual feature 12 and the weight 14 to obtain the context information 15 of the current decoding process. Further, the SA-LSTM model outputs the current hidden state 17 according to the previous hidden state 13, the previous decoded word 16, and the context information 15, and then determines the current decoded word 18 according to the current hidden state 17.
- the SA-LSTM model only pays attention to the visual features of the current video, and correspondingly, the determined decoded words are only related to the visual features of the current video.
- the same decoded word may appear in multiple video clips and express similar but not exactly the same meaning in different video clips (that is, the decoded word may correspond to similar but not exactly the same visual features), resulting in SA -The accuracy of the decoded words output by the LSTM model is low, which in turn affects the quality of the final video description.
- the video description generation model in the embodiments of this application adopts the structure of "encoder + basic decoder + auxiliary decoder".
- a memory mechanism is introduced to store the relationship between each candidate vocabulary in the vocabulary and related videos in the memory structure, and the memory structure is added to the auxiliary decoder.
- the video description generation model provided by the embodiments of this application can focus not only on the current video (basic decoder), but also on other videos with similar visual features as the current video (auxiliary decoder), thereby avoiding focusing only on the angle of attention caused by the current video Limitations, thereby improving the accuracy of the output decoded words and improving the quality of the generated video description.
- the video description generation method provided in the embodiment of the present application can be used in any of the following scenarios.
- the video description generation model in the embodiment of the present application can be implemented as a video management application or a part of a video management application.
- the video management application After inputting video fragments that do not contain video descriptions into the video management application, the video management application extracts the visual features in the video fragments through the encoder in the video description generation model, and uses the basic decoder and auxiliary decoder to analyze the visual features. Decoding is performed to synthesize the decoding results of the basic decoder and the auxiliary decoder to determine the decoded word, and then generate a video description for the video segment based on the decoded word.
- the video management application classifies the video clips based on the video description (for example, through semantic recognition), and adds corresponding category tags to the video clips. During subsequent video retrieval, the video management application can return video clips that meet the retrieval conditions according to the retrieval conditions and the category tags corresponding to each video segment.
- VQA Visual Question Answer
- the video description generation model in the embodiment of the present application can be implemented as an intelligent question answering application or a part of the intelligent question answering application.
- the intelligent question answering application obtains a video and a question for the video, it generates a video description corresponding to the video through the video description generation model, and semantically recognizes the question and the video description, thereby generating the answer corresponding to the question, and then the answer To display.
- the video description generation model in the embodiment of the present application can be implemented as a voice prompt application or a part of the voice prompt application.
- the voice prompt application program (such as the auxiliary equipment used by the visually impaired) collects the environmental video around the visually impaired person through the camera, the voice prompt application program encodes and decodes the environmental video through the video description generation model to generate The video description corresponding to the environmental video.
- the voice prompt application can convert the video description from text to voice and perform voice broadcast to help the visually impaired understand the surrounding environment.
- a camera 32 and a bone conduction earphone 33 are provided on glasses 31 worn by a visually impaired person.
- the camera 31 collects images of the front environment, and collects a piece of environment video 34.
- the glasses 31 generate a video description for the environmental video 34 through the processor, "a man is walking a dog in front", and convert the video description from text to voice, and then play it through the bone conduction headset 33, so that the visually impaired can avoid according to the voice prompts .
- the method provided in the embodiment of the present application can also be applied to other scenarios where a video description needs to be generated for a video, and the embodiment of the present application does not limit specific application scenarios.
- the video description generation method provided in the embodiments of the present application can be applied to computer devices such as terminals or servers.
- the video description generation model in this embodiment of the application can be implemented as an application or a part of the application and installed in the terminal so that the terminal has the function of generating video description; or, the video description generation model It can be applied to the background server of the application program, so that the server provides the function of generating video description for the application program in the terminal.
- FIG. 4 shows a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application.
- This implementation environment includes a terminal 410 and a server 420.
- the terminal 410 and the server 420 communicate data through a communication network.
- the communication network may be a wired network or a wireless network, and the communication network may be It is at least one of a local area network, a metropolitan area network, and a wide area network.
- the application program may be a video management application, an intelligent question and answer application, a voice prompt application, a subtitle generation application (adding commentary subtitles to the video screen), etc., this application
- the terminal may be a mobile terminal such as a mobile phone, a tablet computer, a laptop portable computer, an auxiliary device for the visually impaired, etc., or a terminal such as a desktop computer, a projection computer, etc., which is not limited in the embodiment of the application. .
- the server 420 may be implemented as a server, or as a server cluster formed by a group of servers, and it may be a physical server or a cloud server. In one embodiment, the server 420 is a background server of the application in the terminal 410.
- a pre-trained video description generation model 421 is stored in the server 420.
- the application program transmits the target video to the server 420 through the terminal 410.
- the server 420 receives the target video, it inputs the target video into the video description generation model 421.
- the video description generation model 421 extracts features of the target video through the decoder 421A, and decodes the extracted features through the basic decoder 421B and the auxiliary decoder 422C, respectively, to generate a video description according to the decoding result, and feed it back to the terminal 410.
- the video description is displayed by the application in the terminal 410.
- the terminal 410 can generate the video description of the target video locally without the server 420, thereby improving the terminal’s ability to obtain the video description. Speed, reduce the delay caused by interacting with the server.
- FIG. 5 shows a flowchart of a method for generating a video description provided by an exemplary embodiment of the present application.
- the method is used in a computer device as an example for description. The method includes the following steps.
- step 501 the target video is encoded by the encoder of the video description generation model to obtain the target visual feature of the target video.
- the function of the encoder in the video description generation model is to extract the target visual feature (visual feature) from the target video, and input the extracted target visual feature into the decoder (including the basic decoder and auxiliary decoder) .
- the target visual feature is represented by a vector.
- the video description generation model uses pre-trained deep convolutional neural networks (Convolutional Neural Networks, CNNs) as the encoder for visual feature extraction, and the target video needs to be preprocessed before using the encoder for feature extraction , So that the preprocessed target video meets the input requirements of the encoder.
- CNNs convolutional Neural Networks
- the encoder inputs the target visual features into the basic decoder and the auxiliary decoder respectively, and executes the following steps 502 and 503. It should be noted that there is no strict sequence between the following steps 502 and 503, that is, step 502 and step 503 can be executed synchronously, and this embodiment does not limit the execution order of the two.
- step 502 the target visual feature is decoded by the basic decoder of the video description generation model to obtain the first selection probability corresponding to each candidate vocabulary.
- the basic decoder is used to use the attention mechanism to decode the candidate vocabulary matching the target visual feature.
- the basic decoder pays attention to the target video, so as to perform decoding based on the target visual characteristics of the target video.
- the basic decoder may be a Recurrent Neural Network (RNN) encoder that uses an attention mechanism.
- RNN Recurrent Neural Network
- the basic decoder uses the SA-LSTM model, and each time it decodes, the basic decoder uses an attention mechanism to determine each candidate vocabulary in the vocabulary according to the hidden state of the last decoded output, the last decoded word, and the target visual feature The corresponding first selection probability.
- the basic decoder may also adopt other RNN encoders based on the attention mechanism, which is not limited in the embodiments of the present application.
- the decoding process of the basic decoder is essentially a classification task, that is, the (first) selection probability of each candidate vocabulary in the vocabulary is calculated through the softmax function.
- the greater the first selection probability the higher the degree of match between the candidate word and the context information of the video, that is, the meaning expressed by the candidate word matches the context more closely.
- Step 503 The auxiliary decoder of the video description generation model decodes the target visual features to obtain the second selection probability corresponding to each candidate vocabulary.
- the memory structure of the auxiliary decoder includes the reference visual context information corresponding to each candidate vocabulary.
- the context information is generated based on the relevant video corresponding to the candidate vocabulary.
- the auxiliary decoder in this example focuses on the correlation between candidate words and related videos. Therefore, when the auxiliary decoder is used to decode the target visual features, it can capture The visual features of the same candidate word in different videos are matched with the target visual features of the target video to improve the accuracy of determining the decoded word.
- the association between candidate words and related videos is stored in the memory structure of the auxiliary decoder, and is reflected by the correspondence between the candidate words and the reference visual context information.
- the reference visual context information corresponding to the candidate word is used to indicate the visual context feature of the related video containing the candidate word, and the reference visual context information is generated according to the related video related to the candidate word in the sample video.
- a graph-based algorithm can also be used to construct the association between the candidate vocabulary and the related video, which is not limited in this application.
- the decoding process of the auxiliary decoder is essentially a classification task, that is, the (second) selection probability of each candidate vocabulary in the vocabulary is calculated through the softmax function.
- the basic decoder and the auxiliary decoder correspond to the same vocabulary, and the greater the second selection probability, the higher the matching degree between the candidate vocabulary and the context information of the video, that is, the meaning expressed by the candidate vocabulary matches the context more closely.
- Step 504 Determine the decoded word in the candidate vocabulary according to the first selection probability and the second selection probability.
- the video description generation model integrates the first selection probability output by the basic decoder and the second selection probability output by the auxiliary decoder.
- the decoded word obtained by this decoding is determined from each candidate vocabulary in the library.
- Step 505 Generate a video description corresponding to the target video according to each decoded word.
- the video description is usually a natural language composed of multiple decoded words, it is necessary to repeat the above 502 to 504 each time it is decoded to generate each decoded word of the video description in turn, thereby connecting multiple decoded words, and finally generate the video description.
- the encoder of the video description generation model is used to encode the target video, and after the target visual feature is obtained, the target visual feature is decoded through the basic decoder based on the attention mechanism and the auxiliary decoder included. , Obtain the first selection probability and the second selection probability of each candidate vocabulary, thereby combining the first selection probability and the second selection probability to determine the decoded word from the candidate words, and then generate the video description based on the multiple decoded words;
- the memory structure of the auxiliary decoder in the model contains the reference visual context information corresponding to the candidate words, and the reference visual context information is generated based on the related video of the candidate words. Therefore, when the auxiliary decoder is used for decoding, the candidate words and the divisions can be paid attention to.
- the correlation between videos other than the current video thereby improving the accuracy of decoding word selection, and thereby improving the quality of the video description generated subsequently.
- the video description generated by the video description generation model in the related technology is "a woman is mixing ingredients in a bowl.” (a woman is currently Mixing the materials in the bowl); and the video description generated by the video description generation model in the embodiment of this application is described as "a woman is pouring liquid into a bowl.” (a woman pours liquid into the bowl).
- the video description generation model in the related art cannot identify the "pouring” in the video 61, and in the embodiment of the present application, the memory structure of the auxiliary decoder contains the difference between "pouring" and the related video picture 62.
- the relevance of the video ie, refer to the visual context information), so it can accurately decode the decoding word "pouring", which improves the description quality of the video description.
- FIG. 7 shows a flowchart of a method for generating a video description provided by another exemplary embodiment of the present application.
- the method is used in computer equipment as an example for description. The method includes the following steps.
- Step 701 Encode the target video by the encoder to obtain the two-dimensional visual feature and the three-dimensional visual feature of the target video.
- the two-dimensional visual feature is used to indicate the feature of a single frame image
- the three-dimensional visual feature is used to indicate the sequential feature of continuous image frames. .
- the visual features of the video include not only the image features of a single frame image (ie, two-dimensional visual features), but also the sequential features of continuous image frames (ie, three-dimensional visual features).
- the encoder includes a first sub-encoder for extracting two-dimensional visual features and a second sub-encoder for extracting three-dimensional visual features.
- the target video when the target video is encoded, the target video is divided into independent image frames, and the first sub-encoder is used to perform feature extraction on each image frame to obtain two-dimensional visual features; the target video is divided into several video segments (each Each video segment contains several continuous image frames), and the second sub-encoder is used to extract features of each video segment to obtain three-dimensional visual features.
- the first sub-encoder adopts the ResNet-101 model (residual network with a depth of 101) pre-trained on the ImageNet (large visualization database for visual object recognition software research) data set, and the first The two sub-encoders use the ResNeXt-101 model pre-trained on the Kinetics data set.
- the first sub-encoder and the second sub-encoder may also adopt other models, which are not limited in the embodiment of the present application.
- the encoder 81 extracts a two-dimensional visual feature 811 and a three-dimensional visual feature 812.
- Step 702 Convert the two-dimensional visual feature and the three-dimensional visual feature to the same feature dimension to obtain the target visual feature.
- the video description generation model converts two-dimensional visual features and three-dimensional visual features into the same feature dimension of hidden space to obtain target visual features.
- the converted target visual feature f'l M f f l + b f
- the converted target visual feature v n M v v n + b v
- M f and M v as a conversion matrix, b f b v and a bias term.
- step 703 the target visual feature is decoded by the basic decoder of the video description generation model to obtain the first selection probability corresponding to each candidate vocabulary.
- the basic decoder is used to decode the candidate vocabulary matching the target visual feature using the attention mechanism .
- the video description generation model uses Gated Recurrent Unit (GRU) as the skeleton of the basic decoder.
- GRU Gated Recurrent Unit
- the basic decoder 82 includes GRU 821, GRU 822, and GRU 823.
- the basic decoder performs the t-th decoding, the following steps may be included.
- the t-1-th decoding When performing the t-th decoding, obtain the t-1-th decoded word and the t-1-th hidden state obtained from the t-1-th decoding.
- the t-1-th hidden state is the t-1th decoding by the basic decoder
- the hidden state output when t is an integer greater than or equal to 2.
- the basic decoder outputs a hidden state, and subsequently determines the decoded word obtained from this decoding based on the hidden state.
- the hidden state output during the last decoding and the last decoded word need to be used. Therefore, when the basic decoder performs the t-th decoding, it needs to obtain the t-1 Decode the word and the t-1th hidden state.
- the basic decoder 82 obtains the t-1th hidden state h t-1 output by the GRU 821 and the corresponding t-1th decoded word w t-1 The word vector e t-1 .
- the basic decoder In different decoding stages, the correlation between different visual features and the current decoded word is different. Therefore, before calculating the first selection probability, the basic decoder also needs to use an attention mechanism to process the target visual features output by the encoder (weighted calculation). And) to obtain the target visual context information decoded this time.
- the basic decoder processes two-dimensional visual features and three-dimensional visual features respectively to obtain two-dimensional visual context information and three-dimensional visual context information, and fuse two-dimensional visual context information and three-dimensional visual context information to obtain Target visual context information.
- the two-dimensional visual feature f 'i its processing to give two-dimensional vision context information
- a i, t f att (h t-1, f 'i), h t-1 t-1 for the first hidden states (represented by vector), f att function of attention.
- v 'i For three-dimensional vision wherein v 'i, the processing thereof to obtain a three-dimensional visual context information
- a 'i, t f att (h t-1, v' i), h t-1 t-1 for the first hidden states (represented by vector), f att function of attention.
- the same attention function is used when processing two-dimensional visual features and processing three-dimensional visual features.
- the attention mechanism ( fat in the figure) is used to process the two-dimensional visual feature 811 and the three-dimensional visual feature 812 respectively to obtain C t,2D and C t,3D , and the processing result Perform fusion to obtain target visual context information C t at the t-th decoding time.
- the GRU outputs the t-th decoded t-th hidden state according to the t-1th decoded word, the t-1th hidden state, and the target visual context information.
- the way GRU determines the t-th hidden state can be expressed as:
- the basic decoder calculates the first selection probability corresponding to each candidate vocabulary in the vocabulary based on the t-th hidden state.
- the calculation formula of the first selection probability is as follows:
- w i is the i-th dictionary of candidate words
- K is the total number of candidate words in the lexicon
- W i is the total number of candidate words in the lexicon
- b i is the i-th candidate word calculation parameters used when linear mapping score.
- the target visual context information C t the t-1th hidden state h t-1 output by the GRU 821, and the word vector e t-1 of the t- 1th decoded word are input into the GRU 822,
- the GRU822 calculates the first selection probability P b of each candidate vocabulary.
- Step 704 When the t-th decoding is performed, obtain the t-1-th decoded word and the t-1-th hidden state obtained from the t-1-th decoding.
- the t-1-th hidden state is the t-1th time performed by the basic decoder
- the hidden state output during decoding, t is an integer greater than or equal to 2.
- the auxiliary decoder also needs to use the last decoded word and the hidden state output during the last decoding during the decoding process. Therefore, when the t-th decoding is performed, the auxiliary decoding The decoder obtains the t-1th decoded word and the t-1th hidden state; the t-1th hidden state is the hidden state output when the basic decoder performs the t-1th decoding.
- Step 705 According to the t-1th decoded word, the t-1th hidden state, the target visual feature, and the reference visual context information corresponding to the candidate word, the auxiliary decoder determines the second selection probability of the candidate word.
- the auxiliary decoder also needs to obtain the reference visual context information corresponding to each candidate vocabulary in the memory structure during the decoding process, so as to focus on the visual features of the candidate vocabulary in the related video during the decoding process.
- the memory structure comprising at least a word feature vector e r each candidate word corresponding to the reference visual context information and g r candidate words.
- the auxiliary decoder focuses on calculating the matching degree between the target visual context information corresponding to the candidate vocabulary and the reference visual context information, and the word feature vector of the candidate vocabulary and the word feature vector of the previous decoded word. Then, the second selection probability of candidate words is determined according to the two matching degrees.
- this step 705 may include the following steps.
- Step 705A according to the target visual feature and the t-1th hidden state, generate target visual context information for the tth decoding.
- the process of generating the target visual context information can refer to the above step 703, which will not be repeated in this embodiment.
- the auxiliary encoder can obtain the target visual context information from the basic encoder without repeated calculations, which is not limited in this embodiment.
- Step 705B Determine the first matching degree of the candidate vocabulary according to the target visual context information and the reference visual context information.
- the reference visual context information corresponding to the candidate vocabulary is generated based on the related video corresponding to the candidate vocabulary, the reference visual context information can reflect the visual features of the related video using the candidate vocabulary as the decoded word.
- the reference visual context information corresponding to the candidate word has a higher matching degree with the target visual context information at the time of decoding, the matching degree between the candidate word and the target visual context information is also higher.
- the auxiliary decoder determines the matching degree between the target visual context information and the reference visual context information as the first matching degree of the candidate words, and the first matching degree can be expressed as: [W c ⁇ c t + W g ⁇ g i ], where W c and W g are linear transformation matrices, and g i is the reference visual context information corresponding to the i-th candidate word.
- Step 705C Obtain the first word feature vector corresponding to the candidate vocabulary in the memory structure and the second word feature vector of the t-1th decoded word.
- the auxiliary decoder In addition to determining the matching degree of the candidate vocabulary according to the visual context information, the auxiliary decoder also determines the matching degree of the candidate vocabulary according to the meaning of the candidate vocabulary and the previous decoded word, thereby improving the difference between the decoded word obtained by subsequent decoding and the previous decoded word Continuity.
- the auxiliary decoder obtains the first word feature vector corresponding to the candidate vocabulary from the memory structure, and converts the t-1th decoded word into the second word feature vector through the conversion matrix.
- Step 705D Determine the second matching degree of the candidate vocabulary according to the first word feature vector and the second word feature vector.
- the auxiliary decoder determines the matching degree between the first word feature vector and the second word feature vector as the second matching degree of the candidate vocabulary, and the second matching degree can be expressed as: [W' e ⁇ e t-1 + W e ⁇ e i], where, W 'e and W e is a linear transform matrix, e i is the i th word feature vectors corresponding to the candidate words.
- steps 705A and 705B and steps 705C and 705D there is no strict sequence between steps 705A and 705B and steps 705C and 705D, that is, steps 705A and 705B can be executed synchronously with steps 705C and 705D, which is not limited in the embodiment of the present application.
- Step 705E Determine the second selection probability of the candidate vocabulary according to the first matching degree and the second matching degree.
- the second selection probability has a positive correlation with the first matching degree and the second matching degree, that is, the higher the first matching degree and the second matching degree, the higher the second selection probability of the candidate word.
- the memory structure in addition to the reference visual context information g r corresponding to the candidate vocabulary and the word vector feature e r of the candidate vocabulary, also includes auxiliary information u r corresponding to the candidate vocabulary.
- the auxiliary information may be the part of speech of the candidate vocabulary, the field of the candidate vocabulary, the video category of the candidate vocabulary commonly used, and so on.
- the auxiliary decoder determines the second selection probability of the candidate word according to the auxiliary information, the t-1th decoded word, the t-1th hidden state, the target visual feature, and the reference visual context information corresponding to the candidate word.
- the second selection probability P m of the candidate word w k can be expressed as:
- q k is the relevance score of candidate words w k
- K is the total number of candidate words in the thesaurus.
- the calculation formula for the relevance score of the candidate words is as follows:
- W h and Wu are linear transformation matrices
- u i is the auxiliary information corresponding to the i-th candidate word
- b is the bias term.
- the memory structure 832 in the auxiliary decoder 83 comprises eight individual candidate word (w i) corresponding to the reference visual context information g i, word feature vector and auxiliary information e i u i.
- the content in the memory structure 832, the target visual context information C t , the t-1th hidden state h t-1, and the word feature vector e t-1 of the t- 1th decoded word are input to the decoding component 831 ,
- the input decoding component 831 outputs the second selection probability P m of each candidate word.
- Step 706 Calculate the target selection probability of each candidate vocabulary according to the first selection probability and the first weight corresponding to the first selection probability, and the second selection probability and the second weight corresponding to the second selection probability.
- the video description generation model obtains the first selection probability and the second selection probability corresponding to the candidate word, and calculates the weighting according to the weight corresponding to each selection probability.
- the target selection probability of candidate words is the first selection probability and the second selection probability corresponding to the candidate word.
- the calculation formula for the target selection probability of candidate vocabulary w k is as follows:
- ⁇ is the second weight
- (1- ⁇ ) is the first weight
- the first weight and the second weight are hyperparameters obtained experimentally, and the first weight is greater than the second weight.
- the value range of ⁇ is (0.1, 0.2).
- Step 707 Determine the candidate word corresponding to the highest target selection probability as the decoded word.
- the video description generation model obtains the target selection probability of each candidate vocabulary, and determines the candidate vocabulary corresponding to the highest target selection probability as the decoded word obtained by this decoding.
- the video description generation model calculates the target selection probability P according to the first selection probability P b and the second selection probability P m , and determines the t-th decoded word w t based on the target selection probability P.
- Step 708 Generate a video description corresponding to the target video according to each decoded word.
- the video description generated by the video description generation model in the related technology is "a person is slicing bread” (a person is slicing bread); and
- the video description generated by the video description generation model in the embodiment of the present application is "a man is spreading butter on a bread” (a person is spreading butter on a bread). It can be seen that the video description generation model in the related art cannot recognize the "spreading" and "butter" in the video 1001.
- the memory structure of the auxiliary decoder contains "spreading” and "butter” and related The correlation between the video pictures 1002 (that is, referring to the visual context information), so that the decoded words “spreading” and “butter” can be accurately decoded, which improves the accuracy of the video description.
- the video description generation model uses a decoder to decode the target video to obtain 2D visual features and 3D visual features, and maps the 2D visual features and 3D visual features to the same feature dimension, which improves the comprehensiveness of visual feature extraction , And avoid the mutual contamination of two-dimensional visual features and three-dimensional visual features.
- the auxiliary decoder determines the selection probability of the candidate vocabulary according to the reference visual feature context information of the candidate vocabulary and the currently decoded target visual context information, which helps to improve the accuracy of the finally determined decoded word; at the same time, The auxiliary decoder determines the selection probability of the candidate vocabulary according to the candidate vocabulary and the word vector characteristics of the last decoded word, which helps to improve the coherence between the finally determined decoded word and the last decoded word.
- the generation process may include the following steps.
- Step 1101 For each candidate vocabulary, determine I related videos corresponding to the candidate vocabulary according to the sample video description corresponding to the sample video.
- the sample video description of the related video includes the candidate vocabulary, and I is an integer greater than or equal to 1.
- the developer uses manual annotation to add sample video descriptions to sample video generation, or uses an existing video description generation model to automatically generate sample video descriptions for sample videos, and manually filters that the quality is lower than expected The description of the sample video.
- the computer device When determining the related video of each candidate vocabulary in the thesaurus, the computer device obtains the sample video description corresponding to each sample video, and determines the video containing the candidate vocabulary in the sample video description as the related video of the candidate vocabulary.
- the computer device determines the sample video B as the related video corresponding to "Walking".
- Step 1102 For each related video, determine k key visual features in the related video.
- the matching degree between the key visual feature and the candidate vocabulary is higher than the matching degree between the non-key visual features in the related video and the candidate vocabulary, and k is greater than or equal to An integer of 1.
- non-key visual features are visual features other than key visual features in each related video.
- the following steps may be included when determining key visual features in related videos.
- the computer device first trains the basic decoder in the video description generation model, and uses the basic decoder (using the attention mechanism) to obtain and decode the candidate vocabulary.
- the feature weight is the basic decoder (using the attention mechanism) to obtain and decode the candidate vocabulary.
- the computer device uses the basic decoder to decode the visual features of the sample video, and obtains the t-th decoding time, the first hidden t-1 output from the decoder based h t-1, the respective visual characteristics (including v 'i or f' i) wherein the weight of a i right candidate words, t thus is calculated by the focus function f att.
- the computer device can determine the visual feature corresponding to the top k (Top-k) feature weights as the candidate vocabulary Key visual features.
- the computer device separately extracts the two-dimensional visual feature 1201 and the three-dimensional visual feature 1202 of each related video, and obtains it through the attention mechanism of the basic decoder
- the feature weight of each visual feature in the related video to the candidate vocabulary, and the Top-k visual feature is selected as the key visual feature 1203.
- Step 1103 Generate reference visual context information corresponding to the candidate vocabulary according to each key visual feature corresponding to one related video.
- the computer device fuses the key visual features corresponding to each related video to generate reference visual context information corresponding to the candidate words.
- reference visual context information candidate words corresponding to g r may be expressed as:
- I is the number of related videos
- k is the number of key visual features corresponding to each related video
- a i,j is the feature weight of the j-th two-dimensional key visual feature f'i ,j to the candidate words
- a ' i,j is the j-th three-dimensional key visual feature v'i ,j 's feature weight for candidate words.
- the computer device fuses the key visual features 1203 corresponding to each related video to generate reference visual context information 1204.
- Step 1104 Store the reference visual context information corresponding to each candidate word in the memory structure.
- the computer device stores the reference visual context information corresponding to each candidate vocabulary into the memory structure of the auxiliary decoder for subsequent use.
- the computer device extracts the key visual features of the candidate vocabulary from the related video corresponding to the candidate vocabulary, thereby generating reference visual context information of the candidate vocabulary based on a large number of key visual features, and storing it in the memory structure, which helps to improve subsequent The accuracy of the decoded words obtained by decoding.
- the video description generation model in the embodiment of the present application is at a leading level in the four evaluation indicators (BLEU-4, METEROR, ROUGE-L, and CIDEr).
- Fig. 13 is a structural block diagram of an apparatus for generating a video description provided by an exemplary embodiment of the present application.
- the apparatus may be set in the computer equipment described in the foregoing embodiment. As shown in Fig. 13, the apparatus includes:
- the encoding module 1301 is used to encode the target video through the encoder of the video description generation model to obtain the target visual feature of the target video.
- the first decoding module 1302 is used to generate the basic decoder of the model through the video description, and use the attention mechanism to decode the target visual features to obtain the first selection probability corresponding to each candidate vocabulary.
- the second decoding module 1303 is used to decode the target visual features by the auxiliary decoder of the video description generation model to obtain the second selection probability corresponding to each candidate word.
- the memory structure of the auxiliary decoder includes the reference vision corresponding to each candidate word Context information is generated based on the relevant video corresponding to the candidate vocabulary with reference to the visual context information.
- the first determining module 1304 is configured to determine the decoded words in the candidate vocabulary according to the first selection probability and the second selection probability.
- the first generating module 1305 is configured to generate a video description corresponding to the target video according to each decoded word.
- the second decoding module 1303 includes:
- the first obtaining unit is used to obtain the t-1th decoded word obtained by the t-1th decoding and the t-1th hidden state when the t-th decoding is performed.
- the t-1th hidden state is the basic decoder for the t-1th hidden state.
- the hidden state output during t-1 decoding, t is an integer greater than or equal to 2.
- the first determining unit is configured to determine the second selection probability of the candidate word according to the t-1th decoded word, the t-1th hidden state, the target visual feature, and the reference visual context information corresponding to the candidate word.
- the first determining unit is configured to:
- the target visual context information for the t-th decoding; determine the first matching degree of the candidate words according to the target visual context information and the reference visual context information; obtain the candidates in the memory structure The first word feature vector corresponding to the vocabulary and the second word feature vector of the t-1th decoded word; determine the second matching degree of the candidate vocabulary according to the first word feature vector and the second word feature vector; according to the first matching degree and The second matching degree determines the second selection probability of the candidate words.
- the memory structure also includes auxiliary information corresponding to each candidate vocabulary.
- the first determining unit is used for:
- the second selection probability of the candidate word is determined.
- the device includes:
- the second determining module is used to determine, for each candidate vocabulary, according to the sample video description corresponding to the sample video, determine I related videos corresponding to the candidate vocabulary.
- the sample video description of the related video contains the candidate vocabulary, and I is an integer greater than or equal to 1. .
- the third determination module is used to determine k key visual features in the related video for each related video, the matching degree between the key visual feature and the candidate vocabulary is higher than the matching degree between the non-key visual features in the related video and the candidate vocabulary, k Is an integer greater than or equal to 1.
- the second generation module is used to generate reference visual context information corresponding to the candidate vocabulary according to each key visual feature corresponding to one related video.
- the storage module is used to store the reference visual context information corresponding to each candidate vocabulary in the memory structure.
- the third determining module includes:
- the acquiring unit is used to acquire the feature weight of each visual feature to the candidate vocabulary in the related video through the basic decoder, where the sum of the weight of each feature is 1.
- the second determining unit is used to determine the visual features corresponding to the first k feature weights as key visual features.
- the first determining module 1304 includes:
- the calculation unit is configured to calculate the target selection probability of each candidate word according to the first weight corresponding to the first selection probability and the first selection probability, and the second weight corresponding to the second selection probability and the second selection probability.
- the third determining unit is used to determine the candidate vocabulary corresponding to the highest target selection probability as the decoded word.
- the encoding module 1301 includes:
- the encoding unit is used to encode the target video through the encoder to obtain the two-dimensional visual characteristics and three-dimensional visual characteristics of the target video.
- the two-dimensional visual characteristics are used to indicate the characteristics of a single frame of images, and the three-dimensional visual characteristics are used to indicate the characteristics of consecutive image frames. Timing characteristics.
- the conversion unit is used to convert the two-dimensional visual feature and the three-dimensional visual feature to the same feature dimension to obtain the target visual feature.
- the encoder of the video description generation model is used to encode the target video, and after the target visual feature is obtained, the target visual feature is decoded through the basic decoder based on the attention mechanism and the auxiliary decoder included. , Obtain the first selection probability and the second selection probability of each candidate vocabulary, thereby combining the first selection probability and the second selection probability to determine the decoded word from the candidate words, and then generate the video description based on the multiple decoded words;
- the memory structure of the auxiliary decoder in the model contains the reference visual context information corresponding to the candidate words, and the reference visual context information is generated based on the related video of the candidate words. Therefore, when the auxiliary decoder is used for decoding, the candidate words and the divisions can be paid attention to.
- the correlation between videos other than the current video thereby improving the accuracy of decoding word selection, and thereby improving the quality of the video description generated subsequently.
- the video description generation device provided in the above embodiment only uses the division of the above functional modules for illustration.
- the above functions can be allocated by different functional modules or units as required, that is, Divide the internal structure of the device into different functional modules or units to complete all or part of the functions described above.
- Each functional module or unit can be implemented in whole or in part by software, hardware or a combination thereof.
- the video description generating device provided in the foregoing embodiment and the video description generating method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, and will not be repeated here.
- the computer device 1400 includes a central processing unit (CPU) 1401, a system memory 1404 including a random access memory (RAM) 1402 and a read-only memory (ROM) 1403, and a system connecting the system memory 1404 and the central processing unit 1401 BUS 1405.
- the computer equipment 1400 also includes a basic input/output system (I/O system) 1406 that helps transfer information between various devices in the computer, and a mass storage device for storing the operating system 1413, application programs 1414, and other program modules 1415 1407.
- I/O system basic input/output system
- the basic input/output system 1406 includes a display 1408 for displaying information and an input device 1409 such as a mouse and a keyboard for the user to input information.
- the display 1408 and the input device 1409 are both connected to the central processing unit 1401 through the input and output controller 1410 connected to the system bus 1405.
- the basic input/output system 1406 may also include an input and output controller 1410 for receiving and processing input from multiple other devices such as a keyboard, a mouse, or an electronic stylus.
- the input and output controller 1410 also provides output to a display screen, a printer, or other types of output devices.
- the mass storage device 1407 is connected to the central processing unit 1401 through a mass storage controller (not shown) connected to the system bus 1405.
- the mass storage device 1407 and its associated computer-readable medium provide non-volatile storage for the computer device 1400. That is, the mass storage device 1407 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROI drive.
- Computer-readable media may include computer storage media and communication media.
- Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storing information such as computer readable instructions, data structures, program modules or other data.
- Computer storage media include RAM, ROM, EPROM, EEPROM, flash memory or other solid-state storage technologies, CD-ROM, DVD or other optical storage, tape cartridges, magnetic tape, disk storage or other magnetic storage devices.
- RAM random access memory
- ROM read-only memory
- EPROM Erasable programmable read-only memory
- EEPROM electrically erasable programmable read-only memory
- the memory stores one or more programs, one or more programs are configured to be executed by one or more central processing units 1401, one or more programs contain computer-readable instructions for implementing the above methods, and the central processing unit 1401 executes The one or more programs implement the methods provided in the foregoing method embodiments.
- the computer device 1400 may also be connected to a remote computer on the network through a network such as the Internet to operate. That is, the computer device 1400 can be connected to the network 1412 through the network interface unit 1411 connected to the system bus 1405, or in other words, the network interface unit 1411 can also be used to connect to other types of networks or remote computer systems (not shown).
- the memory also includes one or more programs, and the one or more programs are stored in the memory, and the one or more programs include steps for performing the steps executed by the computer device in the method provided in the embodiments of the present application.
- the embodiments of the present application also provide one or more computer-readable storage media, and the readable storage medium stores at least one computer-readable instruction, at least one program, code set, or computer-readable instruction set, and at least one computer-readable instruction At least one program, code set, or computer readable instruction set is loaded and executed by one or more processors to implement the video description generation method of any one of the above embodiments.
- the present application also provides a computer program product, which when the computer program product runs on a computer, causes the computer to execute the video description generation method provided by the foregoing method embodiments.
- the program can be stored in a computer-readable storage medium.
- the readable storage medium may be a computer-readable storage medium included in the memory in the foregoing embodiment; or may be a computer-readable storage medium that exists alone and is not assembled into the terminal.
- the computer-readable storage medium stores at least one computer-readable instruction, at least one program, code set, or computer-readable instruction set, and at least one computer-readable instruction, at least one program, code set, or computer-readable instruction set is processed by The device is loaded and executed to implement the video description generation method of any of the foregoing method embodiments.
- the computer-readable storage medium may include: Read Only Memory (ROM), Random Access Memory (RAM), Solid State Drives (SSD, Solid State Drives), or optical disks Wait.
- random access memory may include resistive random access memory (ReRAM, Resistance Random Access Memory) and dynamic random access memory (DRAM, Dynamic Random Access Memory).
- ReRAM resistive random access memory
- DRAM Dynamic Random Access Memory
- the program can be stored in a computer-readable storage medium.
- the aforementioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Security & Cryptography (AREA)
- Computational Linguistics (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
- Image Analysis (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
Description
Claims (20)
- 一种视频描述生成方法,其特征在于,由计算机设备执行,所述方法包括:通过视频描述生成模型的编码器对目标视频进行编码,得到所述目标视频的目标视觉特征;通过所述视频描述生成模型的基础解码器,采用注意力机制对所述目标视觉特征进行解码,得到各个候选词汇对应的第一选取概率;通过所述视频描述生成模型的辅助解码器对所述目标视觉特征进行解码,得到各个所述候选词汇对应的第二选取概率,所述辅助解码器的记忆结构中包括各个所述候选词汇对应的参考视觉上下文信息,所述参考视觉上下文信息根据所述候选词汇对应的相关视频生成;根据所述第一选取概率和所述第二选取概率确定所述候选词汇中的解码词;根据各个所述解码词生成所述目标视频对应的视频描述。
- 根据权利要求1所述的方法,其特征在于,所述通过所述视频描述生成模型的辅助解码器对所述目标视觉特征进行解码,得到各个所述候选词汇对应的第二选取概率,包括:当进行第t次解码时,获取第t-1次解码得到的第t-1解码词以及第t-1隐藏状态,所述第t-1隐藏状态是所述基础解码器进行第t-1次解码时输出的隐藏状态,t为大于或等于2的整数;根据所述第t-1解码词、所述第t-1隐藏状态、所述目标视觉特征以及所述候选词汇对应的所述参考视觉上下文信息,通过辅助解码器确定所述候选词汇的所述第二选取概率。
- 根据权利要求2所述的方法,其特征在于,所述根据所述第t-1解码词、所述第t-1隐藏状态、所述目标视觉特征以及所述候选词汇对应的所述参考视觉上下文信息,确定所述候选词汇的所述第二选取概率,包括:根据所述目标视觉特征和所述第t-1隐藏状态,生成进行第t次解码时的目标视觉上下文信息;根据所述目标视觉上下文信息和所述参考视觉上下文信息,确定所述候选词汇的第一匹配度;获取所述记忆结构中所述候选词汇对应的第一词特征向量以及所述第t-1解码词的第二词特征向量;根据所述第一词特征向量和所述第二词特征向量,确定所述候选词汇的第二匹配度;根据所述第一匹配度和所述第二匹配度,确定所述候选词汇的所述第二选取概率。
- 根据权利要求3所述的方法,其特征在于,所述根据所述目标视觉特征和所述第t-1隐藏状态,生成进行第t次解码时的目标视觉上下文信息包括:根据所述目标视觉特征和所述第t-1隐藏状态,得到进行第t次解码时的二维视觉上下文信息和三维视觉上下文信息;对所述二维视觉上下文信息和所述三维视觉上下文信息进行融合,得到进行第t次解码时的目标视觉上下文信息。
- 根据权利要求2所述的方法,其特征在于,所述记忆结构中还包括各个所述候选词汇对应的辅助信息;所述根据所述第t-1解码词、所述第t-1隐藏状态、所述目标视觉特征以及所述候选词汇对应的所述参考视觉上下文信息,确定所述候选词汇的所述第二选取概率,包括:根据所述辅助信息、所述第t-1解码词、所述第t-1隐藏状态、所述目标视觉特征以及所述候选词汇对应的所述参考视觉上下文信息,确定所述候选词汇的所述第二选取概率。
- 根据权利要求1至5任一所述的方法,其特征在于,所述方法包括:对于各个所述候选词汇,根据样本视频对应的样本视频描述,确定所述候选词汇对应的I条所述相关视频,所述相关视频的所述样本视频描述中包含所述候选词汇,I为大于或等于1的整数;对于各条所述相关视频,确定所述相关视频中的k个关键视觉特征,所述关键视觉特征与所述候选词汇的匹配度高于所述相关视频中非关键视觉特征与所述候选词汇的匹配度,k为大于等于1的整数;根据I条所述相关视频对应的各个所述关键视觉特征,生成所述候选词汇对应的所述参考视觉上下文信息;将各个所述候选词汇对应的所述参考视觉上下文信息存储到所述记忆结 构。
- 根据权利要求6所述的方法,其特征在于,所述确定所述相关视频中的k个关键视觉特征,包括:通过所述基础解码器,获取所述相关视频中各个视觉特征对所述候选词汇的特征权重,其中,各个所述特征权重之和为1;将前k个所述特征权重对应的所述视觉特征确定为所述关键视觉特征。
- 根据权利要求1至5任一所述的方法,其特征在于,所述根据所述第一选取概率和所述第二选取概率确定所述候选词汇中的解码词,包括:根据所述第一选取概率和所述第一选取概率对应的第一权重,以及所述第二选取概率和所述第二选取概率对应的第二权重,计算各个所述候选词汇的目标选取概率;将最高目标选取概率对应的所述候选词汇确定为所述解码词。
- 根据权利要求1至5任一所述的方法,其特征在于,所述通过视频描述生成模型的编码器对目标视频进行编码,得到所述目标视频的目标视觉特征,包括:通过所述编码器对所述目标视频进行编码,得到所述目标视频的二维视觉特征和三维视觉特征,所述二维视觉特征用于指示单帧图像的特征,所述三维视觉特征用于指示连续图像帧的时序特征;将所述二维视觉特征和所述三维视觉特征转换到同一特征维度,得到所述目标视觉特征。
- 一种视频描述生成装置,其特征在于,设置于计算机设备中,所述装置包括:编码模块,用于通过视频描述生成模型的编码器对目标视频进行编码,得到所述目标视频的目标视觉特征;第一解码模块,用于通过所述视频描述生成模型的基础解码器,采用注意力机制对所述目标视觉特征进行解码,得到各个候选词汇对应的第一选取概率;第二解码模块,用于通过所述视频描述生成模型的辅助解码器对所述目标视觉特征进行解码,得到各个所述候选词汇对应的第二选取概率,所述辅助解码器的记忆结构中包括各个所述候选词汇对应的参考视觉上下文信息, 所述参考视觉上下文信息根据所述候选词汇对应的相关视频生成;第一确定模块,用于根据所述第一选取概率和所述第二选取概率确定所述候选词汇中的解码词;第一生成模块,用于根据各个所述解码词生成所述目标视频对应的视频描述。
- 根据权利要求10所述的装置,其特征在于,所述第二解码模块,包括:第一获取单元,用于当进行第t次解码时,获取第t-1次解码得到的第t-1解码词以及第t-1隐藏状态,所述第t-1隐藏状态是所述辅助解码器进行第t-1次解码时输出的隐藏状态,t为大于或等于2的整数;第一确定单元,用于根据所述第t-1解码词、所述第t-1隐藏状态、所述目标视觉特征以及所述候选词汇对应的所述参考视觉上下文信息,通过辅助解码器确定所述候选词汇的所述第二选取概率。
- 根据权利要求11所述的装置,其特征在于,所述第一确定单元,用于:根据所述目标视觉特征和所述第t-1隐藏状态,生成进行第t次解码时的目标视觉上下文信息;根据所述目标视觉上下文信息和所述参考视觉上下文信息,确定所述候选词汇的第一匹配度;获取所述记忆结构中所述候选词汇对应的第一词特征向量以及所述第t-1解码词的第二词特征向量;根据所述第一词特征向量和所述第二词特征向量,确定所述候选词汇的第二匹配度;根据所述第一匹配度和所述第二匹配度,确定所述候选词汇的所述第二选取概率。
- 根据权利要求12所述的装置,其特征在于,所述第一确定单元还用于:根据所述目标视觉特征和所述第t-1隐藏状态,得到进行第t次解码时的二维视觉上下文信息和三维视觉上下文信息;对所述二维视觉上下文信息和所述三维视觉上下文信息进行融合,得到进行第t次解码时的目标视觉上下文信息。
- 根据权利要求11所述的装置,其特征在于,所述记忆结构中还包括各个所述候选词汇对应的辅助信息;所述第一确定单元,用于:根据所述辅助信息、所述第t-1解码词、所述第t-1隐藏状态、所述目标视觉特征以及所述候选词汇对应的所述参考视觉上下文信息,确定所述候选词汇的所述第二选取概率。
- 根据权利要求10至14任一所述的装置,其特征在于,所述装置包括:第二确定模块,用于对于各个所述候选词汇,根据样本视频对应的样本视频描述,确定所述候选词汇对应的I条所述相关视频,所述相关视频的所述样本视频描述中包含所述候选词汇,I为大于或等于1的整数;第三确定模块,用于对于各条所述相关视频,确定所述相关视频中的k个关键视觉特征,所述关键视觉特征与所述候选词汇的匹配度高于所述相关视频中非关键视觉特征与所述候选词汇的匹配度,k为大于等于1的整数;第二生成模块,用于根据I条所述相关视频对应的各个所述关键视觉特征,生成所述候选词汇对应的所述参考视觉上下文信息;存储模块,用于将各个所述候选词汇对应的所述参考视觉上下文信息存储到所述记忆结构。
- 根据权利要求15所述的装置,其特征在于,所述第三确定模块,包括:获取单元,用于通过所述基础解码器,获取所述相关视频中各个视觉特征对所述候选词汇的特征权重,其中,各个所述特征权重之和为1;第二确定单元,用于将前k个所述特征权重对应的所述视觉特征确定为所述关键视觉特征。
- 根据权利要求10至14任一所述的装置,其特征在于,所述第一确定模块,包括:计算单元,用于根据所述第一选取概率和所述第一选取概率对应的第一权重,以及所述第二选取概率和所述第二选取概率对应的第二权重,计算各个所述候选词汇的目标选取概率;第三确定单元,用于将最高目标选取概率对应的所述候选词汇确定为所 述解码词。
- 根据权利要求10至14任一所述的装置,其特征在于,所述编码模块,包括:编码单元,用于通过所述编码器对所述目标视频进行编码,得到所述目标视频的二维视觉特征和三维视觉特征,所述二维视觉特征用于指示单帧图像的特征,所述三维视觉特征用于指示连续图像帧的时序特征;转换单元,用于将所述二维视觉特征和所述三维视觉特征转换到同一特征维度,得到所述目标视觉特征。
- 一种计算机设备,其特征在于,所述计算机设备包括一个或多个处理器和存储器,所述存储器中存储有至少一条计算机可读指令、至少一段程序、代码集或计算机可读指令集,所述至少一条计算机可读指令、所述至少一段程序、所述代码集或计算机可读指令集由所述一个或多个处理器加载并执行以实现如权利要求1至9任一所述的视频描述生成方法。
- 一个或多个计算机可读存储介质,其特征在于,所述可读存储介质中存储有至少一条计算机可读指令、至少一段程序、代码集或计算机可读指令集,所述至少一条计算机可读指令、所述至少一段程序、所述代码集或计算机可读指令集由一个或多个处理器加载并执行以实现如权利要求1至9任一所述的视频描述生成方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021531058A JP7179183B2 (ja) | 2019-04-22 | 2020-03-27 | ビデオキャプションの生成方法、装置、デバイスおよびコンピュータプログラム |
KR1020217020589A KR102477795B1 (ko) | 2019-04-22 | 2020-03-27 | 비디오 캡션 생성 방법, 디바이스 및 장치, 그리고 저장 매체 |
EP20795471.0A EP3962097A4 (en) | 2019-04-22 | 2020-03-27 | VIDEO SUBTITLE GENERATION METHOD, DEVICE AND APPARATUS AND STORAGE MEDIA |
US17/328,970 US11743551B2 (en) | 2019-04-22 | 2021-05-24 | Video caption generating method and apparatus, device, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910325193.0 | 2019-04-22 | ||
CN201910325193.0A CN109874029B (zh) | 2019-04-22 | 2019-04-22 | 视频描述生成方法、装置、设备及存储介质 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/328,970 Continuation US11743551B2 (en) | 2019-04-22 | 2021-05-24 | Video caption generating method and apparatus, device, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020215988A1 true WO2020215988A1 (zh) | 2020-10-29 |
Family
ID=66922965
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/081721 WO2020215988A1 (zh) | 2019-04-22 | 2020-03-27 | 视频描述生成方法、装置、设备及存储介质 |
Country Status (6)
Country | Link |
---|---|
US (1) | US11743551B2 (zh) |
EP (1) | EP3962097A4 (zh) |
JP (1) | JP7179183B2 (zh) |
KR (1) | KR102477795B1 (zh) |
CN (1) | CN109874029B (zh) |
WO (1) | WO2020215988A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113343986A (zh) * | 2021-06-29 | 2021-09-03 | 北京奇艺世纪科技有限公司 | 字幕时间区间确定方法、装置、电子设备及可读存储介质 |
CN113673376A (zh) * | 2021-08-03 | 2021-11-19 | 北京奇艺世纪科技有限公司 | 弹幕生成方法、装置、计算机设备和存储介质 |
CN116166827A (zh) * | 2023-04-24 | 2023-05-26 | 北京百度网讯科技有限公司 | 语义标签抽取模型的训练和语义标签的抽取方法及其装置 |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109874029B (zh) * | 2019-04-22 | 2021-02-12 | 腾讯科技(深圳)有限公司 | 视频描述生成方法、装置、设备及存储介质 |
CN110263218B (zh) * | 2019-06-21 | 2022-02-25 | 北京百度网讯科技有限公司 | 视频描述文本生成方法、装置、设备和介质 |
CN110891201B (zh) * | 2019-11-07 | 2022-11-01 | 腾讯科技(深圳)有限公司 | 文本生成方法、装置、服务器和存储介质 |
CN111860597B (zh) * | 2020-06-17 | 2021-09-07 | 腾讯科技(深圳)有限公司 | 一种视频信息处理方法、装置、电子设备及存储介质 |
CN112528883A (zh) * | 2020-12-15 | 2021-03-19 | 杭州义顺科技有限公司 | 一种基于反思网络的教学场景视频描述生成方法 |
CN112580570B (zh) * | 2020-12-25 | 2024-06-21 | 南通大学 | 人体姿态图像的关键点检测方法 |
CN113569068B (zh) * | 2021-01-19 | 2023-09-29 | 腾讯科技(深圳)有限公司 | 描述内容生成方法、视觉内容的编码、解码方法、装置 |
CN113099228B (zh) * | 2021-04-30 | 2024-04-05 | 中南大学 | 一种视频编解码方法及系统 |
CN113596557B (zh) * | 2021-07-08 | 2023-03-21 | 大连三通科技发展有限公司 | 一种视频生成方法及装置 |
CN113792166B (zh) * | 2021-08-18 | 2023-04-07 | 北京达佳互联信息技术有限公司 | 信息获取方法、装置、电子设备及存储介质 |
CN113810730B (zh) * | 2021-09-17 | 2023-08-01 | 咪咕数字传媒有限公司 | 基于视频的实时文本生成方法、装置及计算设备 |
CN114422841B (zh) * | 2021-12-17 | 2024-01-02 | 北京达佳互联信息技术有限公司 | 字幕生成方法、装置、电子设备及存储介质 |
CN114501064B (zh) * | 2022-01-29 | 2023-07-14 | 北京有竹居网络技术有限公司 | 一种视频生成方法、装置、设备、介质及产品 |
CN114612321A (zh) * | 2022-03-08 | 2022-06-10 | 北京达佳互联信息技术有限公司 | 视频处理方法、装置以及设备 |
US20240163390A1 (en) * | 2022-11-14 | 2024-05-16 | Zoom Video Communications, Inc. | Providing Assistance to Impaired Users within a Conferencing System |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170357720A1 (en) * | 2016-06-10 | 2017-12-14 | Disney Enterprises, Inc. | Joint heterogeneous language-vision embeddings for video tagging and search |
CN108024158A (zh) * | 2017-11-30 | 2018-05-11 | 天津大学 | 利用视觉注意力机制的有监督视频摘要提取方法 |
CN108388900A (zh) * | 2018-02-05 | 2018-08-10 | 华南理工大学 | 基于多特征融合和时空注意力机制相结合的视频描述方法 |
CN108419094A (zh) * | 2018-03-05 | 2018-08-17 | 腾讯科技(深圳)有限公司 | 视频处理方法、视频检索方法、装置、介质及服务器 |
CN108509411A (zh) * | 2017-10-10 | 2018-09-07 | 腾讯科技(深圳)有限公司 | 语义分析方法和装置 |
CN109344288A (zh) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | 一种基于多模态特征结合多层注意力机制的结合视频描述方法 |
CN109359214A (zh) * | 2018-10-15 | 2019-02-19 | 平安科技(深圳)有限公司 | 基于神经网络的视频描述生成方法、存储介质及终端设备 |
CN109874029A (zh) * | 2019-04-22 | 2019-06-11 | 腾讯科技(深圳)有限公司 | 视频描述生成方法、装置、设备及存储介质 |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107809642B (zh) * | 2015-02-16 | 2020-06-16 | 华为技术有限公司 | 用于视频图像编码和解码的方法、编码设备和解码设备 |
US10303768B2 (en) * | 2015-05-04 | 2019-05-28 | Sri International | Exploiting multi-modal affect and semantics to assess the persuasiveness of a video |
US10366292B2 (en) * | 2016-11-03 | 2019-07-30 | Nec Corporation | Translating video to language using adaptive spatiotemporal convolution feature representation with dynamic abstraction |
CN108062505B (zh) * | 2016-11-09 | 2022-03-18 | 微软技术许可有限责任公司 | 用于基于神经网络的动作检测的方法和设备 |
US10565305B2 (en) * | 2016-11-18 | 2020-02-18 | Salesforce.Com, Inc. | Adaptive attention model for image captioning |
JP6867153B2 (ja) | 2016-12-21 | 2021-04-28 | ホーチキ株式会社 | 異常監視システム |
US10592751B2 (en) * | 2017-02-03 | 2020-03-17 | Fuji Xerox Co., Ltd. | Method and system to generate targeted captions and summarize long, continuous media files |
EP3399460B1 (en) | 2017-05-02 | 2019-07-17 | Dassault Systèmes | Captioning a region of an image |
US10909157B2 (en) * | 2018-05-22 | 2021-02-02 | Salesforce.Com, Inc. | Abstraction of text summarization |
US10831834B2 (en) * | 2018-11-27 | 2020-11-10 | Sap Se | Unsupervised document summarization by attention and reconstruction |
WO2020190112A1 (en) * | 2019-03-21 | 2020-09-24 | Samsung Electronics Co., Ltd. | Method, apparatus, device and medium for generating captioning information of multimedia data |
CN111836111A (zh) * | 2019-04-17 | 2020-10-27 | 微软技术许可有限责任公司 | 生成弹幕的技术 |
-
2019
- 2019-04-22 CN CN201910325193.0A patent/CN109874029B/zh active Active
-
2020
- 2020-03-27 WO PCT/CN2020/081721 patent/WO2020215988A1/zh unknown
- 2020-03-27 KR KR1020217020589A patent/KR102477795B1/ko active IP Right Grant
- 2020-03-27 JP JP2021531058A patent/JP7179183B2/ja active Active
- 2020-03-27 EP EP20795471.0A patent/EP3962097A4/en active Pending
-
2021
- 2021-05-24 US US17/328,970 patent/US11743551B2/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170357720A1 (en) * | 2016-06-10 | 2017-12-14 | Disney Enterprises, Inc. | Joint heterogeneous language-vision embeddings for video tagging and search |
CN108509411A (zh) * | 2017-10-10 | 2018-09-07 | 腾讯科技(深圳)有限公司 | 语义分析方法和装置 |
CN108024158A (zh) * | 2017-11-30 | 2018-05-11 | 天津大学 | 利用视觉注意力机制的有监督视频摘要提取方法 |
CN108388900A (zh) * | 2018-02-05 | 2018-08-10 | 华南理工大学 | 基于多特征融合和时空注意力机制相结合的视频描述方法 |
CN108419094A (zh) * | 2018-03-05 | 2018-08-17 | 腾讯科技(深圳)有限公司 | 视频处理方法、视频检索方法、装置、介质及服务器 |
CN109344288A (zh) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | 一种基于多模态特征结合多层注意力机制的结合视频描述方法 |
CN109359214A (zh) * | 2018-10-15 | 2019-02-19 | 平安科技(深圳)有限公司 | 基于神经网络的视频描述生成方法、存储介质及终端设备 |
CN109874029A (zh) * | 2019-04-22 | 2019-06-11 | 腾讯科技(深圳)有限公司 | 视频描述生成方法、装置、设备及存储介质 |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113343986A (zh) * | 2021-06-29 | 2021-09-03 | 北京奇艺世纪科技有限公司 | 字幕时间区间确定方法、装置、电子设备及可读存储介质 |
CN113343986B (zh) * | 2021-06-29 | 2023-08-25 | 北京奇艺世纪科技有限公司 | 字幕时间区间确定方法、装置、电子设备及可读存储介质 |
CN113673376A (zh) * | 2021-08-03 | 2021-11-19 | 北京奇艺世纪科技有限公司 | 弹幕生成方法、装置、计算机设备和存储介质 |
CN113673376B (zh) * | 2021-08-03 | 2023-09-01 | 北京奇艺世纪科技有限公司 | 弹幕生成方法、装置、计算机设备和存储介质 |
CN116166827A (zh) * | 2023-04-24 | 2023-05-26 | 北京百度网讯科技有限公司 | 语义标签抽取模型的训练和语义标签的抽取方法及其装置 |
CN116166827B (zh) * | 2023-04-24 | 2023-12-15 | 北京百度网讯科技有限公司 | 语义标签抽取模型的训练和语义标签的抽取方法及其装置 |
Also Published As
Publication number | Publication date |
---|---|
JP2022509299A (ja) | 2022-01-20 |
US11743551B2 (en) | 2023-08-29 |
JP7179183B2 (ja) | 2022-11-28 |
EP3962097A4 (en) | 2022-07-13 |
US20210281774A1 (en) | 2021-09-09 |
KR102477795B1 (ko) | 2022-12-14 |
KR20210095208A (ko) | 2021-07-30 |
CN109874029A (zh) | 2019-06-11 |
CN109874029B (zh) | 2021-02-12 |
EP3962097A1 (en) | 2022-03-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020215988A1 (zh) | 视频描述生成方法、装置、设备及存储介质 | |
US11409791B2 (en) | Joint heterogeneous language-vision embeddings for video tagging and search | |
CN108304439B (zh) | 一种语义模型优化方法、装置及智能设备、存储介质 | |
US12008810B2 (en) | Video sequence selection method, computer device, and storage medium | |
WO2022161298A1 (zh) | 信息生成方法、装置、设备、存储介质及程序产品 | |
US11868738B2 (en) | Method and apparatus for generating natural language description information | |
US20230077849A1 (en) | Content recognition method and apparatus, computer device, and storage medium | |
JP6361351B2 (ja) | 発話ワードをランク付けする方法、プログラム及び計算処理システム | |
CN111897939B (zh) | 视觉对话方法、视觉对话模型的训练方法、装置及设备 | |
WO2022033208A1 (zh) | 视觉对话方法、模型训练方法、装置、电子设备及计算机可读存储介质 | |
CN110234018B (zh) | 多媒体内容描述生成方法、训练方法、装置、设备及介质 | |
WO2023065619A1 (zh) | 多维度细粒度动态情感分析方法及系统 | |
CN111837142A (zh) | 用于表征视频内容的深度强化学习框架 | |
CN110263218B (zh) | 视频描述文本生成方法、装置、设备和介质 | |
CN113392265A (zh) | 多媒体处理方法、装置及设备 | |
CN116912642A (zh) | 基于双模多粒度交互的多模态情感分析方法、设备及介质 | |
Moctezuma et al. | Video captioning: a comparative review of where we are and which could be the route | |
CN114281948A (zh) | 一种纪要确定方法及其相关设备 | |
CN118051635A (zh) | 基于大语言模型的对话式图像检索方法和装置 | |
CN115130461A (zh) | 一种文本匹配方法、装置、电子设备及存储介质 | |
CN115982343B (zh) | 摘要生成方法、训练摘要生成模型的方法及装置 | |
CN116932922B (zh) | 搜索词条处理方法、装置、计算机设备和计算机存储介质 | |
CN114329053B (zh) | 特征提取模型训练、媒体数据检索方法和装置 | |
WO2023238722A1 (ja) | 情報作成方法、情報作成装置、及び動画ファイル | |
Kong | Final Year Project Final Report |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20795471 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2021531058 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20217020589 Country of ref document: KR Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2020795471 Country of ref document: EP Effective date: 20211122 |