WO2020215988A1 - 视频描述生成方法、装置、设备及存储介质 - Google Patents

视频描述生成方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020215988A1
WO2020215988A1 PCT/CN2020/081721 CN2020081721W WO2020215988A1 WO 2020215988 A1 WO2020215988 A1 WO 2020215988A1 CN 2020081721 W CN2020081721 W CN 2020081721W WO 2020215988 A1 WO2020215988 A1 WO 2020215988A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
visual
target
candidate
feature
Prior art date
Application number
PCT/CN2020/081721
Other languages
English (en)
French (fr)
Inventor
裴文杰
张记袁
柯磊
戴宇荣
沈小勇
贾佳亚
王向荣
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2021531058A priority Critical patent/JP7179183B2/ja
Priority to KR1020217020589A priority patent/KR102477795B1/ko
Priority to EP20795471.0A priority patent/EP3962097A4/en
Publication of WO2020215988A1 publication Critical patent/WO2020215988A1/zh
Priority to US17/328,970 priority patent/US11743551B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/28Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/772Determining representative reference patterns, e.g. averaging or distorting patterns; Generating dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234336Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by media transcoding, e.g. video is transformed into a slideshow of still pictures or audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/251Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence technology and the field of video description, and in particular, to a method, device, device, and storage medium for generating a video description.
  • Video Captioning is a technology for generating content description information for videos.
  • video description generation models are usually used to automatically generate video descriptions for videos, and video description generation models are mostly based on the Encoder-Decoder framework.
  • the video description generation model first extracts the visual features in the video through the encoder, and then inputs the extracted visual features into the decoder, and the decoder sequentially generates decoded words according to the visual features, and finally generates The decoded words of are combined into a video description.
  • the video description generation model in the related technology only focuses on the currently processed video.
  • the same decoded word may be used in multiple videos with similar but not identical semantics, resulting in the video description generation model focusing too much Limitations, which in turn affect the quality of the generated video description.
  • a method for generating a video description, executed by a computer device including:
  • the attention mechanism is used to decode the target visual features to obtain the first selection probability corresponding to each candidate vocabulary
  • the auxiliary decoder of the video description generation model decodes the target visual features to obtain the second selection probability corresponding to each candidate word.
  • the memory structure of the auxiliary decoder includes the reference visual context information corresponding to each candidate word.
  • the reference visual context information is based on Related video generation corresponding to candidate words;
  • a video description generating device is set in computer equipment, and the device includes:
  • the encoding module is used to encode the target video through the encoder of the video description generation model to obtain the target visual features of the target video;
  • the first decoding module is used to generate the basic decoder of the model through the video description, and use the attention mechanism to decode the target visual features to obtain the first selection probability corresponding to each candidate vocabulary;
  • the second decoding module is used to decode the target visual feature through the auxiliary decoder of the video description generation model to obtain the second selection probability corresponding to each candidate word.
  • the memory structure of the auxiliary decoder includes the reference visual context corresponding to each candidate word Information, refer to the visual context information and generate according to the relevant video corresponding to the candidate vocabulary;
  • the first determining module is configured to determine the decoded words in the candidate vocabulary according to the first selection probability and the second selection probability;
  • the first generating module is used to generate a video description corresponding to the target video according to each decoded word.
  • a computer device that includes one or more processors and a memory.
  • the memory stores at least one computer readable instruction, at least one program, code set, or computer readable instruction set, at least one computer readable instruction, at least one The program, code set or computer readable instruction set is loaded and executed by one or more processors to implement the video description generation method as described above.
  • One or more computer-readable storage media stores at least one computer-readable instruction, at least one section of program, code set, or computer-readable instruction set, at least one computer-readable instruction, at least one section of program, code set Or a set of computer-readable instructions is loaded and executed by one or more processors to implement the video description generation method as described above.
  • a computer program product when the computer program product runs on a computer, causes the computer to execute the video description generation method as described above.
  • FIG. 1 is a schematic diagram of the principle of using the SA-LSTM model to generate a video description in a related technology in an embodiment
  • Figure 2 is a schematic diagram of an implementation of a video description generation method in a video classification retrieval scenario in an embodiment
  • Fig. 3 is a schematic diagram of an implementation of a method for generating a video description in an assisted scene for the visually impaired in an embodiment
  • Figure 4 is a schematic diagram of an implementation environment in an embodiment
  • Fig. 5 is a flowchart of a method for generating a video description in an embodiment
  • Figure 6 is a video description generated by a video description generation model in an embodiment
  • FIG. 7 is a flowchart of a method for generating a video description in an embodiment
  • FIG. 8 is a schematic structural diagram of a video description generation model in an embodiment
  • FIG. 9 is a flowchart of the process of assisting the decoder to determine the candidate vocabulary selection probability in an embodiment
  • FIG. 10 is a video description generated by a video description generation model in an embodiment of the related technology and an embodiment of this application;
  • FIG. 11 is a flowchart of a process of generating reference visual context information corresponding to candidate words in an embodiment
  • FIG. 12 is a schematic diagram of an implementation of a process of generating reference visual context information in an embodiment
  • Figure 13 is a structural block diagram of a video description generating apparatus in an embodiment
  • Fig. 14 is a schematic structural diagram of a computer device in an embodiment.
  • the video description generation model based on the encoding-decoding framework can be a soft attention-long short-term memory ((Soft Attention Long-Short-Term Memory, SA-LSTM) model.
  • SA-LSTM Soft Attention Long-Short-Term Memory
  • Figure 1 The process of generating a video description is shown in Figure 1.
  • the SA-LSTM model first performs feature extraction on the input video 11 to obtain the visual features 12 (v 1 , v 2 ,..., v n ) of the video 11. Then, the SA-LSTM model uses the soft attention mechanism to calculate the effect of each visual feature 12 on the current decoding process (that is, the t-th time) according to the last hidden state 13 (the hidden state output during the t-1th decoding process) and the visual feature 12 Decoding) weight 14 to perform a weighted sum calculation on the visual feature 12 and the weight 14 to obtain the context information 15 of the current decoding process. Further, the SA-LSTM model outputs the current hidden state 17 according to the previous hidden state 13, the previous decoded word 16, and the context information 15, and then determines the current decoded word 18 according to the current hidden state 17.
  • the SA-LSTM model only pays attention to the visual features of the current video, and correspondingly, the determined decoded words are only related to the visual features of the current video.
  • the same decoded word may appear in multiple video clips and express similar but not exactly the same meaning in different video clips (that is, the decoded word may correspond to similar but not exactly the same visual features), resulting in SA -The accuracy of the decoded words output by the LSTM model is low, which in turn affects the quality of the final video description.
  • the video description generation model in the embodiments of this application adopts the structure of "encoder + basic decoder + auxiliary decoder".
  • a memory mechanism is introduced to store the relationship between each candidate vocabulary in the vocabulary and related videos in the memory structure, and the memory structure is added to the auxiliary decoder.
  • the video description generation model provided by the embodiments of this application can focus not only on the current video (basic decoder), but also on other videos with similar visual features as the current video (auxiliary decoder), thereby avoiding focusing only on the angle of attention caused by the current video Limitations, thereby improving the accuracy of the output decoded words and improving the quality of the generated video description.
  • the video description generation method provided in the embodiment of the present application can be used in any of the following scenarios.
  • the video description generation model in the embodiment of the present application can be implemented as a video management application or a part of a video management application.
  • the video management application After inputting video fragments that do not contain video descriptions into the video management application, the video management application extracts the visual features in the video fragments through the encoder in the video description generation model, and uses the basic decoder and auxiliary decoder to analyze the visual features. Decoding is performed to synthesize the decoding results of the basic decoder and the auxiliary decoder to determine the decoded word, and then generate a video description for the video segment based on the decoded word.
  • the video management application classifies the video clips based on the video description (for example, through semantic recognition), and adds corresponding category tags to the video clips. During subsequent video retrieval, the video management application can return video clips that meet the retrieval conditions according to the retrieval conditions and the category tags corresponding to each video segment.
  • VQA Visual Question Answer
  • the video description generation model in the embodiment of the present application can be implemented as an intelligent question answering application or a part of the intelligent question answering application.
  • the intelligent question answering application obtains a video and a question for the video, it generates a video description corresponding to the video through the video description generation model, and semantically recognizes the question and the video description, thereby generating the answer corresponding to the question, and then the answer To display.
  • the video description generation model in the embodiment of the present application can be implemented as a voice prompt application or a part of the voice prompt application.
  • the voice prompt application program (such as the auxiliary equipment used by the visually impaired) collects the environmental video around the visually impaired person through the camera, the voice prompt application program encodes and decodes the environmental video through the video description generation model to generate The video description corresponding to the environmental video.
  • the voice prompt application can convert the video description from text to voice and perform voice broadcast to help the visually impaired understand the surrounding environment.
  • a camera 32 and a bone conduction earphone 33 are provided on glasses 31 worn by a visually impaired person.
  • the camera 31 collects images of the front environment, and collects a piece of environment video 34.
  • the glasses 31 generate a video description for the environmental video 34 through the processor, "a man is walking a dog in front", and convert the video description from text to voice, and then play it through the bone conduction headset 33, so that the visually impaired can avoid according to the voice prompts .
  • the method provided in the embodiment of the present application can also be applied to other scenarios where a video description needs to be generated for a video, and the embodiment of the present application does not limit specific application scenarios.
  • the video description generation method provided in the embodiments of the present application can be applied to computer devices such as terminals or servers.
  • the video description generation model in this embodiment of the application can be implemented as an application or a part of the application and installed in the terminal so that the terminal has the function of generating video description; or, the video description generation model It can be applied to the background server of the application program, so that the server provides the function of generating video description for the application program in the terminal.
  • FIG. 4 shows a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application.
  • This implementation environment includes a terminal 410 and a server 420.
  • the terminal 410 and the server 420 communicate data through a communication network.
  • the communication network may be a wired network or a wireless network, and the communication network may be It is at least one of a local area network, a metropolitan area network, and a wide area network.
  • the application program may be a video management application, an intelligent question and answer application, a voice prompt application, a subtitle generation application (adding commentary subtitles to the video screen), etc., this application
  • the terminal may be a mobile terminal such as a mobile phone, a tablet computer, a laptop portable computer, an auxiliary device for the visually impaired, etc., or a terminal such as a desktop computer, a projection computer, etc., which is not limited in the embodiment of the application. .
  • the server 420 may be implemented as a server, or as a server cluster formed by a group of servers, and it may be a physical server or a cloud server. In one embodiment, the server 420 is a background server of the application in the terminal 410.
  • a pre-trained video description generation model 421 is stored in the server 420.
  • the application program transmits the target video to the server 420 through the terminal 410.
  • the server 420 receives the target video, it inputs the target video into the video description generation model 421.
  • the video description generation model 421 extracts features of the target video through the decoder 421A, and decodes the extracted features through the basic decoder 421B and the auxiliary decoder 422C, respectively, to generate a video description according to the decoding result, and feed it back to the terminal 410.
  • the video description is displayed by the application in the terminal 410.
  • the terminal 410 can generate the video description of the target video locally without the server 420, thereby improving the terminal’s ability to obtain the video description. Speed, reduce the delay caused by interacting with the server.
  • FIG. 5 shows a flowchart of a method for generating a video description provided by an exemplary embodiment of the present application.
  • the method is used in a computer device as an example for description. The method includes the following steps.
  • step 501 the target video is encoded by the encoder of the video description generation model to obtain the target visual feature of the target video.
  • the function of the encoder in the video description generation model is to extract the target visual feature (visual feature) from the target video, and input the extracted target visual feature into the decoder (including the basic decoder and auxiliary decoder) .
  • the target visual feature is represented by a vector.
  • the video description generation model uses pre-trained deep convolutional neural networks (Convolutional Neural Networks, CNNs) as the encoder for visual feature extraction, and the target video needs to be preprocessed before using the encoder for feature extraction , So that the preprocessed target video meets the input requirements of the encoder.
  • CNNs convolutional Neural Networks
  • the encoder inputs the target visual features into the basic decoder and the auxiliary decoder respectively, and executes the following steps 502 and 503. It should be noted that there is no strict sequence between the following steps 502 and 503, that is, step 502 and step 503 can be executed synchronously, and this embodiment does not limit the execution order of the two.
  • step 502 the target visual feature is decoded by the basic decoder of the video description generation model to obtain the first selection probability corresponding to each candidate vocabulary.
  • the basic decoder is used to use the attention mechanism to decode the candidate vocabulary matching the target visual feature.
  • the basic decoder pays attention to the target video, so as to perform decoding based on the target visual characteristics of the target video.
  • the basic decoder may be a Recurrent Neural Network (RNN) encoder that uses an attention mechanism.
  • RNN Recurrent Neural Network
  • the basic decoder uses the SA-LSTM model, and each time it decodes, the basic decoder uses an attention mechanism to determine each candidate vocabulary in the vocabulary according to the hidden state of the last decoded output, the last decoded word, and the target visual feature The corresponding first selection probability.
  • the basic decoder may also adopt other RNN encoders based on the attention mechanism, which is not limited in the embodiments of the present application.
  • the decoding process of the basic decoder is essentially a classification task, that is, the (first) selection probability of each candidate vocabulary in the vocabulary is calculated through the softmax function.
  • the greater the first selection probability the higher the degree of match between the candidate word and the context information of the video, that is, the meaning expressed by the candidate word matches the context more closely.
  • Step 503 The auxiliary decoder of the video description generation model decodes the target visual features to obtain the second selection probability corresponding to each candidate vocabulary.
  • the memory structure of the auxiliary decoder includes the reference visual context information corresponding to each candidate vocabulary.
  • the context information is generated based on the relevant video corresponding to the candidate vocabulary.
  • the auxiliary decoder in this example focuses on the correlation between candidate words and related videos. Therefore, when the auxiliary decoder is used to decode the target visual features, it can capture The visual features of the same candidate word in different videos are matched with the target visual features of the target video to improve the accuracy of determining the decoded word.
  • the association between candidate words and related videos is stored in the memory structure of the auxiliary decoder, and is reflected by the correspondence between the candidate words and the reference visual context information.
  • the reference visual context information corresponding to the candidate word is used to indicate the visual context feature of the related video containing the candidate word, and the reference visual context information is generated according to the related video related to the candidate word in the sample video.
  • a graph-based algorithm can also be used to construct the association between the candidate vocabulary and the related video, which is not limited in this application.
  • the decoding process of the auxiliary decoder is essentially a classification task, that is, the (second) selection probability of each candidate vocabulary in the vocabulary is calculated through the softmax function.
  • the basic decoder and the auxiliary decoder correspond to the same vocabulary, and the greater the second selection probability, the higher the matching degree between the candidate vocabulary and the context information of the video, that is, the meaning expressed by the candidate vocabulary matches the context more closely.
  • Step 504 Determine the decoded word in the candidate vocabulary according to the first selection probability and the second selection probability.
  • the video description generation model integrates the first selection probability output by the basic decoder and the second selection probability output by the auxiliary decoder.
  • the decoded word obtained by this decoding is determined from each candidate vocabulary in the library.
  • Step 505 Generate a video description corresponding to the target video according to each decoded word.
  • the video description is usually a natural language composed of multiple decoded words, it is necessary to repeat the above 502 to 504 each time it is decoded to generate each decoded word of the video description in turn, thereby connecting multiple decoded words, and finally generate the video description.
  • the encoder of the video description generation model is used to encode the target video, and after the target visual feature is obtained, the target visual feature is decoded through the basic decoder based on the attention mechanism and the auxiliary decoder included. , Obtain the first selection probability and the second selection probability of each candidate vocabulary, thereby combining the first selection probability and the second selection probability to determine the decoded word from the candidate words, and then generate the video description based on the multiple decoded words;
  • the memory structure of the auxiliary decoder in the model contains the reference visual context information corresponding to the candidate words, and the reference visual context information is generated based on the related video of the candidate words. Therefore, when the auxiliary decoder is used for decoding, the candidate words and the divisions can be paid attention to.
  • the correlation between videos other than the current video thereby improving the accuracy of decoding word selection, and thereby improving the quality of the video description generated subsequently.
  • the video description generated by the video description generation model in the related technology is "a woman is mixing ingredients in a bowl.” (a woman is currently Mixing the materials in the bowl); and the video description generated by the video description generation model in the embodiment of this application is described as "a woman is pouring liquid into a bowl.” (a woman pours liquid into the bowl).
  • the video description generation model in the related art cannot identify the "pouring” in the video 61, and in the embodiment of the present application, the memory structure of the auxiliary decoder contains the difference between "pouring" and the related video picture 62.
  • the relevance of the video ie, refer to the visual context information), so it can accurately decode the decoding word "pouring", which improves the description quality of the video description.
  • FIG. 7 shows a flowchart of a method for generating a video description provided by another exemplary embodiment of the present application.
  • the method is used in computer equipment as an example for description. The method includes the following steps.
  • Step 701 Encode the target video by the encoder to obtain the two-dimensional visual feature and the three-dimensional visual feature of the target video.
  • the two-dimensional visual feature is used to indicate the feature of a single frame image
  • the three-dimensional visual feature is used to indicate the sequential feature of continuous image frames. .
  • the visual features of the video include not only the image features of a single frame image (ie, two-dimensional visual features), but also the sequential features of continuous image frames (ie, three-dimensional visual features).
  • the encoder includes a first sub-encoder for extracting two-dimensional visual features and a second sub-encoder for extracting three-dimensional visual features.
  • the target video when the target video is encoded, the target video is divided into independent image frames, and the first sub-encoder is used to perform feature extraction on each image frame to obtain two-dimensional visual features; the target video is divided into several video segments (each Each video segment contains several continuous image frames), and the second sub-encoder is used to extract features of each video segment to obtain three-dimensional visual features.
  • the first sub-encoder adopts the ResNet-101 model (residual network with a depth of 101) pre-trained on the ImageNet (large visualization database for visual object recognition software research) data set, and the first The two sub-encoders use the ResNeXt-101 model pre-trained on the Kinetics data set.
  • the first sub-encoder and the second sub-encoder may also adopt other models, which are not limited in the embodiment of the present application.
  • the encoder 81 extracts a two-dimensional visual feature 811 and a three-dimensional visual feature 812.
  • Step 702 Convert the two-dimensional visual feature and the three-dimensional visual feature to the same feature dimension to obtain the target visual feature.
  • the video description generation model converts two-dimensional visual features and three-dimensional visual features into the same feature dimension of hidden space to obtain target visual features.
  • the converted target visual feature f'l M f f l + b f
  • the converted target visual feature v n M v v n + b v
  • M f and M v as a conversion matrix, b f b v and a bias term.
  • step 703 the target visual feature is decoded by the basic decoder of the video description generation model to obtain the first selection probability corresponding to each candidate vocabulary.
  • the basic decoder is used to decode the candidate vocabulary matching the target visual feature using the attention mechanism .
  • the video description generation model uses Gated Recurrent Unit (GRU) as the skeleton of the basic decoder.
  • GRU Gated Recurrent Unit
  • the basic decoder 82 includes GRU 821, GRU 822, and GRU 823.
  • the basic decoder performs the t-th decoding, the following steps may be included.
  • the t-1-th decoding When performing the t-th decoding, obtain the t-1-th decoded word and the t-1-th hidden state obtained from the t-1-th decoding.
  • the t-1-th hidden state is the t-1th decoding by the basic decoder
  • the hidden state output when t is an integer greater than or equal to 2.
  • the basic decoder outputs a hidden state, and subsequently determines the decoded word obtained from this decoding based on the hidden state.
  • the hidden state output during the last decoding and the last decoded word need to be used. Therefore, when the basic decoder performs the t-th decoding, it needs to obtain the t-1 Decode the word and the t-1th hidden state.
  • the basic decoder 82 obtains the t-1th hidden state h t-1 output by the GRU 821 and the corresponding t-1th decoded word w t-1 The word vector e t-1 .
  • the basic decoder In different decoding stages, the correlation between different visual features and the current decoded word is different. Therefore, before calculating the first selection probability, the basic decoder also needs to use an attention mechanism to process the target visual features output by the encoder (weighted calculation). And) to obtain the target visual context information decoded this time.
  • the basic decoder processes two-dimensional visual features and three-dimensional visual features respectively to obtain two-dimensional visual context information and three-dimensional visual context information, and fuse two-dimensional visual context information and three-dimensional visual context information to obtain Target visual context information.
  • the two-dimensional visual feature f 'i its processing to give two-dimensional vision context information
  • a i, t f att (h t-1, f 'i), h t-1 t-1 for the first hidden states (represented by vector), f att function of attention.
  • v 'i For three-dimensional vision wherein v 'i, the processing thereof to obtain a three-dimensional visual context information
  • a 'i, t f att (h t-1, v' i), h t-1 t-1 for the first hidden states (represented by vector), f att function of attention.
  • the same attention function is used when processing two-dimensional visual features and processing three-dimensional visual features.
  • the attention mechanism ( fat in the figure) is used to process the two-dimensional visual feature 811 and the three-dimensional visual feature 812 respectively to obtain C t,2D and C t,3D , and the processing result Perform fusion to obtain target visual context information C t at the t-th decoding time.
  • the GRU outputs the t-th decoded t-th hidden state according to the t-1th decoded word, the t-1th hidden state, and the target visual context information.
  • the way GRU determines the t-th hidden state can be expressed as:
  • the basic decoder calculates the first selection probability corresponding to each candidate vocabulary in the vocabulary based on the t-th hidden state.
  • the calculation formula of the first selection probability is as follows:
  • w i is the i-th dictionary of candidate words
  • K is the total number of candidate words in the lexicon
  • W i is the total number of candidate words in the lexicon
  • b i is the i-th candidate word calculation parameters used when linear mapping score.
  • the target visual context information C t the t-1th hidden state h t-1 output by the GRU 821, and the word vector e t-1 of the t- 1th decoded word are input into the GRU 822,
  • the GRU822 calculates the first selection probability P b of each candidate vocabulary.
  • Step 704 When the t-th decoding is performed, obtain the t-1-th decoded word and the t-1-th hidden state obtained from the t-1-th decoding.
  • the t-1-th hidden state is the t-1th time performed by the basic decoder
  • the hidden state output during decoding, t is an integer greater than or equal to 2.
  • the auxiliary decoder also needs to use the last decoded word and the hidden state output during the last decoding during the decoding process. Therefore, when the t-th decoding is performed, the auxiliary decoding The decoder obtains the t-1th decoded word and the t-1th hidden state; the t-1th hidden state is the hidden state output when the basic decoder performs the t-1th decoding.
  • Step 705 According to the t-1th decoded word, the t-1th hidden state, the target visual feature, and the reference visual context information corresponding to the candidate word, the auxiliary decoder determines the second selection probability of the candidate word.
  • the auxiliary decoder also needs to obtain the reference visual context information corresponding to each candidate vocabulary in the memory structure during the decoding process, so as to focus on the visual features of the candidate vocabulary in the related video during the decoding process.
  • the memory structure comprising at least a word feature vector e r each candidate word corresponding to the reference visual context information and g r candidate words.
  • the auxiliary decoder focuses on calculating the matching degree between the target visual context information corresponding to the candidate vocabulary and the reference visual context information, and the word feature vector of the candidate vocabulary and the word feature vector of the previous decoded word. Then, the second selection probability of candidate words is determined according to the two matching degrees.
  • this step 705 may include the following steps.
  • Step 705A according to the target visual feature and the t-1th hidden state, generate target visual context information for the tth decoding.
  • the process of generating the target visual context information can refer to the above step 703, which will not be repeated in this embodiment.
  • the auxiliary encoder can obtain the target visual context information from the basic encoder without repeated calculations, which is not limited in this embodiment.
  • Step 705B Determine the first matching degree of the candidate vocabulary according to the target visual context information and the reference visual context information.
  • the reference visual context information corresponding to the candidate vocabulary is generated based on the related video corresponding to the candidate vocabulary, the reference visual context information can reflect the visual features of the related video using the candidate vocabulary as the decoded word.
  • the reference visual context information corresponding to the candidate word has a higher matching degree with the target visual context information at the time of decoding, the matching degree between the candidate word and the target visual context information is also higher.
  • the auxiliary decoder determines the matching degree between the target visual context information and the reference visual context information as the first matching degree of the candidate words, and the first matching degree can be expressed as: [W c ⁇ c t + W g ⁇ g i ], where W c and W g are linear transformation matrices, and g i is the reference visual context information corresponding to the i-th candidate word.
  • Step 705C Obtain the first word feature vector corresponding to the candidate vocabulary in the memory structure and the second word feature vector of the t-1th decoded word.
  • the auxiliary decoder In addition to determining the matching degree of the candidate vocabulary according to the visual context information, the auxiliary decoder also determines the matching degree of the candidate vocabulary according to the meaning of the candidate vocabulary and the previous decoded word, thereby improving the difference between the decoded word obtained by subsequent decoding and the previous decoded word Continuity.
  • the auxiliary decoder obtains the first word feature vector corresponding to the candidate vocabulary from the memory structure, and converts the t-1th decoded word into the second word feature vector through the conversion matrix.
  • Step 705D Determine the second matching degree of the candidate vocabulary according to the first word feature vector and the second word feature vector.
  • the auxiliary decoder determines the matching degree between the first word feature vector and the second word feature vector as the second matching degree of the candidate vocabulary, and the second matching degree can be expressed as: [W' e ⁇ e t-1 + W e ⁇ e i], where, W 'e and W e is a linear transform matrix, e i is the i th word feature vectors corresponding to the candidate words.
  • steps 705A and 705B and steps 705C and 705D there is no strict sequence between steps 705A and 705B and steps 705C and 705D, that is, steps 705A and 705B can be executed synchronously with steps 705C and 705D, which is not limited in the embodiment of the present application.
  • Step 705E Determine the second selection probability of the candidate vocabulary according to the first matching degree and the second matching degree.
  • the second selection probability has a positive correlation with the first matching degree and the second matching degree, that is, the higher the first matching degree and the second matching degree, the higher the second selection probability of the candidate word.
  • the memory structure in addition to the reference visual context information g r corresponding to the candidate vocabulary and the word vector feature e r of the candidate vocabulary, also includes auxiliary information u r corresponding to the candidate vocabulary.
  • the auxiliary information may be the part of speech of the candidate vocabulary, the field of the candidate vocabulary, the video category of the candidate vocabulary commonly used, and so on.
  • the auxiliary decoder determines the second selection probability of the candidate word according to the auxiliary information, the t-1th decoded word, the t-1th hidden state, the target visual feature, and the reference visual context information corresponding to the candidate word.
  • the second selection probability P m of the candidate word w k can be expressed as:
  • q k is the relevance score of candidate words w k
  • K is the total number of candidate words in the thesaurus.
  • the calculation formula for the relevance score of the candidate words is as follows:
  • W h and Wu are linear transformation matrices
  • u i is the auxiliary information corresponding to the i-th candidate word
  • b is the bias term.
  • the memory structure 832 in the auxiliary decoder 83 comprises eight individual candidate word (w i) corresponding to the reference visual context information g i, word feature vector and auxiliary information e i u i.
  • the content in the memory structure 832, the target visual context information C t , the t-1th hidden state h t-1, and the word feature vector e t-1 of the t- 1th decoded word are input to the decoding component 831 ,
  • the input decoding component 831 outputs the second selection probability P m of each candidate word.
  • Step 706 Calculate the target selection probability of each candidate vocabulary according to the first selection probability and the first weight corresponding to the first selection probability, and the second selection probability and the second weight corresponding to the second selection probability.
  • the video description generation model obtains the first selection probability and the second selection probability corresponding to the candidate word, and calculates the weighting according to the weight corresponding to each selection probability.
  • the target selection probability of candidate words is the first selection probability and the second selection probability corresponding to the candidate word.
  • the calculation formula for the target selection probability of candidate vocabulary w k is as follows:
  • is the second weight
  • (1- ⁇ ) is the first weight
  • the first weight and the second weight are hyperparameters obtained experimentally, and the first weight is greater than the second weight.
  • the value range of ⁇ is (0.1, 0.2).
  • Step 707 Determine the candidate word corresponding to the highest target selection probability as the decoded word.
  • the video description generation model obtains the target selection probability of each candidate vocabulary, and determines the candidate vocabulary corresponding to the highest target selection probability as the decoded word obtained by this decoding.
  • the video description generation model calculates the target selection probability P according to the first selection probability P b and the second selection probability P m , and determines the t-th decoded word w t based on the target selection probability P.
  • Step 708 Generate a video description corresponding to the target video according to each decoded word.
  • the video description generated by the video description generation model in the related technology is "a person is slicing bread” (a person is slicing bread); and
  • the video description generated by the video description generation model in the embodiment of the present application is "a man is spreading butter on a bread” (a person is spreading butter on a bread). It can be seen that the video description generation model in the related art cannot recognize the "spreading" and "butter" in the video 1001.
  • the memory structure of the auxiliary decoder contains "spreading” and "butter” and related The correlation between the video pictures 1002 (that is, referring to the visual context information), so that the decoded words “spreading” and “butter” can be accurately decoded, which improves the accuracy of the video description.
  • the video description generation model uses a decoder to decode the target video to obtain 2D visual features and 3D visual features, and maps the 2D visual features and 3D visual features to the same feature dimension, which improves the comprehensiveness of visual feature extraction , And avoid the mutual contamination of two-dimensional visual features and three-dimensional visual features.
  • the auxiliary decoder determines the selection probability of the candidate vocabulary according to the reference visual feature context information of the candidate vocabulary and the currently decoded target visual context information, which helps to improve the accuracy of the finally determined decoded word; at the same time, The auxiliary decoder determines the selection probability of the candidate vocabulary according to the candidate vocabulary and the word vector characteristics of the last decoded word, which helps to improve the coherence between the finally determined decoded word and the last decoded word.
  • the generation process may include the following steps.
  • Step 1101 For each candidate vocabulary, determine I related videos corresponding to the candidate vocabulary according to the sample video description corresponding to the sample video.
  • the sample video description of the related video includes the candidate vocabulary, and I is an integer greater than or equal to 1.
  • the developer uses manual annotation to add sample video descriptions to sample video generation, or uses an existing video description generation model to automatically generate sample video descriptions for sample videos, and manually filters that the quality is lower than expected The description of the sample video.
  • the computer device When determining the related video of each candidate vocabulary in the thesaurus, the computer device obtains the sample video description corresponding to each sample video, and determines the video containing the candidate vocabulary in the sample video description as the related video of the candidate vocabulary.
  • the computer device determines the sample video B as the related video corresponding to "Walking".
  • Step 1102 For each related video, determine k key visual features in the related video.
  • the matching degree between the key visual feature and the candidate vocabulary is higher than the matching degree between the non-key visual features in the related video and the candidate vocabulary, and k is greater than or equal to An integer of 1.
  • non-key visual features are visual features other than key visual features in each related video.
  • the following steps may be included when determining key visual features in related videos.
  • the computer device first trains the basic decoder in the video description generation model, and uses the basic decoder (using the attention mechanism) to obtain and decode the candidate vocabulary.
  • the feature weight is the basic decoder (using the attention mechanism) to obtain and decode the candidate vocabulary.
  • the computer device uses the basic decoder to decode the visual features of the sample video, and obtains the t-th decoding time, the first hidden t-1 output from the decoder based h t-1, the respective visual characteristics (including v 'i or f' i) wherein the weight of a i right candidate words, t thus is calculated by the focus function f att.
  • the computer device can determine the visual feature corresponding to the top k (Top-k) feature weights as the candidate vocabulary Key visual features.
  • the computer device separately extracts the two-dimensional visual feature 1201 and the three-dimensional visual feature 1202 of each related video, and obtains it through the attention mechanism of the basic decoder
  • the feature weight of each visual feature in the related video to the candidate vocabulary, and the Top-k visual feature is selected as the key visual feature 1203.
  • Step 1103 Generate reference visual context information corresponding to the candidate vocabulary according to each key visual feature corresponding to one related video.
  • the computer device fuses the key visual features corresponding to each related video to generate reference visual context information corresponding to the candidate words.
  • reference visual context information candidate words corresponding to g r may be expressed as:
  • I is the number of related videos
  • k is the number of key visual features corresponding to each related video
  • a i,j is the feature weight of the j-th two-dimensional key visual feature f'i ,j to the candidate words
  • a ' i,j is the j-th three-dimensional key visual feature v'i ,j 's feature weight for candidate words.
  • the computer device fuses the key visual features 1203 corresponding to each related video to generate reference visual context information 1204.
  • Step 1104 Store the reference visual context information corresponding to each candidate word in the memory structure.
  • the computer device stores the reference visual context information corresponding to each candidate vocabulary into the memory structure of the auxiliary decoder for subsequent use.
  • the computer device extracts the key visual features of the candidate vocabulary from the related video corresponding to the candidate vocabulary, thereby generating reference visual context information of the candidate vocabulary based on a large number of key visual features, and storing it in the memory structure, which helps to improve subsequent The accuracy of the decoded words obtained by decoding.
  • the video description generation model in the embodiment of the present application is at a leading level in the four evaluation indicators (BLEU-4, METEROR, ROUGE-L, and CIDEr).
  • Fig. 13 is a structural block diagram of an apparatus for generating a video description provided by an exemplary embodiment of the present application.
  • the apparatus may be set in the computer equipment described in the foregoing embodiment. As shown in Fig. 13, the apparatus includes:
  • the encoding module 1301 is used to encode the target video through the encoder of the video description generation model to obtain the target visual feature of the target video.
  • the first decoding module 1302 is used to generate the basic decoder of the model through the video description, and use the attention mechanism to decode the target visual features to obtain the first selection probability corresponding to each candidate vocabulary.
  • the second decoding module 1303 is used to decode the target visual features by the auxiliary decoder of the video description generation model to obtain the second selection probability corresponding to each candidate word.
  • the memory structure of the auxiliary decoder includes the reference vision corresponding to each candidate word Context information is generated based on the relevant video corresponding to the candidate vocabulary with reference to the visual context information.
  • the first determining module 1304 is configured to determine the decoded words in the candidate vocabulary according to the first selection probability and the second selection probability.
  • the first generating module 1305 is configured to generate a video description corresponding to the target video according to each decoded word.
  • the second decoding module 1303 includes:
  • the first obtaining unit is used to obtain the t-1th decoded word obtained by the t-1th decoding and the t-1th hidden state when the t-th decoding is performed.
  • the t-1th hidden state is the basic decoder for the t-1th hidden state.
  • the hidden state output during t-1 decoding, t is an integer greater than or equal to 2.
  • the first determining unit is configured to determine the second selection probability of the candidate word according to the t-1th decoded word, the t-1th hidden state, the target visual feature, and the reference visual context information corresponding to the candidate word.
  • the first determining unit is configured to:
  • the target visual context information for the t-th decoding; determine the first matching degree of the candidate words according to the target visual context information and the reference visual context information; obtain the candidates in the memory structure The first word feature vector corresponding to the vocabulary and the second word feature vector of the t-1th decoded word; determine the second matching degree of the candidate vocabulary according to the first word feature vector and the second word feature vector; according to the first matching degree and The second matching degree determines the second selection probability of the candidate words.
  • the memory structure also includes auxiliary information corresponding to each candidate vocabulary.
  • the first determining unit is used for:
  • the second selection probability of the candidate word is determined.
  • the device includes:
  • the second determining module is used to determine, for each candidate vocabulary, according to the sample video description corresponding to the sample video, determine I related videos corresponding to the candidate vocabulary.
  • the sample video description of the related video contains the candidate vocabulary, and I is an integer greater than or equal to 1. .
  • the third determination module is used to determine k key visual features in the related video for each related video, the matching degree between the key visual feature and the candidate vocabulary is higher than the matching degree between the non-key visual features in the related video and the candidate vocabulary, k Is an integer greater than or equal to 1.
  • the second generation module is used to generate reference visual context information corresponding to the candidate vocabulary according to each key visual feature corresponding to one related video.
  • the storage module is used to store the reference visual context information corresponding to each candidate vocabulary in the memory structure.
  • the third determining module includes:
  • the acquiring unit is used to acquire the feature weight of each visual feature to the candidate vocabulary in the related video through the basic decoder, where the sum of the weight of each feature is 1.
  • the second determining unit is used to determine the visual features corresponding to the first k feature weights as key visual features.
  • the first determining module 1304 includes:
  • the calculation unit is configured to calculate the target selection probability of each candidate word according to the first weight corresponding to the first selection probability and the first selection probability, and the second weight corresponding to the second selection probability and the second selection probability.
  • the third determining unit is used to determine the candidate vocabulary corresponding to the highest target selection probability as the decoded word.
  • the encoding module 1301 includes:
  • the encoding unit is used to encode the target video through the encoder to obtain the two-dimensional visual characteristics and three-dimensional visual characteristics of the target video.
  • the two-dimensional visual characteristics are used to indicate the characteristics of a single frame of images, and the three-dimensional visual characteristics are used to indicate the characteristics of consecutive image frames. Timing characteristics.
  • the conversion unit is used to convert the two-dimensional visual feature and the three-dimensional visual feature to the same feature dimension to obtain the target visual feature.
  • the encoder of the video description generation model is used to encode the target video, and after the target visual feature is obtained, the target visual feature is decoded through the basic decoder based on the attention mechanism and the auxiliary decoder included. , Obtain the first selection probability and the second selection probability of each candidate vocabulary, thereby combining the first selection probability and the second selection probability to determine the decoded word from the candidate words, and then generate the video description based on the multiple decoded words;
  • the memory structure of the auxiliary decoder in the model contains the reference visual context information corresponding to the candidate words, and the reference visual context information is generated based on the related video of the candidate words. Therefore, when the auxiliary decoder is used for decoding, the candidate words and the divisions can be paid attention to.
  • the correlation between videos other than the current video thereby improving the accuracy of decoding word selection, and thereby improving the quality of the video description generated subsequently.
  • the video description generation device provided in the above embodiment only uses the division of the above functional modules for illustration.
  • the above functions can be allocated by different functional modules or units as required, that is, Divide the internal structure of the device into different functional modules or units to complete all or part of the functions described above.
  • Each functional module or unit can be implemented in whole or in part by software, hardware or a combination thereof.
  • the video description generating device provided in the foregoing embodiment and the video description generating method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, and will not be repeated here.
  • the computer device 1400 includes a central processing unit (CPU) 1401, a system memory 1404 including a random access memory (RAM) 1402 and a read-only memory (ROM) 1403, and a system connecting the system memory 1404 and the central processing unit 1401 BUS 1405.
  • the computer equipment 1400 also includes a basic input/output system (I/O system) 1406 that helps transfer information between various devices in the computer, and a mass storage device for storing the operating system 1413, application programs 1414, and other program modules 1415 1407.
  • I/O system basic input/output system
  • the basic input/output system 1406 includes a display 1408 for displaying information and an input device 1409 such as a mouse and a keyboard for the user to input information.
  • the display 1408 and the input device 1409 are both connected to the central processing unit 1401 through the input and output controller 1410 connected to the system bus 1405.
  • the basic input/output system 1406 may also include an input and output controller 1410 for receiving and processing input from multiple other devices such as a keyboard, a mouse, or an electronic stylus.
  • the input and output controller 1410 also provides output to a display screen, a printer, or other types of output devices.
  • the mass storage device 1407 is connected to the central processing unit 1401 through a mass storage controller (not shown) connected to the system bus 1405.
  • the mass storage device 1407 and its associated computer-readable medium provide non-volatile storage for the computer device 1400. That is, the mass storage device 1407 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROI drive.
  • Computer-readable media may include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storing information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, EPROM, EEPROM, flash memory or other solid-state storage technologies, CD-ROM, DVD or other optical storage, tape cartridges, magnetic tape, disk storage or other magnetic storage devices.
  • RAM random access memory
  • ROM read-only memory
  • EPROM Erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • the memory stores one or more programs, one or more programs are configured to be executed by one or more central processing units 1401, one or more programs contain computer-readable instructions for implementing the above methods, and the central processing unit 1401 executes The one or more programs implement the methods provided in the foregoing method embodiments.
  • the computer device 1400 may also be connected to a remote computer on the network through a network such as the Internet to operate. That is, the computer device 1400 can be connected to the network 1412 through the network interface unit 1411 connected to the system bus 1405, or in other words, the network interface unit 1411 can also be used to connect to other types of networks or remote computer systems (not shown).
  • the memory also includes one or more programs, and the one or more programs are stored in the memory, and the one or more programs include steps for performing the steps executed by the computer device in the method provided in the embodiments of the present application.
  • the embodiments of the present application also provide one or more computer-readable storage media, and the readable storage medium stores at least one computer-readable instruction, at least one program, code set, or computer-readable instruction set, and at least one computer-readable instruction At least one program, code set, or computer readable instruction set is loaded and executed by one or more processors to implement the video description generation method of any one of the above embodiments.
  • the present application also provides a computer program product, which when the computer program product runs on a computer, causes the computer to execute the video description generation method provided by the foregoing method embodiments.
  • the program can be stored in a computer-readable storage medium.
  • the readable storage medium may be a computer-readable storage medium included in the memory in the foregoing embodiment; or may be a computer-readable storage medium that exists alone and is not assembled into the terminal.
  • the computer-readable storage medium stores at least one computer-readable instruction, at least one program, code set, or computer-readable instruction set, and at least one computer-readable instruction, at least one program, code set, or computer-readable instruction set is processed by The device is loaded and executed to implement the video description generation method of any of the foregoing method embodiments.
  • the computer-readable storage medium may include: Read Only Memory (ROM), Random Access Memory (RAM), Solid State Drives (SSD, Solid State Drives), or optical disks Wait.
  • random access memory may include resistive random access memory (ReRAM, Resistance Random Access Memory) and dynamic random access memory (DRAM, Dynamic Random Access Memory).
  • ReRAM resistive random access memory
  • DRAM Dynamic Random Access Memory
  • the program can be stored in a computer-readable storage medium.
  • the aforementioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Computational Linguistics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

一种视频描述生成方法,包括:通过视频描述生成模型的编码器对目标视频进行编码,得到目标视频的目标视觉特征;通过视频描述生成模型的基础解码器,采用注意力机制对目标视觉特征进行解码,得到各个候选词汇对应的第一选取概率;通过视频描述生成模型的辅助解码器对目标视觉特征进行解码,得到各个候选词汇对应的第二选取概率,辅助解码器的记忆结构中包括各个候选词汇对应的参考视觉上下文信息,参考视觉上下文信息根据候选词汇对应的相关视频生成;根据第一选取概率和第二选取概率确定候选词汇中的解码词;根据各个解码词生成视频描述。

Description

视频描述生成方法、装置、设备及存储介质
本申请要求于2019年04月22日提交中国专利局,申请号为2019103251930,申请名称为“视频描述生成方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能技术领域和视频描述领域,特别涉及一种视频描述生成方法、装置、设备及存储介质。
背景技术
视频描述(Video Captioning)是一种为视频生成内容描述信息的技术。在人工智能领域,通常采用视频描述生成模型自动为视频生成视频描述,而视频描述生成模型大多基于编码-解码(Encoder-Decoder)框架。
在应用视频描述生成模型过程中,视频描述生成模型首先通过编码器提取视频中的视觉特征,然后将提取到的视觉特征输入解码器,由解码器根据视觉特征依次生成解码词,并最终将生成的各个解码词组合成视频描述。
相关技术中的视频描述生成模型仅关注当前处理的视频,而在实际应用中,同一解码词可能会用于语义相似但并不完全相同的多个视频中,导致视频描述生成模型的关注角度过于局限,进而影响生成的视频描述的质量。
发明内容
根据本申请提供的各种实施例,提供了一种视频描述生成方法、装置、设备及存储介质。所述技术方案如下:
一种视频描述生成方法,由计算机设备执行,该方法包括:
通过视频描述生成模型的编码器对目标视频进行编码,得到目标视频的目标视觉特征;
通过视频描述生成模型的基础解码器,采用注意力机制对目标视觉特征进行解码,得到各个候选词汇对应的第一选取概率;
通过视频描述生成模型的辅助解码器对目标视觉特征进行解码,得到各个候选词汇对应的第二选取概率,辅助解码器的记忆结构中包括各个候选词汇对应的参考视觉上下文信息,参考视觉上下文信息根据候选词汇对应的相关视频生成;
根据第一选取概率和第二选取概率确定候选词汇中的解码词;
根据各个解码词生成目标视频对应的视频描述。
一种视频描述生成装置,设置于计算机设备中,装置包括:
编码模块,用于通过视频描述生成模型的编码器对目标视频进行编码,得到目标视频的目标视觉特征;
第一解码模块,用于通过视频描述生成模型的基础解码器,采用注意力机制对目标视觉特征进行解码,得到各个候选词汇对应的第一选取概率;
第二解码模块,用于通过视频描述生成模型的辅助解码器对目标视觉特征进行解码,得到各个候选词汇对应的第二选取概率,辅助解码器的记忆结构中包括各个候选词汇对应的参考视觉上下文信息,参考视觉上下文信息根据候选词汇对应的相关视频生成;
第一确定模块,用于根据第一选取概率和第二选取概率确定候选词汇中的解码词;
第一生成模块,用于根据各个解码词生成目标视频对应的视频描述。
一种计算机设备,计算机设备包括一个或多个处理器和存储器,存储器中存储有至少一条计算机可读指令、至少一段程序、代码集或计算机可读指令集,至少一条计算机可读指令、至少一段程序、代码集或计算机可读指令集由一个或多个处理器加载并执行以实现如上述方面的视频描述生成方法。
一个或多个计算机可读存储介质,可读存储介质中存储有至少一条计算机可读指令、至少一段程序、代码集或计算机可读指令集,至少一条计算机可读指令、至少一段程序、代码集或计算机可读指令集由一个或多个处理器加载并执行以实现如上述方面的视频描述生成方法。
一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行如上述方面的视频描述生成方法。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。基于本申请的说明书、附图以及权利要求书,本申请的其它特征、目的和优点将变得更加明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本 申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是在一个实施例中相关技术中利用SA-LSTM模型生成视频描述的原理示意图;
图2是在一个实施例中视频分类检索场景下视频描述生成方法的实施示意图;
图3是在一个实施例中视障人士辅助场景下视频描述生成方法的实施示意图;
图4是在一个实施例中的实施环境的示意图;
图5是在一个实施例中的视频描述生成方法的流程图;
图6是在一个实施例中视频描述生成模型生成的视频描述;
图7是在一个实施例中的视频描述生成方法的流程图;
图8是在一个实施例中视频描述生成模型的结构示意图;
图9是在一个实施例中辅助解码器确定候选词汇选取概率过程的流程图;
图10是在一个实施例中相关技术与本申请实施例中视频描述生成模型生成的视频描述;
图11是在一个实施例中候选词汇对应参考视觉上下文信息生成过程的流程图;
图12是在一个实施例中生成参考视觉上下文信息过程的实施示意图;
图13是在一个实施例中的视频描述生成装置的结构框图;
图14是在一个实施例中的计算机设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。应当理解,此处所描述的具体实施方式仅仅用以解释本申请,并不用于限定本申请。
在视频描述领域,利用基于编码-解码框架构建的视频描述生成模型为视频自动生成视频描述是一种常规手段。其中,基于编码-解码框架的视频描述生成模型可以是软注意力-长短期记忆((Soft Attention Long Short-Term Memory,SA-LSTM)模型。在一个示意性的例子中,利用SA-LSTM模型 生成视频描述的过程如图1所示。
SA-LSTM模型首先对输入的视频11进行特征提取,得到视频11的视觉特征12(v 1,v 2,…,v n)。然后,SA-LSTM模型根据上一隐藏状态13(第t-1次解码过程中输出的隐藏状态)以及视觉特征12,采用软注意力机制计算各个视觉特征12对当前解码过程(即第t次解码)的权重14,从而对视觉特征12和权重14进行加权求和计算,得到当前解码过程的上下文信息15。进一步的,SA-LSTM模型根据上一隐藏状态13、上一个解码词16以及上下文信息15,输出当前隐藏状态17,进而根据当前隐藏状态17确定当前解码词18。
可见,利用相关技术中的SA-LSTM模型生成视频描述时,SA-LSTM模型仅关注当前视频中的视觉特征,相应的,确定出的解码词仅与当前视频的视觉特征相关。然而在实际情况中,同一解码词可能出现在多个视频片段中,且在不同视频片段中表达相似但不完全相同的含义(即解码词可能对应相似但不完全相同的视觉特征),导致SA-LSTM模型输出的解码词的准确度较低,进而影响最终生成的视频描述的质量。
为了提高视频描述的质量,不同于相关技术中的“单编码器+单解码器”结构,本申请实施例中视频描述生成模型采用“编码器+基础解码器+辅助解码器”的结构,创造性地引入了记忆机制,将词库中各个候选词汇与相关视频之间的关联关系存储在记忆结构中,并将记忆结构添加到在辅助解码器中。本申请实施例提供的视频描述生成模型既能够关注当前视频(基础解码器),又能够关注与当前视频的视觉特征相似的其它视频(辅助解码器),从而避免仅关注当前视频造成的关注角度局限性,进而提高输出的解码词的准确度,提高生成的视频描述的质量。
本申请实施例提供的视频描述生成方法可以用于如下任一场景。
1、视频分类/检索场景
应用于视频分类场景时,本申请实施例中的视频描述生成模型可以实现成为视频管理应用程序或视频管理应用程序的一部分。将不包含视频描述的视频片段输入视频管理应用程序后,视频管理应用程序即通过视频描述生成模型中的编码器提取视频片段中的视觉特征,并分别利用基础解码器和辅助解码器对视觉特征进行解码,从而综合基础解码器和辅助解码器的解码结果 确定解码词,进而根据解码词为视频片段生成视频描述。对于包含视频描述的视频片段,视频管理应用程序基于视频描述(比如通过语义识别)对视频片段进行分类,并为视频片段添加相应的类别标签。后续进行视频检索时,视频管理应用程序即可根据检索条件和各个视频片段对应的类别标签,返回符合该检索条件的视频片段。
在一个示意性的例子中,如图2所示,用户使用手机拍摄一段视频后,点击保存控件21将该视频存在手机中,由视频管理应用程序自动在后台为该视频生成视频描述“一个男人在公园里遛狗”,进而根据生成的视频描述为该视频添加类别标签“遛狗”。后续用户需要从手机中存储的大量视频中检索该视频时,可以在视频管理应用程序的视频检索界面22中输入关键词“遛狗”,由视频管理应用程序将该关键词与各个视频对应的视频类别进行匹配,从而将匹配到的视频23作为检索结果进行显示。
2、视觉问答(Visual Question Answer,VQA)场景
应用于视觉问答场景时,本申请实施例中的视频描述生成模型可以实现成为智能问答应用程序或智能问答应用程序的一部分。智能问答应用程序获取到一段视频以及针对该视频的提问后,通过视频描述生成模型生成该视频对应的视频描述,并对提问和视频描述进行语义识别,从而生成提问对应的答案,进而对该答案进行显示。
3、视障人士辅助场景
应用于视障人士辅助场景时,本申请实施例中的视频描述生成模型可以实现成为语音提示应用程序或语音提示应用程序的一部分。安装有语音提示应用程序的终端(比如视障人士使用的辅助设备)通过摄像头采集到视障人士周围的环境视频后,语音提示应用程序即通过视频描述生成模型对该环境视频进行编码解码,生成环境视频对应的视频描述。对于生成的视频描述,语音提示应用程序可以将该视频描述由文字转化为语音,并进行语音播报,帮助视障人士了解周侧环境情况。
在一个示意性的例子中,如图3所示,视障人士佩戴的眼镜31上设置有摄像头32以及骨传导耳机33。工作状态下,摄像头31对前方环境进行图像采集,并采集到一段环境视频34。眼镜31通过处理器为环境视频34生成视频描述“前方有个男人正在遛狗”,并将该视频描述由文字转化为语音,进而 通过骨传导耳机33播放,以便视障人士根据语音提示进行避让。
当然,除了应用于上述场景外,本申请实施例提供方法还可以应用于其他需要为视频生成视频描述的场景,本申请实施例并不对具体的应用场景进行限定。
本申请实施例提供的视频描述生成方法可以应用于终端或者服务器等计算机设备中。在一个实施例中,本申请实施例中的视频描述生成模型可以实现成为应用程序或应用程序的一部分,并被安装到终端中,使终端具备生成视频描述的功能;或者,该视频描述生成模型可以应用于应用程序的后台服务器中,从而由服务器为终端中的应用程序提供生成视频描述的功能。
请参考图4,其示出了本申请一个示例性实施例提供的实施环境的示意图。该实施环境中包括终端410和服务器420,其中,终端410与服务器420之间通过通信网络进行数据通信,在一个实施例中,通信网络可以是有线网络也可以是无线网络,且该通信网络可以是局域网、城域网以及广域网中的至少一种。
终端410中安装有具有视频描述需求的应用程序,该应用程序可以是视频管理应用程序、智能问答应用程序、语音提示应用程序、字幕生成应用程序(为视频画面添加解说字幕)等等,本申请实施例对此不做限定。在一个实施例中,终端可以是手机、平板电脑、膝上便携式笔记本电脑、视障人士辅助设备等移动终端,也可以是台式电脑、投影式电脑等终端,本申请实施例对此不做限定。
服务器420可以实现为一台服务器,也可以实现为一组服务器构成的服务器集群,其可以是物理服务器,也可以实现为云服务器。在一个实施例中,服务器420是终端410中应用程序的后台服务器。
如图4所示,本申请实施例中,服务器420中存储有预先训练得到的视频描述生成模型421。在一种可能的应用场景下,当需要自动为目标视频生成视频描述时,应用程序即通过终端410将目标视频传输至服务器420,服务器420接收到目标视频后,将目标视频输入视频描述生成模型421。视频描述生成模型421通过解码器421A对目标视频进行特征提取,并分别通过基础解码器421B和辅助解码器422C对提取到的特征进行解码,从而根据解码结果生成视频描述,并反馈给终端410,由终端410中的应用程序对视频 描述进行显示。
在其他可能的实施方式中,当视频描述生成模型421实现成为终端410中应用程序的一部分时,终端410可以在本地生成目标视频的视频描述,而无需借助服务器420,从而提高终端获取视频描述的速度,降低与服务器交互产生的延迟。
请参考图5,其示出了本申请一个示例性实施例提供的视频描述生成方法的流程图。本实施例以该方法用于计算机设备为例进行说明,该方法包括如下步骤。
步骤501,通过视频描述生成模型的编码器对目标视频进行编码,得到目标视频的目标视觉特征。
本申请实施例中,视频描述生成模型中编码器的作用是从目标视频中提取目标视觉特征(visual feature),并将提取到的目标视觉特征输入解码器(包括基础解码器和辅助解码器)。在一个实施例中,该目标视觉特征采用向量表示。
在一个实施例中,视频描述生成模型利用预训练的深层卷积神经网络(Convolutional Neural Networks,CNNs)作为编码器进行视觉特征提取,且利用编码器进行特征提取前,需要对目标视频进行预处理,使预处理后的目标视频符合编码器的输入要求。
对于提取到的目标视觉特征,编码器分别将目标视觉特征输入基础解码器和辅助解码器,并执行下述步骤502和503。需要说明的是,下述步骤502和503之间不存在严格的先后时序,即步骤502和步骤503可以同步执行,本实施例并不对两者的执行顺序进行限定。
步骤502,通过视频描述生成模型的基础解码器对目标视觉特征进行解码,得到各个候选词汇对应的第一选取概率,基础解码器用于采用注意力机制解码出与目标视觉特征匹配的候选词汇。
在一个实施例中,该基础解码器关注目标视频,从而基于目标视频的目标视觉特征进行解码。在一个实施例中,该基础解码器可以是采用注意力机制的循环神经网络(Recurrent Neural Network,RNN)编码器。比如,该基础解码器采用SA-LSTM模型,且每次解码时,基础解码器采用注意力机制,根据上一次解码输出的隐藏状态、上一个解码词以及目标视觉特征确定词库 中各个候选词汇对应的第一选取概率。当然,除了采用SA-LSTM模型外,该基础解码器还可以采用其他基于注意力机制的RNN编码器,本申请实施例对此并不构成限定。
在一个实施例中,基础解码器进行解码的过程本质上是一种分类任务,即通过softmax函数计算词库中各个候选词汇的(第一)选取概率。其中,第一选取概率越大,表明候选词汇与视频的上下文信息匹配度越高,即该候选词汇所表达的含义与上下文更加匹配。
步骤503,通过视频描述生成模型的辅助解码器对目标视觉特征进行解码,得到各个候选词汇对应的第二选取概率,辅助解码器的记忆结构中包括各个候选词汇对应的参考视觉上下文信息,参考视觉上下文信息根据候选词汇对应的相关视频生成。
不同于基础解码器仅关注目标视频的目标视觉特征,本实例中的辅助解码器关注候选词汇与相关视频之间的关联性,因此利用辅助解码器对目标视觉特征进行解码时,能够抓取到同一候选词汇在不同视频中的视觉特征,并与目标视频的目标视觉特征进行匹配,以此提高确定解码词的准确性。
在一个实施例中,候选词汇与相关视频的关联性存储在辅助解码器的记忆结构(memory structure)中,并通过候选词汇与参考视觉上下文信息的对应关系进行体现。其中,候选词汇对应的参考视觉上下文信息用于表示包含该候选词汇的相关视频的视觉上下文特征,且该参考视觉上下文信息根据样本视频中与候选词汇相关的相关视频生成。下述实施例将对参考视觉上下文信息的生成方式进行详细说明。
需要说明的是,除了利用记忆结构构建候选词汇与相关视频的关联性外,还可以采用基于图(graph)的算法来构建候选词汇与相关视频的关联性,本申请并不对此进行限定。
在一个实施例中,与基础解码器类似的,辅助解码器进行解码的过程本质上也是一种分类任务,即通过softmax函数计算词库中各个候选词汇的(第二)选取概率。其中,基础解码器与辅助解码器对应的相同词库,且第二选取概率越大,表明候选词汇与视频的上下文信息匹配度越高,即该候选词汇所表达的含义与上下文更加匹配。
步骤504,根据第一选取概率和第二选取概率确定候选词汇中的解码词。
不同于相关技术中仅根据单一解码器的解码结果确定解码词,本申请实施例中,视频描述生成模型综合基础解码器输出的第一选取概率和辅助解码器输出的第二选取概率,从词库中的各个候选词汇中确定出本次解码得到的解码词。
步骤505,根据各个解码词生成目标视频对应的视频描述。
由于视频描述中通常是由多个解码词构成的自然语言,因此每次解码时都需要重复上述502至504,依次生成视频描述的各个解码词,从而对多个解码词进行连接,最终生成视频描述。
综上,本申请实施例中,利用视频描述生成模型的编码器对目标视频进行编码,得到目标视觉特征后,分别通过基于注意力机制的基础解码器以及包含辅助解码器对目标视觉特征进行解码,得到各个候选词汇的第一选取概率和第二选取概率,从而综合第一选取概率和第二选取概率从候选词汇中确定出解码词,进而根据多个解码词生成视频描述;由于视频描述生成模型中辅助解码器的记忆结构包含候选词汇对应的参考视觉上下文信息,且该参考视觉上下文信息是基于候选词汇的相关视频生成的,因此利用辅助解码器进行解码时,能够关注到候选词汇与除当前视频以外其他视频之间的关联性,从而提高解码词选取的准确性,进而提高了后续生成的视频描述的质量。
在一个示意性的例子中,如图6所示,对于同一视频61,利用相关技术中的视频描述生成模型所生成的视频描述为“a woman is mixing ingredients in a bowl.”(一位女士正在混合碗里的材料);而本申请实施例中的视频描述生成模型所生成的视频描述为“a woman is pouring liquid into a bowl.”(一位女士将液体倒入碗中)。可见,相关技术中的视频描述生成模型无法识别出视频61中的“pouring”(倒),而本申请实施例中,由于辅助解码器的记忆结构中包含“pouring”与相关视频画面62之间的关联性(即参考视觉上下文信息),因此能够准确解码出“pouring”这个解码词,提高了视频描述的描述质量。
上述实施例对视频描述生成模型的工作原理进行了简单说明,下面采用示意性的实施例并结合附图,对视频描述生成过程中涉及的编码以及解码过程进行更加细致的说明。
请参考图7,其示出了本申请另一个示例性实施例提供的视频描述生成方法的流程图。本实施例以该方法用于计算机设备为例进行说明,该方法包 括如下步骤。
步骤701,通过编码器对目标视频进行编码,得到目标视频的二维视觉特征和三维视觉特征,二维视觉特征用于指示单帧图像的特征,三维视觉特征用于指示连续图像帧的时序特征。
由于视频是由连续图像帧构成,因此视频的视觉特征中既包含单帧图像的图像特征(即二维视觉特征),又包含连续图像帧的时序特征(即三维视觉特征)。在一个实施例中,编码器中包括用于提取二维视觉特征的第一子编码器以及用于提取三维视觉特征的第二子编码器。
相应的,对目标视频进行编码时,将目标视频划分为独立的图像帧,利用第一子编码器对各个图像帧进行特征提取,得到二维视觉特征;将目标视频划分为若干视频片段(每个视频片段中包含若干连续图像帧),利用第二子编码器对各个视频片段进行特征提取,得到三维视觉特征。
在一个实施例中,第一子编码器采用在ImageNet(用于视觉对象识别软件研究的大型可视化数据库)数据集上预训练好的ResNet-101模型(深度为101的残差网络),而第二子编码器采用Kinetics数据集上预训练好的ResNeXt-101模型。当然,第一子编码器和第二子编码器也可以采用其它模型,本申请实施例并不对此构成限定。
在一个示意性的例子中,对于包含L个图像帧的目标视频,通过解码器对目标视频进行编码,得到二维视觉特征F 2D={f 1,f 2,...,f L}以及三维视觉特征F 3D={v 1,v 2,...,v N},其中,N=L/d,d为每个视频片段中图像帧的数量。
示意性的,如图8所示,编码器81提取得到二维视觉特征811以及三维视觉特征812。
步骤702,将二维视觉特征和三维视觉特征转换到同一特征维度,得到目标视觉特征。
由于提取到的二维视觉特征和三维视觉特征的特征维度(比如向量尺寸)可能不同,因此为了统一视觉特征的特征维度,并避免二维视觉特征和三维视觉特征相互污染,在一个实施例中,视频描述生成模型将二维视觉特征和三维视觉特征转换到隐藏空间(hidden space)同一特征维度,得到目标视觉特征。
在一个示意性的例子中,对于任意二维视觉特征f l,其转换得到的目标视觉特征f' l=M ff l+b f,对于任意三维视觉特征v n,其转换得到的目标视觉特征v' n=M vv n+b v,其中,M f和M v为转换矩阵,b f和b v为偏置项。
步骤703,通过视频描述生成模型的基础解码器对目标视觉特征进行解码,得到各个候选词汇对应的第一选取概率,基础解码器是用于采用注意力机制解码出与目标视觉特征匹配的候选词汇。
在一个实施例中,视频描述生成模型采用门控循环单元(Gated Recurrent Unit,GRU)作为基础解码器的骨架。示意性的,如图8所示,基础解码器82中包括GRU 821、GRU 822以及GRU 823。
相应的,基础解码器进行第t次解码时,可以包括如下步骤。
一、当进行第t次解码时,获取第t-1次解码得到的第t-1解码词以及第t-1隐藏状态,第t-1隐藏状态是基础解码器进行第t-1次解码时输出的隐藏状态,t为大于或等于2的整数。
基础解码器每次解码过程中,都会输出一个隐藏状态,后续即基于该隐藏状态确定本次解码得到的解码词。本申请实施例中,由于使用GRU输出隐藏状态时,需要利用到上一次解码时输出的隐藏状态以及上一个解码词,因此,基础解码器在进行第t次解码时,需要获取第t-1解码词以及第t-1隐藏状态。
示意性的,如图8所示,进行第t次解码时,基础解码器82获取GRU 821输出的第t-1隐藏状态h t-1,以及第t-1解码词w t-1对应的词向量e t-1
二、根据第t-1解码词、第t-1隐藏状态以及目标视觉特征,确定候选词汇的第一选取概率。
在不同解码阶段,不同视觉特征与当前解码词的相关度存在差异,因此,在计算第一选取概率前,基础解码器还需要采用注意力机制对编码器输出的目标视觉特征进行处理(加权求和),得到本次解码的目标视觉上下文信息。
在一个实施例中,基础解码器分别对二维视觉特征和三维视觉特征进行处理,得到二维视觉上下文信息和三维视觉上下文信息,并对二维视觉上下文信息和三维视觉上下文信息进行融合,得到目标视觉上下文信息。
其中,对于二维视觉特征f' i,对其处理得到二维视觉上下文信息
Figure PCTCN2020081721-appb-000001
其中,a i,t=f att(h t-1,f' i),h t-1为第t-1隐藏状态(向量表示),f att为注意力函数。
对于三维视觉特征v' i,对其处理得到三维视觉上下文信息
Figure PCTCN2020081721-appb-000002
其中,a' i,t=f att(h t-1,v' i),h t-1为第t-1隐藏状态(向量表示),f att为注意力函数。在一个实施例中,处理二维视觉特征和处理三维视觉特征时采用同一注意力函数。
对二维视觉上下文信息和三维视觉上下文信息进行融合,得到目标视觉上下文信息c t=[c t,2D;c t,3D]。
示意性的,如图8所示,采用注意力机制(图中的f att)分别对二维视觉特征811和三维视觉特征812进行处理,得到C t,2D和C t,3D,对处理结果进行融合,得到第t次解码时的目标视觉上下文信息C t
GRU根据第t-1解码词、第t-1隐藏状态以及目标视觉上下文信息,输出第t次解码的第t隐藏状态。GRU确定第t隐藏状态的方式可以表示为:
h t=GRU(h t-1,c t,e t-1)
进一步的,基础解码器基于第t隐藏状态,计算词库中各个候选词汇对应的第一选取概率。第一选取概率的计算公式如下:
Figure PCTCN2020081721-appb-000003
其中,w i为词库中的第i个候选词汇,K为词库中候选词汇总数,W i和b i是计算第i个候选词汇的线性映射得分时使用的参数。
示意性的,如图8所示,目标视觉上下文信息C t、GRU 821输出的第t-1隐藏状态h t-1以及第t-1解码词的词向量e t-1输入GRU 822中,由GRU822计算各个候选词汇的第一选取概率P b
步骤704,当进行第t次解码时,获取第t-1次解码得到的第t-1解码词 以及第t-1隐藏状态,第t-1隐藏状态是基础解码器进行第t-1次解码时输出的隐藏状态,t为大于或等于2的整数。
在一个实施例中,与基础解码器相似的,辅助解码器在解码过程中,也需要使用上一个解码词以及上一次解码时输出的隐藏状态,因此,在进行第t次解码时,辅助解码器获取第t-1解码词以及第t-1隐藏状态;第t-1隐藏状态是基础解码器进行第t-1次解码时输出的隐藏状态。
步骤705,根据第t-1解码词、第t-1隐藏状态、目标视觉特征以及候选词汇对应的参考视觉上下文信息,通过辅助解码器确定候选词汇的第二选取概率。
与基础解码器不同的是,辅助解码器在解码过程中,还需要获取记忆结构中各个候选词汇对应的参考视觉上下文信息,从而在解码过程中关注与候选词汇在相关视频中的视觉特征。
在一个实施例中,记忆结构中至少包括各个候选词汇对应的参考视觉上下文信息g r以及候选词汇的词向量特征e r。相应的,辅助解码器在解码过程中,重点计算候选词汇对应的目标视觉上下文信息与参考视觉上下文信息之间的匹配度,以及候选词汇的词特征向量与上一个解码词的词特征向量之间的匹配度,进而根据两个匹配度确定候选词汇的第二选取概率。
在一个实施例中,如图9所示,本步骤705可以包括如下步骤。
步骤705A,根据目标视觉特征和第t-1隐藏状态,生成进行第t次解码时的目标视觉上下文信息。
其中,根据目标视觉特征和第t-1隐藏状态,生成目标视觉上下文信息的过程可以参考上述步骤703,本实施例在此不再赘述。
在一个实施例中,辅助编码器可以从基础编码器处获取目标视觉上下文信息,而无需重复计算,本实施例对此不做限定。
步骤705B,根据目标视觉上下文信息和参考视觉上下文信息,确定候选词汇的第一匹配度。
由于候选词汇对应的参考视觉上下文信息基于候选词汇对应的相关视频生成,因此该参考视觉上下文信息可以反映出以该候选词汇为解码词的相关视频的视觉特征。相应的,当候选词汇对应的参考视觉上下文信息与本次解码时的目标视觉上下文信息匹配度较高时,该候选词汇与目标视觉上下文信 息的匹配度也越高。
在一个实施例中,辅助解码器将目标视觉上下文信息和参考视觉上下文信息之间的匹配度确定为候选词汇的第一匹配度,该第一匹配度可以表示为:[W c·c t+W g·g i],其中,W c和W g是线性变换矩阵,g i是第i个候选词汇对应的参考视觉上下文信息。
步骤705C,获取记忆结构中候选词汇对应的第一词特征向量以及第t-1解码词的第二词特征向量。
除了根据视觉上下文信息确定候选词汇的匹配度外,辅助解码器还根据候选词汇与上一个解码词的词意确定候选词汇的匹配度,从而提高后续解码得到的解码词与前一个解码词之间的连贯性。
在一个实施例中,辅助解码器从记忆结构中获取候选词汇对应的第一词特征向量,并通过转换矩阵,将第t-1解码词转化为第二词特征向量。
步骤705D,根据第一词特征向量和第二词特征向量,确定候选词汇的第二匹配度。
在一个实施例中,辅助解码器将第一词特征向量和第二词特征向量之间的匹配度确定为候选词汇的第二匹配度,该第二匹配度可以表示为:[W' e·e t-1+W e·e i],其中,W' e和W e是线性变换矩阵,e i是第i个候选词汇对应的词向量特征。
需要说明的是,上述步骤705A和705B与步骤705C和705D之间并不存在严格的先后时序,即步骤705A和705B可以与步骤705C和705D同步执行,本申请实施例对此不做限定。
步骤705E,根据第一匹配度和第二匹配度,确定候选词汇的第二选取概率。
在一个实施例中,第二选取概率与第一匹配度以及第二匹配度呈正相关关系,即第一匹配度和第二匹配度越高,候选词汇的第二选取概率越高。
在一个实施例中,为了进一步提高解码的准确性,记忆结构中除了包含候选词汇对应的参考视觉上下文信息g r以及候选词汇的词向量特征e r外,还包含候选词汇对应的辅助信息u r。其中,该辅助信息可以是候选词汇的词性、 候选词汇所属领域、常用该候选词汇的视频类别等等。
相应的,辅助解码器即根据辅助信息、第t-1解码词、第t-1隐藏状态、目标视觉特征以及候选词汇对应的参考视觉上下文信息,确定候选词汇的第二选取概率。
在一个实施例中,候选词汇w k的第二选取概率P m可以表示为:
Figure PCTCN2020081721-appb-000004
其中,q k为候选词汇w k的相关性分数,K为词库中候选词汇总数。
在一个实施例中,候选词汇的相关性分数计算公式如下:
q i=v Τtanh([W c·c t+W g·g i]+[W' e·e t-1+W e·e i]
+W h·h t-1+W u·u i+b)
其中,W h和W u是线性变换矩阵,u i是第i个候选词汇对应的辅助信息,b是偏置项。
示意性的,如图8所示,辅助解码器83的记忆结构832中包含各个候选词汇(w i)对应的参考视觉上下文信息g i、词向量特征e i以及辅助信息u i。进行第t次解码时,记忆结构832中的内容、目标视觉上下文信息C t、第t-1隐藏状态h t-1以及第t-1解码词的词特征向量e t-1输入解码组件831中,由入解码组件831输出各个候选词汇的第二选取概率P m
步骤706,根据第一选取概率和第一选取概率对应的第一权重,以及第二选取概率和第二选取概率对应的第二权重,计算各个候选词汇的目标选取概率。
在一个实施例中,对于词库中的各个候选词汇,视频描述生成模型获取该候选词汇对应的第一选取概率以及第二选取概率,并根据各项选取概率各自对应的权重,加权计算得到该候选词汇的目标选取概率。
示意性的,候选词汇w k的目标选取概率的计算公式如下:
P(w k)=(1-λ)P b(w k)+λP m(w k)
其中,λ为第二权重,(1-λ)为第一权重。
在一个实施例中,第一权重和第二权重为实验得到的超参数,且第一权重大于第二权重。比如,λ的取值范围为(0.1,0.2)。
步骤707,将最高目标选取概率对应的候选词汇确定为解码词。
进一步的,视频描述生成模型获取各个候选词汇的目标选取概率,并将最高目标选取概率对应的候选词汇确定为本次解码得到的解码词。
示意性的,如图8所示,视频描述生成模型根据第一选取概率P b和第二选取概率P m计算得到目标选取概率P,并基于目标选取概率P确定第t解码词w t
步骤708,根据各个解码词生成目标视频对应的视频描述。
在一个示意性的例子中,如图10所示,对于同一视频1001,利用相关技术中的视频描述生成模型所生成的视频描述为“a person is slicing bread”(一个人正在切面包);而本申请实施例中的视频描述生成模型所生成的视频描述为“a man is  spreading  butter on a bread”(一个人正在面包上涂黄油)。可见,相关技术中的视频描述生成模型无法识别出视频1001中的“spreading”以及“butter”,而本申请实施例中,由于辅助解码器的记忆结构中包含“spreading”以及“butter”与相关视频画面1002之间的关联性(即参考视觉上下文信息),因此能够准确解码出“spreading”和“butter”这些解码词,提高了视频描述的准确性。
本实施例中,视频描述生成模型利用解码器对目标视频解码得到二维视觉特征和三维视觉特征,并将二维视觉特征和三维视觉特征映射到同一特征维度,提高了视觉特征提取的全面性,并避免二维视觉特征和三维视觉特征相互污染。
另外,本实施例中,辅助解码器根据候选词汇的参考视觉特征上下文信息和当前解码的目标视觉上下文信息确定候选词汇的选取概率,有助于提高最终确定出的解码词的准确性;同时,辅助解码器根据候选词汇以及上一个解码词的词向量特征确定候选词汇的选取概率,有助于提高最终确定出的解码词与上一解码词的连贯性。
针对上述实施例中候选词汇对应参考视觉上下文信息的生成过程,在一个实施例中,如图11所示,该生成过程可以包括如下步骤。
步骤1101,对于各个候选词汇,根据样本视频对应的样本视频描述,确定候选词汇对应的I条相关视频,相关视频的样本视频描述中包含候选词汇,I为大于或等于1的整数。
在一个实施例中,开发人员采用人工标注方式为样本视频生成添加样本视频描述,或者,使用已有的视频描述生成模型,自动为样本视频生成样本视频描述,并通过人工方式过滤质量低于预期的样本视频描述。
在确定词库中各个候选词汇的相关视频时,计算机设备即获取各个样本视频对应的样本视频描述,并将样本视频描述中包含该候选词汇的视频确定为候选词汇的相关视频。
在一个示意性的例子中,对于候选词汇“散步”,若样本视频A对应的视频描述为“一个男人牵着一条狗”,而样本视频B对应的视频描述为“一个男人和一个女人在公园散步”,计算机设备则将样本视频B确定为“散步”对应的相关视频。
步骤1102,对于各条相关视频,确定相关视频中的k个关键视觉特征,关键视觉特征与候选词汇的匹配度高于相关视频中非关键视觉特征与候选词汇的匹配度,k为大于或等于1的整数。
对各个候选词汇对应的各条相关视频,由于相关视频中并非所有图像帧(或视频片段)均与该候选词汇相关,因此,计算机设备需要确定出各条相关视频中与候选词汇相关的关键视觉特征。可以理解,非关键视觉特征,是各条相关视频中的除关键视觉特征以外的视觉特征。
在一个实施例中,确定相关视频中的关键视觉特征时可以包括如下步骤。
一、通过基础解码器,获取相关视频中各个视觉特征对候选词汇的特征权重,其中,各个特征权重之和为1。
在一个实施例中,计算机设备首先训练视频描述生成模型中的基础解码器,并利用该基础解码器(采用注意力机制),获取解码该候选词汇时,相关视频中各个视觉特征对该候选词汇的特征权重。
在一个示意性的例子中,候选词汇是样本视频对应样本视频描述中的第t个解码词时,计算机设备即利用基础解码器对样本视频的视觉特征进行解码,并获取第t次解码时,基础解码器输出的第t-1隐藏状态h t-1,从而通过注意力函数f att计算各个视觉特征(包括v’ i或者f’ i)对该候选词汇的特征权重a i,t
二、将前k个特征权重对应的视觉特征确定为关键视觉特征。
当视觉特征对候选词汇的特征权重越大时,表明该视觉特征与候选词汇 的相关度越高,因此,计算机设备可以将前k个(Top-k)特征权重对应的视觉特征确定为候选词汇的关键视觉特征。
示意性的,如图12所示,对于候选词汇对应的I个相关视频,计算机设备分别提取各个相关视频的二维视觉特征1201和三维视觉特征1202,并通过基础解码器的注意力机制,获取相关视频中各个视觉特征对候选词汇的特征权重,并从中选取Top-k的视觉特征作为关键视觉特征1203。
步骤1103,根据I条相关视频对应的各个关键视觉特征,生成候选词汇对应的参考视觉上下文信息。
进一步的,计算机设备将各条相关视频对应的关键视觉特征进行融合,从而生成候选词汇对应的参考视觉上下文信息。
其中,候选词汇对应的参考视觉上下文信息g r可以表示为:
Figure PCTCN2020081721-appb-000005
其中,I为相关视频的个数,k为每个相关视频对应的关键视觉特征的数量,a i,j为第j个二维关键视觉特征f' i,j对候选词汇的特征权重,a' i,j为第j个三维关键视觉特征v' i,j对候选词汇的特征权重。
示意性的,如图12所示,计算机设备将各个相关视频对应的关键视觉特征1203进行融合,生成参考视觉上下文信息1204。
步骤1104,将各个候选词汇对应的参考视觉上下文信息存储到记忆结构。
进一步的,计算机设备将各个候选词汇对应的参考视觉上下文信息存储到辅助解码器的记忆结构中,以便后续使用。
本实施例中,计算机设备从候选词汇对应的相关视频中提取候选词汇的关键视觉特征,从而根据大量关键视觉特征生成候选词汇的参考视觉上下文信息,并存储到记忆结构中,有助于提高后续解码所得解码词的准确度。
在MSR-VTT数据集上,对相关技术以及本申请实施例中视频描述生成模型的视频描述质量进行分析,得到的分析结果如表一所示。
表一
Figure PCTCN2020081721-appb-000006
Figure PCTCN2020081721-appb-000007
在MSVD数据集上,对相关技术以及本申请实施例中视频描述生成模型的视频描述质量进行分析,得到的分析结果如表二所示。
表二
Figure PCTCN2020081721-appb-000008
从上述分析结果可以看出,本申请实施例中的视频描述生成模型,在四个评估指标(BLEU-4,METEROR,ROUGE-L,CIDEr)上均处于领先水平。
应该理解的是,虽然上述各实施例的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,上述各实施例中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
图13是本申请一个示例性实施例提供的视频描述生成装置的结构框图,该装置可以设置于上述实施例所述的计算机设备,如图13所示,该装置包括:
编码模块1301,用于通过视频描述生成模型的编码器对目标视频进行编码,得到目标视频的目标视觉特征。
第一解码模块1302,用于通过视频描述生成模型的基础解码器,采用注意力机制对目标视觉特征进行解码,得到各个候选词汇对应的第一选取概率。
第二解码模块1303,用于通过视频描述生成模型的辅助解码器对目标视觉特征进行解码,得到各个候选词汇对应的第二选取概率,辅助解码器的记忆结构中包括各个候选词汇对应的参考视觉上下文信息,参考视觉上下文信息根据候选词汇对应的相关视频生成。
第一确定模块1304,用于根据第一选取概率和第二选取概率确定候选词汇中的解码词。
第一生成模块1305,用于根据各个解码词生成目标视频对应的视频描述。
在一个实施例中,第二解码模块1303,包括:
第一获取单元,用于当进行第t次解码时,获取第t-1次解码得到的第t-1解码词以及第t-1隐藏状态,第t-1隐藏状态是基础解码器进行第t-1次解码时输出的隐藏状态,t为大于或等于2的整数。
第一确定单元,用于根据第t-1解码词、第t-1隐藏状态、目标视觉特征以及候选词汇对应的参考视觉上下文信息,确定候选词汇的第二选取概率。
在一个实施例中,第一确定单元,用于:
根据目标视觉特征和第t-1隐藏状态,生成进行第t次解码时的目标视觉上下文信息;根据目标视觉上下文信息和参考视觉上下文信息,确定候选词汇的第一匹配度;获取记忆结构中候选词汇对应的第一词特征向量以及第t-1解码词的第二词特征向量;根据第一词特征向量和第二词特征向量,确定候选词汇的第二匹配度;根据第一匹配度和第二匹配度,确定候选词汇的第二选取概率。
在一个实施例中,记忆结构中还包括各个候选词汇对应的辅助信息。第一确定单元,用于:
根据辅助信息、第t-1解码词、第t-1隐藏状态、目标视觉特征以及候选词汇对应的参考视觉上下文信息,确定候选词汇的第二选取概率。
在一个实施例中,装置包括:
第二确定模块,用于对于各个候选词汇,根据样本视频对应的样本视频描述,确定候选词汇对应的I条相关视频,相关视频的样本视频描述中包含候选词汇,I为大于或等于1的整数。
第三确定模块,用于对于各条相关视频,确定相关视频中的k个关键视觉特征,关键视觉特征与候选词汇的匹配度高于相关视频中非关键视觉特征与候选词汇的匹配度,k为大于或等于1的整数。
第二生成模块,用于根据I条相关视频对应的各个关键视觉特征,生成候选词汇对应的参考视觉上下文信息。
存储模块,用于将各个候选词汇对应的参考视觉上下文信息存储到记忆结构。
在一个实施例中,第三确定模块,包括:
获取单元,用于通过基础解码器,获取相关视频中各个视觉特征对候选词汇的特征权重,其中,各个特征权重之和为1。
第二确定单元,用于将前k个特征权重对应的视觉特征确定为关键视觉特征。
在一个实施例中,第一确定模块1304,包括:
计算单元,用于根据第一选取概率和第一选取概率对应的第一权重,以及第二选取概率和第二选取概率对应的第二权重,计算各个候选词汇的目标选取概率。
第三确定单元,用于将最高目标选取概率对应的候选词汇确定为解码词。
在一个实施例中,编码模块1301,包括:
编码单元,用于通过编码器对目标视频进行编码,得到目标视频的二维视觉特征和三维视觉特征,二维视觉特征用于指示单帧图像的特征,三维视觉特征用于指示连续图像帧的时序特征。
转换单元,用于将二维视觉特征和三维视觉特征转换到同一特征维度,得到目标视觉特征。
综上,本申请实施例中,利用视频描述生成模型的编码器对目标视频进行编码,得到目标视觉特征后,分别通过基于注意力机制的基础解码器以及包含辅助解码器对目标视觉特征进行解码,得到各个候选词汇的第一选取概率和第二选取概率,从而综合第一选取概率和第二选取概率从候选词汇中确定出解码词,进而根据多个解码词生成视频描述;由于视频描述生成模型中辅助解码器的记忆结构包含候选词汇对应的参考视觉上下文信息,且该参考视觉上下文信息是基于候选词汇的相关视频生成的,因此利用辅助解码器进行解码时,能够关注到候选词汇与除当前视频以外其他视频之间的关联性,从而提高解码词选取的准确性,进而提高了后续生成的视频描述的质量。
需要说明的是:上述实施例提供的视频描述生成装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块或单元完成,即,将设备的内部结构划分成不同的功能模块或单元,以完成以上描述的全部或者部分功能。每个功能模块或单元可全部或部分通过软件、硬件或其组合来实现。另外,上述实施例提供的视频描述生成装置与视频描述生成方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
请参考图14,其示出了本申请一个示例性实施例提供的计算机设备的结构示意图。具体来讲:计算机设备1400包括中央处理单元(CPU)1401、包括随机存取存储器(RAM)1402和只读存储器(ROM)1403的系统存储器1404,以及连接系统存储器1404和中央处理单元1401的系统总线1405。计算机设备1400还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(I/O系统)1406,和用于存储操作系统1413、应用程序1414和其他程序模块1415的大容量存储设备1407。
基本输入/输出系统1406包括有用于显示信息的显示器1408和用于用户输入信息的诸如鼠标、键盘之类的输入设备1409。其中显示器1408和输入设备1409都通过连接到系统总线1405的输入输出控制器1410连接到中央处理单元1401。基本输入/输出系统1406还可以包括输入输出控制器1410以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器1410还提供输出到显示屏、打印机或其他类型的输出设备。
大容量存储设备1407通过连接到系统总线1405的大容量存储控制器(未示出)连接到中央处理单元1401。大容量存储设备1407及其相关联的计算机可读介质为计算机设备1400提供非易失性存储。也就是说,大容量存储设备1407可以包括诸如硬盘或者CD-ROI驱动器之类的计算机可读介质(未示出)。
不失一般性,计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、EPROM、EEPROM、闪存或其他固态存储其技术,CD-ROM、DVD或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知计算机存储介质不局限于上述几种。上述的系统存储器1404和大容量存储设备1407可以统称为存储器。
存储器存储有一个或多个程序,一个或多个程序被配置成由一个或多个中央处理单元1401执行,一个或多个程序包含用于实现上述方法的计算机可读指令,中央处理单元1401执行该一个或多个程序实现上述各个方法实施例提供的方法。
根据本申请的各种实施例,计算机设备1400还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即计算机设备1400可以通过连接在系统总线1405上的网络接口单元1411连接到网络1412,或者说,也可以使用网络接口单元1411来连接到其他类型的网络或远程计算机系统(未示出)。
存储器还包括一个或者一个以上的程序,一个或者一个以上程序存储于存储器中,一个或者一个以上程序包含用于进行本申请实施例提供的方法中 由计算机设备所执行的步骤。
本申请实施例还提供一个或多个计算机可读存储介质,该可读存储介质中存储有至少一条计算机可读指令、至少一段程序、代码集或计算机可读指令集,至少一条计算机可读指令、至少一段程序、代码集或计算机可读指令集由一个或多个处理器加载并执行以实现上述任一实施例的视频描述生成方法。
本申请还提供了一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行上述各个方法实施例提供的视频描述生成方法。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来计算机可读指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,该计算机可读存储介质可以是上述实施例中的存储器中所包含的计算机可读存储介质;也可以是单独存在,未装配入终端中的计算机可读存储介质。该计算机可读存储介质中存储有至少一条计算机可读指令、至少一段程序、代码集或计算机可读指令集,至少一条计算机可读指令、至少一段程序、代码集或计算机可读指令集由处理器加载并执行以实现上述任一方法实施例的视频描述生成方法。
在一个实施例中,该计算机可读存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、固态硬盘(SSD,Solid State Drives)或光盘等。其中,随机存取记忆体可以包括电阻式随机存取记忆体(ReRAM,Resistance Random Access Memory)和动态随机存取存储器(DRAM,Dynamic Random Access Memory)。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来计算机可读指令相关的硬件完成,的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。

Claims (20)

  1. 一种视频描述生成方法,其特征在于,由计算机设备执行,所述方法包括:
    通过视频描述生成模型的编码器对目标视频进行编码,得到所述目标视频的目标视觉特征;
    通过所述视频描述生成模型的基础解码器,采用注意力机制对所述目标视觉特征进行解码,得到各个候选词汇对应的第一选取概率;
    通过所述视频描述生成模型的辅助解码器对所述目标视觉特征进行解码,得到各个所述候选词汇对应的第二选取概率,所述辅助解码器的记忆结构中包括各个所述候选词汇对应的参考视觉上下文信息,所述参考视觉上下文信息根据所述候选词汇对应的相关视频生成;
    根据所述第一选取概率和所述第二选取概率确定所述候选词汇中的解码词;
    根据各个所述解码词生成所述目标视频对应的视频描述。
  2. 根据权利要求1所述的方法,其特征在于,所述通过所述视频描述生成模型的辅助解码器对所述目标视觉特征进行解码,得到各个所述候选词汇对应的第二选取概率,包括:
    当进行第t次解码时,获取第t-1次解码得到的第t-1解码词以及第t-1隐藏状态,所述第t-1隐藏状态是所述基础解码器进行第t-1次解码时输出的隐藏状态,t为大于或等于2的整数;
    根据所述第t-1解码词、所述第t-1隐藏状态、所述目标视觉特征以及所述候选词汇对应的所述参考视觉上下文信息,通过辅助解码器确定所述候选词汇的所述第二选取概率。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述第t-1解码词、所述第t-1隐藏状态、所述目标视觉特征以及所述候选词汇对应的所述参考视觉上下文信息,确定所述候选词汇的所述第二选取概率,包括:
    根据所述目标视觉特征和所述第t-1隐藏状态,生成进行第t次解码时的目标视觉上下文信息;
    根据所述目标视觉上下文信息和所述参考视觉上下文信息,确定所述候选词汇的第一匹配度;
    获取所述记忆结构中所述候选词汇对应的第一词特征向量以及所述第t-1解码词的第二词特征向量;
    根据所述第一词特征向量和所述第二词特征向量,确定所述候选词汇的第二匹配度;
    根据所述第一匹配度和所述第二匹配度,确定所述候选词汇的所述第二选取概率。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述目标视觉特征和所述第t-1隐藏状态,生成进行第t次解码时的目标视觉上下文信息包括:根据所述目标视觉特征和所述第t-1隐藏状态,得到进行第t次解码时的二维视觉上下文信息和三维视觉上下文信息;
    对所述二维视觉上下文信息和所述三维视觉上下文信息进行融合,得到进行第t次解码时的目标视觉上下文信息。
  5. 根据权利要求2所述的方法,其特征在于,所述记忆结构中还包括各个所述候选词汇对应的辅助信息;
    所述根据所述第t-1解码词、所述第t-1隐藏状态、所述目标视觉特征以及所述候选词汇对应的所述参考视觉上下文信息,确定所述候选词汇的所述第二选取概率,包括:
    根据所述辅助信息、所述第t-1解码词、所述第t-1隐藏状态、所述目标视觉特征以及所述候选词汇对应的所述参考视觉上下文信息,确定所述候选词汇的所述第二选取概率。
  6. 根据权利要求1至5任一所述的方法,其特征在于,所述方法包括:
    对于各个所述候选词汇,根据样本视频对应的样本视频描述,确定所述候选词汇对应的I条所述相关视频,所述相关视频的所述样本视频描述中包含所述候选词汇,I为大于或等于1的整数;
    对于各条所述相关视频,确定所述相关视频中的k个关键视觉特征,所述关键视觉特征与所述候选词汇的匹配度高于所述相关视频中非关键视觉特征与所述候选词汇的匹配度,k为大于等于1的整数;
    根据I条所述相关视频对应的各个所述关键视觉特征,生成所述候选词汇对应的所述参考视觉上下文信息;
    将各个所述候选词汇对应的所述参考视觉上下文信息存储到所述记忆结 构。
  7. 根据权利要求6所述的方法,其特征在于,所述确定所述相关视频中的k个关键视觉特征,包括:
    通过所述基础解码器,获取所述相关视频中各个视觉特征对所述候选词汇的特征权重,其中,各个所述特征权重之和为1;
    将前k个所述特征权重对应的所述视觉特征确定为所述关键视觉特征。
  8. 根据权利要求1至5任一所述的方法,其特征在于,所述根据所述第一选取概率和所述第二选取概率确定所述候选词汇中的解码词,包括:
    根据所述第一选取概率和所述第一选取概率对应的第一权重,以及所述第二选取概率和所述第二选取概率对应的第二权重,计算各个所述候选词汇的目标选取概率;
    将最高目标选取概率对应的所述候选词汇确定为所述解码词。
  9. 根据权利要求1至5任一所述的方法,其特征在于,所述通过视频描述生成模型的编码器对目标视频进行编码,得到所述目标视频的目标视觉特征,包括:
    通过所述编码器对所述目标视频进行编码,得到所述目标视频的二维视觉特征和三维视觉特征,所述二维视觉特征用于指示单帧图像的特征,所述三维视觉特征用于指示连续图像帧的时序特征;
    将所述二维视觉特征和所述三维视觉特征转换到同一特征维度,得到所述目标视觉特征。
  10. 一种视频描述生成装置,其特征在于,设置于计算机设备中,所述装置包括:
    编码模块,用于通过视频描述生成模型的编码器对目标视频进行编码,得到所述目标视频的目标视觉特征;
    第一解码模块,用于通过所述视频描述生成模型的基础解码器,采用注意力机制对所述目标视觉特征进行解码,得到各个候选词汇对应的第一选取概率;
    第二解码模块,用于通过所述视频描述生成模型的辅助解码器对所述目标视觉特征进行解码,得到各个所述候选词汇对应的第二选取概率,所述辅助解码器的记忆结构中包括各个所述候选词汇对应的参考视觉上下文信息, 所述参考视觉上下文信息根据所述候选词汇对应的相关视频生成;
    第一确定模块,用于根据所述第一选取概率和所述第二选取概率确定所述候选词汇中的解码词;
    第一生成模块,用于根据各个所述解码词生成所述目标视频对应的视频描述。
  11. 根据权利要求10所述的装置,其特征在于,所述第二解码模块,包括:
    第一获取单元,用于当进行第t次解码时,获取第t-1次解码得到的第t-1解码词以及第t-1隐藏状态,所述第t-1隐藏状态是所述辅助解码器进行第t-1次解码时输出的隐藏状态,t为大于或等于2的整数;
    第一确定单元,用于根据所述第t-1解码词、所述第t-1隐藏状态、所述目标视觉特征以及所述候选词汇对应的所述参考视觉上下文信息,通过辅助解码器确定所述候选词汇的所述第二选取概率。
  12. 根据权利要求11所述的装置,其特征在于,所述第一确定单元,用于:
    根据所述目标视觉特征和所述第t-1隐藏状态,生成进行第t次解码时的目标视觉上下文信息;
    根据所述目标视觉上下文信息和所述参考视觉上下文信息,确定所述候选词汇的第一匹配度;
    获取所述记忆结构中所述候选词汇对应的第一词特征向量以及所述第t-1解码词的第二词特征向量;
    根据所述第一词特征向量和所述第二词特征向量,确定所述候选词汇的第二匹配度;
    根据所述第一匹配度和所述第二匹配度,确定所述候选词汇的所述第二选取概率。
  13. 根据权利要求12所述的装置,其特征在于,所述第一确定单元还用于:根据所述目标视觉特征和所述第t-1隐藏状态,得到进行第t次解码时的二维视觉上下文信息和三维视觉上下文信息;
    对所述二维视觉上下文信息和所述三维视觉上下文信息进行融合,得到进行第t次解码时的目标视觉上下文信息。
  14. 根据权利要求11所述的装置,其特征在于,所述记忆结构中还包括各个所述候选词汇对应的辅助信息;
    所述第一确定单元,用于:
    根据所述辅助信息、所述第t-1解码词、所述第t-1隐藏状态、所述目标视觉特征以及所述候选词汇对应的所述参考视觉上下文信息,确定所述候选词汇的所述第二选取概率。
  15. 根据权利要求10至14任一所述的装置,其特征在于,所述装置包括:
    第二确定模块,用于对于各个所述候选词汇,根据样本视频对应的样本视频描述,确定所述候选词汇对应的I条所述相关视频,所述相关视频的所述样本视频描述中包含所述候选词汇,I为大于或等于1的整数;
    第三确定模块,用于对于各条所述相关视频,确定所述相关视频中的k个关键视觉特征,所述关键视觉特征与所述候选词汇的匹配度高于所述相关视频中非关键视觉特征与所述候选词汇的匹配度,k为大于等于1的整数;
    第二生成模块,用于根据I条所述相关视频对应的各个所述关键视觉特征,生成所述候选词汇对应的所述参考视觉上下文信息;
    存储模块,用于将各个所述候选词汇对应的所述参考视觉上下文信息存储到所述记忆结构。
  16. 根据权利要求15所述的装置,其特征在于,所述第三确定模块,包括:
    获取单元,用于通过所述基础解码器,获取所述相关视频中各个视觉特征对所述候选词汇的特征权重,其中,各个所述特征权重之和为1;
    第二确定单元,用于将前k个所述特征权重对应的所述视觉特征确定为所述关键视觉特征。
  17. 根据权利要求10至14任一所述的装置,其特征在于,所述第一确定模块,包括:
    计算单元,用于根据所述第一选取概率和所述第一选取概率对应的第一权重,以及所述第二选取概率和所述第二选取概率对应的第二权重,计算各个所述候选词汇的目标选取概率;
    第三确定单元,用于将最高目标选取概率对应的所述候选词汇确定为所 述解码词。
  18. 根据权利要求10至14任一所述的装置,其特征在于,所述编码模块,包括:
    编码单元,用于通过所述编码器对所述目标视频进行编码,得到所述目标视频的二维视觉特征和三维视觉特征,所述二维视觉特征用于指示单帧图像的特征,所述三维视觉特征用于指示连续图像帧的时序特征;
    转换单元,用于将所述二维视觉特征和所述三维视觉特征转换到同一特征维度,得到所述目标视觉特征。
  19. 一种计算机设备,其特征在于,所述计算机设备包括一个或多个处理器和存储器,所述存储器中存储有至少一条计算机可读指令、至少一段程序、代码集或计算机可读指令集,所述至少一条计算机可读指令、所述至少一段程序、所述代码集或计算机可读指令集由所述一个或多个处理器加载并执行以实现如权利要求1至9任一所述的视频描述生成方法。
  20. 一个或多个计算机可读存储介质,其特征在于,所述可读存储介质中存储有至少一条计算机可读指令、至少一段程序、代码集或计算机可读指令集,所述至少一条计算机可读指令、所述至少一段程序、所述代码集或计算机可读指令集由一个或多个处理器加载并执行以实现如权利要求1至9任一所述的视频描述生成方法。
PCT/CN2020/081721 2019-04-22 2020-03-27 视频描述生成方法、装置、设备及存储介质 WO2020215988A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2021531058A JP7179183B2 (ja) 2019-04-22 2020-03-27 ビデオキャプションの生成方法、装置、デバイスおよびコンピュータプログラム
KR1020217020589A KR102477795B1 (ko) 2019-04-22 2020-03-27 비디오 캡션 생성 방법, 디바이스 및 장치, 그리고 저장 매체
EP20795471.0A EP3962097A4 (en) 2019-04-22 2020-03-27 VIDEO SUBTITLE GENERATION METHOD, DEVICE AND APPARATUS AND STORAGE MEDIA
US17/328,970 US11743551B2 (en) 2019-04-22 2021-05-24 Video caption generating method and apparatus, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910325193.0A CN109874029B (zh) 2019-04-22 2019-04-22 视频描述生成方法、装置、设备及存储介质
CN201910325193.0 2019-04-22

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/328,970 Continuation US11743551B2 (en) 2019-04-22 2021-05-24 Video caption generating method and apparatus, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2020215988A1 true WO2020215988A1 (zh) 2020-10-29

Family

ID=66922965

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/081721 WO2020215988A1 (zh) 2019-04-22 2020-03-27 视频描述生成方法、装置、设备及存储介质

Country Status (6)

Country Link
US (1) US11743551B2 (zh)
EP (1) EP3962097A4 (zh)
JP (1) JP7179183B2 (zh)
KR (1) KR102477795B1 (zh)
CN (1) CN109874029B (zh)
WO (1) WO2020215988A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343986A (zh) * 2021-06-29 2021-09-03 北京奇艺世纪科技有限公司 字幕时间区间确定方法、装置、电子设备及可读存储介质
CN113673376A (zh) * 2021-08-03 2021-11-19 北京奇艺世纪科技有限公司 弹幕生成方法、装置、计算机设备和存储介质
CN116166827A (zh) * 2023-04-24 2023-05-26 北京百度网讯科技有限公司 语义标签抽取模型的训练和语义标签的抽取方法及其装置

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109874029B (zh) * 2019-04-22 2021-02-12 腾讯科技(深圳)有限公司 视频描述生成方法、装置、设备及存储介质
CN110263218B (zh) * 2019-06-21 2022-02-25 北京百度网讯科技有限公司 视频描述文本生成方法、装置、设备和介质
CN110891201B (zh) * 2019-11-07 2022-11-01 腾讯科技(深圳)有限公司 文本生成方法、装置、服务器和存储介质
CN111860597B (zh) * 2020-06-17 2021-09-07 腾讯科技(深圳)有限公司 一种视频信息处理方法、装置、电子设备及存储介质
CN112528883A (zh) * 2020-12-15 2021-03-19 杭州义顺科技有限公司 一种基于反思网络的教学场景视频描述生成方法
CN112580570B (zh) * 2020-12-25 2024-06-21 南通大学 人体姿态图像的关键点检测方法
CN113569068B (zh) * 2021-01-19 2023-09-29 腾讯科技(深圳)有限公司 描述内容生成方法、视觉内容的编码、解码方法、装置
CN113099228B (zh) * 2021-04-30 2024-04-05 中南大学 一种视频编解码方法及系统
CN113596557B (zh) * 2021-07-08 2023-03-21 大连三通科技发展有限公司 一种视频生成方法及装置
CN113792166B (zh) * 2021-08-18 2023-04-07 北京达佳互联信息技术有限公司 信息获取方法、装置、电子设备及存储介质
CN113810730B (zh) * 2021-09-17 2023-08-01 咪咕数字传媒有限公司 基于视频的实时文本生成方法、装置及计算设备
CN114422841B (zh) * 2021-12-17 2024-01-02 北京达佳互联信息技术有限公司 字幕生成方法、装置、电子设备及存储介质
CN114501064B (zh) * 2022-01-29 2023-07-14 北京有竹居网络技术有限公司 一种视频生成方法、装置、设备、介质及产品

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170357720A1 (en) * 2016-06-10 2017-12-14 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
CN108024158A (zh) * 2017-11-30 2018-05-11 天津大学 利用视觉注意力机制的有监督视频摘要提取方法
CN108388900A (zh) * 2018-02-05 2018-08-10 华南理工大学 基于多特征融合和时空注意力机制相结合的视频描述方法
CN108419094A (zh) * 2018-03-05 2018-08-17 腾讯科技(深圳)有限公司 视频处理方法、视频检索方法、装置、介质及服务器
CN108509411A (zh) * 2017-10-10 2018-09-07 腾讯科技(深圳)有限公司 语义分析方法和装置
CN109344288A (zh) * 2018-09-19 2019-02-15 电子科技大学 一种基于多模态特征结合多层注意力机制的结合视频描述方法
CN109359214A (zh) * 2018-10-15 2019-02-19 平安科技(深圳)有限公司 基于神经网络的视频描述生成方法、存储介质及终端设备
CN109874029A (zh) * 2019-04-22 2019-06-11 腾讯科技(深圳)有限公司 视频描述生成方法、装置、设备及存储介质

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107809642B (zh) * 2015-02-16 2020-06-16 华为技术有限公司 用于视频图像编码和解码的方法、编码设备和解码设备
US10303768B2 (en) * 2015-05-04 2019-05-28 Sri International Exploiting multi-modal affect and semantics to assess the persuasiveness of a video
US10929681B2 (en) * 2016-11-03 2021-02-23 Nec Corporation Surveillance system using adaptive spatiotemporal convolution feature representation with dynamic abstraction for video to language translation
CN108062505B (zh) * 2016-11-09 2022-03-18 微软技术许可有限责任公司 用于基于神经网络的动作检测的方法和设备
US10558750B2 (en) * 2016-11-18 2020-02-11 Salesforce.Com, Inc. Spatial attention model for image captioning
JP6867153B2 (ja) * 2016-12-21 2021-04-28 ホーチキ株式会社 異常監視システム
US10592751B2 (en) * 2017-02-03 2020-03-17 Fuji Xerox Co., Ltd. Method and system to generate targeted captions and summarize long, continuous media files
EP3399460B1 (en) * 2017-05-02 2019-07-17 Dassault Systèmes Captioning a region of an image
US10909157B2 (en) * 2018-05-22 2021-02-02 Salesforce.Com, Inc. Abstraction of text summarization
US10831834B2 (en) * 2018-11-27 2020-11-10 Sap Se Unsupervised document summarization by attention and reconstruction
EP3892005A4 (en) * 2019-03-21 2022-07-06 Samsung Electronics Co., Ltd. METHOD, DEVICE, DEVICE AND MEDIA FOR GENERATION OF SUBTITLING INFORMATION FROM MULTIMEDIA DATA
CN111836111A (zh) * 2019-04-17 2020-10-27 微软技术许可有限责任公司 生成弹幕的技术

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170357720A1 (en) * 2016-06-10 2017-12-14 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
CN108509411A (zh) * 2017-10-10 2018-09-07 腾讯科技(深圳)有限公司 语义分析方法和装置
CN108024158A (zh) * 2017-11-30 2018-05-11 天津大学 利用视觉注意力机制的有监督视频摘要提取方法
CN108388900A (zh) * 2018-02-05 2018-08-10 华南理工大学 基于多特征融合和时空注意力机制相结合的视频描述方法
CN108419094A (zh) * 2018-03-05 2018-08-17 腾讯科技(深圳)有限公司 视频处理方法、视频检索方法、装置、介质及服务器
CN109344288A (zh) * 2018-09-19 2019-02-15 电子科技大学 一种基于多模态特征结合多层注意力机制的结合视频描述方法
CN109359214A (zh) * 2018-10-15 2019-02-19 平安科技(深圳)有限公司 基于神经网络的视频描述生成方法、存储介质及终端设备
CN109874029A (zh) * 2019-04-22 2019-06-11 腾讯科技(深圳)有限公司 视频描述生成方法、装置、设备及存储介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343986A (zh) * 2021-06-29 2021-09-03 北京奇艺世纪科技有限公司 字幕时间区间确定方法、装置、电子设备及可读存储介质
CN113343986B (zh) * 2021-06-29 2023-08-25 北京奇艺世纪科技有限公司 字幕时间区间确定方法、装置、电子设备及可读存储介质
CN113673376A (zh) * 2021-08-03 2021-11-19 北京奇艺世纪科技有限公司 弹幕生成方法、装置、计算机设备和存储介质
CN113673376B (zh) * 2021-08-03 2023-09-01 北京奇艺世纪科技有限公司 弹幕生成方法、装置、计算机设备和存储介质
CN116166827A (zh) * 2023-04-24 2023-05-26 北京百度网讯科技有限公司 语义标签抽取模型的训练和语义标签的抽取方法及其装置
CN116166827B (zh) * 2023-04-24 2023-12-15 北京百度网讯科技有限公司 语义标签抽取模型的训练和语义标签的抽取方法及其装置

Also Published As

Publication number Publication date
KR102477795B1 (ko) 2022-12-14
EP3962097A1 (en) 2022-03-02
US11743551B2 (en) 2023-08-29
KR20210095208A (ko) 2021-07-30
CN109874029B (zh) 2021-02-12
US20210281774A1 (en) 2021-09-09
JP2022509299A (ja) 2022-01-20
JP7179183B2 (ja) 2022-11-28
EP3962097A4 (en) 2022-07-13
CN109874029A (zh) 2019-06-11

Similar Documents

Publication Publication Date Title
WO2020215988A1 (zh) 视频描述生成方法、装置、设备及存储介质
US11409791B2 (en) Joint heterogeneous language-vision embeddings for video tagging and search
CN108304439B (zh) 一种语义模型优化方法、装置及智能设备、存储介质
US20210224601A1 (en) Video sequence selection method, computer device, and storage medium
WO2022161298A1 (zh) 信息生成方法、装置、设备、存储介质及程序产品
US20230077849A1 (en) Content recognition method and apparatus, computer device, and storage medium
CN110234018B (zh) 多媒体内容描述生成方法、训练方法、装置、设备及介质
US11868738B2 (en) Method and apparatus for generating natural language description information
CN111897939B (zh) 视觉对话方法、视觉对话模型的训练方法、装置及设备
WO2023065619A1 (zh) 多维度细粒度动态情感分析方法及系统
WO2022033208A1 (zh) 视觉对话方法、模型训练方法、装置、电子设备及计算机可读存储介质
CN110704601A (zh) 利用问题-知识引导的渐进式时空注意力网络解决需要常识的视频问答任务的方法
JP2015162244A (ja) 発話ワードをランク付けする方法、プログラム及び計算処理システム
CN111837142A (zh) 用于表征视频内容的深度强化学习框架
CN110263218B (zh) 视频描述文本生成方法、装置、设备和介质
CN114339450B (zh) 视频评论生成方法、系统、设备及存储介质
CN113392265A (zh) 多媒体处理方法、装置及设备
CN116912642A (zh) 基于双模多粒度交互的多模态情感分析方法、设备及介质
CN114281948A (zh) 一种纪要确定方法及其相关设备
CN115130461A (zh) 一种文本匹配方法、装置、电子设备及存储介质
CN115982343B (zh) 摘要生成方法、训练摘要生成模型的方法及装置
CN116932922B (zh) 搜索词条处理方法、装置、计算机设备和计算机存储介质
CN116561350B (zh) 一种资源生成方法及相关装置
WO2023238722A1 (ja) 情報作成方法、情報作成装置、及び動画ファイル
Kong Final Year Project Final Report

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20795471

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021531058

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20217020589

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020795471

Country of ref document: EP

Effective date: 20211122