WO2022134793A1 - 视频帧语义信息的提取方法、装置及计算机设备 - Google Patents

视频帧语义信息的提取方法、装置及计算机设备 Download PDF

Info

Publication number
WO2022134793A1
WO2022134793A1 PCT/CN2021/124889 CN2021124889W WO2022134793A1 WO 2022134793 A1 WO2022134793 A1 WO 2022134793A1 CN 2021124889 W CN2021124889 W CN 2021124889W WO 2022134793 A1 WO2022134793 A1 WO 2022134793A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
sequence
video
vector
frame sequence
Prior art date
Application number
PCT/CN2021/124889
Other languages
English (en)
French (fr)
Inventor
王德勋
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022134793A1 publication Critical patent/WO2022134793A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a method, apparatus and computer equipment for extracting semantic information of video frames.
  • convolutional neural networks can usually be used to extract semantic information in video frames.
  • this method of extracting video information by using convolutional neural network can obtain limited voice information.
  • higher-level voice information such as the correlation information between any two video frames in a video cannot be extracted, which is not conducive to Execution of downstream tasks such as video storage and retrieval.
  • the present application provides a method, device and computer equipment for extracting semantic information of video frames, mainly in that the correlation information between any two video frames in a video can be extracted, so that higher-level voice information can be obtained, which is convenient for video storage. , retrieval and other downstream tasks.
  • a method for extracting semantic information of video frames comprising:
  • the video frame feature sequence determine the query vector, key vector and value vector corresponding to each video frame in the video frame sequence, and calculate the video frame according to the query vector, the key vector and the value vector The correlation between any two video frames in the frame sequence;
  • the voice information corresponding to the video frame sequence is determined.
  • a device for extracting semantic information of video frames comprising:
  • an acquisition unit configured to acquire a video frame sequence to be extracted from semantic information, and perform video feature extraction on the video frame sequence to obtain a video frame feature sequence
  • a computing unit configured to determine the query vector, key vector and value vector corresponding to each video frame in the video frame sequence according to the video frame feature sequence, and according to the query vector, the key vector and the value vector , calculate the correlation between any two video frames in the video frame sequence;
  • a determining unit configured to determine the speech information corresponding to the video frame sequence according to the correlation between any two video frames in the video frame sequence.
  • a computer-readable storage medium on which computer-readable instructions are stored, and when the program is executed by a processor, the following steps are implemented:
  • the video frame feature sequence determine the query vector, key vector and value vector corresponding to each video frame in the video frame sequence, and calculate the video frame according to the query vector, the key vector and the value vector The correlation between any two video frames in the frame sequence;
  • the voice information corresponding to the video frame sequence is determined.
  • a computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the program :
  • the video frame feature sequence determine the query vector, key vector and value vector corresponding to each video frame in the video frame sequence, and calculate the video frame according to the query vector, the key vector and the value vector The correlation between any two video frames in the frame sequence;
  • the voice information corresponding to the video frame sequence is determined.
  • the present application provides a method, device and computer equipment for extracting semantic information of video frames.
  • the present application can obtain video frames for which semantic information is to be extracted. sequence, and perform video feature extraction on the video frame sequence to obtain a video frame feature sequence, and at the same time, according to the video frame feature sequence, determine the query vector, key vector and value vector corresponding to each video frame in the video frame sequence, and calculate the correlation between any two video frames in the video frame sequence according to the query vector, the key vector and the value vector, and finally calculate the correlation between any two video frames in the video frame sequence according to the correlation between any two video frames in the video frame sequence.
  • By calculating the correlation between any two video frames in the video frame sequence higher-level semantic information in a video can be obtained, which is convenient for downstream video storage, retrieval, etc. execution of tasks.
  • FIG. 1 shows a flowchart of a method for extracting semantic information of a video frame provided by an embodiment of the present application
  • FIG. 2 shows a flowchart of another method for extracting semantic information of video frames provided by an embodiment of the present application
  • FIG. 3 shows a schematic structural diagram of an apparatus for extracting semantic information of a video frame provided by an embodiment of the present application
  • FIG. 4 shows a schematic structural diagram of another apparatus for extracting semantic information of a video frame provided by an embodiment of the present application
  • FIG. 5 shows a schematic diagram of an entity structure of a computer device provided by an embodiment of the present application.
  • convolutional neural networks can usually be used to extract semantic information in video frames.
  • this method of extracting video information by using convolutional neural network can obtain limited voice information.
  • higher-level voice information such as the correlation information between any two video frames in a video cannot be extracted, which is not conducive to Execution of downstream tasks such as video storage and retrieval.
  • an embodiment of the present application provides a method for extracting semantic information of a video frame, as shown in FIG. 1 , the method includes:
  • the video frame sequence is obtained by parsing from the video to be extracted from the semantic information.
  • the embodiment of the present application uses the attention layer mechanism in the preset encoder to calculate any two in the video frame sequence.
  • the correlation between video frames, and then higher-level semantic information in the video can be obtained according to the calculated correlation.
  • the embodiment of the present application is mainly applicable to the extraction of semantic information in the video.
  • the execution subject of the embodiment of the present application is capable of extracting video.
  • the device or device for the semantic information in the server can be specifically set on the server or the client side.
  • the video frame sequence of a video can be extracted by installing the imageio library and the skimage library.
  • the video frame sequence of the video can also be extracted by Adobe Premiere software, and the method of extracting the video frame sequence is the embodiment of the present application. No specific limitation is made.
  • a convolutional neural network can be used to extract For the video frame feature sequence corresponding to the video frame sequence, it should be noted that the model for extracting the video frame feature sequence may be, but not limited to, a convolutional neural network.
  • the extracted video frame feature sequence is input to the attention mechanism layer in the preset encoder to calculate the similarity.
  • the video frame feature sequence is respectively compared with the three weight matrices of the preset encoder. Multiply to obtain the query vector, key vector and value vector corresponding to each frame of video in the video frame sequence, and then calculate the interaction score between any two video frames in the video frame sequence according to the query vector, key vector and value vector respectively. value, and then determine the correlation between any two video frames in the video frame sequence according to the interaction score.
  • the higher the interaction score between any two video frames in the video frame sequence the higher The higher the correlation, the lower the mutual influence score between any two video frames in the video frame sequence, and the lower the correlation between any two video frames. Therefore, the correlation between any two video frames in the video frame sequence can be determined according to the query vector, key vector and value vector corresponding to each frame of video in the video frame sequence.
  • the semantic information includes high-level semantic information, such as the correlation between any two video frames in the video frame sequence.
  • the semantic information also includes low-level semantic information.
  • convolutional neural The network performs feature extraction on the video frame sequence to obtain the video frame feature sequence corresponding to the video frame sequence.
  • the video frame feature sequence can reflect the low-level semantic information such as color and brightness of each video frame.
  • the force mechanism can calculate the correlation between any two video frames in the video frame sequence, and then obtain high-level semantic information in the video.
  • a method for extracting semantic information of a video frame provided by the embodiment of the present application, compared with the current method of extracting semantic information in a video frame only by using a convolutional neural network, the present application can obtain a sequence of video frames to be subjected to semantic information extraction, and Perform video feature extraction on the video frame sequence to obtain a video frame feature sequence, and at the same time according to the video frame feature sequence, determine the query vector, key vector and value vector corresponding to each video frame in the video frame sequence, and according to the video frame sequence.
  • the query vector, the key vector and the value vector calculate the correlation between any two video frames in the video frame sequence, and finally determine the correlation between any two video frames in the video frame sequence.
  • the voice information corresponding to the video frame sequence so that by calculating the correlation between any two video frames in the video frame sequence, higher-level semantic information in a video can be obtained, which is convenient for the execution of downstream tasks such as video storage and retrieval. .
  • the embodiment of the present application provides another method for extracting semantic information of video frames, as shown in FIG. 2 .
  • the method includes:
  • a preset encoder before using the preset encoder to calculate the similarity between any two video frames in the video frame sequence, a preset encoder needs to be constructed.
  • the method further includes: Constructing a spliced sample video frame sequence, and labeling the spliced sample video frame sequence to obtain an annotated sample video frame sequence; using an initial convolutional neural network to characterize the annotated sample video frame sequence Extraction to obtain the sample video frame feature sequence corresponding to the labeled sample video frame sequence; input the sample video frame feature sequence to the initial encoder for correlation calculation to obtain the labeled sample video frame sequence
  • the correlation between any two video frames in the initial encoder; the initial encoder and the initial convolutional neural network are trained according to the correlation between any two video frames output by the initial encoder.
  • constructing the spliced sample video frame sequence, and labeling the spliced sample video frame sequence, to obtain the annotated sample video frame sequence includes: acquiring the sample video frame sequence corresponding to each sample video, Splicing the sample video frame sequences corresponding to the respective sample videos to obtain a spliced sample video frame sequence; according to whether the spliced sample video frame sequence comes from the same sample video, mark it to obtain an annotation The resulting sequence of sample video frames.
  • a large number of sample videos of different topics are stored in the sample video library, such as sample videos related to art and criminal investigation topics.
  • the convolutional neural network used for feature extraction and the encoder for similarity calculation before determining the correlation between any two video frames in the video frame sequence, the convolutional neural network needs to be The network and the encoder are trained as a whole. Specifically, the initial convolutional neural network model pre-trained by ImageNet is obtained, and the weights of the first 3/4 layers are frozen to not participate in the learning update, and then two sample videos are randomly selected from the sample video library. , and try to ensure that the content (such as categories, keywords, and topics) of the two are different, and randomly extract K frames of video from the two sample videos in chronological order.
  • the content such as categories, keywords, and topics
  • the number of frames extracted from the video can be based on
  • the K value should not be set too small, and then the K video frames extracted from the two sample videos will be spliced to obtain the spliced sample video frame sequence.
  • the spliced sample video frame sequence is (E u1 ,E u2 ....E uk ,E v1 ,E v2 ..
  • E vk E uk represents the k-th frame video extracted from the u-th sample video
  • Evk represents the k-th frame video extracted from the v-th sample video
  • the label corresponding to the spliced sample video frame sequence is 1. If the u-th video and the v-th video are not from the same video, the label corresponding to the spliced sample video frame sequence is determined to be 0. Thereby, the labeled sample video frame sequence can be obtained.
  • a convolutional neural network is used to perform feature extraction on the labeled sample video frame sequence, and the sample video frame feature sequence corresponding to the labeled sample video frame sequence is obtained, and then the sample video frame sequence is input into the initial encoder for correlation.
  • the obtained gradient is updated in the entire network, including the 1/4 layer of the initial convolutional neural network and the initial coding, so that the convolutional neural network model and the preset encoder for feature extraction and similarity calculation in the embodiment of the present application can be obtained .
  • the video frame sequence of the semantic information extraction to be performed specifically, the video frame sequence of a section of video can be extracted by installing imageio library and skimage library, in addition, the video frame sequence of this section of video can also be extracted by Adobe Premiere software, The manner of extracting the video frame sequence is not specifically limited in this embodiment of the present application. Then, the trained convolutional neural network is used to extract the video frame feature sequence corresponding to the video frame sequence.
  • the video frame feature sequence is multiplied by the weight matrix in the preset encoder to obtain the video frame.
  • the query vector, key vector and value vector corresponding to each video frame in the sequence including: determining the relative position information between any two video frames in the video frame sequence; introducing the relative position information into the video frame feature sequence , and multiply the video frame feature sequence introduced with the relative position information by the weight matrix in the preset encoder to obtain the query vector, key vector and value vector corresponding to each video frame in the video frame sequence.
  • the specific formula for determining the relative position information between any two video frames is as follows:
  • the index subscript corresponding to the frequency frame j in common W is the position matrix
  • k is the preset truncation distance
  • the position matrix is queried according to the determined index subscript
  • the relative position information between the video frame i and the video frame j is determined, and further, The relative position information is introduced into the video frame feature sequence, and the video frame feature sequence introduced with the relative position information is multiplied by the weight matrix in the preset encoder to obtain the query vector and key corresponding to each video frame in the video frame sequence.
  • vector and value vector in order to calculate the correlation between any two video frames in the video frame sequence according to the key vector, query vector and value vector after introducing relative position information. It should be noted that if the relative positional relationship between any two video frames in the video frame sequence is introduced, then when the overall model structure is trained, the model structure includes not only the convolutional neural network and the encoder, but also the position matrix.
  • the key vector and the value vector respectively calculate any two video frames in the video frame sequence.
  • the mutual influence score of the video frame in the encoding process of the preset encoder including: multiplying the query vector corresponding to the first video frame and the key vector corresponding to the second video frame, and then multiplying the result of the multiplication.
  • the influence score of the first video frame on the second video frame is calculated by combining the influence score of the second video frame on the first video frame with the first video frame.
  • the influence scores of a video frame on the second video frame are added to obtain the mutual influence scores of the first video frame and the second video frame in the encoding process of the preset encoder.
  • the Determining the correlation between any two video frames in the video frame sequence according to the mutual influence score includes: calculating an average value of the mutual influence score, and determining, according to the average value, the degree of correlation between any two video frames in the video frame sequence.
  • the first video frame and the second video frame may be any two video frames in the video frame sequence.
  • the mutual influence score of the two video frames can be calculated first, and according to the mutual influence score, the relationship between any two video frames in the video frame sequence can be determined.
  • the first video frame and the second video frame are video frame i and video frame j, respectively, when the encoder encodes video frame i, the influence score of video frame j on video frame i is calculated, and the encoding
  • the specific formula is as follows:
  • Vi and Vj are the value vectors corresponding to video frame i and video frame j respectively
  • ki and kj are the key vectors corresponding to video frame i and video frame j respectively
  • Qi and Qj are the corresponding video frame i and video frame j respectively.
  • the influence score of j further, the influence score of the video frame j on the video frame i and the influence score of the video frame i on the video frame j are added to obtain an average value, and the video frame is determined according to the average value.
  • the degree of correlation between any two video frames in the frame sequence, so that the degree of correlation between any two video frames in the video frame sequence can be determined according to the above method.
  • the semantic information includes high-level semantic information, such as the correlation between any two video frames in the video frame sequence.
  • the semantic information also includes low-level semantic information.
  • convolutional neural The network performs feature extraction on the video frame sequence to obtain the video frame feature sequence corresponding to the video frame sequence.
  • the video frame feature sequence can reflect the low-level semantic information such as color and brightness of each video frame.
  • the force mechanism can calculate the correlation between any two video frames in the video frame sequence, and then obtain high-level semantic information in the video.
  • Another method for extracting semantic information of a video frame provided by the embodiment of the present application, compared with the current method of extracting semantic information in a video frame only by using a convolutional neural network, the present application can obtain a sequence of video frames for which semantic information extraction is to be performed.
  • an embodiment of the present application provides an apparatus for extracting semantic information of video frames.
  • the apparatus includes: an acquisition unit 31 , a calculation unit 32 and a determination unit 33 .
  • the obtaining unit 31 may be configured to obtain a video frame sequence for which semantic information extraction is to be performed, and perform video feature extraction on the video frame sequence to obtain a video frame feature sequence.
  • the obtaining unit 31 is a main function module for obtaining the video frame sequence to be extracted semantic information in the device, and extracting the video features from the video frame sequence to obtain the video frame feature sequence.
  • the computing unit 32 can be configured to determine the query vector, key vector and value vector corresponding to each video frame in the video frame sequence according to the video frame feature sequence, and determine the query vector, the key vector and the value vector according to the query vector, the key vector and the value vector. For the value vector, the correlation between any two video frames in the video frame sequence is calculated.
  • the computing unit 32 determines the query vector, key vector and value vector corresponding to each video frame in the video frame sequence according to the video frame feature sequence in the device, and determines the query vector, the key vector and the value vector according to the query vector, the key vector and the value vector.
  • the value vector, the main functional module for calculating the correlation between any two video frames in the video frame sequence is also the core module.
  • the determining unit 33 may be configured to determine the voice information corresponding to the video frame sequence according to the correlation between any two video frames in the video frame sequence.
  • the determining unit 33 is a main functional module in the apparatus for determining the voice information corresponding to the video frame sequence according to the correlation between any two video frames in the video frame sequence.
  • the calculation unit 32 includes: a determination module 321 and a multiplication module 322 .
  • the determining module 321 may be configured to determine relative position information between any two video frames in the video frame sequence.
  • the multiplication module 322 can be configured to introduce the relative position information into the video frame feature sequence, and compare the video frame feature sequence introduced with the relative position information with the weight matrix in the preset encoder. Multiply to obtain the query vector, key vector and value vector corresponding to each video frame in the video frame sequence.
  • the calculation unit 32 further includes: a calculation module 323 .
  • the calculation module 323 can be used to calculate the mutual influence of any two video frames in the video frame sequence in the encoding process of the preset encoder according to the query vector, the key vector and the value vector. points.
  • the determining module 321 may also be configured to determine the correlation between any two video frames in the video frame sequence according to the mutual influence score.
  • the any two video frames are respectively the first video frame and the second video frame in the video frame sequence
  • the calculation module 323 includes: a multiplication submodule and an addition submodule.
  • the multiplication sub-module can be used to multiply the query vector corresponding to the first video frame and the key vector corresponding to the second video frame, and then multiply the result with the value corresponding to the second video frame.
  • the vectors are multiplied to obtain an influence score of the second video frame on the first video frame in the process of encoding the first video frame by the preset encoder.
  • the multiplication sub-module can also be used to multiply the query vector corresponding to the second video frame and the key vector corresponding to the first video frame, and then multiply the result with the key vector corresponding to the first video frame.
  • the value vectors are multiplied to obtain an influence score of the first video frame on the second video frame in the process of encoding the second video frame by the preset encoder.
  • the adding submodule may be configured to add the impact score of the second video frame on the first video frame and the impact score of the first video frame on the second video frame to obtain The mutual influence score of the first video frame and the second video frame in the encoding process of the preset encoder.
  • the determining module 321 may be specifically configured to calculate the average value of the mutual influence scores, and determine the correlation between any two video frames in the video frame sequence according to the average value.
  • the apparatus further includes: a labeling unit 34 , an extraction unit 35 and a training unit 36 .
  • the labeling unit 34 may be configured to construct a spliced sample video frame sequence, and annotate the spliced sample video frame sequence to obtain an annotated sample video frame sequence.
  • the extraction unit 35 may be configured to perform feature extraction on the labeled sample video frame sequence by using an initial convolutional neural network to obtain a sample video frame feature sequence corresponding to the labeled sample video frame sequence.
  • the computing unit 32 can also be used to input the sample video frame feature sequence into the initial encoder for correlation calculation, and obtain the correlation between any two video frames in the labeled sample video frame sequence. .
  • the training unit 36 may be configured to train the initial encoder and the initial convolutional neural network according to the correlation between any two video frames output by the initial encoder.
  • the labeling unit 34 includes: a splicing module 341 and a labeling module 342 .
  • the splicing module 341 may be configured to obtain sample video frame sequences corresponding to each sample video, and splicing the sample video frame sequences corresponding to each sample video to obtain a spliced sample video frame sequence.
  • the labeling module 342 may be configured to label the spliced sample video frame sequences according to whether they come from the same sample video, and obtain the labelled sample video frame sequences.
  • an embodiment of the present application further provides a computer-readable storage medium, on which computer-readable instructions are stored, and when the program is executed by a processor, the following steps are implemented: The video frame sequence extracted from the semantic information, and performing video feature extraction on the video frame sequence to obtain a video frame feature sequence; according to the video frame feature sequence, determine the query vector, key corresponding to each video frame in the video frame sequence vector and value vector, and calculate the correlation between any two video frames in the video frame sequence according to the query vector, the key vector and the value vector; according to any two video frames in the video frame sequence The correlation degree between them is used to determine the semantic information corresponding to the video frame sequence.
  • an embodiment of the present application further provides a physical structure diagram of a computer device.
  • the computer device includes: a processor 41 , Memory 42, and computer-readable instructions stored on the memory 42 and executable on the processor, wherein both the memory 42 and the processor 41 are provided on the bus 43 when the processor 41 executes the program to achieve the following steps: obtaining The video frame sequence to be subjected to semantic information extraction, and performing video feature extraction on the video frame sequence to obtain a video frame feature sequence; according to the video frame feature sequence, determine the query vector corresponding to each video frame in the video frame sequence.
  • key vector and value vector and calculate the correlation between any two video frames in the video frame sequence according to the query vector, the key vector and the value vector; according to any two video frames in the video frame sequence
  • the correlation between video frames determines the semantic information corresponding to the video frame sequence.
  • the present application can obtain the video frame sequence to be subjected to semantic information extraction, perform video feature extraction on the video frame sequence to obtain the video frame feature sequence, and at the same time determine the video frame feature sequence according to the video frame feature sequence.
  • Correlation finally according to the correlation between any two video frames in the video frame sequence, determine the corresponding voice information of the video frame sequence, thus by calculating the correlation between any two video frames in the video frame sequence, It can obtain higher-level semantic information in a video, which is convenient for the execution of downstream tasks such as video storage and retrieval.
  • modules or steps of the present application can be implemented by a general-purpose computing device, and they can be centralized on a single computing device, or distributed in a network composed of multiple computing devices Alternatively, they may be implemented in program code executable by a computing device, such that they may be stored in a storage device and executed by the computing device, and in some cases, in a different order than here
  • the steps shown or described are performed either by fabricating them separately into individual integrated circuit modules, or by fabricating multiple modules or steps of them into a single integrated circuit module.
  • the present application is not limited to any particular combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

一种视频帧语义信息的提取方法、装置及计算机设备,涉及人工智能领域,其中方法包括:获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列(101);根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度(102);根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语音信息(103)。该方法主要适用于视频帧语义信息的提取,能够获得更高层次的语音信息,便于视频存储、检索等下游任务的执行。

Description

视频帧语义信息的提取方法、装置及计算机设备
本申请要求与2020年12月22日提交中国专利局、申请号为202011526812.1申请名称为“视频帧语义信息的提取方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及人工智能技术领域,尤其是涉及一种视频帧语义信息的提取方法、装置及计算机设备。
背景技术
随着信息技术的不断发展,每天都会产生大量各种不同题材的视频,为了方便不同题材视频的存储和检索,通常需要从视频中提取相应的视频信息,从而方便存储、检索和相似度计算等下游任务。
目前,在提取视频信息的过程中,通常可以利用卷积神经网络提取视频帧中的语义信息。然而,这种利用卷积神经网络提取视频信息的方式,能够获取的语音信息有限,针对较高层次的语音信息,如一段视频中任意两视频帧之间的相关度信息无法提取,从而不利于视频存储、检索等下游任务的执行。
发明内容
本申请提供了一种视频帧语义信息的提取方法、装置及计算机设备,主要在于能够提取一段视频中任意两视频帧之间的相关度信息,从而能够获得更高层次的语音信息,便于视频存储、检索等下游任务的执行。
根据本申请的第一个方面,提供一种视频帧语义信息的提取方法,包括:
获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列;
根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度;
根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语音信息。
根据本申请的第二个方面,提供一种视频帧语义信息的提取装置,包括:
获取单元,用于获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列;
计算单元,用于根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视 频帧序列中任意两视频帧之间的相关度;
确定单元,用于根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语音信息。
根据本申请的第三个方面,提供一种计算机可读存储介质,其上存储有计算机可读指令,该程序被处理器执行时实现以下步骤:
获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列;
根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度;
根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语音信息。
根据本申请的第四个方面,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,所述处理器执行所述程序时实现以下步骤:
获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列;
根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度;
根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语音信息。
本申请提供的一种视频帧语义信息的提取方法、装置及计算机设备,与目前仅利用卷积神经网络提取视频帧中语义信息的方式相比,本申请能够获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列,同时根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度,最终根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语音信息,由此通过计算视频帧序列中任意两视频帧之间的相关度,能够获取一段视频中更高层次的语义信息,便于视频存储、检索等下游任务的执行。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1示出了本申请实施例提供的一种视频帧语义信息的提取方法流程图;
图2示出了本申请实施例提供的另一种视频帧语义信息的提取方法流程图;
图3示出了本申请实施例提供的一种视频帧语义信息的提取装置的结构示意图;
图4示出了本申请实施例提供的另一种视频帧语义信息的提取装置的结构示意图;
图5示出了本申请实施例提供的一种计算机设备的实体结构示意图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
目前,在提取视频信息的过程中,通常可以利用卷积神经网络提取视频帧中的语义信息。然而,这种利用卷积神经网络提取视频信息的方式,能够获取的语音信息有限,针对较高层次的语音信息,如一段视频中任意两视频帧之间的相关度信息无法提取,从而不利于视频存储、检索等下游任务的执行。
为了解决上述问题,本申请实施例提供了一种视频帧语义信息的提取方法,如图1所示,所述方法包括:
101、获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列。
其中,视频帧序列为从待进行语义信息提取的视频中解析得到的。为了克服现有技术中利用卷积神经网络提取的语义信息有限,无法获取更高层次语义信息的缺陷,本申请实施例,利用预设编码器中的注意力层机制计算视频帧序列中任意两视频帧之间的相关度,进而根据计算的相关度能够获取视频中更高层次的语义信息,本申请实施例主要适用于视频中语义信息的提取,本申请实施例的执行主体为能够提取视频中语义信息的装置或设备,具体可以设置在服务器或者客户端一侧。
对于本申请实施例,可以通过安装imageio库和skimage库提取一段视频的视频帧序列,此外,还可以通过Adobe Premiere软件提取该段视频的视频帧序列,提取视频帧序列的方式,本申请实施例不做具体限定。进一步地,为了计算视频帧序列中任意两视频帧之间的相关度,需要对视频帧序列进行特征提取,得到该视频帧序列对应的视频帧特征序列,具体地,可以利用卷积神经网络提取该视频帧序列对应的视频帧特征序列,需要说明的是,提取视频帧特征序列的模型可以为但不局限于卷积神经网络。
102、根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度。
对于本申请实施例,将提取的视频帧特征序列输入至预设编码器中的注意力机制层进行相似度的计算,具体地,将视频帧特征序列分别与预设编码器的三个权重矩阵相乘,得到视频帧序列中各帧视频对应的查询向量、键向量和值向量,接着根据该查询向量、键向量和值向量,分别计算视频帧序列中任意两视频帧之间的相互影响分值,进而根据该相互影响分值,确定视频帧序列中任意两视频帧之间的相关度,视频帧序列中任意两视频帧之间的相互影响分值越高,任意两视频帧之间的相关度越高,视频帧序列中任意两视频帧之 间的相互影响分值越低,任意两视频帧之间的相关度越低。由此根据视频帧序列中各帧视频对应的查询向量、键向量和值向量,能够确定视频帧序列中任意两视频帧之间的相关度。
103、根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语义信息。
其中,该语义信息包括高层的语义信息,如视频帧序列中任意两视频帧之间的相关度,对于本申请实施例,该语义信息还包括低层次的语义信息,具体地,利用卷积神经网络对视频帧序列进行特征提取,得到视频帧序列对应的视频帧特征序列,该视频帧特征序列能够反映各视频帧的颜色、亮度等低层次的语义信息,通过利用预设编码器中的注意力机制能够计算出视频帧序列中任意两视频帧之间的相关度,进而获取视频中的高层次语义信息。
本申请实施例提供的一种视频帧语义信息的提取方法,与目前仅利用卷积神经网络提取视频帧中语义信息的方式相比,本申请能够获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列,同时根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度,最终根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语音信息,由此通过计算视频帧序列中任意两视频帧之间的相关度,能够获取一段视频中更高层次的语义信息,便于视频存储、检索等下游任务的执行。
进一步的,为了更好的说明上述提取视频帧语义信息的过程,作为对上述实施例的细化和扩展,本申请实施例提供了另一种视频帧语义信息的提取方法,如图2所示,所述方法包括:
201、获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列。
对于本申请实施例,在利用预设编码器计算视频帧序列中任意两视频帧之间的相似度之前,需要构建预设编码器,针对预设编码器的构建过程,所述方法还包括:构建拼接后的样本视频帧序列,并对所述拼接后的样本视频帧序列进行标注,得到标注后的样本视频帧序列;利用初始卷积神经网络对所述标注后的样本视频帧序列进行特征提取,得到所述标注后的样本视频帧序列对应的样本视频帧特征序列;将所述样本视频帧特征序列输入至初始编码器进行相关度计算,得到所述所述标注后的样本视频帧序列中任意两视频帧之间的相关度;根据所述初始编码器输出的任意两视频帧之间的相关度,对所述初始编码器和所述初始卷积神经网络进行训练。进一步地,所述构建拼接后的样本视频帧序列,并对所述拼接后的样本视频帧序列进行标注,得到标注后的样本视频帧序列,包括:获取各个样本视频对应的样本视频帧序列,并将所述各个样本视频对应的样本视频帧序列进行拼接,得到拼接后的样本视频帧序列;根据所述拼接后的样本视频帧序列是否来自于同一段样本视频,对其进行标注,得到标注后的样本视频帧序列。其中,样本视频库中存储有大量不同主题的样本视频,如与艺术、刑侦主题相关的样本视频。
对于本申请实施例主要涉及的模型结构包括用于特征提取的卷积神经网络和相似度计算的编码器,在确定视频帧序列中任意两视频帧之间的相关度之前,需要将卷积神经网络和编码器作为一个整体进行训练,具体地,获取ImageNet预训练的初始卷积神经网络模型,并冻结前3/4层的权重不参与学习更新,之后从样本视频库随机抽取两段样本视频,并尽量保证两者的内容(如类目、关键词、主题)有差异,从两段样本视频中按照时间顺序随机抽取K帧视频,需要说明的是,从视频中抽取的帧数可以根据业务需求进行设定,但为了确保相关度的计算精度,K值不宜设定过小,之后将从两段样本视频中抽取的K视频帧进行拼接,得到拼接后的样本视频帧序列,进一步地,根据拼接后的样本视频帧序列是否来自于同一段样本视频,对其进行标注,例如,拼接后的样本视频帧序列为(E u1,E u2….E uk,E v1,E v2..E vk),其中,E uk代表从第u段样本视频中抽取的第k帧视频,Evk代表从第v段样本视频中抽取的第k帧视频,如果第u段视频和第v段视频来自同一段视频,则确定拼接后的样本视频帧序列对应的标签为1,如果第u段视频和第v段视频不是来自同一段视频,则确定拼接后的样本视频帧序列对应的标签为0,由此能够得到标注后的样本视频帧序列。
进一步地,利用卷积神经网络对标注后的样本视频帧序列进行特征提取,得到标注后的样本视频帧序列对应的样本视频帧特征序列,之后将样本视频帧序列输入至初始编码器进行相关度计算,得到标注后的样本视频帧序列中任意两视频帧之间的相关度,即得到初始编码器输出的结果,并从该输出结果中提取CLS特征做二分类任务,通过二分类损失函数计算得到的梯度更新于整个网络,包括初始卷积神经网络的1/4层和初始编码,从而能够得到本申请实施例中用于特征提取和相似度计算的卷积神经网络模型和预设编码器。
进一步地,获取待进行语义信息提取的视频帧序列,具体地,可以通过安装imageio库和skimage库提取一段视频的视频帧序列,此外,还可以通过Adobe Premiere软件提取该段视频的视频帧序列,提取视频帧序列的方式,本申请实施例不做具体限定。之后利用训练好的卷积神经网络提取视频帧序列对应的视频帧特征序列。
202、将所述视频帧特征序列与预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量。
对于本申请实施例,为了提高视频帧序列中任意两视频帧之间的相似度计算精度,所述将所述视频帧特征序列与预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,包括:确定所述视频帧序列中任意两视频帧之间的相对位置信息;在所述视频帧特征序列中引入所述相对位置信息,并将引入所述相对位置信息的视频帧特征序列与所述预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量。确定任意两视频帧之间的相对位置信息的具体公式如下:
Figure PCTCN2021124889-appb-000001
频帧j共同对应的索引下标,W为位置矩阵,k为预设截断距离,根据确定的索引下标查询位置矩阵,确定视频帧i和视频帧j之间的相对位置信息,进一步地,将该相对位置信息引入视频帧特征序列,并将引入相对位置信息的视频帧特征序列与所述预设编码器中的权重矩阵相乘,得到视频帧序列中各视频帧对应的查询向量、键向量和值向量,以便根据引入相对位置信息后的键向量、查询向量和值向量,计算视频帧序列中任意两视频帧之间的相关度。需要说明的是,如果引入视频帧序列中任意两视频帧之间的相对位置关系,那么对整体模型结构进行训练时,该模型结构不仅包括卷积神经网络和编码器,还包括位置矩阵。
203、根据所述查询向量、所述键向量和所述值向量,分别计算所述视频帧序列中任意两视频帧在所述预设编码器编码过程中的相互影响分值,并根据所述相互影响分值,确定所述视频帧序列中任意两视频帧之间的相关度。
对于本申请实施例,为了计算视频帧序列中任意两视频帧之间的相关度,所述根据所述查询向量、所述键向量和所述值向量,分别计算所述视频帧序列中任意两视频帧在所述预设编码器编码过程中的相互影响分值,包括:将所述第一视频帧对应的查询向量和所述第二视频帧对应的键向量相乘,再将相乘结果与所述第二视频帧对应的值向量相乘,得到所述预设编码器在对第一视频帧编码的过程中所述第二视频帧对所述第一视频帧的影响分值;将所述第二视频帧对应的查询向量和所述第一视频帧对应的键向量相乘,再将相乘结果与所述第一视频帧对应的值向量相乘,得到所述预设编码器在对第二视频帧编码的过程中所述第一视频帧对所述第二视频帧的影响分值,将所述第二视频帧对所述第一视频帧的影响分值和所述第一视频帧对所述第二视频帧的影响分值相加,得到所述第一视频帧和所述第二视频帧在预设编码器编码过程中的相互影响分值,基于此,所述根据所述相互影响分值,确定所述视频帧序列中任意两视频帧之间的相关度,包括:计算所述相互影响分值的平均值,根据所述平均值确定所述视频帧序列中任意两视频帧之间的相关度。其中,第一视频帧和第二视频帧可以为视频帧序列中的任意两视频帧。
具体地,为了确定视频帧序列中任意两视频帧之间的相关度,可以先计算两视频帧的相互影响分值,并根据该相互影响分值,确定视频帧序列中任意两视频帧之间的相关度,假设第一视频帧和第二视频帧分别为视频帧i和视频帧j,分别计算编码器在对视频帧i编码时,视频帧j对视频帧i的影响分值,以及编码器在对视频帧j编码时,视频帧i对视频帧j的影响分值,具体公式如下:
Vj*(Qi*kj)
Vi*(Qj*ki)
其中,Vi和Vj分别为视频帧i和视频帧j对应的值向量,ki和kj分别为视频帧i和视频帧j对应的键向量,Qi和Qj分别为视频帧i和视频帧j对应的查询向量,由此按照上述公式能够分别计算编码器在对视频帧i编码时,视频帧j对视频帧i的影响分值,以及编 码器在对视频帧j编码时,视频帧i对视频帧j的影响力分值,进一步地,将视频帧j对视频帧i的影响力分值和视频帧i对视频帧j的影响力分值相加取均值,根据所述平均值确定所述视频帧序列中任意两视频帧之间的相关度,由此按照上述方式能够确定视频帧序列中任意两视频帧之间的相关度。
204、根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语义信息。
其中,该语义信息包括高层的语义信息,如视频帧序列中任意两视频帧之间的相关度,对于本申请实施例,该语义信息还包括低层次的语义信息,具体地,利用卷积神经网络对视频帧序列进行特征提取,得到视频帧序列对应的视频帧特征序列,该视频帧特征序列能够反映各视频帧的颜色、亮度等低层次的语义信息,通过利用预设编码器中的注意力机制能够计算出视频帧序列中任意两视频帧之间的相关度,进而获取视频中的高层次语义信息。
本申请实施例提供的另一种视频帧语义信息的提取方法,与目前仅利用卷积神经网络提取视频帧中语义信息的方式相比,本申请能够获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列,同时根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度,最终根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语音信息,由此通过计算视频帧序列中任意两视频帧之间的相关度,能够获取一段视频中更高层次的语义信息,便于视频存储、检索等下游任务的执行。
进一步地,作为图1的具体实现,本申请实施例提供了一种视频帧语义信息的提取装置,如图3所示,所述装置包括:获取单元31、计算单元32和确定单元33。
所述获取单元31,可以用于获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列。所述获取单元31是本装置中获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列的主要功能模块。
所述计算单元32,可以用于根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度。所述计算单元32是本装置中根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度的主要功能模块,也是核心模块。
所述确定单元33,可以用于根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语音信息。所述确定单元33是本装置中根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语音信息的主要功能模块。
在具体应用场景中,为了确定所述视频帧序列中各视频帧对应的查询向量、键向量和 值向量,如图4所示,所述计算单元32,可以具体用于将所述视频帧特征序列与预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量。
在具体应用场景中,提高视频帧序列中任意两视频帧之间的相似度计算精度,所述计算单元32,包括:确定模块321和相乘模块322。
所述确定模块321,可以用于确定所述视频帧序列中任意两视频帧之间的相对位置信息。
所述相乘模块322,可以用于在所述视频帧特征序列中引入所述相对位置信息,并将引入所述相对位置信息的视频帧特征序列与所述预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量。
进一步地,为了计算所述视频帧序列中任意两视频帧之间的相关度,如图4所示,所述计算单元32,还包括:计算模块323。
所述计算模块323,可以用于根据所述查询向量、所述键向量和所述值向量,分别计算所述视频帧序列中任意两视频帧在所述预设编码器编码过程中的相互影响分值。
所述确定模块321,还可以用于根据所述相互影响分值,确定所述视频帧序列中任意两视频帧之间的相关度。
在具体应用场景中,所述任意两视频帧分别为所述视频帧序列中的第一视频帧和第二视频帧,所述计算模块323,包括:相乘子模块和相加子模块。
所述相乘子模块,可以用于将所述第一视频帧对应的查询向量和所述第二视频帧对应的键向量相乘,再将相乘结果与所述第二视频帧对应的值向量相乘,得到所述预设编码器在对第一视频帧编码的过程中所述第二视频帧对所述第一视频帧的影响分值。
所述相乘子模块,还可以用于将所述第二视频帧对应的查询向量和所述第一视频帧对应的键向量相乘,再将相乘结果与所述第一视频帧对应的值向量相乘,得到所述预设编码器在对第二视频帧编码的过程中所述第一视频帧对所述第二视频帧的影响分值。
所述相加子模块,可以用于将所述第二视频帧对所述第一视频帧的影响分值和所述第一视频帧对所述第二视频帧的影响分值相加,得到所述第一视频帧和所述第二视频帧在预设编码器编码过程中的相互影响分值。
所述确定模块321,具体可以用于计算所述相互影响分值的平均值,根据所述平均值确定所述视频帧序列中任意两视频帧之间的相关度。
进一步地,为了训练编码器和卷积神经网络,所述装置还包括:标注单元34、提取单元35和训练单元36。
所述标注单元34,可以用于构建拼接后的样本视频帧序列,并对所述拼接后的样本视频帧序列进行标注,得到标注后的样本视频帧序列。
所述提取单元35,可以用于利用初始卷积神经网络对所述标注后的样本视频帧序列进行特征提取,得到所述标注后的样本视频帧序列对应的样本视频帧特征序列。
所述计算单元32,还可以用于将所述样本视频帧特征序列输入至初始编码器进行相 关度计算,得到所述所述标注后的样本视频帧序列中任意两视频帧之间的相关度。
所述训练单元36,可以用于根据所述初始编码器输出的任意两视频帧之间的相关度,对所述初始编码器和所述初始卷积神经网络进行训练。
在具体应用场景中,所述标注单元34,包括:拼接模块341和标注模块342。
所述拼接模块341,可以用于获取各个样本视频对应的样本视频帧序列,并将所述各个样本视频对应的样本视频帧序列进行拼接,得到拼接后的样本视频帧序列。
所述标注模块342,可以用于根据所述拼接后的样本视频帧序列是否来自于同一段样本视频,对其进行标注,得到标注后的样本视频帧序列。
需要说明的是,本申请实施例提供的一种视频帧语义信息的提取装置所涉及各功能模块的其他相应描述,可以参考图1所示方法的对应描述,在此不再赘述。
基于上述如图1所示方法,相应的,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机可读指令,该程序被处理器执行时实现以下步骤:获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列;根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度;根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语义信息。
基于上述如图1所示方法和如图3所示装置的实施例,本申请实施例还提供了一种计算机设备的实体结构图,如图5所示,该计算机设备包括:处理器41、存储器42、及存储在存储器42上并可在处理器上运行的计算机可读指令,其中存储器42和处理器41均设置在总线43上所述处理器41执行所述程序时实现以下步骤:获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列;根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度;根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语义信息。
通过本申请的技术方案,本申请能够获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列,同时根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度,最终根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语音信息,由此通过计算视频帧序列中任意两视频帧之间的相关度,能够获取一段视频中更高层次的语义信息,便于视频存储、检索等下游任务的执行。
显然,本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网 络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。

Claims (20)

  1. 一种视频帧语义信息的提取方法,其中,包括:
    获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列;
    根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度;
    根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语义信息。
  2. 根据权利要求1所述的方法,其中,所述根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,包括:
    将所述视频帧特征序列与预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量。
  3. 根据权利要求2所述的方法,其中,所述将所述视频帧特征序列与预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,包括:
    确定所述视频帧序列中任意两视频帧之间的相对位置信息;
    在所述视频帧特征序列中引入所述相对位置信息,并将引入所述相对位置信息的视频帧特征序列与所述预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量。
  4. 根据权利要求1所述的方法,其中,所述根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度,包括:
    根据所述查询向量、所述键向量和所述值向量,分别计算所述视频帧序列中任意两视频帧在预设编码器编码过程中的相互影响分值;
    根据所述相互影响分值,确定所述视频帧序列中任意两视频帧之间的相关度。
  5. 根据权利要求4所述的方法,其中,所述任意两视频帧分别为所述视频帧序列中的第一视频帧和第二视频帧,所述根据所述查询向量、所述键向量和所述值向量,分别计算所述视频帧序列中任意两视频帧在所述预设编码器编码过程中的相互影响分值,包括:
    将所述第一视频帧对应的查询向量和所述第二视频帧对应的键向量相乘,再将相乘结果与所述第二视频帧对应的值向量相乘,得到所述预设编码器在对第一视频帧编码的过程中所述第二视频帧对所述第一视频帧的影响分值;
    将所述第二视频帧对应的查询向量和所述第一视频帧对应的键向量相乘,再将相乘结果与所述第一视频帧对应的值向量相乘,得到所述预设编码器在对第二视频帧编码的过程中所述第一视频帧对所述第二视频帧的影响分值;
    将所述第二视频帧对所述第一视频帧的影响分值和所述第一视频帧对所述第二视频 帧的影响分值相加,得到所述第一视频帧和所述第二视频帧在预设编码器编码过程中的相互影响分值;
    所述根据所述相互影响分值,确定所述视频帧序列中任意两视频帧之间的相关度,包括:
    计算所述相互影响分值的平均值,根据所述平均值确定所述视频帧序列中任意两视频帧之间的相关度。
  6. 根据权利要求1所述的方法,其中,在所述获取待进行语义信息提取的视频帧序列之前,所述方法还包括:
    构建拼接后的样本视频帧序列,并对所述拼接后的样本视频帧序列进行标注,得到标注后的样本视频帧序列;
    利用初始卷积神经网络对所述标注后的样本视频帧序列进行特征提取,得到所述标注后的样本视频帧序列对应的样本视频帧特征序列;
    将所述样本视频帧特征序列输入至初始编码器进行相关度计算,得到所述所述标注后的样本视频帧序列中任意两视频帧之间的相关度;
    根据所述初始编码器输出的任意两视频帧之间的相关度,对所述初始编码器和所述初始卷积神经网络进行训练。
  7. 根据权利要求6所述的方法,其中,所述构建拼接后的样本视频帧序列,并对所述拼接后的样本视频帧序列进行标注,得到标注后的样本视频帧序列,包括:
    获取各个样本视频对应的样本视频帧序列,并将所述各个样本视频对应的样本视频帧序列进行拼接,得到拼接后的样本视频帧序列;
    根据所述拼接后的样本视频帧序列是否来自于同一段样本视频,对其进行标注,得到标注后的样本视频帧序列。
  8. 一种视频帧语义信息的提取装置,其中,包括:
    获取单元,用于获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列;
    计算单元,用于根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度;
    确定单元,用于根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语音信息。
  9. 一种计算机可读存储介质,其上存储有计算机可读指令,其中,所述计算机可读指令被处理器执行时实现视频帧语义信息的提取方法,包括:
    获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列;
    根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量 和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度;
    根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语义信息。
  10. 根据权利要求9所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,包括:
    将所述视频帧特征序列与预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量。
  11. 根据权利要求10所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述将所述视频帧特征序列与预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,包括:
    确定所述视频帧序列中任意两视频帧之间的相对位置信息;
    在所述视频帧特征序列中引入所述相对位置信息,并将引入所述相对位置信息的视频帧特征序列与所述预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量。
  12. 根据权利要求9所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度,包括:
    根据所述查询向量、所述键向量和所述值向量,分别计算所述视频帧序列中任意两视频帧在预设编码器编码过程中的相互影响分值;
    根据所述相互影响分值,确定所述视频帧序列中任意两视频帧之间的相关度。
  13. 根据权利要求12所述的计算机可读存储介质,其中,所述任意两视频帧分别为所述视频帧序列中的第一视频帧和第二视频帧,所述计算机可读指令被处理器执行时实现所述根据所述查询向量、所述键向量和所述值向量,分别计算所述视频帧序列中任意两视频帧在所述预设编码器编码过程中的相互影响分值,包括:
    将所述第一视频帧对应的查询向量和所述第二视频帧对应的键向量相乘,再将相乘结果与所述第二视频帧对应的值向量相乘,得到所述预设编码器在对第一视频帧编码的过程中所述第二视频帧对所述第一视频帧的影响分值;
    将所述第二视频帧对应的查询向量和所述第一视频帧对应的键向量相乘,再将相乘结果与所述第一视频帧对应的值向量相乘,得到所述预设编码器在对第二视频帧编码的过程中所述第一视频帧对所述第二视频帧的影响分值;
    将所述第二视频帧对所述第一视频帧的影响分值和所述第一视频帧对所述第二视频帧的影响分值相加,得到所述第一视频帧和所述第二视频帧在预设编码器编码过程中的相互影响分值;
    所述根据所述相互影响分值,确定所述视频帧序列中任意两视频帧之间的相关度,包括:
    计算所述相互影响分值的平均值,根据所述平均值确定所述视频帧序列中任意两视频帧之间的相关度。
  14. 根据权利要求9所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现在所述获取待进行语义信息提取的视频帧序列之前,所述方法还包括:
    构建拼接后的样本视频帧序列,并对所述拼接后的样本视频帧序列进行标注,得到标注后的样本视频帧序列;
    利用初始卷积神经网络对所述标注后的样本视频帧序列进行特征提取,得到所述标注后的样本视频帧序列对应的样本视频帧特征序列;
    将所述样本视频帧特征序列输入至初始编码器进行相关度计算,得到所述所述标注后的样本视频帧序列中任意两视频帧之间的相关度;
    根据所述初始编码器输出的任意两视频帧之间的相关度,对所述初始编码器和所述初始卷积神经网络进行训练。
  15. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,其中,所述计算机可读指令被处理器执行时实现视频帧语义信息的提取方法,包括:
    获取待进行语义信息提取的视频帧序列,并对所述视频帧序列进行视频特征提取,得到视频帧特征序列;
    根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,并根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度;
    根据所述视频帧序列中任意两视频帧之间的相关度,确定所述视频帧序列对应的语义信息。
  16. 根据权利要求15所述的计算机设备,其中,所述计算机可读指令被处理器执行时实现所述根据所述视频帧特征序列,确定所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,包括:
    将所述视频帧特征序列与预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量。
  17. 根据权利要求16所述的计算机设备,其中,所述计算机可读指令被处理器执行时实现所述将所述视频帧特征序列与预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的查询向量、键向量和值向量,包括:
    确定所述视频帧序列中任意两视频帧之间的相对位置信息;
    在所述视频帧特征序列中引入所述相对位置信息,并将引入所述相对位置信息的视频帧特征序列与所述预设编码器中的权重矩阵相乘,得到所述视频帧序列中各视频帧对应的 查询向量、键向量和值向量。
  18. 根据权利要求15所述的计算机设备,其中,所述计算机可读指令被处理器执行时实现所述根据所述查询向量、所述键向量和所述值向量,计算所述视频帧序列中任意两视频帧之间的相关度,包括:
    根据所述查询向量、所述键向量和所述值向量,分别计算所述视频帧序列中任意两视频帧在预设编码器编码过程中的相互影响分值;
    根据所述相互影响分值,确定所述视频帧序列中任意两视频帧之间的相关度。
  19. 根据权利要求18所述的计算机设备,其中,所述任意两视频帧分别为所述视频帧序列中的第一视频帧和第二视频帧,所述计算机可读指令被处理器执行时实现所述根据所述查询向量、所述键向量和所述值向量,分别计算所述视频帧序列中任意两视频帧在所述预设编码器编码过程中的相互影响分值,包括:
    将所述第一视频帧对应的查询向量和所述第二视频帧对应的键向量相乘,再将相乘结果与所述第二视频帧对应的值向量相乘,得到所述预设编码器在对第一视频帧编码的过程中所述第二视频帧对所述第一视频帧的影响分值;
    将所述第二视频帧对应的查询向量和所述第一视频帧对应的键向量相乘,再将相乘结果与所述第一视频帧对应的值向量相乘,得到所述预设编码器在对第二视频帧编码的过程中所述第一视频帧对所述第二视频帧的影响分值;
    将所述第二视频帧对所述第一视频帧的影响分值和所述第一视频帧对所述第二视频帧的影响分值相加,得到所述第一视频帧和所述第二视频帧在预设编码器编码过程中的相互影响分值;
    所述根据所述相互影响分值,确定所述视频帧序列中任意两视频帧之间的相关度,包括:
    计算所述相互影响分值的平均值,根据所述平均值确定所述视频帧序列中任意两视频帧之间的相关度。
  20. 根据权利要求15所述的计算机设备,其中,所述计算机可读指令被处理器执行时实现在所述获取待进行语义信息提取的视频帧序列之前,所述方法还包括:
    构建拼接后的样本视频帧序列,并对所述拼接后的样本视频帧序列进行标注,得到标注后的样本视频帧序列;
    利用初始卷积神经网络对所述标注后的样本视频帧序列进行特征提取,得到所述标注后的样本视频帧序列对应的样本视频帧特征序列;
    将所述样本视频帧特征序列输入至初始编码器进行相关度计算,得到所述所述标注后的样本视频帧序列中任意两视频帧之间的相关度;
    根据所述初始编码器输出的任意两视频帧之间的相关度,对所述初始编码器和所述初始卷积神经网络进行训练。
PCT/CN2021/124889 2020-12-22 2021-10-20 视频帧语义信息的提取方法、装置及计算机设备 WO2022134793A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011526812.1A CN112651324A (zh) 2020-12-22 2020-12-22 视频帧语义信息的提取方法、装置及计算机设备
CN202011526812.1 2020-12-22

Publications (1)

Publication Number Publication Date
WO2022134793A1 true WO2022134793A1 (zh) 2022-06-30

Family

ID=75358948

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/124889 WO2022134793A1 (zh) 2020-12-22 2021-10-20 视频帧语义信息的提取方法、装置及计算机设备

Country Status (2)

Country Link
CN (1) CN112651324A (zh)
WO (1) WO2022134793A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116437057A (zh) * 2023-06-13 2023-07-14 博纯材料股份有限公司 乙硼烷生产监控系统的系统优化方法及系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651324A (zh) * 2020-12-22 2021-04-13 深圳壹账通智能科技有限公司 视频帧语义信息的提取方法、装置及计算机设备
CN113435594B (zh) * 2021-06-30 2022-08-02 平安科技(深圳)有限公司 安防检测模型训练方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180293246A1 (en) * 2015-05-13 2018-10-11 Beijing Zhigu Rui Tuo Tech Co., Ltd. Video retrieval methods and apparatuses
CN110175580A (zh) * 2019-05-29 2019-08-27 复旦大学 一种基于时序因果卷积网络的视频行为识别方法
CN110378269A (zh) * 2019-07-10 2019-10-25 浙江大学 通过影像查询定位视频中未预习的活动的方法
CN111523462A (zh) * 2020-04-22 2020-08-11 南京工程学院 基于自注意增强cnn的视频序列表情识别系统及方法
CN112651324A (zh) * 2020-12-22 2021-04-13 深圳壹账通智能科技有限公司 视频帧语义信息的提取方法、装置及计算机设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180293246A1 (en) * 2015-05-13 2018-10-11 Beijing Zhigu Rui Tuo Tech Co., Ltd. Video retrieval methods and apparatuses
CN110175580A (zh) * 2019-05-29 2019-08-27 复旦大学 一种基于时序因果卷积网络的视频行为识别方法
CN110378269A (zh) * 2019-07-10 2019-10-25 浙江大学 通过影像查询定位视频中未预习的活动的方法
CN111523462A (zh) * 2020-04-22 2020-08-11 南京工程学院 基于自注意增强cnn的视频序列表情识别系统及方法
CN112651324A (zh) * 2020-12-22 2021-04-13 深圳壹账通智能科技有限公司 视频帧语义信息的提取方法、装置及计算机设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116437057A (zh) * 2023-06-13 2023-07-14 博纯材料股份有限公司 乙硼烷生产监控系统的系统优化方法及系统
CN116437057B (zh) * 2023-06-13 2023-09-19 博纯材料股份有限公司 乙硼烷生产监控系统的系统优化方法及系统

Also Published As

Publication number Publication date
CN112651324A (zh) 2021-04-13

Similar Documents

Publication Publication Date Title
CN111737476B (zh) 文本处理方法、装置、计算机可读存储介质及电子设备
WO2022134793A1 (zh) 视频帧语义信息的提取方法、装置及计算机设备
Qiao et al. Deep co-training for semi-supervised image recognition
CN112131366B (zh) 训练文本分类模型及文本分类的方法、装置及存储介质
CN108334487B (zh) 缺失语意信息补全方法、装置、计算机设备和存储介质
CN110705301B (zh) 实体关系抽取方法及装置、存储介质、电子设备
US11768869B2 (en) Knowledge-derived search suggestion
CN112100332A (zh) 词嵌入表示学习方法及装置、文本召回方法及装置
CN112131881B (zh) 信息抽取方法及装置、电子设备、存储介质
CN112560502B (zh) 一种语义相似度匹配方法、装置及存储介质
US20210326383A1 (en) Search method and device, and storage medium
US20230008897A1 (en) Information search method and device, electronic device, and storage medium
CN113704460A (zh) 一种文本分类方法、装置、电子设备和存储介质
CN110597956A (zh) 一种搜索方法、装置及存储介质
CN116029305A (zh) 一种基于多任务学习的中文属性级情感分析方法、系统、设备及介质
Nedelchev et al. End-to-end entity linking and disambiguation leveraging word and knowledge graph embeddings
Aina et al. What do entity-centric models learn? insights from entity linking in multi-party dialogue
Ishmam et al. From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities
CN114282001A (zh) 基于文本的任务处理方法、装置、计算机设备及存储介质
CN116992886A (zh) 一种基于bert的热点新闻事件脉络生成方法及装置
WO2023040545A1 (zh) 一种数据处理方法、装置、设备、存储介质和程序产品
WO2023168818A1 (zh) 视频和文本相似度确定方法、装置、电子设备、存储介质
CN112749554B (zh) 确定文本匹配度的方法、装置、设备及存储介质
CN114065769A (zh) 情感原因对抽取模型的训练方法、装置、设备及介质
CN114492450A (zh) 文本匹配方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908794

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02.11.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21908794

Country of ref document: EP

Kind code of ref document: A1