CN118692014A - Video tag identification method, device, equipment, medium and product - Google Patents

Video tag identification method, device, equipment, medium and product Download PDF

Info

Publication number
CN118692014A
CN118692014A CN202411181885.XA CN202411181885A CN118692014A CN 118692014 A CN118692014 A CN 118692014A CN 202411181885 A CN202411181885 A CN 202411181885A CN 118692014 A CN118692014 A CN 118692014A
Authority
CN
China
Prior art keywords
video
feature
text
sample
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411181885.XA
Other languages
Chinese (zh)
Other versions
CN118692014B (en
Inventor
陈世哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202411181885.XA priority Critical patent/CN118692014B/en
Publication of CN118692014A publication Critical patent/CN118692014A/en
Application granted granted Critical
Publication of CN118692014B publication Critical patent/CN118692014B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a video tag identification method, a device, equipment, a medium and a product, wherein the method comprises the following steps: acquiring a video to be identified, and extracting multi-modal features of the video to be identified to obtain first multi-modal features, wherein the first multi-modal features comprise a plurality of modal features respectively corresponding to a plurality of modes; searching a plurality of reference videos similar to the video to be identified according to the first multi-mode characteristics, wherein each reference video carries a video tag; building context learning information according to the second multi-mode features corresponding to each reference video, the video tags carried by each reference video and the first multi-mode features of the video to be identified; and identifying the video tag of the video to be identified according to the context learning information. The technical scheme of the embodiment of the application can rapidly and accurately identify the video tag of the obtained video.

Description

视频标签识别方法、装置、设备、介质及产品Video tag identification method, device, equipment, medium and product

技术领域Technical Field

本申请涉及视频标签技术领域,具体而言,涉及一种视频标签识别方法、视频标签识别装置、电子设备、计算机可读存储介质及计算机产品。The present application relates to the field of video tag technology, and in particular to a video tag recognition method, a video tag recognition device, an electronic device, a computer-readable storage medium, and a computer product.

背景技术Background Art

视频标签识别是视频内容特征中重要的一部分;通过为海量的用户生成内容(User-generated Content,UGC)视频自动生成标签,可以为下游内容分发链路(如推荐系统和内容运营)提供不同粒度的视频内容特征,但由于UGC视频内容的多样性,实际业务场景中使用的标签库往往包含数十万甚至上百万个标签,而为每个视频精准地匹配相应的视频标签面临着巨大的挑战。Video tag recognition is an important part of video content features. By automatically generating tags for massive user-generated content (UGC) videos, we can provide video content features of different granularities for downstream content distribution links (such as recommendation systems and content operations). However, due to the diversity of UGC video content, the tag libraries used in actual business scenarios often contain hundreds of thousands or even millions of tags, and accurately matching the corresponding video tags for each video faces huge challenges.

发明内容Summary of the invention

本申请的实施例提供了一种视频标签识别方法、视频标签识别装置、电子设备、计算机可读存储介质及计算机程序产品,可快速且准确识别得到视频的视频标签。The embodiments of the present application provide a video tag recognition method, a video tag recognition device, an electronic device, a computer-readable storage medium, and a computer program product, which can quickly and accurately recognize the video tag of a video.

本申请的其他特性和优点将通过下面的详细描述变得显然,或部分地通过本申请的实践而习得。Other features and advantages of the present application will become apparent from the following detailed description, or may be learned in part by the practice of the present application.

根据本申请实施例的一个方面,提供了一种视频标签识别方法,包括:获取待识别视频,并对所述待识别视频进行多模态特征提取得到第一多模态特征,所述第一多模态特征包括多种模态分别对应的多种模态特征;根据所述第一多模态特征检索与所述待识别视频相似的多个参考视频,每个所述参考视频均携带有视频标签;根据各个所述参考视频分别对应的第二多模态特征和各个所述参考视频携带的视频标签,以及所述待识别视频的第一多模态特征构建上下文学习信息;根据所述上下文学习信息,对所述待识别视频的视频标签进行识别。According to one aspect of an embodiment of the present application, a video tag identification method is provided, including: obtaining a video to be identified, and performing multimodal feature extraction on the video to be identified to obtain a first multimodal feature, wherein the first multimodal feature includes multiple modal features corresponding to multiple modalities; retrieving multiple reference videos similar to the video to be identified based on the first multimodal feature, each of the reference videos carrying a video tag; constructing context learning information based on second multimodal features corresponding to each of the reference videos and video tags carried by each of the reference videos, as well as the first multimodal feature of the video to be identified; and identifying the video tag of the video to be identified based on the context learning information.

根据本申请实施例的一个方面,提供了一种视频标签识别装置,包括:获取模块,用于获取待识别视频,并对所述待识别视频进行多模态特征提取得到第一多模态特征,所述第一多模态特征包括多种模态分别对应的多种模态特征;检索模块,用于根据所述第一多模态特征检索与所述待识别视频相似的多个参考视频,每个所述参考视频均携带有视频标签;构建模块,用于根据各个所述参考视频分别对应的第二多模态特征和各个所述参考视频携带的视频标签,以及所述待识别视频的第一多模态特征构建上下文学习信息;识别模块,用于根据所述上下文学习信息,对所述待识别视频的视频标签进行识别。According to one aspect of an embodiment of the present application, a video tag identification device is provided, including: an acquisition module, used to acquire a video to be identified, and perform multimodal feature extraction on the video to be identified to obtain a first multimodal feature, wherein the first multimodal feature includes multiple modal features corresponding to multiple modalities; a retrieval module, used to retrieve multiple reference videos similar to the video to be identified based on the first multimodal feature, each of the reference videos carrying a video tag; a construction module, used to construct context learning information based on second multimodal features corresponding to each of the reference videos and video tags carried by each of the reference videos, as well as the first multimodal feature of the video to be identified; and an identification module, used to identify the video tag of the video to be identified based on the context learning information.

在本申请的一实施例中,所述第一多模态特征包括第一视频帧特征,所述第二多模态特征包括第二视频帧特征;构建模块进一步用于对所述第一视频帧特征进行特征转换处理得到第一视觉文本标记序列,并对各个所述参考视频对应的所述第二视频帧特征分别进行特征转换处理得到各个所述参考视频对应的第二视觉文本标记序列;获取所述待识别视频对应的第一文本信息和各个所述参考视频分别对应的第二文本信息;根据各个所述参考视频对应的所述第二视觉文本标记序列、所述第二文本信息和所述视频标签,以及所述第一视觉文本标记序列和所述第一文本信息,构建所述上下文学习信息。In one embodiment of the present application, the first multimodal feature includes a first video frame feature, and the second multimodal feature includes a second video frame feature; the construction module is further used to perform feature conversion processing on the first video frame feature to obtain a first visual text mark sequence, and perform feature conversion processing on the second video frame feature corresponding to each of the reference videos to obtain a second visual text mark sequence corresponding to each of the reference videos; obtain the first text information corresponding to the video to be identified and the second text information corresponding to each of the reference videos; construct the context learning information according to the second visual text mark sequence corresponding to each of the reference videos, the second text information and the video label, as well as the first visual text mark sequence and the first text information.

在本申请的一实施例中,所述第一视频帧特征的数量包括多个;所述构建模块进一步用于对多个所述第一视频帧特征进行特征融合处理,得到目标视频特征;通过预训练的特征对齐模块将所述目标视频特征对齐到预设文本特征空间,得到所述第一视觉文本标记序列。In one embodiment of the present application, the number of the first video frame features includes multiple; the construction module is further used to perform feature fusion processing on the multiple first video frame features to obtain target video features; and the target video features are aligned to a preset text feature space through a pre-trained feature alignment module to obtain the first visual text tag sequence.

在本申请的一实施例中,所述构建模块进一步用于根据用于隔离同一视频下的不同信息的第一文本标记,以及同一参考视频对应的所述第二视觉文本标记序列、所述第二文本信息和所述视频标签构建学习示例,以得到多个参考视频对应的多个学习示例;将所述多个学习示例进行拼接,得到学习信息,其中,所述学习信息中的多个学习示例通过用于隔离不同视频的信息的第二文本标记进行分隔;根据所述第一文本标记、所述第一视觉文本标记序列和所述第一文本信息构建识别示例;根据所述学习信息和所述识别示例生成所述上下文学习信息。In one embodiment of the present application, the construction module is further used to construct a learning example based on a first text tag used to isolate different information under the same video, and the second visual text tag sequence, the second text information and the video label corresponding to the same reference video, so as to obtain multiple learning examples corresponding to multiple reference videos; splicing the multiple learning examples to obtain learning information, wherein the multiple learning examples in the learning information are separated by the second text tag used to isolate information of different videos; constructing a recognition example based on the first text tag, the first visual text tag sequence and the first text information; and generating the contextual learning information based on the learning information and the recognition example.

在本申请的一实施例中,获取模块进一步用于获取所述待识别视频的视频标题,以及所述待识别视频的音频信息转化得到的识别文本;根据所述待识别视频的视频场景生成所述待识别视频的情境描述;根据所述视频标题、所述识别文本和所述情境描述生成所述第一文本信息。In one embodiment of the present application, the acquisition module is further used to obtain the video title of the video to be identified, and the recognition text converted from the audio information of the video to be identified; generate a context description of the video to be identified based on the video scene of the video to be identified; and generate the first text information based on the video title, the recognition text and the context description.

在本申请的一实施例中,所述第一多模态特征包括目标视频特征和第一文本特征;所述检索模块用于获取预先建立的视频特征检索库和文本特征检索库,所述视频特征检索库包括各个候选视频和候选视频对应的视频特征的映射关系,所述文本特征检索库包括各个候选视频和候选视频对应的文本特征的映射关系;根据所述目标视频特征分别检索所述视频特征检索库和所述文本特征检索库,并根据所述第一文本特征分别检索所述视频特征检索库和文本特征检索库,得到与所述待识别视频相似的多个目标候选视频;根据所述多个目标候选视频和所述待识别视频之间的视频相似度,从所述多个目标候选视频中选择所述参考视频。In one embodiment of the present application, the first multimodal feature includes a target video feature and a first text feature; the retrieval module is used to obtain a pre-established video feature retrieval library and a text feature retrieval library, the video feature retrieval library includes a mapping relationship between each candidate video and the video feature corresponding to the candidate video, and the text feature retrieval library includes a mapping relationship between each candidate video and the text feature corresponding to the candidate video; the video feature retrieval library and the text feature retrieval library are respectively searched according to the target video feature, and the video feature retrieval library and the text feature retrieval library are respectively searched according to the first text feature to obtain a plurality of target candidate videos similar to the video to be identified; the reference video is selected from the plurality of target candidate videos according to the video similarity between the plurality of target candidate videos and the video to be identified.

在本申请的一实施例中,所述多个目标候选视频包括每种检索方式所检索得到的多个目标候选视频;检索模块进一步用于针对每一种检索方式所检索得到所述多个目标候选视频,计算所述待识别视频与每个目标候选视频的相似度;针对每个目标候选视频,计算所述目标候选视频在各个检索方式中的与所述待识别视频的平均相似度;根据所述平均相似度从所述多个目标候选视频中选择所述参考视频。In one embodiment of the present application, the multiple target candidate videos include multiple target candidate videos retrieved by each retrieval method; the retrieval module is further used to calculate the similarity between the video to be identified and each target candidate video for the multiple target candidate videos retrieved by each retrieval method; for each target candidate video, calculate the average similarity between the target candidate video and the video to be identified in each retrieval method; and select the reference video from the multiple target candidate videos based on the average similarity.

在本申请的一实施例中,所述获取模块进一步用于从所述待识别视频中提取多个视频帧,并将所述多个视频帧划分为多个片段;从所述多个片段中抽取目标视频帧,对所述目标视频帧进行视频特征提取得到第一视频帧特征;获取所述待识别视频对应的第一文本信息,并对所述第一文本信息进行文本特征提取得到第一文本特征;根据所述第一视频帧特征和所述第一文本特征得到所述第一多模态特征。In one embodiment of the present application, the acquisition module is further used to extract multiple video frames from the video to be identified, and divide the multiple video frames into multiple segments; extract a target video frame from the multiple segments, and perform video feature extraction on the target video frame to obtain a first video frame feature; obtain the first text information corresponding to the video to be identified, and perform text feature extraction on the first text information to obtain a first text feature; obtain the first multimodal feature based on the first video frame feature and the first text feature.

在本申请的一实施例中,识别进一步用于对所述上下文学习信息进行序列特征转换,得到上下文学习序列,所述序列特征为标签生成模型所支持的输入特征;将所述上下文学习序列输入至所述标签生成模型,所述标签生成模型是通过对预设语言模型的原始模型参数保持冻结,并根据样本上下文学习序列对所述语言模型的新增模型参数进行调整得到的,所述新增模型参数与引入至所述语言模型的低秩自适应模块相关;获取所述标签生成模型输出的所述待识别视频的目标视频标签。In one embodiment of the present application, recognition is further used to perform sequence feature conversion on the context learning information to obtain a context learning sequence, wherein the sequence feature is an input feature supported by a label generation model; the context learning sequence is input into the label generation model, wherein the label generation model is obtained by freezing the original model parameters of a preset language model and adjusting the newly added model parameters of the language model according to the sample context learning sequence, wherein the newly added model parameters are related to a low-rank adaptive module introduced into the language model; and a target video label of the video to be identified output by the label generation model is obtained.

在本申请的一实施例中,所述装置还包括训练模块,用于获取第一样本视频对应的第一样本视频特征、第一样本文本信息和第一样本视频标签,以及第二样本视频对应的第二样本视频特征、第二样本文本信息和第二样本视频标签;根据预训练的初始特征对齐模块分别对所述第一样本视频特征和所述第二样本视频特征进行特征对齐处理,得到第一样本视觉标记序列和第二样本视觉标记序列,并根据所述第一样本视觉标记序列、第一样本文本信息和第一样本视频标签,以及第二样本视觉标记序列和第二样本文本信息构建样本上下文学习信息;将所述样本上下文学习信息进行序列特征转换,得到所述样本上下文学习序列;将所述低秩自适应模块引入所述语言模型,以通过所述低秩自适应模块在所述语言模型引入所述新增模型参数;对所述语言模型的原始模型参数保持冻结,并根据所述样本上下文学习序列和所述第二样本视频标签对所述语言模型的所述新增模型参数进行调整,得到所述标签生成模型。In one embodiment of the present application, the device also includes a training module for obtaining a first sample video feature, a first sample text information and a first sample video label corresponding to the first sample video, and a second sample video feature, a second sample text information and a second sample video label corresponding to the second sample video; performing feature alignment processing on the first sample video feature and the second sample video feature according to the pre-trained initial feature alignment module to obtain a first sample visual mark sequence and a second sample visual mark sequence, and constructing sample context learning information according to the first sample visual mark sequence, the first sample text information and the first sample video label, and the second sample visual mark sequence and the second sample text information; performing sequence feature conversion on the sample context learning information to obtain the sample context learning sequence; introducing the low-rank adaptive module into the language model to introduce the newly added model parameters into the language model through the low-rank adaptive module; keeping the original model parameters of the language model frozen, and adjusting the newly added model parameters of the language model according to the sample context learning sequence and the second sample video label to obtain the label generation model.

在本申请的一实施例中,训练模块还用于获取样本图像和所述样本图像对应的样本描述文本;对所述样本图像进行特征提取得到样本视觉特征,并将所述样本视觉特征输入至待训练模块,以使所述待训练模块将所述样本视觉特征对齐至所述语言模型的输入文本特征空间;获取对齐描述指令对应的样本文本特征,并将所述样本文本特征和所述特征对齐模块输出的目标样本视觉文本特征输入至所述语言模型,其中,所述语言模型的模型参数保持冻结;获取所述语言模型输出的样本预测描述文本,根据所述样本描述文本和所述样本预测描述文本对所述待训练模块进行训练得到所述初始特征对齐模块。In one embodiment of the present application, the training module is also used to obtain a sample image and a sample description text corresponding to the sample image; perform feature extraction on the sample image to obtain a sample visual feature, and input the sample visual feature into the module to be trained so that the module to be trained aligns the sample visual feature to the input text feature space of the language model; obtain the sample text feature corresponding to the alignment description instruction, and input the sample text feature and the target sample visual text feature output by the feature alignment module into the language model, wherein the model parameters of the language model remain frozen; obtain the sample prediction description text output by the language model, and train the module to be trained according to the sample description text and the sample prediction description text to obtain the initial feature alignment module.

在本申请的一实施例中,所述训练模块进一步用于将所述样本上下文学习序列输入至引入所述低秩自适应模块的语言模型,并获取所述语言模型输出的预测样本标签;根据所述预测样本标签和所述第二样本视频标签的差异,对所述语言模型的所述新增模型参数进行调整,得到所述标签生成模型;所述训练模块还用于根据所述预测样本标签和所述第二样本视频标签的差异,对所述初始特征对齐模块的模块参数进行调整。In one embodiment of the present application, the training module is further used to input the sample context learning sequence into the language model introduced into the low-rank adaptive module, and obtain the predicted sample label output by the language model; according to the difference between the predicted sample label and the second sample video label, the newly added model parameters of the language model are adjusted to obtain the label generation model; the training module is also used to adjust the module parameters of the initial feature alignment module according to the difference between the predicted sample label and the second sample video label.

根据本申请实施例的一个方面,本申请实施例提供了一种电子设备,包括一个或多个处理器;存储装置,用于存储一个或多个计算机程序,当所述一个或多个计算机程序被所述一个或多个处理器执行时,使得所述电子设备实现如上所述的视频标签识别方法。According to one aspect of an embodiment of the present application, an embodiment of the present application provides an electronic device, comprising one or more processors; a storage device for storing one or more computer programs, when the one or more computer programs are executed by the one or more processors, the electronic device implements the video tag recognition method as described above.

根据本申请实施例的一个方面,本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序被电子设备的处理器执行时,使电子设备执行如上所述的视频标签识别方法。According to one aspect of an embodiment of the present application, an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor of an electronic device, the electronic device executes the video tag identification method as described above.

根据本申请实施例的一个方面,本申请实施例提供了一种计算机程序产品,包括计算机程序,所述计算机程序存储在计算机可读存储介质中,电子设备的处理器从计算机可读存储介质读取并执行计算机程序,使得电子设备执行如上的视频标签识别方法。According to one aspect of an embodiment of the present application, an embodiment of the present application provides a computer program product, including a computer program, wherein the computer program is stored in a computer-readable storage medium, and a processor of an electronic device reads and executes the computer program from the computer-readable storage medium, so that the electronic device performs the above-mentioned video tag recognition method.

在本申请的实施例所提供的技术方案中,获取待识别视频,并对待识别视频进行多模态特征提取,生成包含多种模态特征的第一多模态特征,进而根据第一多模态特征检索与待识别视频相似的多个参考视频,每个参考视频均携带有视频标签,参考视频不仅提供了丰富的对比信息,还为待识别视频的标签识别提供了重要的上下文依据;根据各个参考视频分别对应的第二多模态特征和各个参考视频携带的视频标签,以及待识别视频的第一多模态特征构建上下文学习信息,通过上下文学习信息,使得标签识别不仅限于单一模态的特征分析,而是利用多个参考视频的标签和多模态特征进行上下文对比和学习,为待识别视频的视频标签进行识别提供参考,从而快速且准确识别待识别视频的视频标签,保证了识别出的视频标签更加精准和可靠。In the technical solution provided in the embodiments of the present application, a video to be identified is obtained, and multimodal features are extracted for the video to be identified to generate a first multimodal feature containing multiple modal features, and then multiple reference videos similar to the video to be identified are retrieved based on the first multimodal feature, each reference video carries a video label, and the reference video not only provides rich comparison information, but also provides important contextual basis for label identification of the video to be identified; contextual learning information is constructed based on the second multimodal features corresponding to each reference video and the video labels carried by each reference video, as well as the first multimodal features of the video to be identified. Through the contextual learning information, label identification is not limited to feature analysis of a single modality, but uses labels and multimodal features of multiple reference videos for contextual comparison and learning, providing a reference for identifying video labels of the video to be identified, thereby quickly and accurately identifying video labels of the video to be identified, ensuring that the identified video labels are more accurate and reliable.

应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present application.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术者来讲,在不付出创造性劳动的前提下还可以根据这些附图获得其他的附图。在附图中。The drawings herein are incorporated into the specification and constitute a part of the specification, showing embodiments consistent with the present application, and together with the specification, are used to explain the principles of the present application. Obviously, the drawings described below are only some embodiments of the present application, and for those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work. In the drawings.

图1是本申请涉及的一种实施环境的示意图。FIG1 is a schematic diagram of an implementation environment involved in the present application.

图2是本申请的一示例性实施例示出的一种视频标签识别方法的流程图。FIG. 2 is a flow chart of a video tag recognition method shown in an exemplary embodiment of the present application.

图3是本申请的一示例性实施例示出的另一种视频标签识别方法的示意图。FIG. 3 is a schematic diagram of another video tag recognition method shown in an exemplary embodiment of the present application.

图4-1是本申请的一示例性实施例示出的一种上下文学习信息的示意图。FIG4-1 is a schematic diagram of context learning information shown in an exemplary embodiment of the present application.

图4-2是本申请的一示例性实施例示出的另一种上下文学习信息的示意图。FIG4-2 is a schematic diagram of another type of context learning information shown in an exemplary embodiment of the present application.

图5是本申请的一示例性实施例示出的另一种视频标签识别方法的流程图。FIG. 5 is a flow chart of another video tag identification method shown in an exemplary embodiment of the present application.

图6是本申请的一示例性实施例示出的另一种视频标签识别方法的流程图。FIG. 6 is a flow chart of another video tag identification method shown in an exemplary embodiment of the present application.

图7是本申请的一示例性实施例示出的另一种视频标签识别方法的流程图。FIG. 7 is a flow chart of another video tag identification method shown in an exemplary embodiment of the present application.

图8是本申请的一示例性实施例示出的另一种视频标签识别方法的流程图。FIG. 8 is a flow chart of another video tag identification method shown in an exemplary embodiment of the present application.

图9是本申请的一示例性实施例示出的一种特征训练模块的训练示意图。FIG. 9 is a training diagram of a feature training module shown in an exemplary embodiment of the present application.

图10是本申请的一示例性实施例示出的视频标签识别方法的应用链的示意图。FIG. 10 is a schematic diagram of an application chain of a video tag identification method shown in an exemplary embodiment of the present application.

图11是本申请的一示例性实施例示出的视频标签识别的整体结构流程示意图。FIG. 11 is a schematic diagram of the overall structural flow of video tag recognition shown in an exemplary embodiment of the present application.

图12是本申请的一示例性实施例示出的视频标签识别装置的结构框图。FIG. 12 is a structural block diagram of a video tag identification device shown in an exemplary embodiment of the present application.

图13示出了适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。FIG. 13 shows a schematic diagram of the structure of a computer system of an electronic device suitable for implementing an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

这里将详细地对示例性实施例执行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Here, exemplary embodiments will be described in detail, examples of which are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present application. Instead, they are only examples of devices and methods consistent with some aspects of the present application as detailed in the attached claims.

附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the accompanying drawings are merely functional entities and do not necessarily correspond to physically independent entities. That is, these functional entities may be implemented in software form, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/,也不是必须按所描述的顺序执行。例如,有的操作/还可以分解,而有的操作/可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the accompanying drawings are only exemplary and do not necessarily include all contents and operations, nor must they be executed in the order described. For example, some operations may be decomposed, while some operations may be combined or partially combined, so the actual execution order may change according to actual conditions.

还需要说明的是:在本申请中提及的“多个”是指两个或者两个以上。“和/或”描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。It should also be noted that the "multiple" mentioned in this application refers to two or more than two. "And/or" describes the association relationship of the associated objects, indicating that there can be three relationships. For example, A and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone. The character "/" generally indicates that the previous and next associated objects are in an "or" relationship.

以下对本申请实施例的技术方案进行详细介绍。The technical solutions of the embodiments of the present application are introduced in detail below.

请参阅图1,图1是本申请涉及的一种实施环境的示意图。该实施环境包括终端10和服务器20。Please refer to FIG1 , which is a schematic diagram of an implementation environment involved in the present application. The implementation environment includes a terminal 10 and a server 20 .

终端10用于待识别视频,并将该待识别视频发送至服务器20。The terminal 10 is used for the video to be identified, and sends the video to be identified to the server 20 .

服务器20用于对待识别视频进行多模态特征提取得到第一多模态特征,第一多模态特征包括多种模态分别对应的多种模态特征,根据第一多模态特征检索与待识别视频相似的多个参考视频,每个参考视频均携带有视频标签,其中服务器也会对各个参考视频进行多模态特征提取得到各个参考视频分别对应的第二多模态特征,进而根据各个参考视频分别对应的第二多模态特征和视频标签,以及待识别视频的第一多模态特征构建上下文学习信息,通过上下文学习信息,对待识别视频的视频标签进行识别。The server 20 is used to perform multimodal feature extraction on the video to be identified to obtain a first multimodal feature, where the first multimodal feature includes multiple modal features corresponding to multiple modalities, and retrieve multiple reference videos similar to the video to be identified based on the first multimodal feature, each reference video carrying a video tag, wherein the server also performs multimodal feature extraction on each reference video to obtain a second multimodal feature corresponding to each reference video, and then constructs context learning information based on the second multimodal features and video tags corresponding to each reference video, and the first multimodal feature of the video to be identified, and identifies the video tag of the video to be identified through the context learning information.

该服务器还可以待识别视频的视频标签发送至终端,以使得终端根据该视频标签对该视频进行分类或视频推荐等。The server may also send the video tag of the video to be identified to the terminal, so that the terminal can classify the video or recommend the video according to the video tag.

在一些实施例中,服务器20也可以自己获取待识别视频,然后进行多模态特征提取,相似视频检索以及上下文学习信息构建,根据上下文学习信息,对待识别视频的视频标签进行识别。In some embodiments, the server 20 may also obtain the video to be identified by itself, and then perform multimodal feature extraction, similar video retrieval, and context learning information construction, and identify the video tag of the video to be identified based on the context learning information.

在一些实施例中,终端10也可以单独实现视频标签识别的处理过程,即终端10获取待识别视频,以进行多模态特征提取,相似视频检索以及上下文学习信息构建,进而识别得到待识别视频的视频标签。In some embodiments, the terminal 10 can also independently implement the processing of video tag recognition, that is, the terminal 10 obtains the video to be identified to perform multimodal feature extraction, similar video retrieval and context learning information construction, and then identifies the video tag of the video to be identified.

其中,前述终端10可以是智能手机、平板、笔记本电脑、计算机、智能语音交互设备、智能家电、车载终端、飞行器等任意能够获取目标对象的模态数据的电子设备,服务器20可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)以及大数据和人工智能平台等基础云计算服务的云服务器,本处不对此进行限制。Among them, the aforementioned terminal 10 can be any electronic device that can obtain modal data of the target object, such as a smart phone, a tablet, a laptop computer, a computer, an intelligent voice interaction device, a smart home appliance, a vehicle-mounted terminal, an aircraft, etc. The server 20 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also be a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network) and big data and artificial intelligence platforms. There is no restriction on this here.

终端10和服务器20预先通过网络建立通信连接,使得终端10和服务器20 之间可以通过网络互相通信。网络可以是有线网络,也可以是无线网络,本处也不进行限制。The terminal 10 and the server 20 establish a communication connection in advance through a network, so that the terminal 10 and the server 20 can communicate with each other through the network. The network can be a wired network or a wireless network, and there is no limitation here.

需要说明的是,在本申请的具体实施方式中,待识别视频和参考视频中的至少一个涉及到对象相关,当本申请实施例运用到具体产品或技术中时,需要获得对象许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that in the specific implementation of the present application, at least one of the video to be identified and the reference video involves object-relatedness. When the embodiment of the present application is applied to a specific product or technology, it is necessary to obtain the object's permission or consent, and the collection, use and processing of relevant data need to comply with the relevant laws, regulations and standards of the relevant countries and regions.

以下对本申请实施例的技术方案的各种实现细节进行详细阐述。The various implementation details of the technical solutions of the embodiments of the present application are elaborated in detail below.

如图2所示,图2是本申请的一个实施例示出的视频标签识别方法的流程图,该方法可以应用于图1所示的实施环境,该方法可以由终端或服务器执行,也可以由终端和服务器共同执行,在本申请实施例中,以该方法由服务器执行为例进行说明,该视频标签识别方法可以包括S210至S240,详细介绍如下。As shown in Figure 2, Figure 2 is a flowchart of a video tag identification method shown in an embodiment of the present application. The method can be applied to the implementation environment shown in Figure 1. The method can be executed by a terminal or a server, or by a terminal and a server together. In the embodiment of the present application, the method is executed by a server as an example for explanation. The video tag identification method may include S210 to S240, which are described in detail as follows.

S210、获取待识别视频,并对待识别视频进行多模态特征提取得到第一多模态特征,第一多模态特征包括多种模态分别对应的多种模态特征。S210: Obtain a video to be identified, and perform multimodal feature extraction on the video to be identified to obtain a first multimodal feature, where the first multimodal feature includes multiple modal features corresponding to multiple modalities.

在本申请实施例中,待识别视频是指还未确定视频标签的视频,而视频标签是一种用于标识和分类视频内容的标签,该标签可以是特定关键词、短语或与视频主题相关的任何描述性信息。In the embodiment of the present application, the video to be identified refers to a video whose video tag has not yet been determined, and the video tag is a label used to identify and classify video content. The label can be a specific keyword, phrase, or any descriptive information related to the video theme.

获取待识别视频,对该待识别视频进行多模态特征提取,其中模态可以理解为数据的不同表征或获取方式,如图像和文本属于不同模态,多模态包括至少两种模态,对待识别视频进行多模态特征提取,进而所得到的第一多模态特征包括多种模态分别进行特征提取得到的多种模态特征,例如可以是对待识别视频的视频内容进行特征提取,并对待识别视频的文本信息进行特征提取,还可以是待识别视频的音频信息进行特征提取,第一多模态特征包括视频内容对应的视频特征、文本信息对应的文本特征、音频信息对应的音频特征。A video to be identified is obtained, and multimodal feature extraction is performed on the video to be identified, wherein modality can be understood as different representations or acquisition methods of data, such as images and texts belong to different modalities, and multimodality includes at least two modalities. Multimodal feature extraction is performed on the video to be identified, and then the first multimodal feature obtained includes multiple modal features obtained by performing feature extraction on multiple modalities respectively. For example, feature extraction can be performed on the video content of the video to be identified, and feature extraction can be performed on the text information of the video to be identified. It can also be feature extraction on the audio information of the video to be identified. The first multimodal feature includes video features corresponding to the video content, text features corresponding to the text information, and audio features corresponding to the audio information.

在一示例中,待识别视频包括多种模态的信息,可以根据待识别视频所属的业务场景或终端的指令,对待识别视频的特定模态进行特征提取;例如若待识别视频所属的业务场景为使用模板的短视频场景,该模板包括统一的背景音乐,因此可不对待识别视频的音频信息进行特征提取,而对待识别视频的文本和视频内容进行特征提取。In one example, the video to be identified includes information of multiple modalities, and feature extraction can be performed on a specific modality of the video to be identified according to the business scenario to which the video to be identified belongs or the instructions of the terminal; for example, if the business scenario to which the video to be identified belongs is a short video scene using a template, and the template includes uniform background music, feature extraction may not be performed on the audio information of the video to be identified, but feature extraction may be performed on the text and video content of the video to be identified.

在另一示例中,可以根据待识别视频的视频类型,对待识别视频的特点模态进行特征提取,例如音乐视频,可对视频内容和声音进行特征提取;电视/电影视频段,可对视频的视频内容、文本和声音进行特征提取。In another example, feature extraction can be performed on characteristic modalities of the video to be identified based on the video type of the video to be identified. For example, for a music video, feature extraction can be performed on the video content and sound; for a TV/movie video segment, feature extraction can be performed on the video content, text, and sound of the video.

S220、根据第一多模态特征检索与待识别视频相似的多个参考视频,每个参考视频均携带有视频标签。S220. Retrieve a plurality of reference videos similar to the video to be identified according to the first multimodal feature, each reference video carrying a video tag.

在本申请实施例中,预先存储有一个视频库,该视频库包括多个携带有视频标签的候选视频,该候选视频的视频标签可以是通过步骤S210~S240识别得到的,也可以是相关人员打标确定的,在此不进行限定。In an embodiment of the present application, a video library is pre-stored, and the video library includes multiple candidate videos carrying video tags. The video tags of the candidate videos can be obtained by identification through steps S210~S240, or can be determined by labeling by relevant personnel, which is not limited here.

在获取待识别视频的第一多模态特征之后,根据第一多模态特征在视频库中检索与待识别视频相似的多个参考视频,其中检索得到的参考视频与待识别视频的相似度大于预设相似度阈值。After obtaining the first multimodal feature of the video to be identified, multiple reference videos similar to the video to be identified are retrieved from the video library according to the first multimodal feature, wherein the similarity between the retrieved reference videos and the video to be identified is greater than a preset similarity threshold.

在一示例中,根据第一多模态特征进行相似视频检索的过程可以是将第一多模态特征所包括的每种模态特征分别与视频库中的候选视频的对应模态特征进行对比,进而选取特征相似度高于预设相似度阈值所对应的参考视频;如将第一多模态特征包括的视频特征与视频库中的各个候选视频的视频特征进行对比,以选取得到第一参考视频集;将第一多模态特征包括的文本特征与视频库中的各个候选视频的文本特征进行对比,以选取得到第二参考视频集;将第一多模态特征包括的视频特征与视频库中的各个候选视频的文本特征进行对比,以选取得到第三参考视频集;将第一多模态特征包括的文本特征与视频库中的各个候选视频的视频特征进行对比,以选取得到第四参考视频集,根据第一~第四参考视频集得到与待识别视频相似的最终的参考视频。In one example, the process of retrieving similar videos based on the first multimodal feature may be to compare each modal feature included in the first multimodal feature with the corresponding modal feature of the candidate video in the video library, and then select the reference video corresponding to the feature similarity higher than the preset similarity threshold; such as comparing the video features included in the first multimodal feature with the video features of each candidate video in the video library to select a first reference video set; comparing the text features included in the first multimodal feature with the text features of each candidate video in the video library to select a second reference video set; comparing the video features included in the first multimodal feature with the text features of each candidate video in the video library to select a third reference video set; comparing the text features included in the first multimodal feature with the video features of each candidate video in the video library to select a fourth reference video set, and obtaining a final reference video similar to the video to be identified based on the first to fourth reference video sets.

在另一示例中,根据第一多模态特征进行相似视频检索的过程可以是将第一多模态特征所包括的多种模态特征进行特征融合得到融合特征,将融合特征与视频库中的各个候选视频的融合特征进行对比,以选取的第五参考视频集。In another example, the process of retrieving similar videos based on the first multimodal feature may be to fuse the multiple modal features included in the first multimodal feature to obtain a fused feature, and compare the fused feature with the fused features of each candidate video in the video library to select a fifth reference video set.

在另一示例中,还可以根据上述的第一~第四参考视频集,以及第五参考视频集得到与待识别视频相似的最终的参考视频,例如将第一~第四参考视频集,以及第五参考视频集的交集视频作为最终的参考视频;又如将第一~第四参考视频集,以及第五参考视频集均作为最终的参考视频。In another example, a final reference video similar to the video to be identified can be obtained based on the above-mentioned first to fourth reference video sets and fifth reference video set. For example, the intersection video of the first to fourth reference video sets and the fifth reference video set is used as the final reference video; or the first to fourth reference video sets and the fifth reference video set are all used as the final reference video.

S230、根据各个参考视频分别对应的第二多模态特征和各个参考视频携带的视频标签,以及待识别视频的第一多模态特征构建上下文学习信息。S230: Construct contextual learning information according to the second multimodal features corresponding to the reference videos, the video tags carried by the reference videos, and the first multimodal features of the video to be identified.

在本申请实施例中,需要对每个参考视频进行多模态特征提取得到第二多模态特征,其中对待识别视频和参考视频进行多模态特征提取的过程相同,且第一多模态特征所包括模态特征类型与第二多模态特征所包括模态特征类型相同,例如第一多模态特征的特征类型包括视频特征和文本特征,第二多模态特征的特征类型也包括视频特征和文本特征。In an embodiment of the present application, it is necessary to perform multimodal feature extraction on each reference video to obtain a second multimodal feature, wherein the process of performing multimodal feature extraction on the video to be identified and the reference video is the same, and the modal feature type included in the first multimodal feature is the same as the modal feature type included in the second multimodal feature, for example, the feature type of the first multimodal feature includes video features and text features, and the feature type of the second multimodal feature also includes video features and text features.

可以理解的是,参考视频的第二多模态特征与视频标签存在映射关系,即参考视频是与视频标签相关的学习示例,进而根据各个参考视频分别对应的第二多模态特征和各个参考视频携带的视频标签,以及待识别视频的第一多模态特征构建上下文学习信息,该上下文学习信息包括学习示例提供的上下文信息,是带有类比学习的提示信息。It can be understood that there is a mapping relationship between the second multimodal features of the reference video and the video label, that is, the reference video is a learning example related to the video label, and then contextual learning information is constructed according to the second multimodal features corresponding to each reference video and the video label carried by each reference video, as well as the first multimodal features of the video to be identified. The contextual learning information includes the contextual information provided by the learning example, which is prompt information with analogy learning.

在一示例中,可以各个参考视频分别对应的第二多模态特征和各个参考视频携带的视频标签,以及待识别视频的第一多模态特征进行拼接得到上下文学习信息。In one example, the context learning information may be obtained by splicing the second multimodal features corresponding to the reference videos, the video tags carried by the reference videos, and the first multimodal features of the video to be identified.

S240、根据上下文学习信息,对待识别视频的视频标签进行识别。S240: Identify the video tag of the video to be identified based on the context learning information.

如前所描述的,上下文学习信息包括学习示例提供的上下文信息,进而根据该上下文学习信息对待识别视频的视频标签进行识别,可以根据学习示例提供的上下文信息来指导对待识别视频的视频标签的识别,进而得到待识别视频的目标视频标签。As described above, the contextual learning information includes the contextual information provided by the learning examples, and the video tags of the video to be identified are identified based on the contextual learning information. The contextual information provided by the learning examples can be used to guide the identification of the video tags of the video to be identified, and then the target video tags of the video to be identified are obtained.

在一示例中,将上下文学习信息输入至预训练的标签生成模型,通过标签生成模型根据上下文学习信息生成待识别视频的视频标签。In one example, the context learning information is input into a pre-trained label generation model, and the label generation model generates a video label for the video to be identified according to the context learning information.

在本申请实施例中,获取待识别视频,并对待识别视频进行多模态特征提取,生成包含多种模态特征的第一多模态特征,进而根据第一多模态特征检索与待识别视频相似的多个参考视频,每个参考视频均携带有视频标签,参考视频不仅提供了丰富的对比信息,还为待识别视频的标签识别提供了重要的上下文依据;根据各个参考视频分别对应的第二多模态特征和各个参考视频携带的视频标签,以及待识别视频的第一多模态特征构建上下文学习信息,通过上下文学习信息,使得标签识别不仅限于单一模态的特征分析,而是利用多个参考视频的标签和多模态特征进行上下文对比和学习,为待识别视频的视频标签进行识别提供参考,减少了单一模态下可能出现的偏差,可快速且准确识别待识别视频的视频标签,保证了识别出的视频标签更加精准和可靠。In an embodiment of the present application, a video to be identified is obtained, and multimodal features are extracted for the video to be identified to generate a first multimodal feature containing multiple modal features, and then multiple reference videos similar to the video to be identified are retrieved based on the first multimodal feature, each reference video carries a video label, and the reference video not only provides rich comparison information, but also provides important contextual basis for label identification of the video to be identified; contextual learning information is constructed based on the second multimodal features corresponding to each reference video and the video labels carried by each reference video, as well as the first multimodal feature of the video to be identified. Through the contextual learning information, label identification is not limited to feature analysis of a single modality, but uses labels and multimodal features of multiple reference videos for contextual comparison and learning, providing a reference for identifying video labels of the video to be identified, reducing possible deviations under a single modality, and being able to quickly and accurately identify video labels of the video to be identified, ensuring that the identified video labels are more accurate and reliable.

在本申请的一个实施例中,提供了另一种视频标签识别方法,该视频标签识别方法可以应用于图1所示的实施环境,该方法可以由终端或服务器执行,也可以由终端和服务器共同执行,在本申请实施例中,以该方法由服务器执行为例进行说明,如图3所示,该视频标签识别方法在图2中所示的S210~S240的基础上,将S230扩展为S310~S330,其中,第一多模态特征包括第一视频帧特征,第二多模态特征包括第二视频帧特征,视频帧特征是对待识别视频的视频帧进行特征提取的;S310~S330详细介绍如下。In one embodiment of the present application, another video tag recognition method is provided. The video tag recognition method can be applied to the implementation environment shown in Figure 1. The method can be executed by a terminal or a server, or can be executed by a terminal and a server together. In the embodiment of the present application, the method is executed by a server as an example for illustration. As shown in Figure 3, the video tag recognition method is based on S210~S240 shown in Figure 2, and S230 is expanded to S310~S330, wherein the first multimodal feature includes a first video frame feature, and the second multimodal feature includes a second video frame feature, and the video frame feature is a feature extraction of a video frame of a video to be recognized; S310~S330 are described in detail as follows.

S310、对第一视频帧特征进行特征转换处理得到第一视觉文本标记序列,并对各个参考视频对应的第二视频帧特征分别进行特征转换处理得到各个参考视频对应的第二视觉文本标记序列。S310, performing feature conversion processing on the first video frame features to obtain a first visual text mark sequence, and performing feature conversion processing on the second video frame features corresponding to each reference video to obtain a second visual text mark sequence corresponding to each reference video.

在本申请实施例中,视频标签为文本类的信息,视频标签和视频帧特征来自不同形态、不同模态,为了便于后续构建上下文学习信息,可以通过特征转换处理将视频帧特征转换为视觉文本标记序列,该视觉文本标记序列为视频帧特征对应的token序列,其中对第一视频帧特征和对第二视频帧特征进行特征转换处理的过程相同。In an embodiment of the present application, the video label is text-type information, and the video label and video frame features come from different forms and modes. In order to facilitate the subsequent construction of contextual learning information, the video frame features can be converted into a visual text tag sequence through feature conversion processing. The visual text tag sequence is a token sequence corresponding to the video frame features, wherein the process of performing feature conversion processing on the first video frame features and the second video frame features is the same.

在一示例中,第一视频帧特征的数量包括多个,对第一视频帧特征进行特征转换处理的第一视觉文本标记序列,包括:对多个第一视频帧特征进行特征融合处理,得到目标视频特征;通过预训练的特征对齐模块将目标视频特征对齐到预设文本特征空间,得到第一视觉文本标记序列。In one example, the number of first video frame features includes multiple, and a first visual text mark sequence is performed on the first video frame features through feature conversion, including: performing feature fusion processing on the multiple first video frame features to obtain target video features; aligning the target video features to a preset text feature space through a pre-trained feature alignment module to obtain a first visual text mark sequence.

其中,由于存在多个第一视频帧特征,对多个第一视频帧特征进行特征融合处理得到一个目标视频特征,该目标视频特征更能包含全面的视频内容;特征融合处理可以是平均池化处理,也可以是加权处理,其中各个第一视频帧特征的权值可以根据第一视频帧特征所对应的视频帧在待识别视频中的位置确定,如视频帧在待识别视频中的位置越靠近中间位置,该视频帧对应的视频帧特征的权值越大;视频帧在待识别视频中的位置越靠近开始或结束位置,该视频帧对应的视频帧特征的权值越小。Among them, since there are multiple first video frame features, feature fusion processing is performed on the multiple first video frame features to obtain a target video feature, and the target video feature can better contain comprehensive video content; the feature fusion processing can be an average pooling processing or a weighted processing, wherein the weight of each first video frame feature can be determined according to the position of the video frame corresponding to the first video frame feature in the video to be identified, such as the closer the position of the video frame in the video to be identified is to the middle position, the greater the weight of the video frame feature corresponding to the video frame; the closer the position of the video frame in the video to be identified is to the start or end position, the smaller the weight of the video frame feature corresponding to the video frame.

在本申请其他实施例中,若只存在一个第一视频帧特征,可以对第一视频帧特征增强处理,以扩充视频帧特征,进而基于第一视频帧特征和扩充的视频帧特征进行特征融合处理得到目标视频特征。In other embodiments of the present application, if there is only one first video frame feature, the first video frame feature can be enhanced to expand the video frame feature, and then feature fusion processing is performed based on the first video frame feature and the expanded video frame feature to obtain the target video feature.

在本申请实施例中,特征对齐模块是预先训练的,将目标视频特征输入至特征对齐模块,以将目标视频特征对齐到预设文本特征空间,该预设文本特征空间可以是预训练的标签生成模型的输入文本特征空间,即将非文本token对齐到标签生成模型输入文本token的空间上,从而将非文本特征“翻译”为标签生成模型可以理解的内容,以使后续标签生成模型根据上下文学习信息进行视频标签的识别。In an embodiment of the present application, the feature alignment module is pre-trained, and the target video features are input into the feature alignment module to align the target video features to a preset text feature space, which can be the input text feature space of a pre-trained label generation model, that is, to align non-text tokens to the space of the label generation model input text tokens, thereby "translating" the non-text features into content that the label generation model can understand, so that the subsequent label generation model can identify video labels based on contextual learning information.

类似的,对多个第二视频帧特征进行特征融合得到视频特征,通过预训练的特征对齐模块将视频特征对齐到预设文本特征空间,得到第二视觉文本标记序列。Similarly, feature fusion is performed on multiple second video frame features to obtain video features, and the video features are aligned to a preset text feature space through a pre-trained feature alignment module to obtain a second visual text tag sequence.

S320、获取待识别视频对应的第一文本信息和各个参考视频分别对应的第二文本信息。S320: Obtain first text information corresponding to the video to be identified and second text information corresponding to each reference video.

在本申请实施例中,可以根据待识别视频的附带文本信息得到第一文本信息,如附带文本信息包括视频标题、视频发布时间,视频发布平台、从视频中检测和识别的文字内容,如字幕、将人的语音转换得到的文本等;参考视频对应的第二文本信息所包含的文本类型与第一文本信息所包含的文本类型相同。In an embodiment of the present application, the first text information can be obtained based on the accompanying text information of the video to be identified, such as the accompanying text information including the video title, video release time, video release platform, text content detected and identified from the video, such as subtitles, text obtained by converting human voice, etc.; the text type contained in the second text information corresponding to the reference video is the same as the text type contained in the first text information.

可以理解的是,待识别视频的附带文本信息的信息内容较多、较杂,为了构建高效且精准上下文学习信息,可以对附带文本信息进行筛选得到第一文本信息。It is understandable that the accompanying text information of the video to be identified has more and more complex information content. In order to construct efficient and accurate context learning information, the accompanying text information can be screened to obtain the first text information.

在一示例,获取待识别视频对应的第一文本信息,包括:获取待识别视频的视频标题,以及待识别视频的音频信息转化得到的识别文本;根据待识别视频的视频场景生成待识别视频的情境描述;根据视频标题、识别文本和情境描述生成第一文本信息。In one example, obtaining first text information corresponding to a video to be identified includes: obtaining a video title of the video to be identified, and an identification text converted from audio information of the video to be identified; generating a context description of the video to be identified based on a video scene of the video to be identified; and generating the first text information based on the video title, the identification text, and the context description.

其中,可以从待识别视频的视频文件中解析元数据,以提取视频标题,若提取到的视频标题包括特殊字符或格式,需要进行清理和归一化处理,如去除特殊字符、空格处理、大小写归一化等;若提取的视频标题与视频标签的语言不同,还可以将所提取的视频标题转换为视频标签对应的语言。Among them, metadata can be parsed from the video file of the video to be identified to extract the video title. If the extracted video title includes special characters or formats, it needs to be cleaned and normalized, such as removing special characters, processing spaces, normalizing upper and lower cases, etc.; if the language of the extracted video title is different from that of the video tag, the extracted video title can also be converted into the language corresponding to the video tag.

可以理解的是,待识别视频包括音频信息,该音频信息非背景音频,而是待识别视频中包含的对象发出的音频,如音频对话,可以将音频信息通过自动语音识别技术转换为识别文本。It is understandable that the video to be identified includes audio information, which is not background audio but audio emitted by an object contained in the video to be identified, such as an audio conversation. The audio information can be converted into recognized text through automatic speech recognition technology.

视频标题和识别文本是视频所附带的基础文本信息,为了进一步添加描述性语言,以提供更多的情境背景,进一步丰富视频自身所带来的上下文信息,因此可根据待识别视频的视频场景确定情境描述,其中,可以通过对象检测分析视频中的场景内容,如通过识别视频中出现的物体确定视频对应的场景;进一步地分析待识别视频的时序信息,确定视频的场景变化信息,以帮助理解视频的上下文情境,进而根据视频的场景以及场景变化信息生成情境描述。例如,视频中检测到沙滩、太阳等元素,可以生成类似“在一个阳光明媚的沙滩上”的情境描述。The video title and recognition text are the basic text information attached to the video. In order to further add descriptive language to provide more contextual background and further enrich the contextual information brought by the video itself, the context description can be determined according to the video scene of the video to be recognized. Among them, the scene content in the video can be analyzed by object detection, such as determining the corresponding scene of the video by identifying the objects appearing in the video; further analyze the timing information of the video to be recognized, determine the scene change information of the video, so as to help understand the context of the video, and then generate a context description based on the scene and scene change information of the video. For example, if elements such as the beach and the sun are detected in the video, a context description similar to "on a sunny beach" can be generated.

在一示例中,还可以根据识别文本中的关键词和语音情感(如欢快、严肃)等信息,进一步细化情境描述。例如,音频中检测到笑声或欢呼声,可以推断视频的情境可能是愉快的或庆祝的场景。In one example, the situation description can be further refined based on the recognition of keywords and voice emotions (such as cheerfulness, seriousness) in the text. For example, if laughter or cheering is detected in the audio, it can be inferred that the situation of the video may be a happy or celebratory scene.

在获取视频标题、识别文本和情境描述之后,可以将视频标题、识别文本和情境描述进行拼接得到第一文本信息,即从一个视频中获取有意义的文本信息,并生成一个具有上下文和语境理解的第一文本信息,为后续的标签生成提供基础。After obtaining the video title, recognition text and situation description, the video title, recognition text and situation description can be spliced together to obtain the first text information, that is, meaningful text information is obtained from a video, and a first text information with context and contextual understanding is generated to provide a basis for subsequent label generation.

在一示例中,若将视频标题、识别文本和情境描述中存在重复的词,则删除重复的词之后进行拼接的第一文本信息;在一示例中,若第一文本信息的信息内容较少,如第一文本信息所包含的文本词语的数量小于预设数量阈值,则可以对第一文本信息进行文本数据增强,通过同义词替换、句子重构、文本扩展等方法生成多样化的文本输入。In one example, if there are repeated words in the video title, recognition text and situation description, the first text information is spliced after deleting the repeated words; in one example, if the information content of the first text information is small, such as the number of text words contained in the first text information is less than a preset threshold, the first text information can be enhanced by text data enhancement, and diversified text inputs can be generated through methods such as synonym replacement, sentence reconstruction, and text expansion.

获取第二文本信息的过程与获取第一文本信息的过程相同,在此不再赘述。The process of acquiring the second text information is the same as the process of acquiring the first text information, and will not be described in detail here.

S330、根据各个参考视频对应的第二视觉文本标记序列、第二文本信息和视频标签,以及第一视觉文本标记序列和第一文本信息,构建上下文学习信息。S330: Construct context learning information according to the second visual text mark sequence, the second text information and the video label corresponding to each reference video, and the first visual text mark sequence and the first text information.

在本申请实施例中,可以将各个参考视频对应的第二视觉文本标记序列、第二文本信息和视频标签,以及第一视觉文本标记序列和第一文本信息进行拼接得到上下文学习信息;其中,可以将同一参考视频对应的第二视觉文本标记序列、第二文本信息作为输入,视频标签作为输出,将输入和输出拼接得到输入-输出的学习示例,通过学习示例提供上下文信息,将多个参考视频对应的学习示例与待识别视频的第一视觉文本标记序列和第一文本信息进行拼接得到上下文学习信息。In an embodiment of the present application, the second visual text marker sequence, the second text information and the video label corresponding to each reference video, as well as the first visual text marker sequence and the first text information can be spliced together to obtain contextual learning information; wherein, the second visual text marker sequence and the second text information corresponding to the same reference video can be used as input, and the video label can be used as output, and the input and output can be spliced together to obtain an input-output learning example, and contextual information is provided through the learning example, and the learning examples corresponding to multiple reference videos are spliced with the first visual text marker sequence and the first text information of the video to be identified to obtain contextual learning information.

在一示例中,根据各个参考视频对应的第二视觉特征、第二文本信息和视频标签,以及第一视觉文本标记序列和第一文本信息,构建上下文学习信息,包括:根据用于隔离同一视频下的不同信息的第一文本标记,以及同一参考视频对应的第二视觉文本标记序列、第二文本信息和视频标签构建学习示例,以得到多个参考视频对应的多个学习示例;将多个学习示例进行拼接,得到学习信息,其中,学习信息中的多个学习示例通过用于隔离不同视频的信息的第二文本标记进行分隔;根据第一文本标记、第一视觉文本标记序列和第一文本信息构建识别示例;根据学习信息和识别示例生成上下文学习信息。In one example, contextual learning information is constructed based on the second visual feature, second text information and video label corresponding to each reference video, as well as the first visual text label sequence and the first text information, including: constructing a learning example based on the first text label used to isolate different information under the same video, and the second visual text label sequence, the second text information and the video label corresponding to the same reference video to obtain multiple learning examples corresponding to multiple reference videos; splicing the multiple learning examples to obtain learning information, wherein the multiple learning examples in the learning information are separated by the second text label used to isolate information from different videos; constructing a recognition example based on the first text label, the first visual text label sequence and the first text information; and generating contextual learning information based on the learning information and the recognition example.

例如,第一文本标记和第二文本标记为特殊token,如第一文本标记为<pad>,用于隔离同一视频下的不同信息,参考视频1对应的学习示例,即前述的输入-输出示例,如:第二视觉文本标记序列1<pad>第二文本信息1<pad>视频标签1;参考视频2对应的学习示例,如:第二视觉文本标记序列2<pad>第二文本信息2<pad>视频标签2,依次类推得到多个参考视频对应的多个学习示例。第二文本标记为<eoc>,用于隔离不同视频的信息,将多个学习示例进行拼接得到学习信息,该学习信息为:第二视觉文本标记序列1<pad>第二文本信息1<pad>视频标签1<eoc>第二视觉文本标记序列2<pad>第二文本信息2<pad>视频标签2<eoc>,以此类推,其中视频标签1和第二视觉文本标记序列为不同视频的信息,因此通过<eoc>分割。For example, the first text tag and the second text tag are special tokens, such as the first text tag is <pad>, which is used to isolate different information under the same video. The learning example corresponding to reference video 1, that is, the aforementioned input-output example, such as: second visual text tag sequence 1<pad> second text information 1<pad> video tag 1; the learning example corresponding to reference video 2, such as: second visual text tag sequence 2<pad> second text information 2<pad> video tag 2, and so on to obtain multiple learning examples corresponding to multiple reference videos. The second text tag is <eoc>, which is used to isolate information from different videos. Multiple learning examples are spliced to obtain learning information, and the learning information is: second visual text tag sequence 1<pad> second text information 1<pad> video tag 1<eoc> second visual text tag sequence 2<pad> second text information 2<pad> video tag 2<eoc>, and so on, where video tag 1 and the second visual text tag sequence are information from different videos, and are therefore segmented by <eoc>.

在本申请实施例中,构建识别示例,识别示例为:第一视觉文本标记序列<pad>第一文本信息;进而将学习信息和识别示例拼接在一起,以形成带有提示信息的上下文学习信息,该提示信息可以是视觉文本标记序列、文本信息与视频标签之间的映射关系;其中,学习信息和识别示例也属于不同视频的信息,因此通过第二文本标记将学习信息和识别示例进行拼接,得到如图4-1所示的上下文学习信息。In an embodiment of the present application, a recognition example is constructed, and the recognition example is: a first visual text tag sequence <pad> first text information; then the learning information and the recognition example are spliced together to form contextual learning information with prompt information, and the prompt information can be a mapping relationship between the visual text tag sequence, the text information and the video tag; wherein, the learning information and the recognition example also belong to information of different videos, so the learning information and the recognition example are spliced together through the second text tag to obtain the contextual learning information as shown in Figure 4-1.

可以理解的是,第一文本标记和第二文本标记的标记形式可以根据实际情况灵活调整,<pad>和<eoc>为本申请实施例提供的一种示例。It can be understood that the markup format of the first text mark and the second text mark can be flexibly adjusted according to actual conditions, and <pad> and <eoc> are examples provided in the embodiments of the present application.

在本申请其他实施例中,在构建上下文学习信息时,除了结合与待识别视频相似的多个参考视频,还可以引入与待识别视频不相关的负例视频作为对比,如通过第一文本标记将负例视频的第三视觉文本标记序列、第三文本信息和视频标签拼接得到反面信息,进而将学习信息、反面信息和识别示例拼接得到如图4-2所示的上下文学习信息。In other embodiments of the present application, when constructing contextual learning information, in addition to combining multiple reference videos similar to the video to be identified, negative example videos that are unrelated to the video to be identified can also be introduced for comparison. For example, the third visual text tag sequence, the third text information and the video label of the negative example video can be spliced together through the first text tag to obtain negative information, and then the learning information, the negative information and the recognition example can be spliced together to obtain the contextual learning information as shown in Figure 4-2.

需要说明的是,图3中所示S210~S220、S240的其他详细介绍请参见图2所示的S210~S220、S240,在此不再赘述。It should be noted that for other detailed descriptions of S210 - S220 , S240 shown in FIG. 3 , please refer to S210 - S220 , S240 shown in FIG. 2 , which will not be repeated here.

本申请实施例中,通过先列出参考视频的多模态特征和视频标签,构造学习示例,再列出待识别视频的多模态特征,构造识别示例,进而拼接学习示例和识别示例,构造了一个包含相似视频上下文信息的输入,能够有效地利用多模态信息来进行上下文学习,以便于后续准确生成待识别视频的内容标签。In an embodiment of the present application, by first listing the multimodal features and video labels of a reference video to construct a learning example, then listing the multimodal features of a video to be identified to construct an identification example, and then splicing the learning example and the identification example, an input containing contextual information of similar videos is constructed, which can effectively utilize multimodal information for contextual learning, so as to facilitate the subsequent accurate generation of content labels for the video to be identified.

本申请实施例提供了另一种视频标签识别方法,该视频标签识别方法可以应用于图1所示的实施环境,该方法可以由终端或服务器执行,也可以由终端和服务器共同执行,在本申请实施例中,以该方法由服务器执行为例进行说明,如图5所示,该视频标签识别方法在图2中所示的基础上,将图2中所示的S220扩展为S510~S530。其中,第一多模态特征包括目标视频特征和第一文本特征,其中,目标视频特征可以是多个第一视频帧特征进行特征融合得到的,该目标视频特征的特征维度与第一文本特征的特征维度相同,S510~S530详细介绍如下。The embodiment of the present application provides another video tag recognition method, which can be applied to the implementation environment shown in Figure 1. The method can be executed by a terminal or a server, or by a terminal and a server. In the embodiment of the present application, the method is executed by a server as an example for explanation. As shown in Figure 5, the video tag recognition method is based on that shown in Figure 2, and S220 shown in Figure 2 is expanded to S510~S530. Among them, the first multimodal feature includes a target video feature and a first text feature, wherein the target video feature can be obtained by feature fusion of multiple first video frame features, and the feature dimension of the target video feature is the same as the feature dimension of the first text feature. S510~S530 are described in detail as follows.

S510、获取预先建立的视频特征检索库和文本特征检索库,视频特征检索库包括各个候选视频和候选视频对应的视频特征的映射关系,文本特征检索库包括各个候选视频和候选视频对应的文本特征的映射关系。S510, obtaining a pre-established video feature retrieval library and a text feature retrieval library, wherein the video feature retrieval library includes a mapping relationship between each candidate video and the video features corresponding to the candidate video, and the text feature retrieval library includes a mapping relationship between each candidate video and the text features corresponding to the candidate video.

在本申请实施例中,预先建立的视频特征检索库和文本特征检索库,该视频特征检索库A包括各个候选视频和候选视频对应的视频特征的映射关系;文本特征检索库B包括各个候选视频和候选视频对应的文本特征的映射关系。In an embodiment of the present application, a video feature retrieval library and a text feature retrieval library are pre-established, wherein the video feature retrieval library A includes a mapping relationship between each candidate video and the video features corresponding to the candidate video; and the text feature retrieval library B includes a mapping relationship between each candidate video and the text features corresponding to the candidate video.

S520、根据目标视频特征分别检索视频特征检索库和文本特征检索库,并根据第一文本特征分别检索视频特征检索库和文本特征检索库,得到与待识别视频相似的多个目标候选视频。S520, searching the video feature retrieval library and the text feature retrieval library respectively according to the target video feature, and searching the video feature retrieval library and the text feature retrieval library respectively according to the first text feature, to obtain a plurality of target candidate videos similar to the video to be identified.

根据目标视频特征和第一文本特征分别在检索库A和检索库B中进行检索,存在四种检索方式:1)目标视频特征检索检索库A,确定检索库A中与目标视频特征相似的视频特征所对应的候选视频集1;2)目标视频特征检索检索库B,确定检索库B中与目标视频特征相似的文本特征所对应的候选视频集2;3)第一文本特征检索检索库A,确定检索库A中与第一文本特征相似的视频特征所对应的候选视频集3;4)第一文本特征检索检索库B,确定检索库B中与第一文本特征相似的文本特征所对应的候选视频集4。According to the target video feature and the first text feature, searches are performed in retrieval library A and retrieval library B respectively. There are four retrieval methods: 1) The target video feature searches retrieval library A to determine the candidate video set 1 corresponding to the video features similar to the target video features in retrieval library A; 2) The target video feature searches retrieval library B to determine the candidate video set 2 corresponding to the text features similar to the target video features in retrieval library B; 3) The first text feature searches retrieval library A to determine the candidate video set 3 corresponding to the video features similar to the first text feature in retrieval library A; 4) The first text feature searches retrieval library B to determine the candidate video set 4 corresponding to the text features similar to the first text feature in retrieval library B.

其中,在视频检索时,是计算目标视频特征与视频特征、文本特征之间的特征相似度,以及计算第一文本特征与视频特征、文本特征之间的特征相似度,其中,特征相似度可以通过特征向量的余弦相似度、欧几里得距离等计算,进而选择高相似度的多个候选视频作为候选视频集,如选择特征相似度top-K对应的K个候选视频得到候选视频集,其中每个候选视频集中包括候选视频的数量可以相同,也可以不同。Among them, during video retrieval, the feature similarity between the target video feature and the video feature and the text feature is calculated, and the feature similarity between the first text feature and the video feature and the text feature is calculated, wherein the feature similarity can be calculated by the cosine similarity of the feature vector, the Euclidean distance, etc., and then multiple candidate videos with high similarity are selected as the candidate video set, such as selecting K candidate videos corresponding to the top-K feature similarities to obtain the candidate video set, wherein the number of candidate videos included in each candidate video set can be the same or different.

在一示例中,从候选视频集1~候选视频集4中删除重复的视频,可以按照相似度从高到低排序,选择指定数量个候选视频作为多个目标候选视频。In one example, duplicate videos are deleted from candidate video set 1 to candidate video set 4, and the candidate videos can be sorted from high to low according to similarity, and a specified number of candidate videos are selected as multiple target candidate videos.

在另一示例中,可以直接将候选视频集1~候选视频集4中的视频均作为目标候选视频。In another example, the videos in candidate video set 1 to candidate video set 4 may be directly used as target candidate videos.

S530、根据多个目标候选视频和待识别视频之间的视频相似度,从多个目标候选视频中选择参考视频。S530: Select a reference video from the multiple target candidate videos according to the video similarities between the multiple target candidate videos and the video to be identified.

在一示例中,可以计算每个目标候选视频与待识别视频之间的视频相似度,进而根据视频相似度的高低从多个目标候选视频中选择得到多个参考视频。In one example, the video similarity between each target candidate video and the video to be identified may be calculated, and then multiple reference videos may be selected from the multiple target candidate videos according to the video similarity.

在另一示例中,多个目标候选视频包括每种检索方式所检索得到的多个目标候选视频,即候选视频集1中的目标候选视频~候选视频集4的目标候选视频,针对每一种检索方式所检索得到多个目标候选视频,计算待识别视频与每个目标候选视频的相似度;针对每个目标候选视频,计算目标候选视频在各个检索方式中的与待识别视频的平均相似度;根据平均相似度从多个目标候选视频中选择参考视频。In another example, the multiple target candidate videos include the multiple target candidate videos retrieved by each retrieval method, i.e., the target candidate videos in candidate video set 1 to the target candidate videos in candidate video set 4. For the multiple target candidate videos retrieved by each retrieval method, the similarity between the video to be identified and each target candidate video is calculated; for each target candidate video, the average similarity between the target candidate video and the video to be identified in each retrieval method is calculated; and a reference video is selected from the multiple target candidate videos according to the average similarity.

针对检索方式1)中的候选视频集中1中的多个目标候选视频,计算每个目标候选视频与待识别视频之间的视频相似度;同理,针对检索方式2)~4),计算每个目标候选视频与待识别视频之间的视频相似度;之后针对每个目标候选视频,若该目标候选视频同时出现在多个候选视频集中,则计算该目标候选视频在各个候选视频集中与待识别视频的平均相似度;例如检索方式1)检索得到a、b、c、d四个目标候选视频,与待识别视频的视频相似度分别为{a: 0.4, b: 0.2, c:0.1,d: 0.7};检索方式2)检索得到a、b、d、e四个目标候选视频,与待识别视频的视频相似度分别为{a: 0.3, b: 0.5, d: 0.6, e: 0.6};检索方式3)检索得到a、b、c三个目标候选视频,与待识别视频的视频相似度分别为{a: 0.4, b: 0.5,c:0.3 };检索方式4)检索得到a、b、e、f四个目标候选视频,与待识别视频的视频相似度分别为{a: 0.6, b: 0.2, e: 0.4, f: 0.2};这针对目标候选视频a,计算目标候选视频a与待识别视频的平均相似度为0.425;目标候选视频b的平均相似度为:0.35;若某一目标候选视频并未出现在一检索方式中,则在该检索方式中,目标候选视频的相似度为0,如目标候选视频c的平均相似度为(0.1+0+0.3+0)/4=0.1,目标候选视频d的平均相似度为0.425;目标候选视频e的平均相似度为0.25;目标候选视频f的平均相似度为0.05。For multiple target candidate videos in the candidate video set 1 in retrieval method 1), calculate the video similarity between each target candidate video and the video to be identified; similarly, for retrieval methods 2)~4), calculate the video similarity between each target candidate video and the video to be identified; then, for each target candidate video, if the target candidate video appears in multiple candidate video sets at the same time, calculate the average similarity between the target candidate video and the video to be identified in each candidate video set; for example, retrieval method 1) retrieves four target candidate videos a, b, c, and d, and their video similarities with the video to be identified are {a: 0.4, b: 0.2, c: 0.1, d: 0.7} respectively; retrieval method 2) retrieves four target candidate videos a, b, d, and e, and their video similarities with the video to be identified are {a: 0.3, b: 0.5, d: 0.6, e: 0.6} respectively; retrieval method 3) retrieves three target candidate videos a, b, and c, and their video similarities with the video to be identified are {a: 0.4, b: 0.5,c:0.3 }; Retrieval method 4) retrieves four target candidate videos a, b, e, and f, and their video similarities with the video to be identified are {a: 0.6, b: 0.2, e: 0.4, f: 0.2} respectively; for the target candidate video a, the average similarity between the target candidate video a and the video to be identified is 0.425; the average similarity of the target candidate video b is: 0.35; if a target candidate video does not appear in a retrieval method, then in this retrieval method, the similarity of the target candidate video is 0, such as the average similarity of the target candidate video c is (0.1+0+0.3+0)/4=0.1, and the average similarity of the target candidate video d is 0.425; the average similarity of the target candidate video e is 0.25; the average similarity of the target candidate video f is 0.05.

在根据平均相似度从多个目标候选视频中选择参考视频,可以将平均相似度从高到低排序,进而选取相似度高的top-N个视频,如可选择目标候选视频a、b、d作为参考视频,其中 N小于K。When selecting a reference video from multiple target candidate videos according to the average similarity, the average similarity can be sorted from high to low, and then the top-N videos with high similarity can be selected. For example, the target candidate videos a, b, and d can be selected as reference videos, where N is less than K.

在本申请其他实施例中,在计算每个目标候选视频的平均相似度,不一定需要对四种检索方式赋予相同的权重,可以动态为每个检索方式分配权重,如检索方式1)和检索方式3)对应的权重大于检索方式2)和检索方式4)对应的权重,进而计算目标候选视频的加权平均相似度,通过加权平均相似度选择参考视频。In other embodiments of the present application, when calculating the average similarity of each target candidate video, it is not necessarily necessary to assign the same weight to the four retrieval methods. Weights can be dynamically assigned to each retrieval method. For example, the weights corresponding to retrieval method 1) and retrieval method 3) are greater than the weights corresponding to retrieval method 2) and retrieval method 4), and then the weighted average similarity of the target candidate videos is calculated, and the reference video is selected according to the weighted average similarity.

需要说明的是,图5中所示S210、S230~S240的详细介绍请参见图2所示的S210、S230~S240,在此不再赘述。It should be noted that, for the detailed introduction of S210, S230-S240 shown in FIG. 5, please refer to S210, S230-S240 shown in FIG. 2, which will not be repeated here.

本申请实施例中,使用视频特征和文本特征在两个不同的检索库中进行交叉检索,可以最大程度地利用多模态信息检索与待识别视频相似的目标候选视频;通过计算目标候选视频在各个检索方式中的与待识别视频的平均相似度,提高选取参考视频的准确度和可靠性,进而得到与待识别视频最相似的 top-N 视频及其标签结果。In the embodiment of the present application, video features and text features are used to perform cross-search in two different retrieval libraries, so that multimodal information can be used to the greatest extent to retrieve target candidate videos similar to the video to be identified; by calculating the average similarity of the target candidate video with the video to be identified in each retrieval method, the accuracy and reliability of selecting reference videos are improved, and then the top-N videos most similar to the video to be identified and their label results are obtained.

本申请实施例还提供了另一种视频标签识别方法,该视频标签识别方法可以应用于图1所示的实施环境,该方法可以由终端或服务器执行,也可以由终端和服务器共同执行,在本申请实施例中,以该方法由服务器执行为例进行说明,如图6所示,在图2中所示的基础上,将图2中所示的S210扩展为S610~S640。其中,S610~S640详细介绍如下。The embodiment of the present application also provides another video tag recognition method, which can be applied to the implementation environment shown in FIG1 . The method can be executed by a terminal or a server, or by both a terminal and a server. In the embodiment of the present application, the method is executed by a server as an example for explanation. As shown in FIG6 , based on FIG2 , S210 shown in FIG2 is expanded to S610~S640. Among them, S610~S640 are described in detail as follows.

S610、获取待识别视频,从待识别视频中提取多个视频帧,并将多个视频帧划分为多个片段。S610: Obtain a video to be identified, extract multiple video frames from the video to be identified, and divide the multiple video frames into multiple segments.

S620、从多个片段中抽取目标视频帧,对目标视频帧进行视频特征提取得到第一视频帧特征。S620: extract a target video frame from the multiple segments, and perform video feature extraction on the target video frame to obtain a first video frame feature.

S630、获取待识别视频对应的第一文本信息,并对第一文本信息进行文本特征提取得到第一文本特征。S630: Obtain first text information corresponding to the video to be identified, and perform text feature extraction on the first text information to obtain a first text feature.

S640、根据第一视频帧特征和第一文本特征得到第一多模态特征。S640: Obtain a first multimodal feature according to the first video frame feature and the first text feature.

在本申请实施例中,可以根据待识别视频的视频时长从待识别视频中提取多个视频帧,例如每隔预设时间从待视频识别中抽取1帧视频帧,待识别视频的视频时长越长,该预设时间越长;例如待识别视频为30S的视频,每隔1s抽取一帧视频帧;待识别视频为1min的视频,每隔2s抽取一帧视频帧;也可以随机从待识别视频中抽取多个视频帧,还可以将待识别视频划分为三段,按照不同的比例从三段视频中抽取视频帧。In an embodiment of the present application, multiple video frames can be extracted from the video to be identified according to the video length of the video to be identified. For example, one video frame is extracted from the video to be identified at preset time intervals, and the longer the video length of the video to be identified, the longer the preset time is; for example, if the video to be identified is a 30S video, one video frame is extracted every 1s; if the video to be identified is a 1min video, one video frame is extracted every 2s; multiple video frames can also be randomly extracted from the video to be identified, and the video to be identified can also be divided into three segments, and video frames are extracted from the three segments according to different proportions.

在抽取得到多个视频帧之后,可以将多个视频帧划分为多个片段,其中,可以先多个视频帧按照时间排序或随机排序之后,等分切割的多个片段,进而从多个片段中抽取目标视频帧,如从每个片段中随机抽取1帧或多帧得到多个目标视频帧,通过预训练的视频特征编码器对目标视频帧进行视频特征提取得到第一视频帧特征。After extracting multiple video frames, the multiple video frames can be divided into multiple segments, wherein the multiple video frames can be first sorted by time or randomly sorted, and then the multiple segments are equally cut, and then target video frames are extracted from the multiple segments, such as randomly extracting 1 frame or multiple frames from each segment to obtain multiple target video frames, and video features are extracted from the target video frames through a pre-trained video feature encoder to obtain first video frame features.

获取待识别视频对应的第一文本信息,可参见图3所示的实施例中,在此不再一一赘述;通过预训练的文本特征编码器对第一文本信息进行文本特征提取的第一文本特征。To obtain the first text information corresponding to the video to be identified, please refer to the embodiment shown in FIG3 , which will not be described in detail here; the first text feature is extracted from the first text information by a pre-trained text feature encoder.

将第一视频帧特征和第一文本特征作为第一多模态特征。The first video frame feature and the first text feature are used as the first multimodal feature.

在一示例中,使用CLIP(Contrastive Language–Image Pre-training)方法训练得到的文本特征编码器和视频特征编码器,以使得文本特征编码器特征提取得到的文本特征和视频特征编码器特征提取得到的视频特征相接近。In one example, a text feature encoder and a video feature encoder are trained using a CLIP (Contrastive Language–Image Pre-training) method, so that text features extracted by the text feature encoder and video features extracted by the video feature encoder are close.

需要说明的是,图6中所示的S220~S240的其他详细介绍请参见图2中所示的S220~S240,在此不再赘述。It should be noted that for other detailed descriptions of S220 to S240 shown in FIG. 6 , please refer to S220 to S240 shown in FIG. 2 , which will not be repeated here.

本申请实施例中,从视频中抽取多个视频帧,再将视频帧划分为片段,从片段中抽取目标视频帧进行视频特征提取,使得得到的第一视频帧特征在具有代表性的同时,避免特征单一性;且基于CLIP模型的特性,通过编码器提取得到视频文本特征与每一个视频帧特征在理想情况下是接近的,从而使得第一多模态特征更加准确可靠。In an embodiment of the present application, multiple video frames are extracted from a video, and then the video frames are divided into segments, and target video frames are extracted from the segments for video feature extraction, so that the obtained first video frame features are representative while avoiding feature uniformity; and based on the characteristics of the CLIP model, the video text features extracted by the encoder are ideally close to each video frame feature, thereby making the first multimodal feature more accurate and reliable.

在本申请的一个实施例中,还提供了另一种视频标签识别方法,视频标签识别方法可以应用于图1所示的实施环境,该方法可以由终端或服务器执行,也可以由终端和服务器共同执行,在本申请实施例中,以该方法由服务器执行为例进行说明,如图7所示,该视频标签识别方法在图2至图6中所示的基础上,将图2中的S240中进行视频标签识别的过程的扩展为S710~S730。该S710~S730详细介绍如下。In one embodiment of the present application, another video tag recognition method is also provided. The video tag recognition method can be applied to the implementation environment shown in FIG. 1. The method can be executed by a terminal or a server, or by a terminal and a server together. In the embodiment of the present application, the method is executed by a server as an example for explanation. As shown in FIG. 7, the video tag recognition method is based on the method shown in FIG. 2 to FIG. 6, and the process of performing video tag recognition in S240 in FIG. 2 is expanded to S710~S730. The S710~S730 are described in detail as follows.

S710、对上下文学习信息进行序列特征转换,得到上下文学习序列,序列特征为标签生成模型所支持的输入特征。S710 , performing sequence feature conversion on the context learning information to obtain a context learning sequence, where the sequence feature is an input feature supported by the label generation model.

在本申请实施例中,上下文学习信息为连续的长文本序列,为了便于标签生成模型理解和处理,需要将上下文学习信息进行序列特征转换处理,将连续的长文本序列转换为模型能够理解和处理的token序列,token是语言模型的基本输入单元,可以是单个字符、单词或者子词,以使模型可以更好地理解和处理文本的语义和结构。In an embodiment of the present application, the context learning information is a continuous long text sequence. In order to facilitate the understanding and processing of the label generation model, the context learning information needs to be processed by sequence feature conversion, and the continuous long text sequence is converted into a token sequence that the model can understand and process. The token is the basic input unit of the language model, which can be a single character, word or subword, so that the model can better understand and process the semantics and structure of the text.

在一示例中,可以通过语言模型的tokenizer(分词器)对上下文学习信息进行序列特征转换得到上下文学习序列。In one example, the context learning information may be converted into sequence features by a tokenizer of a language model to obtain a context learning sequence.

S720、将上下文学习序列输入至标签生成模型,标签生成模型是通过对预设语言模型的原始模型参数保持冻结,并根据样本上下文学习序列对语言模型的新增模型参数进行调整得到的,新增模型参数与引入至语言模型的低秩自适应模块相关。S720. Input the context learning sequence into the label generation model. The label generation model is obtained by freezing the original model parameters of the preset language model and adjusting the newly added model parameters of the language model according to the sample context learning sequence. The newly added model parameters are related to the low-rank adaptive module introduced into the language model.

在本申请实施例中,标签生成模型可以通过预设语言模型训练得到的,该语言模型是通过对模型进行预训练得到,因此该语言模型可以应用于自然语言处理(NaturalLanguage Processing ,NLP)领域,例如机器翻译、语音识别、文本生成等,其中语言模型包括但不限于大语言模型(Large Language Model, LLM)、多模态大语言模型(multimodallarge language model,MLLM)、其他生成式模型;在此基础上,为了使得语言模型能够针对内容标签的应用场景进行优化,对语言模型进行微调处理,以更好拟合到业务场景的需求。其中,本申请实施例通过低秩自适应模块(Low-Rank Adaptation,LoRA)对语言模型做参数高效微调,该LoRA通过在语言模型中引入新增模型参数,如低秩矩阵,该新增模型参数是可训练的,而语言模型的原始模型参数保持冻结,根据通过样本上下文学习序列对语言模型的新增模型参数进行调整,进而将包含新增模型参数和原始模型参数的语言模型作为标签生成模型。In an embodiment of the present application, the label generation model can be obtained by training a preset language model, and the language model is obtained by pre-training the model, so the language model can be applied to the field of natural language processing (NLP), such as machine translation, speech recognition, text generation, etc., wherein the language model includes but is not limited to a large language model (LLM), a multimodal large language model (MLLM), and other generative models; on this basis, in order to enable the language model to be optimized for the application scenario of the content label, the language model is fine-tuned to better fit the needs of the business scenario. Among them, the embodiment of the present application uses a low-rank adaptation module (LoRA) to efficiently fine-tune the parameters of the language model. The LoRA introduces new model parameters, such as a low-rank matrix, into the language model. The new model parameters are trainable, while the original model parameters of the language model remain frozen, and the new model parameters of the language model are adjusted according to the sample context learning sequence, and then the language model containing the new model parameters and the original model parameters is used as the label generation model.

S730、获取标签生成模型输出的待识别视频的目标视频标签。S730: Obtain a target video label of the video to be identified output by the label generation model.

标签生成模型可以根据上下文学习信息中学习示例提供的上下文信息,从类比中学习,对待识别视频的视频标签的识别。The label generation model can learn from analogies to identify the video labels of the video to be identified based on the contextual information provided by the learning examples in the contextual learning information.

需要说明的是,图7中所示的S210~S230的其他详细介绍请参见图2所示的S210~S230,在此不再赘述。It should be noted that for other detailed descriptions of S210 to S230 shown in FIG. 7 , please refer to S210 to S230 shown in FIG. 2 , which will not be repeated here.

在本申请实施例中,上下文学习信息包括带有基于多模态特征生成视频标签的参考提示,将其转换为语言模型更好地理解和处理token序列,进而基于语言模型得到标签生成模型可以参考提示,快速为待识别视频生成标签,且标签生成模型是在语言模型的基础上进行参数高效微调,以其更能适应视频标签的任务需求。In an embodiment of the present application, contextual learning information includes reference prompts for generating video labels based on multimodal features, which are converted into a language model to better understand and process token sequences, and then a label generation model based on the language model can refer to the prompts to quickly generate labels for the video to be identified, and the label generation model is efficiently fine-tuned based on the language model so that it can better adapt to the task requirements of video labels.

值得注意的是,在本申请的一个实施例中,还提供了另一种视频标签识别方法,视频标签识别方法可以应用于图1所示的实施环境,该方法可以由终端或服务器执行,也可以由终端和服务器共同执行,在本申请实施例中,以该方法由服务器执行为例进行说明,如图8所示,该视频标签识别方法在图7中所示的基础上,增加了标签生成模型的训练过程,即S810~S850。其中,S810~S850详细介绍如下。It is worth noting that in one embodiment of the present application, another video tag recognition method is also provided. The video tag recognition method can be applied to the implementation environment shown in FIG1. The method can be executed by a terminal or a server, or by a terminal and a server together. In the embodiment of the present application, the method is executed by a server as an example for explanation. As shown in FIG8, the video tag recognition method is based on the method shown in FIG7, and the training process of the tag generation model is added, namely, S810~S850. Among them, S810~S850 are described in detail as follows.

S810、获取第一样本视频对应的第一样本视频特征、第一样本文本信息和第一样本视频标签,以及第二样本视频对应的第二样本视频特征、第二样本文本信息和第二样本视频标签。S810: Obtain first sample video features, first sample text information, and first sample video labels corresponding to the first sample video, and second sample video features, second sample text information, and second sample video labels corresponding to the second sample video.

在本申请实施例中,通过训练集对模型进行训练,训练集包括多个样本视频,每个样本视频都携带有视频标签,第一样本视频即训练中的参考样本视频,第二样本视频即训练中需要识别的样本视频;第一样本视频和第二样本视频为相似视频;获取样本视频的样本视频特征的过程可参见图6所示,获取样本文本信息可参见图3所示,在此不再赘述。In an embodiment of the present application, the model is trained by a training set, the training set includes multiple sample videos, each sample video carries a video label, the first sample video is a reference sample video in training, and the second sample video is a sample video to be identified in training; the first sample video and the second sample video are similar videos; the process of obtaining sample video features of the sample video can be seen in Figure 6, and the process of obtaining sample text information can be seen in Figure 3, which will not be repeated here.

S820、根据预训练的初始特征对齐模块分别对第一样本视频特征和第二样本视频特征进行特征对齐处理,得到第一样本视觉标记序列和第二样本视觉标记序列,并根据第一样本视觉标记序列、第一样本文本信息和第一样本视频标签,以及第二样本视觉标记序列和第二样本文本信息构建样本上下文学习信息。S820. Perform feature alignment processing on the first sample video feature and the second sample video feature respectively according to the pre-trained initial feature alignment module to obtain a first sample visual marker sequence and a second sample visual marker sequence, and construct sample context learning information according to the first sample visual marker sequence, the first sample text information and the first sample video label, and the second sample visual marker sequence and the second sample text information.

在本申请实施例中,预先训练得到初始特征对齐模块,该初始特征对齐模块能够将视频特征映射到与文本特征对齐的空间,使得视觉信息可以被语言模型更好地理解和利用,通过该初始特征对齐模块分别对第一样本视频特征和第二样本视频特征进行特征对齐处理,以将第一样本视频特征和第二样本视频特征映射到语言模型的文本特征对齐的空间,分别得到第一样本视觉标记序列和第二样本视觉标记序列,进而根据第一样本视觉标记序列、第一样本文本信息和第一样本视频标签,以及第二样本视觉标记序列和第二样本文本信息构建样本上下文学习信息,构建过程请参见图3所示。In an embodiment of the present application, an initial feature alignment module is pre-trained, and the initial feature alignment module is capable of mapping video features to a space aligned with text features, so that visual information can be better understood and utilized by the language model. The first sample video features and the second sample video features are respectively subjected to feature alignment processing through the initial feature alignment module to map the first sample video features and the second sample video features to a space aligned with text features of the language model, and obtain a first sample visual marker sequence and a second sample visual marker sequence, respectively. Then, sample context learning information is constructed according to the first sample visual marker sequence, the first sample text information and the first sample video label, and the second sample visual marker sequence and the second sample text information. Please refer to Figure 3 for the construction process.

在一示例中,初始特征对齐模块可以通过图像和文本描述对,并联合语言模型进行训练,该初始特征对齐模块的训练步骤包括:获取样本图像和样本图像对应的样本描述文本;对样本图像进行特征提取得到样本视觉特征,并将样本视觉特征输入至待训练模块,以使待训练模块将样本视觉特征对齐至语言模型的输入文本特征空间;获取对齐描述指令对应的样本文本特征,并将样本文本特征和特征对齐模块输出的目标样本视觉文本特征输入至语言模型,其中,语言模型的模型参数保持冻结;获取语言模型输出的样本预测描述文本,根据样本描述文本和样本预测描述文本对待训练模块进行训练得到初始特征对齐模块。In one example, an initial feature alignment module can be trained through image and text description pairs in conjunction with a language model, and the training steps of the initial feature alignment module include: obtaining a sample image and a sample description text corresponding to the sample image; extracting features from the sample image to obtain sample visual features, and inputting the sample visual features into the module to be trained, so that the module to be trained aligns the sample visual features to the input text feature space of the language model; obtaining sample text features corresponding to the alignment description instructions, and inputting the sample text features and the target sample visual text features output by the feature alignment module into the language model, wherein the model parameters of the language model remain frozen; obtaining the sample prediction description text output by the language model, and training the module to be trained based on the sample description text and the sample prediction description text to obtain the initial feature alignment module.

其中,待训练模块可以是采用一层全连接(FC)层的网络,也可以为多层感知机(MLP)或其他类型的神经网络结构,如q-former;样本描述文本用于描述样本图像的图像内容。如图9所示,将样本图像输入至视频特征编码器,通过视频特征编码器提取样本视觉特征,该样本视觉特征包含的样本图像的主要视觉信息,如物体、背景、颜色等。同时获取对齐描述指令,该对齐描述指令用于指示语言模型生成图像对应的描述文本,该对齐描述指令可以是文本型指令,比如采用以下的语句之一:"简要描述以下图片。"、"提供给定图片的简短描述。"、"简洁解释所提供的图片。"等;为了便于语言模型可处理该对齐描述指令,因此将对齐描述指令通过Text Tokenizer 处理转换为样本文本特征,将样本文本特征输入至语言模型。Among them, the module to be trained can be a network with a fully connected (FC) layer, or a multi-layer perceptron (MLP) or other types of neural network structures, such as q-former; the sample description text is used to describe the image content of the sample image. As shown in Figure 9, the sample image is input into the video feature encoder, and the sample visual features are extracted by the video feature encoder. The sample visual features contain the main visual information of the sample image, such as objects, backgrounds, colors, etc. At the same time, an alignment description instruction is obtained, which is used to instruct the language model to generate a description text corresponding to the image. The alignment description instruction can be a text-type instruction, such as one of the following sentences: "Briefly describe the following picture.", "Provide a brief description of the given picture.", "Briefly explain the provided picture.", etc.; In order to facilitate the language model to process the alignment description instruction, the alignment description instruction is converted into a sample text feature through Text Tokenizer processing, and the sample text feature is input into the language model.

由于视觉特征和文本特征来自不同的模态,直接将它们输入语言模型通常效果不佳,因此,需要将该样本视觉特征输入至待训练模块,待训练模块是待训练模块通过学习将视觉特征映射到与文本特征一致的空间,得到目标样本视觉文本特征,使其能够与文本特征在相同或相似的特征空间中对齐。Since visual features and text features come from different modalities, directly inputting them into a language model usually does not work well. Therefore, the sample visual features need to be input into the module to be trained. The module to be trained maps the visual features to a space consistent with the text features through learning to obtain the target sample visual text features so that they can be aligned with the text features in the same or similar feature space.

需要说明的是,在训练待训练模块时,语言模型的模型参数保持冻结,将样本文本特征和目标样本视觉文本特征输入至语言模型后,该语言模型将目标样本视觉文本特征和样本文本特征结合,以生成样本预测描述文本,可以根据样本预测描述文本和样本图像对应的样本描述文本的差异对待训练模块的参数进行调整,直到待训练模块收敛得到该初始特征对齐模块,通过训练,该初始特征对齐模块能够将视觉特征有效地对齐到与文本特征一致的空间,使得语言模型可以在多模态输入下表现出色。It should be noted that when training the module to be trained, the model parameters of the language model remain frozen. After the sample text features and the target sample visual text features are input into the language model, the language model combines the target sample visual text features and the sample text features to generate a sample prediction description text. The parameters of the module to be trained can be adjusted according to the difference between the sample prediction description text and the sample description text corresponding to the sample image until the module to be trained converges to obtain the initial feature alignment module. Through training, the initial feature alignment module can effectively align the visual features to a space consistent with the text features, so that the language model can perform well under multimodal input.

S830、将样本上下文学习信息进行序列特征转换,得到样本上下文学习序列。S830 , performing sequence feature conversion on the sample context learning information to obtain a sample context learning sequence.

通过语言模型的tokenizer对样本上下文学习信息进行序列特征转换得到样本上下文学习序列。The sample context learning sequence is obtained by performing sequence feature conversion on the sample context learning information through the tokenizer of the language model.

S840、将低秩自适应模块引入语言模型,以通过低秩自适应模块在语言模型引入新增模型参数。S840. Introduce a low-rank adaptive module into the language model, so as to introduce new model parameters into the language model through the low-rank adaptive module.

S850、对语言模型的原始模型参数保持冻结,并根据样本上下文学习序列和第二样本视频标签对语言模型的新增模型参数进行调整得到标签生成模型。S850: Keep the original model parameters of the language model frozen, and adjust the newly added model parameters of the language model according to the sample context learning sequence and the second sample video label to obtain a label generation model.

在本申请实施例,将LoRA模块引入语言模型,LoRA模块会在模型的某些层中引入低秩矩阵;在前向传播过程中,LoRA模块会将输入数据通过低秩矩阵进行变换,然后再传递给模型的下一层;这种变换相当于在语言模型的的基础上添加了一层额外的参数调整,即在语言模型引入新增模型参数,而不改变原始模型的参数;在训练过程中,仅训练新增模型参数,对语言模型的原始模型参数保持冻结,而不更新原始模型参数;通过这种方式,可以在保持原始模型参数不变的情况下,实现对模型的微调。In an embodiment of the present application, a LoRA module is introduced into a language model, and the LoRA module introduces a low-rank matrix into certain layers of the model; during the forward propagation process, the LoRA module transforms the input data through a low-rank matrix, and then passes it to the next layer of the model; this transformation is equivalent to adding an extra layer of parameter adjustment on the basis of the language model, that is, introducing new model parameters into the language model without changing the parameters of the original model; during the training process, only the new model parameters are trained, and the original model parameters of the language model are kept frozen without updating the original model parameters; in this way, the model can be fine-tuned while keeping the original model parameters unchanged.

在一示例中,将样本上下文学习序列输入至引入低秩自适应模块的语言模型,并获取语言模型输出的预测样本标签;该语言模型可以从样本上下文学习序列中类比学习,以生成第二样本视频对应的预测样本标签;而第二样本视频对应有第二样本视频标签,因此可根据预测样本标签和第二样本视频标签的差异,对语言模型的新增模型参数进行调整,得到标签生成模型;如根据预测样本标签和第二样本视频标签的差异,计算损失函数,该损失函数如对比损失函数,均方误差损失函数等,进而根据损失函数优化新增模型参数,直到预测样本标签和第二样本视频标签的差异小于预设阈值,得到标签生成模型。In one example, a sample context learning sequence is input into a language model that introduces a low-rank adaptive module, and a predicted sample label output by the language model is obtained; the language model can learn by analogy from the sample context learning sequence to generate a predicted sample label corresponding to a second sample video; and the second sample video corresponds to a second sample video label, so the newly added model parameters of the language model can be adjusted according to the difference between the predicted sample label and the second sample video label to obtain a label generation model; such as calculating a loss function according to the difference between the predicted sample label and the second sample video label, such as a contrast loss function, a mean square error loss function, etc., and then optimizing the newly added model parameters according to the loss function until the difference between the predicted sample label and the second sample video label is less than a preset threshold to obtain a label generation model.

需要说明的是,在对LoRA模块进行训练的同时,还可对初始特征对齐模块进行进一步优化,即根据预测样本标签和第二样本视频标签的差异,对初始特征对齐模块的模块参数进行调整,如根据损失函数进一步优化初始特征对齐模块的模块参数,直到预测样本标签和第二样本视频标签的差异小于预设阈值,得到最终的特征对齐模块。It should be noted that while training the LoRA module, the initial feature alignment module can be further optimized, that is, the module parameters of the initial feature alignment module are adjusted according to the difference between the predicted sample label and the second sample video label. For example, the module parameters of the initial feature alignment module are further optimized according to the loss function until the difference between the predicted sample label and the second sample video label is less than a preset threshold, thereby obtaining the final feature alignment module.

在本申请实施例中,先结合语言模型对特征对齐模块进行初步训练,保持语言模型的参数冻结,通过训练,特征对齐模块学会将视觉特征与文本特征对齐,从而实现多模态特征之间的有效转化和融合,之后引入LoRA至语言模型,LoRA 通过低秩分解的方式,以最小的参数变化达到对模型输出的有效调整,适配视频标签任务需求,对 LoRA 进行训练,同时继续训练 adapter,以便模型能够根据多模态上下文实例为待识别的视频生成内容标签,以及特征对齐模块特征对齐更加精准。In an embodiment of the present application, the feature alignment module is first preliminarily trained in combination with the language model, and the parameters of the language model are kept frozen. Through training, the feature alignment module learns to align visual features with text features, thereby realizing effective conversion and fusion between multimodal features. Then, LoRA is introduced into the language model. LoRA uses low-rank decomposition to achieve effective adjustment of the model output with minimal parameter changes, adapts to the requirements of the video labeling task, trains LoRA, and continues to train the adapter so that the model can generate content labels for the video to be identified based on the multimodal context instances, and the feature alignment of the feature alignment module is more accurate.

为了便于理解,本申请实施例还提供一种视频标签识别方法,如图10所示,视频从内容生产环节进入内容处理链路,经过人机协同的方式得到相应的内容特征,进入下游的内容分发环节。本申请实施例提出的视频标签识别方法属于机器打标环节。For ease of understanding, the embodiment of the present application also provides a video tag recognition method, as shown in Figure 10, the video enters the content processing link from the content production link, obtains the corresponding content features through human-machine collaboration, and enters the downstream content distribution link. The video tag recognition method proposed in the embodiment of the present application belongs to the machine tagging link.

视频标签识别的整个过程可参见图11所示,包括视频多模态特征提取、相似视频检索、上下文学习指令构造和模型识别过程,其中以LLM模型为例进行说明,LLM可以选择开源的中文LLM,如chinese-llama、BLOOM等。The entire process of video tag recognition can be seen in Figure 11, including video multimodal feature extraction, similar video retrieval, context learning instruction construction and model recognition process, in which the LLM model is taken as an example. LLM can choose open source Chinese LLM, such as chinese-llama, BLOOM, etc.

其中,视频多模态特征提取包括:从原始视频提取两个模态,视频帧和视频文本。视频帧:首先将一个视频1秒1帧先进行抽帧,然后等分切割成M个片段,训练模式下从每个片段中随机取1帧,测试模式下从每个片段中取最中间1帧,一共得到M帧;视频文本:提取视频的标题,和视频ASR转换得到的文本,并拼接成一句话。The video multimodal feature extraction includes: extracting two modes from the original video, video frames and video text. Video frames: First, extract one frame per second from a video, and then divide it into M segments. In the training mode, randomly select one frame from each segment, and in the test mode, select the middle frame from each segment, and get a total of M frames; Video text: extract the title of the video and the text converted by the video ASR, and splice them into a sentence.

在本申请实施例中,不使用原始音频特征的原因是:基于业务场景,短视频中的背景音很多是统一的热门背景音乐,不具备很好的判别性,且重要的口播信息已经通过ASR提取并合并到视频文本信息中,在此不使用原始的音频信号作为输入之一。In the embodiment of the present application, the reason for not using the original audio features is: based on the business scenario, many background sounds in short videos are unified popular background music, which does not have good discriminability, and important spoken information has been extracted by ASR and merged into the video text information. The original audio signal is not used as one of the inputs here.

本本申请实施例使用CLIP方法训练得到的视频特征编码器(cv encoder)和文本特征编码器(text encoder)提取视频多模态特征。具体来说,视频帧一共M张图像,分别通过cv encoder计算得到视频帧特征,记为f = {fv 1, fv 2,…, fv M};将视频标题和ASR等其他文本拼接在一起,通过text encoder计算得到视频文本特征,记为ft。基于CLIP模型对齐特征的特性,理想情况下,视频文本特征跟每一个视频帧特征都比较接近。The embodiment of this application uses the video feature encoder (cv encoder) and text feature encoder (text encoder) trained by the CLIP method to extract video multimodal features. Specifically, the video frame has a total of M images, and the video frame features are calculated by the cv encoder, denoted as f = {f v 1 , f v 2 ,…, f v M }; the video title and other texts such as ASR are spliced together, and the video text features are calculated by the text encoder, denoted as f t . Based on the characteristics of the CLIP model alignment features, ideally, the video text features are relatively close to each video frame feature.

在进行相似视频检索之前,需要构造检索库。其中,对于每个视频,根据以上方法可以得到视频帧特征F = {fv 1, fv 2,…, fv M}和视频文本特征ft,为了简化计算流程,将视频特征F做average pooling(平均池化处理),得到一个跟文本特征同维度的特征fv,检索库中每个视频都有标签的标注结果,此时可以建两个检索库:检索库A:{fv==>视频},使用视频帧的视觉特征建库;检索库B:{ft==>视频},使用视频文本特征建库。Before performing similar video retrieval, a retrieval library needs to be constructed. For each video, the above method can be used to obtain the video frame feature F = {f v 1 , f v 2 ,…, f v M } and the video text feature f t . In order to simplify the calculation process, the video feature F is average pooled to obtain a feature f v with the same dimension as the text feature. Each video in the retrieval library has a label annotation result. At this time, two retrieval libraries can be built: retrieval library A: {f v ==> video}, which uses the visual features of the video frame to build the library; retrieval library B: {f t ==> video}, which uses the video text features to build the library.

以上构造检索库的步骤可以离线计算完成。The above steps of constructing the retrieval library can be completed by offline calculation.

对于待识别视频,提取视觉特征fv和文本特征ft之后,在检索库A和B上做交叉检索,返回4路搜索的top-K结果。基于上文的介绍,CLIP的特征在理想情况下对齐了视觉和特征特征,除了视觉特征搜检索库A和文本特征搜检索库B这种做法之外,还可以使用视觉特征搜文本特征检索库B、文本特征搜视觉特征检索库A这种交叉检索;将以上4路的top-K结果中的视频,分别与待识别视频计算相似度,进而计算检索得到的视频在4路结果中的平均相似度(如果某视频不在某一路top-K中,相似度视为0),根据平均相似度得到top-N个视频和对应的标签结果,做为后续的上下文参考样本使用。其中,K可以设置一个比N大的数值,比如K = 2 * N。For the video to be identified, after extracting the visual features f v and text features f t , a cross search is performed on the search libraries A and B, and the top-K results of the 4-way search are returned. Based on the above introduction, the features of CLIP ideally align the visual and feature features. In addition to searching the search library A with visual features and searching the search library B with text features, you can also use the cross search of visual features searching the text feature search library B and text features searching the visual feature search library A. The videos in the above 4-way top-K results are respectively calculated with the video to be identified, and then the average similarity of the retrieved videos in the 4-way results is calculated (if a video is not in a certain top-K, the similarity is regarded as 0), and the top-N videos and corresponding label results are obtained according to the average similarity, which are used as subsequent context reference samples. Among them, K can be set to a value larger than N, such as K = 2 * N.

本申请实施例,借鉴了上下文学习(In Context Learning,ICL)的基本思想,即从类比中学习,并将其扩展到多模态的应用场景上;在本申请实施例中,通过上文描述的相似视频检索的方法,召回top-N个与待识别视频内容最接近的样本,用以下模版构造LLM的输入。The embodiment of the present application draws on the basic idea of in-context learning (ICL), that is, learning from analogy, and extends it to multimodal application scenarios; in the embodiment of the present application, the top-N samples closest to the video content to be identified are recalled through the similar video retrieval method described above, and the input of LLM is constructed using the following template.

具体来说,模版首先列出top-N个视频的多模态特征和对应的标签结果,然后给出待识别视频的特征,让LLM生成其内容标签。其中,<pad>和<eoc>都是特殊token,<pad>用于隔开一个视频内部不同模态和结果token序列;<eoc>代表end of chunk,用于隔开不同视频的token序列。模版中视频文本i为视频的标题和ASR等文本拼接起来的信息(此处指原始的文本,不是文本特征ft);视频标签i为视频对应的标签标注结果文本;模版中[视觉token_i]代表视觉特征的token序列,这里通过视频帧特征fv经过一个需要训练的adapter(特征对齐模块)的转换得到;整个输入序列通过LLM的tokenizer生成token序列(即前述的上下文学习序列);这样,通过这个方法,构造了一个类似于上下文演示例子的输入给LLM;可参见图4-1所示。Specifically, the template first lists the multimodal features and corresponding label results of the top-N videos, and then gives the features of the video to be identified, allowing LLM to generate its content label. Among them, <pad> and <eoc> are special tokens. <pad> is used to separate different modalities and result token sequences within a video; <eoc> stands for end of chunk, which is used to separate token sequences of different videos. In the template, the video text i is the information spliced together from the title of the video and text such as ASR (here refers to the original text, not the text feature f t ); the video tag i is the label annotation result text corresponding to the video; in the template, [visual token_i] represents the token sequence of the visual feature, which is obtained here by converting the video frame feature f v through an adapter (feature alignment module) that needs to be trained; the entire input sequence generates a token sequence (i.e., the aforementioned context learning sequence) through the LLM tokenizer; in this way, through this method, an input similar to the context demonstration example is constructed for the LLM; see Figure 4-1.

模型识别过程,即LLM接收token序列(即前述的上下文学习序列),根据这个序列生成标签。The model recognition process is that LLM receives a token sequence (i.e., the aforementioned context learning sequence) and generates a label based on this sequence.

需要指出的是,本申请实施例区别于传统的完全不训练LLM,考虑到内容标签这个任务存在一定主观性,给LLM做适当微调可以更好拟合到业务场景的需求,因此使用一个LoRA对LLM做参数高效微调。训练过程中LLM参数一直保持全部冻结,仅训练LoRA和adapter两个模块。It should be pointed out that the embodiment of the present application is different from the traditional one that does not train LLM at all. Considering that the task of content labeling is somewhat subjective, proper fine-tuning of LLM can better fit the needs of business scenarios. Therefore, a LoRA is used to efficiently fine-tune the parameters of LLM. During the training process, all LLM parameters remain frozen, and only the LoRA and adapter modules are trained.

LLM的训练方法包括两个关键阶段。The LLM training approach consists of two key phases.

阶段一:训练adapter,以对齐视觉特征。此阶段仅训练adapter,不加入LoRA,使用图片-文本描述pair进行训练。Phase 1: Train the adapter to align visual features. In this phase, only the adapter is trained without adding LoRA, and the image-text description pair is used for training.

在此阶段不提供其他信息给LLM,仅能接收视觉token,迫使adapter成功“翻译”视觉信息给LLM才能生成正确的图片描述文本,使adapter可以更好训练,具体可参见图9所示。At this stage, no other information is provided to the LLM, and only the visual token can be received, forcing the adapter to successfully "translate" the visual information to the LLM to generate the correct image description text, so that the adapter can be better trained. For details, see Figure 9.

阶段二:训练LoRA。这个阶段,LLM加入LoRA,同时训练adapter和LoRA,让模型根据上下文实例为待识别视频生成标签结果,如图11所示,此阶段输入token序列按前文描述的方法,根据模版生成,无需手工设计指令。Phase 2: Training LoRA. In this phase, LLM joins LoRA and trains the adapter and LoRA at the same time, allowing the model to generate label results for the video to be identified based on the context instance, as shown in Figure 11. In this phase, the input token sequence is generated according to the template according to the method described above, without the need for manual design of instructions.

在本申请实施例中,若在训练过程中,LLM模型为待识别视频生成标签结果时,可根据生成标签的质量反馈,优化选择的上下文示例组合和模版结构。例如,模型对某些类别的生成标签的质量较低,则可以优先选择与这些类别相关的上下文示例组合进行二次学习。In the embodiment of the present application, if during the training process, the LLM model generates label results for the video to be identified, the selected context example combination and template structure can be optimized according to the quality feedback of the generated labels. For example, if the quality of the generated labels for certain categories is low, the context example combinations related to these categories can be preferentially selected for secondary learning.

可以理解的是,模型训练完成后可对新的视频预测标签;视频提取视觉特征和文本特征后,通过检索相似视频,根据模板构造上下文增强的输入token序列,LLM可直接生成对应的内容标签结果。It can be understood that after the model training is completed, the label of the new video can be predicted; after the visual features and text features of the video are extracted, by retrieving similar videos and constructing a context-enhanced input token sequence according to the template, LLM can directly generate the corresponding content label results.

本申请提供的视频标签识别方法,利用大模型固有的强推理能力,也考虑到业务特点场景下标准的差异,提出了一种基于上下文增强的多模态大模型,结合上下文学习的思路,通过引入一个相似视频检索,将待识别视频的top-N召回的相似视频和对应的标签结果以参考样本的形式输给大模型一起推理,增强输入模型的特征,以更少的训练样本实现训练目标。The video tag recognition method provided in this application utilizes the inherent strong reasoning ability of the large model and also takes into account the differences in standards under business characteristics and scenarios, and proposes a multimodal large model based on context enhancement. By combining the idea of context learning, a similar video retrieval is introduced, and the top-N recalled similar videos and corresponding label results of the video to be identified are input to the large model in the form of reference samples for reasoning together, thereby enhancing the features of the input model and achieving the training goals with fewer training samples.

在此介绍本申请的装置实施例,可以用于执行本申请上述实施例中的视频标签识别方法。对于本申请装置实施例中未披露的细节,请参照本申请上述的视频标签识别方法的实施例。Here, an apparatus embodiment of the present application is introduced, which can be used to execute the video tag identification method in the above-mentioned embodiment of the present application. For details not disclosed in the apparatus embodiment of the present application, please refer to the above-mentioned embodiment of the video tag identification method of the present application.

本申请实施例提供了一种视频标签识别装置,如图12所示,装置包括。An embodiment of the present application provides a video tag recognition device, as shown in FIG12 , the device includes:

获取模块1210,用于获取待识别视频,并对所述待识别视频进行多模态特征提取得到第一多模态特征,所述第一多模态特征包括多种模态分别对应的多种模态特征。The acquisition module 1210 is used to acquire a video to be identified, and perform multimodal feature extraction on the video to be identified to obtain a first multimodal feature, where the first multimodal feature includes multiple modal features corresponding to multiple modalities.

检索模块1220,用于根据所述第一多模态特征检索与所述待识别视频相似的多个参考视频,每个所述参考视频均携带有视频标签。The retrieval module 1220 is used to retrieve a plurality of reference videos similar to the video to be identified according to the first multimodal feature, each of the reference videos carrying a video tag.

构建模块1230,用于根据各个所述参考视频分别对应的第二多模态特征和各个所述参考视频携带的视频标签,以及所述待识别视频的第一多模态特征构建上下文学习信息。The construction module 1230 is used to construct context learning information according to the second multimodal features corresponding to each of the reference videos, the video tags carried by each of the reference videos, and the first multimodal features of the video to be identified.

识别模块1240,用于根据所述上下文学习信息,对所述待识别视频的视频标签进行识别。The identification module 1240 is used to identify the video tag of the video to be identified according to the context learning information.

在本申请的一个实施例中,基于前述方案,所述第一多模态特征包括第一视频帧特征,所述第二多模态特征包括第二视频帧特征;构建模块进一步用于对所述第一视频帧特征进行特征转换处理得到第一视觉文本标记序列,并对各个所述参考视频对应的所述第二视频帧特征分别进行特征转换处理得到各个所述参考视频对应的第二视觉文本标记序列;获取所述待识别视频对应的第一文本信息和各个所述参考视频分别对应的第二文本信息;根据各个所述参考视频对应的所述第二视觉文本标记序列、所述第二文本信息和所述视频标签,以及所述第一视觉文本标记序列和所述第一文本信息,构建所述上下文学习信息。In one embodiment of the present application, based on the aforementioned scheme, the first multimodal feature includes a first video frame feature, and the second multimodal feature includes a second video frame feature; the construction module is further used to perform feature conversion processing on the first video frame feature to obtain a first visual text mark sequence, and perform feature conversion processing on the second video frame features corresponding to each of the reference videos to obtain a second visual text mark sequence corresponding to each of the reference videos; obtain the first text information corresponding to the video to be identified and the second text information corresponding to each of the reference videos; construct the context learning information according to the second visual text mark sequence corresponding to each of the reference videos, the second text information and the video label, as well as the first visual text mark sequence and the first text information.

在本申请的一个实施例中,基于前述方案,所述第一视频帧特征的数量包括多个;所述构建模块进一步用于对多个所述第一视频帧特征进行特征融合处理,得到目标视频特征;通过预训练的特征对齐模块将所述目标视频特征对齐到预设文本特征空间,得到所述第一视觉文本标记序列。In one embodiment of the present application, based on the aforementioned scheme, the number of the first video frame features includes multiple; the construction module is further used to perform feature fusion processing on the multiple first video frame features to obtain target video features; and the target video features are aligned to the preset text feature space through a pre-trained feature alignment module to obtain the first visual text tag sequence.

在本申请的一个实施例中,基于前述方案,所述构建模块进一步用于根据用于隔离同一视频下的不同信息的第一文本标记,以及同一参考视频对应的所述第二视觉文本标记序列、所述第二文本信息和所述视频标签构建学习示例,以得到多个参考视频对应的多个学习示例;将所述多个学习示例进行拼接,得到学习信息,其中,所述学习信息中的多个学习示例通过用于隔离不同视频的信息的第二文本标记进行分隔;根据所述第一文本标记、所述第一视觉文本标记序列和所述第一文本信息构建识别示例;根据所述学习信息和所述识别示例生成所述上下文学习信息。In one embodiment of the present application, based on the aforementioned scheme, the construction module is further used to construct a learning example based on a first text tag for isolating different information under the same video, and the second visual text tag sequence, the second text information and the video label corresponding to the same reference video, so as to obtain multiple learning examples corresponding to multiple reference videos; splicing the multiple learning examples to obtain learning information, wherein the multiple learning examples in the learning information are separated by the second text tag for isolating information of different videos; constructing a recognition example based on the first text tag, the first visual text tag sequence and the first text information; and generating the contextual learning information based on the learning information and the recognition example.

在本申请的一个实施例中,基于前述方案,获取模块进一步用于获取所述待识别视频的视频标题,以及所述待识别视频的音频信息转化得到的识别文本;根据所述待识别视频的视频场景生成所述待识别视频的情境描述;根据所述视频标题、所述识别文本和所述情境描述生成所述第一文本信息。In one embodiment of the present application, based on the aforementioned scheme, the acquisition module is further used to obtain the video title of the video to be identified, and the recognition text converted from the audio information of the video to be identified; generate a context description of the video to be identified according to the video scene of the video to be identified; and generate the first text information according to the video title, the recognition text and the context description.

在本申请的一个实施例中,基于前述方案,所述第一多模态特征包括目标视频特征和第一文本特征;所述检索模块用于获取预先建立的视频特征检索库和文本特征检索库,所述视频特征检索库包括各个候选视频和候选视频对应的视频特征的映射关系,所述文本特征检索库包括各个候选视频和候选视频对应的文本特征的映射关系;根据所述目标视频特征分别检索所述视频特征检索库和所述文本特征检索库,并根据所述第一文本特征分别检索所述视频特征检索库和文本特征检索库,得到与所述待识别视频相似的多个目标候选视频;根据所述多个目标候选视频和所述待识别视频之间的视频相似度,从所述多个目标候选视频中选择所述参考视频。In one embodiment of the present application, based on the aforementioned scheme, the first multimodal feature includes a target video feature and a first text feature; the retrieval module is used to obtain a pre-established video feature retrieval library and a text feature retrieval library, the video feature retrieval library includes a mapping relationship between each candidate video and the video feature corresponding to the candidate video, and the text feature retrieval library includes a mapping relationship between each candidate video and the text feature corresponding to the candidate video; the video feature retrieval library and the text feature retrieval library are searched separately according to the target video feature, and the video feature retrieval library and the text feature retrieval library are searched separately according to the first text feature to obtain multiple target candidate videos similar to the video to be identified; the reference video is selected from the multiple target candidate videos according to the video similarity between the multiple target candidate videos and the video to be identified.

在本申请的一个实施例中,基于前述方案,所述多个目标候选视频包括每种检索方式所检索得到的多个目标候选视频;检索模块进一步用于针对每一种检索方式所检索得到所述多个目标候选视频,计算所述待识别视频与每个目标候选视频的相似度;针对每个目标候选视频,计算所述目标候选视频在各个检索方式中的与所述待识别视频的平均相似度;根据所述平均相似度从所述多个目标候选视频中选择所述参考视频。In one embodiment of the present application, based on the aforementioned scheme, the multiple target candidate videos include multiple target candidate videos retrieved by each retrieval method; the retrieval module is further used to calculate the similarity between the video to be identified and each target candidate video for the multiple target candidate videos retrieved by each retrieval method; for each target candidate video, calculate the average similarity between the target candidate video and the video to be identified in each retrieval method; and select the reference video from the multiple target candidate videos according to the average similarity.

在本申请的一个实施例中,基于前述方案,所述获取模块进一步用于从所述待识别视频中提取多个视频帧,并将所述多个视频帧划分为多个片段;从所述多个片段中抽取目标视频帧,对所述目标视频帧进行视频特征提取得到第一视频帧特征;获取所述待识别视频对应的第一文本信息,并对所述第一文本信息进行文本特征提取得到第一文本特征;根据所述第一视频帧特征和所述第一文本特征得到所述第一多模态特征。In one embodiment of the present application, based on the aforementioned scheme, the acquisition module is further used to extract multiple video frames from the video to be identified, and divide the multiple video frames into multiple segments; extract a target video frame from the multiple segments, and perform video feature extraction on the target video frame to obtain a first video frame feature; obtain the first text information corresponding to the video to be identified, and perform text feature extraction on the first text information to obtain a first text feature; obtain the first multimodal feature based on the first video frame feature and the first text feature.

在本申请的一个实施例中,基于前述方案,识别进一步用于对所述上下文学习信息进行序列特征转换,得到上下文学习序列,所述序列特征为标签生成模型所支持的输入特征;将所述上下文学习序列输入至所述标签生成模型,所述标签生成模型是通过对预设语言模型的原始模型参数保持冻结,并根据样本上下文学习序列对所述语言模型的新增模型参数进行调整得到的,所述新增模型参数与引入至所述语言模型的低秩自适应模块相关;获取所述标签生成模型输出的所述待识别视频的目标视频标签。In one embodiment of the present application, based on the aforementioned scheme, recognition is further used to perform sequence feature conversion on the context learning information to obtain a context learning sequence, and the sequence feature is an input feature supported by a label generation model; the context learning sequence is input into the label generation model, and the label generation model is obtained by freezing the original model parameters of a preset language model and adjusting the newly added model parameters of the language model according to the sample context learning sequence, and the newly added model parameters are related to the low-rank adaptive module introduced into the language model; and the target video label of the video to be identified output by the label generation model is obtained.

在本申请的一个实施例中,基于前述方案,所述装置还包括训练模块,用于获取第一样本视频对应的第一样本视频特征、第一样本文本信息和第一样本视频标签,以及第二样本视频对应的第二样本视频特征、第二样本文本信息和第二样本视频标签;根据预训练的初始特征对齐模块分别对所述第一样本视频特征和所述第二样本视频特征进行特征对齐处理,得到第一样本视觉标记序列和第二样本视觉标记序列,并根据所述第一样本视觉标记序列、第一样本文本信息和第一样本视频标签,以及第二样本视觉标记序列和第二样本文本信息构建样本上下文学习信息;将所述样本上下文学习信息进行序列特征转换,得到所述样本上下文学习序列;将所述低秩自适应模块引入所述语言模型,以通过所述低秩自适应模块在所述语言模型引入所述新增模型参数;对所述语言模型的原始模型参数保持冻结,并根据所述样本上下文学习序列和所述第二样本视频标签对所述语言模型的所述新增模型参数进行调整,得到所述标签生成模型。In one embodiment of the present application, based on the aforementioned scheme, the device also includes a training module for obtaining a first sample video feature, a first sample text information and a first sample video label corresponding to the first sample video, and a second sample video feature, a second sample text information and a second sample video label corresponding to the second sample video; performing feature alignment processing on the first sample video feature and the second sample video feature according to the pre-trained initial feature alignment module, respectively, to obtain a first sample visual marker sequence and a second sample visual marker sequence, and constructing sample context learning information according to the first sample visual marker sequence, the first sample text information and the first sample video label, and the second sample visual marker sequence and the second sample text information; performing sequence feature conversion on the sample context learning information to obtain the sample context learning sequence; introducing the low-rank adaptive module into the language model to introduce the newly added model parameters into the language model through the low-rank adaptive module; keeping the original model parameters of the language model frozen, and adjusting the newly added model parameters of the language model according to the sample context learning sequence and the second sample video label to obtain the label generation model.

在本申请的一个实施例中,基于前述方案,训练模块还用于获取样本图像和所述样本图像对应的样本描述文本;对所述样本图像进行特征提取得到样本视觉特征,并将所述样本视觉特征输入至待训练模块,以使所述待训练模块将所述样本视觉特征对齐至所述语言模型的输入文本特征空间;获取对齐描述指令对应的样本文本特征,并将所述样本文本特征和所述特征对齐模块输出的目标样本视觉文本特征输入至所述语言模型,其中,所述语言模型的模型参数保持冻结;获取所述语言模型输出的样本预测描述文本,根据所述样本描述文本和所述样本预测描述文本对所述待训练模块进行训练得到所述初始特征对齐模块。In one embodiment of the present application, based on the aforementioned scheme, the training module is also used to obtain a sample image and a sample description text corresponding to the sample image; perform feature extraction on the sample image to obtain a sample visual feature, and input the sample visual feature into the module to be trained so that the module to be trained aligns the sample visual feature to the input text feature space of the language model; obtain the sample text feature corresponding to the alignment description instruction, and input the sample text feature and the target sample visual text feature output by the feature alignment module into the language model, wherein the model parameters of the language model remain frozen; obtain the sample prediction description text output by the language model, and train the module to be trained according to the sample description text and the sample prediction description text to obtain the initial feature alignment module.

在本申请的一个实施例中,基于前述方案,所述训练模块进一步用于将所述样本上下文学习序列输入至引入所述低秩自适应模块的语言模型,并获取所述语言模型输出的预测样本标签;根据所述预测样本标签和所述第二样本视频标签的差异,对所述语言模型的所述新增模型参数进行调整,得到所述标签生成模型;所述训练模块还用于根据所述预测样本标签和所述第二样本视频标签的差异,对所述初始特征对齐模块的模块参数进行调整。In one embodiment of the present application, based on the aforementioned scheme, the training module is further used to input the sample context learning sequence into the language model introduced into the low-rank adaptive module, and obtain the predicted sample label output by the language model; according to the difference between the predicted sample label and the second sample video label, the newly added model parameters of the language model are adjusted to obtain the label generation model; the training module is also used to adjust the module parameters of the initial feature alignment module according to the difference between the predicted sample label and the second sample video label.

需要说明的是,上述实施例所提供的装置与上述实施例所提供的方法属于同一构思,其中各个模块和单元执行操作的具体方式已经在方法实施例中进行了详细描述,此处不再赘述。It should be noted that the device provided in the above embodiment and the method provided in the above embodiment belong to the same concept, wherein the specific manner in which each module and unit performs the operation has been described in detail in the method embodiment and will not be repeated here.

上述实施例所提供的装置可以设于终端内,也可以设于服务器内。The device provided in the above embodiment may be arranged in a terminal or in a server.

本申请的实施例还提供了一种电子设备,包括一个或多个处理器,以及存储装置,其中,存储装置,用于存储一个或多个计算机程序,当一个或多个计算机程序被一个或多个处理器执行时,使得电子设备实现如上的视频标签识别方法。An embodiment of the present application also provides an electronic device, comprising one or more processors and a storage device, wherein the storage device is used to store one or more computer programs, and when the one or more computer programs are executed by one or more processors, the electronic device implements the above-mentioned video tag recognition method.

图13示出了适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。FIG. 13 shows a schematic diagram of the structure of a computer system of an electronic device suitable for implementing an embodiment of the present application.

需要说明的是,图13示出的电子设备的计算机系统1300仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。It should be noted that the computer system 1300 of the electronic device shown in FIG. 13 is merely an example and should not bring any limitation to the functions and scope of use of the embodiments of the present application.

如图13所示,计算机系统1300包括处理器(Central Processing Unit,CPU)1301,其可以根据存储在只读存储器(Read-Only Memory,ROM)1302中的程序或者从储存部分1308加载到随机访问存储器(Random Access Memory,RAM)1303中的程序而执行各种适当的动作和处理,例如执行上述实施例中的方法。在RAM 1303中,还存储有系统操作所需的各种程序和数据。CPU 1301、ROM 1302以及RAM 1303通过总线1304彼此相连。输入/输出(Input /Output,I/O)接口1305也连接至总线1304。As shown in FIG. 13 , a computer system 1300 includes a processor (Central Processing Unit, CPU) 1301, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1302 or a program loaded from a storage part 1308 to a random access memory (RAM) 1303, such as executing the method in the above embodiment. Various programs and data required for system operation are also stored in RAM 1303. CPU 1301, ROM 1302 and RAM 1303 are connected to each other via a bus 1304. An input/output (I/O) interface 1305 is also connected to the bus 1304.

在一些实施例中,以下部件连接至I/O接口1305:包括键盘、鼠标等的输入部分1306;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid CrystalDisplay,LCD)等以及扬声器等的输出部分1307;包括硬盘等的储存部分1308;以及包括诸如LAN(Local Area Network,局域网)卡、调制解调器等的网络接口卡的通信部分1309。通信部分1309经由诸如因特网的网络执行通信处理。驱动器1310也根据需要连接至I/O接口1305。可拆卸介质1311,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1310上,以便于从其上读出的计算机程序根据需要被安装入储存部分1308。In some embodiments, the following components are connected to the I/O interface 1305: an input section 1306 including a keyboard, a mouse, etc.; an output section 1307 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 1308 including a hard disk, etc.; and a communication section 1309 including a network interface card such as a LAN (Local Area Network) card, a modem, etc. The communication section 1309 performs communication processing via a network such as the Internet. A drive 1310 is also connected to the I/O interface 1305 as needed. A removable medium 1311, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1310 as needed so that a computer program read therefrom is installed into the storage section 1308 as needed.

特别地,根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的计算机程序。在这样的实施例中,该计算机程序可以通过通信部分1309从网络上被下载和安装,和/或从可拆卸介质1311被安装。在该计算机程序被处理器(CPU)1301执行时,执行本申请的系统中限定的各种功能。In particular, according to an embodiment of the present application, the process described above with reference to the flowchart can be implemented as a computer program. For example, an embodiment of the present application includes a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program includes a computer program for executing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication part 1309, and/or installed from a removable medium 1311. When the computer program is executed by the processor (CPU) 1301, various functions defined in the system of the present application are executed.

需要说明的是,本申请实施例所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的计算机程序。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的计算机程序可以用任何适当的介质传输,包括但不限于:无线、有线等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the embodiment of the present application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. The computer-readable storage medium may be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (Erasable Programmable Read Only Memory), a flash memory, an optical fiber, a portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above. In the present application, a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier, which carries a computer-readable computer program. This propagated data signal may take a variety of forms, including but not limited to an electromagnetic signal, an optical signal, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which may send, propagate or transmit a program for use by or in conjunction with an instruction execution system, apparatus or device. A computer program contained on a computer-readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the above.

附图中的流程图和框图,图示了按照本申请各种实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。其中,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机程序的组合来实现。The flowcharts and block diagrams in the accompanying drawings illustrate the possible architecture, functions and operations of the devices, methods and computer program products according to various embodiments of the present application. Among them, each box in the flowchart or block diagram can represent a module, a program segment, or a part of the code, and the above-mentioned module, program segment, or a part of the code contains one or more executable instructions for realizing the specified logical function. It should also be noted that in some alternative implementations, the functions marked in the box can also occur in a different order from the order marked in the accompanying drawings. For example, two boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each box in the block diagram or flowchart, and the combination of boxes in the block diagram or flowchart, can be implemented with a dedicated hardware-based system that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and a computer program.

描述于本申请实施例中所涉及到的单元或者模块可以通过软件的方式实现,也可以通过硬件的方式来实现,所描述的单元或者模块也可以设置在处理器中。其中,这些单元或者模块的名称在某种情况下并不构成对该单元或者模块本身的限定。The units or modules involved in the embodiments described in this application may be implemented by software or hardware, and the units or modules described may also be set in a processor. The names of these units or modules do not, in some cases, constitute limitations on the units or modules themselves.

本申请的另一方面还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如前所述的视频标签识别方法。该计算机可读存储介质可以是上述实施例中描述的电子设备中所包含的,也可以是单独存在,而未装配入该电子设备中。Another aspect of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the video tag recognition method as described above is implemented. The computer-readable storage medium can be included in the electronic device described in the above embodiment, or it can exist independently without being assembled into the electronic device.

本申请的另一方面还提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在计算机可读存储介质中。电子设备的处理器从计算机可读存储介质读取该计算机程序,处理器执行该计算机程序,使得该电子设备执行上述各个实施例中提供如前所述的视频标签识别方法。Another aspect of the present application also provides a computer program product, which includes a computer program stored in a computer-readable storage medium. A processor of an electronic device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the electronic device performs the video tag recognition method as described above in the above embodiments.

应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本申请的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that, although several modules or units of the equipment for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present application, the features and functions of two or more modules or units described above can be embodied in one module or unit. On the contrary, the features and functions of one module or unit described above can be further divided into being embodied by multiple modules or units.

本领域技术者在考虑说明书及实践这里公开的实施方式后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。Those skilled in the art will readily appreciate other embodiments of the present application after considering the specification and practicing the embodiments disclosed herein. The present application is intended to cover any variations, uses or adaptations of the present application, which follow the general principles of the present application and include common knowledge or customary technical means in the art that are not disclosed in the present application.

上述内容,仅为本申请的较佳示例性实施例,并非用于限制本申请的实施方案,本领域普通技术者根据本申请的主要构思和精神,可以十分方便地进行相应的变通或修改,故本申请的保护范围应以权利要求书所要求的保护范围为准。The above content is only a preferred exemplary embodiment of the present application and is not intended to limit the implementation scheme of the present application. A person of ordinary skill in the art can easily make corresponding changes or modifications based on the main concept and spirit of the present application. Therefore, the scope of protection of the present application shall be based on the scope of protection required by the claims.

Claims (16)

1. A method for identifying a video tag, comprising:
Acquiring a video to be identified, and extracting multi-modal features of the video to be identified to obtain a first multi-modal feature, wherein the first multi-modal feature comprises a plurality of modal features respectively corresponding to a plurality of modes;
Searching a plurality of reference videos similar to the video to be identified according to the first multi-mode characteristics, wherein each reference video carries a video tag;
Building context learning information according to the second multi-mode features corresponding to the reference videos, the video tags carried by the reference videos and the first multi-mode features of the videos to be identified;
and identifying the video tag of the video to be identified according to the context learning information.
2. The method of claim 1, wherein the first multi-modal feature comprises a first video frame feature and the second multi-modal feature comprises a second video frame feature; the construction of the context learning information according to the second multi-modal feature corresponding to each reference video, the video tag carried by each reference video, and the first multi-modal feature of the video to be identified includes:
Performing feature conversion processing on the first video frame features to obtain a first visual text marking sequence, and performing feature conversion processing on the second video frame features corresponding to each reference video to obtain a second visual text marking sequence corresponding to each reference video;
Acquiring first text information corresponding to the video to be identified and second text information corresponding to each reference video;
And constructing the context learning information according to the second visual text marking sequence, the second text information and the video tag corresponding to each reference video, and the first visual text marking sequence and the first text information.
3. The method of claim 2, wherein the number of first video frame features comprises a plurality; performing feature conversion processing on the first video frame features to obtain a first visual text marking sequence, including:
Performing feature fusion processing on the plurality of first video frame features to obtain target video features;
And aligning the target video features to a preset text feature space through a pre-training feature alignment module to obtain the first visual text marking sequence.
4. The method of claim 2, wherein said constructing said contextual learning information from said second visual text label sequence, said second text information, and said video tag, and said first visual text label sequence and said first text information for each of said reference videos comprises:
According to a first text mark used for isolating different information under the same video, and the second visual text mark sequence, the second text information and the video tag corresponding to the same reference video, a learning example is built, so that a plurality of learning examples corresponding to a plurality of reference videos are obtained;
splicing the plurality of learning examples to obtain learning information, wherein the plurality of learning examples in the learning information are separated by a second text mark for isolating information of different videos;
Constructing an identification example according to the first text mark, the first visual text mark sequence and the first text information;
The contextual learning information is generated from the learning information and the recognition example.
5. The method according to claim 2, wherein the obtaining the first text information corresponding to the video to be identified and the second text information corresponding to each of the reference videos includes:
Acquiring a video title of the video to be identified and an identification text obtained by converting the audio information of the video to be identified;
Generating a situation description of the video to be identified according to the video scene of the video to be identified;
Generating the first text information according to the video title, the identification text and the context description.
6. The method of claim 1, wherein the first multimodal feature comprises a target video feature and a first text feature; retrieving a plurality of reference videos similar to the video to be identified according to the first multi-modal feature, including:
Acquiring a pre-established video feature retrieval library and a text feature retrieval library, wherein the video feature retrieval library comprises mapping relations of each candidate video and video features corresponding to the candidate videos, and the text feature retrieval library comprises mapping relations of each candidate video and text features corresponding to the candidate videos;
Searching the video feature searching library and the text feature searching library according to the target video features respectively, and searching the video feature searching library and the text feature searching library according to the first text features respectively to obtain a plurality of target candidate videos similar to the video to be identified;
and selecting the reference video from the target candidate videos according to the video similarity between the target candidate videos and the video to be identified.
7. The method of claim 6, wherein the plurality of target candidate videos includes a plurality of target candidate videos retrieved for each retrieval mode; the selecting the reference video from the target candidate videos according to the video similarity between the target candidate videos and the video to be identified comprises the following steps:
The multiple target candidate videos are retrieved according to each retrieval mode, and the similarity between the video to be identified and each target candidate video is calculated;
For each target candidate video, calculating the average similarity between the target candidate video and the video to be identified in each retrieval mode;
and selecting the reference video from the target candidate videos according to the average similarity.
8. The method according to claim 1, wherein the performing multi-modal feature extraction on the video to be identified to obtain a first multi-modal feature includes:
extracting a plurality of video frames from the video to be identified, and dividing the plurality of video frames into a plurality of fragments;
Extracting target video frames from the plurality of fragments, and extracting video features of the target video frames to obtain first video frame features;
Acquiring first text information corresponding to the video to be identified, and extracting text features of the first text information to obtain first text features;
and obtaining the first multi-mode feature according to the first video frame feature and the first text feature.
9. The method according to any one of claims 1 to 8, wherein said identifying the video tag of the video to be identified based on the contextual learning information comprises:
Performing sequence feature conversion on the context learning information to obtain a context learning sequence, wherein the sequence feature is an input feature supported by a label generation model;
Inputting the context learning sequence into the label generation model, wherein the label generation model is obtained by keeping original model parameters of a preset language model frozen and adjusting newly-added model parameters of the language model according to the sample context learning sequence, and the newly-added model parameters are related to a low-rank self-adaptive module introduced into the language model;
and acquiring a target video tag of the video to be identified, which is output by the tag generation model.
10. The method of claim 9, wherein the training step of the tag generation model comprises:
Acquiring a first sample video feature, first sample text information and a first sample video tag corresponding to a first sample video, and a second sample video feature, second sample text information and a second sample video tag corresponding to a second sample video;
Performing feature alignment processing on the first sample video feature and the second sample video feature according to a pre-trained initial feature alignment module to obtain a first sample visual marker sequence and a second sample visual marker sequence, and constructing sample context learning information according to the first sample visual marker sequence, the first sample text information and the first sample video tag, and the second sample visual marker sequence and the second sample text information;
performing sequence feature conversion on the sample context learning information to obtain a sample context learning sequence;
introducing the low-rank adaptive module into the language model to introduce the newly added model parameters into the language model through the low-rank adaptive module;
and keeping the original model parameters of the language model frozen, and adjusting the newly added model parameters of the language model according to the sample context learning sequence and the second sample video label to obtain the label generation model.
11. The method of claim 10, wherein prior to feature alignment processing of the video features according to the pre-trained initial feature alignment module, the method further comprises:
acquiring a sample image and a sample description text corresponding to the sample image;
Extracting features from the sample image to obtain sample visual features, and inputting the sample visual features to a module to be trained so that the module to be trained aligns the sample visual features to an input text feature space of the language model;
Acquiring sample text features corresponding to the alignment description instruction, and inputting the sample text features and target sample visual text features output by the feature alignment module into the language model, wherein model parameters of the language model are kept frozen;
And acquiring a sample prediction description text output by the language model, and training the module to be trained according to the sample description text and the sample prediction description text to obtain the initial feature alignment module.
12. The method of claim 10, wherein said adjusting the model parameters of the language model based on the sample context learning sequence to obtain the tag generation model comprises:
Inputting the sample context learning sequence into a language model which is introduced into the low-rank self-adaptive module, and acquiring a prediction sample label which is output by the language model;
According to the difference between the prediction sample label and the second sample video label, the newly added model parameters of the language model are adjusted to obtain the label generation model;
the method further comprises the steps of:
And adjusting the module parameters of the initial feature alignment module according to the difference between the prediction sample label and the second sample video label.
13. A video processing apparatus, comprising:
The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a video to be identified, and carrying out multi-modal feature extraction on the video to be identified to obtain a first multi-modal feature, wherein the first multi-modal feature comprises a plurality of modal features corresponding to a plurality of modes respectively;
The retrieval module is used for retrieving a plurality of reference videos similar to the video to be identified according to the first multi-mode characteristics, and each reference video carries a video tag;
The construction module is used for constructing context learning information according to the second multi-mode characteristics corresponding to each reference video, the video tags carried by each reference video and the first multi-mode characteristics of the video to be identified;
and the identification module is used for identifying the video tag of the video to be identified according to the context learning information.
14. An electronic device, comprising:
One or more processors;
Storage means for storing one or more programs that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-12.
15. A computer readable storage medium, having stored thereon a computer program which, when executed by a processor of an electronic device, causes the electronic device to perform the method of any of claims 1 to 12.
16. A computer program product, characterized in that it comprises a computer program stored in a computer readable storage medium, from which a processor of an electronic device reads and executes the computer program, causing the electronic device to perform the method of any one of claims 1 to 12.
CN202411181885.XA 2024-08-27 2024-08-27 Video tag identification method, device, equipment, medium and product Active CN118692014B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411181885.XA CN118692014B (en) 2024-08-27 2024-08-27 Video tag identification method, device, equipment, medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411181885.XA CN118692014B (en) 2024-08-27 2024-08-27 Video tag identification method, device, equipment, medium and product

Publications (2)

Publication Number Publication Date
CN118692014A true CN118692014A (en) 2024-09-24
CN118692014B CN118692014B (en) 2024-12-13

Family

ID=92766509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411181885.XA Active CN118692014B (en) 2024-08-27 2024-08-27 Video tag identification method, device, equipment, medium and product

Country Status (1)

Country Link
CN (1) CN118692014B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119181001A (en) * 2024-11-04 2024-12-24 腾讯科技(深圳)有限公司 Video tag recognition model training method, video tag recognition method and device
CN119537647A (en) * 2025-01-23 2025-02-28 中国科学院自动化研究所 Label sequence generation method based on sequential prompts and retrieval enhanced generation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100573A (en) * 2022-07-15 2022-09-23 北京有竹居网络技术有限公司 A video recognition method, device, storage medium and device
WO2023273769A1 (en) * 2021-07-01 2023-01-05 北京百度网讯科技有限公司 Method for training video label recommendation model, and method for determining video label
CN115878842A (en) * 2022-12-23 2023-03-31 北京爱奇艺科技有限公司 Video tag determination method and device, electronic equipment and readable storage medium
WO2024001057A1 (en) * 2022-07-01 2024-01-04 深圳先进技术研究院 Video retrieval method based on attention segment prompt
CN118093936A (en) * 2024-04-26 2024-05-28 腾讯科技(深圳)有限公司 Video tag processing method, device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023273769A1 (en) * 2021-07-01 2023-01-05 北京百度网讯科技有限公司 Method for training video label recommendation model, and method for determining video label
WO2024001057A1 (en) * 2022-07-01 2024-01-04 深圳先进技术研究院 Video retrieval method based on attention segment prompt
CN115100573A (en) * 2022-07-15 2022-09-23 北京有竹居网络技术有限公司 A video recognition method, device, storage medium and device
CN115878842A (en) * 2022-12-23 2023-03-31 北京爱奇艺科技有限公司 Video tag determination method and device, electronic equipment and readable storage medium
CN118093936A (en) * 2024-04-26 2024-05-28 腾讯科技(深圳)有限公司 Video tag processing method, device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119181001A (en) * 2024-11-04 2024-12-24 腾讯科技(深圳)有限公司 Video tag recognition model training method, video tag recognition method and device
CN119537647A (en) * 2025-01-23 2025-02-28 中国科学院自动化研究所 Label sequence generation method based on sequential prompts and retrieval enhanced generation

Also Published As

Publication number Publication date
CN118692014B (en) 2024-12-13

Similar Documents

Publication Publication Date Title
KR20210104571A (en) Theme classification method based on multimodality, device, apparatus, and storage medium
CN112163122A (en) Method and device for determining label of target video, computing equipment and storage medium
CN118692014B (en) Video tag identification method, device, equipment, medium and product
CN113688951B (en) Video data processing method and device
CN113392265B (en) Multimedia processing method, device and equipment
CN116955699B (en) Video cross-mode search model training method, searching method and device
US11036996B2 (en) Method and apparatus for determining (raw) video materials for news
CN115544303A (en) Method, apparatus, device and medium for determining label of video
CN116340467B (en) Text processing method, device, electronic device, and computer-readable storage medium
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
CN116977701A (en) Video classification model training method, video classification method and device
CN116955591A (en) Recommendation language generation method, related device and medium for content recommendation
CN114282055B (en) Video feature extraction method, device, equipment and computer storage medium
CN118468224A (en) A multimodal sarcasm detection method based on visual instruction fine-tuning and demonstration learning enhancement
CN117453895B (en) Intelligent customer service response method, device, equipment and readable storage medium
CN113987274A (en) Video semantic representation method and device, electronic equipment and storage medium
CN115115984B (en) Video data processing method, device, program product, computer equipment and medium
Wang et al. RSRNeT: a novel multi-modal network framework for named entity recognition and relation extraction
CN114676705B (en) Dialogue relation processing method, computer and readable storage medium
CN115526171A (en) Intention identification method, device, equipment and computer readable storage medium
CN115114469B (en) Picture identification method, device, equipment and storage medium
CN118093936B (en) Video tag processing method, device, computer equipment and storage medium
CN118171149B (en) Label classification method, apparatus, device, storage medium and computer program product
CN115115432B (en) Product information recommendation method and device based on artificial intelligence
CN112905884B (en) Method, apparatus, medium and program product for generating sequence annotation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant