WO2023173539A1 - Video content processing method and system, and terminal and storage medium - Google Patents

Video content processing method and system, and terminal and storage medium Download PDF

Info

Publication number
WO2023173539A1
WO2023173539A1 PCT/CN2022/089559 CN2022089559W WO2023173539A1 WO 2023173539 A1 WO2023173539 A1 WO 2023173539A1 CN 2022089559 W CN2022089559 W CN 2022089559W WO 2023173539 A1 WO2023173539 A1 WO 2023173539A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
text information
title
processed
neural network
Prior art date
Application number
PCT/CN2022/089559
Other languages
French (fr)
Chinese (zh)
Inventor
潘芸倩
叶静娴
奚悦
包小溪
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023173539A1 publication Critical patent/WO2023173539A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47205End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for manipulating displayed content, e.g. interacting with MPEG-4 objects, editing locally
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Abstract

A video content processing method and system, and a terminal and a storage medium. The method comprises: extracting audio signals and text information from videos to be processed; extracting video images from said videos, performing image analysis on the video images, so as to determine the video types of said videos, wherein the video types involve a PPT video, a single-person video and a multi-person video; and on the basis of the audio signals and the text information, extracting, by using a multi-modal video processing model, highlight clips from said videos of the different types, extracting, by using a deep neural network model, titles, summaries and tag information, which correspond to the highlight clips, and generating a short-video clipping result of said videos. The present application can generate a plurality of finely clipped short videos in one click, thereby greatly improving the efficiency of clipping and shortening the period of short-video production.

Description

一种视频内容处理方法、系统、终端及存储介质A video content processing method, system, terminal and storage medium
本申请要求于2022年03月16日提交中国专利局、申请号为202210259504.X、发明名称为“一种视频内容处理方法、系统、终端及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application submitted to the China Patent Office on March 16, 2022, with the application number 202210259504.X and the invention title "A video content processing method, system, terminal and storage medium", all of which The contents are incorporated into this application by reference.
技术领域Technical field
本申请涉及人工智能之深度学习技术领域,特别是涉及一种视频内容处理方法、系统、终端及存储介质。This application relates to the field of deep learning technology of artificial intelligence, and in particular to a video content processing method, system, terminal and storage medium.
背景技术Background technique
目前以视频为代表的富媒体信息成为主流,其中短视频是中国消费者接触内容最多的形式,小屏幕、短视频、快节奏成为视频行业的发展趋势。At present, rich media information represented by video has become mainstream, among which short video is the most popular form of content for Chinese consumers. Small screen, short video, and fast pace have become the development trend of the video industry.
视频行业高速发展的同时也对视频处理效率和质量提出了更高的要求。发明人意识到,目前的视频内容处理主要依靠人工作业为主,内容处理工具操作门槛高、人才培育成本高,且视频的人工精剪耗时较长,在一定程度上阻碍了视频领域的发展。The rapid development of the video industry has also put forward higher requirements for video processing efficiency and quality. The inventor realized that the current video content processing mainly relies on manual work. The operation threshold of content processing tools is high, the cost of talent cultivation is high, and the manual fine editing of videos takes a long time, which hinders the development of the video field to a certain extent. develop.
技术问题technical problem
本申请提供了一种视频内容处理方法、系统、终端及存储介质,旨在解决现有的视频内容处理依靠人工作业存在的操作门槛高、人才培育成本高以及视频精剪耗时较长等技术问题。This application provides a video content processing method, system, terminal and storage medium, aiming to solve the problems of high operating threshold, high talent cultivation cost and long time-consuming video editing of existing video content processing that relies on manual work. technical problem.
技术解决方案Technical solutions
为解决上述技术问题,本申请采用的技术方案为:In order to solve the above technical problems, the technical solutions adopted in this application are:
一种视频内容处理方法,包括:A video content processing method, including:
提取待处理视频中的音频信号以及文本信息;Extract audio signals and text information from the video to be processed;
提取所述待处理视频中的视频图像,对所述视频图像进行图像分析,判断所述待处理视频的视频类型;所述视频类型包括PPT视频、单人视频以及多人视频;Extract video images in the video to be processed, perform image analysis on the video images, and determine the video type of the video to be processed; the video types include PPT videos, single-person videos, and multi-person videos;
基于所述音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息,生成所述待处理视频的短视频剪辑结果。Based on the audio signal and text information, a multi-modal video processing model is used to extract the essence clips in different types of videos to be processed, and a deep neural network model is used to extract the title, abstract and label information corresponding to the essence clips, and generate the Describe the short video clipping results of the video to be processed.
本申请实施例采取的另一技术方案为:一种视频内容处理系统,包括:Another technical solution adopted by the embodiment of the present application is: a video content processing system, including:
多模态信息提取模块:用于提取待处理视频中的音频信号以及文本信息;Multi-modal information extraction module: used to extract audio signals and text information from the video to be processed;
视频类型判断模块:用于提取所述待处理视频中的视频图像,对视频图像进行图像分析,判断所述待处理视频的视频类型;所述视频类型包括PPT视频、单人视频以及多人视频;Video type judgment module: used to extract video images in the video to be processed, perform image analysis on the video images, and determine the video type of the video to be processed; the video types include PPT videos, single-person videos, and multi-person videos. ;
视频剪辑模块:用于基于所述音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取不同类型的待处理视频的精华片段对应的标题、摘要以及标签信息,生成待处理视频的短视频剪辑结果。Video editing module: used to use the multi-modal video processing model to extract the essence of different types of videos to be processed based on the audio signal and text information, and use the deep neural network model to extract the essence of different types of videos to be processed. The corresponding title, abstract and tag information are used to generate the short video clipping result of the video to be processed.
本申请实施例采取的又一技术方案为:一种终端,所述终端包括处理器、与所述处理器耦接的存储器,其中,Another technical solution adopted by the embodiment of the present application is: a terminal, the terminal includes a processor and a memory coupled to the processor, wherein,
所述存储器存储有用于实现上述的视频内容处理方法的程序指令;The memory stores program instructions for implementing the above video content processing method;
所述处理器用于执行所述存储器存储的所述程序指令以执行所述视频内容处理操作。The processor is configured to execute the program instructions stored in the memory to perform the video content processing operations.
本申请实施例采取的又一技术方案为:一种存储介质,存储有处理器可运行的程序指令,所述程序指令用于执行上述的视频内容处理方法。Another technical solution adopted by the embodiments of the present application is: a storage medium that stores program instructions executable by a processor, and the program instructions are used to execute the above video content processing method.
有益效果beneficial effects
本申请实施例的视频内容处理方法、系统、终端及存储介质采用多模态视频内容处理技术,通过提取待处理视频中的音频信号以及文本信息,基于音频信号以及文本信息,利用多模态视频处理模型以及深度学习神经网络模型获取待处理视频中的精华片段以及精华片段对应的标题、摘要以及标签信息。本申请采用全AI处理流程,可以一键生成多个精剪短视频,大大提升了剪辑效率,缩短视频制作周期。The video content processing method, system, terminal and storage medium of the embodiment of the present application adopt multi-modal video content processing technology. By extracting the audio signal and text information in the video to be processed, based on the audio signal and text information, the multi-modal video is processed. The processing model and the deep learning neural network model obtain the highlight clips in the video to be processed and the title, summary, and tag information corresponding to the highlight clips. This application adopts a full AI processing process, which can generate multiple finely edited short videos with one click, greatly improving the editing efficiency and shortening the video production cycle.
附图说明Description of the drawings
图1是本申请第一实施例的视频内容处理方法的流程示意图;Figure 1 is a schematic flow chart of a video content processing method according to the first embodiment of the present application;
图2是本申请第二实施例的视频内容处理方法的流程示意图;Figure 2 is a schematic flowchart of a video content processing method according to the second embodiment of the present application;
图3是本申请实施例视频内容处理系统的结构示意图;Figure 3 is a schematic structural diagram of a video content processing system according to an embodiment of the present application;
图4是本申请实施例的终端结构示意图;Figure 4 is a schematic structural diagram of a terminal according to an embodiment of the present application;
图5是本申请实施例的存储介质结构示意图。Figure 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
本发明的实施方式Embodiments of the invention
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.
本申请中的术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”、“第三”的特征可以明示或者隐含地包括至少一个该特征。本申请的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。本申请实施例中所有方向性指示(诸如上、下、左、右、前、后……)仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等,如果该特定姿态发生改变时,则该方向性指示也相应地随之改变。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms “first”, “second” and “third” in this application are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, features defined as "first", "second", and "third" may explicitly or implicitly include at least one of these features. In the description of this application, "plurality" means at least two, such as two, three, etc., unless otherwise clearly and specifically limited. All directional indications (such as up, down, left, right, front, back...) in the embodiments of this application are only used to explain the relative positional relationship between the components in a specific posture (as shown in the drawings) , sports conditions, etc., if the specific posture changes, the directional indication will also change accordingly. Furthermore, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but optionally also includes steps or units that are not listed, or optionally also includes Other steps or units inherent to such processes, methods, products or devices.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art understand, both explicitly and implicitly, that the embodiments described herein may be combined with other embodiments.
请参阅图1,是本申请第一实施例的视频内容处理方法的流程示意图。本申请第一实施例的视频内容处理方法包括以下步骤:Please refer to FIG. 1 , which is a schematic flowchart of a video content processing method according to the first embodiment of the present application. The video content processing method in the first embodiment of the present application includes the following steps:
S10:提取待处理视频中的音频信号以及文本信息;S10: Extract audio signals and text information from the video to be processed;
S11:提取待处理视频中的视频图像,对视频图像进行图像分析,判断待处理视频的视频类型;视频类型包括PPT视频、单人视频以及多人视频;S11: Extract the video image in the video to be processed, perform image analysis on the video image, and determine the video type of the video to be processed; video types include PPT video, single-person video, and multi-person video;
S12:基于音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取不同类型的待处理视频的精华片段对应的标题、摘要以及标签信息,生成待处理视频的短视频剪辑结果。S12: Based on the audio signal and text information, use the multi-modal video processing model to extract the highlights of different types of videos to be processed, and use the deep neural network model to extract the titles, abstracts and corresponding titles of the highlights of different types of videos to be processed. Tag information to generate short video clip results of the video to be processed.
请参阅图2,是本申请第二实施例的视频内容处理方法的流程示意图。本申请第二实施例的视频内容处理方法包括以下步骤:Please refer to FIG. 2 , which is a schematic flowchart of a video content processing method according to the second embodiment of the present application. The video content processing method in the second embodiment of the present application includes the following steps:
S20:提取待处理视频中的音频信号;S20: Extract the audio signal in the video to be processed;
本步骤中,音频信号提取过程具体为:将待处理视频输入开源框架(FFmpeg),通过开源框架输出待处理视频的音频信号。In this step, the audio signal extraction process is specifically: input the video to be processed into the open source framework (FFmpeg), and output the audio signal of the video to be processed through the open source framework.
S21:对提取的音频信号进行语音转文字处理,生成待处理视频的文本信息;S21: Perform speech-to-text processing on the extracted audio signal to generate text information of the video to be processed;
本步骤中,文本信息即待处理视频的视频字幕。文本信息的生成方式具体为:对音频信号进行语音特征提取,将提取的语音特征输入训练好的声学模型,通过声学模型输出对应的概率得分。基于声学模型输出结果,根据一定的搜索和匹配策略从语言模型中搜索出与音频信号相匹配的文本,输出待处理视频的文本信息识别结果。In this step, the text information is the video subtitles of the video to be processed. The specific method of generating text information is: extract speech features from the audio signal, input the extracted speech features into the trained acoustic model, and output the corresponding probability score through the acoustic model. Based on the acoustic model output results, text matching the audio signal is searched from the language model according to a certain search and matching strategy, and the text information recognition results of the video to be processed are output.
S22:提取待处理视频中的视频图像,对视频图像进行分析,获取待处理视频的视频类型;S22: Extract the video image in the video to be processed, analyze the video image, and obtain the video type of the video to be processed;
本步骤中,视频类型包括PPT视频、单人视频或多人视频。视频图像的分析过程具体为:将提取的视频图像输入开源框架,通过开源框架抽取视频图像的帧画面,对每一幅帧画面进行分类,得到待处理视频的视频类型。In this step, the video type includes PPT video, single-person video or multi-person video. The specific process of analyzing video images is as follows: input the extracted video images into the open source framework, extract the frames of the video images through the open source framework, classify each frame, and obtain the video type of the video to be processed.
S23:判断待处理视频属于PPT视频、单人视频还是多人视频,如果属于PPT视频,执行S24;如果属于单人视频,执行S25;如果属于多人视频,执行S26;S23: Determine whether the video to be processed is a PPT video, a single-person video, or a multi-person video. If it is a PPT video, execute S24; if it is a single-person video, execute S25; if it is a multi-person video, execute S26;
S24:基于音频信号以及文本信息,利用多模态视频处理模型提取PPT视频中的精华片段,并采用深度学习神经网络模型输出精华片段对应的标题、摘要以及标签信息,生成PPT视频的短视频剪辑结果;S24: Based on the audio signal and text information, use the multi-modal video processing model to extract the essence of the PPT video, and use the deep learning neural network model to output the title, summary and tag information corresponding to the essence of the segment, and generate a short video clip of the PPT video result;
本步骤中,PPT视频的处理过程具体为:In this step, the specific processing process of PPT video is as follows:
首先,提取PPT视频中的PPT文字信息; PPT文字信息提取过程包括:对PPT视频进行预处理后,识别PPT视频中的字符信息,并对识别的字符信息进行校正后,得到PPT视频的PPT文字信息。First, extract the PPT text information in the PPT video; the PPT text information extraction process includes: preprocessing the PPT video, identifying the character information in the PPT video, and correcting the recognized character information to obtain the PPT text of the PPT video information.
其次,根据文本信息和PPT文字信息,利用多模态视频处理模型计算相邻两页PPT页面的相似度,根据设定的第一相似度阈值对PPT页面进行筛选,得到所有相似度大于第一相似度阈值的PPT页面;同时,计算每一页PPT页面与全局关键词的相似度,根据设定的第二相似度阈值对PPT页面进行筛选,将所有相似度小于第二相似度阈值的PPT页面丢弃。最后,对筛选出的PPT页面进行拼接,得到PPT视频的精华片段以及精华片段在PPT视频中的时间戳。其中,通过计算文本信息和PPT文字信息中每个词语的权重以及每个词语与当前视频主题的相关性得到全局关键词;第一相似度阈值和第二相似度阈值可根据实际应用场景进行设置。Secondly, based on the text information and PPT text information, a multi-modal video processing model is used to calculate the similarity of two adjacent PPT pages, and the PPT pages are filtered according to the set first similarity threshold to obtain all similarity greater than the first PPT pages with a similarity threshold; at the same time, calculate the similarity between each PPT page and the global keywords, filter the PPT pages according to the set second similarity threshold, and select all PPT pages whose similarity is less than the second similarity threshold. Page discarded. Finally, the filtered PPT pages are spliced to obtain the highlights of the PPT video and the timestamps of the highlights in the PPT video. Among them, global keywords are obtained by calculating the weight of each word in the text information and PPT text information and the correlation between each word and the current video topic; the first similarity threshold and the second similarity threshold can be set according to the actual application scenario .
然后,将文本信息输入深度神经网络模型,生成精华片段的标题以及摘要的第一个字,再将生成的第一个字和文本信息一起输入深度神经网络模型,生成精华片段的标题以及摘要的第二个字,如此反复,直到深度神经网络模型输出结束符,得出精华片段所对应的标题、摘要等文本信息。Then, input the text information into the deep neural network model to generate the title of the highlight clip and the first word of the summary. Then input the generated first word and the text information into the deep neural network model to generate the title of the highlight clip and the first word of the summary. For the second word, this is repeated until the deep neural network model outputs the terminator, and the text information such as title and abstract corresponding to the essence fragment is obtained.
最后,将文本信息和标题一起输入深度神经网络模型,通过深度神经网络模型计算文本信息和标题中每个词语的权重,并计算每个词语与当前视频主题的相关性,得到精华片段所对应的标签信息。Finally, the text information and title are input into the deep neural network model, and the weight of each word in the text information and title is calculated through the deep neural network model, and the correlation between each word and the current video topic is calculated to obtain the corresponding content of the essence clip. Label Information.
S25:基于音频信号以及文本信息,利用多模态视频处理模型提取单人视频中的精华片段,并采用深度学习神经网络模型输出精华片段对应的标题、摘要以及标签信息,生成单人视频的短视频剪辑结果;S25: Based on the audio signal and text information, use the multi-modal video processing model to extract the essential clips in the single-person video, and use the deep learning neural network model to output the title, abstract and label information corresponding to the essential clips, and generate a short video of the single-person video. Video clipping results;
本步骤中,单人视频的处理过程具体为:In this step, the specific processing process of single-person videos is as follows:
首先,根据文本信息,利用多模态视频处理模型计算相邻两帧图像的相似度,并通过设定的第一相似度阈值对图像进行筛选,输出所有相似度大于第一相似度阈值的图像;同时,计算每一帧图像与全局关键词的相似度,根据设定的第二相似度阈值对所有图像进行筛选,将所有相似度小于第二相似度阈值的图像丢弃。最后,对筛选的图像进行拼接,得到单人视频的精华片段以及精华片段在单人视频中的时间戳。其中,全局关键词是通过计算文本信息中每个词语的权重以及每个词语与当前视频主题的相关性得到;第一相似度阈值和第二相似度阈值可根据实际应用场景进行设置。First, based on the text information, a multi-modal video processing model is used to calculate the similarity of two adjacent frames of images, and the images are filtered through the set first similarity threshold, and all images with a similarity greater than the first similarity threshold are output. ; At the same time, calculate the similarity between each frame image and the global keyword, filter all images according to the set second similarity threshold, and discard all images with a similarity less than the second similarity threshold. Finally, the filtered images are spliced to obtain the highlights of the single-person video and the timestamps of the highlights in the single-person video. Among them, the global keywords are obtained by calculating the weight of each word in the text information and the correlation between each word and the current video topic; the first similarity threshold and the second similarity threshold can be set according to the actual application scenario.
然后,将文本信息输入深度神经网络模型,生成精华片段的标题以及摘要的第一个字,再将生成的第一个字和文本信息一起输入深度神经网络模型,生成精华片段的标题以及摘要的第二个字,如此反复,直到深度神经网络模型输出结束符,得出单人视频的精华片段所对应的标题、摘要等文本信息。Then, input the text information into the deep neural network model to generate the title of the highlight clip and the first word of the summary. Then input the generated first word and the text information into the deep neural network model to generate the title of the highlight clip and the first word of the summary. For the second word, this is repeated until the deep neural network model outputs the end character, and the text information such as title and summary corresponding to the highlight clip of the single video is obtained.
最后,将文本信息和标题一起输入深度神经网络模型,通过深度神经网络模型计算文本信息和标题中每个词语的权重,并计算每个词语与当前视频主题的相关性,得到精华片段所对应的标签信息。Finally, the text information and title are input into the deep neural network model, and the weight of each word in the text information and title is calculated through the deep neural network model, and the correlation between each word and the current video topic is calculated to obtain the corresponding content of the essence clip. Label Information.
S26:基于音频信号以及文本信息,利用多模态视频处理模型提取多人视频中的精华片段,并采用深度学习神经网络模型输出精华片段对应的标题、摘要以及标签信息,生成多人视频的短视频剪辑结果;S26: Based on the audio signal and text information, use the multi-modal video processing model to extract the essential clips in the multi-person video, and use the deep learning neural network model to output the title, summary and label information corresponding to the elite clips, and generate a short video of the multi-person video. Video clipping results;
本步骤中,多人视频的处理过程具体为:In this step, the specific processing process of multi-person videos is as follows:
首先,对多人视频的音频信息进行声纹识别处理,得到多人视频中每个发声人的声纹识别匹配结果;其中,声纹识别过程为:利用噪声抑制算法提取音频信息中的有效语音,对提取的有效语音进行声纹特征提取,根据提取的声纹特征进行发声人声音建模,并输出每个发声人的声纹识别匹配结果。First, perform voiceprint recognition processing on the audio information of the multi-person video to obtain the voiceprint recognition matching results of each speaker in the multi-person video; the voiceprint recognition process is: using a noise suppression algorithm to extract effective speech in the audio information , perform voiceprint feature extraction on the extracted effective speech, model the speaker's voice based on the extracted voiceprint features, and output the voiceprint recognition matching results of each speaker.
其次,根据视频图像和文本信息,利用多模态视频处理模型计算相邻两帧图像的相似度,并通过设定的第一相似度阈值对所有图像进行筛选,输出相似度大于第一相似度阈值的图像;同时,计算每一帧图像与全局关键词的相似度,根据第二相似度阈值对所有图像进行筛选,将所有相似度小于第二相似度阈值的图像丢弃。最后,对筛选后的图像进行拼接,得到多人视频的精华片段以及精华片段在多人视频中的时间戳。其中,全局关键词是通过计算文本信息中每个词语的权重以及每个词语与当前视频主题的相关性得到;第一相似度阈值和第二相似度阈值可根据实际应用场景进行设置。Secondly, based on the video image and text information, a multi-modal video processing model is used to calculate the similarity of two adjacent frames of images, and all images are filtered through the set first similarity threshold, and the output similarity is greater than the first similarity. Threshold image; at the same time, calculate the similarity between each frame image and the global keyword, filter all images according to the second similarity threshold, and discard all images with a similarity less than the second similarity threshold. Finally, the filtered images are spliced to obtain the highlights of the multi-person video and the timestamps of the highlights in the multi-person video. Among them, the global keywords are obtained by calculating the weight of each word in the text information and the correlation between each word and the current video topic; the first similarity threshold and the second similarity threshold can be set according to the actual application scenario.
然后,将文本信息输入深度神经网络模型,生成精华片段的标题以及摘要的第一个字,再将生成的第一个字和文本信息一起输入深度神经网络模型,生成精华片段的标题以及摘要的第二个字,如此反复,直到深度神经网络模型输出结束符,得出多人视频的精华片段所对应的标题、摘要等文本信息。Then, input the text information into the deep neural network model to generate the title of the highlight clip and the first word of the summary. Then input the generated first word and the text information into the deep neural network model to generate the title of the highlight clip and the first word of the summary. For the second word, this is repeated until the deep neural network model outputs the end character, and the text information such as title and summary corresponding to the highlight clip of the multi-person video is obtained.
最后,将文本信息和标题一起输入深度神经网络模型,通过深度神经网络模型计算文本信息和标题中每个词语的权重,并计算每个词语与当前视频主题的相关性,得出多人视频的精华片段所对应的标签信息。Finally, input the text information and title together into the deep neural network model, calculate the weight of each word in the text information and title through the deep neural network model, and calculate the correlation between each word and the current video topic, and obtain the multi-person video Tag information corresponding to the essence fragment.
基于上述,本申请实施例的视频内容处理方法采用多模态视频内容处理技术,通过提取待处理视频中的音频信号以及文本信息,基于音频信号以及文本信息,利用多模态视频处理模型以及深度学习神经网络模型获取待处理视频中的精华片段以及精华片段对应的标题、摘要以及标签信息。本申请采用全AI(Artificial Intelligence,人工智能)处理流程,可以一键生成多个精剪短视频,大大提升了剪辑效率,缩短视频制作周期。本申请可支持智能关键词精准生成,保证画质清晰、剪辑节奏流畅、内容紧跟亮点等,具有高拓展性,应用范围广,可赋能社会生活中互联网泛娱乐、在线教育、协同办公等各个场景。Based on the above, the video content processing method of the embodiment of the present application adopts multi-modal video content processing technology, by extracting the audio signal and text information in the video to be processed, based on the audio signal and text information, using the multi-modal video processing model and depth Learn the neural network model to obtain the highlight clips in the video to be processed and the title, summary and tag information corresponding to the highlight clips. This application adopts a full AI (Artificial Intelligence, artificial intelligence) processing process, which can generate multiple finely edited short videos with one click, greatly improving the editing efficiency and shortening the video production cycle. This application can support the accurate generation of intelligent keywords to ensure clear picture quality, smooth editing rhythm, and content that closely follows the highlights. It is highly scalable and has a wide range of applications. It can empower Internet pan-entertainment, online education, collaborative office, etc. in social life. various scenes.
在一个可选的实施方式中,还可以:将所述的视频内容处理方法的结果上传至区块链中。In an optional implementation, the result of the video content processing method can also be uploaded to the blockchain.
具体地,基于所述的视频内容处理方法的结果得到对应的摘要信息,具体来说,摘要信息由所述的视频内容处理方法的结果进行散列处理得到,比如利用sha256s算法处理得到。将摘要信息上传至区块链可保证其安全性和对用户的公正透明性。用户可以从区块链中下载得该摘要信息,以便查证所述的视频内容处理方法的结果是否被篡改。本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。Specifically, the corresponding summary information is obtained based on the result of the video content processing method. Specifically, the summary information is obtained by hashing the result of the video content processing method, for example, using the sha256s algorithm. Uploading summary information to the blockchain ensures its security and fairness and transparency to users. Users can download this summary information from the blockchain to verify whether the results of the video content processing method have been tampered with. The blockchain referred to in this example is a new application model of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithms. Blockchain is essentially a decentralized database. It is a series of data blocks generated using cryptographic methods. Each data block contains a batch of network transaction information and is used to verify its Validity of information (anti-counterfeiting) and generation of the next block. Blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
请参阅图3,是本申请实施例视频内容处理系统的结构示意图。本申请实施例视频内容处理系统40包括:Please refer to Figure 3, which is a schematic structural diagram of a video content processing system according to an embodiment of the present application. The video content processing system 40 in this embodiment of the present application includes:
多模态信息提取模块41:用于提取待处理视频中的音频信号以及文本信息;Multi-modal information extraction module 41: used to extract audio signals and text information in the video to be processed;
视频类型判断模块42:用于提取待处理视频中的视频图像,对视频图像进行图像分析,判断待处理视频的视频类型;视频类型包括PPT视频、单人视频以及多人视频;Video type judgment module 42: used to extract video images in the video to be processed, perform image analysis on the video images, and determine the video type of the video to be processed; video types include PPT videos, single-person videos, and multi-person videos;
视频剪辑模块43:用于基于音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取不同类型的待处理视频的精华片段对应的标题、摘要以及标签信息,生成待处理视频的短视频剪辑结果。Video editing module 43: used to use multi-modal video processing models to extract the essence of different types of videos to be processed based on audio signals and text information, and use a deep neural network model to extract the corresponding essence of different types of videos to be processed. The title, summary and tag information of the video are used to generate the short video clipping result of the video to be processed.
请参阅图4,为本申请实施例的终端结构示意图。该终端50包括处理器51、与处理器51耦接的存储器52。Please refer to Figure 4, which is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .
存储器52存储有用于实现上述视频内容处理方法的程序指令。The memory 52 stores program instructions for implementing the above video content processing method.
处理器51用于执行存储器52存储的程序指令以执行视频内容处理操作。The processor 51 is configured to execute program instructions stored in the memory 52 to perform video content processing operations.
其中,处理器51还可以称为CPU(Central Processing Unit,中央处理单元)。处理器51可能是一种集成电路芯片,具有信号的处理能力。处理器51还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 51 may also be called a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip with signal processing capabilities. The processor 51 may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component . A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
请参阅图5,图5为本申请实施例的存储介质的结构示意图。本申请实施例的存储介质存储有能够实现上述所有方法的程序文件61,其中,该程序文件61可以以软件产品的形式存储在上述存储介质中,所述读存储介质可以是非易失性,也可以是易失性,其包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。Please refer to FIG. 5 , which is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium in the embodiment of the present application stores program files 61 that can implement all the above methods. The program files 61 can be stored in the above storage medium in the form of software products. The read storage medium can be non-volatile or non-volatile. It may be volatile and include a number of instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the various implementation methods of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Various media that can store program code, such as Memory), magnetic disks or optical disks, or terminal devices such as computers, servers, mobile phones, and tablets.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的系统实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the system embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。以上仅为本申请的实施方式,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units. The above are only embodiments of the present application, and do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made using the contents of the description and drawings of this application, or directly or indirectly applied in other related technical fields, All are similarly included in the patent protection scope of this application.

Claims (20)

1.一种视频内容处理方法,其中,包括:1. A video content processing method, which includes:
提取待处理视频中的音频信号以及文本信息;Extract audio signals and text information from the video to be processed;
提取所述待处理视频中的视频图像,对所述视频图像进行图像分析,判断所述待处理视频的视频类型;所述视频类型包括PPT视频、单人视频以及多人视频;Extract video images in the video to be processed, perform image analysis on the video images, and determine the video type of the video to be processed; the video types include PPT videos, single-person videos, and multi-person videos;
基于所述音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息,生成所述待处理视频的短视频剪辑结果。Based on the audio signal and text information, a multi-modal video processing model is used to extract the essence clips in different types of videos to be processed, and a deep neural network model is used to extract the title, abstract and label information corresponding to the essence clips, and generate the Describe the short video clipping results of the video to be processed.
2.根据权利要求1所述的视频内容处理方法,其中,所述提取待处理视频中的音频信号以及文本信息包括:2. The video content processing method according to claim 1, wherein the extracting audio signals and text information in the video to be processed includes:
将所述待处理视频输入开源框架,通过所述开源框架输出待处理视频的音频信号;Input the video to be processed into an open source framework, and output the audio signal of the video to be processed through the open source framework;
对所述音频信号进行语音转文字处理,生成待处理视频的文本信息。Perform speech-to-text processing on the audio signal to generate text information of the video to be processed.
3.根据权利要求2所述的视频内容处理方法,其中,所述对音频信号进行语音转文字处理具体为:3. The video content processing method according to claim 2, wherein the speech-to-text processing of the audio signal is specifically:
对所述音频信号进行语音特征提取,将所述语音特征输入训练好的声学模型,通过所述声学模型输出对应的概率得分;Extract speech features from the audio signal, input the speech features into a trained acoustic model, and output the corresponding probability score through the acoustic model;
基于所述声学模型的输出结果,根据搜索和匹配策略从训练好的语言模型中搜索出与所述音频信号相匹配的文本,输出所述待处理视频的文本信息识别结果。Based on the output result of the acoustic model, the text matching the audio signal is searched from the trained language model according to the search and matching strategy, and the text information recognition result of the video to be processed is output.
4.根据权利要求1所述的视频内容处理方法,其中,所述判断所述待处理视频的视频类型具体为:4. The video content processing method according to claim 1, wherein the determining the video type of the video to be processed is specifically:
将提取的视频图像输入开源框架,通过所述开源框架抽取所述视频图像的帧画面;Enter the extracted video image into an open source framework, and extract frames of the video image through the open source framework;
对每一幅帧画面进行分类,得到所述待处理视频的视频类型。Classify each frame to obtain the video type of the video to be processed.
5.根据权利要求1至4任一项所述的视频内容处理方法,其中,当所述待处理视频为PPT视频时,所述基于所述音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息具体为:5. The video content processing method according to any one of claims 1 to 4, wherein when the video to be processed is a PPT video, the multi-modal video processing model is used based on the audio signal and text information. Extract the essence clips from different types of videos to be processed, and use a deep neural network model to extract the title, summary and label information corresponding to the essence clips, specifically as follows:
提取所述PPT视频中的PPT文字信息;Extract PPT text information from the PPT video;
基于所述文本信息和PPT文字信息,利用多模态视频处理模型计算相邻两页PPT页面的相似度,并筛选出相似度大于第一相似度阈值的PPT页面;同时,计算每一页PPT页面与设定的全局关键词的相似度,将相似度小于设定的第二相似度阈值的PPT页面丢弃;Based on the text information and PPT text information, a multi-modal video processing model is used to calculate the similarity of two adjacent PPT pages, and the PPT pages whose similarity is greater than the first similarity threshold are screened out; at the same time, each PPT page is calculated Based on the similarity between the page and the set global keywords, PPT pages whose similarity is less than the set second similarity threshold will be discarded;
对筛选后的PPT页面进行拼接,得到所述PPT视频的精华片段以及所述精华片段在PPT视频中的时间戳;Splice the filtered PPT pages to obtain the essence of the PPT video and the timestamp of the essence in the PPT video;
将所述文本信息输入深度神经网络模型,生成所述精华片段的标题以及摘要的第一个字,再将生成的第一个字和文本信息一起输入深度神经网络模型,生成精华片段的标题以及摘要的第二个字;重复上述过程,得到所述精华片段对应的标题、摘要信息;The text information is input into the deep neural network model to generate the title of the essence segment and the first word of the summary. The generated first word and the text information are then input into the deep neural network model to generate the title of the essence segment and The second word of the summary; repeat the above process to obtain the title and summary information corresponding to the essence fragment;
将所述文本信息和标题一起输入深度神经网络模型,通过所述深度神经网络模型计算文本信息和标题中每个词语的权重,并计算所述每个词语与所述PPT视频主题的相关性,得到所述精华片段对应的标签信息。Enter the text information and title together into a deep neural network model, calculate the weight of each word in the text information and title through the deep neural network model, and calculate the correlation between each word and the PPT video theme, Obtain the label information corresponding to the essence fragment.
6.根据权利要求5所述的视频内容处理方法,其中,当所述待处理视频为单人视频时,所述基于音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息具体为:6. The video content processing method according to claim 5, wherein when the video to be processed is a single-person video, the multi-modal video processing model is used to extract different types of video to be processed based on audio signals and text information. Essence clips in the video, and a deep neural network model is used to extract the title, summary and label information corresponding to the highlights clips, specifically as follows:
根据所述文本信息,利用多模态视频处理模型计算相邻两帧图像的相似度,并筛选出相似度大于第一相似度阈值的图像;同时,计算每一帧图像与全局关键词的相似度,将相似度小于第二相似度阈值的图像丢弃;According to the text information, a multi-modal video processing model is used to calculate the similarity between two adjacent frames of images, and the images whose similarity is greater than the first similarity threshold are screened out; at the same time, the similarity between each frame of image and the global keyword is calculated degree, discard images whose similarity is less than the second similarity threshold;
对筛选后的图像进行拼接,得到所述单人视频的精华片段以及所述精华片段在单人视频中的时间戳;Splice the filtered images to obtain the highlight clips of the single-person video and the timestamps of the highlight clips in the single-person video;
将所述文本信息输入深度神经网络模型,生成所述精华片段的标题以及摘要的第一个字,将生成的第一个字和文本信息一起输入深度神经网络模型,生成所述精华片段的标题以及摘要的第二个字,重复上述过程,得到所述单人视频的精华片段对应的标题和摘要信息;The text information is input into the deep neural network model to generate the title of the essence segment and the first word of the summary. The generated first word and the text information are input into the deep neural network model together to generate the title of the essence segment. and the second word of the summary, repeat the above process to obtain the title and summary information corresponding to the highlights of the single video;
将所述文本信息和标题一起输入深度神经网络模型,通过所述深度神经网络模型计算文本信息和标题中每个词语的权重,并计算每个词语与所述单人视频主题的相关性,得到所述精华片段的标签信息。The text information and the title are input into the deep neural network model together, and the weight of each word in the text information and title is calculated through the deep neural network model, and the correlation between each word and the single-person video topic is calculated, and we get Tag information of the essence fragment.
7.根据权利要求6所述的视频内容处理方法,其中,所述当所述待处理视频为多人视频时,所述基于音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息具体为:7. The video content processing method according to claim 6, wherein when the video to be processed is a multi-person video, the multi-modal video processing model is used to extract different types of video based on audio signals and text information. The highlight clips in the video to be processed are used, and the deep neural network model is used to extract the title, summary and label information corresponding to the highlight clips. Specifically:
对所述多人视频的音频信息进行声纹识别处理,得到所述多人视频中每个发声人的声纹识别匹配结果;Perform voiceprint recognition processing on the audio information of the multi-person video to obtain the voiceprint recognition matching results of each speaker in the multi-person video;
基于所述视频图像和文本信息,利用多模态视频处理模型计算相邻两帧图像的相似度,筛选出相似度大于第一相似度阈值的图像;同时,计算每一帧图像与全局关键词的相似度,将相似度小于第二相似度阈值的图像丢弃;Based on the video image and text information, a multi-modal video processing model is used to calculate the similarity between two adjacent frames of images, and the images whose similarity is greater than the first similarity threshold are screened out; at the same time, the relationship between each frame of image and the global keywords is calculated similarity, discard images whose similarity is less than the second similarity threshold;
对筛选后的图像进行拼接,得到所述多人视频的精华片段以及精华片段在多人视频中的时间戳;Splice the filtered images to obtain the highlight clips of the multi-person video and the timestamps of the highlight clips in the multi-person video;
将所述文本信息输入深度神经网络模型,生成所述精华片段的标题以及摘要的第一个字,再将生成的第一个字和文本信息一起输入深度神经网络模型,生成所述精华片段的标题以及摘要的第二个字,重复上述过程,得到所述多人视频的精华片段对应的标题和摘要信息;The text information is input into the deep neural network model to generate the title of the essence segment and the first word of the summary. The generated first word and the text information are then input into the deep neural network model to generate the title of the essence segment. For the title and the second word of the summary, repeat the above process to obtain the title and summary information corresponding to the highlight clips of the multi-person video;
将所述文本信息和标题一起输入深度神经网络模型,通过所述深度神经网络模型计算文本信息和标题中每个词语的权重,并计算每个词语与当前视频主题的相关性,得到所述多人视频的精华片段的标签信息。The text information and the title are input into the deep neural network model together, and the weight of each word in the text information and title is calculated through the deep neural network model, and the correlation between each word and the current video topic is calculated to obtain the multiple Label information of the highlight clips of people’s videos.
8.一种视频内容处理系统,其中,包括:8. A video content processing system, including:
多模态信息提取模块:用于提取待处理视频中的音频信号以及文本信息;Multi-modal information extraction module: used to extract audio signals and text information from the video to be processed;
视频类型判断模块:用于提取所述待处理视频中的视频图像,对视频图像进行图像分析,判断所述待处理视频的视频类型;所述视频类型包括PPT视频、单人视频以及多人视频;Video type judgment module: used to extract video images in the video to be processed, perform image analysis on the video images, and determine the video type of the video to be processed; the video types include PPT videos, single-person videos, and multi-person videos. ;
视频剪辑模块:用于基于所述音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取不同类型的待处理视频的精华片段对应的标题、摘要以及标签信息,生成待处理视频的短视频剪辑结果。Video editing module: used to use the multi-modal video processing model to extract the essence of different types of videos to be processed based on the audio signal and text information, and use the deep neural network model to extract the essence of different types of videos to be processed. The corresponding title, abstract and tag information are used to generate the short video clipping result of the video to be processed.
9.一种终端,其中,所述终端包括处理器、与所述处理器耦接的存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:提取待处理视频中的音频信号以及文本信息;提取所述待处理视频中的视频图像,对所述视频图像进行图像分析,判断所述待处理视频的视频类型;所述视频类型包括PPT视频、单人视频以及多人视频;基于所述音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息,生成所述待处理视频的短视频剪辑结果。9. A terminal, wherein the terminal includes a processor and a memory coupled to the processor, and computer readable instructions are stored in the memory. When the computer readable instructions are executed by the processor, Cause the processor to perform the following steps: extract the audio signal and text information in the video to be processed; extract the video image in the video to be processed, perform image analysis on the video image, and determine the video type of the video to be processed ; The video types include PPT videos, single-person videos and multi-person videos; based on the audio signals and text information, a multi-modal video processing model is used to extract the essence of different types of videos to be processed, and a deep neural network is used The model extracts the title, abstract and tag information corresponding to the essence clip, and generates a short video clip result of the video to be processed.
10.根据权利要求9所述的终端,其中,所述提取待处理视频中的音频信号以及文本信息包括:10. The terminal according to claim 9, wherein the extracting audio signals and text information in the video to be processed includes:
将所述待处理视频输入开源框架,通过所述开源框架输出待处理视频的音频信号;Input the video to be processed into an open source framework, and output the audio signal of the video to be processed through the open source framework;
对所述音频信号进行语音转文字处理,生成待处理视频的文本信息。Perform speech-to-text processing on the audio signal to generate text information of the video to be processed.
11.根据权利要求10所述的终端,其中,所述对音频信号进行语音转文字处理具体为:11. The terminal according to claim 10, wherein the speech-to-text processing of the audio signal is specifically:
对所述音频信号进行语音特征提取,将所述语音特征输入训练好的声学模型,通过所述声学模型输出对应的概率得分;Extract speech features from the audio signal, input the speech features into a trained acoustic model, and output the corresponding probability score through the acoustic model;
基于所述声学模型的输出结果,根据搜索和匹配策略从训练好的语言模型中搜索出与所述音频信号相匹配的文本,输出所述待处理视频的文本信息识别结果。Based on the output result of the acoustic model, the text matching the audio signal is searched from the trained language model according to the search and matching strategy, and the text information recognition result of the video to be processed is output.
12.根据权利要求9至11任一项所述的终端,其中,当所述待处理视频为PPT视频时,所述基于所述音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息具体为:12. The terminal according to any one of claims 9 to 11, wherein when the video to be processed is a PPT video, the multi-modal video processing model is used to extract different types of video based on the audio signal and text information. The highlight clips in the video to be processed are used, and a deep neural network model is used to extract the title, summary and label information corresponding to the highlight clips, specifically as follows:
提取所述PPT视频中的PPT文字信息;Extract PPT text information from the PPT video;
基于所述文本信息和PPT文字信息,利用多模态视频处理模型计算相邻两页PPT页面的相似度,并筛选出相似度大于第一相似度阈值的PPT页面;同时,计算每一页PPT页面与设定的全局关键词的相似度,将相似度小于设定的第二相似度阈值的PPT页面丢弃;Based on the text information and PPT text information, a multi-modal video processing model is used to calculate the similarity of two adjacent PPT pages, and the PPT pages whose similarity is greater than the first similarity threshold are screened out; at the same time, each PPT page is calculated Based on the similarity between the page and the set global keywords, PPT pages whose similarity is less than the set second similarity threshold will be discarded;
对筛选后的PPT页面进行拼接,得到所述PPT视频的精华片段以及所述精华片段在PPT视频中的时间戳;Splice the filtered PPT pages to obtain the essence of the PPT video and the timestamp of the essence in the PPT video;
将所述文本信息输入深度神经网络模型,生成所述精华片段的标题以及摘要的第一个字,再将生成的第一个字和文本信息一起输入深度神经网络模型,生成精华片段的标题以及摘要的第二个字;重复上述过程,得到所述精华片段对应的标题、摘要信息;The text information is input into the deep neural network model to generate the title of the essence segment and the first word of the summary. The generated first word and the text information are then input into the deep neural network model to generate the title of the essence segment and The second word of the summary; repeat the above process to obtain the title and summary information corresponding to the essence fragment;
将所述文本信息和标题一起输入深度神经网络模型,通过所述深度神经网络模型计算文本信息和标题中每个词语的权重,并计算所述每个词语与所述PPT视频主题的相关性,得到所述精华片段对应的标签信息。Enter the text information and title together into a deep neural network model, calculate the weight of each word in the text information and title through the deep neural network model, and calculate the correlation between each word and the PPT video theme, Obtain the label information corresponding to the essence fragment.
13.根据权利要求12所述的终端,其中,当所述待处理视频为单人视频时,所述基于音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息具体为:13. The terminal according to claim 12, wherein when the video to be processed is a single-person video, the multi-modal video processing model is used to extract different types of video in the video to be processed based on audio signals and text information. Essence fragments, and use a deep neural network model to extract the title, abstract and tag information corresponding to the essence fragments, specifically as follows:
根据所述文本信息,利用多模态视频处理模型计算相邻两帧图像的相似度,并筛选出相似度大于第一相似度阈值的图像;同时,计算每一帧图像与全局关键词的相似度,将相似度小于第二相似度阈值的图像丢弃;According to the text information, a multi-modal video processing model is used to calculate the similarity between two adjacent frames of images, and the images whose similarity is greater than the first similarity threshold are screened out; at the same time, the similarity between each frame of image and the global keyword is calculated degree, discard images whose similarity is less than the second similarity threshold;
对筛选后的图像进行拼接,得到所述单人视频的精华片段以及所述精华片段在单人视频中的时间戳;Splice the filtered images to obtain the highlight clips of the single-person video and the timestamps of the highlight clips in the single-person video;
将所述文本信息输入深度神经网络模型,生成所述精华片段的标题以及摘要的第一个字,将生成的第一个字和文本信息一起输入深度神经网络模型,生成所述精华片段的标题以及摘要的第二个字,重复上述过程,得到所述单人视频的精华片段对应的标题和摘要信息;The text information is input into the deep neural network model to generate the title of the essence segment and the first word of the summary. The generated first word and the text information are input into the deep neural network model together to generate the title of the essence segment. and the second word of the summary, repeat the above process to obtain the title and summary information corresponding to the highlights of the single video;
将所述文本信息和标题一起输入深度神经网络模型,通过所述深度神经网络模型计算文本信息和标题中每个词语的权重,并计算每个词语与所述单人视频主题的相关性,得到所述精华片段的标签信息。The text information and the title are input into the deep neural network model together, and the weight of each word in the text information and title is calculated through the deep neural network model, and the correlation between each word and the single-person video topic is calculated, and we get Tag information of the essence fragment.
14.根据权利要求13所述的终端,其中,所述当所述待处理视频为多人视频时,所述基于音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息具体为:14. The terminal according to claim 13, wherein when the video to be processed is a multi-person video, the multi-modal video processing model is used to extract different types of videos to be processed based on audio signals and text information. The essence fragments in the extract, and use a deep neural network model to extract the title, abstract and label information corresponding to the essence fragments, specifically as follows:
对所述多人视频的音频信息进行声纹识别处理,得到所述多人视频中每个发声人的声纹识别匹配结果;Perform voiceprint recognition processing on the audio information of the multi-person video to obtain the voiceprint recognition matching results of each speaker in the multi-person video;
基于所述视频图像和文本信息,利用多模态视频处理模型计算相邻两帧图像的相似度,筛选出相似度大于第一相似度阈值的图像;同时,计算每一帧图像与全局关键词的相似度,将相似度小于第二相似度阈值的图像丢弃;Based on the video image and text information, a multi-modal video processing model is used to calculate the similarity between two adjacent frames of images, and the images whose similarity is greater than the first similarity threshold are screened out; at the same time, the relationship between each frame of image and the global keywords is calculated similarity, discard images whose similarity is less than the second similarity threshold;
对筛选后的图像进行拼接,得到所述多人视频的精华片段以及精华片段在多人视频中的时间戳;Splice the filtered images to obtain the highlight clips of the multi-person video and the timestamps of the highlight clips in the multi-person video;
将所述文本信息输入深度神经网络模型,生成所述精华片段的标题以及摘要的第一个字,再将生成的第一个字和文本信息一起输入深度神经网络模型,生成所述精华片段的标题以及摘要的第二个字,重复上述过程,得到所述多人视频的精华片段对应的标题和摘要信息;The text information is input into the deep neural network model to generate the title of the essence segment and the first word of the summary. The generated first word and the text information are then input into the deep neural network model to generate the title of the essence segment. For the title and the second word of the summary, repeat the above process to obtain the title and summary information corresponding to the highlight clips of the multi-person video;
将所述文本信息和标题一起输入深度神经网络模型,通过所述深度神经网络模型计算文本信息和标题中每个词语的权重,并计算每个词语与当前视频主题的相关性,得到所述多人视频的精华片段的标签信息。The text information and the title are input into the deep neural network model together, and the weight of each word in the text information and title is calculated through the deep neural network model, and the correlation between each word and the current video topic is calculated to obtain the multiple Label information of the highlight clips of people’s videos.
15.一种存储介质,其中,存储有能够实现如下步骤的程序文件,所述步骤包括:提取待处理视频中的音频信号以及文本信息;提取所述待处理视频中的视频图像,对所述视频图像进行图像分析,判断所述待处理视频的视频类型;所述视频类型包括PPT视频、单人视频以及多人视频;基于所述音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息,生成所述待处理视频的短视频剪辑结果。15. A storage medium, in which program files that can implement the following steps are stored, the steps including: extracting audio signals and text information in the video to be processed; extracting video images in the video to be processed, and performing the following steps: Perform image analysis on video images to determine the video type of the video to be processed; the video types include PPT videos, single-person videos, and multi-person videos; based on the audio signal and text information, a multi-modal video processing model is used to extract different Types of highlight clips in the video to be processed, and a deep neural network model is used to extract the title, summary and tag information corresponding to the highlight clips, and generate a short video clipping result of the video to be processed.
16.根据权利要求15所述的存储介质,其中,所述提取待处理视频中的音频信号以及文本信息包括:16. The storage medium according to claim 15, wherein the extracting audio signals and text information in the video to be processed includes:
将所述待处理视频输入开源框架,通过所述开源框架输出待处理视频的音频信号;Input the video to be processed into an open source framework, and output the audio signal of the video to be processed through the open source framework;
对所述音频信号进行语音转文字处理,生成待处理视频的文本信息。Perform speech-to-text processing on the audio signal to generate text information of the video to be processed.
17.根据权利要求16所述的存储介质,其中,所述对音频信号进行语音转文字处理具体为:17. The storage medium according to claim 16, wherein the speech-to-text processing of the audio signal is specifically:
对所述音频信号进行语音特征提取,将所述语音特征输入训练好的声学模型,通过所述声学模型输出对应的概率得分;Extract speech features from the audio signal, input the speech features into a trained acoustic model, and output the corresponding probability score through the acoustic model;
基于所述声学模型的输出结果,根据搜索和匹配策略从训练好的语言模型中搜索出与所述音频信号相匹配的文本,输出所述待处理视频的文本信息识别结果。Based on the output result of the acoustic model, the text matching the audio signal is searched from the trained language model according to the search and matching strategy, and the text information recognition result of the video to be processed is output.
18.根据权利要求15至17任一项所述的存储介质,其中,当所述待处理视频为PPT视频时,所述基于所述音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息具体为:18. The storage medium according to any one of claims 15 to 17, wherein when the video to be processed is a PPT video, based on the audio signal and text information, a multi-modal video processing model is used to extract different Types of highlight clips in videos to be processed, and a deep neural network model is used to extract the title, summary and label information corresponding to the highlight clips, specifically as follows:
提取所述PPT视频中的PPT文字信息;Extract PPT text information from the PPT video;
基于所述文本信息和PPT文字信息,利用多模态视频处理模型计算相邻两页PPT页面的相似度,并筛选出相似度大于第一相似度阈值的PPT页面;同时,计算每一页PPT页面与设定的全局关键词的相似度,将相似度小于设定的第二相似度阈值的PPT页面丢弃;Based on the text information and PPT text information, a multi-modal video processing model is used to calculate the similarity of two adjacent PPT pages, and the PPT pages whose similarity is greater than the first similarity threshold are screened out; at the same time, each PPT page is calculated Based on the similarity between the page and the set global keywords, PPT pages whose similarity is less than the set second similarity threshold will be discarded;
对筛选后的PPT页面进行拼接,得到所述PPT视频的精华片段以及所述精华片段在PPT视频中的时间戳;Splice the filtered PPT pages to obtain the essence of the PPT video and the timestamp of the essence in the PPT video;
将所述文本信息输入深度神经网络模型,生成所述精华片段的标题以及摘要的第一个字,再将生成的第一个字和文本信息一起输入深度神经网络模型,生成精华片段的标题以及摘要的第二个字;重复上述过程,得到所述精华片段对应的标题、摘要信息;The text information is input into the deep neural network model to generate the title of the essence segment and the first word of the summary. The generated first word and the text information are then input into the deep neural network model to generate the title of the essence segment and The second word of the summary; repeat the above process to obtain the title and summary information corresponding to the essence fragment;
将所述文本信息和标题一起输入深度神经网络模型,通过所述深度神经网络模型计算文本信息和标题中每个词语的权重,并计算所述每个词语与所述PPT视频主题的相关性,得到所述精华片段对应的标签信息。Enter the text information and title together into a deep neural network model, calculate the weight of each word in the text information and title through the deep neural network model, and calculate the correlation between each word and the PPT video theme, Obtain the label information corresponding to the essence fragment.
19.根据权利要求18所述的存储介质,其中,当所述待处理视频为单人视频时,所述基于音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息具体为:19. The storage medium according to claim 18, wherein when the video to be processed is a single-person video, the multi-modal video processing model is used to extract different types of video to be processed based on audio signals and text information. of the essence fragments, and use a deep neural network model to extract the title, abstract and tag information corresponding to the essence fragments, specifically as follows:
根据所述文本信息,利用多模态视频处理模型计算相邻两帧图像的相似度,并筛选出相似度大于第一相似度阈值的图像;同时,计算每一帧图像与全局关键词的相似度,将相似度小于第二相似度阈值的图像丢弃;According to the text information, a multi-modal video processing model is used to calculate the similarity between two adjacent frames of images, and the images whose similarity is greater than the first similarity threshold are screened out; at the same time, the similarity between each frame of image and the global keyword is calculated degree, discard images whose similarity is less than the second similarity threshold;
对筛选后的图像进行拼接,得到所述单人视频的精华片段以及所述精华片段在单人视频中的时间戳;Splice the filtered images to obtain the highlight clips of the single-person video and the timestamps of the highlight clips in the single-person video;
将所述文本信息输入深度神经网络模型,生成所述精华片段的标题以及摘要的第一个字,将生成的第一个字和文本信息一起输入深度神经网络模型,生成所述精华片段的标题以及摘要的第二个字,重复上述过程,得到所述单人视频的精华片段对应的标题和摘要信息;The text information is input into the deep neural network model to generate the title of the essence segment and the first word of the summary. The generated first word and the text information are input into the deep neural network model together to generate the title of the essence segment. and the second word of the summary, repeat the above process to obtain the title and summary information corresponding to the highlights of the single video;
将所述文本信息和标题一起输入深度神经网络模型,通过所述深度神经网络模型计算文本信息和标题中每个词语的权重,并计算每个词语与所述单人视频主题的相关性,得到所述精华片段的标签信息。The text information and the title are input into the deep neural network model together, and the weight of each word in the text information and title is calculated through the deep neural network model, and the correlation between each word and the single-person video topic is calculated, and we get Tag information of the essence fragment.
20.根据权利要求19所述的存储介质,其中,所述当所述待处理视频为多人视频时,所述基于音频信号以及文本信息,利用多模态视频处理模型提取不同类型的待处理视频中的精华片段,并采用深度神经网络模型提取所述精华片段对应的标题、摘要以及标签信息具体为:20. The storage medium according to claim 19, wherein when the video to be processed is a multi-person video, the multi-modal video processing model is used to extract different types of video to be processed based on audio signals and text information. Essence clips in the video, and a deep neural network model is used to extract the title, summary and label information corresponding to the highlights clips, specifically as follows:
对所述多人视频的音频信息进行声纹识别处理,得到所述多人视频中每个发声人的声纹识别匹配结果;Perform voiceprint recognition processing on the audio information of the multi-person video to obtain the voiceprint recognition matching results of each speaker in the multi-person video;
基于所述视频图像和文本信息,利用多模态视频处理模型计算相邻两帧图像的相似度,筛选出相似度大于第一相似度阈值的图像;同时,计算每一帧图像与全局关键词的相似度,将相似度小于第二相似度阈值的图像丢弃;Based on the video image and text information, a multi-modal video processing model is used to calculate the similarity between two adjacent frames of images, and the images whose similarity is greater than the first similarity threshold are screened out; at the same time, the relationship between each frame of image and the global keywords is calculated similarity, discard images whose similarity is less than the second similarity threshold;
对筛选后的图像进行拼接,得到所述多人视频的精华片段以及精华片段在多人视频中的时间戳;Splice the filtered images to obtain the highlight clips of the multi-person video and the timestamps of the highlight clips in the multi-person video;
将所述文本信息输入深度神经网络模型,生成所述精华片段的标题以及摘要的第一个字,再将生成的第一个字和文本信息一起输入深度神经网络模型,生成所述精华片段的标题以及摘要的第二个字,重复上述过程,得到所述多人视频的精华片段对应的标题和摘要信息;The text information is input into the deep neural network model to generate the title of the essence segment and the first word of the summary. The generated first word and the text information are then input into the deep neural network model to generate the title of the essence segment. For the title and the second word of the summary, repeat the above process to obtain the title and summary information corresponding to the highlight clips of the multi-person video;
将所述文本信息和标题一起输入深度神经网络模型,通过所述深度神经网络模型计算文本信息和标题中每个词语的权重,并计算每个词语与当前视频主题的相关性,得到所述多人视频的精华片段的标签信息。The text information and the title are input into the deep neural network model together, and the weight of each word in the text information and title is calculated through the deep neural network model, and the correlation between each word and the current video topic is calculated to obtain the multiple Label information of the highlight clips of people’s videos.
 
PCT/CN2022/089559 2022-03-16 2022-04-27 Video content processing method and system, and terminal and storage medium WO2023173539A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210259504.XA CN114598933B (en) 2022-03-16 2022-03-16 Video content processing method, system, terminal and storage medium
CN202210259504.X 2022-03-16

Publications (1)

Publication Number Publication Date
WO2023173539A1 true WO2023173539A1 (en) 2023-09-21

Family

ID=81808756

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089559 WO2023173539A1 (en) 2022-03-16 2022-04-27 Video content processing method and system, and terminal and storage medium

Country Status (2)

Country Link
CN (1) CN114598933B (en)
WO (1) WO2023173539A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115086783B (en) * 2022-06-28 2023-10-27 北京奇艺世纪科技有限公司 Video generation method and device and electronic equipment
CN116453023B (en) * 2023-04-23 2024-01-26 上海帜讯信息技术股份有限公司 Video abstraction system, method, electronic equipment and medium for 5G rich media information

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11220689A (en) * 1998-01-31 1999-08-10 Media Link System:Kk Video software processor and medium for storing its program
CN108401192A (en) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 Video stream processing method, device, computer equipment and storage medium
US20200084519A1 (en) * 2018-09-07 2020-03-12 Oath Inc. Systems and Methods for Multimodal Multilabel Tagging of Video
CN112004111A (en) * 2020-09-01 2020-11-27 南京烽火星空通信发展有限公司 News video information extraction method for global deep learning
CN112464814A (en) * 2020-11-27 2021-03-09 北京百度网讯科技有限公司 Video processing method and device, electronic equipment and storage medium
CN112532897A (en) * 2020-11-25 2021-03-19 腾讯科技(深圳)有限公司 Video clipping method, device, equipment and computer readable storage medium
CN112818906A (en) * 2021-02-22 2021-05-18 浙江传媒学院 Intelligent full-media news cataloging method based on multi-mode information fusion understanding
CN112929744A (en) * 2021-01-22 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device, medium and program product for segmenting video clips

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11220689A (en) * 1998-01-31 1999-08-10 Media Link System:Kk Video software processor and medium for storing its program
CN108401192A (en) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 Video stream processing method, device, computer equipment and storage medium
US20200084519A1 (en) * 2018-09-07 2020-03-12 Oath Inc. Systems and Methods for Multimodal Multilabel Tagging of Video
CN112004111A (en) * 2020-09-01 2020-11-27 南京烽火星空通信发展有限公司 News video information extraction method for global deep learning
CN112532897A (en) * 2020-11-25 2021-03-19 腾讯科技(深圳)有限公司 Video clipping method, device, equipment and computer readable storage medium
CN112464814A (en) * 2020-11-27 2021-03-09 北京百度网讯科技有限公司 Video processing method and device, electronic equipment and storage medium
CN112929744A (en) * 2021-01-22 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device, medium and program product for segmenting video clips
CN112818906A (en) * 2021-02-22 2021-05-18 浙江传媒学院 Intelligent full-media news cataloging method based on multi-mode information fusion understanding

Also Published As

Publication number Publication date
CN114598933B (en) 2022-12-27
CN114598933A (en) 2022-06-07

Similar Documents

Publication Publication Date Title
TWI553494B (en) Multi-modal fusion based Intelligent fault-tolerant video content recognition system and recognition method
US10909111B2 (en) Natural language embellishment generation and summarization for question-answering systems
JP6967059B2 (en) Methods, devices, servers, computer-readable storage media and computer programs for producing video
WO2017096877A1 (en) Recommendation method and device
JP2019527371A (en) Voiceprint identification method and apparatus
JP2022505092A (en) Video content integrated metadata automatic generation method and system utilizing video metadata and script data
WO2018177139A1 (en) Method and apparatus for generating video abstract, server and storage medium
Kotsakis et al. Investigation of broadcast-audio semantic analysis scenarios employing radio-programme-adaptive pattern classification
WO2024001646A1 (en) Audio data processing method and apparatus, electronic device, program product, and storage medium
WO2023173539A1 (en) Video content processing method and system, and terminal and storage medium
WO2023011094A1 (en) Video editing method and apparatus, electronic device, and storage medium
US20140161423A1 (en) Message composition of media portions in association with image content
Robinson et al. Families in wild multimedia: A multimodal database for recognizing kinship
US20190311746A1 (en) Indexing media content library using audio track fingerprinting
CN114860992A (en) Video title generation method, device, equipment and storage medium
CN114281948A (en) Summary determination method and related equipment thereof
Bost et al. Serial speakers: a dataset of tv series
TWI709905B (en) Data analysis method and data analysis system thereof
Yang et al. Lecture video browsing using multimodal information resources
Hu et al. Audio–text retrieval based on contrastive learning and collaborative attention mechanism
Pandit et al. How good is your model ‘really’? on ‘wildness’ of the in-the-wild speech-based affect recognisers
CN113762056A (en) Singing video recognition method, device, equipment and storage medium
Lin et al. Semantic based background music recommendation for home videos
Borges et al. A probabilistic model for recommending music based on acoustic features and social data
JP2020173776A (en) Method and device for generating video

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22931566

Country of ref document: EP

Kind code of ref document: A1