WO2022002214A1 - 一种视频剪辑方法、装置、计算机可读存储介质及相机 - Google Patents

一种视频剪辑方法、装置、计算机可读存储介质及相机 Download PDF

Info

Publication number
WO2022002214A1
WO2022002214A1 PCT/CN2021/104072 CN2021104072W WO2022002214A1 WO 2022002214 A1 WO2022002214 A1 WO 2022002214A1 CN 2021104072 W CN2021104072 W CN 2021104072W WO 2022002214 A1 WO2022002214 A1 WO 2022002214A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
target
sound source
source target
sound
Prior art date
Application number
PCT/CN2021/104072
Other languages
English (en)
French (fr)
Inventor
符峥
蔡锦霖
姜文杰
Original Assignee
影石创新科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 影石创新科技股份有限公司 filed Critical 影石创新科技股份有限公司
Publication of WO2022002214A1 publication Critical patent/WO2022002214A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S11/00Systems for determining distance or velocity not using reflection or reradiation
    • G01S11/14Systems for determining distance or velocity not using reflection or reradiation using ultrasonic, sonic, or infrasonic waves
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip

Definitions

  • the present application belongs to the field of video processing, and in particular, relates to a video editing method, device, computer-readable storage medium, and camera.
  • video conferencing systems have gradually become an important channel for communication in people's daily work.
  • a video conference it is usually necessary to obtain the image and sound of the presenter at the same time, play it on the playback device at the same time, and record it to a storage device, or convert the sound into text for easy organization. Since there may be multiple speakers in the conference, the video conference system usually needs to have the function of obtaining video images and sounds from different angles.
  • the traditional video conference system collects video images by setting up multiple cameras, and at the same time is equipped with multiple microphones to obtain sound.
  • the playback device of the video conference system in order to let the participants watching the conference video focus on the presenter, the video needs to be converted into a plane video with the presenter as the center perspective, and the content of the presenter is recorded; When the presenter switches from one person to another, the video perspective also switches to the new presenter. This process is generally achieved through sound source localization.
  • Embodiments of the present application provide a video editing method, apparatus, computer-readable storage medium, computer device, and camera, aiming to solve one of the above-mentioned technical problems.
  • a specific solution of the folding rod disclosed by the present invention is as follows:
  • an embodiment of the present application provides a video editing method, and the method includes:
  • a clipped planar video including the sound source target is generated from the sound source target.
  • the acquired sound data and the video data corresponding to the sound data are specifically:
  • a plane video corresponding to the sound data is generated according to the panoramic video.
  • the target information includes the plane video frame corresponding to the target and the position information of the target;
  • the determining of the sound source target according to the sound data and the target information is specifically:
  • the sound source target is determined according to the sound data and the plane video frame corresponding to the target.
  • obtaining the plane video frame corresponding to the target is specifically:
  • the position information of the target is obtained through a target detection algorithm, and a plane video frame including the target is intercepted from the plane video frame corresponding to the sound data according to the position information of the target and a preset image size.
  • the location information of the target is obtained in the following manner:
  • a target detection algorithm is used to detect all targets in the plane video frame, and each target is represented by a rectangular frame, and the position information of the target is determined by the coordinates of the rectangular frame.
  • the determining of the sound source target according to the sound data and the target information is specifically:
  • the sound data and the plane video frames corresponding to the one or more targets are respectively input into a pre-trained machine learning model, and the machine learning model outputs sound source targets corresponding to the sound data.
  • the acquired sound data and the video data corresponding to the sound data are specifically:
  • the method further includes:
  • a source target generates a clipped planar video that includes the sound source target.
  • the generation of the clipped plane video including the sound source target according to the sound source target or the clipped plane video including the sound source target according to the sound source target determined at the previous moment is specifically:
  • the plane video frame corresponding to the sound source target is edited as the video frame of the clipping video, and the clipping plane video including the sound source target is generated;
  • a clipped planar video including the sound source target is generated according to the position information of the sound source target.
  • the clipping of the plane video frame corresponding to the sound source target as the video frame of the clipping video is specifically:
  • the plane video frames corresponding to the sound source target at each moment are spliced in sequence to generate a clip plane video.
  • plane video frames corresponding to the sound source target at each moment are spliced in order, and the plane video of the clip is specifically:
  • the plane video frames corresponding to the sound source target at each moment are spliced in sequence, and the plane video frames corresponding to the sound source target are scaled during editing to make all the plane video frames corresponding to the sound source target equal in size, and black pixels are used. Fill in the area that cannot be covered by the flat video frame corresponding to the sound source target to generate a clipped flat video.
  • the generation of the clipped planar video including the sound source target according to the position information of the sound source target is specifically:
  • projective transformation and clipping are performed on the plane video frame, so that the sound source target is located in the center of the video screen, and a clipped plane video is generated.
  • the method further includes:
  • the target tracking algorithm is used to monitor all targets, track the position changes of each target, and record the unique identification mark and corresponding position information of each target;
  • the determining of the location information of the sound source target according to the sound source target is specifically:
  • the position information of the sound source target is determined according to the recorded unique identification mark and corresponding position information of each target and the unique identification mark corresponding to the sound source target.
  • an embodiment of the present application provides a video editing device, and the device includes:
  • a generation module for acquiring sound data and video data corresponding to the sound data, and generating a plane video frame corresponding to the sound data;
  • a target detection module configured to perform target detection on the plane video frame corresponding to the sound data, and obtain target information
  • a sound source target determination module configured to determine a sound source target according to the sound data and the target information
  • a clipping module configured to generate a clipped planar video including the sound source target according to the sound source target.
  • the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the steps of the video editing method as described above.
  • an embodiment of the present application provides a computer device, including:
  • processors one or more processors
  • the processor implements the steps of the video editing method when executing the computer program.
  • an embodiment of the present application provides a camera, including:
  • processors one or more processors
  • the processor implements the steps of the video editing method when executing the computer program.
  • target information is obtained by performing target detection on the plane video frame corresponding to the sound data; a sound source target is determined according to the sound data and the target information; A planar video including the sound source target. Therefore, the application is less difficult to implement, and in a noisy environment such as indoors, the influence of environmental noise and indoor reverberation on positioning can be reduced, the positioning accuracy is high, the robustness is strong, and automatic editing can be performed based on the sound source target, and the video editing effect is good.
  • the sound source target location is performed by the embodiment of the present application, only one microphone can be used for accurate location, the cost is low, and the difficulty and cost of video editing are greatly reduced.
  • FIG. 1 is a schematic diagram of an application scenario of a video editing method provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of a video editing method provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a video editing apparatus provided by an embodiment of the present application.
  • FIG. 4 is a specific structural block diagram of a computer device provided by an embodiment of the present application.
  • FIG. 5 is a specific structural block diagram of a camera provided by an embodiment of the present application.
  • An application scenario of the video editing method provided by an embodiment of the present application may be a computer device or a camera.
  • the computer device or camera executes the video editing method provided by an embodiment of the present application to generate a clipped planar video including a sound source target.
  • An application scenario of the video editing method provided by an embodiment of the present application may also include a connected computer device 100 and a camera 200 (as shown in FIG. 1 ). At least one application program of the computer device 100 and the camera 200 may be executed.
  • the computer device 100 may be a desktop computer, a mobile terminal, and the like, and the mobile terminal includes a mobile phone, a tablet computer, a notebook computer, a personal digital assistant, and the like.
  • the camera 200 may be an ordinary camera or a panoramic camera or the like.
  • a common camera refers to a photographing device for taking flat images and flat videos.
  • the computer device 100 or the camera 200 executes the video editing method provided by an embodiment of the present application to generate a clipped planar video including a sound source
  • FIG. 2 is a flowchart of a video editing method provided by an embodiment of the present application.
  • This embodiment mainly takes the application of the video editing method to a computer device or a camera as an example for illustration.
  • the video editing method provided by an embodiment of the present application The method includes the following steps:
  • the acquiring sound data and the video data corresponding to the sound data may specifically be:
  • a plane video corresponding to the sound data is generated according to the panoramic video.
  • the panoramic video is an original spherical panoramic video shot by a panoramic camera or generated by computer software.
  • the generating of the planar video corresponding to the sound data according to the panoramic video may specifically include: converting the original spherical panoramic video into a panoramic planar video corresponding to the sound data.
  • the converting the original spherical panoramic video into the panoramic plane video corresponding to the sound data may specifically be: projecting the original spherical panoramic video onto a plane to obtain a panoramic plane video corresponding to the sound data.
  • the panoramic camera is a panoramic camera with a microphone, and the number of microphones can be one; the sound data is the original sound data obtained by the microphone of the panoramic camera.
  • the cost is low, which greatly reduces the difficulty and cost of conference system configuration.
  • the shooting scene can be set as a meeting scene, or of course, it can be any other scene.
  • the target is a person or object in the panoramic plane video frame; the target information includes the plane video frame corresponding to the target and the position information of the target.
  • the plane video frame corresponding to the target refers to all plane video frames containing people, and the position information of the target refers to the position information of all people.
  • Acquiring the plane video frame corresponding to the target may specifically be:
  • the target detection algorithm such as HOG algorithm (Histogram of Oriented Gridient, directional gradient histogram) or CNN algorithm (Convolutional Neural Network, convolutional neural network, etc.) to obtain the position information of the target, and according to the position information of the target and the preset image size from the sound data corresponding plane video
  • the preset image size can be a common image resolution, such as 640 x 480; 1024 x 768; 1600 x 1200; 2048 x 1536, etc.
  • the HOG algorithm can well describe the features of the local target area and is a commonly used feature extraction method;
  • the CNN algorithm usually includes a data input layer, a convolution calculation layer, a ReLU activation layer, a pooling layer and a fully connected layer (INPUT-CONV -RELU-POOL-FC), is a neural network that replaces traditional matrix multiplication operations by convolution operations.
  • the location information of the target can be obtained in the following ways:
  • a target detection algorithm is used to detect all targets in the plane video frame, and each target is represented by a rectangular frame, and the position information of the target is determined by the coordinates of the rectangular frame.
  • S103 may specifically be:
  • the sound source target is determined according to the sound data and the plane video frame corresponding to the target.
  • S103 can also be specifically:
  • the sound data and one or more flat video frames corresponding to the targets are respectively input into a pre-trained machine learning model (eg, a CNN model), and the machine learning model outputs sound source targets corresponding to the sound data.
  • a pre-trained machine learning model eg, a CNN model
  • the sound source targets a presenter, and a clipped planar video including the presenter is generated according to the presenter.
  • the determined sound source target performs video editing; therefore, the acquired sound data and the video data corresponding to the sound data are specifically:
  • a continuous piece of sound data refers to sound data recorded by a sound acquisition device such as a microphone in a continuous period of time, such as sound data recorded by a microphone in a continuous period of time from 12:00 to 12:30.
  • the method may further include:
  • a source target generates a clipped planar video that includes the sound source target.
  • the method may further include:
  • a clipped planar video including the sound source target is generated according to the sound source target determined at the previous moment.
  • the clipped planar video including the sound source target is generated according to the sound source target determined at the previous moment when the planar video frame has no corresponding sound data, the continuity of the video picture can be maintained and the presenter can be locked.
  • the speaker may not be locked during the time period without sound.
  • the panoramic video can be converted into a flat video during the time period without sound, and displayed at a preset rate.
  • the user can also preset a video editing scheme for a specific scene according to the situation of the venue, etc., which is not specifically limited in this application.
  • step S104 or the generation of the clipped planar video including the sound source target according to the sound source target determined at the previous moment may specifically be:
  • the plane video frame corresponding to the sound source target is edited as the video frame of the clip video, and the clipped plane video including the sound source target is generated;
  • a clipped planar video including the sound source target is generated according to the position information of the sound source target.
  • the clipping of the plane video frame corresponding to the sound source target as the video frame of the clipping video may specifically be:
  • the plane video frames corresponding to the sound source target at each moment are spliced in sequence to generate a clip plane video.
  • the plane video frames corresponding to the sound source target at each moment are spliced in order to generate the clip plane video specifically:
  • the plane video frames corresponding to the sound source target at each moment are spliced in sequence, and the plane video frames corresponding to the sound source target are scaled during editing to make all the plane video frames corresponding to the sound source target equal in size, and black pixels are used. Fill in the area that cannot be covered by the flat video frame corresponding to the sound source target to generate a clipped flat video.
  • the generating the clipped planar video including the sound source target according to the position information of the sound source target may specifically be:
  • projective transformation and clipping are performed on the plane video frame, so that the sound source target is located in the center of the video screen, and a clipped plane video is generated.
  • the location information of the sound source object is determined according to the sound source object.
  • the method may further include:
  • the target tracking algorithm is used to provide a unique identification mark for the target; for example, the MOT algorithm (Multiple Target tracking algorithms such as Object Tracking, multi-target tracking) provide a unique identification mark for the target; the unique identification mark can be represented by symbols such as "person 1" and "person 2", or can be represented by a Re-ID algorithm (Person Re-identification) to obtain the real name of each character from the humanoid database, such as "Zhang San", "Li Si", etc.;
  • MOT algorithm Multiple Target tracking algorithms such as Object Tracking, multi-target tracking
  • the unique identification mark can be represented by symbols such as "person 1" and "person 2", or can be represented by a Re-ID algorithm (Person Re-identification) to obtain the real name of each character from the humanoid database, such as "Zhang San", "Li Si", etc.;
  • the target tracking algorithm is used to monitor all targets, track the position change of each target, and record each target's unique identification mark and corresponding position information.
  • the determining of the location information of the sound source target according to the sound source target may specifically be:
  • the position information of the sound source target is determined according to the recorded unique identification mark and corresponding position information of each target and the unique identification mark corresponding to the sound source target.
  • the method may further include the following steps:
  • planar video is combined with corresponding sound data.
  • sound data and video data can be combined in time sequence to realize audio and video synchronization, and the present application does not specifically limit the method for synchronizing sound data and video data.
  • target information is obtained by performing target detection on the plane video frame corresponding to the sound data; the sound source target is determined according to the sound data and the target information; the clips generated according to the sound source target include all the A flat video of the sound source target. Therefore, the application is less difficult to implement, and in a noisy environment such as indoors, the influence of environmental noise and indoor reverberation on positioning can be reduced, the positioning accuracy is high, the robustness is strong, and automatic editing can be performed based on the sound source target, and the video editing effect is good.
  • the sound source target location is performed by the embodiment of the present application, only one microphone can be used for accurate location, the cost is low, and the difficulty and cost of video editing are greatly reduced.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • the video editing device provided by an embodiment of the present application may be a computer program or a piece of program code running in a computer device or a panoramic camera, for example, the video editing device is an application software; the video editing device can use corresponding steps in the video editing method provided by the embodiments of the present application.
  • a video editing device provided by an embodiment of the present application includes:
  • Generation module 11 for acquiring sound data and video data corresponding to the sound data, and generating a plane video frame corresponding to the sound data;
  • a target detection module 12 configured to perform target detection on the plane video frame corresponding to the sound data, and obtain target information
  • a sound source target determination module 13 configured to determine a sound source target according to the sound data and the target information
  • the editing module 14 is configured to generate a clipped planar video including the sound source target according to the sound source target.
  • the video editing apparatus provided by an embodiment of the present application and the video editing method provided by an embodiment of the present application belong to the same concept, and the specific implementation process thereof is detailed in the full text of the specification, which will not be repeated here.
  • An embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the video editing method provided by an embodiment of the present application. step.
  • Embodiment 4 is a diagrammatic representation of Embodiment 4:
  • FIG. 4 shows a specific structural block diagram of a computer device provided by an embodiment of the present application.
  • the computer device may be the computer device shown in FIG. 1 .
  • a computer device 100 includes: one or more processors 101 and a memory 102 , and one or more computer programs, wherein the processor 101 and the memory 102 are connected by a bus, the one or more computer programs are stored in the memory 102 and are configured to be executed by the one or Multiple processors 101 execute, and when the processors 101 execute the computer program, the steps of the video editing method provided by an embodiment of the present application are implemented.
  • the computer equipment may be a desktop computer, a mobile terminal, etc.
  • the mobile terminal includes a mobile phone, a tablet computer, a notebook computer, a personal digital assistant, and the like.
  • Embodiment 5 is a diagrammatic representation of Embodiment 5:
  • FIG. 5 shows a specific structural block diagram of a camera provided by an embodiment of the present application.
  • the camera may be the camera shown in FIG. 1 .
  • a camera 200 includes: one or more processors 201 , a memory 202 , and one or more A plurality of computer programs, wherein the processor 201 and the memory 202 are connected by a bus, the one or more computer programs are stored in the memory 202, and are configured to be executed by the one or more processors 201 is executed.
  • the processor 201 executes the computer program, the steps of the video editing method provided by an embodiment of the present application are implemented.
  • the camera 200 may be an ordinary camera or a panoramic camera or the like.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Studio Devices (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

本申请适用于视频处理领域,提供了一种视频剪辑方法、装置、计算机可读存储介质及相机。所述视频剪辑方法包括:获取声音数据和与所述声音数据对应的视频数据,并生成与所述声音数据对应的平面视频帧;对所述声音数据对应的平面视频帧进行目标检测,获取目标信息;根据所述声音数据和所述目标信息确定声源目标;根据所述声源目标生成剪辑的包括所述声源目标的平面视频。本申请实现难度小,在室内等嘈杂环境下,可以降低环境噪声与室内混响对定位的影响,定位精度高,鲁棒性强,且能基于声源目标进行自动剪辑,视频剪辑效果好;此外,通过本申请实施例在进行声源目标定位时,仅需1个麦克风即可进行准确定位,成本较低,大大减少视频剪辑的难度与成本。

Description

一种视频剪辑方法、装置、计算机可读存储介质及相机 技术领域
本申请属于视频处理领域,尤其涉及一种视频剪辑方法、装置、计算机可读存储介质及相机。
背景技术
随着摄像与声音采集硬件设备的不断发展,视频会议系统逐渐成为人们日常生活工作中交流沟通的重要渠道。在视频会议中,通常需要同时获取主讲人的图像与声音,并在播放设备上同时播放并记录到存储设备中,或将声音转换成文字便于整理。由于会议可能存在多个主讲人,因此视频会议系统通常需要具备获得不同角度视频图像和声音的功能。
传统视频会议系统通过设置多个摄像头采集视频图像,同时配备多个麦克风获取声音。在视频会议系统的播放设备中,为了让收看会议视频的与会人员将注意力集中在主讲者上,需要将视频转换成以主讲者为中心视角的平面视频,并记录主讲者的内容;当主讲者由一人转换至另一人时,视频视角也随之切换至新主讲者。这一过程一般通过声源定位实现。
技术问题
现有技术方案通过麦克风阵列采集会议现场的声音并对声源定位,获得声源的位置信息,成本较高,且在室内环境下,受麦克风位置、环境噪声与室内混响等因素的影响,会导致定位误差,影响视频剪辑效果,用户体验不佳。本申请实施例在于提供一种视频剪辑方法、装置、计算机可读存储介质、计算机设备及相机,旨在解决上述技术问题之一。
技术解决方案
本发明所揭示的折叠杆的一个具体方案如下:
第一方面,本申请实施例提供了一种视频剪辑方法,所述方法包括:
获取声音数据和与所述声音数据对应的视频数据,并生成与所述声音数据对应的平面视频帧;
对所述声音数据对应的平面视频帧进行目标检测,获取目标信息;
根据所述声音数据和所述目标信息确定声源目标;
根据所述声源目标生成剪辑的包括所述声源目标的平面视频。
进一步地,所述获取声音数据和与所述声音数据对应的视频数据具体为:
获取声音数据和与所述声音数据对应的平面视频;
或者,获取声音数据和与所述声音数据对应的全景视频;
根据所述全景视频生成与所述声音数据对应的平面视频。
进一步地,所述目标信息包括目标对应的平面视频帧和目标的位置信息;
所述根据所述声音数据和所述目标信息确定声源目标具体为:
根据所述声音数据和所述目标对应的平面视频帧确定声源目标。
进一步地,获取所述目标对应的平面视频帧具体为:
通过目标检测算法获得目标的位置信息,并根据目标的位置信息和预设的图像大小从所述声音数据对应的平面视频帧中截取包括目标的平面视频帧。
进一步地,所述目标的位置信息通过以下方式获取:
采用目标检测算法检测所述平面视频帧的所有目标,并用矩形框来表示每一个目标,通过所述矩形框坐标来确定目标的位置信息。
进一步地,所述根据所述声音数据和所述目标信息确定声源目标具体为:
分别将所述声音数据和一个或多个所述目标对应的平面视频帧输入预先训练完成的机器学习模型,由机器学习模型输出与所述声音数据对应的声源目标。
进一步地,所述获取声音数据和与所述声音数据对应的视频数据具体为:
获取连续的一段声音数据和与所述声音数据对应的视频数据;
所述根据所述声源目标生成剪辑的包括所述声源目标的平面视频之后,所述方法还包括:
获取当前时刻的平面视频帧;
判断所述当前时刻的平面视频帧是否有对应的声音数据,如果是,则返回所述对所述声音数据对应的平面视频帧进行目标检测的步骤,如果没有,则根据前一时刻确定的声源目标生成剪辑的包括所述声源目标的平面视频。
进一步地,所述根据所述声源目标生成剪辑的包括所述声源目标的平面视频或者所述根据前一时刻确定的声源目标生成剪辑的包括所述声源目标的平面视频具体为:
根据所述声源目标确定所述声源目标对应的平面视频帧;
将声源目标对应的平面视频帧作为剪辑视频的视频帧进行剪辑,生成剪辑的包括所述声源目标的平面视频;
或者,根据所述声源目标确定所述声源目标的位置信息;
根据所述声源目标的位置信息生成剪辑的包括所述声源目标的平面视频。
进一步地,所述将声源目标对应的平面视频帧作为剪辑视频的视频帧进行剪辑具体为:
将每一时刻的声源目标对应的平面视频帧按顺序拼接,生成剪辑的平面视频。
进一步地,所述将每一时刻的声源目标对应的平面视频帧按顺序拼接,生成剪辑的平面视频具体为:
将每一时刻的声源目标对应的平面视频帧按顺序拼接,剪辑时对声源目标对应的平面视频帧进行缩放以使所述声源目标对应的所有平面视频帧的大小相等,用黑色像素填充声源目标对应的平面视频帧无法覆盖的区域,生成剪辑的平面视频。
进一步地,所述根据所述声源目标的位置信息生成剪辑的包括所述声源目标的平面视频具体为:
根据所述声源目标的位置信息,对平面视频帧进行投影变换与剪辑,使声源目标处于视频画面的中心位置,生成剪辑的平面视频。
进一步地,所述对所述声音数据对应的平面视频帧进行目标检测之后,所述方法还包括:
采用目标跟踪算法为所述目标提供唯一身份标记;
采用目标跟踪算法监测所有目标,并追踪每个目标的位置变化,记录每个目标的唯一身份标记和相应的位置信息;
所述根据所述声源目标确定所述声源目标的位置信息具体为:
根据记录的每个目标的唯一身份标记和相应的位置信息以及所述声源目标对应的唯一身份标记确定所述声源目标的位置信息。
第二方面,本申请实施例提供了一种视频剪辑装置,所述装置包括:
生成模块,用于获取声音数据和与所述声音数据对应的视频数据,并生成与所述声音数据对应的平面视频帧;
目标检测模块,用于对所述声音数据对应的平面视频帧进行目标检测,获取目标信息;
声源目标确定模块,用于根据所述声音数据和所述目标信息确定声源目标;
剪辑模块,用于根据所述声源目标生成剪辑的包括所述声源目标的平面视频。
本申请提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如所述的视频剪辑方法的步骤。
第三方面,本申请实施例提供了一种计算机设备,包括:
一个或多个处理器;
存储器;以及
一个或多个计算机程序,所述处理器和所述存储器通过总线连接,其中所述一个或多个计算机程序被存储在所述存储器中,并且被配置成由所述一个或多个处理器执行,所述处理器执行所述计算机程序时实现如所述的视频剪辑方法的步骤。
第四方面,本申请实施例提供了一种相机,包括:
一个或多个处理器;
存储器;以及
一个或多个计算机程序,所述处理器和所述存储器通过总线连接,其中所述一个或多个计算机程序被存储在所述存储器中,并且被配置成由所述一个或多个处理器执行,所述处理器执行所述计算机程序时实现如所述的视频剪辑方法的步骤。
有益效果
在本申请实施例中,由于对所述声音数据对应的平面视频帧进行目标检测,获取目标信息;根据所述声音数据和所述目标信息确定声源目标;根据所述声源目标生成剪辑的包括所述声源目标的平面视频。因此本申请实现难度小,在室内等嘈杂环境下,可以降低环境噪声与室内混响对定位的影响,定位精度高,鲁棒性强,且能基于声源目标进行自动剪辑,视频剪辑效果好;此外,通过本申请实施例在进行声源目标定位时,仅需1个麦克风即可进行准确定位,成本较低,大大减少视频剪辑的难度与成本。
附图说明
图1是本申请一实施例提供的视频剪辑方法的应用场景示意图。
图2是本申请一实施例提供的视频剪辑方法的流程图。
图3是本申请一实施例提供的视频剪辑装置示意图。
图4是本申请一实施例提供的计算机设备的具体结构框图。
图5是本申请一实施例提供的相机的具体结构框图。
本发明的实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
以下结合具体实施例对本发明的具体实现进行详细描述:
实施例一:
本申请一实施例提供的视频剪辑方法的应用场景可以是计算机设备或相机,计算机设备或相机执行本申请一实施例提供的视频剪辑方法生成剪辑的包括声源目标的平面视频。本申请一实施例提供的视频剪辑方法的应用场景也可以包括相连接的计算机设备100和相机200(如图1所示)。计算机设备100和相机200中可运行至少一个的应用程序。计算机设备100可以是台式计算机、移动终端等,移动终端包括手机、平板电脑、笔记本电脑、个人数字助理等。相机200可以是普通的相机或者全景相机等。普通的相机是指用于拍摄平面图像和平面视频的拍摄装置。计算机设备100或者是相机200执行本申请一实施例提供的视频剪辑方法生成剪辑的包括声源目标的平面视频。
请参阅图2,是本申请一实施例提供的视频剪辑方法的流程图,本实施例主要以该视频剪辑方法应用于计算机设备或相机为例来举例说明,本申请一实施例提供的视频剪辑方法包括以下步骤:
S101、获取声音数据和与所述声音数据对应的视频数据,并生成与所述声音数据对应的平面视频帧。
在本申请一实施例中,所述获取声音数据和与所述声音数据对应的视频数据具体可以为:
获取声音数据和与所述声音数据对应的平面视频;
或者,获取声音数据和与所述声音数据对应的全景视频;
根据所述全景视频生成与所述声音数据对应的平面视频。
在本申请一实施例中,全景视频为全景相机拍摄的或者是由电脑软件生成的原始球面全景视频。
所述根据所述全景视频生成与所述声音数据对应的平面视频具体可以为:将所述原始球面全景视频转化成与所述声音数据对应的全景平面视频。
所述将所述原始球面全景视频转化成与所述声音数据对应的全景平面视频具体可以为:将原始球面全景视频投影至平面得到与所述声音数据对应的全景平面视频。
全景相机为具有麦克风的全景相机,麦克风的数量可为1个;声音数据为全景相机的麦克风获取的原始声音数据。成本较低,大大减少会议系统配置的难度与成本。
全景相机拍摄全景视频时,拍摄场景可以设置为会议场景,当然也可以为其他任意场景。
S102、对所述声音数据对应的平面视频帧进行目标检测,获取目标信息。
在本申请一实施例中,所述目标是所述全景平面视频帧中的人或物体;所述目标信息包括目标对应的平面视频帧和目标的位置信息。例如在会议场景中,目标对应的平面视频帧是指所有包含人的平面视频帧,目标的位置信息是指所有人的位置信息。
获取所述目标对应的平面视频帧具体可以为:
通过目标检测算法(例如HOG算法(Histogram of Oriented Gridient,方向梯度直方图)或CNN算法(Convolutional Neural Network,卷积神经网络)等)获得目标的位置信息,并根据目标的位置信息和预设的图像大小从所述声音数据对应的平面视频帧中截取包括目标的平面视频帧。预设的图像大小具体可以为常见的图像分辨率,例如640 x 480;1024 x 768;1600 x 1200;2048 x 1536等。HOG算法能够很好地描述局部目标区域的特征,是一种常用的特征提取方法;CNN算法通常包含数据输入层、卷积计算层、ReLU激活层、池化层和全连接层(INPUT-CONV-RELU-POOL-FC),是由卷积运算来代替传统矩阵乘法运算的神经网络。
所述目标的位置信息可以通过以下方式获取:
采用目标检测算法检测所述平面视频帧的所有目标,并用矩形框来表示每一个目标,通过所述矩形框坐标来确定目标的位置信息。
S103、根据所述声音数据和所述目标信息确定声源目标。
在本申请一实施例中,S103具体可以为:
根据所述声音数据和所述目标对应的平面视频帧确定声源目标。
S103具体也可以为:
分别将所述声音数据和一个或多个所述目标对应的平面视频帧输入预先训练完成的机器学习模型(例如CNN模型),由机器学习模型输出与所述声音数据对应的声源目标。
S104、根据所述声源目标生成剪辑的包括所述声源目标的平面视频。
例如在会议场景中,声源目标为主讲者,根据所述主讲者生成剪辑的包括所述主讲者的平面视频。
对于需保持视频画面延续的应用场景,如会议场景,在会议进行过程中,会场可能有些时间段没有声音,为保持视频画面的延续性,在没有声音时,可以锁定主讲人,即以前一时刻确定的声源目标进行视频剪辑;因此,所述获取声音数据和与所述声音数据对应的视频数据具体为:
获取连续的一段声音数据和与所述声音数据对应的视频数据。
连续的一段声音数据是指麦克风等声音获取设备在连续的时间段内记录的声音数据,如麦克风在12:00至12:30分这一连续时间段内记录的声音数据。
所述根据所述声源目标生成剪辑的包括所述声源目标的平面视频之后,所述方法还可以包括:
获取当前时刻的平面视频帧;
判断所述当前时刻的平面视频帧是否有对应的声音数据,如果是,则返回所述对所述声音数据对应的平面视频帧进行目标检测的步骤,如果没有,则根据前一时刻确定的声源目标生成剪辑的包括所述声源目标的平面视频。
或者,所述根据所述声源目标生成剪辑的包括所述声源目标的平面视频之后,所述方法还可以包括:
判断当前时刻是否有声音数据,如果是,则返回所述获取声音数据和与所述声音数据对应的视频数据的步骤,如果没有,则获取当前时刻的平面视频帧;
根据前一时刻确定的声源目标生成剪辑的包括所述声源目标的平面视频。
由于在平面视频帧没有对应的声音数据时根据前一时刻确定的声源目标生成剪辑的包括所述声源目标的平面视频,因此可以保持视频画面的延续性,锁定主讲人。
当然在具体应用中,在没有声音的时间段,也可以不锁定主讲人,如当视频为全景视频时,可以在没有声音的时间段,将全景视频转化为平面视频,并以预设速率展示会场情况等,用户也可以根据使用需要预先设置特定场景的视频剪辑方案,本申请不做具体限定。
在本申请一实施例中,S104或者所述根据前一时刻确定的声源目标生成剪辑的包括所述声源目标的平面视频具体可以为:
根据所述声源目标确定所述声源目标对应的平面视频帧;
将声源目标对应的平面视频帧作为剪辑视频的视频帧进行剪辑,生成剪辑的包括所述声源目标的平面视频;
或者,根据所述声源目标确定所述声源目标的位置信息;
根据所述声源目标的位置信息生成剪辑的包括所述声源目标的平面视频。
所述将声源目标对应的平面视频帧作为剪辑视频的视频帧进行剪辑具体可以为:
将每一时刻的声源目标对应的平面视频帧按顺序拼接,生成剪辑的平面视频。
所述将每一时刻的声源目标对应的平面视频帧按顺序拼接,生成剪辑的平面视频具体可以为:
将每一时刻的声源目标对应的平面视频帧按顺序拼接,剪辑时对声源目标对应的平面视频帧进行缩放以使所述声源目标对应的所有平面视频帧的大小相等,用黑色像素填充声源目标对应的平面视频帧无法覆盖的区域,生成剪辑的平面视频。
所述根据所述声源目标的位置信息生成剪辑的包括所述声源目标的平面视频具体可以为:
根据所述声源目标的位置信息,对平面视频帧进行投影变换与剪辑,使声源目标处于视频画面的中心位置,生成剪辑的平面视频。
在本申请一实施例中,对于包含一个或多个目标的视频,为了方便根据声源目标确定声源目标的位置信息。所述对所述声音数据对应的平面视频帧进行目标检测之后,所述方法还可以包括:
采用目标跟踪算法为所述目标提供唯一身份标记;例如采用MOT算法(Multiple Object Tracking,多目标跟踪)等目标跟踪算法为所述目标提供唯一身份标记;所述唯一身份标记可以使用“人物1”,“人物2”等符号表示,也可以使用通过Re-ID算法(Person Re-identification)从人形数据库中获得每个人物的真实姓名,如“张三”,“李四”等;
采用目标跟踪算法监测所有目标,并追踪每个目标的位置变化,记录每个目标的唯一身份标记和相应的位置信息。
所述根据所述声源目标确定所述声源目标的位置信息具体可以为:
根据记录的每个目标的唯一身份标记和相应的位置信息以及所述声源目标对应的唯一身份标记确定所述声源目标的位置信息。
在本申请一实施例中,S104之后,所述方法还可以包括以下步骤:
将所述平面视频与对应的声音数据结合。
通过将声音数据与视频数据同步,方便用户再观看视频的时候同步获取到声音。例如,可以按照时间顺序将声音数据与视频数据结合,实现音画同步,本申请不对声音数据与视频数据进行同步的方法进行具体限定。
在本申请中,由于对所述声音数据对应的平面视频帧进行目标检测,获取目标信息;根据所述声音数据和所述目标信息确定声源目标;根据所述声源目标生成剪辑的包括所述声源目标的平面视频。因此本申请实现难度小,在室内等嘈杂环境下,可以降低环境噪声与室内混响对定位的影响,定位精度高,鲁棒性强,且能基于声源目标进行自动剪辑,视频剪辑效果好;此外,通过本申请实施例在进行声源目标定位时,仅需1个麦克风即可进行准确定位,成本较低,大大减少视频剪辑的难度与成本。
实施例二:
请参阅图3,本申请一实施例提供的视频剪辑装置可以是运行于计算机设备或全景相机中的一个计算机程序或一段程序代码,例如该视频剪辑装置为一个应用软件;该视频剪辑装置可以用于执行本申请实施例提供的视频剪辑方法中的相应步骤。本申请一实施例提供的视频剪辑装置包括:
生成模块11,用于获取声音数据和与所述声音数据对应的视频数据,并生成与所述声音数据对应的平面视频帧;
目标检测模块12,用于对所述声音数据对应的平面视频帧进行目标检测,获取目标信息;
声源目标确定模块13,用于根据所述声音数据和所述目标信息确定声源目标;
剪辑模块14,用于根据所述声源目标生成剪辑的包括所述声源目标的平面视频。
本申请一实施例提供的视频剪辑装置与本申请一实施例提供的视频剪辑方法属于同一构思,其具体实现过程详见说明书全文,此处不再赘述。
实施例三:
本申请一实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如本申请一实施例提供的视频剪辑方法的步骤。
实施例四:
图4示出了本申请一实施例提供的计算机设备的具体结构框图,该计算机设备可以是图1中所示的计算机设备,一种计算机设备100包括:一个或多个处理器101、存储器102、以及一个或多个计算机程序,其中所述处理器101和所述存储器102通过总线连接,所述一个或多个计算机程序被存储在所述存储器102中,并且被配置成由所述一个或多个处理器101执行,所述处理器101执行所述计算机程序时实现如本申请一实施例提供的视频剪辑方法的步骤。
计算机设备可以是台式计算机、移动终端等,移动终端包括手机、平板电脑、笔记本电脑、个人数字助理等。
实施例五:
图5示出了本申请一实施例提供的相机的具体结构框图,该相机可以是图1中所示的相机,一种相机200包括:一个或多个处理器201、存储器202、以及一个或多个计算机程序,其中所述处理器201和所述存储器202通过总线连接,所述一个或多个计算机程序被存储在所述存储器202中,并且被配置成由所述一个或多个处理器201执行,所述处理器201执行所述计算机程序时实现如本申请一实施例提供的视频剪辑方法的步骤。
相机200可以是普通的相机或者全景相机等。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM (SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM (ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。 因此,本申请专利的保护范围应以所附权利要求为准。

Claims (17)

  1. 一种视频剪辑方法,其特征在于,所述方法包括:
    获取声音数据和与所述声音数据对应的视频数据,并生成与所述声音数据对应的平面视频帧;
    对所述声音数据对应的平面视频帧进行目标检测,获取目标信息;
    根据所述声音数据和所述目标信息确定声源目标;
    根据所述声源目标生成剪辑的包括所述声源目标的平面视频。
  2. 如权利要求1所述的视频剪辑方法,其特征在于,所述获取声音数据和与所述声音数据对应的视频数据具体为:
    获取声音数据和与所述声音数据对应的平面视频;
    或者,
    获取声音数据和与所述声音数据对应的全景视频;
    根据所述全景视频生成与所述声音数据对应的平面视频。
  3. 如权利要求1所述的视频剪辑方法,其特征在于,所述目标信息包括目标对应的平面视频帧和目标的位置信息;
    所述根据所述声音数据和所述目标信息确定声源目标具体为:
    根据所述声音数据和所述目标对应的平面视频帧确定声源目标。
  4. 如权利要求3所述的视频剪辑方法,其特征在于,获取所述目标对应的平面视频帧具体为:
    通过目标检测算法获得目标的位置信息,并根据目标的位置信息和预设的图像大小从所述声音数据对应的平面视频帧中截取包括目标的平面视频帧。
  5. 如权利要求3所述的视频剪辑方法,其特征在于,所述目标的位置信息通过以下方式获取:
    采用目标检测算法检测所述平面视频帧的所有目标,并用矩形框来表示每一个目标,通过所述矩形框坐标来确定目标的位置信息。
  6. 如权利要求3所述的视频剪辑方法,其特征在于,所述根据所述声音数据和所述目标信息确定声源目标具体为:
    将所述声音数据和一个或多个所述目标对应的平面视频帧输入预先训练完成的机器学习模型,由机器学习模型输出与所述声音数据对应的声源目标。
  7. 如权利要求1所述的视频剪辑方法,其特征在于,所述获取声音数据和与所述声音数据对应的视频数据具体为:
    获取连续的一段声音数据和与所述连续的一段声音数据对应的视频数据;
    所述根据所述声源目标生成剪辑的包括所述声源目标的平面视频之后,所述方法还包括:
    获取当前时刻的平面视频帧;
    判断所述当前时刻的平面视频帧是否有对应的声音数据,如果是,则返回所述对所述声音数据对应的平面视频帧进行目标检测的步骤,如果没有,则根据前一时刻确定的声源目标生成剪辑的包括所述声源目标的平面视频;
    或者,
    所述根据所述声源目标生成剪辑的包括所述声源目标的平面视频之后,所述方法还包括:
    判断当前时刻是否有声音数据,如果是,则返回所述获取声音数据和与所述声音数据对应的视频数据的步骤,如果没有,则获取当前时刻的平面视频帧;
    根据前一时刻确定的声源目标生成剪辑的包括所述声源目标的平面视频。
  8. 如权利要求7所述的视频剪辑方法,其特征在于,所述根据所述声源目标生成剪辑的包括所述声源目标的平面视频或者所述根据前一时刻确定的声源目标生成剪辑的包括所述声源目标的平面视频具体为:
    根据所述声源目标确定所述声源目标对应的平面视频帧;
    将声源目标对应的平面视频帧作为剪辑视频的视频帧进行剪辑,生成剪辑的包括所述声源目标的平面视频;
    或者,
    根据所述声源目标确定所述声源目标的位置信息;
    根据所述声源目标的位置信息生成剪辑的包括所述声源目标的平面视频。
  9. 如权利要求8所述的视频剪辑方法,其特征在于,所述将声源目标对应的平面视频帧作为剪辑视频的视频帧进行剪辑具体为:
    将每一时刻的声源目标对应的平面视频帧按顺序拼接,生成剪辑的平面视频。
  10. 如权利要求9所述的视频剪辑方法,其特征在于,所述将每一时刻的声源目标对应的平面视频帧按顺序拼接,生成剪辑的平面视频具体为:
    将每一时刻的声源目标对应的平面视频帧按顺序拼接,剪辑时对声源目标对应的平面视频帧进行缩放以使所述声源目标对应的所有平面视频帧的大小相等,用黑色像素填充声源目标对应的平面视频帧无法覆盖的区域,生成剪辑的平面视频。
  11. 如权利要求8所述的视频剪辑方法,其特征在于,所述根据所述声源目标的位置信息生成剪辑的包括所述声源目标的平面视频具体为:
    根据所述声源目标的位置信息,对平面视频帧进行投影变换与剪辑,使声源目标处于视频画面的中心位置,生成剪辑的平面视频。
  12. 如权利要求8所述的视频剪辑方法,其特征在于,所述对所述声音数据对应的平面视频帧进行目标检测之后,所述方法还包括:
    采用目标跟踪算法为所述目标提供唯一身份标记;
    采用目标跟踪算法监测所有目标,并追踪每个目标的位置变化,记录每个目标的唯一身份标记和相应的位置信息;
    所述根据所述声源目标确定所述声源目标的位置信息具体为:
    根据记录的每个目标的唯一身份标记和相应的位置信息以及所述声源目标对应的唯一身份标记确定所述声源目标的位置信息。
  13. 如权利要求1至12任一项所述的视频剪辑方法,其特征在于,所述根据所述声源目标生成剪辑的包括所述声源目标的平面视频之后,所述方法还包括:
    将所述平面视频与对应的声音数据结合。
  14. 一种视频剪辑装置,其特征在于,所述装置包括:
    生成模块,用于获取声音数据和与所述声音数据对应的视频数据,并生成与所述声音数据对应的平面视频帧;
    目标检测模块,用于对所述声音数据对应的平面视频帧进行目标检测,获取目标信息;
    声源目标确定模块,用于根据所述声音数据和所述目标信息确定声源目标;
    剪辑模块,用于根据所述声源目标生成剪辑的包括所述声源目标的平面视频。
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至13任一项所述的视频剪辑方法的步骤。
  16. 一种计算机设备,包括:
    一个或多个处理器;
    存储器;以及
    一个或多个计算机程序,所述处理器和所述存储器通过总线连接,其中所述一个或多个计算机程序被存储在所述存储器中,并且被配置成由所述一个或多个处理器执行,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至13任一项所述的视频剪辑方法的步骤。
  17. 一种相机,包括:
    一个或多个处理器;
    存储器;以及
    一个或多个计算机程序,所述处理器和所述存储器通过总线连接,其中所述一个或多个计算机程序被存储在所述存储器中,并且被配置成由所述一个或多个处理器执行,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至13任一项所述的视频剪辑方法的步骤。
PCT/CN2021/104072 2020-07-02 2021-07-01 一种视频剪辑方法、装置、计算机可读存储介质及相机 WO2022002214A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010628033.6 2020-07-02
CN202010628033.6A CN111918127B (zh) 2020-07-02 2020-07-02 一种视频剪辑方法、装置、计算机可读存储介质及相机

Publications (1)

Publication Number Publication Date
WO2022002214A1 true WO2022002214A1 (zh) 2022-01-06

Family

ID=73227260

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/104072 WO2022002214A1 (zh) 2020-07-02 2021-07-01 一种视频剪辑方法、装置、计算机可读存储介质及相机

Country Status (2)

Country Link
CN (1) CN111918127B (zh)
WO (1) WO2022002214A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111918127B (zh) * 2020-07-02 2023-04-07 影石创新科技股份有限公司 一种视频剪辑方法、装置、计算机可读存储介质及相机
CN112492380B (zh) * 2020-11-18 2023-06-30 腾讯科技(深圳)有限公司 音效调整方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105592268A (zh) * 2016-03-03 2016-05-18 苏州科达科技股份有限公司 视频会议系统、处理装置及视频会议方法
CN106161985A (zh) * 2016-07-05 2016-11-23 宁波菊风系统软件有限公司 一种浸入式视频会议的实现方法
CN108924469A (zh) * 2018-08-01 2018-11-30 广州视源电子科技股份有限公司 一种显示画面切换传输系统、智能交互平板及方法
CN110740259A (zh) * 2019-10-21 2020-01-31 维沃移动通信有限公司 视频处理方法及电子设备
CN111918127A (zh) * 2020-07-02 2020-11-10 影石创新科技股份有限公司 一种视频剪辑方法、装置、计算机可读存储介质及相机

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012138823A (ja) * 2010-12-27 2012-07-19 Brother Ind Ltd テレビ会議装置、テレビ会議方法、およびテレビ会議プログラム
CN102682273A (zh) * 2011-03-18 2012-09-19 夏普株式会社 嘴唇运动检测设备和方法
CN103905780A (zh) * 2014-03-18 2014-07-02 华为技术有限公司 一种数据处理方法、设备和视频会议系统
JP6651989B2 (ja) * 2015-08-03 2020-02-19 株式会社リコー 映像処理装置、映像処理方法、及び映像処理システム
CN109492506A (zh) * 2017-09-13 2019-03-19 华为技术有限公司 图像处理方法、装置和系统
CN108683874B (zh) * 2018-05-16 2020-09-11 瑞芯微电子股份有限公司 一种视频会议注意力聚焦的方法及一种存储设备
CN109257559A (zh) * 2018-09-28 2019-01-22 苏州科达科技股份有限公司 一种全景视频会议的图像显示方法、装置及视频会议系统
CN110544491A (zh) * 2019-08-30 2019-12-06 上海依图信息技术有限公司 一种实时关联说话人及其语音识别结果的方法及装置
CN111163281A (zh) * 2020-01-09 2020-05-15 北京中电慧声科技有限公司 一种基于语音跟踪的全景视频录制方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105592268A (zh) * 2016-03-03 2016-05-18 苏州科达科技股份有限公司 视频会议系统、处理装置及视频会议方法
CN106161985A (zh) * 2016-07-05 2016-11-23 宁波菊风系统软件有限公司 一种浸入式视频会议的实现方法
CN108924469A (zh) * 2018-08-01 2018-11-30 广州视源电子科技股份有限公司 一种显示画面切换传输系统、智能交互平板及方法
CN110740259A (zh) * 2019-10-21 2020-01-31 维沃移动通信有限公司 视频处理方法及电子设备
CN111918127A (zh) * 2020-07-02 2020-11-10 影石创新科技股份有限公司 一种视频剪辑方法、装置、计算机可读存储介质及相机

Also Published As

Publication number Publication date
CN111918127B (zh) 2023-04-07
CN111918127A (zh) 2020-11-10

Similar Documents

Publication Publication Date Title
CN109754811B (zh) 基于生物特征的声源追踪方法、装置、设备及存储介质
JP6785908B2 (ja) カメラ撮影制御方法、装置、インテリジェント装置および記憶媒体
CN108900787B (zh) 图像显示方法、装置、系统及设备、可读存储介质
WO2021139583A1 (zh) 自动调整视角的全景视频渲染方法、存储介质及计算机设备
CN111432115B (zh) 基于声音辅助定位的人脸追踪方法、终端及存储装置
US20150146078A1 (en) Shift camera focus based on speaker position
US8749607B2 (en) Face equalization in video conferencing
WO2022002214A1 (zh) 一种视频剪辑方法、装置、计算机可读存储介质及相机
US10681308B2 (en) Electronic apparatus and method for controlling thereof
KR20100028060A (ko) 디스플레이 장치 검출 기법
CN109035138B (zh) 会议记录方法、装置、设备和存储介质
US10250803B2 (en) Video generating system and method thereof
WO2020238324A1 (zh) 一种基于视频会议的图像处理方法和装置
WO2021120190A1 (zh) 数据处理方法、装置、电子设备和存储介质
TW201933889A (zh) 分散式室內定位系統及其方法
WO2019033955A1 (zh) 全景视频文件剪辑的方法、系统及便携式终端
CN112839165B (zh) 人脸跟踪摄像的实现方法、装置、计算机设备和存储介质
CN110673811B (zh) 基于声音信息定位的全景画面展示方法、装置及存储介质
WO2022028407A1 (zh) 一种全景视频剪辑方法、装置、存储介质及设备
WO2017096859A1 (zh) 照片的处理方法及装置
CN113794814B (zh) 一种控制视频图像输出的方法、装置及存储介质
US11120524B2 (en) Video conferencing system and video conferencing method
WO2021217897A1 (zh) 定位方法、终端设备及会议系统
CN110730378A (zh) 一种信息处理方法及系统
CN116668835A (zh) 摄像机控制方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21834374

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25/08/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21834374

Country of ref document: EP

Kind code of ref document: A1