CN116761020A

CN116761020A - Video processing method, device, equipment and medium

Info

Publication number: CN116761020A
Application number: CN202310620164.3A
Authority: CN
Inventors: 张超; 姜文翼; 曹海涛; 石东升
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-09-15

Abstract

The disclosure provides a video processing method, a device, equipment and a storage medium, which relate to the technical field of artificial intelligence, in particular to the technical fields of video processing, image processing, deep learning and the like. The video processing method comprises the following steps: acquiring a plurality of candidate key frames in a video; determining a similarity of the first candidate key frame and the second candidate key frame based on the first association data and the second association data; determining redundant key frames in the first candidate key frame and the second candidate key frame based on the similarity; and removing the redundant key frames from the candidate key frames to obtain target key frames. The video processing method and device can improve video processing effects.

Description

Video processing method, device, equipment and medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of video processing, image processing, deep learning and the like, and particularly relates to a video processing method, device, equipment and medium.

Background

The video frame slicing technique is to extract images in a video stream to form a series of still images for subsequent processing. In the case of analyzing video, it is necessary to cut frames of the video, for example, in the application fields of video content review, video content understanding, etc.

Some video frame cutting technologies exist at present, but there is a problem that image redundancy exists and video content is not effectively extracted.

Disclosure of Invention

The present disclosure provides a video processing method, apparatus, device, and medium.

According to an aspect of the present disclosure, there is provided a video processing method including: acquiring a plurality of candidate key frames in a video; determining a similarity of the first candidate key frame and the second candidate key frame based on the first association data and the second association data; the first candidate key frame and the second candidate key frame are any two candidate key frames of the plurality of candidate key frames; the first association data includes: the first candidate key frame and first speech data associated with the first candidate key frame; the second association data includes: the second candidate key frame and second speech data associated with the second candidate key frame; determining redundant key frames in the first candidate key frame and the second candidate key frame based on the similarity; and removing the redundant key frames from the candidate key frames to obtain target key frames.

According to another aspect of the present disclosure, there is provided a video processing apparatus including: the acquisition module is used for acquiring a plurality of candidate key frames in the video; the first determining module is used for determining the similarity of the first candidate key frame and the second candidate key frame according to the first association data and the second association data; the first candidate key frame and the second candidate key frame are any two candidate key frames of the plurality of candidate key frames; the first association data includes: the first candidate key frame and first speech data associated with the first candidate key frame; the second association data includes: the second candidate key frame and second speech data associated with the second candidate key frame; a second determining module, configured to determine a redundant key frame from the first candidate key frame and the second candidate key frame based on the similarity; and the removing module is used for removing the redundant key frames from the candidate key frames so as to obtain a target key frame.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the above aspects.

According to the technical scheme, the video processing effect can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

fig. 2 is a schematic diagram of an application scenario provided according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of the overall architecture of a video processing method provided according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fourth embodiment of the present disclosure;

fig. 7 is a schematic diagram of an electronic device for implementing a video processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, video frame cutting can be performed based on time intervals, namely, one image frame is extracted from each interval by setting time length, however, the implementation mode is excessively dead, if the set time length is too small, the number of redundant frames is too large, and if the set time length is too large, information is lost.

In order to streamline efficient extraction of images in video, the present disclosure provides the following embodiments.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The embodiment provides a video processing method, which includes:

101. a plurality of candidate key frames in the video is acquired.

102. Determining a similarity of the first candidate key frame and the second candidate key frame based on the first association data and the second association data; the first candidate key frame and the second candidate key frame are any two candidate key frames of the plurality of candidate key frames; the first association data includes: the first candidate key frame and first speech data associated with the first candidate key frame; the second association data includes: the second candidate key frame and second speech data associated with the second candidate key frame.

103. Redundant key frames are determined among the first candidate key frame and the second candidate key frame based on the similarity.

104. And removing the redundant key frames from the candidate key frames to obtain target key frames.

The key frame is also called an I frame, and is an important frame in the inter-frame compression coding, and a complete image can be reconstructed only by using the data of the I frame during decoding, so that the I frame is generated without referring to other pictures.

Specifically, various related key frame extraction algorithms can be adopted to obtain key frames in the video.

For distinction, key frames obtained based on a key frame extraction algorithm may be referred to as candidate key frames.

The candidate key frames are usually multiple, and there may be a problem of content duplication between different candidate key frames, that is, there may be redundant key frames in the multiple candidate key frames.

Redundant key frames may be determined based on a similarity between every two candidate key frames.

The two candidate key frames may be represented by a first candidate key frame and a second candidate key frame, the first candidate key frame and the second candidate key frame being any two candidate key frames of the plurality of candidate key frames.

The similarity between every two candidate key frames may be referred to as a frame similarity, which may be calculated based on corresponding association data (first association data and second association data).

The associated data includes candidate key frames and associated speech data. That is, the first associated data includes a first candidate key frame and its associated first speech data, and the second associated data includes a second candidate key frame and its associated second speech data.

Taking the first association data as an example, the first voice data may be voice data of a preset duration (for example, 5 seconds) before and after the first candidate key frame.

In order to reduce redundancy, similar candidate key frames can be determined based on the frame similarity, so that redundant key frames are obtained, and after the redundant key frames are removed from the candidate key frames, target key frames are obtained.

The target key frame is an image finally obtained for the video, and can be used for the subsequent analysis processing flow to carry out video content auditing, video content understanding and the like.

In this embodiment, redundant key frames are removed from multiple candidate key frames to obtain a target key frame, so that redundancy can be reduced, and a simplified target key frame can be obtained; because the candidate key frames contain important content information of the video, the target key frames are obtained based on the candidate key frames, the important content information in the video can be extracted, and information loss is avoided; the frame similarity is determined based on the candidate key frames and the corresponding voice data, and the information of the voice data can be referred to when the frame similarity is determined, so that the accuracy of the frame similarity is improved, and the accuracy of the target key frames is further improved. Therefore, the target key frame which is simple, contains effective content information and is accurate can be obtained, and the video processing effect is improved.

In order to better understand the embodiments of the present disclosure, application scenarios provided by the embodiments of the present disclosure are described.

Fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure. The present embodiment takes a living body detection based on a face image as an example. As shown in fig. 2, a user may upload a face image to be processed by using a user terminal, which may be a terminal having an image acquisition module (such as a camera), such as a mobile phone, a tablet computer, a notebook computer, an intelligent wearable device, or the like.

Fig. 2 is a schematic diagram of an application scenario corresponding to an embodiment of the present disclosure. As shown in fig. 2, the scenario may include a user terminal 201 and a server 202, where the user terminal 201 includes, for example: personal computers (personalcomputers), notebook computers, mobile devices (e.g., cell phones), and the like. The server 202 may be a cloud server or a local server.

The user may send the video to the server 202 through the user terminal 201, and the server 202 processes the video to obtain the target key frame. The target key frames can be analyzed later to complete video content auditing, video content understanding, and the like.

In this embodiment, the server is taken as an example for performing video processing, and it can be understood that if the user terminal has the corresponding capability, the video processing may be performed locally at the user terminal to obtain the target key frame, and further, the subsequent processing may be performed based on the target key frame.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

For video processing, as shown in fig. 3, the overall architecture may include: a key frame extraction module 301, a frame similarity calculation module 302, and a redundancy removal module 303.

The key frame extraction module 301 is configured to perform key frame extraction processing on the video to obtain a plurality of candidate key frames.

The similarity calculation module 302 is configured to calculate, for every two candidate key frames of the plurality of candidate key frames, a similarity between every two candidate key frames, and determine a redundant key frame of the plurality of candidate key frames based on the similarity.

The redundancy removing module 303 is configured to remove the redundant key frame from the plurality of candidate key frames to obtain the target key frame.

Wherein, for frame similarity, a semantic feature vector based on image-text pairs is calculated. Taking an example that every two candidate key frames comprise a first candidate key frame and a second candidate key frame, an image-text pair corresponding to the first candidate key frame comprises: the method comprises the steps of performing voice recognition processing on first voice data associated with a first candidate key frame to obtain a first candidate key frame and a first text; the image-text pairs corresponding to the second candidate key frame include: the second candidate key frame and the second text are obtained after the voice recognition processing of the second voice data associated with the second candidate key frame.

Based on the < first candidate key frame, the first text > the image-text pair, a first semantic feature vector can be obtained, based on the < second candidate key frame, the second text > the image-text pair, a second semantic feature vector can be obtained, and then a vector similarity between the first semantic feature and the second semantic feature vector can be calculated, and the vector similarity is used as a frame similarity between the first candidate key frame and the second candidate key frame.

If the frame similarity is greater than a predetermined threshold, determining that a candidate key frame and a second candidate key frame are similar, and using one (e.g., random one) of the candidate key frames as a redundant key frame. And removing redundant key frames from the candidate key frames to obtain a target key frame.

In combination with the above application scenario, the present disclosure further provides the following embodiments.

Fig. 4 is a schematic diagram according to a second embodiment of the present disclosure. The embodiment provides a video processing method, which includes:

401. and performing key frame extraction processing on the video to obtain a plurality of candidate key frames.

402. For a first candidate key frame and a second candidate key frame, acquiring first association data of the first candidate key frame and acquiring second association data of the second candidate key frame.

Wherein the first candidate key frame and the second candidate key frame are any two candidate key frames of the plurality of candidate key frames.

For each candidate key frame of the plurality of candidate key frames, each candidate key frame and associated speech data thereof may be used as associated data for the each candidate key frame.

For each candidate key frame, the extraction time of each candidate key frame can be used as a reference time, and the voice data with preset duration can be obtained based on the reference time and used as the voice data associated with each candidate key frame. For example, if the preset duration is 5 seconds and the reference time is denoted by t, the voice data in the time period of [ t-2.5, t+2.5] can be intercepted as the associated voice data.

The reference time may be taken as a start point or an end point of the voice data, for example, the reference time may be taken as a start point, and the voice data in the period of [ t, t+5] may be captured as the associated voice data.

For example, the foregoing time periods are consistent, and the time periods may be inconsistent, for example, the voice data in the time period of [ t-2, t+3] may be used as the associated voice data.

Assuming that every two candidate key frames are represented by a first candidate key frame and a second candidate key frame, the process of acquiring the voice data associated with the two candidate key frames may include:

aiming at the first candidate key frame, taking the extraction time of the first candidate key frame as a first reference time, and acquiring voice data with preset duration based on the first reference time as the first voice data;

and aiming at the second candidate key frame, taking the extraction time of the second candidate key frame as a second reference time, and acquiring the voice data with the preset duration based on the second reference time as the second voice data.

In this embodiment, the voice data with the preset duration is obtained based on the extraction time of the candidate key frame, and the voice data is used as the voice data associated with the candidate key frame, so that the association between the candidate key frame and the voice data can be improved, the accuracy of similarity calculation is further improved, and the accuracy of the target key frame is improved.

403. And acquiring the similarity of the first candidate key frame and the second candidate key frame based on the first semantic feature vector of the first associated data and the second semantic feature vector of the second associated data.

Specifically, a first semantic feature vector of the first associated data may be acquired; acquiring a second semantic feature vector of the second associated data; and determining the similarity of the first semantic feature vector and the second semantic feature vector as the similarity of the first candidate key frame and the second candidate key frame.

In this embodiment, since the associated data includes image information and voice information, when calculating the similarity based on the semantic feature vector of the associated data, more dimensional information can be referred to, so as to improve the accuracy of the similarity, and further improve the accuracy of the target key frame.

The similarity between the first semantic feature vector and the second semantic feature vector may specifically be cosine similarity.

For each set of associated data, the semantic feature vector is a semantic feature vector of an image-text pair, and can be obtained specifically based on the image feature vector and the text feature vector.

Specifically, the acquiring the first semantic feature vector of the first associated data includes:

extracting image features of the first candidate key frames to obtain first image feature vectors;

performing voice recognition processing on the first voice data to obtain a first text, and performing text feature extraction on the first text to obtain a first text feature vector;

and performing splicing processing on the first image feature vector and the first text feature vector to obtain the first semantic feature vector.

The obtaining the second semantic feature vector of the second associated data includes:

extracting image features of the second candidate key frames to obtain second image feature vectors;

performing voice recognition processing on the second voice data to obtain a second text, and performing text feature extraction on the second text to obtain a second text feature vector;

and performing splicing processing on the second image feature vector and the second text feature vector to obtain the second semantic feature vector.

Specifically, for each candidate key frame, by intercepting voice data of a preset duration to obtain associated voice data, and performing voice recognition on the voice data to obtain a corresponding text, therefore, each candidate key frame and the corresponding text thereof can be used as an image-text pair corresponding to each candidate key frame. For each image-text pair, image feature extraction can be performed on the image to obtain an image feature vector; extracting text features of the text to obtain text feature vectors; and then, splicing the image feature vector and the text feature vector together to obtain the semantic feature vector of the image-text pair.

Wherein, image feature extraction network such as Convolutional neural network (Convolutional NeuralNetwork, CNN) can be used for image feature extraction; text feature extraction may be performed using a text feature extraction network, such as a transfomer network.

It will be appreciated that, for images, when extracting image features based on CNN, a leveling process or the like may be employed to convert two-dimensional features into one-dimensional vectors.

Assuming that the image feature vector is represented by [0,1,0] and the text feature vector is represented by [1, 0], the spliced semantic feature vector is [0,1,0,1,1,0].

In this embodiment, the semantic feature vector of the associated data is obtained based on the image feature vector and the text feature vector, the image feature vector is obtained after extracting the image feature of the candidate key frame, and the text feature vector is obtained after extracting the text feature corresponding to the voice data associated with the candidate key frame, so that the semantic feature vector combines the image information and the voice information, and further, the similarity calculated based on the semantic feature vector refers to the information of multiple dimensions, so that the accuracy of the similarity calculation is improved, and the accuracy of the target key frame is further improved.

404. And if the similarity is larger than a preset threshold value, randomly determining the first candidate key frame or the second candidate key frame as the redundant key frame.

If the similarity between the first candidate key frame and the second candidate key frame is greater than a preset threshold, it indicates that the two candidate key frames are similar, one of the two candidate key frames can be randomly used as a redundant key frame, and then the redundant key frame can be removed.

On the other hand, if the similarity between the first candidate key frame and the second candidate key frame is not greater than the preset threshold, it indicates that the two candidate key frames are not similar, and both the two candidate key frames may be reserved.

In this embodiment, when the similarity is greater than a preset threshold, the first candidate key frame or the second candidate key frame is randomly determined to be the redundant key frame, so that the redundant key frame can be simply, conveniently and efficiently obtained.

405. And removing the redundant key frames from the candidate key frames to obtain target key frames.

After the redundant key frames are determined, redundant key can be removed from the candidate key frames to obtain target key frames, redundant content among different target key frames is avoided, simplification of video processing results is achieved, and subsequent resource expenditure is reduced.

Fig. 5 is a schematic diagram according to a third embodiment of the present disclosure. The present embodiment provides a video processing apparatus, as shown in fig. 5, the apparatus 500 includes: an acquisition module 501, a first determination module 502, a second determination module 503, and a removal module 504.

The obtaining module 501 is configured to obtain a plurality of candidate key frames in a video; the first determining module 502 is configured to determine, based on the first association data and the second association data, a similarity between the first candidate key frame and the second candidate key frame; the first candidate key frame and the second candidate key frame are any two candidate key frames of the plurality of candidate key frames; the first association data includes: the first candidate key frame and first speech data associated with the first candidate key frame; the second association data includes: the second candidate key frame and second speech data associated with the second candidate key frame; a second determining module 503 is configured to determine a redundant key frame from the first candidate key frame and the second candidate key frame based on the similarity; the removing module 504 is configured to remove the redundant keyframes from the plurality of candidate keyframes to obtain a target keyframe.

In some embodiments, the first determining module 502 is further configured to:

acquiring a first semantic feature vector of the first associated data;

acquiring a second semantic feature vector of the second associated data;

and determining the similarity of the first semantic feature vector and the second semantic feature vector as the similarity of the first candidate key frame and the second candidate key frame.

In some embodiments, the first determining module 502 is further configured to:

extracting image features of the first candidate key frames to obtain first image feature vectors; performing voice recognition processing on the first voice data to obtain a first text, and performing text feature extraction on the first text to obtain a first text feature vector; performing stitching processing on the first image feature vector and the first text feature vector to obtain the first semantic feature vector; and/or extracting image features of the second candidate key frame to obtain a second image feature vector; performing voice recognition processing on the second voice data to obtain a second text, and performing text feature extraction on the second text to obtain a second text feature vector; and performing splicing processing on the second image feature vector and the second text feature vector to obtain the second semantic feature vector.

In some embodiments, the second determining module 503 is further configured to:

and if the similarity is larger than a preset threshold value, randomly determining the first candidate key frame or the second candidate key frame as the redundant key frame.

Fig. 6 is a schematic diagram according to a fourth embodiment of the present disclosure. The present embodiment provides a video processing apparatus, as shown in fig. 6, the apparatus 600 includes: the obtaining module 601, the first determining module 602, the second determining module 603, and the removing module 604 further include: the association module 605.

The description of the acquisition module 601, the first determination module 602, the second determination module 603, and the removal module 604 may be found in the previous embodiment.

The association module 605 is configured to, for the first candidate key frame, use an extraction time of the first candidate key frame as a first reference time, and obtain, based on the first reference time, voice data of a preset duration as the first voice data; and aiming at the second candidate key frame, taking the extraction time of the second candidate key frame as a second reference time, and acquiring the voice data of the preset duration based on the second reference time as the second voice data.

It is to be understood that in the embodiments of the disclosure, the same or similar content in different embodiments may be referred to each other.

It can be understood that "first", "second", etc. in the embodiments of the present disclosure are only used for distinguishing, and do not indicate the importance level, the time sequence, etc.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. The electronic device 700 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the electronic device 700 may also be stored. The computing unit 701, the ROM702, and the RAM703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, for example, a video processing method. For example, in some embodiments, the video processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM702 and/or the communication unit 709. When a computer program is loaded into RAM703 and executed by computing unit 701, one or more steps of the video processing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the video processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-chips (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("VirtualPrivate Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A video processing method, comprising:

acquiring a plurality of candidate key frames in a video;

determining a similarity of the first candidate key frame and the second candidate key frame based on the first association data and the second association data; the first candidate key frame and the second candidate key frame are any two candidate key frames of the plurality of candidate key frames; the first association data includes: the first candidate key frame and first speech data associated with the first candidate key frame; the second association data includes: the second candidate key frame and second speech data associated with the second candidate key frame;

determining redundant key frames in the first candidate key frame and the second candidate key frame based on the similarity;

and removing the redundant key frames from the candidate key frames to obtain target key frames.

2. The method of claim 1, wherein the determining the similarity of the first candidate key frame and the second candidate key frame based on the first association data and the second association data comprises:

acquiring a first semantic feature vector of the first associated data;

acquiring a second semantic feature vector of the second associated data;

3. The method of claim 2, wherein,

the obtaining the first semantic feature vector of the first associated data includes:

performing stitching processing on the first image feature vector and the first text feature vector to obtain the first semantic feature vector;

and/or the number of the groups of groups,

4. The method of claim 1, wherein the determining redundant key frames among the first candidate key frame and the second candidate key frame based on the similarity comprises:

5. The method of any of claims 1-4, after the obtaining the plurality of candidate key frames in the video and before the determining the similarity of the first candidate key frame and the second candidate key frame based on the first correlation data and the second correlation data, the method further comprising:

6. A video processing apparatus comprising:

the acquisition module is used for acquiring a plurality of candidate key frames in the video;

the first determining module is used for determining the similarity of the first candidate key frame and the second candidate key frame according to the first association data and the second association data; the first candidate key frame and the second candidate key frame are any two candidate key frames of the plurality of candidate key frames; the first association data includes: the first candidate key frame and first speech data associated with the first candidate key frame; the second association data includes: the second candidate key frame and second speech data associated with the second candidate key frame;

a second determining module, configured to determine a redundant key frame from the first candidate key frame and the second candidate key frame based on the similarity;

and the removing module is used for removing the redundant key frames from the candidate key frames so as to obtain a target key frame.

7. The apparatus of claim 6, wherein the first determination module is further to:

acquiring a first semantic feature vector of the first associated data;

acquiring a second semantic feature vector of the second associated data;

8. The method of claim 7, the first determination module further to:

extracting image features of the first candidate key frames to obtain first image feature vectors; performing voice recognition processing on the first voice data to obtain a first text, and performing text feature extraction on the first text to obtain a first text feature vector; performing stitching processing on the first image feature vector and the first text feature vector to obtain the first semantic feature vector;

and/or the number of the groups of groups,

extracting image features of the second candidate key frames to obtain second image feature vectors; performing voice recognition processing on the second voice data to obtain a second text, and performing text feature extraction on the second text to obtain a second text feature vector; and performing splicing processing on the second image feature vector and the second text feature vector to obtain the second semantic feature vector.

9. The apparatus of claim 6, wherein the second determination module is further to:

10. The apparatus of any of claims 6-9, further comprising:

the association module is used for regarding the first candidate key frames, taking the extraction time of the first candidate key frames as a first reference time, and acquiring voice data with preset duration based on the first reference time as the first voice data; and aiming at the second candidate key frame, taking the extraction time of the second candidate key frame as a second reference time, and acquiring the voice data of the preset duration based on the second reference time as the second voice data.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-5.