CN111901627A

CN111901627A - Video processing method and device, storage medium and electronic equipment

Info

Publication number: CN111901627A
Application number: CN202010468397.2A
Authority: CN
Inventors: 程驰; 谢文珍
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Beijing Dami Technology Co Ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-11-06
Anticipated expiration: 2040-05-28
Also published as: CN111901627B

Abstract

The embodiment of the application discloses a video processing method and device, a storage medium and electronic equipment, and belongs to the technical field of computers. The method comprises the following steps: the method comprises the steps that a server obtains original video data, at least one effective voice segment is obtained based on audio data in the original video data, at least one effective voice segment is obtained based on preset interval duration in a grouping mode, at least one long voice segment is obtained, effective pronunciation duration ratio information in the at least one long voice segment is determined, at least one video segment corresponding to the at least one long voice segment is determined, the number of faces, the number of smiles, the face ratio and/or the smile ratio in the at least one video segment are determined, an analysis result and the at least one video segment are input into a binary model to be classified, classification results corresponding to the at least one video segment are obtained, the analysis result comprises at least one of effective pronunciation duration ratio information, the number of faces, the number of smiles, the face ratio and/or the smile ratio, and at least one video segment is selected as a target video segment based on the classification results, by the method, more accurate highlight video segments can be selected and obtained, and high-quality highlight videos can be generated.

Description

Video processing method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video processing method and apparatus, a storage medium, and an electronic device.

Background

With the development of computer technology, video applications are more and more extensive, and in the applications, the video sometimes needs to be compressed or extracted, such as extracting highlight segments and the like. However, in the related art, there are defects that the selected highlight video segment is inaccurate and the highlight video quality is poor.

Disclosure of Invention

The embodiment of the application provides a video processing method, a video processing device, a storage medium and electronic equipment, which can solve the problems of inaccuracy of a highlight video segment and poor highlight video quality selected in the related technology.

The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a video processing method, where the method includes:

acquiring original video data, and acquiring at least one effective voice segment based on audio data in the original video data;

grouping the at least one effective voice segment based on a preset interval duration to obtain at least one long voice segment;

determining effective pronunciation duration ratio information in the at least one long voice segment;

determining at least one video segment corresponding to the at least one long voice segment;

determining a number of faces, a number of smiles, a face proportion and/or a smile proportion in the at least one video segment;

inputting the analysis result and the at least one video clip into a two-classification model for classification processing to obtain a classification result corresponding to the at least one video clip; the analysis result comprises at least one of the effective pronunciation duration ratio information, the number of the frontal faces, the number of the smiling faces, the frontal face ratio and/or the smiling face ratio;

and selecting at least one video segment as a target video segment based on the classification result.

In a second aspect, an embodiment of the present application provides a video processing apparatus, including:

the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for acquiring original video data and obtaining at least one effective voice segment based on audio data in the original video data;

the grouping module is used for grouping the at least one effective voice segment based on a preset interval duration to obtain at least one long voice segment;

the first determination module is used for determining effective pronunciation duration ratio information in the at least one long voice segment;

the second determining module is used for determining at least one video clip corresponding to the at least one long voice clip;

a third determining module, configured to determine a number of faces, a number of smiles, a face ratio, and/or a smile ratio in the at least one video segment;

the second processing module is used for inputting the analysis result and the at least one video clip into a two-classification model for classification processing to obtain a classification result corresponding to the at least one video clip; the analysis result comprises at least one of the effective pronunciation duration ratio information, the number of the frontal faces, the number of the smiling faces, the frontal face ratio and/or the smiling face ratio;

and the selection module is used for selecting at least one video clip as a target video clip based on the classification result.

In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

In a fourth aspect, an embodiment of the present application provides an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The beneficial effects brought by the technical scheme provided by some embodiments of the application at least comprise:

when the scheme of the embodiment of the application is executed, the server acquires original video data, obtains at least one effective voice segment based on audio data in the original video data, groups the at least one effective voice segment based on preset interval duration to obtain at least one long voice segment, determines effective pronunciation duration ratio information in the at least one long voice segment, determines at least one video segment corresponding to the at least one long voice segment, determines the number of faces, the number of smiles, the ratio of faces and/or the ratio of smiles in the at least one video segment, inputs an analysis result and the at least one video segment into a binary model for classification processing to obtain a classification result corresponding to the at least one video segment, wherein the analysis result comprises at least one of effective pronunciation duration ratio information, the number of faces, the number of smiles, the ratio of faces and/or the ratio of smiles, and selecting at least one video segment as a target video segment based on the classification result, and selecting to obtain a more accurate highlight video segment in such a way, thereby generating a high-quality highlight video.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram of a system architecture provided by an embodiment of the present application;

fig. 2 is a schematic flowchart of a video processing method according to an embodiment of the present application;

fig. 3 is another schematic flow chart of a video processing method provided in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an apparatus provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic diagram of an exemplary system architecture 100 to which a video processing method or a video processing apparatus of an embodiment of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is a medium used to provide communication links between the

terminal devices

101, 102, 103 and the server 105, and various communication client applications may be installed on the

terminal devices

101, 102, 103, such as: video recording application, video playing application, voice interaction application, search application, instant messaging tool, mailbox client, social platform software, etc. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and the like. The network 104 may include various types of wired or wireless communication links, such as: the wired communication link includes an optical fiber, a twisted pair wire, or a coaxial cable, and the WIreless communication link includes a bluetooth communication link, a WIreless-FIdelity (Wi-Fi) communication link, or a microwave communication link, etc. The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal apparatuses

101, 102, and 103 are software, they may be installed in the electronic apparatuses listed above. Which may be implemented as multiple software or software modules (e.g., to provide distributed services) or as a single software or software module, and is not particularly limited herein. When the

terminal devices

101, 102, and 103 are hardware, the terminal devices may further include a display device and a camera, the display device may display various devices capable of implementing a display function, and the camera is used to collect a video stream; for example: the display device may be a Cathode ray tube (Cathode ray tube, CR) display, a Light-emitting diode (LED) display, an electronic ink screen, a Liquid Crystal Display (LCD), a Plasma Display Panel (PDP), or the like. The user can view information such as displayed text, pictures, videos, etc. using the display device on the

terminal device

101, 102, 103.

It should be noted that the video processing method provided by the embodiment of the present application is generally executed by the server 105, and accordingly, the video processing apparatus is generally disposed in the server 105. The server 105 may be a server that provides various services, and the server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as a plurality of software or software modules (for example, for providing distributed services), or may be implemented as a single software or software module, and is not limited in particular herein.

The server 105 in the present application may obtain original video data for a terminal device providing various services, such as a server, obtain at least one valid voice segment based on audio data in the original video data, group the at least one valid voice segment based on a preset interval duration to obtain at least one long voice segment, determine valid utterance duration ratio information in the at least one long voice segment, determine at least one video segment corresponding to the at least one long voice segment, determine a face number, a smile number, a face ratio, and/or a smile ratio in the at least one video segment, input an analysis result and the at least one video segment into a binary model for classification processing to obtain a classification result corresponding to the at least one video segment, where the analysis result includes at least one of valid utterance duration ratio information, a face number, a smile number, a face ratio, and/or a smile ratio, and selecting at least one video segment as the target video segment based on the classification result.

It should be noted that the video processing method provided in the embodiments of the present application may be executed by at least one of the

terminal devices

101, 102, and 103, and/or the server 105, and accordingly, the video processing apparatus provided in the embodiments of the present application is generally disposed in the corresponding terminal device, and/or the server 105, but the present application is not limited thereto.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The video processing method provided by the embodiment of the present application will be described in detail below with reference to fig. 2 and 3. It should be noted that, for convenience of description, the embodiment is described by taking the online education industry as an example, but those skilled in the art will understand that the application of the present application is not limited to the online education industry, and the video processing method described in the present application can be effectively applied to various industries of the internet.

Referring to fig. 2, a flow chart of a video processing method according to an embodiment of the present application is schematically shown. As shown in fig. 2, the method of the embodiment of the present application may include the steps of:

s201, original video data are obtained, and at least one effective voice segment is obtained based on audio data in the original video data.

The original video data comprises video and audio parts, and the effective voice segment is a part of effectively sounding audio which is cut from the audio part of the original video data.

Generally, after a student and a teacher finish course teaching, a server automatically generates original video data, the original video data comprises audio and video of the student and the teacher interacting in the course teaching process, the audio data can be obtained from the original video data, the audio data is subjected to framing processing based on a voice endpoint detection VAD algorithm to obtain a plurality of frame audio data, and the plurality of frame audio data is subjected to classification processing based on a preset classification model and a voice recognition ASR algorithm to obtain at least one effective voice segment.

S202, grouping at least one effective voice segment based on a preset interval duration to obtain at least one long voice segment.

The preset interval duration refers to an interval threshold between two preset points of time, and whether the multiple effective voice segments are the same group of long voice segments can be determined based on the preset interval duration. The long voice segments refer to complete effective voice conversations, a plurality of effective voice segments are all voice segments comprising discontinuous content information, at least one effective voice segment belonging to the same complete effective voice conversation can be screened out through a preset interval duration, the screened out at least one effective voice segment is classified into the same group of effective audio segments, and at least one complete effective voice conversation, namely at least one long voice segment, can be selected from audio data of original video data based on the screened out same group of effective voice segments.

S203, determining effective pronunciation duration ratio information in at least one long voice segment.

The effective pronunciation time length ratio information is the ratio of the time length of the effective audio part in the audio time length of the long voice segment.

Generally, at least one long voice segment is at least one complete effective voice dialog in the audio data of the original video data, the at least one complete effective voice dialog comprises an effective pronunciation part, a noise part and a silence part, the at least one long voice segment is analyzed to obtain the total audio time of the at least one long voice segment, the time of the effective pronunciation part and/or the total time of the noise part and the silence part are obtained, and then the effective pronunciation time ratio information of the at least one long voice segment is obtained through calculation.

S204, determining at least one video clip corresponding to at least one long voice clip.

Generally, original video data includes two parts, namely audio and video, after an audio part of the original video data is analyzed, a long voice segment with complete conversation content information can be obtained, the position of the long voice segment in the whole original video data can be obtained by obtaining a time tag of the long voice segment, and then the position of a video segment corresponding to the time tag can be obtained from the original video data, and the video segment is captured.

S205, determining the number of faces, the number of smiles, the face ratio and/or the smile ratio corresponding to at least one video segment in the at least one video segment.

The number of positive faces refers to the number of faces at the position of the positive faces in the video clips, a plurality of video images can be obtained through analyzing the video clips, and the number of the positive faces of the faces can be obtained from the plurality of video images; the smile number is the number of smile states of human faces in a video clip, a plurality of video images can be obtained by analyzing the video clip, and the number of smile states of human faces can be obtained from the plurality of video images; the face proportion is the proportion occupied by the face state in all face states of the video segment; the smiling face duty ratio refers to the proportion of the smiling face state of the human face in all the human face states of the video segment.

Generally, at least one video segment corresponds to a plurality of video images, the plurality of video images comprise face images of students and teachers, total duration corresponding to the at least one video segment can be obtained, framing processing is performed on the at least one video segment to obtain at least one video image, posture analysis is performed on the at least one video image to obtain a face number, expression analysis is performed on the at least one video image to obtain a smile number, and after the face number and the smile number corresponding to the at least one video image are obtained, further, a face-to-face ratio and/or a smile ratio can be calculated based on the face number, the smile number and the total duration.

And S206, inputting the analysis result and the at least one video clip into a two-classification model for classification processing to obtain a classification result corresponding to the at least one video clip.

The analysis result refers to analysis data obtained after multiple analysis processing is performed on at least one video segment, and the analysis data includes at least one of effective pronunciation duration ratio information, a face number, a smiling face number, a face ratio and/or a smiling face ratio.

Generally, before inputting the analysis result and the at least one video segment into the two classification models for classification processing to obtain the classification result corresponding to the at least one video segment, the two classification models need to be trained, and the training process may include: determining a positive sample set and a negative sample set of training, wherein the positive sample set and the negative sample set comprise at least one of an effective pronunciation duration ratio information label, a positive face number label, a smiling face number label, a positive face ratio label and/or a smiling face ratio label, classifying according to multiple sample data included in the positive sample set and the negative sample set, and training the classification of the multiple sample data according to characteristics until a binary classification model is obtained through training. The method comprises the following steps that at least one analysis result of effective pronunciation duration ratio information, a positive face number, a smiling face number, a positive face ratio and/or a smiling face ratio is obtained, and at least one video clip is input into a trained two-classification model as input data, the two-classification model is utilized to classify the input data to obtain a classification result corresponding to the video clip, and if: a score or probability that a video segment belongs to a highlight video segment.

And S207, selecting at least one video clip as a target video clip based on the classification result.

The target video clip refers to a video clip which is selected from at least one video clip and can be used as a highlight video, and the target video clip comprises an audio part and a video part.

Generally, after a classification result corresponding to at least one video segment is obtained, at least one video segment meeting a condition can be selected from the at least one video segment as a target video segment based on the classification result, and further, a preset editing process is performed on the target video segment to obtain a highlight video segment which can be directly displayed to a user terminal; if the selected target video clip is a video clip, generating a highlight video clip directly based on the video clip, or beautifying the video clip to generate the highlight video clip; if the selected target video segments are a plurality of video segments, the plurality of video segments need to be edited and synthesized, and a wonderful video segment is generated after further beautification processing.

For example, the following steps are carried out: if the classification score is 100 points in total, the preset condition is that the classification score of the video segments is greater than or equal to 60 points, 6 video segments are obtained after the original video data are analyzed, the 6 video segments are classified through a two-classification model, the classification scores of the 6 video segments are respectively 40 points, 60 points, 70 points, 80 points, 20 points and 50 points, the video segments with the classification scores of 60 points, 70 points and 80 points can be selected from the 6 video segments based on the preset condition to serve as target video segments, the target video segments with the classification scores of 60 points, 70 points and 80 points are clipped and synthesized, and the highlight video segments corresponding to the original video data can be generated after further beautification processing.

Referring to fig. 3, a schematic flow chart of a video processing method according to an embodiment of the present application is provided, where the video processing method includes the following steps:

s301, original video data are obtained.

Generally, after a student and a teacher finish course teaching, a server automatically generates original video data, the original video data comprises two parts of audio and video interacted by the student and the teacher in the course teaching process, the audio data can be obtained from the original video data, the audio data is subjected to framing processing based on a voice endpoint detection VAD algorithm to obtain a plurality of frame audio data, and the plurality of frame audio data are subjected to classification processing based on a preset classification model and a voice recognition ASR algorithm to obtain at least one effective voice segment.

S302, framing processing is carried out on the audio data based on a voice endpoint detection VAD algorithm to obtain a plurality of framed audio data.

Generally, VAD (Voice Activity Detection), also called Voice Activity Detection and Voice boundary Detection, is to identify and eliminate a long silent period from a Voice signal stream, so as to achieve the purpose of eliminating irrelevant audio to obtain effective audio without reducing service quality. The main processes of voice endpoint detection VAD include: the method comprises the steps of performing framing processing on audio data to obtain multi-frame audio data, training a classifier on a data frame set of a known speech and silence signal area according to audio features extracted from each frame of audio data, classifying unknown multi-frame audio data based on the classifier, and further obtaining audio data of effective pronunciation. Generally, the VAD algorithm divides the audio data into a voiced part (voiced), an unvoiced part (unvoiced), and a silence part (silence), i.e., the audio data that is effectively voiced.

In the embodiment of the application, the audio data is subjected to framing processing, and a plurality of audio segments are intercepted from the audio data, wherein each audio segment is a frame. The frame length needs to satisfy two conditions:

first, the frame length must be short enough to ensure that the signal is stationary within the frame, and the length of a frame should be equal to the length of a phoneme. At normal speech speeds, the duration of a phoneme is about 50ms to 200ms, so the frame length is typically 50 ms. Secondly, the frame length must include enough vibration period to ensure that Fourier transform can be performed for multiple times to obtain analysis frequency, usually the fundamental frequency of male voice is about 100Hz, and the period is 10 ms; the fundamental frequency of female voice is about 200Hz, and the period is 5 ms; in order to include a plurality of periods in one frame, the frame length is at least 20 ms. The frame length is usually 20ms to 50ms, and 20ms, 25ms, 30ms, 40ms and 50ms are all the frame length values which are commonly used.

S303, classifying the plurality of frame audio data based on a preset classification model and a speech recognition ASR algorithm to obtain at least one effective speech segment.

The preset classification model is obtained by training based on preset input data and output data, and whether the input audio clip is the effective voice clip can be judged through the preset classification model.

Generally, the classification model is based on a logistic regression method, and the values of the discrete dependent variables (e.g., binary value 0/1, yes/no, true/false) are obtained by using the known independent variable prediction, i.e., the probability of the occurrence of the event is predicted by fitting a logistic function (logistic function). ASR (Automatic Speech Recognition) algorithms use audio as a processing object and convert Speech signals into corresponding text or commands through a Recognition and understanding process.

S304, grouping the at least one effective voice segment based on the preset interval duration to obtain at least one long voice segment.

S305, determining the duration and the interval duration of at least one long voice segment.

And S306, calculating effective pronunciation duration ratio information based on the duration and the interval duration.

The effective pronunciation time length ratio information is the ratio of the time length of the effective audio part in the total audio time length of the long voice segment.

S307, time tag information corresponding to at least one long voice segment is determined.

The time tag information refers to a corresponding start time point and an end time point of the long voice segment, and the time tag information of the audio is consistent with the time tag information of the video.

S308, at least one video clip corresponding to at least one long voice clip is extracted from the original video data based on the time label information.

S309, acquiring the duration corresponding to at least one video clip, and performing framing processing on at least one video clip to obtain at least one video image.

Generally, a video segment is subjected to framing processing, that is, the video segment is divided into multiple still video images, and by acquiring a video image corresponding to the video segment, the number of faces and smiles in the video image can be further analyzed and obtained from the video image, and the faces in the video image can belong to students and/or teachers.

And S310, performing gesture analysis on at least one video image to obtain the number of positive faces.

The number of positive faces refers to the number of faces at the position of the positive faces in the video clip, a plurality of video images can be obtained through analyzing the video clip, and the number of the positive faces of the faces can be obtained from the plurality of video images.

Generally, current face pose analysis methods can be divided into two categories: the model-based method is to judge the posture condition of the human face by reconstructing a three-dimensional model of the human face, so that a more accurate analysis result can be obtained, but huge calculation amount is required; the appearance-based method comprises two modes: firstly, robustness based on feature representation and secondly feature point detection based on human face; the method based on the feature representation robustness completes the face pose estimation by depending on the property that certain feature is insensitive to pose change, so the accuracy of an analysis result seriously depends on the property of feature representation; the method based on the human face characteristic points judges the posture direction of the human face by marking a plurality of characteristic points of the human face on the image and through the geometric relationship among the characteristic points, so the accuracy of the analysis result depends on the accuracy of the characteristic point detection. The gesture condition corresponding to the face can be obtained by performing face gesture analysis on the face contained in the video images, and then the number of the face faces can be obtained after the face gesture analysis is completed on at least one video image, so that a calculation basis is provided for the subsequent calculation of the face proportion.

For example: the method based on the human face characteristic points analyzes the human face gesture of the human face contained in the video image, wherein the method comprises the steps of detecting key characteristic points of the human face, such as the chin, the nose tip, the left eye corner, the right eye corner, the left mouth corner, the right mouth corner and the like, comparing and analyzing the geometric relationship between the characteristic points corresponding to the detected human face and standard human face parameters, obtaining the gesture corresponding to the detected human face, and further judging whether the human face in the video image is a front face or not.

And S311, performing expression analysis on at least one video image to obtain the number of smiling faces.

The number of smiling faces refers to the number of smiling faces of people in a video segment, a plurality of video images can be obtained through video segment analysis, and the number of smiling faces of people can be obtained from the plurality of video images.

Generally, facial expression analysis is a process of extracting features of a given expression image (video image) through a terminal, developing learning, reasoning and judging by combining prior knowledge, and further understanding human emotion. The facial expression analysis mainly comprises two parts of expression recognition and expression intensity estimation, and the expression recognition process mainly divides expression images into six types of basic expressions: the expression intensity estimation process can judge the expression intensity degree of emotion, can convert the expression intensity estimation problem into a sequencing problem, and trains a sequencing model by using the sequence information of the expression sequence as a constraint condition, so that the intensity relation of any two expressions in the sequence is estimated. The expression condition corresponding to the human face can be obtained by performing expression analysis on the human face contained in the video images, and then the smiling face number of the human face can be obtained after the facial expression analysis is completed on at least one video image, so that a calculation basis is provided for the subsequent calculation of the smiling face ratio.

And S312, calculating to obtain the face occupation ratio and/or the smiling face occupation ratio based on the face number, the smiling face number and the duration.

The face proportion refers to the proportion of face states occupied by the faces in all face states of the video clip; the smiling face duty ratio refers to the proportion of the smiling face state of the human face in all the human face states of the video segment.

S313, a positive sample set and a negative sample set are determined.

The positive sample set and the negative sample set comprise at least one of an effective pronunciation duration ratio information tag, a positive face number tag, a smiling face number tag, a positive face proportion tag and a smiling face proportion tag.

And S314, training to obtain a two-classification model based on the positive sample set and the negative sample set.

Generally, the features of sample data in a positive sample set and a negative sample set are classified, and a binary classification model capable of classifying a plurality of video segments is obtained through multiple training. The binary classification model is based on a logistic regression method, and obtains the value of a discrete dependent variable by using known independent variable prediction, such as: binary 0/1, yes/no, true/false, i.e., the probability of an event occurrence is predicted by fitting a logistic function.

And S315, inputting the analysis result and the at least one video clip into a two-classification model for classification processing to obtain a classification result corresponding to the at least one video clip.

The analysis result refers to analysis data obtained after multiple analysis processing is performed on at least one video segment, and the analysis data includes at least one of effective pronunciation duration ratio information, a face number, a smiling face number, a face ratio and/or a smiling face ratio. The classification result refers to a result of a discrete dependent variable obtained after the corresponding video clip is analyzed by a binary classification model, such as: binary value 0/1, yes/no, true/false.

Generally, before inputting the analysis result and the at least one video segment into the two classification models for classification processing to obtain the classification result corresponding to the at least one video segment, the two classification models need to be trained, and the training process may include: determining a positive sample set and a negative sample set of training, wherein the positive sample set and the negative sample set comprise at least one of an effective pronunciation duration ratio information label, a positive face number label, a smiling face number label, a positive face ratio label and a smiling face ratio label, classifying according to multiple sample data in the positive sample set and the negative sample set, and training the classification of the multiple sample data according to characteristics until a binary classification model is obtained through training. The method comprises the following steps that at least one analysis result of effective pronunciation duration ratio information, a positive face number, a smiling face number, a positive face ratio and/or a smiling face ratio is obtained, and at least one video clip is input into a trained two-classification model as input data, the two-classification model is utilized to classify the input data to obtain a classification result corresponding to the video clip, and if: a score or probability that a video segment belongs to a highlight video segment.

And S316, selecting at least one video clip as a target video clip based on the classification result.

When the scheme of the embodiment of the application is executed, a server acquires original video data, frames audio data in the original video data based on a voice endpoint detection VAD algorithm to obtain a plurality of frame audio data, classifies the plurality of frame audio data based on a preset classification model and a voice recognition ASR algorithm to obtain at least one effective voice segment, groups the at least one effective voice segment based on a preset interval duration to obtain at least one long voice segment, determines the total duration and the interval duration of the at least one long voice segment, calculates effective pronunciation duration ratio information based on the duration and the interval duration, determines time label information corresponding to the at least one long voice segment, extracts at least one video segment corresponding to the at least one long voice segment from the original video data based on the time label information to acquire the duration of the at least one video segment, the method comprises the steps of performing framing processing on at least one video clip to obtain at least one video image, performing posture analysis on the at least one video image to obtain a positive face number, performing expression analysis on the at least one video image to obtain a smiling face number, calculating to obtain a positive face proportion/smiling face proportion based on the positive face number, the smiling face number and the duration, determining a positive sample set and a negative sample set, training to obtain a two-classification model based on the positive sample set and the negative sample set, inputting an analysis result and the at least one video clip into the two-classification model to perform classification processing to obtain an analysis result corresponding to the at least one video clip, selecting the at least one video clip as a target video clip based on the classification result, and selecting an accurate highlight video clip from original video data by the method to generate a high-quality highlight video.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 4, a schematic structural diagram of a video processing apparatus according to an exemplary embodiment of the present application is shown. Hereinafter referred to as device 4, the device 4 may be implemented as all or part of a terminal, by software, hardware or a combination of both. The apparatus 4 comprises a first processing module 401, a grouping module 402, a first determining module 403, a second determining module 404, a third determining module 405, a second processing module 406, a selecting module 407.

A first processing module 401, configured to obtain original video data, and obtain at least one valid voice segment based on audio data in the original video data;

a grouping module 402, configured to group the at least one valid voice segment based on a preset interval duration to obtain at least one long voice segment;

a first determining module 403, configured to determine effective pronunciation duration ratio information in the at least one long speech segment;

a second determining module 404, configured to determine at least one video segment corresponding to the at least one long voice segment;

a third determining module 405, configured to determine a number of faces, a number of smiles, a face proportion, and/or a smile proportion in the at least one video segment;

the second processing module 406 is configured to input the analysis result and the at least one video segment into a two-classification model for classification processing to obtain a classification result corresponding to the at least one video segment; the analysis result comprises at least one of the effective pronunciation duration ratio information, the number of the frontal faces, the number of the smiling faces, the frontal face ratio and/or the smiling face ratio;

a selecting module 407, configured to select at least one video segment as a target video segment based on the classification result.

Optionally, the third determining module 405 includes:

the first processing unit is used for acquiring the duration of the at least one video clip and performing framing processing on the at least one video clip to obtain at least one video image;

the acquisition unit is used for carrying out face recognition on the at least one video image to obtain the number of the front faces and/or the number of the smiling faces;

and the first calculating unit is used for calculating the face occupation ratio and the smiling face occupation ratio based on the face number, the smiling face number and the duration.

Optionally, the third determining module 405 further includes:

the first analysis unit is used for carrying out gesture analysis on the at least one video image to obtain the number of the frontal faces; and/or

And the second analysis unit is used for performing expression analysis on the at least one video image to obtain the number of smiling faces.

Optionally, the first processing module 401 includes:

the second processing unit is used for performing framing processing on the audio data based on a voice endpoint detection VAD algorithm to obtain a plurality of framed audio data;

and the classification unit is used for classifying the plurality of frame audio data based on a preset classification model and a speech recognition ASR algorithm to obtain the at least one effective speech segment.

Optionally, the apparatus 4 further comprises:

a first determining unit for determining a positive sample set and a negative sample set; the positive sample set and the negative sample set comprise at least one of an effective pronunciation duration ratio information label, a positive face number label, a smiling face number label, a positive face ratio label and/or a smiling face ratio label;

and the training unit is used for training to obtain the two classification models based on the positive sample set and the negative sample set.

Optionally, the second determining module 404 includes:

a second determining unit, configured to determine time tag information corresponding to the at least one long speech segment;

an extracting unit, configured to extract the at least one video segment corresponding to the at least one long voice segment from the original video data based on the time tag information.

Optionally, the first determining module 403 includes:

a third determining unit, configured to determine a duration and an interval duration of the at least one long speech segment;

and the second calculating unit is used for calculating the effective pronunciation duration ratio information based on the duration and the interval duration.

It should be noted that, when the apparatus 4 provided in the foregoing embodiment executes the video processing method, only the division of the above functional modules is taken as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the embodiments of the video processing method provided by the above embodiments belong to the same concept, and details of implementation processes thereof are referred to in the embodiments of the method, which are not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

An embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executing the method steps in the embodiments shown in fig. 2 to fig. 3, and a specific execution process may refer to specific descriptions of the embodiments shown in fig. 2 to fig. 3, which is not described herein again.

The present application further provides a computer program product storing at least one instruction, which is loaded and executed by the processor to implement the video processing method according to the above embodiments.

Fig. 5 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application, which is hereinafter referred to as an apparatus 5, where the apparatus 5 may be integrated in the foregoing server or terminal device, as shown in fig. 5, and the apparatus includes: memory 502, processor 501, input device 503, output device 504, and communication interface.

The memory 502 may be a separate physical unit, and may be connected to the processor 501, the input device 503, and the output device 504 via a bus. The memory 502, processor 501, input device 503, and output device 504 may also be integrated, implemented in hardware, etc.

The memory 502 is used for storing a program for implementing the above method embodiment, or various modules of the apparatus embodiment, and the processor 501 calls the program to perform the operation of the above method embodiment.

Input devices 502 include, but are not limited to, a keyboard, a mouse, a touch panel, a camera, and a microphone; the output device includes, but is not limited to, a display screen.

Communication interfaces are used to send and receive various types of messages and include, but are not limited to, wireless interfaces or wired interfaces.

Alternatively, when part or all of the distributed task scheduling method of the above embodiments is implemented by software, the apparatus may also include only a processor. The memory for storing the program is located outside the device and the processor is connected to the memory by means of circuits/wires for reading and executing the program stored in the memory.

The processor may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.

The processor may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

The memory may include volatile memory (volatile memory), such as random-access memory (RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); the memory may also comprise a combination of memories of the kind described above.

Wherein the processor 501 calls the program code in the memory 502 for executing the following steps:

In one or more embodiments, processor 501 is further configured to:

acquiring the duration of the at least one video clip, and performing framing processing on the at least one video clip to obtain at least one video image;

performing face recognition on the at least one video image to obtain the number of the front faces and/or the number of smiling faces;

and calculating the face ratio and/or the smiling face ratio based on the number of faces, the number of smiling faces and the duration.

In one or more embodiments, processor 501 is further configured to:

performing gesture analysis on the at least one video image to obtain the number of frontal faces; and/or

And performing expression analysis on the at least one video image to obtain the number of smiling faces.

In one or more embodiments, processor 501 is further configured to:

framing the audio data based on a voice endpoint detection VAD algorithm to obtain a plurality of framed audio data;

and classifying the plurality of frame audio data based on a preset classification model and a speech recognition ASR algorithm to obtain the at least one effective speech segment.

In one or more embodiments, processor 501 is further configured to:

determining a positive sample set and a negative sample set; the positive sample set and the negative sample set comprise at least one of an effective pronunciation duration ratio information tag, a positive face number tag, a smiling face number tag, a positive face ratio tag and a smiling face ratio tag;

and training to obtain the two classification models based on the positive sample set and the negative sample set.

In one or more embodiments, processor 501 is further configured to:

determining time tag information corresponding to the at least one long voice segment;

extracting the at least one video segment corresponding to the at least one long voice segment from the original video data based on the time tag information.

In one or more embodiments, processor 501 is further configured to:

determining the duration and interval duration of the at least one long speech segment;

and calculating the effective pronunciation duration ratio information based on the duration and the interval duration.

The embodiment of the application also provides a computer storage medium, which stores a computer program, and the computer program is used for executing the video processing method provided by the embodiment.

Embodiments of the present application further provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the video processing method provided by the foregoing embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A method of video processing, the method comprising:

2. The method of claim 1, wherein determining the number of faces, the number of smiles, the face proportion, and/or the smile proportion in the at least one video segment comprises:

3. The method according to claim 2, wherein the performing face recognition on the at least one video image to obtain the number of faces and/or the number of smiling faces comprises:

4. The method of claim 1, wherein deriving at least one valid speech segment based on audio data in the original video data comprises:

5. The method of claim 1, wherein the two-class model is a pre-trained model, and the training process comprises:

determining a positive sample set and a negative sample set; the positive sample set and the negative sample set comprise at least one of an effective pronunciation duration ratio information label, a positive face number label, a smiling face number label, a positive face ratio label and/or a smiling face ratio label;

6. The method of claim 1, wherein the determining at least one video segment corresponding to the at least one long speech segment comprises:

7. The method according to claim 1, wherein said determining valid utterance duration ratio information in the at least one long speech segment comprises:

8. A video processing apparatus, characterized in that the apparatus comprises:

9. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to carry out the method steps according to any one of claims 1 to 7.

10. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 7.