CN117579858A

CN117579858A - Video data processing method, device, computer equipment and storage medium

Info

Publication number: CN117579858A
Application number: CN202311514849.6A
Authority: CN
Inventors: 冯鑫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2024-02-20

Abstract

The present application relates to a video data processing method, apparatus, computer device, storage medium and computer program product. The method comprises the following steps: determining a plurality of time points of the video to be processed, and determining multi-mode data corresponding to each time point, wherein the multi-mode data at least comprises a video picture, a speech sound and a speech text; for each time point, carrying out emotion recognition according to the video picture and the speech audio of the time point to obtain an emotion recognition result; determining the suspense score of the targeted time point according to the video picture, the speech text and the emotion recognition result; screening candidate time points according to the suspense score of each time point; and carrying out integrity analysis according to the multi-modal data of each candidate time point, and determining a target time point from the candidate time points according to an integrity analysis result. By adopting the method, the target time point for determining the video end point can be accurately determined, and the video editing and scenario marking effects can be improved.

Description

Video data processing method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technology, and in particular, to a video data processing method, apparatus, computer device, storage medium, and computer program product.

Background

With the development of computer technology, watching video becomes a common entertainment mode in daily life, and for video with long duration, a user does not necessarily have to be interested in any content in the video, and the user may wish to selectively watch. In order to adapt to the requirements of different users, the video can be clipped to obtain highlight video content clips with shorter duration.

In the process of sectioning and editing video, a great deal of labor is usually utilized to mark the time positions of the contents suitable as the end points of materials in the whole video, so that the contents are provided for selection and use in other landing applications to sectione the video.

However, by means of manual marking, a great deal of manpower is consumed, the marking speed is very slow, and the problem of low marking efficiency exists.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a video data processing method, apparatus, computer device, computer readable storage medium, and computer program product that can improve the determination efficiency of a target time point.

In a first aspect, the present application provides a video data processing method. The method comprises the following steps:

determining a plurality of time points of a video to be processed, and acquiring multi-mode data corresponding to each time point, wherein the multi-mode data at least comprises a video picture, a speech sound and a speech text;

aiming at any time point, carrying out emotion recognition according to the video picture and the speech audio of the time point to obtain an emotion recognition result;

determining the suspense score of the time point according to the video picture, the speech text and the emotion recognition result of the time point;

screening candidate time points from a plurality of time points according to the suspense score of each time point;

carrying out integrity analysis according to the multi-modal data of each candidate time point, and determining a target time point from the candidate time points according to an integrity analysis result; the target time point is used to determine the video end point.

In a second aspect, the present application also provides a video data processing apparatus. The device comprises:

the data determining module is used for determining a plurality of time points of the video to be processed, and acquiring multi-mode data corresponding to each time point, wherein the multi-mode data at least comprises a video picture, a speech sound and a speech text;

The recognition module is used for carrying out emotion recognition on any time point according to the video picture and the speech audio of the time point to obtain an emotion recognition result;

the computing module is used for determining the suspense score of the aimed time point according to the video picture, the speech text and the emotion recognition result of the aimed time point;

the screening module is used for screening candidate time points from a plurality of time points according to the suspense score of each time point;

the positioning module is used for carrying out integrity analysis according to the multi-mode data of each candidate time point, and determining a target time point from the candidate time points according to an integrity analysis result; the target time point is used to determine the video end point.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the video data processing method described above when the processor executes the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the video data processing method described above.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the video data processing method described above.

The video data processing method, the video data processing device, the computer equipment, the storage medium and the computer program product determine a plurality of time points of the video to be processed, and acquire multi-mode data corresponding to each time point, wherein the multi-mode data at least comprises a video picture, speech audio and speech text; further, emotion recognition is carried out according to the video picture and the speech recognition result of the time point, so that an emotion recognition result is obtained, and accordingly, the suspense score of the time point is determined according to the video picture, the speech recognition result and the speech recognition result of the time point, that is, when the suspense score of the time point is calculated, the data of multiple modes such as the video picture, the speech recognition result and the speech recognition result are integrated, accuracy of the suspense score can be guaranteed, and certain robustness is achieved. Further, candidate time points with suspense are screened out according to the suspense score of each time point. And finally, carrying out integrity analysis according to the multi-mode data of each candidate time point, and carrying out screening again according to an integrity analysis result to determine a target time point, wherein the target time point is used for determining a video end point. According to the method and the device, the suspense score is calculated through the multi-mode data, the candidate time points are obtained through screening according to the suspense score, the integrity analysis is further carried out on the candidate time points, and the screening is carried out again in combination with the integrity analysis result, so that the content at the video end point has certain logic integrity and suspense, the manual participation is thoroughly eliminated, and the editing efficiency is greatly improved. Meanwhile, as the positioning of the target time point is standardized operation, the difference caused by different personnel is avoided, and the accuracy of video end point prediction is ensured.

Drawings

FIG. 1 is an application environment diagram of a video data processing method in one embodiment;

FIG. 2 is a flow chart of a video data processing method in one embodiment;

FIG. 3 is a flow diagram of facial emotion recognition in one embodiment;

FIG. 4 is a block diagram of the network architecture in one embodiment;

FIG. 5 is a schematic diagram showing the effect of face detection in one embodiment;

FIG. 6 is a phase block diagram of a network architecture in one embodiment;

FIG. 7 is a block diagram of a network architecture in another embodiment;

FIG. 8 is a block diagram of a network architecture in yet another embodiment;

FIG. 9 is a schematic diagram of a speech emotion recognition model in one embodiment;

FIG. 10 is a bullet screen data trend graph, according to one embodiment;

FIG. 11 is a block diagram of a feature weighting process in one embodiment;

FIG. 12 is an overall system architecture diagram in one embodiment;

FIG. 13 is a diagram illustrating determination of a target time point in one embodiment;

FIG. 14 is an overall system architecture diagram in another embodiment;

FIG. 15 is a block diagram showing the structure of a video processing apparatus in one embodiment;

fig. 16 is an internal structural view of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The video data processing method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be provided separately, may be integrated on the server 104, or may be located on a cloud or other network server. The terminal 102 and the server 104 may be used alone to perform the video data processing method in the present application, and the terminal 102 and the server 104 may be used cooperatively to perform the video data processing method in the present application.

Taking the terminal 102 and the server 104 cooperatively execute the application as an example, when specifically performing video data processing, a user may combine his own video processing requirements, such as video editing requirements, video plot labeling requirements, etc., and send a video to be processed to the server 104 through the terminal 102. The server 104 may acquire the video to be processed sent by the terminal 102, determine a plurality of time points of the video to be processed, and acquire multi-mode data corresponding to each time point, where the multi-mode data includes at least a video picture, a speech audio and a speech text; aiming at any time point, the server 104 carries out emotion recognition according to the video picture and the speech audio of the aimed time point to obtain an emotion recognition result; the server 104 determines the suspense score of the aimed time point according to the video picture, the speech text and the emotion recognition result of the aimed time point; the server 104 screens candidate time points from the plurality of time points according to the suspense score of each time point; the server 104 performs integrity analysis according to the multi-modal data of each candidate time point, and determines a target time point from the candidate time points according to the integrity analysis result; the target time point is used to determine the video end point. The server 104 may feed back the target point in time to the terminal 102 so that the user may perform related operations on the video, such as video clips, video episode labels, etc., based on the terminal 102.

Applications such as a video application, a video clip application, a live broadcast application, a social application, an educational application, a shopping application, and the like, which have video data processing requirements, are running on the terminal 102. The terminal 102 may be, but is not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, portable wearable devices, intelligent voice interaction devices, smart appliances, vehicle terminals, aircraft, and the like. The internet of things equipment can be an intelligent sound box, an intelligent television, an intelligent air conditioner, intelligent vehicle-mounted equipment and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers. The server may also be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The terminal and the server may be directly or indirectly connected through wired or wireless communication.

It should be noted that the embodiments of the present invention may be applied to various scenarios, including, but not limited to, cloud technology, artificial intelligence, intelligent transportation, driving assistance, and the like.

The artificial intelligence (Artificial Intelligence, AI) referred to herein is the intelligence of a person using a digital computer or a machine controlled by a digital computer to simulate, extend and extend the environment, obtain knowledge, and use knowledge to obtain the best results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The following describes the video data processing method of the present application in detail:

in one embodiment, as shown in fig. 2, a video data processing method is provided, which is described as an example of application to a computer device (the computer device may be specifically the terminal 102 or the server 104 in fig. 1), and the video data processing method includes the following steps:

step 202, determining a plurality of time points of the video to be processed, and obtaining multi-mode data corresponding to each time point, wherein the multi-mode data at least comprises a video picture, a speech sound and a speech text.

Wherein, the video to be processed is the video to be processed. For example, the video to be processed may be a video of a movie play, a video of a variety program, or a user-generated video, etc.

The point in time is the point in time when the content of the video appears in the video progress bar. In the case of video, when the play position of the video is described by the time point, the measurement unit of the time point may be various, for example, may be hours (H), seconds (S), minutes (min), or the like. Specific measurement units can be set according to the length of the video and the content of the video.

The multi-modal data of the video to be processed refers to data of different forms capable of reflecting characteristics of the video to be processed, and may generally include video pictures, speech audio, speech text, and the like.

The video picture is the data of the video characteristic reflected from the video frame form.

The speech audio is the data of the video features reflected from the audio form. The speech audio in the application may specifically be sound data matched with a video picture in the video to be processed.

The speech text is the data of the video features reflected from the text form. The speech text in the application can be a text paragraph matched with a video picture and speech audio in the video to be processed.

In some embodiments, the measure of the point in time is in seconds for the acquired video to be processed. The plurality of time points of the video to be processed determined by the computer device may include 1S, 2S, 3S …, etc. Wherein, when the time point is at 1S, the video picture, the speech sound and the speech text are corresponding; at the time point of 2S, there corresponds to a video picture, speech audio, and speech text …. It can be understood that the measurement unit of the time point may be hours, minutes, etc., and correspondingly, the multi-mode data obtained by the computer device is the multi-mode data corresponding to the time point x hours and the multi-mode data corresponding to the time point x minutes.

In some embodiments, when determining the multi-mode data corresponding to each time point, the computer device may take the time point as a central time point, and forward take a certain time period to form a time interval, backward take a certain time period to form a time interval, or forward and backward take certain time periods to form a time interval, so as to obtain the multi-mode data in the time interval.

In some embodiments, when the computer device takes a certain time period forwards or backwards to form a time interval by taking any time point as a central time point, the time period can be taken in units of seconds, and the time period can also be taken in units of minutes. For example, when the value is taken in units of seconds, the taken time period may be 2S, 3S, or the like. When the value is taken in the unit of minutes, the time period can be 1min, 2min and the like, and the specific time period taken forwards or backwards can be adaptively adjusted according to the actual video length, video processing requirements and the like.

In some embodiments, when multi-modal data is obtained in terms of a time interval, the computer device may obtain a speech segment in this time interval as speech text, a speech segment in this time interval as speech audio, and a multi-frame video frame in this time interval as video picture.

In other embodiments, when the number of video frames at a time point is too large, for example, exceeds a set number threshold, the video frames may be subjected to frame extraction processing to obtain a preset number of key frames, and the video frames are determined based on the key frames. The number of video frames included in the video picture of each time point may be 2-5 frames.

In some embodiments, the video to be processed may be sent to the computer device by the terminal when the user has a video clip requirement or a story annotation requirement. The computer equipment obtains the video to be processed, and then the computer equipment can process video data according to the obtained video to be processed, provide target time points for video editing and plot marking for users, and after the users receive the target time points provided by the computer equipment through the terminal, the users can carry out boundary marking of short video editing, highlight clips, plot marking and the like based on the terminal.

Step 204, for any time point, performing emotion recognition according to the video picture and the speech audio of the time point to obtain an emotion recognition result.

The emotion recognition comprises emotion recognition of a video picture and emotion recognition of speech sound. The emotion recognition of the video picture is a process of emotion recognition of video content in the video picture, and the emotion recognition of the speech in the speech audio is a process of emotion recognition of the speech audio.

Specifically, for any time point, the computer device can respectively perform emotion recognition for the video picture to obtain a corresponding first emotion recognition result, and carrying out emotion recognition aiming at the speech frequency of the speech to obtain a corresponding second emotion recognition result, and combining the respective recognition results of the video picture and the speech sounds to obtain a final emotion recognition result.

In some embodiments, the video content in the video to be processed is related to the type of video to be processed, which may include a person-type video, an animal-type video, a natural landscape-type video, and the like. When the video to be processed is a person type video, the video content in the video picture can comprise a face, and the emotion classification aiming at the video picture can be specifically emotion recognition aiming at a face; when the video to be processed is an animal-type video, the video content in the video picture can also comprise a face, and the emotion classification specific to the video picture can be specifically emotion recognition specific to the animal face.

In other embodiments, when the type of the video to be processed is a character type, the computer device performs emotion recognition on the video picture to obtain an expression which can be a character; when the type of the video to be processed is animal type video, the computer equipment carries out emotion recognition on the video picture, and the obtained emotion can be animal.

In some embodiments, the speech audio may be human voice, animal voice, etc., and may be different based on the type of video to be processed. If the type of the video to be processed is a person type video, the audio may be a voice or a mixed sound of voice and background music; if the type of the video to be processed is an animal video type, the speech audio may be animal sound or the like.

In some embodiments, the first emotion recognition result is a facial expression category obtained after emotion recognition for a face in the video picture, and the facial expression category may include multiple categories such as an excited category, a flat category, and the like. For facial expression categories, the agitation categories may include surprise, fear, laugh, cry, face, anger; the flat category may include no expression, smile, sadness.

In some embodiments of the present invention, in some embodiments, the second emotion recognition result is for the speech of the speech the category of speech emotion obtained after emotion recognition of speech, the speech emotion categories may also include various categories such as an excited category, a flat category, and the like. For the speech emotion category, the agitation category may include anger, surprise, happiness, crying, shouting, sadness, fear; the flat category may include flat, sad.

And 206, determining the suspense score of the time point according to the video picture, the speech text and the emotion recognition result of the time point.

Wherein the suspense score is divided into calculated suspense degrees of the time points, and the suspense degrees can be used for determining whether the time points are suspense points of the video to be processed. In general, it is possible to determine whether a point in time is a suspense point in terms of the magnitude of the suspense score.

The suspense points are points in the video which can attract the interests of the user or are bedding points before the high tide. In an actual movie and television drama video, the suspense point is usually a point of interest which can be lifted by a user to watch continuously, and the suspense point is set to lay a cushion for the development of the subsequent scenario.

Specifically, for each time point, the computer device may integrate the video frame, the speech text and the emotion recognition result of the time point, determine the suspense score of the time point, and determine whether the time point is the suspense point according to the suspense score of the time point, if so, perform the subsequent screening process, and if not, discard the time point, that is, the time point is no longer involved in the determination of the subsequent target time point.

In some embodiments, the suspense score may be any parameter that may be used to provide a basis for screening, such as a calculated confidence level, score value, or the like. Where the suspense score is divided into calculated confidence levels, it will be appreciated that the higher the confidence level for any point in time, the more likely that point in time will be the suspense point.

Step 208, screening candidate time points from the plurality of time points according to the suspense score of each time point.

The candidate time points are points of highlight content which can arouse the interest of a user in the video to be processed, and the candidate time points can be suspense points screened from a plurality of time points.

Specifically, the computer device screens the time points according to the suspense score of each time point, so as to screen candidate time points from a plurality of time points.

In some embodiments, the computer device may set a suspense score threshold through which to screen when screening candidate time points. For example, the computer device may determine a time point having a suspense score greater than a suspense score threshold as a candidate time point and discard a time point having a suspense score threshold less than or equal to the suspense score threshold.

In other embodiments, the computer device may segment the video to be processed according to the scenario to obtain a plurality of sub-videos, select candidate time points with the sub-videos as units, and set the number of candidate time points of each unit when selecting the candidate time points with the sub-videos as units, and select the candidate time points in combination with the suspense score of the time points.

In some embodiments, screening candidate time points from a plurality of time points based on the suspense score for each time point includes: performing curve fitting according to the suspense scores of the time points to obtain a target fitting curve; and determining the time point position corresponding to the wave crest in the target fitting curve as a candidate time point position.

The target fitting curve is a curve fitted by the time points and the suspense score, and can be used for representing the association relation between the time points and the suspense score. By fitting a plurality of time points and their corresponding suspense scores, a peak-valley graph can be obtained.

The probability of the scenario with suspense of the point in time where the wave crest is located is high.

Specifically, the computer device performs fitting through a plurality of time points and suspense scores corresponding to the time points to obtain a target fitting curve. Further, the computer device may determine a slope of each point on the target fitting curve, determine a peak from each point with a slope equal to 0, and determine a time point corresponding to the peak as the candidate time point.

In some embodiments, the computer device may determine a plurality of peaks by fitting a curve to the target. The computer device may determine the time points corresponding to all the peaks as candidate time points. It can be understood that the computer device may also screen the wave crest again to obtain a screened wave crest, and determine the time point corresponding to the screened wave crest as the candidate time point, so that the obtained candidate time point is more simplified, the calculated amount is reduced on the basis of improving the determination accuracy of the target time point, and the processing efficiency of the video data is improved.

In some embodiments, the computer device may set a peak number threshold, and when the number of peaks is greater than the peak number threshold, further filtering is performed on the peaks to obtain the target peak. The setting of the threshold value of the peak number can be adaptively adjusted in combination with efficiency requirements, precision requirements and the like of video data processing.

In the above embodiment, the computer device performs fitting through a plurality of time points and suspense scores corresponding to the time points, so as to obtain a target fitting curve. Further, the computer equipment determines a wave crest from the target fitting curve, and determines a time point corresponding to the wave crest as a candidate time point. The probability that the wave crest represents the scenario with the suspense of the time point is higher, so that the candidate time point can be accurately determined from the time point, and the accuracy of determining the target time point is improved.

Step 210, performing integrity analysis according to the multi-modal data of each candidate time point, and determining a target time point from the candidate time points according to the integrity analysis result; the target time point is used to determine the video end point.

The integrity analysis is used for analyzing whether the scenario of the video segment obtained based on the candidate time point position segmentation is logically complete, namely whether the video segment has logical integrity.

In particular, the computer device may perform an integrity analysis in connection with video pictures, speech audio, and speech text of candidate points. The computer device may determine, based on the video frame, whether the candidate time point is a scenario ending point of the entire video or a scenario ending point of each unit in the video, and the computer device may reserve the candidate time point that belongs to the scenario ending point.

The computer device may also determine, based on the speech and the speech text, whether the candidate time points are in time paragraphs where speech and/or speech text occurs, and retain candidate time points that are not in time paragraphs where speech and/or speech text occurs.

The computer device may integrate the results of the integrity analyses to determine a final target point in time. By combining three types of information, namely video pictures, speech sounds and speech texts, integrity analysis can be performed, the integrity of video content at a target time point can be ensured, abrupt interruption of a clipped video is avoided in the middle of a complete plot at the target time point, and the overall quality of subsequent video materials and the watching interest of users can be improved.

In some embodiments, when the computer device performs integrity analysis based on the video picture, the scenario ending point position in the video can be determined first, and the candidate time point position and the scenario ending point position are compared to obtain an integrity analysis result. For example, the computer device may perform difference calculation on the time point of the candidate time point and the time point of the scenario ending point, determine a time difference between the time point of the candidate time point and the time point of the scenario ending point, and when the time difference exceeds a set time difference threshold, indicate that the candidate time point is far from the scenario ending point, and determine that the scenario of the candidate time point does not have integrity. The time difference threshold may be set in units of minutes, such as 1min and 2min, or may be set in units of seconds, such as 10s and 48 s. In some embodiments, since the types of the videos to be processed are various, there may be only one or multiple scenario ending points determined by the computer device. Under the condition that the number of scenario ending points is multiple, the computer equipment can select scenario ending points close to the candidate time points for comparison, for example, select scenario ending points close in time for comparison, and obtain an integrity analysis result.

In some embodiments, when the computer device performs the integrity analysis based on the speech sound and the speech text, it may determine a time period of occurrence of the speech sound and a time period of occurrence of the speech text corresponding to the candidate time point, and compare the candidate time point with the time period of occurrence of the speech sound and the time period of occurrence of the speech text, so as to obtain an integrity analysis result. For example, the computer device may determine whether the candidate time point falls within a time segment in which the speech audio appears, and if so, determine that the candidate time point does not have integrity. The computer device may also determine whether the candidate time point falls within a time segment in which the line text appears, and if so, determine that the candidate time point does not have integrity,

in the video data processing method, a plurality of time points of a video to be processed are determined, and multi-mode data corresponding to each time point are obtained, wherein the multi-mode data at least comprise video pictures, speech sounds and speech texts; further, for any time point, emotion recognition is performed according to the video picture and the speech of the time point to obtain an emotion recognition result, so that the suspense score of the time point is determined according to the video picture, the speech text and the emotion recognition result of the time point, that is, when the suspense score of the time point is calculated, the data of multiple modes such as the video picture, the speech text and the emotion recognition result are integrated, the accuracy of the suspense score can be ensured, and certain robustness is achieved. Further, candidate time points with suspense are screened out according to the suspense score of each time point. And finally, carrying out integrity analysis according to the multi-mode data of each candidate time point, and carrying out screening again according to an integrity analysis result to determine a target time point, wherein the target time point is used for determining a video end point. According to the method and the device, the suspense score is calculated through the multi-mode data, the candidate time points are obtained through screening according to the suspense score, the integrity analysis is further carried out on the candidate time points, and the screening is carried out again in combination with the integrity analysis result, so that the content at the video end point has certain logic integrity and suspense, the manual participation is thoroughly eliminated, and the editing efficiency is greatly improved. Meanwhile, as the positioning of the target time point is standardized operation, the difference caused by different personnel is avoided, and the accuracy of video end point prediction is ensured.

In some embodiments, the emotion recognition results include a first emotion recognition result and a second emotion recognition result, and for any time point, performing expression recognition according to a video picture and speech audio of the time point to obtain the emotion recognition result, including: for any time point, carrying out facial emotion recognition according to the video picture of the time point to obtain a first emotion recognition result; and aiming at any time point, carrying out voice emotion recognition according to the speech sound of the time point to obtain a second emotion recognition result.

The facial emotion recognition is a process of recognizing a face appearing in a video picture, such as a face of a person, a face of an animal, or the like, to determine a facial emotion. Facial emotion recognition may specifically include detecting a face in a video frame, and determining a facial emotion category of the detected face.

Speech emotion recognition is a process of recognizing speech occurring in speech of a person, speech of an animal, or the like to determine speech emotion. Facial emotion recognition may specifically include recognizing whether speech is present in the speech audio, and determining a speech emotion classification of the recognized speech when speech is present.

Specifically, for any time point, the computer device may be developed based on a pre-trained facial emotion recognition model when performing facial emotion recognition on the video frame, and the computer device may be developed based on a pre-trained speech emotion recognition model when performing speech emotion recognition on the speech audio frame, and of course, the computer device may also be developed and recognized by adopting other algorithms that may be used for performing expression recognition, which is not limited herein.

In some embodiments, the facial emotion recognition model and the speech emotion recognition model may be network models constructed based on artificial intelligence algorithms. When the facial emotion recognition model and the voice emotion recognition model are constructed, the construction can be performed based on various algorithms such as a supervised learning algorithm, an unsupervised learning algorithm and the like. And are not limited herein. For example, in constructing the facial emotion recognition model, for example, MTCNN (Multi-task Cascaded Convolutional Networks, multi-task convolutional neural network), PCA (principal component analysis ), LFA (Local Face Analysis, local feature analysis) may be employed; in the construction of the speech emotion recognition model, various algorithms such as a Residual Network (Resnet), a target detection algorithm (FPN feature pyramid networks) and the like may be used.

In some embodiments, the facial emotion recognition model may be composed of a plurality of models, for example, may include a neural network model capable of realizing facial positioning and a neural network model capable of realizing facial emotion recognition, and by adopting different neural network models to perform facial positioning and facial emotion recognition in a targeted manner, the accuracy of facial emotion recognition may be improved. Of course, the facial emotion recognition model may also be a neural network model including a face detection function and an expression recognition function, and facial emotion recognition is completed only by one neural network model, so that facial emotion recognition efficiency can be improved.

In some embodiments, the speech emotion recognition model may also be composed of multiple neural network models, for example, may include a neural network model capable of implementing speech detection and a neural network model capable of implementing speech emotion recognition, and by adopting different neural network models to implement speech detection and speech emotion recognition in a targeted manner, the accuracy of speech emotion recognition may be improved; of course, the speech emotion recognition model may also be a neural network model including a speech detection function and an emotion recognition function, and speech emotion recognition is completed only by one neural model, so that speech emotion recognition efficiency can be improved.

In the above embodiment, the computer device performs emotion recognition for the video picture and emotion recognition for the speech audio, to obtain the corresponding emotion recognition results from the facial emotion and the speech emotion respectively, because the facial emotion and the voice emotion are greatly associated with the scenario, the finally positioned target time point has certain robustness, and the positioned target time point is more accurate.

In some embodiments, for any time point, facial emotion recognition is performed according to the video picture of the time point, and a first emotion recognition result is obtained, including: for any time point, carrying out face detection according to the video picture of the time point, and intercepting a face image from the video picture based on a detection result; extracting features based on the face image to obtain a vector representation of the face image; and predicting the facial expression category to which the facial image belongs according to the vector representation of the facial image, and determining a first emotion recognition result according to the facial expression category.

Face detection is a process in which a face is located in a video picture. The vector representation may be used to reflect features of the facial image, i.e., semantic feature information of the facial image. The facial expression class is a facial emotion class of a facial image predicted based on semantic feature information of the facial image.

Specifically, for any time point, the computer device may use a neural network model for face detection trained in advance, perform face detection according to a video frame of the time point, locate a position of a face, and intercept a face existing in the video frame, such as a human face or an animal face, based on the located position, to obtain a face image. Further, the computer device may perform feature encoding on the face image by using a pre-trained network model for emotion recognition, obtain a vector representation of the face image, and finally perform facial emotion recognition on the face image based on the vector representation, to obtain a first emotion recognition result.

In some embodiments, the position of the face located by the computer device is the coordinates of a square frame capable of being located on the face, specifically, four coordinate points of the square frame, and the computer device can intercept the face image based on the four coordinate points of the square frame.

In some embodiments, referring to fig. 3, a flow diagram for facial emotion recognition is shown:

the flow chart of fig. 3 includes a video frame, a face positioning module, a face coding module, and an expression recognition module. The computer equipment can input the multi-frame video frames to the face positioning module, the face positioning module respectively positions face images from the video frames of each frame, and the positioned face images are intercepted from the video frames. Further, in the face coding module, the truncated face image may be coded to obtain a vector representation of the face image, and finally, in the expression recognition module, facial emotion recognition is performed based on the vector representation of the face image to obtain a first emotion recognition result.

When the video to be processed is a person-type video, the face positioning module may specifically refer to a face positioning module, and the face encoding module may specifically refer to a face encoding module.

In some embodiments, in performing face detection, a neural network model based on MTCNN network training is developed, and referring to fig. 4, an exemplary diagram of a network structure of the whole MTCNN is shown:

the Network structure of MTCNN can be divided into three modules, namely, candidate Network (P-Net), refined Network (R-Net) and Output Network (Output Network). The following steps are face detection procedures based on MTCNN expansion:

(1) For the video picture of the targeted time point, face detection can be performed on each frame of video frame respectively. The following procedure will be described by taking face detection of one of the video frames as an example. First, the video frames may be transformed (Resize) at different scales to construct an image pyramid to accommodate detection of faces of different sizes. The way to construct the image pyramid may be: the video frames are resized by a set zoom factor (size_factor) until the size is equal to the size required by P-Net, which may be, for example, 12 x 12. Thus, a picture pyramid, i.e. the original video frame, the scaled video frame, can be obtained. The value of the size_factor is determined according to the actual face size distribution in the video frame, and may be set between 0.70 and 0.80, for example. The value of the size_factor is set to be larger, so that the reasoning time is easy to prolong, and small and medium-sized faces are easy to miss. In the embodiment of the present application, the size_factor may be set to 0.70, for example.

(2) Inputting the obtained picture pyramid into P-Net to obtain a large number of candidates (candidates). P-Net is a candidate network for a face region, and includes three convolution layers, where the sizes of the convolution layers are 5×5×10, 3×3×16, and 1×1×12, respectively, and the output includes 3 parts: the number of the key points is not limited in the embodiment of the present application, and the number of the key points is 5, where the 5 key points may be left eye, right eye, nose, left mouth corner, and right mouth corner, respectively.

The first partial output is used to determine whether the image has a face, and the vector size of the output is 1×1×2, that is, two values. The location of the second portion output bbox is commonly referred to as block regression. The P-Net input image may not be a perfect face frame position, sometimes the face is not exactly square, and it is possible that the 12 x 12 video frame is left or right, and therefore, a relatively perfect face frame position offset needs to be output. The offset is 1×1×4, that is, the relative offset of the abscissa representing the upper left corner of the frame, the relative offset of the ordinate representing the upper left corner of the frame, the error of the width of the frame, and the error of the height of the frame. The third section outputs positions of key points, taking 5 key points as an example, the 5 key points respectively corresponding to the position of the left eye, the position of the right eye, the position of the nose, the position of the left mouth, and the position of the right mouth. Each keypoint needs to be represented in two dimensions, so the vector size of the output is 1 x 10.

All images were input to P-Net, map (graph), offset, and classification score were output, the shape of map being (m, n, 16), where m, n is the length and width of map. According to the classification score, most of candidates can be screened out, and then the coordinates of the left upper right lower of the bbox are obtained after the bbox is calibrated according to the obtained 4 offset values. Based on the cross-over ratio (Intersection overUnion, IOU) values, non-maximal suppression (Non Maximum Suppression, NMS) is performed to screen out a large portion of the candidates. The specific process is as follows: from the large to small rows, tensors of (num_left, 4), i.e., upper left and lower right absolute coordinates of num_left bbox, are obtained according to the classification scores. Each time the IOU is found with the bbox coordinates and the remaining coordinates of the maximum score value, the bbox with the IOU greater than 0.6 (the threshold value is set in advance) is removed, and this maximum score value is moved to the final result. Repeating this operation removes many bboxs with a large amount of overlapping (overlap), and finally (num_left_after_nms, 16) candidates are obtained, and after these candidates need to be cut out from the original video frame according to the bbox coordinates, the resize is 24×24 and input to R-Net.

(3) And (3) fine tuning the candidate image screened by the P-Net through the R-Net. The R-Net network structure is different from the P-Net network structure, and one more full connection layer is added, so that better effect can be obtained. Before inputting R-Net, the image needs to be scaled to 24×24×3, the output of R-Net is the same as that of P-Net, and the purpose of R-Net is to remove a large number of non-human face frames. According to the output position of the P-Net, a part of image is cut out from the original video frame, wherein when the part of image is cut out, a square with the maximum side length of bbox is required to be cut out, so that deformation is not generated and more details around a face frame are reserved when the size is reserved. The R-Net still outputs 2 outputs (namely, classification results) of a two-class single hot code (one-hot), 4 outputs of a coordinate offset of bbox (namely, the position of a face frame) and 10 outputs of a landmark (namely, the position of a key point). Removing most candidates which are not human faces according to the classification score, adjusting offset of the bbox of the screenshot, namely adjusting the x and y coordinates of the left upper part, the right lower part, the upper part, the lower part, the left and right parts, repeating NMS according to the IOU value and screening out most candidates. The final R-Net output is (num_left_after_R-Net, 16), and partial images are cut out of the original video frame according to the position of bbox and input into the O-Net, and the square cutting method according to the maximum side length is also adopted, so that deformation is avoided and more details are reserved.

(4) The image from which many candidates are removed by R-Net is input to O-Net, and the accurate bbox position and landmark position are output. From the network structure, O-Net has one more convolution layer than R-Net, so the processing result is finer. The input image size is 48×48×3, and the output includes the classification result, the position of the detected face frame, and the position of the key point. The process of P-Net can be generally repeated, but with the difference that at this time, in addition to the location of the bbox (which may be represented by coordinates), the coordinates of the landmark are also output. And (3) screening by the classified screening and the adjusted NMS to obtain the accurate position of the face frame and the position of the landmark.

In some embodiments, referring to fig. 5, an effect diagram of a face image is obtained for MTCNN-based face detection:

it can be seen from fig. 5 that, by inputting a video frame including a blank position into a multitasking convolutional neural network, that is, an MCTNN, the face detection is expanded by the MCTNN, and by the MCTNN, the bbox coordinates of all faces existing in the video frame can be obtained. bbox coordinates are four points located to the face box. The computer equipment can cut the video frame through the bbox coordinates calculated by positioning, so that a face image in the video frame can be obtained.

In some embodiments, the computer device may encode the facial image based on an encoding network, an encoding algorithm, or the like, resulting in a vector representation of the facial image. The coding network may be a Resnet, the Resnet may include a Resnet50, a Resnet101, and the like, the Resnet50 may be divided into 5 stages (stages), and referring to fig. 6, a schematic structural diagram of a residual network is shown, the residual network may be specifically the Resnet50, and the Resnet50 may be divided into 5 stages (stages) which are respectively stage0, stage 1, stage 2, stage 3, and stage 4.

In some embodiments, the structure of each Stage of the Resnet50 may be shown in fig. 7, where Stage0 is relatively simple in structure, and may be considered as preprocessing of input, and the last 4 stages are all composed of Bottleneck convolution layers (BTNK), and the structures are relatively similar. The Stage0 input has a shape (3,224,224), 3 is the number of channels, and two 224 are high and wide, respectively. The Stage0 structure includes a first layer and a second layer, the first layer includes 3 sequential operations, respectively Conv (Convolutional Neural Network ), batch normalization (Batch Normalization, BN), linear rectification function (Rectified Linear Unit, RELU). The Conv has a convolution kernel size of 7 x 7, the number of convolution kernels is 64, and the step size of the convolution kernels is 2, which can be expressed as/2.

The second layer is the maximum pooling layer (MAXPOOL) with a kernel (kernel) size of 3X 3 and a step size of 2. The shape of Stage0 output is (64,56,56), where 64 equals the number of convolution kernels in the first layer of Stage0, 56 equals 224/2/2 (a step size of 2 would halve the input size).

Stage1 contains 3 bottlenecks (including 1 Bottleneck convolution 1 (BTNK 1) and 2 Bottleneck convolutions 2 (BTNK 2)), the remaining 3 stages include 4 bottlenecks (including 1 BTNK1 and 2 BTNK 2), 6 bottlenecks (including 1 BTNK1 and 5 BTNK 2), 3 bottlenecks (including 1 BTNK1 and 2 BTNK 2), respectively. The shape of Stage1 output is (256,56,56), the shape of Stage2 output is (512,28,28), the shape of Stage3 output is (1024,14,14), and the shape of Stage4 output is (2048,7,7). Thus, the facial image is transformed by Resnet50 into a 2048 vector representation that can represent semantic feature information for the facial image.

It should be noted that, as shown in fig. 8, the structures of BTNK1 and BTNK2 may be referred to, where BTNK2 has 2 variable parameters, i.e., C and W in the input shape (C, W), 3 convolution layers on the left side of BTNK2, and associated BN and RELU. Compared with BTNK2, BTNK1 has more than 1 convolution layer on the right side, so that the effect of matching input and output dimension differences is achieved.

In the above embodiment, for any point in time, the computer device locates the position of the face first, and performs clipping processing on the video frame based on the located position of the face to obtain the face image. Further, the computer device performs feature encoding on the facial image to obtain a vector representation of the facial image, and finally performs facial expression recognition based on the vector representation, so that a first emotion recognition result can be accurately obtained.

In some embodiments, for any time point, performing speech emotion recognition according to the speech audio of the time point to obtain a second emotion recognition result, including: extracting acoustic feature representation of the speech sounds of the time points aiming at any time point; encoding based on the acoustic feature representation to obtain an encoded feature output; performing voice recognition according to the coding feature output to obtain a voice recognition result of the targeted time point; and under the condition that the voice recognition result represents that the voice exists, carrying out voice emotion recognition based on the coding feature output to obtain a second emotion recognition result.

Wherein the acoustic features represent speech feature information for characterizing speech of the speech sounds. The speech recognition is a recognition process for recognizing whether speech is present in the speech of the speech lines, such as whether there is a human voice in the speech or other speech that can be used for speech emotion recognition, etc.

In particular, for any point in time, the computer device may extract an acoustic feature representation of that point in time and encode the acoustic feature representation using a pre-trained speech emotion recognition model or encoding algorithm. The acoustic feature representation can be encoded by adopting convolution kernels with different sizes or convolution kernels with different dimensions in the speech emotion recognition model, so that more detailed speech feature information can be obtained when the acoustic feature representation is encoded. Further, the computer equipment performs voice recognition based on the obtained coding feature output, determines whether voice exists in the speech of the time point, and performs voice emotion recognition based on the coding feature output under the condition that voice exists, so as to obtain a second emotion recognition result.

In some embodiments, the acoustic feature representation is speech feature information comprising time domain features and frequency domain features of the speech audio, wherein the time domain features comprise a substantial amount of time domain information such as, for example, audio loudness, sample point amplitude, etc. The frequency domain features comprise information of sampling point frequency, so that the characteristics of the speech sounds can be reflected by extracting the acoustic feature representation, and the speech recognition is facilitated. Of course, the acoustic feature representation may also only include the time domain feature of the speech audio in the time dimension, or only include the frequency domain feature of the speech audio in the frequency dimension, and specifically, the speech feature information included in the acoustic feature representation may be adaptively adjusted in combination with the accuracy requirement of speech recognition, and the like.

In the above embodiments, for any point in time, the computer device may extract an acoustic feature representation of that point in time and encode the acoustic feature representation. Further, the computer equipment performs voice recognition based on the obtained coding feature output, determines whether voice exists in the recognition audio of the time point location, and finally performs voice emotion recognition based on the coding feature output under the condition that voice exists, so that a second emotion recognition result can be accurately obtained.

In some embodiments, encoding based on the acoustic feature representation results in an encoded feature output, comprising: performing first encoding based on the acoustic feature representation to obtain a first encoded output; the first coding is coding focusing on time domain perception; performing second encoding based on the acoustic feature representation to obtain a second encoded output; the second code is a multidimensional code; performing third coding based on the acoustic characteristic representation to obtain third coding output; the third coding is a coding focusing on frequency domain perception; an encoding characteristic output is determined from the first encoded output, the second encoded output, and the third encoded output.

Wherein the first encoding focuses on encoding the representation of the acoustic features from the time dimension, the multi-dimensional encoding focuses on encoding the representation of the acoustic features from both the time dimension and the frequency dimension, and the third encoding focuses on encoding the representation of the acoustic features from the frequency dimension.

In particular, when the computer device encodes the acoustic feature representation, different processing branches may be used to encode the acoustic feature representation in a targeted manner. When the first coding is carried out, the computer equipment can adopt a first one-dimensional coding processing branch to be unfolded, the first one-dimensional coding processing branch is used for coding the acoustic characteristic representation based on one-dimensional dimension, and the first one-dimensional coding branch focuses on information extraction of time dimension; in performing the second encoding, the computer device may employ a two-dimensional encoding processing branch that encodes the acoustic feature representation based on two dimensions simultaneously, i.e., information extraction in both the time and frequency dimensions. In performing the third encoding, the computer device may employ a second one-dimensional encoding processing branch that focuses on information extraction in the frequency dimension.

In some embodiments, the encoding focusing on time domain perception may be only performing time dimension information extraction on the acoustic feature representation, and of course, may also be performing time dimension information extraction from the time dimension and then performing frequency dimension information extraction from the frequency dimension, so that frequency domain information may be obtained while focusing on time domain information acquisition, so as to keep complementation between the time domain information and the frequency domain information, and improve accuracy of subsequent speech emotion recognition.

In some embodiments, the encoding focusing on frequency domain sensing may be only performing frequency-dimensional information extraction on the acoustic feature representation, and of course, may also be performing frequency-dimensional information extraction from the frequency dimension first and then performing time-dimensional information extraction from the time dimension, so that time domain information may be acquired while focusing on frequency domain information acquisition, so as to keep complementarity between time domain information and frequency domain information, and improve accuracy of subsequent speech emotion recognition.

In some embodiments, the computer device may implement the first one-dimensional encoding processing branch, the second one-dimensional encoding processing branch, and the two-dimensional encoding processing branch by convolution when encoding the acoustic feature representations in different dimensions. If the convolution layers with different dimensions are adopted, the one-dimensional convolution layer codes from one dimension, the first one-dimensional coding processing branch and the second one-dimensional coding processing branch can code through the one-dimensional convolution layer, and as for the one-dimensional dimension, the time dimension or the frequency dimension can be set in combination with actual requirements; the two-dimensional convolution layer is used for coding from two dimensions simultaneously, and the two-dimensional coding processing branch can be used for coding through the two-dimensional convolution layer.

In some embodiments, after the first encoded output, the second encoded output, and the third encoded output are obtained, the computer device may process the three encoded outputs, such as a fusion process, an averaging process, etc., so that the time domain and the frequency domain may be kept complementary in information, while still obtaining more detailed characteristic information.

In the above embodiment, when the computer device encodes the acoustic feature representation, the first encoding, the second encoding and the third encoding are performed on the acoustic feature representation respectively, so as to obtain the encoded output with different emphasis dimensions, and finally, the encoded feature output is determined according to the encoded output with different emphasis dimensions, so that the computer device focuses on the perception learning in the time dimension and the perception learning in the frequency dimension, and can obtain more detailed information.

In some embodiments, first encoding based on the acoustic feature representation, resulting in a first encoded output, comprises: performing time dimension feature extraction on the acoustic feature representation through convolution check of a first size, and performing frequency dimension feature extraction to obtain a first intermediate feature; performing feature extraction on the acoustic feature representation through convolution check of a second size to obtain a second intermediate feature; the second dimension is greater than the first dimension; and fusing the first intermediate feature and the second intermediate feature to obtain a first coding output.

Wherein the first size convolution kernel is a small size convolution kernel and the second size convolution kernel is a larger convolution kernel than the first size convolution kernel.

In some embodiments, the small dimension may be less than the first preset dimension and the large dimension may be greater than the second preset dimension, the second preset dimension being greater than or equal to the first preset dimension. The actual values of the first preset size and the second preset size can be adaptively set in combination with specific video data processing scenes, video data processing precision and the like.

In some embodiments, the first one-dimensional processing branch may be formed of a plurality of convolution kernels when feature extraction is performed using a first size of convolution kernel acoustic features. For coding that emphasizes time domain perception, the size of the preceding convolution kernel in the first one-dimensional processing branch may be set to be more biased towards the time domain, i.e. the convolution kernel is larger in size in the time domain, e.g. the size of the convolution kernel in the first half is set to be more biased towards the time domain, and the size of the convolution kernel in the second half is set to be more biased towards the frequency domain. Of course, in the first one-dimensional processing branch, the two-thirds of the convolution kernels are more biased to the time domain, and specifically, how many convolution kernels are more biased to the time domain can be set in combination with the actual requirement, so long as the set first one-dimensional processing branch is determined to be more biased to the extraction of the time domain information.

In some embodiments, the input of the acoustic feature may be represented by t×n, where t represents time domain information and n represents frequency domain information, and in the first one-dimensional processing branch, the size of the first half of the convolution kernel may be 3*1 and the size of the second half of the convolution kernel may be 1*3. Where 3*1 may represent that the size of this convolution kernel is 3 in the time domain and 1 in the frequency domain, so that it is larger in the time domain and more sensitive, so that more information is extracted in the time domain at the time of computation.

In some embodiments, when the computer device performs feature extraction by checking the acoustic feature representation through the convolution kernel of the first size, the convolution kernel of the first size is further connected with a pooling layer adapted thereto, so that a first intermediate feature is obtained through the convolution kernel of the first size and the pooling layer.

In other embodiments, the computer device performs feature extraction by convolving the acoustic feature representation with a second size of convolution kernel, the second size of convolution kernel further having a pooling layer adapted thereto, such that a second intermediate feature is obtained by convolving the second size of convolution kernel with the pooling layer.

In the above embodiment, the computer device performs feature extraction focusing on the time dimension through the convolution kernel of the first dimension to obtain the first intermediate feature, and then performs extraction by adopting the convolution kernel of the second dimension larger than the first dimension to obtain the second intermediate feature, and obtains the third code output by fusing the first intermediate feature and the second intermediate feature, so that not only can the features of the time domain and the frequency domain be obtained, but also the information is obtained more completely due to the adoption of the convolution kernels of different dimensions, and the accuracy of video data processing is effectively improved.

In some embodiments, third encoding based on the acoustic feature representation results in a third encoded output, comprising: the acoustic feature representation is firstly subjected to feature extraction in the frequency dimension through convolution check of the first dimension, and then the feature extraction in the time dimension is carried out to obtain a third intermediate feature; performing feature extraction on the acoustic feature representation through convolution check of a second size to obtain a second intermediate feature; the second dimension is greater than the first dimension; and fusing the third intermediate feature and the second intermediate feature to obtain a third coding output.

The feature extraction focused on the frequency dimension in this embodiment is similar to the feature extraction focused on the time dimension, and the number of convolution kernels on the second one-dimensional processing branch may be set to obtain a third intermediate feature focused on the frequency dimension.

In some embodiments, the second one-dimensional processing branch may be formed of a plurality of convolution kernels when the acoustic features are feature extracted using a convolution kernel of a first size. For coding focusing on frequency domain perception, the size of the preceding convolution kernel in the second one-dimensional processing branch may be set to be more biased towards the frequency domain, i.e. the convolution kernel is larger in frequency domain size, e.g. the size of the convolution kernel in the first half is set to be more biased towards the frequency domain, and the size of the convolution kernel in the second half is set to be more biased towards the time domain. Of course, the size of two-thirds convolution kernels in the second one-dimensional processing branch may be more biased to the frequency domain, and specifically how many convolution kernels are more biased to the frequency domain may be set in combination with the actual requirement, so long as the set second one-dimensional processing branch is determined to be more biased to the extraction of the frequency domain information.

In some embodiments, the input of the acoustic features may be represented by t×n, where t represents time domain information, n represents frequency domain information, and the size of the convolution kernel of the first half is 1*2 and the size of the convolution kernel of the second half is 2*1 in the one-dimensional processing branch. Where 1*2 represents that the size of this convolution kernel is 2 in the frequency domain and 1 in the time domain, so that its size in the frequency domain is larger and the perceptibility is larger, so that it extracts more information in the frequency domain at the time of calculation.

In some embodiments, when the computer device performs feature extraction by checking the acoustic feature representation through the convolution kernel of the first size, the convolution kernel of the first size is further connected with a pooling layer adapted thereto, so that a third intermediate feature is obtained through the convolution kernel of the first size and the pooling layer.

In the above embodiment, the computer device performs the feature extraction focusing on the frequency dimension through the convolution kernel of the first dimension to obtain the third intermediate feature, and then uses the convolution kernel of the second dimension larger than the first dimension to extract the acoustic representation feature to obtain the second intermediate feature, and obtains the third code output by fusing the third intermediate feature and the second intermediate feature, so that not only the features of the time domain and the frequency domain can be obtained, but also the information is obtained more completely due to the adoption of the convolution kernels of different dimensions, and the accuracy of video data processing is effectively improved.

In some embodiments, referring to fig. 9, a schematic flow chart of speech emotion recognition on speech of a speech audio to obtain a second emotion recognition result is shown:

fig. 9 is a schematic structural diagram of a speech emotion recognition model according to the present embodiment. The speech emotion recognition model can first process the speech audio, an acoustic signature representation is obtained. The speech emotion recognition model can calculate logarithmic mel frequency spectrum (log-mel) frequency spectrum characteristics of the speech audio to obtain mel frequency. Mel frequency is a nonlinear frequency scale. It will be appreciated that the speech audio itself is a time domain signal, which also contains a lot of time domain information, such as: the information of the loudness of the audio and the amplitude of the sampling point can reflect the characteristics of the speech, and is helpful for voice recognition. Thus, the acoustic signature is a signature comprising both time domain information and frequency domain information.

After the acoustic feature representation is obtained, the speech emotion recognition model can perform speech recognition on the acoustic feature representation of the currently input speech sound to obtain a speech recognition result. And carrying out voice emotion recognition when the voice recognition result represents that voice exists, wherein the voice emotion recognition result can be used for weighting the subsequent second coded data.

The speech emotion recognition model is divided into three branches, namely a left branch, a middle branch and a right branch. Among the three branches, the left and right branches are formed by one-dimensional small convolution kernels, namely convolution kernels of a first size, and residual connection, wherein the residual connection is obtained by the convolution kernels of a second size. The middle branch is formed by a small two-dimensional convolution kernel. And combining the three branches to obtain a mean value (mean), and finally, carrying out pooling layer and shape reconstruction (reshape) to obtain coding characteristic output.

Further, in the case where it is determined that speech is present in the speech of the speech line based on the encoding feature output, to decide whether to turn on the branch of speech emotion recognition. After the branch of speech emotion recognition is opened, a second emotion recognition result can be determined based on the pooling layer, the convolution layer, the shape reconstruction, the multi-layer perceptron, and the classification layer.

In fig. 9, the upper half of the left branch, i.e., the first layer, the second layer, and the third layer, is one-dimensional convolution and one-dimensional pooling in the time dimension, the lower half, i.e., the fourth layer, the fifth layer, and the sixth layer, is one-dimensional convolution and one-dimensional pooling in the frequency dimension, and the architecture of the right branch and the architecture of the left branch are opposite. The purpose of this architecture is to make half of the network layer focus on the perception learning in the time domain, and half of the network layer focus on the perception learning in the frequency domain, so that the network can respectively perform detailed information acquisition in the time domain and the frequency domain. And then, the left branch and the right branch are respectively added with a residual connection, namely, a large convolution kernel and a large pooling layer are utilized to splice with the result of the last branch, so that semantic information combined by the bottom layer and the high layer at the same time can be obtained.

Wherein, the middle branch uses a two-dimensional small convolution kernel and a corresponding pooling layer. And finally, obtaining final coding characteristic output through the pooling layer on one channel by calculating the characteristics of the three branches and finally averaging. Wherein, a classification layer (softmax) for two classifications is arranged in the middle branch circuit for judging whether the voice is human voice or not.

After judging, if the current speech audio is an audio segment containing voice, a branch of emotion recognition and classification of the human beings is started, namely a lower right corner part in the figure, wherein the lower right corner consists of a pooling layer, a convolution layer, a shape reconstruction layer, a multi-layer perceptron and a classification layer, and a second emotion recognition result can be calculated after passing through the classification layer of the lower right corner part.

A large number of small convolution kernels are adopted in the speech emotion recognition model as the construction of the convolution layers, so that the large convolution kernels are replaced. If a small convolution kernel is used, a deeper network layer needs to be expanded in order to realize the same convolution scale calculation, so that the network depth of the whole system is higher in the use of the small convolution kernel. The use of a small convolution kernel allows more detail to be noted when performing the feature computation, as it allows the smaller size to be able to note the variation in detail throughout the computation of the convolution kernel. The voice recognition method and the voice recognition device are used for recognizing voice recognition of the voice and emotion classification of the voice, and the voice characteristics of different people are very small in difference, so that the voice recognition method and the voice recognition device can be used for grabbing details more easily.

In some embodiments, the emotion recognition results include a first emotion recognition result and a second emotion recognition result; determining the suspense score of the time point according to the video picture, the speech text and the emotion recognition result of the time point, comprising: acquiring first coding data of a video picture of a time point, and acquiring video picture coding characteristics according to the first coding data and a first emotion recognition result; acquiring second coding data of the line text of the time point, and acquiring text coding features according to the second coding data and a second emotion recognition result; fusing the video picture coding features and the text coding features of the time points to obtain multi-mode fusion features of the time points; and determining the suspense score of the targeted time point according to the multi-modal fusion characteristics.

The first coding data are obtained by coding video frames included in the video picture; the first encoded data is data obtained by encoding the speech text.

Specifically, for each time point, the computer device may determine a first emotion recognition result and first encoded data of the time point, determine a target confidence level corresponding to the facial emotion through the first emotion recognition result, and weight or reduce the weight of the first encoded data based on the target confidence level, so as to obtain the video picture coding feature. The computer equipment can also determine a second emotion recognition result and second coding data of the time point, determine a target confidence coefficient corresponding to the voice emotion through the second emotion recognition result, and weight or reduce weight the second coding data based on the target confidence coefficient to obtain text coding characteristics.

In some embodiments, the computer device may employ various types of image coding networks to encode the entire video frame included in the video picture, such as the Resnet50 network mentioned above, or other coding networks other than the Resnet50 network, such as VGG (Visual Geometry Group, visual geometry network), without limitation.

In some embodiments, the computer device may input the entire sequence of video frames for the point in time into a Resnet50 network, encoding the entire sequence of video frames into an sequence of ebedding based on the Resnet50 network, and taking the sequence of ebedding of the video pictures as the first encoded data.

In some embodiments, the computer device may encode the speech text using various types of text encoding networks to obtain the second encoded data. For example, the text encoding network may be BERT (Bidirectional Encoder Representations from Transformers, bi-directional encoder model), GPT3 (generated Pre-trained Transformer, generated Pre-Training model), GPT4 (generated Pre-trained Transformer, generated Pre-Training model), clip (Contrastive Language-Image Pre-Training model), and the like, which are not limited in this embodiment of the present application.

In some embodiments, the computer device encodes the speech text using a BERT model. The BERT model is input into the text of the line corresponding to the time point, so that the text can be output by the method of outputting the text according to each line of the time point, and the text sequence can be used as second coding data.

In some embodiments, when the computer device fuses the video picture coding feature and the text coding feature, various Fusion modes such as Early Fusion, late Fusion and the like can be adopted to perform multi-mode Fusion.

In some embodiments, for the targeted time point, when determining the target confidence corresponding to the facial emotion, the computer device may determine a maximum confidence in the confidence of the first target class corresponding to each frame of video frame of the time point, determine a first emotion weighted value based on the maximum confidence and the first target class corresponding to the maximum confidence, and use the first emotion weighted value as the target confidence of the facial emotion.

The first target category of each video frame is the category corresponding to the calculated maximum confidence coefficient of the video frame. For example, for one of the video frames, the confidence that the facial expression category of the video frame belongs to the surprise category is calculated to be 0.3, the confidence that the facial expression category belongs to the fear category is calculated to be 0.3, the confidence that the facial expression category belongs to the laugh category is calculated to be 0.6, the confidence that the facial expression category belongs to the cry category is calculated to be 0.2, the confidence that the facial expression category belongs to the spanish category is calculated to be 0.5, and the confidence that the facial expression category belongs to the anger category is calculated to be 0.1; the confidence belonging to the expression-free category was 0.4, the confidence belonging to the smile category was 0.7, and the confidence belonging to the sad category was 0.45. It can be seen that the first target category of the video frame is a smile category.

In some embodiments, when determining the first emotion weighted value based on the maximum confidence level and the first target class corresponding to the maximum confidence level, the computer device may directly use the maximum confidence level as the first emotion weighted value if the first target class corresponding to the maximum confidence level belongs to the excited class, so as to perform the weighting processing on the first encoded data through the first emotion weighted value. If the first target class corresponding to the maximum confidence level belongs to the flat class, the computer device may set the first emotion weighting level to 0, or assign a smaller value to the first emotion weighting value, for example, the first emotion weighting value is smaller than a set threshold value, where the set threshold value may be 0.2, 0.3, and so on, so that the first encoded data is subjected to weight reduction processing through the first emotion weighting value.

In some embodiments, there is a time point a comprising three video frames: a first video frame, a second video frame, and a third video frame. The first target class of the first video frame is fear, and the confidence is 0.9; the first target class of the second video frame is laugh, and the confidence is 0.7; the first target class of the third video frame was crying with a confidence level of 0.6. The maximum confidence is 0.9, the corresponding first target category is fear, the fear belongs to the excited category, and the first emotion weighting value can be determined to be 0.9.

In some embodiments, there is a time point B comprising three video frames: a first video frame, a second video frame, and a third video frame. The first target class of the first video frame is sad, and the confidence is 0.8; the first target class of the second video frame is surprised with a confidence level of 0.4; the first target category of the third video frame is anger with a confidence level of 0.7. The maximum confidence is 0.8, the corresponding first target category is sadness, the sadness belongs to the flat category, and the first emotion weighting value can be determined to be 0.

In some embodiments, for the targeted time point, when determining the target confidence level corresponding to the speech emotion, the computer device may determine the confidence level of a second target class corresponding to the speech emotion audio of the time point, determine a second emotion weighted value based on the maximum confidence level and the second target class corresponding to the maximum confidence level, and use the second emotion weighted value as the target confidence level of the speech emotion.

The second target category of the speech sound is the category corresponding to the calculated maximum confidence of the speech sound. For example, for the speech audio, the confidence that the speech emotion category of the speech audio belongs to the surprise category is calculated to be 0.7, the confidence that the speech emotion category belongs to the fear category is calculated to be 0.2, the confidence that the speech emotion category belongs to the happy category is calculated to be 0.9, the confidence that the speech emotion category belongs to the cry category is calculated to be 0.1, the confidence that the speech emotion category belongs to the naught category is calculated to be 0.6, and the confidence that the speech emotion category belongs to the plain category is calculated to be 0.7; the confidence level belonging to the sad category was 0.4, and the confidence level belonging to the fear category was 0.5. It can be seen that the second target category of the speech audio is a happy category.

In some embodiments, when determining the second emotion weighted value based on the maximum confidence level and the second target class corresponding to the maximum confidence level, the computer device may directly use the maximum confidence level as the second emotion weighted value if the second target class corresponding to the maximum confidence level belongs to the excited class, so as to perform the weighting processing on the second encoded data through the second emotion weighted value. If the second target class corresponding to the maximum confidence level belongs to the flat class, the computer device may set the second emotion weighting level to 0, or assign a smaller value to the second emotion weighting value, for example, the second emotion weighting value is smaller than a set threshold value, where the set threshold value may be 0.1, 0.25, and so on, so as to perform the weight-reducing processing on the second encoded data through the second emotion weighting value.

In some embodiments, there is a time point C, where the second target category of the speech emotion category of the corresponding speech audio is anger, the confidence is 0.8, anger belongs to the excited category, and the second emotion weighting value is 0.8.

In some embodiments, there is a time point D, where the second target class of the speech emotion class of the corresponding speech audio is sad, the confidence is 0.9, the sad belongs to the flat class, and it may be determined that the second emotion weighting value is 0.

In the above embodiment, for each time point, the computer device may determine the first emotion recognition result and the first encoded data for the time point, and obtain the video picture encoding feature. The computer equipment can also determine a second emotion recognition result and second coding data of the time points to obtain text coding characteristics, so that a certain weight can be given to the corresponding coding data based on the emotion recognition result, the representation of the modes of each time point can be improved, the characteristic duty ratio in the calculation of the suspense score can be adjusted, and the accuracy of the calculation of the final suspense score can be improved. And finally, fusing the video picture coding features and the text coding features to obtain multi-mode fusion features, and further improving the accuracy of the calculation of the suspense score based on the multi-mode fusion features.

In some embodiments, the multi-modal data further includes interactive data, and the obtaining video picture coding features according to the first coding data and the first emotion recognition result includes: determining video picture coding characteristics according to the first coding data, the first emotion recognition result and the interactive data; obtaining text coding features according to the second coding data and the second emotion recognition result, wherein the text coding features comprise: and determining the text coding characteristic according to the second coding data, the second emotion recognition result and the interaction data.

The interactive data is data generated by the interaction of an object (such as a user) on the video playing platform based on the video to be processed, and can be, for example, barrage data, comment data and the like.

The bullet screen data is a comment subtitle popped up from the video to be processed, and the bullet screen can give the user a feeling of real-time interaction, and the bullet screen data transmitted at the same moment basically has the same theme although different bullet screen data are different in transmission time. Meanwhile, due to the autonomous behavior of the user, the bullet screen data can represent the heat of the video to be processed to a certain extent.

Specifically, for each time point, the computer device may determine a first emotion recognition result, first encoding data and interaction data of the time point, determine a target confidence level corresponding to the facial emotion according to the first emotion recognition result, and weight or reduce weight of the first encoding data based on the target confidence level and the interaction data, so as to obtain video picture coding features. The computer equipment can also determine a second emotion recognition result and second coding data of the time point, determine a target confidence coefficient corresponding to the voice emotion through the second emotion recognition result, and weight or reduce weight the second coding data based on the target confidence coefficient and the interaction data to obtain text coding characteristics.

In some embodiments, the computer device may obtain interactive data generated during the historical playing of the video to be processed. Aiming at the condition that the video to be processed is a newly-mapped video and possibly has no interactive data at the first time, the trend of the barrage data can be simulated by establishing a model for predicting the trend of the barrage data.

In some embodiments, as shown in fig. 10, when the interactive data is bullet screen data, the bullet screen data trend map obtained by the computer device is shown. At each time point, there is a corresponding bullet screen data volume. The computer device can normalize the barrage data amount of each time point, normalize the barrage data amount to a numerical value between 0 and 1, then use the numerical value to weight the first coded data by combining the first emotion recognition result, and weight the second coded data by combining the second emotion recognition result.

In the above embodiment, the computer device introduces the interactive data as a basis for judging the actual viewing response of the user, so that the suspense calculation introduces the actual interactive data, and the barrage data can be used to correct the finally located suspense point to a certain extent, so that the content at the finally selected target time point is more fit with the suspense perception of the actual user, thereby improving the suspense accuracy of the target time point, and enabling the user to have more interests to view the feature film.

In some embodiments, the first emotion recognition result includes a first target category corresponding to each of the N frames of video frames for the point in time and a confidence level of each of the N first target categories; n is a natural number greater than 1; determining video picture coding features according to the first coding data, the first emotion recognition result and the interactive data, wherein the video picture coding features comprise: determining the maximum confidence level in the confidence levels of the N first target categories; determining a first emotion weighting value based on the maximum confidence level and a first target class corresponding to the maximum confidence level; determining a first interaction weighted value according to the interaction data; and determining the video picture coding characteristic according to the first coding data, the first emotion weighting value and the first interaction weighting value of the aimed time point.

Specifically, for any time point, the manner in which the computer device determines the first emotion weighted value of the time point may refer to the manner in which the first emotion weighted value is determined when the interactive data is not introduced in the above embodiment, which is not described herein. The computer device may perform normalization processing based on the determined interactive data, take a result of the normalization processing as a first interaction weighted value, and multiply the first encoded data of the time point location, the first emotion weighted value, and the first interaction weighted value to determine the video picture coding feature.

In the above embodiment, for any time point, the computer device determines the video picture coding feature by multiplying the first coding data, the first emotion weighted value and the first interaction weighted value based on the time point, so that the relevance between different modes is fully considered, and the accuracy of the suspense score calculation is improved.

In some embodiments, determining the first emotion weight value based on the maximum confidence level and the first target class to which the maximum confidence level corresponds comprises: determining a first emotion weighting value based on the maximum confidence level under the condition that a first target class corresponding to the maximum confidence level belongs to an excited class; and taking the preset numerical value as a first emotion weighting value under the condition that the first target class corresponding to the maximum confidence coefficient belongs to the flat class.

Specifically, when determining the first emotion weighted value based on the maximum confidence coefficient and the first target class corresponding to the maximum confidence coefficient, if the first target class corresponding to the maximum confidence coefficient belongs to the excited class, the computer device may directly use the maximum confidence coefficient as the first emotion weighted value, so as to perform weighted processing on the first encoded data through the first emotion weighted value. If the first target class corresponding to the maximum confidence belongs to the flat class, the computer device may set the first emotion weighted value to 0, or assign a smaller value to the first emotion weighted value, for example, the first emotion weighted value is smaller than a set threshold, where the set threshold may be 0.2, 0.3, and so on, so as to perform weight reduction processing on the first encoded data through the first emotion weighted value.

In the above embodiment, the computer device sets different first emotion weighted values based on the first target category corresponding to the maximum confidence level, so as to adjust the feature duty ratio during calculation. Because the suspense is often generated in the excited expression, when the first target category belongs to the excited category, the first coded data is weighted through the first emotion weight value, and when the first target category belongs to the flat category, the first coded data is subjected to weight reduction through the first emotion weight value, so that the accuracy of final suspense calculation can be improved.

In some embodiments, the second emotion recognition result includes a second target category corresponding to the speech audio of the targeted point in time, and a confidence level of the second target category; determining text encoding features based on the second encoded data, the second emotion recognition result, and the interactive data, comprising: determining a second emotion weighting value based on the confidence level of the second target class; determining a second interaction weighted value according to the interaction data; and determining the text coding characteristic according to the second coding data, the second emotion weighted value and the second interaction weighted value of the aimed time point.

Specifically, for any time point, the manner in which the computer device determines the second emotion weighted value of the time point may refer to the manner in which the second emotion weighted value is determined when the interactive data is not introduced in the above embodiment, which is not described herein. The computer device may perform normalization processing based on the determined interactive data, take a result of the normalization processing as a second interaction weighted value, and multiply the second encoded data of the time point location, the second emotion weighted value, and the second interaction weighted value to determine the text encoding feature.

In some embodiments, as shown in FIG. 11, a block diagram is provided for feature weighting:

the first emotion recognition result shown in fig. 11 includes: the expression of the excited category and the expression of the flat category comprise: surprise, fear, laugh, cry, face, anger; expression of the plain category includes: no expression, smile, sadness. The second emotion recognition result also includes: the expression of the excited category and the expression of the flat category comprise: anger, surprise, happiness, crying, shouting, sadness, fear; expression of the plain category includes: light and sad.

Taking the first emotion weighted value of any time point as an example for explanation: when determining the first emotion weighting value based on the first emotion recognition result, a plurality of first target categories may be determined for the point in time, and the confidence level of each first target category. The computer device may calculate the maximum confidence level of the category belonging to the excited expression in each first target category, and calculate the maximum confidence level of the category belonging to the flat expression in each first target category, compare the two calculated confidence levels, and if the confidence level of the excited expression is higher, the final confidence level is equal to the confidence level of the excited expression, and if the confidence level of the flat expression is higher, the final confidence level is equal to zero.

Further, after calculating the first emotion weighted value of the time point, the computer device may multiply the first emotion weighted value of the time point with first encoded data, i.e. vector representation (embedding), and then multiply the first emotion weighted value with the normalized barrage data, so as to obtain the weighted video picture coding feature. And (3) carrying out the same feature weighting on the text coding features by using the same flow, finally, concatemerizing (splicing) the two weighted feature coding sequences together, and calculating the suspense score according to the splicing result.

In the above embodiment, for any time point, the computer device determines the text coding feature by multiplying the second coding data, the second emotion weighted value and the second interaction weighted value based on the time point, so that the relevance between different modes is fully considered, and the accuracy of calculating the suspense score is improved.

In some embodiments, referring to fig. 12, which is a system architecture diagram of the present application regarding suspense judgment in one embodiment, the video to be processed in the system architecture diagram is a movie theatre video, where fig. 12 refers to four modes of data, namely video frames, speech texts, speech audios and bullet screen trend curves provided by a platform side, and the bullet screen data amount of each time point can be determined through the bullet screen trend curves.

The video picture can be input into a video coding network for feature coding to obtain first coded data. The video picture may correspond to facial emotion recognition, and the first emotion recognition result corresponding to the video picture is recognized through facial emotion recognition, so that the first encoded data may be weighted based on the first emotion recognition result. When the first emotion recognition results obtained in the facial emotion recognition are all bland expressions, the computer equipment can remove the time point by screening the time period, namely the time point does not participate in weighting.

The speech text can be input into a text coding network for feature coding to obtain second coding data. The time audio may correspond to speech emotion recognition, and the second emotion recognition result corresponding to the speech audio is recognized through speech emotion recognition so that the second encoded data may be weighted based on the second emotion recognition result. When the second emotion recognition results obtained in the mood recognition are all bland expressions, the computer equipment can remove the time point by screening the time period, namely the time point does not participate in weighting.

The first emotion recognition result obtained by performing facial emotion recognition on the video picture can be combined with the barrage data amount determined by barrage curve trend, so that the first coded data can be weighted. And the second emotion recognition result of the speech emotion recognition based on the speech audio can be combined with the bullet screen data amount determined by the bullet screen curve trend to weight the second encoded data. And finally, combining the weighted results with the barrage trend curve to jointly calculate the suspense judgment result, so that the suspense score of each time point in the video can be calculated.

In some embodiments, performing an integrity analysis according to the multi-modal data of each candidate time-point, determining a target=target time-point from the candidate time-points according to the integrity analysis result, including: determining a plurality of sub-mirrors in the video to be processed according to the video picture; determining candidate time points which do not meet the integrity condition in the candidate time points based on a plurality of sub-mirrors, the speech frequency of the speech and the speech text of the speech; and eliminating the candidate time points which do not meet the integrity condition to obtain the target time point.

Wherein the sub-mirror is a infinitesimal unit of video. For the video to be processed, only one sub-mirror can be included, and a plurality of sub-mirrors can be included, all the sub-mirrors can jointly form the whole video, and the specific number of the sub-mirrors is related to the actual scene, the length and the like of the video.

For the same sub-mirror, the included video frame contents are all of strong relevance, namely, the video frame contents have certain similarity; the video frame content of the different mirrors is quite different.

The integrity condition is a condition set for judging whether the contents of the candidate time points meet the logic integrity. When the integrity condition is set, setting can be respectively carried out by combining the lens splitting information of the lens splitting of the video to be processed, the time in the video to be processed where the speech sound is located and the time in the video to be processed where the speech text is located.

In particular, the computer device may determine a sequence of video frames of the video to be processed based on the video frames, calculate a similarity between video frames included in the video frames based on the sequence of video frames, and determine a plurality of partial mirrors in the video to be processed from the similarity. Further, the computer device may determine, based on the determined plurality of sub-mirrors, the speech audio, and the speech text, judging whether the candidate time points meet the integrity condition or not, and reserving the candidate time points meeting the integrity condition; and eliminating the candidate time points which do not meet the integrity condition to obtain the target time point.

In some embodiments, when calculating the mirror based on a sequence of video frames, the sequence of video frames includes a plurality of video frames arranged in a temporal order. And calculating the correlation degree between two adjacent video frames aiming at any two adjacent video frames in the video frame sequence, thereby obtaining the correlation degree between any two adjacent video frames. And judging whether two adjacent video frames belong to the same sub-mirror or not according to the correlation degree, and if not, obtaining a sub-mirror boundary between the two sub-mirrors.

In some embodiments, if the correlation between two adjacent video frames is lower than the correlation threshold, it is indicated that the correlation between the two adjacent video frames is particularly low, and the two video frames are more likely to belong to different sub-mirrors, so that the time point corresponding to the two adjacent video frames is determined as a sub-mirror boundary between the two adjacent sub-mirrors, so that the video to be processed is divided into a plurality of sub-mirrors based on the sub-mirror boundary.

In some embodiments, when determining the sub-mirror, the computer device may perform sub-mirror calculation only according to the video frame of the candidate point location, so that the determined sub-mirror is only related to the candidate time point location, thereby effectively reducing the calculation amount. It can be understood that the computer device can also perform the calculation of the sub-mirrors based on the video frames of the whole video to be processed, so that the determined sub-mirrors are more complete, and the accuracy of subsequent judgment is improved.

In the above embodiment, the computer device performs the integrity analysis by combining the plurality of sub-mirrors, the speech audio and the speech text, and eliminates the candidate time points which do not meet the integrity condition, so as to obtain the target time point, so that the integrity of the video content at the target time point can be ensured, the abrupt condition at the target time point can be avoided, and the overall quality of the subsequent video material and the watching interest of the user can be improved.

In some embodiments, determining a candidate time point of the candidate time points that does not satisfy the integrity condition based on the plurality of mirrors, the speech audio, and the speech text comprises: for any partial mirror, taking the candidate time points in the preset part of the partial mirror as candidate time points which do not meet the integrity condition; determining a first time period for occurrence of a speech in the video to be processed based on the speech audio, taking the candidate time points in the first time period as candidate time points which do not meet the integrity condition; determining a second time period for occurrence of the speech in the video to be processed based on the speech text, and taking the candidate time points in the second time period as candidate time points which do not meet the integrity condition.

Wherein, the preset part of the minute mirror is determined according to the plot of the minute mirror. Typically, the first half of the split mirror represents that the episode has just begun, and the end often represents that the episode in this split mirror has substantially ended. If the final determined target time point is located in the front segment of the split mirror or is interrupted, the cut-off on the plot is often caused, so that the end of the final short video clip is more abrupt. Thus, the predetermined portion of the split mirror is the front section or interruption of the split mirror.

The first time period is a time interval formed by a paragraph in which the speech of the line appears. For each candidate time point, the speech frequency of the candidate time point corresponds to a first time period.

The second time period is a time interval formed by a paragraph in which the line text appears. For each candidate time point, the line text of the candidate time point corresponds to a second time period.

Specifically, for any of the sub-mirrors, the computer device may determine the candidate time points in the preset portion of the sub-mirror as time points that do not satisfy the integrity condition, use the candidate time points in the first time period as candidate time points that do not satisfy the integrity condition, and use the candidate time points in the second time period as candidate time points that do not satisfy the integrity condition, so as to ensure that the target time points cannot occur in the time segments in which the speech and the text occur.

In some embodiments, when the preset part of the split mirror is set, each split mirror unit may be determined based on a split mirror line of the split mirror, and then, in combination with respective time information of each split mirror, that is, a split mirror start point and a split mirror end point, an intermediate position of the split mirror is determined, and a part before the intermediate position is set as the preset part of the split mirror. It will be appreciated that it is also possible to determine the third and the fourth of the sub-mirrors in combination with the respective time information of each sub-mirror, and set the third and the fourth of the previous part as the preset part of the sub-mirrors. The specific preset part can be adaptively adjusted by combining the actual mirror content, application scene and the like.

In some embodiments, the computer device eliminates candidate time points that do not satisfy the integrity condition by performing an integrity analysis based on a plurality of mirrors, speech audio, and speech text. In fig. 13, all arrows represent candidate time points for which the suspense score determination is performed. Wherein, the dotted arrow represents the candidate time point removed in the screening process, and the solid arrow represents the candidate time point with integrity and suspense finally selected.

In the above embodiment, the computer device compares the candidate time points with the preset part of the split mirror, and compares the candidate time points with the first time period and the second time period respectively, so as to reject the candidate time points which do not meet the integrity condition, obtain the target time points, and accurately determine the target time points with the integrity and the suspense.

In some embodiments, the video data processing method further comprises: determining the video end point as a clip candidate point, and clipping the video based on the clip candidate point; or, marking the plot according to the video ending point; the plot annotation includes marking the end point of the fixation frequency in the playing progress bar of the video.

The obtained video end point can be used as a clip candidate point of a video clip and can also be used for carrying out plot annotation. For example, in many short video authoring platforms, many users manually choose segments, i.e., manually watch a material set, and then choose appropriate material entry points and end points when editing the short video. In the application, the video ending point suitable for the short video clip in the video is automatically, quickly and accurately positioned, and the video ending point can be directly given to a user during creation, so that the user does not need to watch the video again and then select the video ending point, the video ending point recommended by the application can be directly selected, the creation efficiency is greatly improved, and the short video creation platform can be effectively improved.

In another example, in the long video online platform, in the application, the video end point suitable for the short video clip in the video is automatically, quickly and accurately positioned, and the video end point is displayed in the progress bar of the platform movie and television play, and the video end point is used as the dividing line of different theme dramas in the movie and television play, so that a viewer can autonomously select to directly skip certain episodes for viewing, and the viewing experience of the viewer can be greatly improved.

For example, in the later stage of movie and television drama production, relevant clips such as highlight clips and the like can be produced according to the album number distribution of movie and television drama feature films, in the application, the video end points suitable for the short video clips in the video can be automatically, quickly and accurately positioned to serve as the short video end point candidates of the highlight clips, so that logical integrity of the feature film can be ensured, and meanwhile, suspense can be reserved at the end, so that the viewing interests of users can be greatly attracted, viewers can be attracted to watch feature films in a platform, and the advertising and drainage can be performed for a long video platform.

The application scene is used for applying the video data processing method.

Specifically, the application of the video data processing in the application scene is as follows:

Referring to fig. 14, an overall system architecture diagram of the present application in one embodiment is shown, where the overall system is divided into left and right parts, respectively, a suspense decision and an integrity decision. The left part is a movie theatrical video, namely the positioning of content points with suspense in the video to be processed, wherein data of four dimensional modes are related to video pictures, speech texts, time audios and barrage trend curves provided by a platform side.

The video picture can correspond to facial emotion recognition, and the characteristics of video frame coding are weighted after the expression with strong expression is recognized.

The right integrity judgment uses three modes of information, namely, the integrity judgment is carried out on three complementary modes of the speech audio, the video picture and the speech text.

After the end point positioned in the left module is subjected to the integrity judgment on the right side, the target time point with both integrity and suspense can be screened out, and the target time point can be used for determining the end point of the short video material.

The embodiment of the application constructs a full-automatic video material end point positioning method, which is used for positioning a proper end point in a full-automatic way in the whole process, and can lead the content at the end point to have certain logic integrity and suspense, thoroughly get rid of manual participation, greatly improve the editing efficiency and lead the whole system to carry out industrialized production. Meanwhile, the end points are predicted by the built neural network model, so that the positioning calculation is completely standardized, and the difference caused by different personnel is avoided.

The embodiment of the application innovatively uses multi-mode data, wherein the multi-mode data comprise facial expressions, speech texts, speech audios, interactive data and the like. The method has the advantages that the mode of jointly combining the plurality of modal data can be used for coping with different types of movie and television drama themes, so that the positioned ending point has certain robustness, and the final suspense point is positioned in the mode of jointly combining the plurality of modal data, so that the positioned suspense position is more wonderful and accurate.

The method and the device for calculating the feature ratio of the final suspense are innovative in that the relevance among different modes is applied to the model, and the feature ratio in final calculation is adjusted through weighting among different modes, so that the accuracy of final suspense calculation judgment is improved.

In the embodiment of the invention, bullet screen data is newly introduced as a judgment basis for actual watching reaction of a user, so that the suspense positioning introduces actual behavior expression data, and the suspense perception of an actual audience can be more fitted, thereby improving suspense accuracy of a target time point position, and enabling the user to have more interests to watch positive films.

In the embodiment of the application, besides locating the suspense point in the movie and television play, the data of a plurality of modes are combined at the same time to judge whether the content of the candidate time point has a logic integrity representation, and the logic integrity of the candidate time point is judged, so that the integrity of the video content at the tail end can be ensured, the abrupt appearance at the tail end of the video is avoided, the integral quality of the short video material is improved, and the watching interest of a user is increased.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiments of the present application also provide a video data processing apparatus for implementing the above-mentioned video data processing method. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the video data processing apparatus provided below may be referred to the limitation of the video data processing method hereinabove, and will not be repeated here.

In one embodiment, as shown in fig. 15, there is provided a video data processing apparatus 1500 comprising: a data determination module 1502, an identification module 1504, a calculation module 1506, a screening module 1508, and a positioning module 1510, wherein:

the data determining module 1502 is configured to determine a plurality of time points of a video to be processed, and obtain multi-mode data corresponding to each time point, where the multi-mode data includes at least a video frame, a speech audio, and a speech text.

The recognition module 1504 is configured to, for any time point, perform emotion recognition according to the video picture and the speech audio of the time point, and obtain an emotion recognition result.

The calculating module 1506 is configured to determine the suspense score of the time point according to the video frame, the speech text and the emotion recognition result of the time point.

A screening module 1508 is configured to screen candidate time points from a plurality of time points according to the suspense score of each time point.

The positioning module 1510 is configured to perform integrity analysis according to the multimodal data of each candidate time point, and determine a target time point from the candidate time points according to the integrity analysis result; the target time point is used to determine the video end point. In some embodiments, the recognition module 1504 is further configured to, for any time point, perform facial emotion recognition according to the video frame of the time point to obtain a first emotion recognition result; and aiming at any time point, carrying out voice emotion recognition according to the speech sound of the time point to obtain a second emotion recognition result.

In some embodiments, the identification module 1504 includes a first identification module;

the first recognition module is used for carrying out face detection on any time point according to the video picture of the time point, and intercepting a face image from the video picture based on a detection result; extracting features based on the face image to obtain a vector representation of the face image; and predicting the facial expression category to which the facial image belongs according to the vector representation of the facial image, and determining a first emotion recognition result according to the facial expression category.

In some embodiments, the identification module 1504 further includes a second identification module;

the second recognition module is used for extracting acoustic feature representation of the speech sounds of the time points aiming at any time point; encoding based on the acoustic feature representation to obtain an encoded feature output; performing voice recognition according to the coding feature output to obtain a voice recognition result of the targeted time point; and under the condition that the voice recognition result represents that the voice exists, carrying out voice emotion recognition based on the coding feature output to obtain a second emotion recognition result.

In some embodiments, the second recognition module is further configured to perform a first encoding based on the acoustic feature representation, resulting in a first encoded output; the first coding is coding focusing on time domain perception; performing second encoding based on the acoustic feature representation to obtain a second encoded output; the second code is a multidimensional code; performing third coding based on the acoustic characteristic representation to obtain third coding output; the third coding is a coding focusing on frequency domain perception; an encoding characteristic output is determined from the first encoded output, the second encoded output, and the third encoded output.

In some embodiments, the second identifying module is further configured to perform feature extraction in a time dimension by performing convolution check on the acoustic feature representation in the first dimension, and then perform feature extraction in a frequency dimension to obtain a first intermediate feature; performing feature extraction on the acoustic feature representation through convolution check of a second size to obtain a second intermediate feature; the second dimension is greater than the first dimension; and fusing the first intermediate feature and the second intermediate feature to obtain a first coding output.

In some embodiments, the second identifying module is further configured to perform feature extraction of the frequency dimension first and then perform feature extraction of the time dimension to obtain a third intermediate feature by performing convolution check on the acoustic feature representation of the first size; performing feature extraction on the acoustic feature representation through convolution check of a second size to obtain a second intermediate feature; the second dimension is greater than the first dimension; and fusing the third intermediate feature and the second intermediate feature to obtain a third coding output.

In some embodiments, the emotion recognition results include a first emotion recognition result and a second emotion recognition result; the computing module 1506 is further configured to obtain first encoded data of the video picture at the targeted time point, and obtain a video picture encoding feature according to the first encoded data and the first emotion recognition result; acquiring second coding data of the line text of the time point, and acquiring text coding features according to the second coding data and a second emotion recognition result; fusing the video picture coding features and the text coding features of the time points to obtain multi-mode fusion features of the time points; and determining the suspense score of the targeted time point according to the multi-modal fusion characteristics.

In some embodiments, the computing module 1506 is further configured to determine the video picture coding feature according to the first coding data, the first emotion recognition result, and the interaction data; and the text coding feature is further determined according to the second coding data, the second emotion recognition result and the interaction data. In some embodiments, the first emotion recognition result includes a first target category corresponding to each of the N frames of video frames for the point in time and a confidence level of each of the N first target categories; n is a natural number greater than 1; the calculating module 1506 is further configured to determine a maximum confidence level of the confidence levels of the N first target categories; determining a first emotion weighting value based on the maximum confidence level and a first target class corresponding to the maximum confidence level; determining a first interaction weighted value according to the interaction data; and determining the video picture coding characteristic according to the first coding data, the first emotion weighting value and the first interaction weighting value of the aimed time point.

In some embodiments, the calculating module 1506 is further configured to determine, if the first target class corresponding to the maximum confidence level belongs to the excited class, a first emotion weighted value based on the maximum confidence level; and taking the preset numerical value as a first emotion weighting value under the condition that the first target class corresponding to the maximum confidence coefficient belongs to the flat class.

In some embodiments, the second emotion recognition result includes a second target category corresponding to the speech audio of the targeted point in time, and a confidence level of the second target category; the calculating module 1506 is further configured to determine a second emotion weighted value based on the confidence level of the second target class; determining a second interaction weighted value according to the interaction data; and determining the text coding characteristic according to the second coding data, the second emotion weighted value and the second interaction weighted value of the aimed time point.

In some embodiments, the screening module 1508 is further configured to perform curve fitting according to the suspense score of each time point to obtain a target fitting curve; and determining the time point position corresponding to the wave crest in the target fitting curve as a candidate time point position.

In some embodiments, the positioning module 1510 is further configured to determine a plurality of mirrors in the video to be processed according to the video frame; determining candidate time points which do not meet the integrity condition in the candidate time points based on a plurality of sub-mirrors, the speech frequency of the speech and the speech text of the speech; and eliminating the candidate time points which do not meet the integrity condition to obtain the target time point.

In some embodiments, the positioning module 1510 is further configured to, for any of the split mirrors, treat the candidate time points at the preset portion of the split mirror as candidate time points that do not satisfy the integrity condition; determining a first time period for occurrence of a speech in the video to be processed based on the speech audio, taking the candidate time points in the first time period as candidate time points which do not meet the integrity condition; determining a second time period for occurrence of the speech in the video to be processed based on the speech text, and taking the candidate time points in the second time period as candidate time points which do not meet the integrity condition.

In some embodiments, the video data processing apparatus further comprises an application module;

the application module is used for determining the video end point as a clip candidate point and clipping the video based on the clip candidate point; the system is also used for marking the plot according to the video ending point; the plot annotation includes marking the end point of the fixation frequency in the playing progress bar of the video.

The respective modules in the above-described video data processing apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server or a terminal, and the internal structure of which may be as shown in fig. 16. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing video data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a video data processing method.

It will be appreciated by those skilled in the art that the structure shown in fig. 16 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application is applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of video data processing, the method comprising:

determining a plurality of time points of a video to be processed, and acquiring multi-mode data corresponding to each time point, wherein the multi-mode data at least comprises a video picture, speech audio and speech text;

screening candidate time points from the plurality of time points according to the suspense score of each time point;

carrying out integrity analysis according to the multi-modal data of each candidate time point, and determining a target time point from the candidate time points according to an integrity analysis result; the target time point is used for determining a video end point.

2. The method according to claim 1, wherein the emotion recognition results include a first emotion recognition result and a second emotion recognition result, the performing, for any time point, expression recognition according to the video picture and the speech audio of the time point, and obtaining the emotion recognition result includes:

for any time point, carrying out facial emotion recognition according to the video picture of the time point to obtain a first emotion recognition result;

and aiming at any time point, carrying out voice emotion recognition according to the speech sound of the time point to obtain a second emotion recognition result.

3. The method according to claim 2, wherein the performing facial emotion recognition according to the video picture of the time point for any time point to obtain the first emotion recognition result includes:

For any time point, carrying out face detection according to a video picture of the time point, and intercepting a face image from the video picture based on a detection result;

extracting features based on the face image to obtain a vector representation of the face image;

and predicting the facial expression category to which the facial image belongs according to the vector representation of the facial image, and determining a first emotion recognition result according to the facial expression category.

4. The method according to claim 2, wherein for any time point, performing speech emotion recognition according to the speech audio of the time point to obtain a second emotion recognition result, including:

extracting acoustic feature representation of the speech sounds of the time points aiming at any time point;

encoding based on the acoustic feature representation to obtain an encoded feature output;

performing voice recognition according to the coding feature output to obtain a voice recognition result of the targeted time point;

and under the condition that the voice recognition result represents that voice exists, voice emotion recognition is carried out based on the coding feature output, and a second emotion recognition result is obtained.

5. The method of claim 4, wherein the encoding based on the acoustic feature representation results in an encoded feature output, comprising:

performing first coding based on the acoustic feature representation to obtain a first coding output; the first encoding is encoding focused on time domain perception;

performing second encoding based on the acoustic feature representation to obtain a second encoded output; the second code is a multi-dimensional code;

performing third coding based on the acoustic feature representation to obtain third coding output; the third coding is a coding focusing on frequency domain perception;

determining a code characteristic output from the first code output, the second code output, and the third code output.

6. The method of claim 5, wherein the first encoding based on the acoustic feature representation results in a first encoded output, comprising:

the acoustic feature representation is subjected to feature extraction in the time dimension through convolution check of a first size, and then the feature extraction in the frequency dimension is performed to obtain a first intermediate feature;

performing feature extraction on the acoustic feature representation through convolution check of a second size to obtain a second intermediate feature; the second dimension is greater than the first dimension;

And fusing the first intermediate feature and the second intermediate feature to obtain a first coding output.

7. The method of claim 5, wherein the third encoding based on the acoustic feature representation to obtain a third encoded output comprises:

the acoustic feature representation is subjected to feature extraction in the frequency dimension through convolution check of the first dimension, and then the feature extraction in the time dimension is performed to obtain a third intermediate feature;

and fusing the third intermediate feature and the second intermediate feature to obtain a third coding output.

8. The method of claim 1, wherein the emotion recognition results include a first emotion recognition result and a second emotion recognition result; the determining the suspense score of the time point according to the video picture, the speech text and the emotion recognition result of the time point comprises the following steps:

acquiring first coding data of a video picture of a time point, and acquiring video picture coding characteristics according to the first coding data and the first emotion recognition result;

Acquiring second coding data of the line text of the time point, and acquiring text coding features according to the second coding data and the second emotion recognition result;

fusing the video picture coding features and the text coding features of the time points to obtain multi-mode fusion features of the time points;

and determining the suspense score of the targeted time point according to the multi-modal fusion characteristic.

9. The method of claim 8, wherein the multi-modal data further includes interactive data, the deriving video picture coding features based on the first coded data and the first emotion recognition result, comprising:

determining video picture coding characteristics according to the first coding data, the first emotion recognition result and the interaction data;

and obtaining text coding features according to the second coding data and the second emotion recognition result, wherein the text coding features comprise:

and determining text coding characteristics according to the second coding data, the second emotion recognition result and the interaction data.

10. The method of claim 9, wherein the first emotion recognition result includes a first target category for each of the N frames of video frames for the point in time and a confidence level for each of the N first target categories; the N is a natural number greater than 1; the determining the video picture coding feature according to the first coding data, the first emotion recognition result and the interactive data includes:

Determining the maximum confidence level in the confidence levels of the N first target categories;

determining a first emotion weighting value based on the maximum confidence level and a first target class corresponding to the maximum confidence level;

determining a first interaction weighted value according to the interaction data;

and determining video picture coding characteristics according to the first coding data of the aimed time point, the first emotion weighting value and the first interaction weighting value.

11. The method of claim 10, wherein determining a first emotion weighting value based on the maximum confidence level and a first target class to which the maximum confidence level corresponds comprises:

determining a first emotion weighting value based on the maximum confidence coefficient under the condition that a first target category corresponding to the maximum confidence coefficient belongs to an excited category;

and taking the preset numerical value as a first emotion weighting value under the condition that the first target category corresponding to the maximum confidence coefficient belongs to the flat category.

12. The method of claim 9, the second emotion recognition result comprising a second target category corresponding to the speech audio for the point in time and a confidence level for the second target category; the determining text encoding features according to the second encoding data, the second emotion recognition result, and the interactive data includes:

Determining a second emotion weighting value based on the confidence level of the second target class;

determining a second interaction weighted value according to the interaction data;

and determining text coding characteristics according to the second coding data of the aimed time point, the second emotion weighting value and the second interaction weighting value.

13. The method of claim 1, wherein the screening candidate time points from the plurality of time points based on the suspense score for each time point comprises:

performing curve fitting according to the suspense score of each time point to obtain a target fitting curve;

and determining the time point position corresponding to the wave crest in the target fitting curve as a candidate time point position.

14. The method according to claim 1, wherein the performing the integrity analysis based on the multi-modal data of each candidate time-point, and determining the target time-point from the candidate time-points based on the integrity analysis result comprises:

determining a plurality of sub-mirrors in the video to be processed according to the video picture;

determining candidate time points which do not meet the integrity condition in the candidate time points based on the plurality of sub-mirrors, the speech frequency of the speech and the text of the speech;

And eliminating the candidate time points which do not meet the integrity condition to obtain the target time point.

15. The method of claim 14, wherein the determining a candidate time point of the candidate time points that does not satisfy an integrity condition based on the plurality of mirrors, the speech audio, and the speech text comprises:

for any partial mirror, taking the candidate time points in the preset part of the partial mirror as candidate time points which do not meet the integrity condition;

determining a first time period in which the speech is generated in the video to be processed based on the speech frequency of the speech, and taking the candidate time points in the first time period as candidate time points which do not meet the integrity condition;

and determining a second time period in which the speech is generated in the video to be processed based on the speech text, and taking the candidate time points in the second time period as candidate time points which do not meet the integrity condition.

16. The method according to any one of claims 1-15, further comprising:

determining the video end point as a clip candidate point, and clipping the video based on the clip candidate point; or,

Marking the plot according to the video ending point; the plot annotation comprises annotating the video end point in a play progress bar of the video.

17. A video data processing apparatus, the apparatus comprising:

the data determining module is used for determining a plurality of time points of the video to be processed and obtaining multi-mode data corresponding to each time point, wherein the multi-mode data at least comprises a video picture, a speech sound and a speech text;

the screening module is used for screening candidate time points from the plurality of time points according to the suspense score of each time point;

the positioning module is used for carrying out integrity analysis according to the multi-mode data of each candidate time point, and determining a target time point from the candidate time points according to an integrity analysis result; the target time point is used for determining a video end point.

18. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 16 when the computer program is executed.

19. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 16.

20. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 16.