CN110381336B

CN110381336B - Video segment emotion judgment method and device based on 5.1 sound channel and computer equipment

Info

Publication number: CN110381336B
Application number: CN201910672842.4A
Authority: CN
Inventors: 何穆; 何欢潮; 何伟峰; 林志杰; 唐爱林; 何图; 杨永恩
Original assignee: Guangzhou Fidek Co ltd
Current assignee: Guangzhou Feida Audio Co ltd
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2021-07-16
Anticipated expiration: 2039-07-24
Also published as: CN110381336A

Abstract

The application relates to a video segment emotion judgment method, a video segment emotion judgment device, computer equipment and a computer readable storage medium based on a 5.1 sound channel. The video segment emotion judgment method comprises the following steps: in the video playing process, 5.1 sound channel signals of a to-be-determined segment with a set frame length of a video are acquired; determining a channel characteristic value of a 5.1 channel signal of a segment to be judged; inputting the characteristic value of the sound channel into a trained emotion judgment model, and determining the emotion type of a segment to be judged according to the judgment result of the emotion judgment model; the set frame length is determined by training an emotion judgment model. By adopting the method, the emotion type of the video segment can be accurately detected and judged in real time in the video playing process, and the method is beneficial to realizing the function of carrying out sound effect processing on different emotion scenes of the video or skipping inappropriate scenes such as violence and the like for younger children viewers in the playing process of segments such as TV shows, movies, network short videos and the like.

Description

Video segment emotion judgment method and device based on 5.1 sound channel and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video segment emotion determining method based on a 5.1 channel, a video segment emotion determining apparatus based on a 5.1 channel, a computer device, and a computer-readable storage medium.

Background

At present, the classification and detection modes of videos are single, and the videos can be roughly classified and detected only according to the overall style of the videos; for example, the classification of movies is generally classified into terrorist, police, and love. However, a video will not usually contain only one type of scene, for example, gun battles in police gangster or car shots are not always taken through the entire video, and terrorist or love related clips may also appear in the meantime. Obviously, the conventional video type detection method cannot accurately and real-timely determine the emotion type of each scene segment in the video, and further cannot realize functions such as automatically skipping segments which are not suitable for children for film viewers in the film showing process, and is not beneficial to retrieval and screening of films and various videos.

Disclosure of Invention

In view of the above, it is necessary to provide a video segment emotion determination method based on 5.1 channels, a video segment emotion determination apparatus based on 5.1 channels, a computer device and a computer readable storage medium.

In one aspect, an embodiment of the present invention provides a video segment emotion determination method based on a 5.1 channel, where the method includes:

in the video playing process, acquiring a 5.1 sound channel signal of a to-be-determined segment with a set frame length of a video;

calculating a channel characteristic value of the 5.1 channel signal of the segment to be judged;

inputting the sound channel characteristic value into a trained emotion judgment model, and determining the emotion type of the segment to be judged according to the judgment result of the emotion judgment model; the set frame length is determined by training the emotion judgment model.

In one embodiment, the step of acquiring a 5.1 channel signal of a to-be-determined segment of a set frame length of a video includes:

acquiring one or more of a center channel signal, a front left channel signal, a front right channel signal, a rear left surround channel signal, a rear right surround channel signal and a subwoofer channel signal of a segment to be determined with a set frame length of a video.

In one embodiment, the step of training the emotion judgment model includes:

constructing a sample set; the sample set comprises 5.1 sound channel signals of a plurality of video sample fragments, and each video sample fragment corresponds to an emotion label;

screening the sample set to obtain a plurality of common situation fragments; the multiple common emotion fragments are multiple video sample fragments with the same emotion label in the sample set;

acquiring a plurality of common emotion frame signals; the multiple common emotion frame signals are obtained by framing the 5.1 sound channel signals of the multiple common emotion fragments according to a preset frame length;

constructing a feature training set; the feature training set is obtained by processing the vocal tract features extracted from the multiple common emotion frame signals;

inputting the feature training set into an emotion judgment model to be trained for training, adjusting the length of the preset frame according to a training result, and acquiring a plurality of common emotion frame signals and constructing the feature training set again according to the adjusted length of the preset frame until the training result of the emotion judgment model to be trained meets a preset condition, so as to obtain emotion judgment submodels corresponding to the same emotion label; the set frame length is determined according to the preset frame length obtained by the last adjustment;

and determining the emotion judgment model according to the emotion judgment sub-model corresponding to each emotion label.

In one embodiment, the step of detecting that the training result of the emotion judgment model to be trained satisfies a preset condition includes:

evaluating a complexity of vocal tract features extracted from the plurality of co-emotion frame signals;

determining the judgment accuracy rate of the emotion types of the common emotion fragments according to the training result of the emotion judgment model to be trained;

and if the balance relation between the complexity and the judgment accuracy is detected to meet the requirement, determining that the training result of the emotion judgment model to be trained meets the preset condition.

In one embodiment, after the step of screening the sample set to obtain a plurality of co-emotion fragments, the step of training the emotion judgment model further includes:

acquiring a common situation segment 5.1 sound channel attribute; the 5.1 sound channel attribute of the common situation segment is determined by analyzing sound signals of different sound channels of each common situation segment;

constructing an attribute training set according to the attribute of the 5.1 sound channels of the common situation segments;

the step of inputting the characteristic training set into the emotion judgment model to be trained for training comprises the following steps:

and inputting the characteristic training set and the attribute training set into the emotion judgment model to be trained together for training.

In one embodiment, the step of determining the emotion type of the to-be-determined section according to the determination result of the emotion determination model includes:

determining the judgment accuracy of the judgment result of the emotion judgment model;

and if the judgment accuracy is greater than or equal to a set accuracy threshold, determining the judgment result of the emotion judgment model as the emotion type of the segment to be judged.

In another aspect, an embodiment of the present invention provides an apparatus for determining emotion of a video segment based on 5.1 channels, where the apparatus includes:

the signal acquisition module is used for acquiring a 5.1 sound channel signal of a to-be-determined segment with a set frame length of a video in the video playing process;

the characteristic determining module is used for determining the channel characteristic value of the 5.1 channel signal of the segment to be judged;

the emotion judging module is used for inputting the sound channel characteristic value into a trained emotion judging model and determining the emotion type of the segment to be judged according to the judgment result of the emotion judging model; the set frame length is determined by training the emotion judgment model.

In one embodiment, the apparatus for determining emotion of video segment based on 5.1 channels further comprises:

the emotion judgment model training module is used for constructing a sample set; the sample set comprises 5.1 sound channel signals of a plurality of video sample fragments, and each video sample fragment corresponds to an emotion label; screening the sample set to obtain a plurality of common situation fragments; the multiple common emotion fragments are multiple video sample fragments with the same emotion label in the sample set; acquiring a plurality of common emotion frame signals; the multiple common emotion frame signals are obtained by framing the 5.1 sound channel signals of the multiple common emotion fragments according to a preset frame length; constructing a feature training set; the feature training set is obtained by processing the vocal tract features extracted from the multiple common emotion frame signals; inputting the feature training set into an emotion judgment model to be trained for training, adjusting the length of the preset frame according to a training result, and acquiring a plurality of common emotion frame signals and constructing the feature training set again according to the adjusted length of the preset frame until the training result of the emotion judgment model to be trained meets a preset condition, so as to obtain emotion judgment submodels corresponding to the same emotion label; the set frame length is determined according to the preset frame length obtained by the last adjustment; and determining the emotion judgment model according to the emotion judgment sub-model corresponding to each emotion label.

In another aspect, an embodiment of the present invention provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of a video segment emotion determination method or a video playing method based on 5.1 channels when executing the computer program.

In still another aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of a video segment emotion determination method or a video playing method based on 5.1 channels.

One of the above technical solutions has the following advantages or beneficial effects: in the video playing process, 5.1 sound channel signals of a to-be-judged segment of a video are obtained, sound channel characteristic values of the 5.1 sound channel signals of the to-be-judged segment are determined, and the sound channel characteristic values are input into a trained emotion judgment model, so that the emotion type of the to-be-judged segment can be determined according to the judgment result of the emotion judgment model; the method can detect and judge the emotion types of the video segments in real time under the condition of meeting certain accuracy in the video playing process, is beneficial to realizing the function of carrying out sound effect processing on different emotion scenes of the video or skipping inappropriate scenes such as violence for younger children viewers in the playing process of the segments such as TV drama, movie, network short video and the like, can also be applied to the fields of video retrieval, filtering, segment deletion and the like, and has wide application range and high practical value.

Drawings

FIG. 1 is a schematic flow diagram of a method for emotion determination for a 5.1 channel based video segment in one embodiment;

FIG. 2 is a schematic flow chart diagram of a video playback method in one embodiment;

FIG. 3 is a schematic block diagram of an embodiment of a video segment emotion determination apparatus based on 5.1 channels;

FIG. 4 is a schematic block diagram of a video playback device in one embodiment;

FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In one embodiment, as shown in fig. 1, there is provided a method for determining emotion of a video segment based on 5.1 channels, comprising the following steps:

s202, in the video playing process, 5.1 sound channel signals of the to-be-determined section with the set frame length of the video are obtained.

The segment to be judged is a segment with a set frame length of a video in a playing state, and the segment is a video segment needing emotion judgment at present; the frame length may be set to a specific value, or may be a frame length range, which is not limited herein.

The Sound Channel (Sound Channel) is an audio signal that is independent from each other and is played at different spatial positions when Sound is played. Here, the channel signal of the to-be-determined section may be understood as a 5.1 channel signal assigned to a corresponding speaker when the to-be-determined section is played. Compared with single sound channel, the multi-channel sound source has the advantage that the omnibearing three-dimensional space feeling of the whole sound field can give more vivid presence feeling to audiences and listeners, so that the multi-channel sound source is rapidly popularized and developed. Common multi-channels include stereo, four-tone surround, 5.1 channels, 7.1 channels, etc., since the 5.1 channels are the main format of the current film sound system and are also the standard channels published internationally, the following will take a film configured with 5.1 channels as an example to perform emotion judgment of a segment to be judged, that is, the present step is: in the process of playing the movie, 5.1 sound channel signals of a to-be-determined section with a set frame length of the movie are acquired, so that emotion determination of the movie section is performed.

It should be noted that the 5.1 channels, i.e., the center channel, the front left and right channels, the rear left and right surround channels, and the so-called 0.1 channel subwoofer channel; in a specific embodiment, S202 may specifically include: acquiring one or more of a center channel signal, a front left channel signal, a front right channel signal, a rear left surround channel signal, a rear right surround channel signal and a subwoofer channel signal of a segment to be determined with a set frame length of a video.

S204, determining the channel characteristic value of the 5.1 channel signal of the segment to be judged.

In an embodiment of the present invention, the channel feature values may include: correlation of symmetrical sound channels, root mean square value, standard deviation, maximum value, minimum value and the like obtained by different time-frequency characteristics; the time-frequency characteristics may specifically include amplitude, fluctuation degree, rhythmicity, frequency energy ratio in a frequency domain, MFCC parameters, and the like of a waveform curve, and may specifically be selected and calculated according to actual needs, without being limited thereto.

S206, inputting the channel characteristic value into the trained emotion judgment model, and determining the emotion type of the segment to be judged according to the judgment result of the emotion judgment model; the set frame length is determined by training an emotion judgment model.

The emotion judgment model can be obtained by training an initial model based on a feature training set, and the type of the initial model is not limited and can be a vector machine, a neural network and the like; the feature training set may be obtained by processing the vocal tract feature values extracted from the 5.1 vocal tract signals of the plurality of film sample segments, where the vocal tract feature values may be selected and calculated with reference to the vocal tract feature values in S204; each movie sample segment is correspondingly provided with an emotion label, wherein the emotion labels can include calmness, tension, fear, happiness, sadness, shock and the like, and are not limited herein.

Specifically, the emotion labels of the plurality of film sample segments can be obtained through subjective evaluation tests; more specifically, the subjective evaluation test content may be: firstly, collecting and sorting segments (calm scene, nervous scene, fear scene, shocking scene and the like) of common film emotion scene types; then, playing the collected movie fragments in a standard movie theater equipped with a 5.1 sound reproduction mode, and recording corresponding emotion scene types in a test table according to personal feelings after participants watch the movie fragments; since parts of a movie fragment may correspond to more than one emotion, such as a stressful emotion in both shocky and horror scenes, participants may choose two or more emotion types; finally, the evaluation results are sorted to establish a database (feature training set) containing a plurality of movie sample segments, and all the movie sample segments have unique scene type labeling results, such as terror scenes, calm scenes, shocking scenes and the like.

It should be noted that the set frame length in S202 may be determined by training the emotion determination model, rather than being set arbitrarily; specifically, for example, the feature extraction can be performed by respectively intercepting movie sample segments with different frame lengths of 1 second, 2 seconds, 5 seconds and the like to construct different feature training sets; training the initial model by adopting different characteristic training sets to obtain an emotion judgment model; and preferentially determining the set frame length of the segment to be judged by measuring factors such as training complexity, result accuracy and the like of each emotion judgment model. By optimizing the frame length of the segment to be judged, on one hand, the accuracy of the emotion judgment of the movie segment can be improved, on the other hand, too large or too complex calculation amount in the emotion judgment process can be avoided, the real-time performance of the emotion judgment is ensured, the performance requirement on data processing equipment is reduced, and the feasibility and the practicability of the emotion judgment method are improved.

In the above embodiments of the present invention, the execution main body may be an audio processor or other server devices, may be a single device, or may be a cluster formed by multiple devices, and the like; of course, the selection and modification may be made according to actual conditions, and are not limited herein.

In the video segment emotion judging method based on the 5.1 sound channel of the embodiment, the 5.1 sound channel signal of the segment to be judged of the video is obtained in the video playing process, the sound channel characteristic value of the 5.1 sound channel signal of the segment to be judged is determined, and the sound channel characteristic value is input into the trained emotion judging model, so that the emotion type of the segment to be judged can be determined according to the judgment result of the emotion judging model; the method can accurately detect and judge the emotion types of the video segments in real time in the video playing process, is beneficial to realizing the function of carrying out sound effect processing on different emotion scenes of the video or skipping inappropriate scenes such as violence and the like for younger children viewers in the playing process of the segments such as TV dramas, movies, network short videos and the like, can also be applied to the fields of video retrieval, filtering, segment deletion and the like, and has wide application range and high practical value; in addition, the frame length of the segment to be judged is determined by training the emotion judgment model instead of being set randomly, so that the accuracy of film segment emotion judgment can be improved, overlarge or overlarge calculated amount in the emotion judgment process can be avoided, and the instantaneity of emotion judgment is ensured.

In some embodiments, the step of training the emotion determination model in S206 includes: constructing a sample set; the method comprises the following steps that a sample set comprises 5.1 sound channel signals of a plurality of video sample fragments, and each video sample fragment corresponds to an emotion label; screening the sample set to obtain a plurality of common situation fragments; the multiple common emotion fragments are multiple video sample fragments with the same emotion label in the sample set; acquiring a plurality of common emotion frame signals; the multiple common emotion frame signals are obtained by framing the 5.1 sound channel signals of the multiple common emotion fragments according to the preset frame length; constructing a feature training set; the feature training set is obtained by processing the vocal tract features extracted from the multiple common emotion frame signals; inputting the feature training set into an emotion judgment model to be trained for training, adjusting the length of a preset frame according to a training result, acquiring a plurality of common emotion frame signals again according to the adjusted length of the preset frame and constructing the feature training set until the training result of the emotion judgment model to be trained meets a preset condition, and obtaining emotion judgment sub-models corresponding to the same emotion label; the set frame length is determined according to the preset frame length obtained by the last adjustment; and determining the emotion judgment model according to the emotion judgment sub-model corresponding to each emotion label.

The sample set can be constructed through the subjective evaluation test, each video sample segment in the sample set can correspond to a 5.1 channel signal, and each channel signal of the same video sample segment carries the same emotion label, so that model training can be performed subsequently.

It should be noted that the common emotion fragment refers to a video sample fragment with the same emotion tag in the sample set, and 5.1 channel signal characteristics of each emotion type can be obtained by performing feature parameter calculation and processing on 5.1 channel signals of the video sample fragment with the same emotion tag, so as to construct a feature training set in the following.

The method comprises the steps of continuously adjusting the preset frame length of a common emotion frame signal, training an emotion judgment model to be trained according to feature training sets with different frame lengths, obtaining the appropriate frame length of a common emotion fragment, and finally determining the set frame length of the obtained fragment to be judged according to the appropriate frame length of the common emotion fragment; the 'suitable' judgment method is but not limited to comparing and weighing the complexity of the characteristic parameters and the judgment accuracy rate of the scene emotion types; the process can be used for modeling and training the characteristic parameter set by applying SPSS Modeler and MATLAB software, and deleting the characteristic parameters with low contribution rate to the emotion type judgment model.

In some embodiments, the step of detecting that the training result of the emotion judgment model to be trained satisfies the preset condition may specifically include: evaluating the complexity of vocal tract features extracted from a plurality of co-emotion frame signals; determining the judgment accuracy rate of the emotion types of the common emotion fragments according to the training result of the emotion judgment model to be trained; and if the balance relation between the complexity and the judgment accuracy is detected to meet the requirement, determining that the training result of the emotion judgment model to be trained meets the preset condition.

In this step, the frame length of the common frame signal may be selected to be 1 second, 2 seconds, or 5 seconds, and the selection and model adjustment of the hardware DSP can be completed by comparing the complexity of the characteristic parameters calculated under different frame lengths and the scene type determination accuracy. The calculation of the characteristic parameters comprises the calculation of time domain characteristic parameters and frequency domain characteristic parameters, wherein the complexity of a frequency domain is higher, namely when the complexity of the characteristic parameters is considered, the DSP operation amount brought by each frame and each window function can be considered emphatically. For example, under the combined condition that the frame length is 5 seconds and the window length is 10 milliseconds, the complexity of the required characteristic parameters is general, and the scene emotion type judgment accuracy rate is 80%; under the combined condition that the frame length is 1 second and the window length is 10 milliseconds, the complexity degree of the required characteristic parameters is very complex, and the scene emotion type judgment accuracy rate is 83 percent; by comparing and balancing the two results, the former frame length can be selected as the setting frame length of the segment to be judged in emotion judgment.

Of course, other balance conditions can be comprehensively considered when the model training parameters are adjusted, and the balance conditions can be specifically set according to actual conditions.

In some embodiments, after the step of screening the sample set to obtain the plurality of shared emotion fragments, the step of training the emotion judgment model may further include: acquiring a common situation segment 5.1 sound channel attribute; the common situation segment 5.1 sound channel attribute is determined by analyzing the sound signals of different sound channels of each common situation segment; constructing an attribute training set according to the attribute of the common situation segment 5.1 sound channel; further, the step of inputting the feature training set into the emotion judgment model to be trained for training includes: and inputting the characteristic training set and the attribute training set into the emotion judgment model to be trained for training.

The common situation segment 5.1 sound channel attribute can be understood as the characteristics, relevance and the like of the sounds of different sound channels of the common situation segment; taking the 5.1 standard sound channel as an example, analyzing the sound signals of different sound channels of each common emotion fragment, namely analyzing the characteristics of each sound channel of the 5.1 standard, and counting the self parameters of each emotion type scene. Specifically, the sound of six channels of the 5.1 standard may be analyzed in content, and the analysis results include, for example, that the center channel is a human voice and a motion sound of a close scene (including a human breath sound, a footstep sound, and the like), that the left and right main channels occasionally contain a human voice, and that mainly background music and an environmental sound of a distant scene are dominant, and the like. Besides, the proportion of human voice appearing in different emotional scenes, the type of background music and the like can be compared, and the voice characteristics of the bass channel and the relevance between the bass channel and other 5 channels can be analyzed with emphasis.

The result of inputting the attribute training set into the emotion judgment model to be trained for training is as follows: different emotion types and finally, the sound signal channels selected for computational analysis may have differences. For example, if 95% of bass channels in a calm scene have no signal, a determination result with high accuracy can be obtained by calculating time domain characteristic parameters of the channels; when the scene is shocked, a signal with higher energy is provided in the bass channel, and background music with certain rhythm is provided in the main sound channel, the sound signal channel for calculating characteristic parameters needs to be added.

In some embodiments, S206 may specifically include: determining the judgment accuracy of the judgment result of the emotion judgment model; and if the judgment accuracy is greater than or equal to the set accuracy threshold, determining the judgment result of the emotion judgment model as the emotion type of the segment to be judged.

The set frame lengths of the segments to be judged are different, and the characteristic parameters and the accuracy threshold required by judgment of different emotion scenes can be different. For example, a calm scene with a frame length of 5 seconds may have three determination conditions, that is, the signal amplitude of a bass channel is lower than a set threshold, there is vocal without background music, and there is no vocal and background music is flat, where each case has a corresponding correct proportion, the 3 cases are taken into comprehensive consideration, it may be determined that the accuracy threshold of the calm scene to be determined is 80%, and the features to be extracted include the vocal tract features of the center vocal tract, the main vocal tract, and the bass vocal tract; however, the determination condition for a calm scene with a frame length of 1 second may be less, and only the feature parameters of the bass channel need to be calculated, and the accuracy threshold may be set to 73%. The above is merely an example and should not be used to limit the present solution.

In conclusion, the video clip emotion judgment method based on the 5.1 sound channel fully utilizes the characteristics of the 5.1 sound system of the video; the method specifically comprises the following steps: carrying out statistical classification on various video emotion scenes, and labeling the emotion scenes according to a subjective judgment test; analyzing the sound content characteristics in six channels of 5.1 standard, and calculating the characteristic parameters of time domain and frequency domain of the audio signals of each channel, wherein the main focus is on the characteristic analysis of a bass channel; and training the model by using methods such as a vector machine or a neural network, and the like, finally constructing an audio signal characteristic parameter library with certain judgment accuracy rate for various video emotion scenes, and determining a corresponding accuracy threshold. The process can be simulated and verified by MATLAB software.

In one embodiment, as shown in fig. 2, there is provided a video playing method, including the steps of:

s302, in the video playing process, a to-be-determined segment of the video is obtained.

S304, determining the emotion type of the segment to be judged according to any one of the video segment emotion judgment methods based on the 5.1 sound channel in the previous embodiments.

And S306, determining the current emotion type of the video according to the emotion type of the segment to be judged.

S308, according to the current emotion type, searching a mapping table to obtain a target playing mode; the mapping table stores the mapping relation between the emotion types and the play modes.

And S310, playing the video according to the target playing mode.

In the above embodiments of the present invention, the execution main body may be a terminal, a video playing device, or other server devices, and may be selected and changed according to actual situations.

In the video playing method of the embodiment, by adopting any one of the video segment emotion judging methods based on the 5.1 sound channel in the embodiments to determine the emotion type of the segment to be judged, the real-time judgment of the current emotion of the video can be realized, and the video is played according to the preset playing mode; the method can be applied to occasions such as home theaters and movie theaters where multichannel sound systems are configured.

In some embodiments, S306 may specifically include: if the emotion types of the fragments to be judged are consistent with the emotion types of the judged fragments with the set number, taking the emotion types of the fragments to be judged as the current emotion types of the video; if the emotion types of the fragments to be judged are inconsistent with the emotion types of the judged fragments with the set number, and the number of the fragments inconsistent with the emotion types of the fragments to be judged is smaller than the set threshold value, taking the emotion types of the fragments to be judged as the current emotion types of the video; if the emotion types of the fragments to be judged are inconsistent with the emotion types of the judged fragments with the set number, and the number of the fragments inconsistent with the emotion types of the fragments to be judged is larger than or equal to the set threshold value, determining the current emotion type of the video according to the emotion types of the inconsistent fragments; the determined segments and the segments to be determined in the set number are continuous segments in the video.

In some embodiments, the play modes in the mapping table include one or more of normal double-speed play, fast-forward play, slow-shot play, bass-plus-sense play, and no-play.

Specifically, the mapping relationship between emotion types and play modes may include: one or more of a first mapping relationship, a second mapping relationship, a third mapping relationship, a fourth mapping relationship and a fifth mapping relationship; wherein, the first mapping relation is that if the emotion type is the first emotion, the playing mode is normal double speed; the second mapping relation is that if the emotion type is the second emotion, the playing mode is fast forward; the third mapping relation is that if the emotion type is the third emotion, the playing mode is slow shot; the fourth mapping relation is that if the emotion type is the fourth emotion, the playing mode is bass intensifying; the fifth mapping relation is that if the emotion type is the fifth emotion, the playing mode is not playing, and the video is jumped to the next emotion.

The first emotion, the second emotion, the third emotion, the fourth emotion and the fifth emotion can be set according to actual conditions, for example, the first emotion can be calm; the second emotion can be stress, etc.; the third emotion can be excitement, happiness, etc.; the fourth feelings can be shocking and the like; the fifth emotion may be nausea, horror, etc.; the selection and setting can be carried out according to the actual situation.

The bass enhancement can be understood as enhancing the emotion expressed in the video scene by using the influence of the bass sound effect on the human emotion perception, and the specific enhancing method is not specifically limited herein.

The video playing method described above is explained with reference to a specific example, that is, it is assumed that there are 3 consecutive segments in a video, which are respectively a calm, fear, nausea and sadness type scene, and each duration is 3 minutes. The scene detection and video playing process of the present application is described as follows:

in the process of playing the video in real time, if the length of the selected set frame is 1 second, the scene emotion type is judged once every second, and a judgment value of the scene emotion type can be obtained every 1 second. And when the 5 continuous judgment results are the same scene type result, increasing the condition for judging other scenes, namely when more than 3 judgment results in the 5 continuous scenes in the next scene judgment results are other scenes, considering that the scene of the previous judgment is ended. Specifically, for example, when 5 consecutive scenes are determined to be a quiet scene, the current segment is determined to be a quiet scene, and if no more than 3 of the 5 consecutive scenes are determined to be a non-quiet scene, the current segment is continuously determined to be a quiet scene; when the fear nausea scene occurs, the calculation result can generate a continuous judgment result of the scene, and at the moment, the scene of the video clip is determined to be changed from a calm scene to the fear nausea scene.

For some video clips feared with nausea, if the video clips are currently judged to be the scene, the video is fast-forwarded and played according to the duration statistical result of the scene, and then the video is continuously judged to be the next emotion scene according to the emotion judgment method.

It should be understood that for the foregoing method embodiments, although the steps in the flowcharts are shown in order indicated by the arrows, the steps are not necessarily performed in order indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the flow charts of the method embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least a portion of the sub-steps or stages of other steps.

Based on the same idea as the video segment emotion determination method based on 5.1 channels in the above embodiment, a video segment emotion determination device based on 5.1 channels is also provided herein.

In one embodiment, as shown in fig. 3, there is provided a video segment emotion judgment apparatus based on 5.1 channels, including: a signal acquisition module 401, a feature determination module 402 and an emotion decision module 403, wherein:

a signal obtaining module 401, configured to obtain, during a video playing process, a 5.1 channel signal of a segment to be determined, of a set frame length of a video;

a feature determining module 402, configured to determine a channel feature value of a 5.1-channel signal of a segment to be determined;

the emotion judging module 403 is configured to input the channel feature value into the trained emotion judging model, and determine an emotion type of a segment to be judged according to a judgment result of the emotion judging model; the set frame length is determined by training an emotion judgment model.

In some embodiments, the signal obtaining module 401 is specifically configured to: acquiring one or more of a center channel signal, a front left channel signal, a front right channel signal, a rear left surround channel signal, a rear right surround channel signal and a subwoofer channel signal of a segment to be determined with a set frame length of a video.

In some embodiments, the apparatus for determining emotion of video segment based on 5.1 channels further comprises: the emotion judgment model training module is used for constructing a sample set; the method comprises the following steps that a sample set comprises 5.1 sound channel signals of a plurality of video sample fragments, and each video sample fragment corresponds to an emotion label; screening the sample set to obtain a plurality of common situation fragments; the multiple common emotion fragments are multiple video sample fragments with the same emotion label in the sample set; acquiring a plurality of common emotion frame signals; the multiple common emotion frame signals are obtained by framing the 5.1 sound channel signals of the multiple common emotion fragments according to the preset frame length; constructing a feature training set; the feature training set is obtained by processing the vocal tract features extracted from the multiple common emotion frame signals; inputting the feature training set into an emotion judgment model to be trained for training, adjusting the length of a preset frame according to a training result, acquiring a plurality of common emotion frame signals again according to the adjusted length of the preset frame and constructing the feature training set until the training result of the emotion judgment model to be trained meets a preset condition, and obtaining emotion judgment sub-models corresponding to the same emotion label; the set frame length is determined according to the preset frame length obtained by the last adjustment; and determining the emotion judgment model according to the emotion judgment sub-model corresponding to each emotion label.

In some embodiments, the emotion determination model training module is specifically configured to: evaluating the complexity of vocal tract features extracted from a plurality of co-emotion frame signals; determining the judgment accuracy rate of the emotion types of the common emotion fragments according to the training result of the emotion judgment model to be trained; and if the balance relation between the complexity and the judgment accuracy is detected to meet the requirement, determining that the training result of the emotion judgment model to be trained meets the preset condition.

In some embodiments, the emotion determination model training module is further configured to: acquiring a common situation segment 5.1 sound channel attribute; the common situation segment 5.1 sound channel attribute is determined by analyzing the sound signals of different sound channels of each common situation segment; constructing an attribute training set according to the attribute of the common situation segment 5.1 sound channel; and inputting the characteristic training set and the attribute training set into the emotion judgment model to be trained for training.

In some embodiments, emotion determining module 403 is specifically configured to: determining the judgment accuracy of the judgment result of the emotion judgment model; and if the judgment accuracy is greater than or equal to the set accuracy threshold, determining the judgment result of the emotion judgment model as the emotion type of the segment to be judged.

Based on the same idea as the video playing method in the above embodiment, a video playing device is also provided herein.

In one embodiment, as shown in fig. 4, there is provided a video segment emotion judgment apparatus based on 5.1 channels, including: a channel signal obtaining module 501, a segment emotion determining module 502, a current emotion determining module 503, a pattern searching module 504 and a playing module 505, wherein:

a sound channel signal obtaining module 501, configured to obtain a to-be-determined segment of a video in a video playing process;

a segment emotion determining module 502, configured to determine an emotion type of a segment to be determined according to any one of the foregoing video segment emotion determining methods based on a 5.1 channel;

a current emotion determining module 503, configured to determine a current emotion type of the video according to the emotion type of the segment to be determined;

a mode search module 504, configured to search a mapping table according to the current emotion type to obtain a target play mode; the mapping table stores the mapping relation between the emotion type and the play mode;

and a playing module 505, configured to play the video according to the target playing mode.

In some embodiments, the current emotion determining module 503 is specifically configured to: if the emotion types of the fragments to be judged are consistent with the emotion types of the judged fragments with the set number, taking the emotion types of the fragments to be judged as the current emotion types of the video; if the emotion types of the fragments to be judged are inconsistent with the emotion types of the judged fragments with the set number, and the number of the fragments inconsistent with the emotion types of the fragments to be judged is smaller than the set threshold value, taking the emotion types of the fragments to be judged as the current emotion types of the video; if the emotion types of the fragments to be judged are inconsistent with the emotion types of the judged fragments with the set number, and the number of the fragments inconsistent with the emotion types of the fragments to be judged is larger than or equal to the set threshold value, determining the current emotion type of the video according to the emotion types of the inconsistent fragments; the set number of the determined segments and the segments to be determined are continuous segments in the video.

For specific limitations of the video segment emotion determination apparatus and the video playback apparatus based on the 5.1 channel, reference may be made to the above limitations of the video segment emotion determination method and the video playback apparatus based on the 5.1 channel, which are not described herein again. The modules in the 5.1 channel-based video segment emotion determination device and the video playing device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In addition, in the above exemplary embodiment of the 5.1 channel based video segment emotion determining apparatus and the video playing apparatus, the logical division of each program module is only an example, and in practical applications, the above function allocation may be performed by different program modules according to needs, for example, due to the configuration requirements of corresponding hardware or the convenience of implementation of software, that is, the internal structure of the 5.1 channel based video segment emotion determining apparatus and the video playing apparatus is divided into different program modules to perform all or part of the above described functions.

In one embodiment, a computer device is provided, which may be a server device or an audio processing device, and its internal structure diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing video segment emotion judgment data based on 5.1 channels. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a video segment emotion determination method based on 5.1 channels.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: in the video playing process, acquiring a 5.1 sound channel signal of a to-be-determined segment with a set frame length of a video; extracting a sound channel characteristic value of a 5.1 sound channel signal of a segment to be judged; inputting the characteristic value of the sound channel into a trained emotion judgment model, and determining the emotion type of a segment to be judged according to the judgment result of the emotion judgment model; the set frame length is determined by training an emotion judgment model.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring one or more of a center channel signal, a front left channel signal, a front right channel signal, a rear left surround channel signal, a rear right surround channel signal and a subwoofer channel signal of a segment to be determined with a set frame length of a video.

In one embodiment, the processor, when executing the computer program, further performs the steps of: constructing a sample set; the method comprises the following steps that a sample set comprises 5.1 sound channel signals of a plurality of video sample fragments, and each video sample fragment corresponds to an emotion label; screening the sample set to obtain a plurality of common situation fragments; the multiple common emotion fragments are multiple video sample fragments with the same emotion label in the sample set; acquiring a plurality of common emotion frame signals; the multiple common emotion frame signals are obtained by framing the 5.1 sound channel signals of the multiple common emotion fragments according to the preset frame length; constructing a feature training set; the feature training set is obtained by processing the vocal tract features extracted from the multiple common emotion frame signals; inputting the feature training set into an emotion judgment model to be trained for training, adjusting the length of a preset frame according to a training result, acquiring a plurality of common emotion frame signals again according to the adjusted length of the preset frame and constructing the feature training set until the training result of the emotion judgment model to be trained meets a preset condition, and obtaining emotion judgment sub-models corresponding to the same emotion label; the set frame length is determined according to the preset frame length obtained by the last adjustment; and determining the emotion judgment model according to the emotion judgment sub-model corresponding to each emotion label.

In one embodiment, the processor, when executing the computer program, further performs the steps of: evaluating the complexity of vocal tract features extracted from a plurality of co-emotion frame signals; determining the judgment accuracy rate of the emotion types of the common emotion fragments according to the training result of the emotion judgment model to be trained; and if the balance relation between the complexity and the judgment accuracy is detected to meet the requirement, determining that the training result of the emotion judgment model to be trained meets the preset condition.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a common situation segment 5.1 sound channel attribute; the common situation segment 5.1 sound channel attribute is determined by analyzing the sound signals of different sound channels of each common situation segment; constructing an attribute training set according to the attribute of the common situation segment 5.1 sound channel; and inputting the characteristic training set and the attribute training set into the emotion judgment model to be trained for training.

In one embodiment, the processor, when executing the computer program, further performs the steps of: determining the judgment accuracy of the judgment result of the emotion judgment model; and if the judgment accuracy is greater than or equal to the set accuracy threshold, determining the judgment result of the emotion judgment model as the emotion type of the segment to be judged.

In one embodiment, a computer device is provided, which may be a terminal device or a video/movie playing device, and its internal structure diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Of course, a display screen (not shown) may be included, and the display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing relevant data of video playing. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of video playback.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: in the video playing process, acquiring a to-be-determined segment of a video; determining the emotion type of a segment to be judged according to the video segment emotion judgment method based on the 5.1 sound channel; determining the current emotion type of the video according to the emotion type of the segment to be judged; according to the current emotion type, searching a mapping table to obtain a target play mode; the mapping table stores the mapping relation between the emotion type and the play mode; and playing the video according to the target playing mode.

In one embodiment, the processor, when executing the computer program, further performs the steps of: if the emotion types of the fragments to be judged are consistent with the emotion types of the judged fragments with the set number, taking the emotion types of the fragments to be judged as the current emotion types of the video; if the emotion types of the fragments to be judged are inconsistent with the emotion types of the judged fragments with the set number, and the number of the fragments inconsistent with the emotion types of the fragments to be judged is smaller than the set threshold value, taking the emotion types of the fragments to be judged as the current emotion types of the video; if the emotion types of the fragments to be judged are inconsistent with the emotion types of the judged fragments with the set number, and the number of the fragments inconsistent with the emotion types of the fragments to be judged is larger than or equal to the set threshold value, determining the current emotion type of the video according to the emotion types of the inconsistent fragments; the set number of the determined segments and the segments to be determined are continuous segments in the video.

In one embodiment, the play modes in the mapping table include one or more of normal double-speed play, fast-forward play, slow-shot play, bass-plus-sense play, and no-play.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: in the video playing process, acquiring a 5.1 sound channel signal of a to-be-determined segment with a set frame length of a video; determining a channel characteristic value of a 5.1 channel signal of a segment to be judged; inputting the characteristic value of the sound channel into a trained emotion judgment model, and determining the emotion type of a segment to be judged according to the judgment result of the emotion judgment model; the set frame length is determined by training an emotion judgment model.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring one or more of a center channel signal, a front left channel signal, a front right channel signal, a rear left surround channel signal, a rear right surround channel signal and a subwoofer channel signal of a segment to be determined with a set frame length of a video.

In one embodiment, the computer program when executed by the processor further performs the steps of: constructing a sample set; the method comprises the following steps that a sample set comprises 5.1 sound channel signals of a plurality of video sample fragments, and each video sample fragment corresponds to an emotion label; screening the sample set to obtain a plurality of common situation fragments; the multiple common emotion fragments are multiple video sample fragments with the same emotion label in the sample set; acquiring a plurality of common emotion frame signals; the multiple common emotion frame signals are obtained by framing the 5.1 sound channel signals of the multiple common emotion fragments according to the preset frame length; constructing a feature training set; the feature training set is obtained by processing the vocal tract features extracted from the multiple common emotion frame signals; inputting the feature training set into an emotion judgment model to be trained for training, adjusting the length of a preset frame according to a training result, acquiring a plurality of common emotion frame signals again according to the adjusted length of the preset frame and constructing the feature training set until the training result of the emotion judgment model to be trained meets a preset condition, and obtaining emotion judgment sub-models corresponding to the same emotion label; the set frame length is determined according to the preset frame length obtained by the last adjustment; and determining the emotion judgment model according to the emotion judgment sub-model corresponding to each emotion label.

In one embodiment, the computer program when executed by the processor further performs the steps of: evaluating the complexity of vocal tract features extracted from a plurality of co-emotion frame signals; determining the judgment accuracy rate of the emotion types of the common emotion fragments according to the training result of the emotion judgment model to be trained; and if the balance relation between the complexity and the judgment accuracy is detected to meet the requirement, determining that the training result of the emotion judgment model to be trained meets the preset condition.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a common situation segment 5.1 sound channel attribute; the common situation segment 5.1 sound channel attribute is determined by analyzing the sound signals of different sound channels of each common situation segment; constructing an attribute training set according to the attribute of the common situation segment 5.1 sound channel; and inputting the characteristic training set and the attribute training set into the emotion judgment model to be trained for training.

In one embodiment, the computer program when executed by the processor further performs the steps of: determining the judgment accuracy of the judgment result of the emotion judgment model; and if the judgment accuracy is greater than or equal to the set accuracy threshold, determining the judgment result of the emotion judgment model as the emotion type of the segment to be judged.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: in the video playing process, acquiring a to-be-determined segment of a video; determining the emotion type of a segment to be judged according to the video segment emotion judgment method based on the 5.1 sound channel; determining the current emotion type of the video according to the emotion type of the segment to be judged; according to the current emotion type, searching a mapping table to obtain a target play mode; the mapping table stores the mapping relation between the emotion type and the play mode; and playing the video according to the target playing mode.

In one embodiment, the computer program when executed by the processor further performs the steps of: if the emotion types of the fragments to be judged are consistent with the emotion types of the judged fragments with the set number, taking the emotion types of the fragments to be judged as the current emotion types of the video; if the emotion types of the fragments to be judged are inconsistent with the emotion types of the judged fragments with the set number, and the number of the fragments inconsistent with the emotion types of the fragments to be judged is smaller than the set threshold value, taking the emotion types of the fragments to be judged as the current emotion types of the video; if the emotion types of the fragments to be judged are inconsistent with the emotion types of the judged fragments with the set number, and the number of the fragments inconsistent with the emotion types of the fragments to be judged is larger than or equal to the set threshold value, determining the current emotion type of the video according to the emotion types of the inconsistent fragments; the set number of the determined segments and the segments to be determined are continuous segments in the video.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The terms "comprises" and "comprising," as well as any variations thereof, of the embodiments herein are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or (module) elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

References to "first \ second" herein are merely to distinguish between similar objects and do not denote a particular ordering with respect to the objects, it being understood that "first \ second" may, where permissible, be interchanged with a particular order or sequence. It should be understood that "first \ second" distinct objects may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced in sequences other than those illustrated or described herein.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for determining emotion of a video segment based on 5.1 channels, the method comprising:

determining a channel characteristic value of a 5.1 channel signal of the segment to be judged; the sound channel characteristic values comprise correlation of symmetrical sound channels, root mean square values, standard deviations, maximum values and minimum values obtained by different time-frequency characteristics, and the time-frequency characteristics comprise amplitude values, fluctuation degrees, rhythmicity of a wave curve, frequency energy ratio in a frequency domain and MFCC parameters;

2. The method according to claim 1, wherein the step of obtaining the 5.1 channel signal of the to-be-determined section of the set frame length of the video comprises:

3. The method of claim 1, wherein the step of training the emotion decision model comprises:

4. The method according to claim 3, wherein the step of detecting that the training result of the emotion judgment model to be trained satisfies a preset condition comprises:

5. The method of claim 3, wherein after the step of screening the sample set to obtain a plurality of co-emotion fragments, the step of training the emotion determination model further comprises:

6. The method according to any one of claims 1 to 5, wherein the step of determining the emotion type of the to-be-determined section according to the determination result of the emotion determination model comprises:

7. An emotion determining apparatus for a 5.1 channel-based video segment, comprising:

the characteristic determining module is used for determining the channel characteristic value of the 5.1 channel signal of the segment to be judged; the sound channel characteristic values comprise correlation of symmetrical sound channels, root mean square values, standard deviations, maximum values and minimum values obtained by different time-frequency characteristics, and the time-frequency characteristics comprise amplitude values, fluctuation degrees, rhythmicity of a wave curve, frequency energy ratio in a frequency domain and MFCC parameters;

8. The apparatus of claim 7, further comprising:

the emotion judgment model training module is used for constructing a sample set; the sample set comprises 5.1 sound channel signals of a plurality of video sample fragments, and each video sample fragment corresponds to an emotion label;

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.