CN109829067B - Audio data processing method and device, electronic equipment and storage medium - Google Patents

Audio data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN109829067B
CN109829067B CN201910165235.9A CN201910165235A CN109829067B CN 109829067 B CN109829067 B CN 109829067B CN 201910165235 A CN201910165235 A CN 201910165235A CN 109829067 B CN109829067 B CN 109829067B
Authority
CN
China
Prior art keywords
audio
segment
feature
emotion
climax
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910165235.9A
Other languages
Chinese (zh)
Other versions
CN109829067A (en
Inventor
张文文
李岩
姜涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201910165235.9A priority Critical patent/CN109829067B/en
Publication of CN109829067A publication Critical patent/CN109829067A/en
Application granted granted Critical
Publication of CN109829067B publication Critical patent/CN109829067B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to an audio data processing method, an audio data processing apparatus, an electronic device and a storage medium, and relates to the technical field of multimedia, wherein the method comprises: performing feature extraction on the audio file to obtain first features of a plurality of first audio segments of the audio file; calling an emotion recognition model, inputting first characteristics of the first audio segments into the emotion recognition model, and outputting emotion degree values of the first audio segments; and according to the emotion degree values of the plurality of first audio segments, taking at least one continuous second audio segment with the total length as the target length and the maximum sum of the emotion degree values in the audio file as a climax segment of the audio file. The embodiment of the disclosure does not simply detect repeated parts in the audio file, but analyzes the emotion of each part of the audio file, so that the part with relatively intense emotion expression is taken as a climax fragment, and the accuracy of the audio data processing method is high.

Description

Audio data processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of multimedia technologies, and in particular, to an audio data processing method and apparatus, an electronic device, and a storage medium.
Background
With the development of multimedia technology, people often use audio playing applications to play audio files. For example, a song may be played using audio playback software. Each song typically includes climax segments, wherein a climax segment refers to a segment of the song with the most concentrated thinking, the most intense emotion and the most rich emotion.
In the related art, the audio data processing method generally detects an audio file, and takes a portion of the audio file with the most repetition times as a climax section of the audio file. And the most repeated ones of some songs are not necessarily climax segments, for example, the repeated times of the main song and the refrain of some songs are the same, wherein the refrain is the climax segment of the song. By the method, the climax segment of the song cannot be acquired, so that the accuracy of the audio data processing method is low.
Disclosure of Invention
The present disclosure provides an audio data processing method, apparatus, electronic device, and storage medium, which can overcome the problem of low accuracy.
According to a first aspect of the embodiments of the present disclosure, there is provided an audio data processing method, including:
performing feature extraction on an audio file to obtain first features of a plurality of first audio segments of the audio file;
calling an emotion recognition model, inputting first features of the first audio segments into the emotion recognition model, and outputting emotion degree values of the first audio segments;
and according to the emotion degree values of the plurality of first audio segments, taking at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values in the audio file as a climax segment of the audio file, wherein the second audio segment is the same as the first audio segment or different from the first audio segment.
In one possible implementation manner, the step of regarding at least one continuous second audio segment of the audio file, which has a total length of a target length and a maximum sum of emotional degree values, as a climax segment of the audio file according to the emotional degree values of the plurality of first audio segments, includes:
obtaining a plurality of candidate climax fragments of the audio file, wherein each candidate climax fragment is at least one continuous second audio fragment with the total length being the target length;
acquiring the sum of the emotion degree values of the second audio segments included by each candidate climax segment according to the emotion degree values of the first audio segments;
and taking the candidate climax fragment with the maximum sum value as the climax fragment of the audio file.
In one possible implementation manner, the obtaining a sum of the emotional degree values of the second audio segments included in each candidate climax segment according to the emotional degree values of the plurality of first audio segments includes:
determining the emotional degree value of each second audio clip in the audio file according to the emotional degree values of the plurality of first audio clips; acquiring the sum of the emotional degree values of at least one continuous second audio segment included in each candidate climax segment; or,
for each candidate climax fragment, acquiring the emotional degree value of each second audio fragment included in the candidate climax fragment according to the emotional degree values of the plurality of first audio fragments; and acquiring the sum of the emotional degree values of at least one continuous second audio segment included in the candidate climax segment.
In one possible implementation manner, the step of regarding at least one continuous second audio segment of the audio file, which has a total length of a target length and a maximum sum of emotional degree values, as a climax segment of the audio file according to the emotional degree values of the plurality of first audio segments, includes:
outputting a playing starting point and a playing ending point of at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values; or,
intercepting a segment corresponding to at least one continuous second audio segment with the total length as a target length and the maximum sum of emotion degree values from the audio file as a climax segment of the audio file; or,
and splicing at least one continuous second audio segment with the total length as the target length and the maximum sum of the emotion degree values to obtain the climax segment of the audio file.
In one possible implementation manner, the performing feature extraction on the audio file to obtain first features of a plurality of first audio segments of the audio file includes:
segmenting the audio file to obtain a plurality of first audio segments of the audio file;
resampling each first audio segment to obtain a first feature of each first audio segment.
In one possible implementation, the resampling each first audio segment to obtain the first feature of each first audio segment includes:
performing audio processing on each first audio clip according to a target sampling rate and a target window function to obtain a first feature of a Mel scale of each first audio clip;
and processing the first feature of the Mel scale of each first audio clip based on a first objective function to obtain a first feature of a logarithmic scale.
In one possible implementation, the inputting the first features of the first audio segments into the emotion recognition model and outputting the emotion degree values of the first audio segments includes:
inputting the first features of the plurality of first audio segments into the emotion recognition model, and performing feature extraction on the first feature of each first audio segment by the emotion recognition model to obtain a second feature of each first audio segment;
obtaining, by the emotion recognition model, an emotion degree value of each first audio segment based on a second feature of each first audio segment and position information of each bit of feature value in the second feature, where the position information of each bit of feature value in the second feature is used to indicate a corresponding position of the bit of feature value in the first audio segment;
outputting the emotion degree values of the plurality of first audio segments.
In one possible implementation manner, the performing, by the emotion recognition model, feature extraction on the first feature of each first audio segment to obtain the second feature of each first audio segment includes:
and calculating the first characteristic of each first audio segment by a convolution layer in the emotion recognition model to obtain the second characteristic of each first audio segment.
In one possible implementation manner, the obtaining the emotion degree value of each first audio segment based on the second feature of each first audio segment and the position information of each feature value in the second feature includes:
acquiring position information of each bit of feature value in the second feature;
obtaining a third feature of each first audio segment based on the second feature of each first audio segment and the position information of each feature value in the second feature;
and calculating the third characteristic of each first audio segment based on a full connection layer in the emotion recognition model to obtain the emotion degree value of each first audio segment.
In one possible implementation manner, the calculating the third feature of each first audio segment based on the full connection layer in the emotion recognition model to obtain the emotion degree value of each first audio segment includes:
calculating the third feature of each first audio segment based on a full-link layer in the emotion recognition model to obtain a calculation result corresponding to the third feature;
and calculating a calculation result corresponding to the third feature of each first audio segment through a second objective function to obtain the emotional degree value of each first audio segment.
In one possible implementation, the training process of the emotion recognition model includes:
acquiring a plurality of sample audio files, wherein each sample audio file carries an emotion label which is used for expressing the emotion tendency of the sample audio file;
extracting the characteristics of each sample audio file to obtain first characteristics of a plurality of first audio fragments of each sample audio file;
calling an initial model, inputting first characteristics of a plurality of first audio segments of the plurality of sample audio files into the initial model, and obtaining second characteristics of the first audio segments by the initial model based on the first characteristics of the first audio segments for each first audio segment;
classifying the first audio segment by the initial model based on a second feature of the first audio segment to obtain a classification result of the first audio segment, wherein the classification result is used for representing the emotional tendency of the first audio segment;
obtaining, by the initial model, an emotional degree value of the first audio segment based on the second feature;
for each sample audio file, outputting, by the initial model, a classification result of the sample audio file based on a classification result and an emotion degree value of a plurality of first audio segments of the sample audio file, the classification result being used to represent the classification result of the sample audio file;
and adjusting model parameters of the initial model based on the classification results of the plurality of sample audio files and the emotion labels carried by each sample audio file until target conditions are met to obtain an emotion recognition model, wherein the model parameters comprise parameters required for acquiring the emotion degree value of the first audio fragment based on the second characteristics.
According to a second aspect of the embodiments of the present disclosure, there is provided an audio data processing apparatus comprising:
the device comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is configured to perform feature extraction on an audio file to obtain first features of a plurality of first audio segments of the audio file;
the emotion recognition module is configured to execute calling an emotion recognition model, input first features of the first audio segments into the emotion recognition model and output emotion degree values of the first audio segments;
and the segment acquisition module is configured to execute at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotional degree values in the audio file as a climax segment of the audio file according to the emotional degree values of the plurality of first audio segments, wherein the second audio segment is the same as the first audio segment or different from the first audio segment.
In one possible implementation, the segment obtaining module is configured to perform:
obtaining a plurality of candidate climax fragments of the audio file, wherein each candidate climax fragment is at least one continuous first audio fragment with the total length being the target length;
acquiring the sum of the emotion degree values of the second audio segments included by each candidate climax segment according to the emotion degree values of the first audio segments;
and taking the candidate climax fragment with the maximum sum value as the climax fragment of the audio file.
In one possible implementation, the segment obtaining module is configured to perform:
determining the emotional degree value of each second audio clip in the audio file according to the emotional degree values of the plurality of first audio clips; acquiring the sum of the emotional degree values of at least one continuous second audio segment included in each candidate climax segment; or,
for each candidate climax fragment, acquiring the emotional degree value of each second audio fragment included in the candidate climax fragment according to the emotional degree values of the plurality of first audio fragments; and acquiring the sum of the emotional degree values of at least one continuous second audio segment included in the candidate climax segment.
In one possible implementation, the segment obtaining module is configured to perform:
outputting a playing starting point and a playing ending point of at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values; or,
intercepting a segment corresponding to at least one continuous second audio segment with the total length as a target length and the maximum sum of emotion degree values from the audio file as a climax segment of the audio file; or,
and splicing at least one continuous second audio segment with the total length as the target length and the maximum sum of the emotion degree values to obtain the climax segment of the audio file.
In one possible implementation, the feature extraction module is configured to perform:
segmenting the audio file to obtain a plurality of first audio segments of the audio file;
resampling each first audio segment to obtain a first feature of each first audio segment.
In one possible implementation, the feature extraction module is configured to perform:
performing audio processing on each first audio clip according to a target sampling rate and a target window function to obtain a first feature of a Mel scale of each first audio clip;
and processing the first feature of the Mel scale of each first audio clip based on a first objective function to obtain a first feature of a logarithmic scale.
In one possible implementation, the emotion recognition module is configured to perform:
inputting the first features of the plurality of first audio segments into the emotion recognition model, and performing feature extraction on the first feature of each first audio segment by the emotion recognition model to obtain a second feature of each first audio segment;
obtaining, by the emotion recognition model, an emotion degree value of each first audio segment based on a second feature of each first audio segment and position information of each bit of feature value in the second feature, where the position information of each bit of feature value in the second feature is used to indicate a corresponding position of the bit of feature value in the first audio segment;
outputting the emotion degree values of the plurality of first audio segments.
In one possible implementation, the emotion recognition module is configured to perform calculation of the first feature of each first audio segment by a convolution layer in the emotion recognition model, and obtain the second feature of each first audio segment.
In one possible implementation, the emotion recognition module is configured to perform:
acquiring position information of each bit of feature value in the second feature;
obtaining a third feature of each first audio segment based on the second feature of each first audio segment and the position information of each feature value in the second feature;
and calculating the third characteristic of each first audio segment based on a full connection layer in the emotion recognition model to obtain the emotion degree value of each first audio segment.
In one possible implementation, the emotion recognition module is configured to perform:
calculating the third feature of each first audio segment based on a full-link layer in the emotion recognition model to obtain a calculation result corresponding to the third feature;
and calculating a calculation result corresponding to the third feature of each first audio segment through a second objective function to obtain the emotional degree value of each first audio segment.
In one possible implementation, the apparatus further includes a model training module configured to perform:
acquiring a plurality of sample audio files, wherein each sample audio file carries an emotion label which is used for expressing the emotion tendency of the sample audio file;
extracting the characteristics of each sample audio file to obtain first characteristics of a plurality of first audio fragments of each sample audio file;
calling an initial model, inputting first characteristics of a plurality of first audio segments of the plurality of sample audio files into the initial model, and obtaining second characteristics of the first audio segments by the initial model based on the first characteristics of the first audio segments for each first audio segment;
classifying the first audio segment by the initial model based on a second feature of the first audio segment to obtain a classification result of the first audio segment, wherein the classification result is used for representing the emotional tendency of the first audio segment;
obtaining, by the initial model, an emotional degree value of the first audio segment based on the second feature;
for each sample audio file, outputting, by the initial model, a classification result of the sample audio file based on a classification result and an emotion degree value of a plurality of first audio segments of the sample audio file, the classification result being used to represent the classification result of the sample audio file;
and adjusting model parameters of the initial model based on the classification results of the plurality of sample audio files and the emotion labels carried by each sample audio file until target conditions are met to obtain an emotion recognition model, wherein the model parameters comprise parameters required for acquiring the emotion degree value of the first audio fragment based on the second characteristics.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
performing feature extraction on an audio file to obtain first features of a plurality of first audio segments of the audio file;
calling an emotion recognition model, inputting first features of the first audio segments into the emotion recognition model, and outputting emotion degree values of the first audio segments;
and according to the emotion degree values of the plurality of first audio segments, taking at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values in the audio file as a climax segment of the audio file, wherein the second audio segment is the same as the first audio segment or different from the first audio segment.
According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having instructions therein, which when executed by a processor of an electronic device, enable the electronic device to perform a method of audio data processing, the method comprising:
performing feature extraction on an audio file to obtain first features of a plurality of first audio segments of the audio file;
calling an emotion recognition model, inputting first features of the first audio segments into the emotion recognition model, and outputting emotion degree values of the first audio segments;
and according to the emotion degree values of the plurality of first audio segments, taking at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values in the audio file as a climax segment of the audio file, wherein the second audio segment is the same as the first audio segment or different from the first audio segment.
According to a fifth aspect of embodiments of the present disclosure, there is provided an application program comprising one or more instructions which, when executed by a processor of an electronic device, enable the electronic device to perform a method of audio data processing, the method comprising:
performing feature extraction on an audio file to obtain first features of a plurality of first audio segments of the audio file;
calling an emotion recognition model, inputting first features of the first audio segments into the emotion recognition model, and outputting emotion degree values of the first audio segments;
and according to the emotion degree values of the plurality of first audio segments, taking at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values in the audio file as a climax segment of the audio file, wherein the second audio segment is the same as the first audio segment or different from the first audio segment.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: according to the method and the device for processing the audio data, the audio file is segmented, emotion analysis is conducted on each first audio segment, and the emotion degree value of each first audio segment is determined, so that according to the length of the climax segment, the part with the largest sum of the emotion degree values with the same length is used as the climax segment of the audio file, repeated parts in the audio file are not simply detected, emotions of all parts of the audio file are analyzed, the relatively intense part of emotion expression is used as the climax segment, the accuracy of the obtained climax segment is high, and the accuracy of the audio data processing method is high.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
Fig. 1 is a flowchart illustrating an audio data processing method according to an exemplary embodiment.
Fig. 2 is a flow chart illustrating a method of audio data processing according to an exemplary embodiment.
FIG. 3 is a flow diagram illustrating a process for training an emotion recognition model in accordance with an exemplary embodiment.
FIG. 4 is a network diagram illustrating an emotion recognition model according to an exemplary embodiment.
Fig. 5 is a schematic diagram illustrating a structure of an audio data processing apparatus according to an exemplary embodiment.
Fig. 6 is a block diagram illustrating a structure of a terminal according to an exemplary embodiment.
Fig. 7 is a schematic diagram illustrating a configuration of a server according to an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
Fig. 1 is a flowchart illustrating an audio data processing method according to an exemplary embodiment, as shown in fig. 1, including the following steps.
In step S11, the electronic device performs feature extraction on the audio file to obtain first features of a plurality of first audio segments of the audio file.
In step S12, the electronic device invokes an emotion recognition model, inputs the first features of the first audio segments into the emotion recognition model, and outputs emotion degree values of the first audio segments.
In step S13, the electronic device uses at least one continuous second audio clip of the audio file having a total length equal to the target length and a maximum sum of emotional degree values as a climax clip of the audio file according to the emotional degree values of the plurality of first audio clips, where the second audio clip is the same as the first audio clip or different from the first audio clip.
According to the method and the device for processing the audio data, the audio file is segmented, emotion analysis is conducted on each first audio segment, and the emotion degree value of each first audio segment is determined, so that according to the length of the climax segment, the part with the largest sum of the emotion degree values with the same length is used as the climax segment of the audio file, repeated parts in the audio file are not simply detected, emotions of all parts of the audio file are analyzed, the relatively intense part of emotion expression is used as the climax segment, the accuracy of the obtained climax segment is high, and the accuracy of the audio data processing method is high.
In one possible implementation manner, the step of taking at least one continuous second audio segment of the audio file, which has a total length of a target length and a maximum sum of emotional degree values, as a climax segment of the audio file according to the emotional degree values of the plurality of first audio segments comprises: obtaining a plurality of candidate climax fragments of the audio file, wherein each candidate climax fragment is at least one continuous second audio fragment with the total length being the target length; acquiring the sum of the emotion degree values of the second audio segments included by each candidate climax segment according to the emotion degree values of the plurality of first audio segments; and taking the candidate climax fragment with the maximum sum value as the climax fragment of the audio file.
In one possible implementation manner, the obtaining a sum of the emotional degree values of the second audio segments included in each candidate climax segment according to the emotional degree values of the plurality of first audio segments includes: determining the emotional degree value of each second audio clip in the audio file according to the emotional degree values of the plurality of first audio clips; acquiring the sum of the emotional degree values of at least one continuous second audio segment included in each candidate climax segment; or, for each candidate climax segment, acquiring the emotional degree value of each second audio segment included in the candidate climax segment according to the emotional degree values of the plurality of first audio segments; and acquiring the sum of the emotional degree values of at least one continuous second audio segment included in the candidate climax segment.
In one possible implementation manner, the step of taking at least one continuous second audio segment of the audio file, which has a total length of a target length and a maximum sum of emotional degree values, as a climax segment of the audio file according to the emotional degree values of the plurality of first audio segments comprises: outputting a playing starting point and a playing ending point of at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values; or, intercepting a segment corresponding to at least one continuous second audio segment with the total length as a target length and the maximum sum of emotion degree values from the audio file as a climax segment of the audio file; or splicing at least one continuous second audio segment with the total length as the target length and the maximum sum of the emotion degree values to obtain the climax segment of the audio file.
In one possible implementation manner, the performing feature extraction on the audio file to obtain first features of a plurality of first audio segments of the audio file includes: segmenting the audio file to obtain a plurality of first audio segments of the audio file; resampling each first audio segment to obtain a first feature of each first audio segment.
In one possible implementation, the resampling each first audio segment to obtain the first feature of each first audio segment includes: performing audio processing on each first audio clip according to a target sampling rate and a target window function to obtain a first characteristic of a Mel scale of each first audio clip; and processing the first feature of the Mel scale of each first audio clip based on a first objective function to obtain a first feature of a logarithmic scale.
In one possible implementation manner, the inputting the first features of the plurality of first audio segments into the emotion recognition model, and outputting the emotion degree values of the plurality of first audio segments includes: inputting the first characteristics of the plurality of first audio segments into the emotion recognition model, and performing characteristic extraction on the first characteristics of each first audio segment by the emotion recognition model to obtain the second characteristics of each first audio segment; obtaining, by the emotion recognition model, an emotion degree value of each first audio segment based on a second feature of each first audio segment and position information of each bit of feature value in the second feature, where the position information of each bit of feature value in the second feature is used to indicate a corresponding position of the bit of feature value in the first audio segment; and outputting the emotion degree values of the plurality of first audio segments.
In one possible implementation manner, the performing, by the emotion recognition model, feature extraction on the first feature of each first audio segment to obtain the second feature of each first audio segment includes: and calculating the first characteristic of each first audio segment by a convolution layer in the emotion recognition model to obtain the second characteristic of each first audio segment.
In one possible implementation, the obtaining the emotion degree value of each first audio segment based on the second feature of each first audio segment and the position information of each bit feature value in the second feature includes: acquiring position information of each bit of feature value in the second feature; obtaining a third feature of each first audio segment based on the second feature of each first audio segment and the position information of each feature value in the second feature; and calculating the third characteristic of each first audio segment based on the full-link layer in the emotion recognition model to obtain the emotion degree value of each first audio segment.
In a possible implementation manner, the calculating the third feature of each first audio segment based on the full connection layer in the emotion recognition model to obtain the emotion degree value of each first audio segment includes:
calculating the third feature of each first audio segment based on the full-link layer in the emotion recognition model to obtain a calculation result corresponding to the third feature; and calculating a calculation result corresponding to the third feature of each first audio segment through a second objective function to obtain the emotional degree value of each first audio segment.
In one possible implementation, the training process of the emotion recognition model includes:
acquiring a plurality of sample audio files, wherein each sample audio file carries an emotion label which is used for expressing the emotion tendency of the sample audio file;
extracting the characteristics of each sample audio file to obtain first characteristics of a plurality of first audio fragments of each sample audio file;
calling an initial model, inputting first characteristics of a plurality of first audio segments of the plurality of sample audio files into the initial model, and obtaining second characteristics of the first audio segments by the initial model based on the first characteristics of the first audio segments for each first audio segment;
classifying the first audio segment by the initial model based on the second characteristics of the first audio segment to obtain a classification result of the first audio segment, wherein the classification result is used for representing the emotional tendency of the first audio segment;
obtaining, by the initial model, an emotional degree value of the first audio segment based on the second feature;
for each sample audio file, outputting a classification result of the sample audio file based on the classification result and the emotion degree value of a plurality of first audio segments of the sample audio file by the initial model, wherein the classification result is used for representing the classification result of the sample audio file;
and adjusting model parameters of the initial model based on the classification results of the plurality of sample audio files and the emotion labels carried by each sample audio file until target conditions are met to obtain an emotion recognition model, wherein the model parameters comprise parameters required for acquiring the emotion degree value of the first audio fragment based on the second characteristic.
Fig. 2 is a flowchart illustrating an audio data processing method according to an exemplary embodiment, which may include the steps of, as shown in fig. 2:
in step S21, the electronic device acquires an audio file.
In the embodiment of the present disclosure, the electronic device may provide an audio data processing function, and for an audio file, the electronic device may determine a climax segment of the audio file by performing analysis processing on the audio file. The electronic device may be provided as a terminal, or may be provided as a server, which is not limited in this disclosure. Accordingly, the manner and execution timing of the electronic device acquiring the audio file in step S21 may be different, and specifically include the following two cases:
in the first case, when the electronic device is a terminal, the execution time of step S21 may be: when a user wants to acquire a climax section of an audio file, the user can perform a climax section acquisition operation on the electronic device, and when the electronic device acquires a climax section acquisition instruction triggered by the climax section acquisition operation, the user can acquire the audio file so as to analyze and process the audio file subsequently.
Accordingly, in step S21, the obtaining manner of the electronic device obtaining the audio file may include the following: in a possible implementation manner, the audio file may be a file already stored in the electronic device, and the electronic device may obtain the audio file from the file stored this time. In another possible implementation manner, the audio file is stored in a server, the audio file is not included in the current storage file of the electronic device, the electronic device may send an audio file acquisition request to the server, the server acquires and sends the audio file to the electronic device based on the audio file acquisition request, and the electronic device receives the audio file, so as to acquire the audio file.
In the second case, when the electronic device is a server, the execution time of the step S21 may be: when a user wants to acquire a climax section of a certain audio file, a climax section acquisition operation can be carried out on the terminal. When the terminal acquires the climax section acquisition instruction triggered by the climax section acquisition operation, the terminal can send a climax section acquisition request to the electronic equipment. After receiving the climax section obtaining request, the electronic equipment can obtain the audio file indicated by the climax section obtaining request.
Specifically, the climax fragment acquiring request may carry identification information of an audio file, and accordingly, in step S21, the electronic device may acquire, from a storage file, an audio file corresponding to the identification information based on the identification information of the audio file. For example, the audio file may be stored in a multimedia database, and the electronic device may obtain the audio file corresponding to the identification information from the multimedia database based on the identification information.
In step S22, the electronic device segments the audio file to obtain a plurality of first audio segments of the audio file.
When the electronic device acquires the climax section, the audio file can be divided into a plurality of first audio sections, so that the emotion degree of each first audio section is analyzed, understandably, the emotion degree of the climax section is generally the highest in the whole audio file, and the audio file is analyzed in a segmented manner, so that the emotion degree distribution condition of the audio file can be obtained through analysis, and a part with higher emotion degree can be found out, and the part with the highest emotion degree can be the climax section.
Specifically, the electronic device may segment the audio file according to a preset length to obtain a plurality of first audio segments of the audio file, where the length of each first audio segment is the preset length. The electronic device may divide the audio file into a plurality of first audio segments that do not overlap with each other, and two adjacent first audio segments may also have a partial overlap, and the overlapped portion may be smaller. The preset length may be preset by a related technician, for example, the preset length may be 3 seconds, and through the step S22, the audio file may be divided into a plurality of first audio segments with a duration of 3 seconds.
In one possible implementation, the process of segmenting the audio file by the electronic device may be implemented by using a window function. That is, the electronic device may obtain the plurality of first audio segments of the audio file by performing windowing on the audio file. The window length may be the preset length, and of course, each parameter of the window function may be set by a relevant technician as required, which is not limited in the embodiment of the present disclosure.
In step S23, the electronic device resamples each first audio segment to obtain the first feature of each first audio segment.
After the electronic device segments the audio file, the features of each first audio segment may be extracted first, so that the first audio segments may be analyzed subsequently based on the extracted features. Specifically, the feature extraction step may be implemented by resampling.
In a possible implementation manner, the resampling to obtain the first feature in step S23 may be implemented by using the following steps one and two:
step one, the electronic equipment carries out audio processing on each first audio frequency fragment according to a target sampling rate and a target window function to obtain a first characteristic of a Mel scale of each first audio frequency fragment.
In step one, for each first audio segment, the electronic device may resample the audio file and extract the mel-frequency spectrum feature of the first audio segment. The target sampling rate and the target window function may be set by a related technician according to requirements, for example, the target sampling rate may be 22050 hertz (Hz), the target window function may be a hamming window function, the window length may be 2048, and the frame shift may be 512, so that 129 128-dimensional mel-frequency spectrum features may be extracted through a resampling process. The above provides only one specific example, and the target sampling rate and the target window function are not limited by the embodiments of the present invention.
And step two, the electronic equipment can process the first feature of the Mel scale of each first audio clip based on the first objective function to obtain the first feature of the logarithmic scale.
It should be noted that, after the electronic device performs sampling and windowing on the first audio segment, the first feature, which may be a mel scale, is obtained. In the second step, the electronic device may convert the mel-scale first feature into a logarithmic-scale first feature, thereby facilitating subsequent calculation of the first feature. The first objective function may be set by a skilled person in advance, for example, the first objective function may be g (x) log (1+10000x), and the first feature of the mel scale may be converted into the first feature of the logarithmic scale by the first objective function, where x is the first feature of the mel scale, log () is the logarithmic function, and g (x) is the first feature of the logarithmic scale.
Through the first step and the second step, the first characteristic of each of the plurality of first audio segments of the audio file can be obtained. For example, the first audio segment may be referred to as a segment (chunk), and the first feature obtained by each chunk may be XtT ∈ {1,. and T }, where T is a positive integer and T is an identification of the first audio piece chunk. The X istMay be 129x128, i.e. 129 first features of 128 dimensions.
The steps S22 to S23 are a process of extracting features of the audio file to obtain first features of a plurality of first audio segments of the audio file, and the features of each first audio segment in the audio file can be extracted and obtained through the feature extraction step, so that the emotional degree of the first audio segment can be analyzed based on the features, and further, the climax segment of the audio file can be determined according to the emotional degree distribution of the audio file embodied by the emotional degrees of the plurality of first audio segments.
In step S24, the electronic device invokes an emotion recognition model, inputs the first features of the multiple first audio segments into the emotion recognition model, and performs feature extraction on the first feature of each first audio segment by using the emotion recognition model to obtain the second feature of each first audio segment.
After obtaining the first features of the first audio segments, the electronic device may analyze the emotion degree of each first audio segment through an emotion recognition model, and further determine the climax segments based on the analysis result. The electronic device may invoke an emotion recognition model, input the first features of the multiple first audio segments obtained in step S23 into the emotion recognition model, and analyze each first audio segment by the emotion recognition model to obtain an emotion degree value of each first audio segment, where the emotion degree value is used to represent an emotion degree of the first audio segment.
After the electronic device inputs the first features of the plurality of first audio segments into the emotion recognition model, the emotion recognition model can perform multi-step processing on the first features. Specifically, for each first audio segment, the emotion recognition model may further perform feature extraction on the first features of the first audio segment to obtain second features that can more accurately express the first audio segment. And analyzing the emotional degree of the first audio segment based on the second characteristic.
Specifically, the process of acquiring the second feature by the emotion recognition model may be: and calculating the first characteristic of each first audio segment by a convolution layer in the emotion recognition model to obtain the second characteristic of each first audio segment.
For example, the emotion recognition model may be a Convolutional Neural Network (CNN) model, where a first feature of the first audio segment is calculated by 3 Convolutional layers, the first feature of the first audio segment may be input into a first Convolutional layer, the first Convolutional layer calculates the first feature, a calculation result is input into a next Convolutional layer, the input calculation result is further calculated by the next Convolutional layer, and so on, and a second feature of each first audio segment is output by a last Convolutional layer of the 3 Convolutional layers.
In one specific example, the first feature of each audio file input to the emotion recognition model may be XtT ∈ {1,. and T }, where T is a positive integer and T is an identification of the first audio piece chunk. The emotion recognition model may use the 3 convolutional layers for feature extraction, and then process the extracted features through a pooling (posing) layer to obtain a second feature ht. The pooling layer may obtain the second characteristic in a MaxPooling Over Time manner. The MaxPooling Over Time is a down-sampling mode in the CNN model, the electronic device may obtain a second feature through the pooling layer, the second feature may be understood as an intermediate feature, and then may further process based on the intermediate feature to obtain a desired emotional degree or emotional tendency, and the like. For example, the MaxPooling Over Time may be implemented by the following equation:
ht=TimeMaxPool(Conv(Xt))
wherein h istFor the second feature, TimeMaxPool () is the maximum pooling function, Conv () is the convolution function, XtFor the first feature, t is an identification of the first audio piece.
By using the formula, the electronic device may only take the value with the largest score among the features extracted by the convolutional layer as the second feature, and discard other feature values.
It should be noted that the emotion recognition model is stored in the electronic device, and the electronic device may call the local emotion recognition model. The emotion recognition model can also be stored in other electronic equipment, and when the electronic equipment needs emotion recognition, the emotion recognition model can be called from other electronic equipment. Of course, the emotion recognition model can be obtained by training on the electronic device, and can also be obtained by training on other electronic devices, so that the emotion recognition model is packaged into a configuration file by the other electronic devices and is sent to the electronic device, and the electronic device can obtain the emotion recognition model.
In step S25, the emotion recognition model in the electronic device obtains an emotion degree value of each first audio segment based on the second feature of each first audio segment and the position information of each feature value in the second feature.
Wherein the position information of each bit of feature value in the second feature is used to indicate the corresponding position of the bit of feature value in the first audio segment. It can be understood that the audios have time series correlation, and the closer the first audio segments are, the stronger the correlation is, so that the electronic device may consider the positions of the feature values corresponding to the respective time points in each first audio segment in the first audio segment when acquiring the emotion degree value of each first audio segment, so that the acquired emotion degree value is more accurate in consideration of the correlation.
Specifically, the process of obtaining the emotion degree value by the emotion recognition model in step S25 can be implemented by the following steps one to three:
step one, the emotion recognition model acquires position information of each position of characteristic value in the second characteristic.
For a first audio segment, before emotion recognition is performed on a second feature of the first audio segment, position information may be added to the second feature, and when adding the position information, position information of each feature value in the second feature may be acquired first.
In one possible implementation, the location information may be via ptAnd (4) showing. p is a radical oftDimension of (d) and second feature htAll of which can be M, wherein p ist,jRepresents ptThe j-th dimension of (a). p is a radical oftThe calculation process of (a) can be realized by the following formula:
Figure BDA0001986074930000151
Figure BDA0001986074930000152
wherein p ist,2z-1Is pt2z-1 dimension, pt,2zIs ptIn the 2 z-dimension of (1), M is ptT is the identity of the first audio segment, sin () is a sine function and cos () is a cosine function.
And step two, the emotion recognition model obtains a third feature of each first audio segment based on the second feature of each first audio segment and the position information of each feature value in the second feature.
After the position information of each feature value in the second feature is obtained through calculation, the emotion recognition model can be embedded into the second feature to obtain a third feature. The third feature is the feature obtained by integrating the second feature and the position information of each feature value in the second feature.
In one possible implementation, the third feature may be implemented by
Figure BDA0001986074930000161
That is, each feature value in the third feature may be a sum of a feature value of the second feature and a value of a corresponding bit in the position information. Specifically, this can be achieved by the following formula:
Figure BDA0001986074930000162
wherein,
Figure BDA0001986074930000163
is a third feature, htIs the second feature, ptT is the identity of the first audio piece.
And thirdly, calculating the third feature of each first audio segment by the emotion recognition model based on the full connection layer in the emotion recognition model to obtain the emotion degree value of each first audio segment.
After the electronic device acquires the third feature, the emotional degree of the first audio segment can be analyzed based on the third feature. In a possible implementation manner, the electronic device may calculate the third feature of each first audio segment based on a full connection layer in the emotion recognition model, and obtain a calculation result corresponding to the third feature. Furthermore, the electronic device may calculate a calculation result corresponding to the third feature of each first audio segment through the second objective function, so as to obtain an emotional degree value of each first audio segment.
For example, the number of the full-connected layers used for calculating the third feature in the emotion recognition model may include 4, the third feature is calculated through the 4 full-connected layers to obtain a calculation result, and then the emotion degree value of the first audio segment may be obtained through further calculation by using the second objective function. Specifically, the calculation process can be implemented by the following two formulas:
Figure BDA0001986074930000164
wherein,
Figure BDA0001986074930000165
as a third feature, ftFor the calculation result of the full-connected layer, FC () is the convolution process performed by the full-connected layer, where FC means full-connected (full connected) and t is the identifier of the first audio piece.
αt=W2tanh(W1ft1)+σ2
Wherein alpha istIs an emotional degree value, ftFor the calculation result of the full connection layer, tanh () is a hyperbolic function, W1、W2、σ1、σ2For the parameter, t is the identity of the first audio piece。
In one possible implementation, the parameter values in the second objective function are obtained based on a training process of the emotion recognition model. Corresponding to the above formula example, W1、W2、σ1、σ2The parameter values of the second objective function can be obtained by training in the emotion recognition model training process.
In step S26, the emotion recognition model in the electronic device outputs emotion degree values of the first audio segments.
After obtaining the emotion degree value of each first audio segment, the emotion recognition model in the electronic equipment can output the emotion degree value, so that the electronic equipment can determine the climax segment of the audio file based on the output emotion degree value of each first audio segment.
The steps S24 to S26 are processes of invoking an emotion recognition model, inputting the first features of the first audio segments into the emotion recognition model, and outputting emotion degree values of the first audio segments. The emotion recognition model can analyze and obtain the emotion degree distribution condition of the audio file based on the characteristics of the first audio fragment. In the following, only the example of the training process of the emotion recognition model is performed on the electronic device, and the training process may include the following steps:
the method comprises the steps that firstly, electronic equipment obtains a plurality of sample audio files, wherein each sample audio file carries an emotion label, and the emotion labels are used for expressing emotion tendencies of the sample audio files.
When model training is needed, the electronic equipment can obtain a plurality of sample audio files, and model parameters of the initial model are trained on the initial model through the sample audio files so as to improve the accuracy of emotion recognition of the initial model. The plurality of sample audio files may be stored in a multimedia database, and the electronic device may obtain the plurality of sample audio files from the multimedia database. Of course, the electronic device may also obtain the plurality of sample audio files in other obtaining manners, for example, capturing the audio files from a website as the sample audio files.
In a possible implementation manner, the electronic device may obtain identification information of a plurality of sample audio files and corresponding emotion tags from the first database, and obtain corresponding sample audio files from the second data based on the identification information.
The emotion tag may be obtained by analyzing the sample audio file in advance by a related technician, where the content of the emotion tag is an emotional tendency of the sample audio file, and the emotional tendency may include sadness (sad), happy (happy), happy (aggressive), peaceful (peaceful), and the like. The above are merely examples, and the embodiments of the disclosure are not limited to this emotional tendency. Of course, the number of the types of emotion labels is not limited in the embodiments of the present disclosure, and for example, the emotion labels may include 190 types.
And secondly, the electronic equipment extracts the characteristics of each sample audio file to obtain the first characteristics of a plurality of first audio clips of each sample audio file.
The process of extracting the features of each sample audio file by the electronic device in step two is the same as the feature extraction steps in step S22 and step S23, and the embodiment of the present disclosure is not repeated herein.
And step three, the electronic equipment calls an initial model, first characteristics of a plurality of first audio clips of the plurality of sample audio files are input into the initial model, and for each first audio clip, second characteristics of the first audio clip are obtained through the initial model based on the first characteristics of the first audio clip.
In the third step, the process of performing, by the electronic device, feature extraction on each first audio segment of each sample audio file is the same as the feature extraction step in step S24, and details of the embodiment of the present disclosure are not repeated herein. After obtaining the second feature, the initial model may be output to two branches, where one branch may be classified based on the second feature, and the other branch may obtain an emotional degree value based on the second feature, which may be specifically referred to in step four and step five below.
And fourthly, classifying the first audio segment by the initial model based on the second characteristic of the first audio segment to obtain a classification result of the first audio segment, wherein the classification result is used for expressing the emotional tendency of the first audio segment.
In one branch, after obtaining the second features of each first audio segment, the initial model may classify the first audio segment according to the second features of the first audio segment to determine the emotional tendency of the first audio segment. For example, to determine whether the emotion expressed by the first piece of audio is sad or happy, etc. In the branch, the prediction of the segment degree in the audio file is completed, and then the audio file can be wholly predicted based on the prediction result of the segment degree.
The classification result obtained by the classification can be the probability that the first audio clip is of each emotional tendency. For example, if the emotional tendencies include 190 types, the classification result may be a 190-dimensional array or matrix, and the feature value of each bit is the probability that the emotional tendency of the first audio piece is the emotional tendency corresponding to the bit.
In a specific example, the classification step may be implemented based on 2 fully connected layers, and the classification process may be implemented based on the following formula:
Figure BDA0001986074930000181
wherein,
Figure BDA0001986074930000182
for the classification result of the first audio piece, t is the identification of the first audio piece, softmax () is a normalized exponential function, which can be used here to classify the result of the full-connectivity layer output. h istFor the second feature, FC () is convolution processing performed by a full connection layer, where FC means full connection (full connected).
And step five, acquiring the emotion degree value of the first audio clip by the initial model based on the second characteristic.
In another branch, the electronic device may further calculate the second feature to obtain an emotion degree value of the first audio segment, so as to use the emotion degree value as a weight of the classification result obtained in the fourth step, thereby performing the following sixth step of predicting the emotional tendency of the whole audio file.
The process of obtaining the emotion degree value of the first audio segment in the fifth step is similar to that shown in the step S25, and details of the embodiment of the present disclosure are not repeated herein.
And step six, for each sample audio file, outputting the classification result of the sample audio file by the initial model based on the classification result and the emotion degree value of a plurality of first audio fragments of the sample audio file, wherein the classification result is used for representing the classification result of the sample audio file.
After the initial model obtains the classification result and the emotion degree value of each first audio segment, the emotion degree value can be used as the weight of the classification result of the first audio segment, so that the sample audio file is wholly predicted. Specifically, for a sample audio file, the initial model may perform weighted summation on the classification results and emotion degree values of a plurality of first audio segments of the sample audio file to obtain the classification result of the sample audio file.
In a possible implementation manner, the process of obtaining the classification result of the sample audio file in the step six can be implemented by the following formula:
Figure BDA0001986074930000191
wherein,
Figure BDA0001986074930000192
as a result of classification of the sample audio file, alphatIs the emotional degree value of the first audio segment, t is the mark of the first audio segment,
Figure BDA0001986074930000193
as a result of the classification of the first audio piece, Σ is a summation symbol.
And seventhly, the electronic equipment adjusts the model parameters of the initial model based on the classification results of the sample audio files and the emotion labels carried by the sample audio files until the target conditions are met, and the emotion recognition model is obtained.
The initial model may obtain an overall prediction result of the sample audio file, that is, a classification result of the sample audio file, and output the classification result. The electronic device may determine the classification accuracy of the emotion recognition model based on the classification result of the sample audio file and the emotional tendency of the sample audio file indicated by the emotion tag carried by the sample audio file, and if it is determined that the classification accuracy is not converged based on the classification accuracy, or the current iteration number is less than the number threshold, the model parameter of the initial model may be adjusted until the target condition is met.
Wherein, the emotional tendency of the sample audio file indicated by the emotional tag is the correct classification result of the sample audio file. The target condition may be that the classification accuracy converges, or that the current iteration number reaches a number threshold.
In a possible implementation manner, the model training process may determine whether the target condition is met based on a gradient descent method, and in a specific possible embodiment, the training process may further perform training by using a gradient descent method with a plurality of sample audio files as a batch (batch). Of course, other methods, such as k-fold verification, etc., may be used, and the embodiments of the present disclosure are not limited thereto.
For example, in a specific example, the electronic device may acquire 30000 songs, and take 6000 songs of the 30000 songs as a verification set and 18000 songs as a training set, to perform the above model training process. As shown in fig. 3, after performing feature extraction on the audio file, the electronic device may input the extracted first feature of each first audio clip into the convolutional layer, perform feature extraction on the first feature by the convolutional layer, and perform feature extraction by the pooling layer to obtain the second feature of each first audio clip. After the second feature is obtained, the second feature may be divided into two branches, where in one branch, the second feature is input into the full-link layer for calculation, and a classification result of each first audio segment is obtained through processing of a normalization index (softmax) function, so as to implement prediction of a segment level. In the other branch, position information is added to the second feature, and an Attention (Attention) mechanism is used to obtain an emotional degree value of each first audio segment, that is, a weight of the classification result of each first audio segment. The model may then integrate the classification results and weights for each first audio piece to obtain a classification result for the entire audio file.
For another example, the network structure of the emotion recognition model may be as shown in fig. 4, assuming that 16 sample audio files are in a batch, each sample audio file is divided into 8 first audio segments, the first feature input by the batch of sample audio files has 16 × 8 × 129 × 128, the emotion recognition model may perform feature extraction on the first feature based on 3 convolutional layers, and obtain a second feature through a pooling layer, where the convolutional layers are Rectified through a Linear rectification function (regu). The emotion recognition model may add location information to the second feature, calculate the second feature added with location information based on 4 convolution layers in an attention mechanism (attention mechanism), and obtain an emotion degree value of each first audio segment through a normalization function, so that the number of data obtained at this time is 16 × 8. For the second feature, the emotion recognition model may further perform segment-level prediction on the second feature through 2 convolutional layers and a normalization function to obtain a classification result of each first audio segment, and taking the number of types of emotion tags as 190 as an example, the number of data in the classification result of the first audio segments of 16 audio files may be 16 × 8 × 190. The final emotion recognition model can synthesize the classification result and the emotion degree value to predict song level, so that the data volume in the output result of 16 audio files can be 16 × 190.
It can be understood that, in the embodiment of the present disclosure, when the trained emotion recognition model is used, after the second feature is obtained in the pooling layer, the second feature is not input into the full link layer for classification, but position information is directly added, and the emotion degree value of each first audio segment is obtained through an Attention mechanism, that is, the emotion degree value can be output.
In step S27, the electronic device uses at least one continuous second audio segment of the audio file having a total length equal to the target length and a maximum sum of emotional degree values as the climax segment of the audio file according to the emotional degree values of the plurality of first audio segments.
After obtaining the emotion degree values of the plurality of first audio segments, the electronic device may further confirm the climax segments of the audio file. The length of the climax fragment, i.e. the target length, may be set in the electronic device. The target length may be preset by a related technician, or may be set by a user according to a use requirement of the user, for example, the target length may be 24 seconds, and a specific value of the target length is not limited in the embodiment of the present disclosure.
Of course, there is also a possibility that the person skilled in the art will be provided with: the target length is a first target length, and when the user uses the device, the first target length is changed into a second target length according to the use requirement. For example, the target length defaults to 24 seconds, and the user changes the target length to 30 seconds, but may change the target length to other values, such as 10 seconds.
It can be understood that the emotional degree value of the climax segment of the audio file is relatively large, and when the electronic device acquires the climax segment, the emotional degree distribution of the audio file can be obtained according to the emotional degree values of the first audio segments obtained in the step S26, so that the electronic device can further consider the total of the emotional degree values of the segments with the same length in the audio file according to the length of the climax segment, and the total is the climax segment corresponding to the audio file at the maximum.
In determining the climax segment, the electronic device can make a determination based on the second audio segment. Wherein the second audio segment is the same as the first audio segment, or the second audio segment is different from the first audio segment.
In a possible case, the second audio segment may be the same as the first audio segment, and in step S27, after obtaining the emotion level value of each first audio segment, the electronic device may use at least one consecutive first audio segment having a total length as the target length and a maximum sum of the emotion level values as the climax segment of the audio file. For example, the length of the first audio segment may be 3 seconds, and the target length is 30 seconds, the electronic device may regard, as the climax segment, 10 consecutive first audio segments having the largest sum of the emotion degree values.
In another case, the second audio segment is different from the first audio segment. For example, the length of the second audio segment may be smaller than the length of the first audio segments, such as the length of the second audio segment is 1 second, and then three second audio segments are included in each first audio segment. The length of the first audio segment may be 3 seconds, the length of the second audio segment may be 1 second, and the target length is 30 seconds, then the electronic device may regard the 30 consecutive second audio segments with the largest sum of the emotional degree values as the climax segments. The emotional degree value of each second audio segment can be determined based on the emotional degree value of the first audio segment in which the second audio segment is located. Specifically, the emotion degree value of the second audio segment may be the same as the emotion degree value of the first audio segment where the second audio segment is located, or may be a ratio of the emotion degree value of the first audio segment where the second audio segment is located to the number of the second audio segments included in the first audio segment, which is not limited in this disclosure.
As another example, the second audio segment may be longer than the first audio segment. The emotional degree value of each second audio segment can be determined based on the emotional degree values of the first audio segments included in the second audio segment. Specifically, the emotion degree value of the second audio segment may be a sum of emotion degree values of the first audio segments included in the second audio segment, or may be an average of emotion degree values of the first audio segments included in the second audio segment, which is not limited in this disclosure. For another example, the length of the second audio segment is the same as the length of the first audio segment, but the second audio segment is divided from the first audio segment in a different manner, for example, the playing time corresponding to a certain first audio segment is 4 seconds to 6 seconds, and the playing time corresponding to a certain second audio segment is 3 seconds to 5 seconds.
The following is a description of a specific process of the electronic device acquiring the climax part in step S27, and specifically, the step S27 can be implemented by the following steps one to three.
Step one, the electronic equipment acquires a plurality of candidate climax fragments of the audio file, wherein each candidate climax fragment is at least one continuous second audio fragment with the total length being the target length.
The electronic device may first obtain at least one continuous second audio segment as a candidate climax segment according to the target length, so as to compare the emotional degree of the plurality of candidate climax segments.
And step two, the electronic equipment acquires the sum of the emotion degree values of the second audio segments included by each candidate climax segment according to the emotion degree values of the plurality of first audio segments.
After obtaining the plurality of candidate climax segments, the electronic device may obtain a sum of the emotion degree values of the second audio segments included in each candidate climax segment to compare the emotion degrees of the plurality of candidate climax segments. Specifically, the obtaining process of the sum of the emotion degree values of the second audio segment included in each candidate climax segment is realized by any one of the following methods:
in the first mode, the electronic equipment determines the emotion degree value of each second audio clip in the audio file according to the emotion degree values of the multiple first audio clips; and acquiring the sum of the emotional degree values of at least one continuous second audio segment included in each candidate climax segment.
In the second mode, for each candidate climax segment, the electronic equipment acquires the emotion degree value of each second audio segment included in the candidate climax segment according to the emotion degree values of the plurality of first audio segments; and acquiring the sum of the emotional degree values of at least one continuous second audio segment included in the candidate climax segment.
In the first mode, when acquiring the emotion degree values of a plurality of first audio segments, the electronic device may determine the emotion degree value of each second audio segment of the audio file, and may acquire the emotion degree value of the second audio segment in the following case where the emotion degree value of the second audio segment needs to be used. In the second mode, when obtaining the emotion degree values of the plurality of first audio segments, the electronic device may first not determine the emotion degree value of each second audio segment, and when needing to determine the emotion degree of the candidate climax segment, may determine the emotion degree value of the second audio segment included in the candidate climax segment based on the emotion degree value of the first audio segment, and then perform summation calculation.
And step three, the electronic equipment takes the candidate climax fragment with the maximum sum value as the climax fragment of the audio file.
It is understood that the candidate climax segment having the larger sum is more likely to be a climax segment, and thus the electronic device may regard the candidate climax segment having the largest sum as the climax segment.
For example, the electronic device may obtain an emotion degree curve of the audio file based on emotion degree values of a plurality of first audio segments of the audio file, an abscissa of the emotion degree curve may be play time, and an ordinate of the emotion degree curve may be emotion degree values, so that the climax segments of the audio file are determined by processing the emotion degree curve.
It should be noted that, when the electronic device acquires the climax section, the climax section may include multiple embodiments, for example, the electronic device only needs to output which section of the audio file corresponds to the climax section, and may output a play start point and a play end point. The electronic device may also intercept the climax part of the audio file as a new audio file.
Specifically, in step S27, the step of acquiring the climax fragment by the electronic device may be: the electronic equipment outputs a playing starting point and a playing ending point of at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values; or, the electronic equipment intercepts a segment corresponding to at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values from the audio file as a climax segment of the audio file; or the electronic equipment splices at least one continuous second audio segment with the total length as the target length and the maximum sum of the emotion degree values to obtain the climax segment of the audio file.
Accordingly, in the third step, the electronic device may output the play start point and the play end point of the candidate climax segment with the largest sum; or, the electronic equipment intercepts the candidate climax segment with the maximum sum value from the audio file as the climax segment of the audio file; or the electronic equipment splices at least one continuous second audio segment corresponding to the candidate climax segment with the maximum value to be used as the climax segment.
In one possible implementation, an audio file may also include multiple climax segments, e.g., a song that includes two identical climax segments that are identical. Then, in step S27, if the audio file includes a plurality of groups of consecutive at least one second audio clips having the total length as the target length and the maximum sum of the emotion degree values, the plurality of groups of at least one first audio clip may be all acquired as climax clips of the audio file.
According to the method and the device for processing the audio data, the audio file is segmented, emotion analysis is conducted on each first audio segment, and the emotion degree value of each first audio segment is determined, so that according to the length of the climax segment, the part with the largest sum of the emotion degree values with the same length is used as the climax segment of the audio file, repeated parts in the audio file are not simply detected, emotions of all parts of the audio file are analyzed, the relatively intense part of emotion expression is used as the climax segment, the accuracy of the obtained climax segment is high, and the accuracy of the audio data processing method is high.
Fig. 5 is a schematic diagram illustrating a structure of an audio data processing apparatus according to an exemplary embodiment. Referring to fig. 5, the apparatus includes:
a feature extraction module 501, configured to perform feature extraction on an audio file to obtain first features of a plurality of first audio segments of the audio file;
the emotion recognition module 502 is configured to execute calling an emotion recognition model, input the first features of the plurality of first audio segments into the emotion recognition model, and output emotion degree values of the plurality of first audio segments;
the segment obtaining module 503 is configured to execute at least one continuous second audio segment having a total length as a target length and a maximum sum of emotional level values in the audio file as a climax segment of the audio file according to the emotional level values of the plurality of first audio segments, where the second audio segment is the same as the first audio segment or different from the first audio segment.
In one possible implementation, the fragment acquisition module 503 is configured to perform: obtaining a plurality of candidate climax fragments of the audio file, wherein each candidate climax fragment is at least one continuous first audio fragment with the total length being the target length; acquiring the sum of the emotion degree values of the second audio segments included by each candidate climax segment according to the emotion degree values of the plurality of first audio segments; and taking the candidate climax fragment with the maximum sum value as the climax fragment of the audio file.
In one possible implementation, the fragment acquisition module 503 is configured to perform: determining the emotional degree value of each second audio clip in the audio file according to the emotional degree values of the plurality of first audio clips; acquiring the sum of the emotional degree values of at least one continuous second audio segment included in each candidate climax segment; or, for each candidate climax segment, acquiring the emotional degree value of each second audio segment included in the candidate climax segment according to the emotional degree values of the plurality of first audio segments; and acquiring the sum of the emotional degree values of at least one continuous second audio segment included in the candidate climax segment.
In one possible implementation, the fragment acquisition module 503 is configured to perform: outputting a playing starting point and a playing ending point of at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values; or, intercepting a segment corresponding to at least one continuous second audio segment with the total length as a target length and the maximum sum of emotion degree values from the audio file as a climax segment of the audio file; or splicing at least one continuous second audio segment with the total length as the target length and the maximum sum of the emotion degree values to obtain the climax segment of the audio file.
In one possible implementation, the feature extraction module 501 is configured to perform: segmenting the audio file to obtain a plurality of first audio segments of the audio file; resampling each first audio segment to obtain a first feature of each first audio segment.
In one possible implementation, the feature extraction module 501 is configured to perform: performing audio processing on each first audio clip according to a target sampling rate and a target window function to obtain a first characteristic of a Mel scale of each first audio clip; and processing the first feature of the Mel scale of each first audio clip based on a first objective function to obtain a first feature of a logarithmic scale.
In one possible implementation, the emotion recognition module 502 is configured to perform: inputting the first characteristics of the plurality of first audio segments into the emotion recognition model, and performing characteristic extraction on the first characteristics of each first audio segment by the emotion recognition model to obtain the second characteristics of each first audio segment; obtaining, by the emotion recognition model, an emotion degree value of each first audio segment based on a second feature of each first audio segment and position information of each bit of feature value in the second feature, where the position information of each bit of feature value in the second feature is used to indicate a corresponding position of the bit of feature value in the first audio segment; and outputting the emotion degree values of the plurality of first audio segments.
In one possible implementation, the emotion recognition module 502 is configured to perform calculation of the first feature of each first audio segment by the convolution layer in the emotion recognition model to obtain the second feature of each first audio segment.
In one possible implementation, the emotion recognition module 502 is configured to perform: acquiring position information of each bit of feature value in the second feature; obtaining a third feature of each first audio segment based on the second feature of each first audio segment and the position information of each feature value in the second feature; and calculating the third characteristic of each first audio segment based on the full-link layer in the emotion recognition model to obtain the emotion degree value of each first audio segment.
In one possible implementation, the emotion recognition module 502 is configured to perform: calculating the third feature of each first audio segment based on the full-link layer in the emotion recognition model to obtain a calculation result corresponding to the third feature; and calculating a calculation result corresponding to the third feature of each first audio segment through a second objective function to obtain the emotional degree value of each first audio segment.
In one possible implementation, the apparatus further includes a model training module configured to perform: acquiring a plurality of sample audio files, wherein each sample audio file carries an emotion label which is used for expressing the emotion tendency of the sample audio file; extracting the characteristics of each sample audio file to obtain first characteristics of a plurality of first audio fragments of each sample audio file; calling an initial model, inputting first characteristics of a plurality of first audio segments of the plurality of sample audio files into the initial model, and obtaining second characteristics of the first audio segments by the initial model based on the first characteristics of the first audio segments for each first audio segment; classifying the first audio segment by the initial model based on the second characteristics of the first audio segment to obtain a classification result of the first audio segment, wherein the classification result is used for representing the emotional tendency of the first audio segment; obtaining, by the initial model, an emotional degree value of the first audio segment based on the second feature; for each sample audio file, outputting a classification result of the sample audio file based on the classification result and the emotion degree value of a plurality of first audio segments of the sample audio file by the initial model, wherein the classification result is used for representing the classification result of the sample audio file; and adjusting model parameters of the initial model based on the classification results of the plurality of sample audio files and the emotion labels carried by each sample audio file until target conditions are met to obtain an emotion recognition model, wherein the model parameters comprise parameters required for acquiring the emotion degree value of the first audio fragment based on the second characteristic.
According to the device provided by the embodiment of the disclosure, the audio file is segmented, emotion analysis is carried out on each first audio segment, and the emotion degree value of each first audio segment is determined, so that according to the length of the climax segment, the part with the largest sum of the emotion degree values with the same length is used as the climax segment of the audio file, repeated parts in the audio file are not simply detected, the emotion of each part of the audio file is analyzed, the relatively intense part of emotion expression is used as the climax segment, the accuracy of the obtained climax segment is high, and the accuracy of the audio data processing method is high.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
The electronic device may be provided as a terminal shown in fig. 6 described below, and may also be provided as a server shown in fig. 7 described below, which is not limited by the embodiment of the present disclosure.
Fig. 6 is a block diagram illustrating a structure of a terminal according to an exemplary embodiment. The terminal 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 600 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.
In general, the terminal 600 includes: a processor 601 and a memory 602.
The processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 601 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.
The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 602 is used to store at least one instruction for execution by the processor 601 to implement the audio data processing method provided by the method embodiments in the present disclosure.
In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603 and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 604, a display 605, a camera 606, an audio circuit 607, a positioning component 608, and a power supply 609.
The peripheral interface 603 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 601 and the memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 601, the memory 602, and the peripheral interface 603 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.
The Radio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 604 may also include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.
The display 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to capture touch signals on or over the surface of the display screen 605. The touch signal may be input to the processor 601 as a control signal for processing. At this point, the display 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 605 may be one, providing the front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in still other embodiments, the display 605 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 605 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The camera assembly 606 is used to capture images or video. Optionally, camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
Audio circuitry 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing or inputting the electric signals to the radio frequency circuit 604 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 607 may also include a headphone jack.
The positioning component 608 is used for positioning the current geographic Location of the terminal 600 to implement navigation or LBS (Location Based Service). The Positioning component 608 can be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.
Power supply 609 is used to provide power to the various components in terminal 600. The power supply 609 may be ac, dc, disposable or rechargeable. When the power supply 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, the terminal 600 also includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.
The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 601 may control the display screen 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on the terminal 600. The processor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
Pressure sensors 613 may be disposed on the side bezel of terminal 600 and/or underneath display screen 605. When the pressure sensor 613 is disposed on the side frame of the terminal 600, a user's holding signal of the terminal 600 can be detected, and the processor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 614 is used for collecting a fingerprint of a user, and the processor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 614 may be disposed on the front, back, or side of the terminal 600. When a physical button or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical button or vendor Logo.
The optical sensor 615 is used to collect the ambient light intensity. In one embodiment, processor 601 may control the display brightness of display screen 605 based on the ambient light intensity collected by optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of the display screen 605 is increased; when the ambient light intensity is low, the display brightness of the display screen 605 is adjusted down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615.
A proximity sensor 616, also known as a distance sensor, is typically disposed on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front surface of the terminal 600. In one embodiment, when proximity sensor 616 detects that the distance between the user and the front face of terminal 600 gradually decreases, processor 601 controls display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 616 detects that the distance between the user and the front face of the terminal 600 is gradually increased, the processor 601 controls the display 605 to switch from the breath-screen state to the bright-screen state.
Those skilled in the art will appreciate that the configuration shown in fig. 6 is not intended to be limiting of terminal 600 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
Fig. 7 is a schematic structural diagram of a server 700 according to an exemplary embodiment, where the server 700 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 701 and one or more memories 702, where the memory 702 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 701 to implement the audio data processing method provided by the above-mentioned method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of audio data processing, the method comprising: performing feature extraction on the audio file to obtain first features of a plurality of first audio segments of the audio file; calling an emotion recognition model, inputting first characteristics of the first audio segments into the emotion recognition model, and outputting emotion degree values of the first audio segments; and according to the emotion degree values of the plurality of first audio segments, taking at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values in the audio file as a climax segment of the audio file, wherein the second audio segment is the same as the first audio segment or different from the first audio segment.
In an exemplary embodiment, there is also provided an application program comprising one or more instructions executable by a processor of an electronic device to perform method steps of the audio data processing method provided in the above embodiments, which method steps may include: performing feature extraction on the audio file to obtain first features of a plurality of first audio segments of the audio file; calling an emotion recognition model, inputting first characteristics of the first audio segments into the emotion recognition model, and outputting emotion degree values of the first audio segments; and according to the emotion degree values of the plurality of first audio segments, taking at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values in the audio file as a climax segment of the audio file, wherein the second audio segment is the same as the first audio segment or different from the first audio segment.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (22)

1. A method of audio data processing, comprising:
performing feature extraction on an audio file to obtain first features of a plurality of first audio segments of the audio file;
calling an emotion recognition model, inputting the first features of the plurality of first audio segments into the emotion recognition model, and performing feature extraction on the first feature of each first audio segment by the emotion recognition model to obtain a second feature of each first audio segment;
obtaining, by the emotion recognition model, an emotion degree value of each first audio segment based on a second feature of each first audio segment and position information of each bit of feature value in the second feature, where the position information of each bit of feature value in the second feature is used to indicate a corresponding position of the bit of feature value in the first audio segment; outputting the emotion degree values of the plurality of first audio segments;
and according to the emotion degree values of the plurality of first audio segments, taking at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values in the audio file as a climax segment of the audio file, wherein the second audio segment is the same as the first audio segment or different from the first audio segment.
2. The audio data processing method of claim 1, wherein the step of regarding at least one continuous second audio segment of the audio file with a total length of a target length and a maximum sum of emotional degree values as the climax segment of the audio file according to the emotional degree values of the plurality of first audio segments comprises:
obtaining a plurality of candidate climax fragments of the audio file, wherein each candidate climax fragment is at least one continuous second audio fragment with the total length being the target length;
acquiring the sum of the emotion degree values of the second audio segments included by each candidate climax segment according to the emotion degree values of the first audio segments;
and taking the candidate climax fragment with the maximum sum value as the climax fragment of the audio file.
3. The audio data processing method of claim 2, wherein the obtaining a sum of emotional degree values of the second audio segments included in each candidate climax segment according to the emotional degree values of the plurality of first audio segments comprises:
determining the emotional degree value of each second audio clip in the audio file according to the emotional degree values of the plurality of first audio clips; acquiring the sum of the emotional degree values of at least one continuous second audio segment included in each candidate climax segment; or,
for each candidate climax fragment, acquiring the emotional degree value of each second audio fragment included in the candidate climax fragment according to the emotional degree values of the plurality of first audio fragments; and acquiring the sum of the emotional degree values of at least one continuous second audio segment included in the candidate climax segment.
4. The audio data processing method of claim 1, wherein the step of regarding at least one continuous second audio segment of the audio file with a total length of a target length and a maximum sum of emotional degree values as the climax segment of the audio file according to the emotional degree values of the plurality of first audio segments comprises:
outputting a playing starting point and a playing ending point of at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values; or,
intercepting a segment corresponding to at least one continuous second audio segment with the total length as a target length and the maximum sum of emotion degree values from the audio file as a climax segment of the audio file; or,
and splicing at least one continuous second audio segment with the total length as the target length and the maximum sum of the emotion degree values to obtain the climax segment of the audio file.
5. The audio data processing method of claim 1, wherein the performing feature extraction on the audio file to obtain first features of a plurality of first audio segments of the audio file comprises:
segmenting the audio file to obtain a plurality of first audio segments of the audio file;
resampling each first audio segment to obtain a first feature of each first audio segment.
6. The audio data processing method of claim 5, wherein the resampling each first audio segment to obtain the first feature of each first audio segment comprises:
performing audio processing on each first audio clip according to a target sampling rate and a target window function to obtain a first feature of a Mel scale of each first audio clip;
and processing the first feature of the Mel scale of each first audio clip based on a first objective function to obtain a first feature of a logarithmic scale.
7. The audio data processing method of claim 1, wherein the performing feature extraction on the first feature of each first audio segment by the emotion recognition model to obtain the second feature of each first audio segment comprises:
and calculating the first characteristic of each first audio segment by a convolution layer in the emotion recognition model to obtain the second characteristic of each first audio segment.
8. The audio data processing method of claim 1, wherein the obtaining the emotion degree value of each first audio segment based on the second feature of each first audio segment and the position information of each bit of feature value in the second feature comprises:
acquiring position information of each bit of feature value in the second feature;
obtaining a third feature of each first audio segment based on the second feature of each first audio segment and the position information of each feature value in the second feature;
and calculating the third characteristic of each first audio segment based on a full connection layer in the emotion recognition model to obtain the emotion degree value of each first audio segment.
9. The audio data processing method of claim 8, wherein the calculating a third feature of each first audio segment based on a full connection layer in the emotion recognition model to obtain an emotion degree value of each first audio segment comprises:
calculating the third feature of each first audio segment based on a full-link layer in the emotion recognition model to obtain a calculation result corresponding to the third feature;
and calculating a calculation result corresponding to the third feature of each first audio segment through a second objective function to obtain the emotional degree value of each first audio segment.
10. The audio data processing method of claim 1, wherein the training process of the emotion recognition model comprises:
acquiring a plurality of sample audio files, wherein each sample audio file carries an emotion label which is used for expressing the emotion tendency of the sample audio file;
extracting the characteristics of each sample audio file to obtain first characteristics of a plurality of first audio fragments of each sample audio file;
calling an initial model, inputting first characteristics of a plurality of first audio segments of the plurality of sample audio files into the initial model, and obtaining second characteristics of the first audio segments by the initial model based on the first characteristics of the first audio segments for each first audio segment;
classifying the first audio segment by the initial model based on a second feature of the first audio segment to obtain a classification result of the first audio segment, wherein the classification result is used for representing the emotional tendency of the first audio segment;
obtaining, by the initial model, an emotional degree value of the first audio segment based on the second feature;
for each sample audio file, outputting, by the initial model, a classification result of the sample audio file based on a classification result and an emotion degree value of a plurality of first audio segments of the sample audio file, the classification result being used to represent the classification result of the sample audio file;
and adjusting model parameters of the initial model based on the classification results of the plurality of sample audio files and the emotion labels carried by each sample audio file until target conditions are met to obtain an emotion recognition model, wherein the model parameters comprise parameters required for acquiring the emotion degree value of the first audio fragment based on the second characteristics.
11. An audio data processing apparatus, comprising:
the device comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is configured to perform feature extraction on an audio file to obtain first features of a plurality of first audio segments of the audio file;
the emotion recognition module is configured to input the first features of the plurality of first audio segments into the emotion recognition model, and perform feature extraction on the first feature of each first audio segment by the emotion recognition model to obtain a second feature of each first audio segment; obtaining, by the emotion recognition model, an emotion degree value of each first audio segment based on a second feature of each first audio segment and position information of each bit of feature value in the second feature, where the position information of each bit of feature value in the second feature is used to indicate a corresponding position of the bit of feature value in the first audio segment; outputting the emotion degree values of the plurality of first audio segments;
and the segment acquisition module is configured to execute at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotional degree values in the audio file as a climax segment of the audio file according to the emotional degree values of the plurality of first audio segments, wherein the second audio segment is the same as the first audio segment or different from the first audio segment.
12. The audio data processing apparatus of claim 11, wherein the segment acquisition module is configured to perform:
obtaining a plurality of candidate climax fragments of the audio file, wherein each candidate climax fragment is at least one continuous first audio fragment with the total length being the target length;
acquiring the sum of the emotion degree values of the second audio segments included by each candidate climax segment according to the emotion degree values of the first audio segments;
and taking the candidate climax fragment with the maximum sum value as the climax fragment of the audio file.
13. The audio data processing apparatus of claim 11, wherein the segment acquisition module is configured to perform:
determining the emotional degree value of each second audio clip in the audio file according to the emotional degree values of the plurality of first audio clips; acquiring the sum of the emotional degree values of at least one continuous second audio segment included in each candidate climax segment; or,
for each candidate climax fragment, acquiring the emotional degree value of each second audio fragment included in the candidate climax fragment according to the emotional degree values of the plurality of first audio fragments; and acquiring the sum of the emotional degree values of at least one continuous second audio segment included in the candidate climax segment.
14. The audio data processing apparatus of claim 11, wherein the segment acquisition module is configured to perform:
outputting a playing starting point and a playing ending point of at least one continuous second audio segment with the total length as a target length and the maximum sum of the emotion degree values; or,
intercepting a segment corresponding to at least one continuous second audio segment with the total length as a target length and the maximum sum of emotion degree values from the audio file as a climax segment of the audio file; or,
and splicing at least one continuous second audio segment with the total length as the target length and the maximum sum of the emotion degree values to obtain the climax segment of the audio file.
15. The audio data processing apparatus of claim 11, wherein the feature extraction module is configured to perform:
segmenting the audio file to obtain a plurality of first audio segments of the audio file;
resampling each first audio segment to obtain a first feature of each first audio segment.
16. The audio data processing device of claim 15, wherein the feature extraction module is configured to perform:
performing audio processing on each first audio clip according to a target sampling rate and a target window function to obtain a first feature of a Mel scale of each first audio clip;
and processing the first feature of the Mel scale of each first audio clip based on a first objective function to obtain a first feature of a logarithmic scale.
17. The audio data processing apparatus of claim 11, wherein the emotion recognition module is configured to perform the calculation of the first feature of each first audio segment by the convolutional layer in the emotion recognition model, resulting in the second feature of each first audio segment.
18. The audio data processing apparatus of claim 11, wherein the emotion recognition module is configured to perform:
acquiring position information of each bit of feature value in the second feature;
obtaining a third feature of each first audio segment based on the second feature of each first audio segment and the position information of each feature value in the second feature;
and calculating the third characteristic of each first audio segment based on a full connection layer in the emotion recognition model to obtain the emotion degree value of each first audio segment.
19. The audio data processing apparatus of claim 18, wherein the emotion recognition module is configured to perform:
calculating the third feature of each first audio segment based on a full-link layer in the emotion recognition model to obtain a calculation result corresponding to the third feature;
and calculating a calculation result corresponding to the third feature of each first audio segment through a second objective function to obtain the emotional degree value of each first audio segment.
20. The audio data processing apparatus of claim 11, characterized in that the apparatus further comprises a model training module configured to perform:
acquiring a plurality of sample audio files, wherein each sample audio file carries an emotion label which is used for expressing the emotion tendency of the sample audio file;
extracting the characteristics of each sample audio file to obtain first characteristics of a plurality of first audio fragments of each sample audio file;
calling an initial model, inputting first characteristics of a plurality of first audio segments of the plurality of sample audio files into the initial model, and obtaining second characteristics of the first audio segments by the initial model based on the first characteristics of the first audio segments for each first audio segment;
classifying the first audio segments by the initial model based on the second characteristics of each first audio segment to obtain classification results of the first audio segments, wherein the classification results are used for representing emotional tendency of the first audio segments;
obtaining, by the initial model, an emotional degree value of the first audio segment based on the second feature;
for each sample audio file, outputting, by the initial model, a classification result of the sample audio file based on a classification result and an emotion degree value of a plurality of first audio segments of the sample audio file, the classification result being used to represent the classification result of the sample audio file;
and adjusting model parameters of the initial model based on the classification results of the plurality of sample audio files and the emotion labels carried by each sample audio file until target conditions are met to obtain an emotion recognition model, wherein the model parameters comprise parameters required for acquiring the emotion degree value of the first audio fragment based on the second characteristics.
21. An electronic device, comprising:
one or more processors;
one or more memories for storing one or more processor-executable instructions;
wherein the one or more processors are configured to execute the instructions to implement the audio data processing method of any of claims 1 to 10.
22. A storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the audio data processing method of any one of claims 1 to 10.
CN201910165235.9A 2019-03-05 2019-03-05 Audio data processing method and device, electronic equipment and storage medium Active CN109829067B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910165235.9A CN109829067B (en) 2019-03-05 2019-03-05 Audio data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910165235.9A CN109829067B (en) 2019-03-05 2019-03-05 Audio data processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109829067A CN109829067A (en) 2019-05-31
CN109829067B true CN109829067B (en) 2020-12-29

Family

ID=66865401

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910165235.9A Active CN109829067B (en) 2019-03-05 2019-03-05 Audio data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109829067B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782863B (en) * 2020-06-30 2024-06-14 腾讯音乐娱乐科技(深圳)有限公司 Audio segmentation method, device, storage medium and electronic equipment
CN113035160B (en) * 2021-02-26 2022-08-02 成都潜在人工智能科技有限公司 Music automatic editing implementation method and device based on similarity matrix and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1258752C (en) * 2004-09-10 2006-06-07 清华大学 Popular song key segment pick-up method for music listening
CN107333071A (en) * 2017-06-30 2017-11-07 北京金山安全软件有限公司 Video processing method and device, electronic equipment and storage medium
CN108648767B (en) * 2018-04-08 2021-11-05 中国传媒大学 Popular song emotion synthesis and classification method
CN109166593B (en) * 2018-08-17 2021-03-16 腾讯音乐娱乐科技(深圳)有限公司 Audio data processing method, device and storage medium
CN109376603A (en) * 2018-09-25 2019-02-22 北京周同科技有限公司 A kind of video frequency identifying method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN109829067A (en) 2019-05-31

Similar Documents

Publication Publication Date Title
CN108538311B (en) Audio classification method, device and computer-readable storage medium
CN111564152B (en) Voice conversion method and device, electronic equipment and storage medium
CN110322760B (en) Voice data generation method, device, terminal and storage medium
CN111696532B (en) Speech recognition method, device, electronic equipment and storage medium
CN110277106B (en) Audio quality determination method, device, equipment and storage medium
CN111897996A (en) Topic label recommendation method, device, equipment and storage medium
CN108320756B (en) Method and device for detecting whether audio is pure music audio
CN111524501A (en) Voice playing method and device, computer equipment and computer readable storage medium
WO2022057435A1 (en) Search-based question answering method, and storage medium
CN111105788B (en) Sensitive word score detection method and device, electronic equipment and storage medium
CN111370025A (en) Audio recognition method and device and computer storage medium
CN113918767A (en) Video clip positioning method, device, equipment and storage medium
CN111081277B (en) Audio evaluation method, device, equipment and storage medium
CN109961802B (en) Sound quality comparison method, device, electronic equipment and storage medium
CN112667844A (en) Method, device, equipment and storage medium for retrieving audio
CN111428079A (en) Text content processing method and device, computer equipment and storage medium
CN109829067B (en) Audio data processing method and device, electronic equipment and storage medium
CN111368136A (en) Song identification method and device, electronic equipment and storage medium
CN110837557A (en) Abstract generation method, device, equipment and medium
CN110166275B (en) Information processing method, device and storage medium
CN112001442B (en) Feature detection method, device, computer equipment and storage medium
CN113744736A (en) Command word recognition method and device, electronic equipment and storage medium
CN113362836A (en) Vocoder training method, terminal and storage medium
CN110337030B (en) Video playing method, device, terminal and computer readable storage medium
CN114547429A (en) Data recommendation method and device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant