CN112466337A - Audio data emotion detection method and device, electronic equipment and storage medium - Google Patents

Audio data emotion detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112466337A
CN112466337A CN202011482460.4A CN202011482460A CN112466337A CN 112466337 A CN112466337 A CN 112466337A CN 202011482460 A CN202011482460 A CN 202011482460A CN 112466337 A CN112466337 A CN 112466337A
Authority
CN
China
Prior art keywords
data
audio data
frequency domain
emotion
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011482460.4A
Other languages
Chinese (zh)
Inventor
张舒婷
赖众程
王亮
吴鹏召
李林毅
李兴辉
李会璟
李骁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011482460.4A priority Critical patent/CN112466337A/en
Publication of CN112466337A publication Critical patent/CN112466337A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention relates to a voice semantic technology and discloses an audio data emotion detection method, which comprises the following steps: the method comprises the steps of obtaining original audio data, carrying out sound channel separation and data cutting processing on the original audio data to obtain standard audio data, carrying out silence detection on the standard audio data, carrying out frequency domain conversion on the standard audio data when the standard audio data are detected to be non-silence data to obtain frequency domain data, converting the frequency domain data into Mel frequency domain data, detecting the Mel frequency domain data by using a pre-trained audio detection model to obtain an initial emotion score, and calculating the initial emotion score by using a preset weighting calculation method to obtain a final emotion score. In addition, the invention also relates to a block chain technology, and the emotion final score can be stored in a node of the block chain. The invention also provides an audio data emotion detection device, electronic equipment and a computer readable storage medium. The method can solve the problem of low emotion detection accuracy of the audio data.

Description

Audio data emotion detection method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of voice semantics, in particular to an audio data emotion detection method and device, electronic equipment and a computer readable storage medium.
Background
With the rapid development of social economy, departments in various fields pay more and more attention to the service of customers, and the customer satisfaction and retention rate are more and more emphasized, wherein the emotion of the user is detected by analyzing audio and video data between the customer and customer service personnel, and the emotion is an important index for controlling the service quality. For example, a bank is specially provided with a department responsible for consulting complaint service of a client, and one of daily work contents of the department is to listen to incoming call records of a telephone seat and check whether a missed complaint worksheet exists or not so as to monitor the service quality of the seat. In the prior art, 1, the emotion of a user in audio data is detected by a manual rechecking method, however, for ten thousand levels of power input of the user every day, quality inspection cannot be completed by manually listening to a recording in a full amount, and only a traditional quality inspection mode, namely, a sampling recording listening mode, is adopted, so that the coverage rate is low, the detection efficiency is low, and the detection is inaccurate. 2. By combining audio data with image or text data, a user emotion is recognized using a deep learning method. However, the analysis of the audio data combined with the image or text data occupies too much computer resources in practical use, resulting in low detection efficiency, and the application scenarios of the audio data combined with the image or text data are fewer.
Disclosure of Invention
The invention provides a method and a device for detecting emotion of audio data and a computer readable storage medium, and mainly aims to solve the problem of low emotion detection accuracy of the audio data.
In order to achieve the above object, the present invention provides a method for detecting emotion of audio data, including:
acquiring original audio data, and performing sound channel separation and data cutting processing on the original audio data to obtain standard audio data;
performing silence detection on the standard audio data;
when the standard audio data is detected to be mute data, marking the mute data;
when the standard audio data is detected to be non-mute data, performing frequency domain conversion on the standard audio data to obtain frequency domain data;
converting the frequency domain data into Mel frequency domain data, and detecting the Mel frequency domain data by using a pre-trained audio detection model to obtain an emotion initial score;
and calculating the emotion initial score by using a preset weighted calculation method to obtain an emotion final score.
Optionally, the performing sound channel separation and data cutting processing on the original audio data to obtain standard audio data includes:
performing sound channel judgment on the original audio data, and extracting right sound channel audio data in the original audio data;
and performing data cutting on the right channel audio data according to a preset time length to obtain the standard audio data.
Optionally, the performing silence detection on the standard audio data includes:
reading each frame of audio data in the standard audio data frame by frame, and calculating a speech energy value and a background noise energy value of each frame of audio data;
calculating the difference value between the voice energy value and the background noise energy value, and comparing the difference value with a preset mute threshold value;
when the difference value is smaller than the mute threshold value, judging the standard audio data as mute data;
and when the difference value is larger than or equal to the mute threshold value, judging that the standard audio data is non-mute data.
Optionally, the frequency domain converting the standard audio data to obtain frequency domain data includes:
and performing frequency domain conversion on the standard audio data by using the following functions to obtain frequency domain data F (omega):
Figure BDA0002837855890000021
wherein f (t) is the standard audio data,
Figure BDA0002837855890000022
is a fourier transform function.
Optionally, the converting the frequency domain data into mel frequency domain data comprises:
converting the frequency domain data by using a preset Mel frequency domain conversion formula;
and carrying out logarithm operation on the converted frequency domain data, and outputting the frequency domain data in a Mel frequency domain with a preset shape.
Optionally, before the detecting the mel frequency domain data by using the pre-trained audio detection model, the method further includes:
acquiring an original training set, and performing data enhancement on the original training set by using a preset data enhancement method to obtain a standard training set;
training a pre-constructed first network model by using the standard training set to obtain an original model;
and taking the parameters of the original model as initialization parameters of a pre-constructed second network model, and training the second network model by using the standard training set to obtain the audio detection model.
The step of calculating the initial emotion score by using a preset weighted calculation method to obtain a final emotion score comprises the following steps:
weighting the emotion initial score by using a preset weighting calculation method to obtain a weighted score;
and carrying out quantile value taking on the weighted scores to obtain final scores of the emotions.
In order to solve the above problem, the present invention further provides an audio data emotion detecting apparatus, including:
the audio data processing module is used for acquiring original audio data, and performing sound channel separation and data cutting processing on the original audio data to obtain standard audio data;
the silence detection module is used for carrying out silence detection on the standard audio data; when the standard audio data is detected to be mute data, marking the mute data; when the standard audio data is detected to be non-mute data, performing frequency domain conversion on the standard audio data to obtain frequency domain data;
the frequency domain data conversion module is used for converting the frequency domain data into Mel frequency domain data and detecting the Mel frequency domain data by using a pre-trained audio detection model to obtain an emotion initial score;
and the emotion score calculation module is used for calculating the emotion initial score by using a preset weighting calculation method to obtain an emotion final score.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the emotion detection method of the audio data.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, which stores at least one instruction, where the at least one instruction is executed by a processor in an electronic device to implement the above-mentioned emotion detection method for audio data.
The invention performs sound channel separation and data cutting on the original audio data to obtain standard audio data, so that the data processing amount is reduced, and the audio data detection efficiency is improved. And by carrying out silence detection on the standard audio data, the influence of the silence data on the audio detection can be reduced, so that the audio data detection accuracy is higher. Meanwhile, the frequency domain data are converted into Mel frequency domain data, and the Mel frequency domain data are detected by the audio detection model, so that the detection accuracy is higher as the Mel frequency domain is more in line with the auditory characteristics of human ears. Therefore, the method and the device for detecting the emotion of the audio data, the electronic equipment and the computer-readable storage medium can solve the problem of low emotion detection accuracy of the audio data.
Drawings
Fig. 1 is a schematic flowchart of an emotion detection method for audio data according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart showing a detailed implementation of one of the steps in FIG. 1;
FIG. 3 is a schematic flow chart showing another step of FIG. 1;
FIG. 4 is a schematic flow chart showing another step of FIG. 1;
FIG. 5 is a schematic diagram of an original training set;
FIG. 6 is a schematic flow chart showing another step of FIG. 1;
fig. 7 is a functional block diagram of an emotion detection apparatus for audio data according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device implementing the emotion detection method for audio data according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides an audio data emotion detection method. The execution subject of the audio data emotion detection method includes, but is not limited to, at least one of electronic devices such as a server and a terminal, which can be configured to execute the method provided by the embodiment of the present application. In other words, the audio data emotion detection method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Referring to fig. 1, a schematic flow chart of an audio data emotion detection method according to an embodiment of the present invention is shown. In this embodiment, the method for emotion detection of audio data includes:
and S1, acquiring original audio data, and performing sound channel separation and data cutting processing on the original audio data to obtain standard audio data.
In the embodiment of the present invention, the original audio data may be communication recording data of a client and a customer service worker in each field, including: consult recording data, complaint recording data, and the like. For example, in the banking field, a department is specially responsible for customer consultation complaint services, and quality detection is performed on banking services by checking consultation complaint records of users.
Preferably, referring to fig. 2, the performing the channel separation and the data segmentation on the original audio data to obtain standard audio data includes:
s10, performing sound channel judgment on the original audio data, and extracting right-channel audio data in the original audio data;
and S11, performing data cutting on the right channel audio data according to a preset time length to obtain the standard audio data.
In the embodiment of the present invention, since the original audio data is a telephone recording, which is only a two-party call, and the audio is a two-channel audio, where a left channel is a seat audio and a right channel is a client audio, it is only necessary to directly extract right channel audio data. The preset time length may be 1s, and the right channel audio data is cut into a plurality of pieces by taking 1s as a unit, wherein if the last piece of audio is less than 1s, the average value (padding) of the piece of audio is filled until the length is 1 s.
Furthermore, the invention reduces the data processing amount and improves the detection efficiency of the audio data by carrying out sound channel separation and data cutting on the original audio data.
And S2, carrying out silence detection on the standard audio data.
Preferably, referring to fig. 3, the S2 includes:
s20, reading each frame of audio data in the standard audio data frame by frame, and calculating the speech energy value and the background noise energy value of each frame of audio data;
s21, calculating the difference value between the voice energy value and the background noise energy value, and comparing the difference value with a preset mute threshold value;
when the difference is smaller than the mute threshold, executing S22, and judging the standard audio data to be mute data;
and when the difference is greater than or equal to the mute threshold, executing S23 to determine that the standard audio data is non-mute data.
In the embodiment of the invention, the voice energy value of each frame of audio data is calculated by the following formula:
Figure BDA0002837855890000061
wherein E isnA speech energy value for each frame of audio data equal to the sum of the squares of all speech signals in each frame, x (m) for each frame of audio data, w (m) for each frameWindow function of the frame.
In the embodiment of the present invention, the speech energy value of the first 10 frames can be selected as the background noise energy value. The preset mute threshold may be 0.5.
Furthermore, in the embodiment of the present invention, the silence detection is performed on the standard audio data by calculating the voice energy value of the standard audio data, so that the accuracy of the silence detection is higher.
When detecting that the standard audio data is mute data, S3 is executed to mark the mute data.
In the embodiment of the present invention, if the value is smaller than the mute threshold, the value is mute data, the flag is set to-2, all the following steps are not performed, and if the value is larger than or equal to the mute threshold, the value is non-mute data.
When detecting that the standard audio data is non-mute data, S4 is executed to perform frequency domain conversion on the standard audio data to obtain frequency domain data.
In detail, the frequency domain converting the standard audio data to obtain frequency domain data includes:
and performing frequency domain conversion on the standard audio data by using the following functions to obtain frequency domain data F (omega):
Figure BDA0002837855890000062
wherein f (t) is the standard audio data,
Figure BDA0002837855890000063
is a fourier transform function.
Further, in the embodiment of the present invention, the standard audio data divided by time is time domain data, and the time domain data may express a change of the audio data with time. Although the time domain data can visually display the audio signals, limited parameters cannot be used for describing the audio, the model cannot be directly learned, and after the time domain data is converted into the frequency domain, the complex time domain signals can be decomposed into superposition of different frequency domain signals, so that the audio data can be analyzed conveniently.
And S5, converting the frequency domain data into Mel frequency domain data, and detecting the Mel frequency domain data by using a pre-trained audio detection model to obtain an emotion initial score.
Preferably, referring to fig. 4, the converting the frequency domain data into mel frequency domain data includes:
s50, converting the frequency domain data by using a preset Mel frequency domain conversion formula;
and S51, carrying out logarithm operation on the converted frequency domain data, and outputting the frequency domain data in a Mel shape.
In an embodiment of the present invention, the preset mel frequency domain conversion formula may be:
Figure BDA0002837855890000071
wherein f ismelF is the converted mel frequency domain data. The preset shape may be a matrix of 64 x 64.
In detail, before the detecting the mel frequency domain data by using the pre-trained audio detection model, the method further includes:
acquiring an original training set, and performing data enhancement on the original training set by using a preset data enhancement method to obtain a standard training set;
training a pre-constructed first network model by using the standard training set to obtain an original model;
and taking the parameters of the original model as initialization parameters of a pre-constructed second network model, and training the second network model by using the standard training set to obtain the audio detection model.
The original training set can be history labeled data and labeled as emotional contrastimulant value per second, the value range of the emotional contrastimulant value is [0,1], the more close to 0 represents the more negative, the more close to 1 represents the more contrastimulant, and a special value-2 represents silence. The original training set may be as shown in fig. 5, and the original training set includes audio id and emotion value per second of each pass of audio data.
Because the emotional excitement of most people falls in a peaceful interval [0.5-0.7] in most of time, the data quantity of the parts with high excitement and low excitement needs to be enhanced. In the embodiment of the invention, a mixed class (Mixup) enhancement method can be used for enhancing data of an original training set to obtain an enhanced data set, and the original training set and the enhanced data set are collected to obtain the standard training set. In the embodiment of the invention, the standard training set is used as a training set, and the original training set is used as a verification set.
In the embodiment of the present invention, the first network model may be a ResNet50 network, and the second network model may be an improved ResNet50 network, where the improvement is: the method is characterized in that a first layer and a last layer of a ResNet50 network are removed, a batch training (BatchNormal) layer, a convolutional layer (with an activation function of relu) and an average pooling layer are added before a ResNet50 network, and a full connection layer (with an activation function of relu), a batch training (BatchNormal) layer and a last full connection layer are added after a ResNet50 network. Meanwhile, due to the fact that the ResNet50 network is easy to be over-fitted during training, the model obtained through training is low in detection accuracy, and in the embodiment of the invention, the training can be stopped in advance before the model training enters over-fitting by using an Early Stopping method (Early Stopping). The early stopping method is to stop training if the loss of the verification set does not decrease along with the loss of the training set within a preset training round range. In the embodiment of the present invention, the MSE loss function may be used to calculate loss.
Furthermore, the frequency domain data are converted into Mel frequency domain data, and the Mel frequency domain data are detected by the audio detection model, so that the detection accuracy is higher as the Mel frequency domain is more in line with the auditory characteristics of human ears.
And S6, calculating the emotion initial score by using a preset weighted calculation method to obtain an emotion final score.
In the embodiment of the invention, the non-silent data is weighted by using the position of the total duration at the current moment, and meanwhile, the emotion of the client at the ending part needs to be more emphasized, so the weighted value closer to the ending part is higher.
In detail, referring to fig. 6, the S5 includes:
s60, weighting the emotion initial score by using a preset weighting calculation method to obtain a weighted score;
and S61, carrying out quantile value taking on the weighted scores to obtain final scores of the emotions.
In an embodiment of the present invention, the preset weighting calculation method may be:
coefficient value=original value*(α+(1-α)*sec/(voice len*0.9))
wherein, coefficient value is the weighted score, original value is the initial score, sec represents that the current second is the second, and voice len is the total duration of the whole tone, and α in the implementation of the present invention may be 0.7.
In the embodiment of the present invention, for the weighted score, since the later numerical value is more important, the quantile value may take a value of 90 quantiles as the final score of the entire audio, and is set to 0 if the final score is less than 0, and is set to 1 if the final score is greater than 1. The 90-point score is the 90% value in the weighted score. Meanwhile, if the non-mute data is null, the final score of the whole tone is directly output to be-2, which represents that the whole tone is mute.
Furthermore, the initial score is weighted by using a preset score calculation method, and the accuracy of audio data detection is improved by increasing the weight of important part data.
The invention performs sound channel separation and data cutting on the original audio data to obtain standard audio data, so that the data processing amount is reduced, and the audio data detection efficiency is improved. And by carrying out silence detection on the standard audio data, the influence of the silence data on the audio detection can be reduced, so that the audio data detection accuracy is higher. Meanwhile, the frequency domain data are converted into Mel frequency domain data, and the Mel frequency domain data are detected by the audio detection model, so that the detection accuracy is higher as the Mel frequency domain is more in line with the auditory characteristics of human ears. Therefore, the method and the device can solve the problem of low emotion detection accuracy of the audio data.
Fig. 7 is a functional block diagram of an emotion detection apparatus for audio data according to an embodiment of the present invention.
The emotion detection apparatus 100 for audio data according to the present invention may be installed in an electronic device. According to the implemented functions, the audio data emotion detection apparatus 100 may include an audio data processing module 101, a silence detection module 102, a frequency domain data conversion module 103, and an emotion score calculation module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the audio data processing module 101 is configured to acquire original audio data, perform channel separation and data cutting processing on the original audio data, and obtain standard audio data.
In the embodiment of the present invention, the original audio data may be communication recording data of a client and a customer service worker in each field, including: consult recording data, complaint recording data, and the like. For example, in the banking field, a department is specially responsible for customer consultation complaint services, and quality detection is performed on banking services by checking consultation complaint records of users.
Preferably, the audio data processing module 101 obtains standard audio data by:
performing sound channel judgment on the original audio data, and extracting right sound channel audio data in the original audio data;
and performing data cutting on the right channel audio data according to a preset time length to obtain the standard audio data.
In the embodiment of the present invention, since the original audio data is a telephone recording, which is only a two-party call, and the audio is a two-channel audio, where a left channel is a seat audio and a right channel is a client audio, it is only necessary to directly extract right channel audio data. The preset time length may be 1s, and the right channel audio data is cut into a plurality of pieces by taking 1s as a unit, wherein if the last piece of audio is less than 1s, the average value (padding) of the piece of audio is filled until the length is 1 s.
Furthermore, the invention reduces the data processing amount and improves the detection efficiency of the audio data by carrying out sound channel separation and data cutting on the original audio data.
The silence detection module 102 is configured to perform silence detection on the standard audio data; when the standard audio data is detected to be mute data, marking the mute data; and when the standard audio data is detected to be non-mute data, performing frequency domain conversion on the standard audio data to obtain frequency domain data.
Preferably, the silence detection module 102 performs silence detection on the standard audio data by:
reading each frame of audio data in the standard audio data frame by frame, and calculating a speech energy value and a background noise energy value of each frame of audio data;
calculating the difference value between the voice energy value and the background noise energy value, and comparing the difference value with a preset mute threshold value;
when the difference value is smaller than the mute threshold value, judging the standard audio data as mute data;
and when the difference value is larger than or equal to the mute threshold value, judging that the standard audio data is non-mute data.
In the embodiment of the invention, the voice energy value of each frame of audio data is calculated by the following formula:
Figure BDA0002837855890000101
wherein E isnThe speech energy value for each frame of audio data is equal to the sum of the squares of all speech signals in each frame, x (m) is the audio data for each frame, and w (m) is the window function for each frame.
In the embodiment of the present invention, the speech energy value of the first 10 frames can be selected as the background noise energy value. The preset mute threshold may be 0.5, if the preset mute threshold is smaller than the mute threshold, the preset mute threshold is mute data, the preset mute threshold is marked as-2, all the following steps are not performed, and if the preset mute threshold is larger than or equal to the mute threshold, the preset mute threshold is non-mute data.
Furthermore, in the embodiment of the present invention, the silence detection is performed on the standard audio data by calculating the voice energy value of the standard audio data, so that the accuracy of the silence detection is higher.
And when the standard audio data is detected to be mute data, marking the mute data.
In the embodiment of the present invention, if the value is smaller than the mute threshold, the value is mute data, the flag is set to-2, all the following steps are not performed, and if the value is larger than or equal to the mute threshold, the value is non-mute data.
And when the standard audio data is detected to be non-mute data, performing frequency domain conversion on the standard audio data to obtain frequency domain data.
In detail, the silence detection module 102 obtains frequency domain data by:
and performing frequency domain conversion on the standard audio data by using the following functions to obtain frequency domain data F (omega):
Figure BDA0002837855890000102
wherein f (t) is the standard audio data,
Figure BDA0002837855890000103
is a fourier transform function.
Further, in the embodiment of the present invention, the standard audio data divided by time is time domain data, and the time domain data may express a change of the audio data with time. Although the time domain data can visually display the audio signals, limited parameters cannot be used for describing the audio, the model cannot be directly learned, and after the time domain data is converted into the frequency domain, the complex time domain signals can be decomposed into superposition of different frequency domain signals, so that the audio data can be analyzed conveniently.
The frequency domain data conversion module 103 is configured to convert the frequency domain data into mel frequency domain data, and detect the mel frequency domain data by using a pre-trained audio detection model to obtain an emotion initial score.
Preferably, the frequency domain data conversion module 103 converts the frequency domain data into mel frequency domain data by:
converting the frequency domain data by using a preset Mel frequency domain conversion formula;
and carrying out logarithm operation on the converted frequency domain data, and outputting the frequency domain data in a Mel frequency domain with a preset shape.
In an embodiment of the present invention, the preset mel frequency domain conversion formula may be:
Figure BDA0002837855890000111
wherein f ismelF is the converted mel frequency domain data. The preset shape may be a matrix of 64 x 64.
In detail, the frequency domain data conversion module 103 further includes:
acquiring an original training set, and performing data enhancement on the original training set by using a preset data enhancement method to obtain a standard training set;
training a pre-constructed first network model by using the standard training set to obtain an original model;
and taking the parameters of the original model as initialization parameters of a pre-constructed second network model, and training the second network model by using the standard training set to obtain the audio detection model.
The original training set can be history labeled data and labeled as emotional contrastimulant value per second, the value range of the emotional contrastimulant value is [0,1], the more close to 0 represents the more negative, the more close to 1 represents the more contrastimulant, and a special value-2 represents silence.
Because the emotional excitement of most people falls in a peaceful interval [0.5-0.7] in most of time, the data quantity of the parts with high excitement and low excitement needs to be enhanced. In the embodiment of the invention, a mixed class (Mixup) enhancement method can be used for enhancing data of an original training set to obtain an enhanced data set, and the original training set and the enhanced data set are collected to obtain the standard training set. In the embodiment of the invention, the standard training set is used as a training set, and the original training set is used as a verification set.
In the embodiment of the present invention, the first network model may be a ResNet50 network, and the second network model may be an improved ResNet50 network, where the improvement is: the method is characterized in that a first layer and a last layer of a ResNet50 network are removed, a batch training (BatchNormal) layer, a convolutional layer (with an activation function of relu) and an average pooling layer are added before a ResNet50 network, and a full connection layer (with an activation function of relu), a batch training (BatchNormal) layer and a last full connection layer are added after a ResNet50 network. Meanwhile, due to the fact that the ResNet50 network is easy to be over-fitted during training, the model obtained through training is low in detection accuracy, and in the embodiment of the invention, the training can be stopped in advance before the model training enters over-fitting by using an Early Stopping method (Early Stopping). The early stopping method is to stop training if the loss of the verification set does not decrease along with the loss of the training set within a preset training round range. In the embodiment of the present invention, the MSE loss function may be used to calculate loss.
Furthermore, the frequency domain data are converted into Mel frequency domain data, and the Mel frequency domain data are detected by the audio detection model, so that the detection accuracy is higher as the Mel frequency domain is more in line with the auditory characteristics of human ears.
The emotion score calculation module 104 is configured to calculate the initial emotion score by using a preset weighting calculation method to obtain a final emotion score.
In the embodiment of the invention, the non-silent data is weighted by using the position of the total duration at the current moment, and meanwhile, the emotion of the client at the ending part needs to be more emphasized, so the weighted value closer to the ending part is higher.
In detail, the emotion score calculation module 104 obtains an emotion final score by:
weighting the emotion initial score by using a preset weighting calculation method to obtain a weighted score;
and carrying out quantile value taking on the weighted scores to obtain final scores of the emotions.
In an embodiment of the present invention, the preset weighting calculation method may be:
coefficient value=original value*(α+(1-α)*sec/(voice len*0.9))
wherein, coefficient value is the weighted score, original value is the initial score, sec represents that the current second is the second, and voice len is the total duration of the whole tone, and α in the implementation of the present invention may be 0.7.
In the embodiment of the present invention, for the weighted score, since the later numerical value is more important, the quantile value may take a value of 90 quantiles as the final score of the entire audio, and is set to 0 if the final score is less than 0, and is set to 1 if the final score is greater than 1. The 90-point score is the 90% value in the weighted score. Meanwhile, if the non-mute data is null, the final score of the whole tone is directly output to be-2, which represents that the whole tone is mute.
Furthermore, the initial score is weighted by using a preset score calculation method, and the accuracy of audio data detection is improved by increasing the weight of important part data.
Fig. 8 is a schematic structural diagram of an electronic device implementing an emotion detection method for audio data according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as an audio data emotion detection program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the emotion detection program 12 for audio data, but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., an audio data emotion detection program, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 8 only shows an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 8 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The memory 11 in the electronic device 1 stores an audio data emotion detection program 12 which is a combination of instructions that, when executed in the processor 10, may implement:
acquiring original audio data, and performing sound channel separation and data cutting processing on the original audio data to obtain standard audio data;
performing silence detection on the standard audio data;
when the standard audio data is detected to be mute data, marking the mute data;
when the standard audio data is detected to be non-mute data, performing frequency domain conversion on the standard audio data to obtain frequency domain data;
converting the frequency domain data into Mel frequency domain data, and detecting the Mel frequency domain data by using a pre-trained audio detection model to obtain an emotion initial score;
and calculating the emotion initial score by using a preset weighted calculation method to obtain an emotion final score.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiments corresponding to fig. 1 to fig. 6, which is not repeated herein.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
acquiring original audio data, and performing sound channel separation and data cutting processing on the original audio data to obtain standard audio data;
performing silence detection on the standard audio data;
when the standard audio data is detected to be mute data, marking the mute data;
when the standard audio data is detected to be non-mute data, performing frequency domain conversion on the standard audio data to obtain frequency domain data;
converting the frequency domain data into Mel frequency domain data, and detecting the Mel frequency domain data by using a pre-trained audio detection model to obtain an emotion initial score;
and calculating the emotion initial score by using a preset weighted calculation method to obtain an emotion final score.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method for emotion detection of audio data, the method comprising:
acquiring original audio data, and performing sound channel separation and data cutting processing on the original audio data to obtain standard audio data;
performing silence detection on the standard audio data;
when the standard audio data is detected to be mute data, marking the mute data;
when the standard audio data is detected to be non-mute data, performing frequency domain conversion on the standard audio data to obtain frequency domain data;
converting the frequency domain data into Mel frequency domain data, and detecting the Mel frequency domain data by using a pre-trained audio detection model to obtain an emotion initial score;
and calculating the emotion initial score by using a preset weighted calculation method to obtain an emotion final score.
2. The method for emotion detection of audio data according to claim 1, wherein said performing channel separation and data segmentation on the original audio data to obtain standard audio data comprises:
performing sound channel judgment on the original audio data, and extracting right sound channel audio data in the original audio data;
and performing data cutting on the right channel audio data according to a preset time length to obtain the standard audio data.
3. The method for emotion detection of audio data as recited in claim 1, wherein said silence detecting said standard audio data includes:
reading each frame of audio data in the standard audio data frame by frame, and calculating a speech energy value and a background noise energy value of each frame of audio data;
calculating the difference value between the voice energy value and the background noise energy value, and comparing the difference value with a preset mute threshold value;
when the difference value is smaller than the mute threshold value, judging the standard audio data as mute data;
and when the difference value is larger than or equal to the mute threshold value, judging that the standard audio data is non-mute data.
4. The method for emotion detection of audio data as recited in claim 1, wherein the frequency-domain converting the standard audio data to obtain frequency-domain data comprises:
and performing frequency domain conversion on the standard audio data by using the following functions to obtain frequency domain data F (omega):
Figure FDA0002837855880000011
wherein f (t) is the standard audio data,
Figure FDA0002837855880000021
is a fourier transform function.
5. The method for emotion detection of audio data as recited in claim 1, wherein said converting the frequency domain data into mel frequency domain data comprises:
converting the frequency domain data by using a preset Mel frequency domain conversion formula;
and carrying out logarithm operation on the converted frequency domain data, and outputting Mel frequency domain data in a preset shape.
6. The method for emotion detection of audio data as recited in any of claims 1 to 5, wherein, before the detection of the Mel frequency domain data by using the pre-trained audio detection model, further comprising:
acquiring an original training set, and performing data enhancement on the original training set by using a preset data enhancement method to obtain a standard training set;
training a pre-constructed first network model by using the standard training set to obtain an original model;
and taking the parameters of the original model as initialization parameters of a pre-constructed second network model, and training the second network model by using the standard training set to obtain the audio detection model.
7. The audio data emotion detection method of any one of claims 1 through 5, wherein the calculating the emotion initial score by using a preset weighting calculation method to obtain an emotion final score comprises:
weighting the emotion initial score by using a preset weighting calculation method to obtain a weighted score;
and carrying out quantile value taking on the weighted scores to obtain final scores of the emotions.
8. An audio data emotion detection apparatus, characterized in that the apparatus comprises:
the audio data processing module is used for acquiring original audio data, and performing sound channel separation and data cutting processing on the original audio data to obtain standard audio data;
the silence detection module is used for carrying out silence detection on the standard audio data; when the standard audio data is detected to be mute data, marking the mute data; when the standard audio data is detected to be non-mute data, performing frequency domain conversion on the standard audio data to obtain frequency domain data;
the frequency domain data conversion module is used for converting the frequency domain data into Mel frequency domain data and detecting the Mel frequency domain data by using a pre-trained audio detection model to obtain an emotion initial score;
and the emotion score calculation module is used for calculating the emotion initial score by using a preset weighting calculation method to obtain an emotion final score.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of emotion detection of audio data as claimed in any of claims 1 to 7.
10. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the audio data emotion detection method as claimed in any one of claims 1 to 7.
CN202011482460.4A 2020-12-15 2020-12-15 Audio data emotion detection method and device, electronic equipment and storage medium Pending CN112466337A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011482460.4A CN112466337A (en) 2020-12-15 2020-12-15 Audio data emotion detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011482460.4A CN112466337A (en) 2020-12-15 2020-12-15 Audio data emotion detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112466337A true CN112466337A (en) 2021-03-09

Family

ID=74802962

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011482460.4A Pending CN112466337A (en) 2020-12-15 2020-12-15 Audio data emotion detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112466337A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345468A (en) * 2021-05-25 2021-09-03 平安银行股份有限公司 Voice quality inspection method, device, equipment and storage medium
CN113808577A (en) * 2021-09-18 2021-12-17 平安银行股份有限公司 Intelligent extraction method and device of voice abstract, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1773605A (en) * 2004-11-12 2006-05-17 中国科学院声学研究所 Sound end detecting method for sound identifying system
WO2007017853A1 (en) * 2005-08-08 2007-02-15 Nice Systems Ltd. Apparatus and methods for the detection of emotions in audio interactions
US20180082679A1 (en) * 2016-09-18 2018-03-22 Newvoicemedia, Ltd. Optimal human-machine conversations using emotion-enhanced natural speech using hierarchical neural networks and reinforcement learning
CN108346436A (en) * 2017-08-22 2018-07-31 腾讯科技(深圳)有限公司 Speech emotional detection method, device, computer equipment and storage medium
CN110556130A (en) * 2019-09-17 2019-12-10 平安科技(深圳)有限公司 Voice emotion recognition method and device and storage medium
CN111696559A (en) * 2019-03-15 2020-09-22 微软技术许可有限责任公司 Providing emotion management assistance
CN112017632A (en) * 2020-09-02 2020-12-01 浪潮云信息技术股份公司 Automatic conference record generation method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1773605A (en) * 2004-11-12 2006-05-17 中国科学院声学研究所 Sound end detecting method for sound identifying system
WO2007017853A1 (en) * 2005-08-08 2007-02-15 Nice Systems Ltd. Apparatus and methods for the detection of emotions in audio interactions
US20180082679A1 (en) * 2016-09-18 2018-03-22 Newvoicemedia, Ltd. Optimal human-machine conversations using emotion-enhanced natural speech using hierarchical neural networks and reinforcement learning
CN108346436A (en) * 2017-08-22 2018-07-31 腾讯科技(深圳)有限公司 Speech emotional detection method, device, computer equipment and storage medium
CN111696559A (en) * 2019-03-15 2020-09-22 微软技术许可有限责任公司 Providing emotion management assistance
CN110556130A (en) * 2019-09-17 2019-12-10 平安科技(深圳)有限公司 Voice emotion recognition method and device and storage medium
CN112017632A (en) * 2020-09-02 2020-12-01 浪潮云信息技术股份公司 Automatic conference record generation method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345468A (en) * 2021-05-25 2021-09-03 平安银行股份有限公司 Voice quality inspection method, device, equipment and storage medium
CN113345468B (en) * 2021-05-25 2024-06-28 平安银行股份有限公司 Voice quality inspection method, device, equipment and storage medium
CN113808577A (en) * 2021-09-18 2021-12-17 平安银行股份有限公司 Intelligent extraction method and device of voice abstract, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110619568A (en) Risk assessment report generation method, device, equipment and storage medium
US20190034703A1 (en) Attack sample generating method and apparatus, device and storage medium
CN110874716A (en) Interview evaluation method and device, electronic equipment and storage medium
CN112560453A (en) Voice information verification method and device, electronic equipment and medium
CN112527994A (en) Emotion analysis method, emotion analysis device, emotion analysis equipment and readable storage medium
CN112992187B (en) Context-based voice emotion detection method, device, equipment and storage medium
CN113903363B (en) Violation behavior detection method, device, equipment and medium based on artificial intelligence
CN112466337A (en) Audio data emotion detection method and device, electronic equipment and storage medium
CN111835926A (en) Intelligent voice outbound method, device, equipment and medium based on voice interaction
CN113064994A (en) Conference quality evaluation method, device, equipment and storage medium
CN111222837A (en) Intelligent interviewing method, system, equipment and computer storage medium
CN113807103A (en) Recruitment method, device, equipment and storage medium based on artificial intelligence
CN114155832A (en) Speech recognition method, device, equipment and medium based on deep learning
CN113707173A (en) Voice separation method, device and equipment based on audio segmentation and storage medium
CN113077821A (en) Audio quality detection method and device, electronic equipment and storage medium
WO2021208700A1 (en) Method and apparatus for speech data selection, electronic device, and storage medium
CN113808616A (en) Voice compliance detection method, device, equipment and storage medium
CN111552832A (en) Risk user identification method and device based on voiceprint features and associated map data
CN116450797A (en) Emotion classification method, device, equipment and medium based on multi-modal dialogue
CN113241095B (en) Conversation emotion real-time recognition method and device, computer equipment and storage medium
CN114186028A (en) Consult complaint work order processing method, device, equipment and storage medium
CN114661942A (en) Method and device for processing streaming tone data, electronic equipment and computer readable medium
CN113704430A (en) Intelligent auxiliary receiving method and device, electronic equipment and storage medium
CN113808577A (en) Intelligent extraction method and device of voice abstract, electronic equipment and storage medium
CN113221990A (en) Information input method and device and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210309

RJ01 Rejection of invention patent application after publication