CN112634938A - Audio-based personnel positivity analysis method, device, equipment and storage medium - Google Patents
Audio-based personnel positivity analysis method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN112634938A CN112634938A CN202011508395.8A CN202011508395A CN112634938A CN 112634938 A CN112634938 A CN 112634938A CN 202011508395 A CN202011508395 A CN 202011508395A CN 112634938 A CN112634938 A CN 112634938A
- Authority
- CN
- China
- Prior art keywords
- voice
- segment
- audio data
- target
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 115
- 238000012545 processing Methods 0.000 claims abstract description 56
- 239000013598 vector Substances 0.000 claims description 87
- 230000008450 motivation Effects 0.000 claims description 28
- 230000009467 reduction Effects 0.000 claims description 24
- 238000000034 method Methods 0.000 claims description 23
- 238000000605 extraction Methods 0.000 claims description 15
- 238000006243 chemical reaction Methods 0.000 claims description 11
- 230000009466 transformation Effects 0.000 claims description 8
- 238000011176 pooling Methods 0.000 claims description 6
- 230000001755 vocal effect Effects 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 238000012216 screening Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 8
- 230000006870 function Effects 0.000 description 16
- 238000004422 calculation algorithm Methods 0.000 description 11
- 238000001514 detection method Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 5
- 238000007726 management method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 239000012634 fragment Substances 0.000 description 4
- 238000004891 communication Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000002349 favourable effect Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Quality & Reliability (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The invention relates to a voice processing technology, and discloses a personnel enthusiasm analysis method based on audio, which comprises the following steps: carrying out audio enhancement processing and voice section cutting on the obtained audio data to obtain a voice section set; extracting the segment characteristics of each voice segment in the voice segment set; acquiring standard voice characteristics of a target person, performing matching analysis on the standard voice characteristics and the segment characteristics, and selecting a target voice section corresponding to the standard voice characteristics according to the result of the matching analysis; carrying out voice analysis on the target voice segment to obtain the voice time length, the voice volume and the voice speed of the target voice segment; and calculating the aggressiveness of the target person according to the analysis result. In addition, the invention also relates to a block chain technology, and the standard voice features can be stored in the nodes of the block chain. The invention also provides a device, equipment and medium for analyzing the human enthusiasm based on the audio. The invention can solve the problem of low accuracy of judging the enthusiasm of the personnel by using audio analysis.
Description
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for analyzing human motivation based on audio, an electronic device, and a computer-readable storage medium.
Background
Many companies need to carry out morning activities every morning, and participants discuss work targets, difficult problems encountered in work and the like in the morning, so that the running-in degree of a team is increased, and in order to judge the enthusiasm of the participants in a conference, all companies collect sound data by installing microphones in a conference room and analyze the sound data to judge the enthusiasm of the participants.
Most of the existing methods for judging the enthusiasm of the participants based on the audio data are to judge the enthusiasm of the participants by analyzing the preset speaking time, namely comparing the speaking time of the participants with the preset time, and judging the enthusiasm of the users according to the speaking time of the participants. However, the method only considers the speaking duration, but does not consider other factors such as the speaking speed and the volume of the participators during speaking, so that the accuracy of judging the enthusiasm of the participants by using audio analysis is not high.
Disclosure of Invention
The invention provides a method and a device for analyzing the human enthusiasm based on audio and a computer readable storage medium, and mainly aims to solve the problem of low accuracy of judging the human enthusiasm by using audio analysis.
In order to achieve the above object, the present invention provides a method for analyzing human motivation based on audio, comprising:
acquiring audio data, and performing audio enhancement processing on the audio data to obtain enhanced audio data;
carrying out voice section cutting on the enhanced audio data to obtain a voice section set;
carrying out feature extraction on the voice segment set to obtain segment features of each voice segment in the voice segment set;
acquiring standard voice features of target personnel, performing matching analysis on the standard voice features and the segment features, and selecting target voice segments corresponding to the standard voice features in the voice segment set according to the result of the matching analysis;
carrying out voice analysis on the target voice section to obtain the voice time length, the voice volume and the voice speed of the target voice section;
and calculating the aggressiveness of the target person according to the voice time length, the voice volume and the voice speed.
Optionally, the performing audio enhancement processing on the audio data to obtain enhanced audio data includes:
carrying out noise reduction processing on the audio data to obtain noise reduction audio data;
and carrying out audio emphasis processing on the noise reduction audio data to obtain enhanced audio data.
Optionally, the performing speech segment segmentation on the enhanced audio data to obtain a speech segment set includes:
continuously detecting the voice intensity of the enhanced audio data;
when the voice intensity is smaller than a preset decibel threshold value, determining that the enhanced audio data is an unmanned sound segment;
when the voice intensity is larger than or equal to the decibel threshold value, determining that the enhanced audio data is a vocal segment;
and deleting the unmanned sound segment in the enhanced audio data to obtain a voice segment set.
Optionally, the performing feature extraction on the speech segment set to obtain segment features of each speech segment in the speech segment set includes:
carrying out convolution processing on each voice section in the voice section set to obtain a convolution voice section set;
performing global maximum pooling processing on the convolution voice segment set to obtain a pooled voice segment set;
performing full-connection processing on the pooled speech segment set by using a first full-connection layer to obtain a full-connection speech segment set;
and utilizing a second full-connection layer to perform full-connection processing on the full-connection voice section set to obtain the segment characteristics of each voice section in the voice section set.
Optionally, the performing matching analysis on the standard speech feature and the segment feature, and selecting a target speech segment corresponding to the standard speech feature in the speech segment set according to a result of the matching analysis includes:
performing vector transformation on the standard voice features of the target person to obtain a first voice feature vector;
performing vector conversion on the segment features to obtain second voice feature vectors corresponding to the segment features of each voice segment;
calculating distance values of the first voice feature vector and a second voice feature vector corresponding to the segment feature of each voice segment;
screening the segment characteristics corresponding to the second voice characteristic vectors with the distance values smaller than the preset threshold value, and determining the voice sections corresponding to the screened segment characteristics as the target voice sections.
Optionally, the vector conversion of the standard speech feature of the target person to obtain a first speech feature vector includes:
obtaining a byte vector set corresponding to each word in the standard voice characteristics, wherein the byte vector set comprises byte vectors of each byte in the standard voice characteristics;
respectively encoding each byte in the standard voice feature according to the byte vector in the byte vector set to obtain an encoded byte set;
and splicing the coding bytes in the coding byte set to obtain the first voice characteristic vector.
Optionally, the performing voice analysis on the target voice segment to obtain the voice duration, the voice volume and the voice speed of the target voice segment includes:
detecting the voice time length of the target voice section;
continuously detecting the voice intensity of the target voice section, and calculating the voice volume of the target person according to the voice duration and the voice intensity;
carrying out voice recognition on the target voice section, and counting the number of voice words of the user in a voice recognition result;
and calculating the voice speed of the target person according to the voice time length and the voice word number.
In order to solve the above problems, the present invention also provides an audio-based human motivation analysis apparatus, including:
the audio enhancement module is used for acquiring audio data and performing audio enhancement processing on the audio data to obtain enhanced audio data;
the voice cutting module is used for cutting voice sections of the enhanced audio data to obtain a voice section set;
the characteristic extraction module is used for extracting the characteristics of the voice segment set to obtain the segment characteristics of each voice segment in the voice segment set;
the matching analysis module is used for acquiring the standard voice characteristics of the target personnel, performing matching analysis on the standard voice characteristics and the segment characteristics, and selecting the target voice segments corresponding to the standard voice characteristics in the voice segment set according to the result of the matching analysis;
the voice analysis module is used for carrying out voice analysis on the target voice segment to obtain the voice time length, the voice volume and the voice speed of the target voice segment;
and the positive degree calculating module is used for calculating the positive degree of the target person according to the voice time length, the voice volume and the voice speed.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the audio-based people motivation analysis method.
In order to solve the above problem, the present invention further provides a computer-readable storage medium having at least one instruction stored therein, where the at least one instruction is executed by a processor in an electronic device to implement the audio-based human motivation analysis method described above.
According to the embodiment of the invention, the audio enhancement processing is carried out on the obtained audio data to obtain the enhanced audio data, so that the noise interference in the audio data can be reduced, the voice part in the audio data is enhanced, and the accuracy of the subsequent analysis on the audio data is improved; the voice section cutting is carried out on the enhanced audio data to obtain a voice section set, the voice section without voice in the enhanced audio data can be deleted, the voice section without voice is prevented from being analyzed in the subsequent analysis, and the efficiency of the subsequent voice analysis is improved; the voice segment set is subjected to feature extraction to obtain the segment features of each voice segment in the voice segment set, so that the accuracy of the enthusiasm of subsequent segment feature analyzers based on each voice segment in the voice segment set is improved; the method comprises the steps of obtaining standard voice characteristics of a target person, carrying out matching analysis on the standard voice characteristics and the segment characteristics, and selecting target voice segments corresponding to the standard voice characteristics in the voice segment set according to the matching analysis result, so that the target voice segments belonging to the target person are screened out from all the voice segments in the voice segment set according to the segment characteristics, and the accuracy of analyzing the enthusiasm of the target person by using the screened target voice segments is improved; and carrying out voice analysis on the target voice segment, and calculating the aggressiveness of the target personnel according to the voice time, the voice volume and the voice speed obtained by analysis, so that the extreme generation of personnel according to various factors in the audio is realized. Therefore, the method, the device, the electronic equipment and the computer readable storage medium for analyzing the human enthusiasm based on the audio frequency can solve the problem that the accuracy of judging the human enthusiasm by using audio frequency analysis is not high.
Drawings
Fig. 1 is a schematic flow chart of a method for analyzing human motivation based on audio according to an embodiment of the present invention;
FIG. 2 is a functional block diagram of an audio-based human motivation analysis apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device implementing the audio-based human motivation analysis method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a personnel enthusiasm analysis method based on audio. The execution subject of the audio-based people activity analysis method includes, but is not limited to, at least one of electronic devices such as a server and a terminal that can be configured to execute the method provided by the embodiments of the present application. In other words, the audio-based human motivation analysis method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Referring to fig. 1, a flowchart of a method for analyzing human motivation based on audio according to an embodiment of the present invention is shown. In this embodiment, the audio-based human motivation analysis method includes:
and S1, acquiring audio data, and performing audio enhancement processing on the audio data to obtain enhanced audio data.
In the embodiment of the present invention, the audio data includes an audio recording with voice, for example, an audio recording recorded by speaking to a participant in a conference room.
In detail, the embodiment of the present invention may acquire the audio data through a pre-installed device with a recording function, for example, the device such as a microphone with a recording function or a computer with a recording function installed in the conference room monitors sound in the conference room to acquire the audio data.
In this embodiment of the present invention, the performing audio enhancement processing on the audio data to obtain enhanced audio data includes:
carrying out noise reduction processing on the audio data to obtain noise reduction audio data;
and carrying out audio emphasis processing on the noise reduction audio data to obtain enhanced audio data.
In detail, in the embodiment of the present invention, in order to remove noise in the audio data, a preset noise reduction filter is used to perform noise filtering processing on the audio data, so as to obtain noise reduction audio data, where the filter includes, but is not limited to, a butterworth filter, a chebyshev filter, a bessel filter, and the like.
Further, in order to highlight the human voice in the noise reduction audio data, the embodiment of the present invention performs audio emphasis processing on the noise reduction audio data, and increases the human voice portion in the noise reduction audio data to obtain the enhanced audio data.
Specifically, in the embodiment of the present invention, the pre-emphasis operation may be performed by a function y (t) ═ x (t) — μ x (t-1), where x (t) is noise reduction audio data, t is time, y (t) is the enhancement audio data, and μ is a preset adjustment value of the audio emphasis operation, and in the embodiment of the present invention, a value range of μ is [0 ]. 9,1. 0].
According to the embodiment of the invention, the audio data is subjected to audio enhancement processing, so that the noise interference in the audio data can be reduced, the human voice part in the audio data is enhanced, and the accuracy of subsequent analysis on the audio data is improved.
And S2, carrying out voice section cutting on the enhanced audio data to obtain a voice section set.
In this embodiment of the present invention, the voice segment cutting is to delete a segment that does not include human voice in the enhanced audio data.
In detail, the performing speech segment segmentation on the enhanced audio data to obtain a speech segment set includes:
continuously detecting the voice intensity of the enhanced audio data;
when the voice intensity is smaller than a preset decibel threshold value, determining that the enhanced audio data is an unmanned sound segment;
when the voice intensity is larger than or equal to the decibel threshold value, determining that the enhanced audio data is a vocal segment;
and deleting the unmanned sound segment in the enhanced audio data to obtain a voice segment set.
Specifically, the embodiment of the present invention uses an audio intensity detection tool to continuously detect the voice intensity of the enhanced audio data, where the audio intensity detection tool includes a PocketRTA decibel tester, a SIA SmaartLive decibel test tool, and the like.
For example, there is enhanced audio data with a duration of 20s, the audio intensity of the enhanced audio data at each time between 0s and 20s is continuously measured by using an audio intensity detection tool, the audio intensity of 0s to 5s is 20, the audio intensity of 5s to 10s is 80, the audio intensity of 10s to 15s is 30, and the audio intensity of 15s to 20s is 60, when the decibel threshold is 50, the voice segments of 5s to 10s and 15s to 20s in the enhanced audio data are determined to be vocal segments, the voice segments of 0s to 5s and 10s to 15s in the enhanced audio data are deleted, and the voice segments of 5s to 10s and 15s to 20s in the enhanced audio data are collected into a voice segment set.
According to the embodiment of the invention, the voice section of the enhanced audio data is cut, so that the voice section without voice in the enhanced audio data can be deleted, the voice section without voice is prevented from being analyzed in the subsequent analysis, and the efficiency of the subsequent voice analysis is improved.
And S3, extracting the characteristics of the voice segment set to obtain the segment characteristics of each voice segment in the voice segment set.
In the embodiment of the invention, a Densenet201 network comprising double fully-connected layers is used for extracting the characteristics of each voice section in the voice section set, the Densenet201 network is a densely-connected convolutional neural network and comprises a plurality of convolutional layers, and the input of each target convolutional layer in the network is the output of all network layers before the target convolutional layer, so that the parameters required to be set are reduced, and the image processing efficiency of the network is improved.
In detail, the extracting features of the speech segment set to obtain the segment features of each speech segment in the speech segment set includes:
carrying out convolution processing on each voice section in the voice section set to obtain a convolution voice section set;
performing global maximum pooling processing on the convolution voice segment set to obtain a pooled voice segment set;
performing full-connection processing on the pooled speech segment set by using a first full-connection layer to obtain a full-connection speech segment set;
and utilizing a second full-connection layer to perform full-connection processing on the full-connection voice section set to obtain the segment characteristics of each voice section in the voice section set.
Because each voice section in the voice section set contains a large amount of voice information, the direct analysis of each voice section in the voice section set occupies a large amount of computing resources, and the analysis efficiency is low; however, the fragment features in the convolved speech segment set obtained by convolution still have a multidimensional condition, and the embodiment of the invention can further reduce the dimensionality of the fragment features in the convolved speech segment set obtained by convolution by utilizing global maximum pooling, reduce the occupation of calculation resources when the pooled speech segment set is analyzed subsequently, and improve the analysis efficiency.
The embodiment of the invention utilizes the double full-connection hierarchical connection to carry out the double full-connection processing on the pooled voice segment set, can improve the network complexity, further improve the accuracy of the segment characteristics in the segment characteristics of each voice segment in the obtained voice segment set, and is favorable for improving the accuracy of the enthusiasm of the subsequent segment characteristic analysts based on each voice segment in the voice segment set.
S4, acquiring the standard voice characteristics of the target personnel, performing matching analysis on the standard voice characteristics and the segment characteristics, and selecting the target voice segments corresponding to the standard voice characteristics in the voice segment set according to the matching analysis result.
In the embodiment of the invention, the standard voice features pre-stored by the target personnel can be obtained from the pre-constructed block chain nodes by utilizing the python statement with the data capturing function, and the efficiency of obtaining the standard voice features can be improved by utilizing the high throughput of the block chain to the data.
In detail, the performing matching analysis on the standard speech feature and the segment feature, and selecting a target speech segment corresponding to the standard speech feature in the speech segment set according to a result of the matching analysis includes:
performing vector transformation on the standard voice features of the target person to obtain a first voice feature vector;
performing vector conversion on the segment features to obtain second voice feature vectors corresponding to the segment features of each voice segment;
calculating distance values of the first voice feature vector and a second voice feature vector corresponding to the segment feature of each voice segment;
screening the segment characteristics corresponding to the second voice characteristic vectors with the distance values smaller than the preset threshold value, and determining the voice sections corresponding to the screened segment characteristics as the target voice sections.
Specifically, the vector conversion of the standard voice feature of the target person to obtain a first voice feature vector includes:
obtaining a byte vector set corresponding to each word in the standard voice characteristics, wherein the byte vector set comprises byte vectors of each byte in the standard voice characteristics;
respectively encoding each byte in the standard voice feature according to the byte vector in the byte vector set to obtain an encoded byte set;
and splicing the coding bytes in the coding byte set to obtain the first voice characteristic vector.
The embodiment of the invention can utilize a one-hot coding technology to code a plurality of byte vectors in the byte vector set, so that the plurality of byte vectors are converted into the first voice characteristic vector.
The specific method for encoding processing by the one-hot encoding technique is to use an N-bit status register to encode N states in the standard speech feature, each state is represented by its own independent register bit, and at any time, only one bit is valid, i.e., only one bit is 1, and the rest are zero values.
In detail, the step of performing vector transformation on the segment features to obtain the second speech feature vectors corresponding to the segment features of each speech segment is consistent with the step of performing vector transformation on the standard speech features of the target person to obtain the first speech feature vectors, which is not repeated herein.
In the embodiment of the invention, the standard voice characteristics and the segment characteristics of the target personnel are subjected to vector conversion, so that the vectorization of the voice information can be realized, and the efficiency of matching and analyzing the standard voice characteristics and the segment characteristics is improved.
Further, the calculating a distance value between the first speech feature vector and a second speech feature vector corresponding to the segment feature of each speech segment includes:
the distance value is calculated using the following distance algorithm:
wherein L (X, Y) is the distance value, X is the first speech feature vector, Y isiIs the ith second speech feature vector corresponding to the segment features of each speech segment.
In detail, in the embodiment of the present invention, the segment feature corresponding to the second speech feature vector whose distance value is smaller than the preset threshold value is screened, and the speech segment corresponding to the screened segment feature is determined to be the target speech segment matched with the standard speech feature.
In the embodiment of the invention, the standard voice characteristics of the target personnel are obtained, and the standard voice characteristics and the segment characteristics are subjected to matching analysis, so that the target voice segments belonging to the target personnel are screened out from all the voice segments of the voice segment set according to the segment characteristics, and the accuracy of analyzing the enthusiasm of the target personnel by using the screened target voice segments is favorably improved.
S5, carrying out voice analysis on the target voice section to obtain the voice time length, the voice volume and the voice speed of the target voice section.
In the embodiment of the present invention, the performing voice analysis on the target voice segment to obtain the voice duration, the voice volume and the voice speed of the target voice segment includes:
detecting the voice time length of the target voice section;
continuously detecting the voice intensity of the target voice section, and calculating the voice volume of the target person according to the voice duration and the voice intensity;
carrying out voice recognition on the target voice section, and counting the number of voice words of the user in a voice recognition result;
and calculating the voice speed of the target person according to the voice time length and the voice word number.
In detail, the step of continuously detecting the speech intensity of the target speech segment is consistent with the step of continuously detecting the speech intensity of the enhanced audio data in step S2, and is not repeated herein.
Specifically, calculating the voice volume of the user according to the voice time length and the voice intensity is to calculate the average volume of the target person in the voice time length, and the average volume is calculated by using the following average algorithm in the embodiment of the present invention:
wherein L is the average volume, n is the voice duration, PtAnd the speech intensity of the target speech segment at the time t is obtained.
Further, in the embodiment of the present invention, an ASR (Automatic Speech Recognition) technology is used to perform text conversion on the target Speech segment to obtain Speech Recognition, and the number of words in the Speech of the user in the Speech Recognition result is counted.
In detail, calculating the speech speed of the target person according to the speech duration and the speech word number is to calculate the speaking speed of the target person in the speech duration of the target speech segment through a rate algorithm, where the rate algorithm is:
wherein V is the speech speed, N is the speech duration, and N is the number of speech words.
And S6, calculating the aggressiveness of the target person according to the voice time length, the voice volume and the voice speed.
In the embodiment of the present invention, the calculating the aggressiveness of the target person according to the voice duration, the voice volume and the voice pace includes:
calculating the aggressiveness of the target person according to the voice time, the voice volume and the voice speed by using the following product-extreme algorithm:
J=α*W+β*L+γ*V
wherein J is the product limit, W is the voice duration, L is the average volume, V is the voice pace, and alpha, beta, and gamma are preset weight coefficients.
According to the embodiment of the invention, the audio enhancement processing is carried out on the obtained audio data to obtain the enhanced audio data, so that the noise interference in the audio data can be reduced, the voice part in the audio data is enhanced, and the accuracy of the subsequent analysis on the audio data is improved; the voice section cutting is carried out on the enhanced audio data to obtain a voice section set, the voice section without voice in the enhanced audio data can be deleted, the voice section without voice is prevented from being analyzed in the subsequent analysis, and the efficiency of the subsequent voice analysis is improved; the voice segment set is subjected to feature extraction to obtain the segment features of each voice segment in the voice segment set, so that the accuracy of the enthusiasm of subsequent segment feature analyzers based on each voice segment in the voice segment set is improved; the method comprises the steps of obtaining standard voice characteristics of a target person, carrying out matching analysis on the standard voice characteristics and the segment characteristics, and selecting target voice segments corresponding to the standard voice characteristics in the voice segment set according to the matching analysis result, so that the target voice segments belonging to the target person are screened out from all the voice segments in the voice segment set according to the segment characteristics, and the accuracy of analyzing the enthusiasm of the target person by using the screened target voice segments is improved; and carrying out voice analysis on the target voice segment, and calculating the aggressiveness of the target personnel according to the voice time, the voice volume and the voice speed obtained by analysis, so that the extreme generation of personnel according to various factors in the audio is realized. Therefore, the method for analyzing the human enthusiasm based on the audio frequency can solve the problem that the accuracy of judging the human enthusiasm by using audio frequency analysis is not high.
Fig. 2 is a functional block diagram of an apparatus for analyzing human motivation based on audio according to an embodiment of the present invention.
The audio-based human motivation analysis apparatus 100 according to the present invention may be installed in an electronic device. According to the implemented functions, the audio-based human motivation analysis apparatus 100 may include an audio enhancement module 101, a voice cutting module 102, a feature extraction module 103, a matching analysis module 104, a voice analysis module 105, and an activity calculation module 106. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the audio enhancement module 101 is configured to obtain audio data, and perform audio enhancement processing on the audio data to obtain enhanced audio data.
In the embodiment of the present invention, the audio data includes an audio recording with voice, for example, an audio recording recorded by speaking to a participant in a conference room.
In detail, the embodiment of the present invention may acquire the audio data through a pre-installed device with a recording function, for example, the device such as a microphone with a recording function or a computer with a recording function installed in the conference room monitors sound in the conference room to acquire the audio data.
In this embodiment of the present invention, the audio enhancement module 101 is specifically configured to:
acquiring audio data;
carrying out noise reduction processing on the audio data to obtain noise reduction audio data;
and carrying out audio emphasis processing on the noise reduction audio data to obtain enhanced audio data.
In detail, in the embodiment of the present invention, in order to remove noise in the audio data, a preset noise reduction filter is used to perform noise filtering processing on the audio data, so as to obtain noise reduction audio data, where the filter includes, but is not limited to, a butterworth filter, a chebyshev filter, a bessel filter, and the like.
Further, in order to highlight the human voice in the noise reduction audio data, the embodiment of the present invention performs audio emphasis processing on the noise reduction audio data, and increases the human voice portion in the noise reduction audio data to obtain the enhanced audio data.
Specifically, in the embodiment of the present invention, the pre-emphasis operation may be performed by a function y (t) ═ x (t) — μ x (t-1), where x (t) is noise reduction audio data, t is time, y (t) is the enhancement audio data, and μ is a preset adjustment value of the audio emphasis operation, and in the embodiment of the present invention, a value range of μ is [0 ]. 9,1. 0].
According to the embodiment of the invention, the audio data is subjected to audio enhancement processing, so that the noise interference in the audio data can be reduced, the human voice part in the audio data is enhanced, and the accuracy of subsequent analysis on the audio data is improved.
The voice cutting module 102 is configured to perform voice segment cutting on the enhanced audio data to obtain a voice segment set.
In this embodiment of the present invention, the voice segment cutting is to delete a segment that does not include human voice in the enhanced audio data.
In detail, the voice cutting module 102 is specifically configured to:
continuously detecting the voice intensity of the enhanced audio data;
when the voice intensity is smaller than a preset decibel threshold value, determining that the enhanced audio data is an unmanned sound segment;
when the voice intensity is larger than or equal to the decibel threshold value, determining that the enhanced audio data is a vocal segment;
and deleting the unmanned sound segment in the enhanced audio data to obtain a voice segment set.
Specifically, the embodiment of the present invention uses an audio intensity detection tool to continuously detect the voice intensity of the enhanced audio data, where the audio intensity detection tool includes a PocketRTA decibel tester, a SIA SmaartLive decibel test tool, and the like.
For example, there is enhanced audio data with a duration of 20s, the audio intensity of the enhanced audio data at each time between 0s and 20s is continuously measured by using an audio intensity detection tool, the audio intensity of 0s to 5s is 20, the audio intensity of 5s to 10s is 80, the audio intensity of 10s to 15s is 30, and the audio intensity of 15s to 20s is 60, when the decibel threshold is 50, the voice segments of 5s to 10s and 15s to 20s in the enhanced audio data are determined to be vocal segments, the voice segments of 0s to 5s and 10s to 15s in the enhanced audio data are deleted, and the voice segments of 5s to 10s and 15s to 20s in the enhanced audio data are collected into a voice segment set.
According to the embodiment of the invention, the voice section of the enhanced audio data is cut, so that the voice section without voice in the enhanced audio data can be deleted, the voice section without voice is prevented from being analyzed in the subsequent analysis, and the efficiency of the subsequent voice analysis is improved.
The feature extraction module 103 is configured to perform feature extraction on the speech segment set to obtain segment features of each speech segment in the speech segment set.
In the embodiment of the invention, a Densenet201 network comprising double fully-connected layers is used for extracting the characteristics of each voice section in the voice section set, the Densenet201 network is a densely-connected convolutional neural network and comprises a plurality of convolutional layers, and the input of each target convolutional layer in the network is the output of all network layers before the target convolutional layer, so that the parameters required to be set are reduced, and the image processing efficiency of the network is improved.
In detail, the feature extraction module 103 is specifically configured to:
carrying out convolution processing on each voice section in the voice section set to obtain a convolution voice section set;
performing global maximum pooling processing on the convolution voice segment set to obtain a pooled voice segment set;
performing full-connection processing on the pooled speech segment set by using a first full-connection layer to obtain a full-connection speech segment set;
and utilizing a second full-connection layer to perform full-connection processing on the full-connection voice section set to obtain the segment characteristics of each voice section in the voice section set.
Because each voice section in the voice section set contains a large amount of voice information, the direct analysis of each voice section in the voice section set occupies a large amount of computing resources, and the analysis efficiency is low; however, the fragment features in the convolved speech segment set obtained by convolution still have a multidimensional condition, and the embodiment of the invention can further reduce the dimensionality of the fragment features in the convolved speech segment set obtained by convolution by utilizing global maximum pooling, reduce the occupation of calculation resources when the pooled speech segment set is analyzed subsequently, and improve the analysis efficiency.
The embodiment of the invention utilizes the double full-connection hierarchical connection to carry out the double full-connection processing on the pooled voice segment set, can improve the network complexity, further improve the accuracy of the segment characteristics in the segment characteristics of each voice segment in the obtained voice segment set, and is favorable for improving the accuracy of the enthusiasm of the subsequent segment characteristic analysts based on each voice segment in the voice segment set.
The matching analysis module 104 is configured to obtain a standard voice feature of a target person, perform matching analysis on the standard voice feature and the segment feature, and select a target voice segment corresponding to the standard voice feature in the voice segment set according to a result of the matching analysis.
In the embodiment of the invention, the standard voice features pre-stored by the target personnel can be obtained from the pre-constructed block chain nodes by utilizing the python statement with the data capturing function, and the efficiency of obtaining the standard voice features can be improved by utilizing the high throughput of the block chain to the data.
In detail, the matching analysis module 104 is specifically configured to:
acquiring standard voice characteristics of a target person;
performing vector transformation on the standard voice features of the target person to obtain a first voice feature vector;
performing vector conversion on the segment features to obtain second voice feature vectors corresponding to the segment features of each voice segment;
calculating distance values of the first voice feature vector and a second voice feature vector corresponding to the segment feature of each voice segment;
screening the segment characteristics corresponding to the second voice characteristic vectors with the distance values smaller than the preset threshold value, and determining the voice sections corresponding to the screened segment characteristics as the target voice sections.
Specifically, the vector conversion of the standard voice feature of the target person to obtain a first voice feature vector includes:
obtaining a byte vector set corresponding to each word in the standard voice characteristics, wherein the byte vector set comprises byte vectors of each byte in the standard voice characteristics;
respectively encoding each byte in the standard voice feature according to the byte vector in the byte vector set to obtain an encoded byte set;
and splicing the coding bytes in the coding byte set to obtain the first voice characteristic vector.
The embodiment of the invention can utilize a one-hot coding technology to code a plurality of byte vectors in the byte vector set, so that the plurality of byte vectors are converted into the first voice characteristic vector.
The specific method for encoding processing by the one-hot encoding technique is to use an N-bit status register to encode N states in the standard speech feature, each state is represented by its own independent register bit, and at any time, only one bit is valid, i.e., only one bit is 1, and the rest are zero values.
In detail, the step of performing vector transformation on the segment features to obtain the second speech feature vectors corresponding to the segment features of each speech segment is consistent with the step of performing vector transformation on the standard speech features of the target person to obtain the first speech feature vectors, which is not repeated herein.
In the embodiment of the invention, the standard voice characteristics and the segment characteristics of the target personnel are subjected to vector conversion, so that the vectorization of the voice information can be realized, and the efficiency of matching and analyzing the standard voice characteristics and the segment characteristics is improved.
Further, the calculating a distance value between the first speech feature vector and a second speech feature vector corresponding to the segment feature of each speech segment includes:
the distance value is calculated using the following distance algorithm:
wherein L (X, Y) is the distance value, X is the first speech feature vector, Y isiIs the ith second speech feature vector corresponding to the segment features of each speech segment.
In detail, in the embodiment of the present invention, the segment feature corresponding to the second speech feature vector whose distance value is smaller than the preset threshold value is screened, and the speech segment corresponding to the screened segment feature is determined to be the target speech segment matched with the standard speech feature.
In the embodiment of the invention, the standard voice characteristics of the target personnel are obtained, and the standard voice characteristics and the segment characteristics are subjected to matching analysis, so that the target voice segments belonging to the target personnel are screened out from all the voice segments of the voice segment set according to the segment characteristics, and the accuracy of analyzing the enthusiasm of the target personnel by using the screened target voice segments is favorably improved.
The voice analysis module 105 is configured to perform voice analysis on the target voice segment to obtain a voice duration, a voice volume, and a voice pace of the target voice segment.
In an embodiment of the present invention, the voice analysis module 105 is specifically configured to:
detecting the voice time length of the target voice section;
continuously detecting the voice intensity of the target voice section, and calculating the voice volume of the target person according to the voice duration and the voice intensity;
carrying out voice recognition on the target voice section, and counting the number of voice words of the user in a voice recognition result;
and calculating the voice speed of the target person according to the voice time length and the voice word number.
In detail, the step of continuously detecting the speech intensity of the target speech segment is consistent with the step of continuously detecting the speech intensity of the enhanced audio data in step S2, and is not repeated herein.
Specifically, calculating the voice volume of the user according to the voice time length and the voice intensity is to calculate the average volume of the target person in the voice time length, and the average volume is calculated by using the following average algorithm in the embodiment of the present invention:
wherein L is the average volume, n is the voice duration, PtFor the target speech segment at time tThe speech intensity.
Further, in the embodiment of the present invention, an ASR (Automatic Speech Recognition) technology is used to perform text conversion on the target Speech segment to obtain Speech Recognition, and the number of words in the Speech of the user in the Speech Recognition result is counted.
In detail, calculating the speech speed of the target person according to the speech duration and the speech word number is to calculate the speaking speed of the target person in the speech duration of the target speech segment through a rate algorithm, where the rate algorithm is:
wherein V is the speech speed, N is the speech duration, and N is the number of speech words.
And the extreme calculation module 106 is configured to calculate the aggressiveness of the target person according to the voice duration, the voice volume and the voice pace.
In the embodiment of the present invention, the product extreme calculation module 106 is specifically configured to:
calculating the aggressiveness of the target person according to the voice time, the voice volume and the voice speed by using the following product-extreme algorithm:
J=α*W+β*L+γ*V
wherein J is the product limit, W is the voice duration, L is the average volume, V is the voice pace, and alpha, beta, and gamma are preset weight coefficients.
According to the embodiment of the invention, the audio enhancement processing is carried out on the obtained audio data to obtain the enhanced audio data, so that the noise interference in the audio data can be reduced, the voice part in the audio data is enhanced, and the accuracy of the subsequent analysis on the audio data is improved; the voice section cutting is carried out on the enhanced audio data to obtain a voice section set, the voice section without voice in the enhanced audio data can be deleted, the voice section without voice is prevented from being analyzed in the subsequent analysis, and the efficiency of the subsequent voice analysis is improved; the voice segment set is subjected to feature extraction to obtain the segment features of each voice segment in the voice segment set, so that the accuracy of the enthusiasm of subsequent segment feature analyzers based on each voice segment in the voice segment set is improved; the method comprises the steps of obtaining standard voice characteristics of a target person, carrying out matching analysis on the standard voice characteristics and the segment characteristics, and selecting target voice segments corresponding to the standard voice characteristics in the voice segment set according to the matching analysis result, so that the target voice segments belonging to the target person are screened out from all the voice segments in the voice segment set according to the segment characteristics, and the accuracy of analyzing the enthusiasm of the target person by using the screened target voice segments is improved; and carrying out voice analysis on the target voice segment, and calculating the aggressiveness of the target personnel according to the voice time, the voice volume and the voice speed obtained by analysis, so that the extreme generation of personnel according to various factors in the audio is realized. Therefore, the personnel enthusiasm analysis device based on the audio frequency can solve the problem that the accuracy of judging the personnel enthusiasm by using audio frequency analysis is not high.
Fig. 3 is a schematic structural diagram of an electronic device implementing an audio-based human motivation analysis method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as an audio-based human motivation analysis program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the audio-based human motivation analysis program 12, but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., audio-based human activity analysis programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The memory 11 in the electronic device 1 stores an audio-based human motivation analysis program 12 that is a combination of instructions that, when executed in the processor 10, may implement:
acquiring audio data, and performing audio enhancement processing on the audio data to obtain enhanced audio data;
carrying out voice section cutting on the enhanced audio data to obtain a voice section set;
carrying out feature extraction on the voice segment set to obtain segment features of each voice segment in the voice segment set;
acquiring standard voice features of target personnel, performing matching analysis on the standard voice features and the segment features, and selecting target voice segments corresponding to the standard voice features in the voice segment set according to the result of the matching analysis;
carrying out voice analysis on the target voice section to obtain the voice time length, the voice volume and the voice speed of the target voice section;
and calculating the aggressiveness of the target person according to the voice time length, the voice volume and the voice speed.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
acquiring audio data, and performing audio enhancement processing on the audio data to obtain enhanced audio data;
carrying out voice section cutting on the enhanced audio data to obtain a voice section set;
carrying out feature extraction on the voice segment set to obtain segment features of each voice segment in the voice segment set;
acquiring standard voice features of target personnel, performing matching analysis on the standard voice features and the segment features, and selecting target voice segments corresponding to the standard voice features in the voice segment set according to the result of the matching analysis;
carrying out voice analysis on the target voice section to obtain the voice time length, the voice volume and the voice speed of the target voice section;
and calculating the aggressiveness of the target person according to the voice time length, the voice volume and the voice speed.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims (10)
1. A method for audio-based human motivation analysis, the method comprising:
acquiring audio data, and performing audio enhancement processing on the audio data to obtain enhanced audio data;
carrying out voice section cutting on the enhanced audio data to obtain a voice section set;
carrying out feature extraction on the voice segment set to obtain segment features of each voice segment in the voice segment set;
acquiring standard voice features of target personnel, performing matching analysis on the standard voice features and the segment features, and selecting target voice segments corresponding to the standard voice features in the voice segment set according to the result of the matching analysis;
carrying out voice analysis on the target voice section to obtain the voice time length, the voice volume and the voice speed of the target voice section;
and calculating the aggressiveness of the target person according to the voice time length, the voice volume and the voice speed.
2. The audio-based human motivation analysis method of claim 1, wherein the audio enhancement processing of the audio data to obtain enhanced audio data comprises:
carrying out noise reduction processing on the audio data to obtain noise reduction audio data;
and carrying out audio emphasis processing on the noise reduction audio data to obtain enhanced audio data.
3. The audio-based human motivation analysis method according to claim 1, wherein the performing speech segment segmentation on the enhanced audio data to obtain a speech segment set comprises:
continuously detecting the voice intensity of the enhanced audio data;
when the voice intensity is smaller than a preset decibel threshold value, determining that the enhanced audio data is an unmanned sound segment;
when the voice intensity is larger than or equal to the decibel threshold value, determining that the enhanced audio data is a vocal segment;
and deleting the unmanned sound segment in the enhanced audio data to obtain a voice segment set.
4. The audio-based human motivation analysis method according to claim 1, wherein the performing feature extraction on the speech segment set to obtain segment features of each speech segment in the speech segment set comprises:
carrying out convolution processing on each voice section in the voice section set to obtain a convolution voice section set;
performing global maximum pooling processing on the convolution voice segment set to obtain a pooled voice segment set;
performing full-connection processing on the pooled speech segment set by using a first full-connection layer to obtain a full-connection speech segment set;
and utilizing a second full-connection layer to perform full-connection processing on the full-connection voice section set to obtain the segment characteristics of each voice section in the voice section set.
5. The audio-based human motivation analysis method according to any one of claims 1-4, wherein the performing matching analysis on the standard speech features and the segment features and selecting the target speech segments in the speech segment set corresponding to the standard speech features according to the result of the matching analysis comprises:
performing vector transformation on the standard voice features of the target person to obtain a first voice feature vector;
performing vector conversion on the segment features to obtain second voice feature vectors corresponding to the segment features of each voice segment;
calculating distance values of the first voice feature vector and a second voice feature vector corresponding to the segment feature of each voice segment;
screening the segment characteristics corresponding to the second voice characteristic vectors with the distance values smaller than the preset threshold value, and determining the voice sections corresponding to the screened segment characteristics as the target voice sections.
6. The audio-based human motivation analysis method of claim 5, wherein the vector-converting the standard speech features of a target human to obtain a first speech feature vector comprises:
obtaining a byte vector set corresponding to each word in the standard voice characteristics, wherein the byte vector set comprises byte vectors of each byte in the standard voice characteristics;
respectively encoding each byte in the standard voice feature according to the byte vector in the byte vector set to obtain an encoded byte set;
and splicing the coding bytes in the coding byte set to obtain the first voice characteristic vector.
7. The audio-based human motivation analysis method according to claim 1, wherein the performing speech analysis on the target speech segment to obtain the speech duration, the speech volume and the speech speed of the target speech segment comprises:
detecting the voice time length of the target voice section;
continuously detecting the voice intensity of the target voice section, and calculating the voice volume of the target person according to the voice duration and the voice intensity;
carrying out voice recognition on the target voice section, and counting the number of voice words of the user in a voice recognition result;
and calculating the voice speed of the target person according to the voice time length and the voice word number.
8. An audio-based human motivation analysis apparatus, the apparatus comprising:
the audio enhancement module is used for acquiring audio data and performing audio enhancement processing on the audio data to obtain enhanced audio data;
the voice cutting module is used for cutting voice sections of the enhanced audio data to obtain a voice section set;
the characteristic extraction module is used for extracting the characteristics of the voice segment set to obtain the segment characteristics of each voice segment in the voice segment set;
the matching analysis module is used for acquiring the standard voice characteristics of the target personnel, performing matching analysis on the standard voice characteristics and the segment characteristics, and selecting the target voice segments corresponding to the standard voice characteristics in the voice segment set according to the result of the matching analysis;
the voice analysis module is used for carrying out voice analysis on the target voice segment to obtain the voice time length, the voice volume and the voice speed of the target voice segment;
and the positive degree calculating module is used for calculating the positive degree of the target person according to the voice time length, the voice volume and the voice speed.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the audio-based people motivation analysis method of any of claims 1-7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the audio-based human motivation analysis method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011508395.8A CN112634938A (en) | 2020-12-18 | 2020-12-18 | Audio-based personnel positivity analysis method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011508395.8A CN112634938A (en) | 2020-12-18 | 2020-12-18 | Audio-based personnel positivity analysis method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112634938A true CN112634938A (en) | 2021-04-09 |
Family
ID=75317432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011508395.8A Pending CN112634938A (en) | 2020-12-18 | 2020-12-18 | Audio-based personnel positivity analysis method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112634938A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113409809A (en) * | 2021-07-07 | 2021-09-17 | 上海新氦类脑智能科技有限公司 | Voice noise reduction method, device and equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101291239A (en) * | 2008-05-20 | 2008-10-22 | 华为技术有限公司 | Method and apparatus for enhancing effect of meeting |
CN110675861A (en) * | 2019-09-26 | 2020-01-10 | 深圳追一科技有限公司 | Method, device and equipment for speech sentence-breaking and storage medium |
CN111583964A (en) * | 2020-04-14 | 2020-08-25 | 台州学院 | Natural speech emotion recognition method based on multi-mode deep feature learning |
CN111785302A (en) * | 2020-06-23 | 2020-10-16 | 北京声智科技有限公司 | Speaker separation method and device and electronic equipment |
WO2020218664A1 (en) * | 2019-04-25 | 2020-10-29 | 이봉규 | Smart conference system based on 5g communication and conference support method using robotic processing automation |
-
2020
- 2020-12-18 CN CN202011508395.8A patent/CN112634938A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101291239A (en) * | 2008-05-20 | 2008-10-22 | 华为技术有限公司 | Method and apparatus for enhancing effect of meeting |
WO2020218664A1 (en) * | 2019-04-25 | 2020-10-29 | 이봉규 | Smart conference system based on 5g communication and conference support method using robotic processing automation |
CN110675861A (en) * | 2019-09-26 | 2020-01-10 | 深圳追一科技有限公司 | Method, device and equipment for speech sentence-breaking and storage medium |
CN111583964A (en) * | 2020-04-14 | 2020-08-25 | 台州学院 | Natural speech emotion recognition method based on multi-mode deep feature learning |
CN111785302A (en) * | 2020-06-23 | 2020-10-16 | 北京声智科技有限公司 | Speaker separation method and device and electronic equipment |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113409809A (en) * | 2021-07-07 | 2021-09-17 | 上海新氦类脑智能科技有限公司 | Voice noise reduction method, device and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112446025B (en) | Federal learning defense method, federal learning defense device, electronic equipment and storage medium | |
CN112447189A (en) | Voice event detection method and device, electronic equipment and computer storage medium | |
CN112560453B (en) | Voice information verification method and device, electronic equipment and medium | |
CN111681681A (en) | Voice emotion recognition method and device, electronic equipment and storage medium | |
CN112992187B (en) | Context-based voice emotion detection method, device, equipment and storage medium | |
CN113064994A (en) | Conference quality evaluation method, device, equipment and storage medium | |
CN110941978A (en) | Face clustering method and device for unidentified personnel and storage medium | |
CN112733531A (en) | Virtual resource allocation method and device, electronic equipment and computer storage medium | |
CN114677650B (en) | Intelligent analysis method and device for pedestrian illegal behaviors of subway passengers | |
CN113707173A (en) | Voice separation method, device and equipment based on audio segmentation and storage medium | |
CN117373580B (en) | Performance analysis method and system for realizing titanium alloy product based on time sequence network | |
CN112634938A (en) | Audio-based personnel positivity analysis method, device, equipment and storage medium | |
CN114155832A (en) | Speech recognition method, device, equipment and medium based on deep learning | |
CN115409041B (en) | Unstructured data extraction method, device, equipment and storage medium | |
CN112712797A (en) | Voice recognition method and device, electronic equipment and readable storage medium | |
CN112489628A (en) | Voice data selection method and device, electronic equipment and storage medium | |
CN111985491A (en) | Similar information merging method, device, equipment and medium based on deep learning | |
CN113705459B (en) | Face snapshot method and device, electronic equipment and storage medium | |
CN113705686B (en) | Image classification method, device, electronic equipment and readable storage medium | |
CN112580505B (en) | Method and device for identifying network point switch door state, electronic equipment and storage medium | |
CN113139561B (en) | Garbage classification method, garbage classification device, terminal equipment and storage medium | |
CN113706207A (en) | Order transaction rate analysis method, device, equipment and medium based on semantic analysis | |
CN110968690B (en) | Clustering division method and device for words, equipment and storage medium | |
CN117422484B (en) | Digital marketing collaborative data processing system, method, equipment and medium | |
CN113723554B (en) | Model scheduling method, device, electronic equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |