WO2021135547A1 - Human voice detection method, apparatus, device, and storage medium - Google Patents

Human voice detection method, apparatus, device, and storage medium Download PDF

Info

Publication number
WO2021135547A1
WO2021135547A1 PCT/CN2020/123198 CN2020123198W WO2021135547A1 WO 2021135547 A1 WO2021135547 A1 WO 2021135547A1 CN 2020123198 W CN2020123198 W CN 2020123198W WO 2021135547 A1 WO2021135547 A1 WO 2021135547A1
Authority
WO
WIPO (PCT)
Prior art keywords
value
audio signal
current frame
sub
human voice
Prior art date
Application number
PCT/CN2020/123198
Other languages
French (fr)
Chinese (zh)
Inventor
付姝华
汪斌
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021135547A1 publication Critical patent/WO2021135547A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • G10L2025/786Adaptive threshold

Definitions

  • This application relates to the field of audio processing technology, and also to the field of artificial intelligence, and in particular to a method, device, device, and storage medium for detecting human voice.
  • VAD Voice Activity Detection
  • voice coding technology is very popular. The purpose is to identify and eliminate the long silent period from the voice signal stream, so as to save voice channel resources without reducing the quality of service. It is an IP phone An important part of the application. For example, muting and not sending packets can save valuable bandwidth resources and help reduce the end-to-end delay felt by users.
  • the current VAD technology generally can only distinguish between silent and non-silent. If it can further identify human voices and non-human voices, voice coding can further improve bandwidth utilization.
  • Noise suppression represents a typical application of audio pre- and post-processing, and also determines the successful foundation of the performance of a call product.
  • Non-human voice is regarded as noise De-tracking suppression can greatly improve noise suppression performance.
  • the inventor realizes that the human voice detection in the noise suppression in the prior art adopts a part of the VAD technology to improve to track the noise.
  • This kind of technology has a good suppression effect on stationary noise, but has a poor suppression effect on non-stationary noise.
  • the purpose of this application is to provide a human voice detection method, device and storage medium to solve the technical problem of poor suppression of non-stationary noise caused by the inability to accurately distinguish between human voice and non-stationary noise in the prior art.
  • a human voice detection method including:
  • the acquiring time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample includes:
  • a human voice detection device including:
  • the time domain feature extraction module is used to obtain the time domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;
  • a time-domain feature calculation module configured to obtain the time-domain envelope information value of the audio signal of the current frame according to the time-domain envelope information
  • the frequency domain feature extraction module is used to obtain the frequency domain signal corresponding to the current frame audio signal, and obtain the energy of each subband of the current frame audio signal according to the frequency domain signal;
  • a frequency domain feature calculation module configured to obtain the subband energy information value of the audio signal of the current frame according to the energy of each subband
  • the gate threshold determination module is used to determine the time domain envelope information gate threshold and the subband energy information gate threshold of the audio signal of the current frame respectively;
  • a time-domain vocal detection module configured to obtain the first vocal probability value of the current frame of audio signal according to the time-domain envelope information value and the time-domain envelope information gate threshold;
  • a frequency domain vocal detection module configured to obtain the second vocal probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;
  • the human voice probability calculation module is configured to obtain the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.
  • an electronic device includes a processor, and a memory coupled to the processor, the memory stores program instructions that can be executed by the processor;
  • the processor executes the program instructions stored in the memory, the following steps are implemented:
  • a storage medium is provided, and program instructions are stored in the storage medium, and the following steps are implemented when the program instructions are executed by a processor:
  • the human voice detection method, device, equipment and storage medium of the present application obtain time-domain envelope information through the audio signal of the current frame and the audio signals of the previous multiple frames, and obtain the energy of each subband through the audio signal of the current frame , And then perform time-domain data analysis on the time-domain envelope information, perform frequency-domain data analysis on the energy of each subband, and calculate the first human voice detection probability value and frequency of the current frame of the audio signal in the time domain according to the two analysis results.
  • the second human voice detection probability value in the domain dimension, and finally the human voice probability value of the current frame is comprehensively calculated based on the two human voice detection probability values.
  • the accuracy of human voice detection is increased, and the human voice can be accurately distinguished from non-human voice.
  • Smooth noise effectively avoids damage to the human voice, and at the same time improves the suppression of non-stationary noise.
  • the effective human voice can be quickly tracked.
  • FIG. 1 is a schematic flowchart of the human voice detection method according to the first embodiment of the application
  • FIG. 2 is a schematic flowchart of a human voice detection method according to a second embodiment of this application.
  • FIG. 3 is a schematic structural diagram of a human voice detection device according to a third embodiment of the application.
  • FIG. 4 is a schematic structural diagram of a human voice detection device according to a fourth embodiment of the application.
  • FIG. 5 is a schematic structural diagram of a storage medium according to a fifth embodiment of this application.
  • first”, “second”, and “third” in this application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first”, “second”, and “third” may explicitly or implicitly include at least one of the features.
  • "a plurality of” means at least two, such as two, three, etc., unless otherwise specifically defined. All directional indicators (such as up, down, left, right, front, back%) in the embodiments of this application are only used to explain the relative positional relationship between the components in a specific posture (as shown in the drawings) , Movement status, etc., if the specific posture changes, the directional indication will also change accordingly.
  • each frame of audio signal is an audio original digital signal within a unit time
  • the frame of audio signal may be any one of a silent frame, a human voice frame, or an environmental noise frame.
  • the silent frame refers to the original audio digital signal frame without energy
  • the human voice frame and the environmental noise frame are both the original audio digital signal frame with energy
  • the environmental noise frame and the silent frame are non-human voice frames
  • the main sound is the sound made when a person speaks.
  • the human voice frame is the audio signal in which the human voice accounts for a larger proportion of the original audio digital signal
  • the main sound in the environmental noise frame is not the sound made by the person talking
  • the environmental noise frame is the original audio digital signal
  • the human voice accounts for a relatively small audio signal in the signal.
  • human voice detection is performed on each frame of audio signal to determine whether the audio signal of the current frame is a human voice frame. Since the silent frame is easily distinguished from the human voice frame, the human voice detection is mainly to distinguish the audio signal of the frame as the environment The noise frame is still the human voice frame.
  • the time domain envelope information is obtained from the audio signal of the current frame and the audio signals of the previous multiple frames
  • the energy of each subband is obtained from the audio signal of the current frame
  • time domain data analysis is performed on the time domain envelope information.
  • Perform frequency domain data analysis on the energy of each subband and calculate the first human voice detection probability value in the time domain dimension and the second human voice detection probability value in the frequency domain dimension of the audio signal of the current frame according to the two analysis results.
  • the personal voice detection probability value is comprehensively calculated to determine whether the current frame is a human voice frame.
  • Fig. 1 is a schematic flowchart of a human voice detection method according to a first embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 1. As shown in Figure 1, the human voice detection method includes steps:
  • S101 Acquire time-domain envelope information according to the audio signal of the current frame and the audio signals of the previous multiple frames in the audio sample.
  • step S101 the time-domain envelope information of the most recent multi-frame audio signal is acquired.
  • the first envelope information is the maximum value vmax of each frame of audio signal
  • the second envelope information is the average value of the maximum value (average packet).
  • Network value envelopeAve Specifically, when it is necessary to perform human voice detection on the audio sample to be detected, the audio sample is first divided into frames, wherein each frame of the audio signal includes a plurality of sampling points, and each sampling point has an amplitude.
  • the maximum value of each frame of audio signal is the maximum value of the amplitude of each sampling point of the audio signal.
  • the audio signal of the t-th frame includes n sampling points, and the n sampling points are Xt(1), Xt(2),...
  • step S101 record the maximum value vmax of each frame of audio signal, and then use the maximum value of the most recent M frames of audio signal (vmax(1), vmax(2), ..., vmax(M)) to calculate the average envelope value envelopeAve ,
  • the most recent M frame audio signal includes the current frame audio signal (the Mth frame) and the M-1 frame audio signal (the first frame, the second frame,..., the M-1th frame) before the current frame audio signal.
  • the M-1 frame audio signal and the maximum value of the current frame audio signal are accumulated to obtain the accumulated value Accumulate value Divide by M to calculate the average envelope value envelopeAve.
  • S102 Acquire a time domain envelope information value of the audio signal of the current frame according to the time domain envelope information.
  • step S102 time-domain data analysis is performed according to the time-domain envelope information of the audio signal obtained in step S101, and the time-domain envelope information is quantized to obtain the value of the time-domain envelope information (quantized value of the time-domain envelope information)
  • the time-domain envelope information is quantized and calculated in the following manner: First, obtain the maximum value of each frame of the audio signal in the most recent multiple frames of audio signal and the average envelope value Difference; then, the difference of each frame of audio signal is logarithmically calculated to obtain the log value corresponding to the difference; finally, the log value of each frame of audio signal is accumulated to obtain the time domain packet of the audio signal of the current frame Network information value.
  • the time-domain envelope information is obtained based on the most recent multi-frame audio signal, the time-domain envelope of the human voice can be regarded as a smooth curve, which is different from the characteristics of environmental noise. Therefore, the time-domain envelope The network information value can well reflect the change of the sound, and the time-domain envelope information value can be used to accurately detect whether a human voice is present.
  • time-domain envelope information value envlopEng is calculated according to the following formula:
  • vMax(i) is the i-th frame audio signal in the most recent M frames of audio signal, i is 1, 2, ..., M, and envelopeAve is the average envelope value.
  • S103 Obtain a frequency domain signal corresponding to the audio signal of the current frame, and obtain the energy of each subband of the audio signal of the current frame according to the frequency domain signal.
  • the audio signal of the current frame is a time domain signal.
  • the audio signal of the current frame is transformed from the time domain to the frequency domain through Fourier transform to generate the corresponding audio signal of the current frame.
  • Frequency domain signal; sub-band division processing is performed on the frequency domain signal, and the energy of each sub-band is calculated.
  • S104 Acquire a sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band.
  • step S104 firstly, the average energy value of each sub-band energy is calculated according to the energy of each sub-band, that is, the energy value subEng(k) of each sub-band is accumulated to obtain the accumulated value Accumulate value Divide by N to obtain the average energy value aveSubEng; then, obtain the difference between the sub-band energy subEng(k) of each sub-band and the average energy value aveSubEng; then, perform logarithmic operation on the difference of each sub-band to obtain the difference The logarithmic value corresponding to the value; finally, the logarithmic value of each subband is accumulated to obtain the subband energy information value of the audio signal of the current frame.
  • the sub-band energy information value is calculated according to the sub-band energy of different sub-bands and the average energy value of each sub-band energy. Since the human voice has a correspondingly covered preset frequency band, the sub-band energy information value can reflect The unique sub-band energy distribution characteristics of the human voice, therefore, the sub-band energy information value can well distinguish the human voice from the environmental noise.
  • the subband energy information value entroEng is calculated according to the following formula:
  • subEng(k) is the sub-band energy of the k-th sub-band
  • k is 1, 2, ..., N
  • aveSubEng is the average energy value of each sub-band energy.
  • S105 Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively.
  • the time domain envelope information gate threshold value envlopEngThrd of the audio signal of the current frame may be updated according to the minimum value of the time domain envelope information value envlopEng within the first preset time range before the current time; the current frame
  • the sub-band energy information gate threshold value of the audio signal may be updated according to the minimum value of the sub-band energy information value entroEng in the first preset time range before the current time. That is to say, the time-domain envelope information gate threshold and the sub-band energy information gate threshold are adjusted according to the change of the call scene.
  • the time-domain envelope information gate valve If the environmental noise is large in the first preset time range before the current time, the time-domain envelope information gate valve The threshold value and the sub-band energy information gate threshold are respectively increased to different degrees; if the environment is quieter in the first preset time range before the current time, the time-domain envelope information gate threshold and the sub-band energy information gate threshold are relatively reduced to different degrees. .
  • S106 Acquire a first human voice probability value of the audio signal of the current frame according to the time domain envelope information value and the time domain envelope information gate threshold value.
  • the feature-based speech probability function maps each frame of audio signal to a probability value to obtain a probability value.
  • For time-domain features first, obtain the time-domain envelope information value and the time-domain envelope information gate valve. Then, the difference between the time domain envelope information value and the time domain envelope information gate threshold value is normalized to obtain the first vocal probability value.
  • the first vocal probability value SpeechProb1 is calculated according to the following formula:
  • SpeechProb1 sigmoid(envlopEng-envlopEngThrd), where envlopEng is the time-domain envelope information value, and envlopEngThrd is the time-domain envelope information gate threshold.
  • S107 Acquire a second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold value.
  • each frame of audio signal is mapped to a probability value to obtain a probability value based on the feature-based speech probability function.
  • the frequency domain feature first, obtain the difference between the sub-band energy information value and the sub-band energy information gate threshold. Then, the difference between the sub-band energy information value and the sub-band energy information gate threshold value is normalized to obtain a second human voice probability value.
  • SpeechProb2 is calculated according to the following formula:
  • SpeechProb1 sigmoid(entroEng-entroEngThrd), where entroEng is the sub-band energy information value, and entroEngThrd is the sub-band energy information gate threshold.
  • S108 Acquire the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.
  • step S108 the human voice probability value of the audio signal of the current frame is calculated according to the product of the first human voice probability value and the second human voice probability value.
  • the speech probability value SpeechProb is calculated by the following formula:
  • SpeechProb SpeechProb1*SpeechProb2, where SpeechProb1 is the first vocal probability value, and SpeechProb2 is the second vocal probability value.
  • the human voice probability value of the current frame of audio signal is synthesized from the first human voice probability value calculated based on the time domain characteristics and the second human voice probability value calculated based on the frequency domain characteristics, and at the same time.
  • the human voice probability value may be set for the two dimensions of the time domain and the frequency domain, respectively, according to the first human voice probability value. Calculate the final vocal probability value with the time domain weight value, the second vocal probability value and the frequency domain weight value.
  • Fig. 2 is a schematic flowchart of a human voice detection method according to a second embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 2. As shown in Figure 2, the human voice detection method includes the steps:
  • S200 Perform preprocessing on the audio signal in the audio sample, where the preprocessing method includes at least one of resampling processing, noise reduction processing, howling suppression processing, and echo cancellation processing.
  • the resampling process includes at least one of up resampling and down resampling.
  • the audio signal is subjected to difference processing, and in the down resampling process, The audio signal is subjected to extraction processing; noise reduction processing refers to the processing method of eliminating the noise part of the audio signal; howling suppression processing refers to eliminating the howling situation in the audio signal, such as frequency equalization method, By adjusting the frequency response of the system to an approximate straight line, the gain of each frequency is basically the same to eliminate howling and other ways to suppress howling; echo cancellation processing can be achieved through echo cancellation (EC) technology, and echo is divided into acoustic echo (Acoustic Echo) and Line Echo (Line Echo), the corresponding echo cancellation technology corresponds to Acoustic Echo Cancellation (AEC) and Line Echo Cancellation (LEC).
  • EC echo cancellation
  • AEC Acoustic Echo Cancellation
  • LEC Line Echo Cancellation
  • S201 Acquire time-domain envelope information according to the audio signal of the current frame and the audio signals of the previous multiple frames in the audio sample.
  • S202 Acquire a time-domain envelope information value of the audio signal of the current frame according to the time-domain envelope information.
  • S203 Obtain a frequency domain signal corresponding to the audio signal of the current frame, and obtain the energy of each subband of the audio signal of the current frame according to the frequency domain signal.
  • S204 Obtain a sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band.
  • S205 Determine the time domain envelope information gate threshold and the subband energy information gate threshold of the audio signal of the current frame respectively.
  • S206 Acquire a first human voice probability value of the audio signal of the current frame according to the time domain envelope information value and the time domain envelope information gate threshold value.
  • S207 Acquire a second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold value.
  • S209 Acquire the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.
  • steps S201 to S207 and step S209 refer to the description of the first embodiment, which will not be repeated here.
  • step S208 specifically, corresponding summary information is obtained based on the first vocal probability value and the second vocal probability value.
  • the summary information is determined by the first vocal probability value or the second vocal probability value.
  • the second vocal probability value is obtained by hash processing, for example, obtained by processing the sha256s algorithm.
  • Uploading summary information to the blockchain can ensure its security and fairness and transparency to users.
  • the user equipment can download the summary information from the blockchain to verify whether the first human voice probability value and the second human voice probability value have been tampered with.
  • the blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer
  • step S210 if the human voice probability value of the audio signal of the current frame is greater than or equal to the first required probability, it is determined that the audio signal of the current frame is a human voice frame;
  • the audio signal of the current frame is encoded to obtain a first encoded audio stream; and the first encoded audio stream is sent.
  • step S210 if the human voice probability value of the audio signal of the current frame is less than the first required probability, it is determined that the audio signal of the current frame is a non-human voice frame; Encoding the audio signal of the current frame to obtain a second encoded audio stream; and sending the second encoded audio stream.
  • the non-human voice frame may be normalized to a silent frame by modifying the digital signal value. If it is determined that the current frame of audio signal is a non-human voice frame (environmental noise frame or silent frame), in the call application, the transmission of non-human voice can be reduced, effectively reducing bandwidth occupation, improving bandwidth utilization, and reducing transmission delay , Enhance customer call experience.
  • FIG. 3 is a schematic structural diagram of a human voice detection device according to a third embodiment of the application.
  • the device 30 includes a time domain feature extraction module 31, a time domain feature calculation module 32, a frequency domain feature extraction module 33, a frequency domain feature calculation module 34, a gate threshold determination module 35, and a time domain voice detection module 36.
  • the frequency domain human voice detection module 37 and the human voice probability calculation module 38 wherein the time domain feature extraction module 31 is used to obtain the time domain envelope information according to the current frame audio signal and the previous multiple frames of audio signals in the audio sample.
  • the time-domain feature calculation module 32 is configured to obtain the time-domain envelope information value of the audio signal of the current frame according to the time-domain envelope information.
  • the frequency domain feature extraction module 33 is configured to obtain a frequency domain signal corresponding to the audio signal of the current frame, and obtain the energy of each subband of the audio signal of the current frame according to the frequency domain signal.
  • the frequency domain feature calculation module 34 is configured to obtain the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band.
  • the gate threshold determination module 35 is used to determine the time domain envelope information gate threshold and the subband energy information gate threshold of the audio signal of the current frame respectively.
  • the time domain human voice detection module 36 is configured to obtain the first human voice probability value of the current frame audio signal according to the time domain envelope information value and the time domain envelope information gate threshold value.
  • the frequency domain human voice detection module 37 is configured to obtain the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold value.
  • the human voice probability calculation module 38 is configured to obtain the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.
  • Fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.
  • the electronic device 40 includes a processor 41 and a memory 42 coupled to the processor 41.
  • the memory 42 stores program instructions for implementing the human voice detection method of any of the above embodiments.
  • the processor 41 is configured to execute program instructions stored in the memory 42 to perform human voice detection.
  • the processor 41 may also be referred to as a CPU (Central Processing Unit, central processing unit).
  • the processor 41 may be an integrated circuit chip with signal processing capabilities.
  • the processor 41 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component .
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • FIG. 5 is a schematic structural diagram of a storage medium according to a fifth embodiment of the application.
  • the storage medium of the embodiment of the present application stores program instructions 51 capable of realizing the above-mentioned method for detecting all human voices.
  • the storage medium may be non-volatile or volatile.
  • the program instructions 51 may be stored in the above-mentioned storage medium in the form of a software product, and include several instructions for making a computer device (may be a personal computer, a server, or a network device, etc.) or a processor to execute the program. Apply for all or part of the steps of the method described in each embodiment.
  • the aforementioned storage devices include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes.
  • terminal devices such as computers, servers, mobile phones, and tablets.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. The above are only implementations of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly applied to other related technical fields, The same reason is included in the scope of patent protection of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Provided are a human voice detection method, apparatus (30), electronic device (40), and storage medium, relating to the technical field of artificial intelligence, said method comprising: obtaining time-domain envelope information by means of an audio signal of a current frame and an audio signal of a previous plurality of frames (S101, S201); obtaining the energy of each sub-band by means of the audio signal of the current frame (S103, S203); performing time-domain data analysis on time-domain envelope information, and performing frequency-domain data analysis on the energy of each sub-band; according to the analysis results, calculating a first human-voice detection probability value in the time-domain dimension and a second human-voice detection probability value in the frequency-domain dimension, respectively, of the audio signal of the current frame (S106, S107, S206, S207); according to the comprehensive calculation of the two human-voice detection probability values, obtaining a human-voice probability value of the current frame (S108, S209). By the described means, the accuracy of human voice detection is increased, accurate differentiation is made between human voice and non-stationary noise, effectively preventing damage to the human voice; at the same time, the non-stationary noise suppression effect is improved, and changes in a call scenario are adapted to by means of updating the gate threshold value, enabling rapid tracking of a valid human voice.

Description

人声检测方法、装置、设备及存储介质Human voice detection method, device, equipment and storage medium
本申请要求于2020年07月24日提交中国专利局、申请号为202010723751.1,发明名称为“人声检测方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 24, 2020, the application number is 202010723751.1, and the invention title is "Human Voice Detection Method, Apparatus, Equipment, and Storage Medium", the entire content of which is incorporated by reference In this application.
【技术领域】【Technical Field】
本申请涉及音频处理技术领域,还涉及人工智能领域,尤其涉及一种人声检测方法、装置、设备及存储介质。This application relates to the field of audio processing technology, and also to the field of artificial intelligence, and in particular to a method, device, device, and storage medium for detecting human voice.
【背景技术】【Background technique】
VAD(语音活动检测)语音编码技术应用非常普及,目的是从声音信号流里识别和消除长时间的静音期,以达到在不降低业务质量的情况下节省话路资源的作用,它是IP电话应用的重要组成部分。如静音不发包可以节省宝贵的带宽资源,有利于减少用户感觉到的端到端的时延。但目前的VAD技术一般只能分辨静音和非静音,如果能进一步识别人声和非人声,语音编码则可进一步提升带宽利用率。The application of VAD (Voice Activity Detection) voice coding technology is very popular. The purpose is to identify and eliminate the long silent period from the voice signal stream, so as to save voice channel resources without reducing the quality of service. It is an IP phone An important part of the application. For example, muting and not sending packets can save valuable bandwidth resources and help reduce the end-to-end delay felt by users. However, the current VAD technology generally can only distinguish between silent and non-silent. If it can further identify human voices and non-human voices, voice coding can further improve bandwidth utilization.
同时,识别人声和非人声在噪声抑制技术中更发挥着关键作用,噪声抑制来代表了音频前后处理的典型应用,亦决定着一款通话产品性能的成功基础,把非人声作为噪声去跟踪抑制,可以极大提升噪声抑制性能。At the same time, the recognition of human voice and non-human voice plays a key role in noise suppression technology. Noise suppression represents a typical application of audio pre- and post-processing, and also determines the successful foundation of the performance of a call product. Non-human voice is regarded as noise De-tracking suppression can greatly improve noise suppression performance.
发明人意识到,现有技术在噪声抑制中的人声检测采用了一部分VAD技术加以改进用以跟踪噪声,此类技术对平稳噪声抑制效果较好,但对非平稳噪声抑制效果很差。The inventor realizes that the human voice detection in the noise suppression in the prior art adopts a part of the VAD technology to improve to track the noise. This kind of technology has a good suppression effect on stationary noise, but has a poor suppression effect on non-stationary noise.
因此,有必要提供一种新的人声检测方法。Therefore, it is necessary to provide a new human voice detection method.
【发明内容】[Summary of the invention]
本申请的目的在于提供一种人声检测方法、装置及存储介质,解决现有技术中不能准确区分人声和非平稳噪声导致的对非平稳噪声抑制效果很差的技术问题。The purpose of this application is to provide a human voice detection method, device and storage medium to solve the technical problem of poor suppression of non-stationary noise caused by the inability to accurately distinguish between human voice and non-stationary noise in the prior art.
本申请的技术方案如下:提供一种人声检测方法,包括:The technical solution of the present application is as follows: a human voice detection method is provided, including:
根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息;Acquire time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;
根据所述时域包络信息获取当前帧音频信号的时域包络信息值;Acquiring the time domain envelope information value of the audio signal of the current frame according to the time domain envelope information;
获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量;Acquiring a frequency domain signal corresponding to the audio signal of the current frame, and acquiring the energy of each subband of the audio signal of the current frame according to the frequency domain signal;
根据所述各子带能量获取当前帧音频信号的子带能量信息值;Acquiring the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band;
分别确定当前帧音频信号的时域包络信息门阀值和子带能量信息门阀值;Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively;
根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值;Acquiring the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold;
根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值;Acquiring the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;
根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值。Acquire the human voice probability value of the audio signal of the current frame according to the first human voice probability value and the second human voice probability value.
优选地,所述根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息,包括:Preferably, the acquiring time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample includes:
获取音频样本中各帧音频信号的最大值;Obtain the maximum value of each frame of audio signal in the audio sample;
计算所述音频样本中最近多帧音频信号最大值的均值并将所述均值作为平均包络值,所述最近多帧音频信号包括当前帧音频信号和当前帧音频信号之前的多帧音频信号,将所述最近多帧音频信号的最大值以及所述平均包络值作为所述时域包络信息。Calculating the average value of the maximum value of the most recent multi-frame audio signal in the audio sample and using the average value as the average envelope value, the most recent multi-frame audio signal including the current frame audio signal and the multi-frame audio signal before the current frame audio signal, The maximum value of the most recent multiple frames of audio signals and the average envelope value are used as the time-domain envelope information.
本申请的另一技术方案如下:提供一种人声检测装置,包括:Another technical solution of the present application is as follows: a human voice detection device is provided, including:
时域特征提取模块,用于根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息;The time domain feature extraction module is used to obtain the time domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;
时域特征计算模块,用于根据所述时域包络信息获取当前帧音频信号的时域包络信息值;A time-domain feature calculation module, configured to obtain the time-domain envelope information value of the audio signal of the current frame according to the time-domain envelope information;
频域特征提取模块,用于获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量;The frequency domain feature extraction module is used to obtain the frequency domain signal corresponding to the current frame audio signal, and obtain the energy of each subband of the current frame audio signal according to the frequency domain signal;
频域特征计算模块,用于根据所述各子带能量获取当前帧音频信号的子带能量信息值;A frequency domain feature calculation module, configured to obtain the subband energy information value of the audio signal of the current frame according to the energy of each subband;
门阀值确定模块,用于分别确定当前帧音频信号的时域包络信息门阀值和子带能量信息门阀值;The gate threshold determination module is used to determine the time domain envelope information gate threshold and the subband energy information gate threshold of the audio signal of the current frame respectively;
时域人声检测模块,用于根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值;A time-domain vocal detection module, configured to obtain the first vocal probability value of the current frame of audio signal according to the time-domain envelope information value and the time-domain envelope information gate threshold;
频域人声检测模块,用于根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值;A frequency domain vocal detection module, configured to obtain the second vocal probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;
人声概率计算模块,用于根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值。The human voice probability calculation module is configured to obtain the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.
本申请的另一技术方案如下:提供一种电子设备,所述设备包括处理器、以及与所述处理器耦接的存储器,所述存储器存储有可被所述处理器执行的程序指令;所述处理器执行所述存储器存储的所述程序指令时实现以下步骤:Another technical solution of the present application is as follows: an electronic device is provided, the device includes a processor, and a memory coupled to the processor, the memory stores program instructions that can be executed by the processor; When the processor executes the program instructions stored in the memory, the following steps are implemented:
根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息;Acquire time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;
根据所述时域包络信息获取当前帧音频信号的时域包络信息值;Acquiring the time domain envelope information value of the audio signal of the current frame according to the time domain envelope information;
获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量;Acquiring a frequency domain signal corresponding to the audio signal of the current frame, and acquiring the energy of each subband of the audio signal of the current frame according to the frequency domain signal;
根据所述各子带能量获取当前帧音频信号的子带能量信息值;Acquiring the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band;
分别确定当前帧音频信号的时域包络信息门阀值和子带能量信息门阀值;Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively;
根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值;Acquiring the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold;
根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值;Acquiring the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;
根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值。Acquire the human voice probability value of the audio signal of the current frame according to the first human voice probability value and the second human voice probability value.
本申请的另一技术方案如下:提供一种存储介质,所述存储介质内存储有程序指令,所述程序指令被处理器执行时实现以下步骤:Another technical solution of the present application is as follows: a storage medium is provided, and program instructions are stored in the storage medium, and the following steps are implemented when the program instructions are executed by a processor:
根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息;Acquire time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;
根据所述时域包络信息获取当前帧音频信号的时域包络信息值;Acquiring the time domain envelope information value of the audio signal of the current frame according to the time domain envelope information;
获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量;Acquiring a frequency domain signal corresponding to the audio signal of the current frame, and acquiring the energy of each subband of the audio signal of the current frame according to the frequency domain signal;
根据所述各子带能量获取当前帧音频信号的子带能量信息值;Acquiring the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band;
分别确定当前帧音频信号的时域包络信息门阀值和子带能量信息门阀值;Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively;
根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值;Acquiring the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold;
根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值;Acquiring the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;
根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值。Acquire the human voice probability value of the audio signal of the current frame according to the first human voice probability value and the second human voice probability value.
本申请的有益效果在于:本申请的人声检测方法、装置、设备及存储介质,通过当前帧音频信号和前多帧音频信号获取时域包络信息,通过当前帧音频信号获取各子带能量,再对时域包络信息进行时域数据分析,对各子带能量进行频域数据分析,根据两个分析结果分别计算当前帧音频信号的时域维度的第一人声检测概率值和频域维度的第二人声检测概率值,最后根据两个人声检测概率值综合计算得出当前帧的人声概率值,通过上述方式,增加了人声检测的精度,能够准确区分人声和非平稳噪声,有效避免对人声产生损伤,同时提升了对非平稳噪声的抑制效果,另外,通过门阀值的更新适应通话场景的变化,实现对有效人声的快速跟踪。The beneficial effects of the present application are: the human voice detection method, device, equipment and storage medium of the present application obtain time-domain envelope information through the audio signal of the current frame and the audio signals of the previous multiple frames, and obtain the energy of each subband through the audio signal of the current frame , And then perform time-domain data analysis on the time-domain envelope information, perform frequency-domain data analysis on the energy of each subband, and calculate the first human voice detection probability value and frequency of the current frame of the audio signal in the time domain according to the two analysis results. The second human voice detection probability value in the domain dimension, and finally the human voice probability value of the current frame is comprehensively calculated based on the two human voice detection probability values. Through the above method, the accuracy of human voice detection is increased, and the human voice can be accurately distinguished from non-human voice. Smooth noise effectively avoids damage to the human voice, and at the same time improves the suppression of non-stationary noise. In addition, through the update of the gate threshold to adapt to the changes in the call scene, the effective human voice can be quickly tracked.
【附图说明】【Explanation of the drawings】
图1为本申请第一实施例的人声检测方法的流程示意图;FIG. 1 is a schematic flowchart of the human voice detection method according to the first embodiment of the application;
图2为本申请第二实施例的人声检测方法的流程示意图;2 is a schematic flowchart of a human voice detection method according to a second embodiment of this application;
图3为本申请第三实施例的人声检测装置的结构示意图;3 is a schematic structural diagram of a human voice detection device according to a third embodiment of the application;
图4为本申请第四实施例的人声检测装置的结构示意图;4 is a schematic structural diagram of a human voice detection device according to a fourth embodiment of the application;
图5为本申请第五实施例的存储介质的结构示意图。FIG. 5 is a schematic structural diagram of a storage medium according to a fifth embodiment of this application.
【具体实施方式】【Detailed ways】
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本申请中的术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”、“第三”的特征可以明示或者隐含地包括至少一个该特征。本申请的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。本申请实施例中所有方向性指示(诸如上、下、左、右、前、后……)仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等,如果该特定姿态发生改变时,则该方向性指示也相应地随之改变。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", and "third" in this application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first”, “second”, and “third” may explicitly or implicitly include at least one of the features. In the description of this application, "a plurality of" means at least two, such as two, three, etc., unless otherwise specifically defined. All directional indicators (such as up, down, left, right, front, back...) in the embodiments of this application are only used to explain the relative positional relationship between the components in a specific posture (as shown in the drawings) , Movement status, etc., if the specific posture changes, the directional indication will also change accordingly. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.
在本申请实施例中,每一帧音频信号是单位时间内的音频原始数字信号,该帧音频信号可以是静音帧、人声帧或环境噪声帧中的任意一种。其中,静音帧是指没有能量的原始音频数字信号帧;人声帧和环境噪声帧均为有能量的原始音频数字信号帧,环境噪声帧和静音帧为非人声帧;人声帧中的主要声音是人说话时发出的声音,人声帧为音频原始数字信号中人声占比较大的音频信号;环境噪声帧中的主要声音不是人说话时发出的声音,环境噪声帧为音频原始数字信号中人声占比较小的音频信号。在本实施例中,对每一帧音频信号进行人声检测,确定当前帧音频信号是否为人声帧,由于静音帧与人声帧容易区别,人声检测时主要是区分该帧音频信号为环境噪声帧还是人声帧。In the embodiment of the present application, each frame of audio signal is an audio original digital signal within a unit time, and the frame of audio signal may be any one of a silent frame, a human voice frame, or an environmental noise frame. Among them, the silent frame refers to the original audio digital signal frame without energy; the human voice frame and the environmental noise frame are both the original audio digital signal frame with energy, and the environmental noise frame and the silent frame are non-human voice frames; The main sound is the sound made when a person speaks. The human voice frame is the audio signal in which the human voice accounts for a larger proportion of the original audio digital signal; the main sound in the environmental noise frame is not the sound made by the person talking, and the environmental noise frame is the original audio digital signal The human voice accounts for a relatively small audio signal in the signal. In this embodiment, human voice detection is performed on each frame of audio signal to determine whether the audio signal of the current frame is a human voice frame. Since the silent frame is easily distinguished from the human voice frame, the human voice detection is mainly to distinguish the audio signal of the frame as the environment The noise frame is still the human voice frame.
在本申请本实施例中,通过当前帧音频信号和前多帧音频信号获取时域包络信息,通过当前帧音频信号获取各子带能量,再对时域包络信息进行时域数据分析,对各子带能量进行频域数据分析,根据两个分析结果分别计算当前帧音频信号的时域维度的第一人声检测概率值和频域维度的第二人声检测概率值,最后根据两个人声检测概率值综合计算得出当前帧是否为人声帧。In this embodiment of the present application, the time domain envelope information is obtained from the audio signal of the current frame and the audio signals of the previous multiple frames, the energy of each subband is obtained from the audio signal of the current frame, and time domain data analysis is performed on the time domain envelope information. Perform frequency domain data analysis on the energy of each subband, and calculate the first human voice detection probability value in the time domain dimension and the second human voice detection probability value in the frequency domain dimension of the audio signal of the current frame according to the two analysis results. Finally, according to the two analysis results, The personal voice detection probability value is comprehensively calculated to determine whether the current frame is a human voice frame.
图1是本申请第一实施例的人声检测方法的流程示意图。需注意的是,若有实质上相同的结果,本申请的方法并不以图1所示的流程顺序为限。如图1所示,该人声检测方法包括 步骤:Fig. 1 is a schematic flowchart of a human voice detection method according to a first embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 1. As shown in Figure 1, the human voice detection method includes steps:
S101,根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息。S101: Acquire time-domain envelope information according to the audio signal of the current frame and the audio signals of the previous multiple frames in the audio sample.
在步骤S101中,获取的为最近多帧音频信号的时域包络信息,第一个包络信息为每帧音频信号的最大值vmax,第二个包络信息为最大值的均值(平均包络值envelopeAve)。具体地,当需要对待检测的音频样本进行人声检测时,先对音频样本进行分帧,其中,每一帧音频信号包括多个采样点,每个采样点具有幅度。每一帧音频信号的最大值为该音频信号的各个采样点幅度的最大值,设第t帧音频信号包括n个采样点,n个采样点分别为Xt(1),Xt(2),……,Xt(n),其中,Xt(n)表示第t帧音频信号中第n个采样点,于是,第t帧音频信号的最大值vmax=max(Xt(1),Xt(2),……,Xt(n))。In step S101, the time-domain envelope information of the most recent multi-frame audio signal is acquired. The first envelope information is the maximum value vmax of each frame of audio signal, and the second envelope information is the average value of the maximum value (average packet). Network value envelopeAve). Specifically, when it is necessary to perform human voice detection on the audio sample to be detected, the audio sample is first divided into frames, wherein each frame of the audio signal includes a plurality of sampling points, and each sampling point has an amplitude. The maximum value of each frame of audio signal is the maximum value of the amplitude of each sampling point of the audio signal. Suppose the audio signal of the t-th frame includes n sampling points, and the n sampling points are Xt(1), Xt(2),... ..., Xt(n), where Xt(n) represents the nth sampling point in the t-th frame audio signal, so the maximum value of the t-th frame audio signal vmax=max(Xt(1), Xt(2), ……, Xt(n)).
在步骤S101中,记录每帧音频信号的最大值vmax,再利用最近M帧音频信号的最大值(vmax(1),vmax(2),……,vmax(M))计算平均包络值envelopeAve,最近M帧音频信号包括当前帧音频信号(第M帧)以及位于当前帧音频信号之前的M-1帧音频信号(第1帧,第2帧,……,第M-1帧),将该M-1帧音频信号以及当前帧音频信号的最大值进行累加得到累加值
Figure PCTCN2020123198-appb-000001
再将累加值
Figure PCTCN2020123198-appb-000002
除以M计算平均包络值envelopeAve。
In step S101, record the maximum value vmax of each frame of audio signal, and then use the maximum value of the most recent M frames of audio signal (vmax(1), vmax(2), ..., vmax(M)) to calculate the average envelope value envelopeAve , The most recent M frame audio signal includes the current frame audio signal (the Mth frame) and the M-1 frame audio signal (the first frame, the second frame,..., the M-1th frame) before the current frame audio signal. The M-1 frame audio signal and the maximum value of the current frame audio signal are accumulated to obtain the accumulated value
Figure PCTCN2020123198-appb-000001
Accumulate value
Figure PCTCN2020123198-appb-000002
Divide by M to calculate the average envelope value envelopeAve.
S102,根据所述时域包络信息获取当前帧音频信号的时域包络信息值。S102: Acquire a time domain envelope information value of the audio signal of the current frame according to the time domain envelope information.
在步骤S102中,根据步骤S101获取的音频信号的时域包络信息进行时域数据分析,对时域包络信息进行量化,得到时域包络信息值(时域包络信息的量化值),在本实施例中,对于当帧前音频信号,时域包络信息通过以下方式进行量化计算:首先,获取最近多帧音频信号中每帧音频信号的最大值与所述平均包络值的差值;然后,将每帧音频信号的差值进行对数运算,得到所述差值对应的对数值;最后,将每帧音频信号的对数值进行累加,得到当前帧音频信号的时域包络信息值。在本实施例中,由于时域包络信息是根据最近多帧音频信号获取的,人声的时域包络可以看成平滑的曲线,与环境噪声表现出的特征不同,因此,时域包络信息值能够很好地反映出声音的变化,利用时域包络信息值能够准确检测出是否有人声出现。In step S102, time-domain data analysis is performed according to the time-domain envelope information of the audio signal obtained in step S101, and the time-domain envelope information is quantized to obtain the value of the time-domain envelope information (quantized value of the time-domain envelope information) In this embodiment, for the audio signal before the current frame, the time-domain envelope information is quantized and calculated in the following manner: First, obtain the maximum value of each frame of the audio signal in the most recent multiple frames of audio signal and the average envelope value Difference; then, the difference of each frame of audio signal is logarithmically calculated to obtain the log value corresponding to the difference; finally, the log value of each frame of audio signal is accumulated to obtain the time domain packet of the audio signal of the current frame Network information value. In this embodiment, since the time-domain envelope information is obtained based on the most recent multi-frame audio signal, the time-domain envelope of the human voice can be regarded as a smooth curve, which is different from the characteristics of environmental noise. Therefore, the time-domain envelope The network information value can well reflect the change of the sound, and the time-domain envelope information value can be used to accurately detect whether a human voice is present.
具体地,按照如下公式计算时域包络信息值envlopEng:Specifically, the time-domain envelope information value envlopEng is calculated according to the following formula:
Figure PCTCN2020123198-appb-000003
其中,vMax(i)为最近M帧音频信号中第i帧音频信号,i为1,2,……,M,envelopeAve为平均包络值。
Figure PCTCN2020123198-appb-000003
Among them, vMax(i) is the i-th frame audio signal in the most recent M frames of audio signal, i is 1, 2, ..., M, and envelopeAve is the average envelope value.
S103,获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量。S103: Obtain a frequency domain signal corresponding to the audio signal of the current frame, and obtain the energy of each subband of the audio signal of the current frame according to the frequency domain signal.
在步骤S103中,当前帧音频信号为时域信号,要对该信号提取频域特征,首先,通过傅里叶变换将当前帧音频信号从时域变换到频域,生成当前帧音频信号对应的频域信号;对该 频域信号进行子带划分处理,计算各个子带的能量。具体地,将当前帧音频信号对应的频域信号C划分为N个子带,并设置子带的结束位置为b(1)、b(2)、……、b(k)、……b(N),且b(0)=1,则各子带能量为subEng(k)。In step S103, the audio signal of the current frame is a time domain signal. To extract frequency domain features from the signal, first, the audio signal of the current frame is transformed from the time domain to the frequency domain through Fourier transform to generate the corresponding audio signal of the current frame. Frequency domain signal; sub-band division processing is performed on the frequency domain signal, and the energy of each sub-band is calculated. Specifically, the frequency domain signal C corresponding to the audio signal of the current frame is divided into N subbands, and the end positions of the subbands are set to b(1), b(2),...,b(k),...b( N), and b(0)=1, the energy of each subband is subEng(k).
S104,根据所述各子带能量获取当前帧音频信号的子带能量信息值。S104: Acquire a sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band.
在步骤S104中,首先,根据所述各子带能量计算各子带能量的平均能量值,即将各子带能量值subEng(k)进行累加得到累加值
Figure PCTCN2020123198-appb-000004
再将累加值
Figure PCTCN2020123198-appb-000005
除以N得到平均能量值aveSubEng;然后,获取每个子带的子带能量subEng(k)与平均能量值aveSubEng的差值;然后,将每个子带的差值进行对数运算,得到所述差值对应的对数值;最后,将每个子带的对数值进行累加,得到当前帧音频信号的子带能量信息值。在本实施例中,根据不同子带的子带能量与各子带能量的平均能量值计算子带能量信息值,由于人声具有对应覆盖的预设频带,该子带能量信息值能够反映出人声独特的子带能量分布特征,因此,该子带能量信息值能够很好的将人声与环境噪声进行区分。
In step S104, firstly, the average energy value of each sub-band energy is calculated according to the energy of each sub-band, that is, the energy value subEng(k) of each sub-band is accumulated to obtain the accumulated value
Figure PCTCN2020123198-appb-000004
Accumulate value
Figure PCTCN2020123198-appb-000005
Divide by N to obtain the average energy value aveSubEng; then, obtain the difference between the sub-band energy subEng(k) of each sub-band and the average energy value aveSubEng; then, perform logarithmic operation on the difference of each sub-band to obtain the difference The logarithmic value corresponding to the value; finally, the logarithmic value of each subband is accumulated to obtain the subband energy information value of the audio signal of the current frame. In this embodiment, the sub-band energy information value is calculated according to the sub-band energy of different sub-bands and the average energy value of each sub-band energy. Since the human voice has a correspondingly covered preset frequency band, the sub-band energy information value can reflect The unique sub-band energy distribution characteristics of the human voice, therefore, the sub-band energy information value can well distinguish the human voice from the environmental noise.
具体地,按照如下公式计算子带能量信息值entroEng:Specifically, the subband energy information value entroEng is calculated according to the following formula:
Figure PCTCN2020123198-appb-000006
其中,subEng(k)为第k个子带的子带能量,k为1,2,……,N,aveSubEng为各子带能量的平均能量值。
Figure PCTCN2020123198-appb-000006
Among them, subEng(k) is the sub-band energy of the k-th sub-band, k is 1, 2, ..., N, and aveSubEng is the average energy value of each sub-band energy.
S105,分别确定当前帧音频信号的时域包络信息门阀值和子带能量信息门阀值。S105: Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively.
在一个可选的实施方式中,当前帧音频信号的时域包络信息门阀值envlopEngThrd可以根据当前时间之前的第一预设时间范围内时域包络信息值envlopEng的最小值进行更新;当前帧音频信号的子带能量信息门阀值可以根据当前时间之前的第一预设时间范围内子带能量信息值entroEng的最小值进行更新。也就是说,时域包络信息门阀值和子带能量信息门阀值均根据通话场景的变化进行调整,若当前时间之前的第一预设时间范围内环境噪声较大时,时域包络信息门阀值和子带能量信息门阀值分别相对不同程度增大;若当前时间之前的第一预设时间范围内环境较安静时,时域包络信息门阀值和子带能量信息门阀值分别相对不同程度减小。In an optional implementation manner, the time domain envelope information gate threshold value envlopEngThrd of the audio signal of the current frame may be updated according to the minimum value of the time domain envelope information value envlopEng within the first preset time range before the current time; the current frame The sub-band energy information gate threshold value of the audio signal may be updated according to the minimum value of the sub-band energy information value entroEng in the first preset time range before the current time. That is to say, the time-domain envelope information gate threshold and the sub-band energy information gate threshold are adjusted according to the change of the call scene. If the environmental noise is large in the first preset time range before the current time, the time-domain envelope information gate valve The threshold value and the sub-band energy information gate threshold are respectively increased to different degrees; if the environment is quieter in the first preset time range before the current time, the time-domain envelope information gate threshold and the sub-band energy information gate threshold are relatively reduced to different degrees. .
S106,根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值。S106: Acquire a first human voice probability value of the audio signal of the current frame according to the time domain envelope information value and the time domain envelope information gate threshold value.
在步骤S106中,基于特征的语音概率函数将每帧音频信号映射到一个概率值得出概率值,对于时域特征,首先,获取所述时域包络信息值与所述时域包络信息门阀值的差值;然后,将所述时域包络信息值与所述时域包络信息门阀值的差值进行归一化处理得到第一人声概率值。In step S106, the feature-based speech probability function maps each frame of audio signal to a probability value to obtain a probability value. For time-domain features, first, obtain the time-domain envelope information value and the time-domain envelope information gate valve. Then, the difference between the time domain envelope information value and the time domain envelope information gate threshold value is normalized to obtain the first vocal probability value.
具体地,按照如下公式计算第一人声概率值SpeechProb1:Specifically, the first vocal probability value SpeechProb1 is calculated according to the following formula:
SpeechProb1=sigmoid(envlopEng-envlopEngThrd),其中,envlopEng为时域包络信息值,envlopEngThrd为时域包络信息门阀值。SpeechProb1=sigmoid(envlopEng-envlopEngThrd), where envlopEng is the time-domain envelope information value, and envlopEngThrd is the time-domain envelope information gate threshold.
S107,根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值。S107: Acquire a second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold value.
在步骤S107中,基于特征的语音概率函数将每帧音频信号映射到一个概率值得出概率值,对于频域特征,首先,获取所述子带能量信息值与所述子带能量信息门阀值的差值;然后,将所述子带能量信息值与所述子带能量信息门阀值的差值进行归一化处理得到第二人声概率值。In step S107, each frame of audio signal is mapped to a probability value to obtain a probability value based on the feature-based speech probability function. For the frequency domain feature, first, obtain the difference between the sub-band energy information value and the sub-band energy information gate threshold. Then, the difference between the sub-band energy information value and the sub-band energy information gate threshold value is normalized to obtain a second human voice probability value.
具体地,按照如下公式计算第二人声概率值SpeechProb2:Specifically, the second vocal probability value SpeechProb2 is calculated according to the following formula:
SpeechProb1=sigmoid(entroEng-entroEngThrd),其中,entroEng为子带能量信息值,entroEngThrd为子带能量信息门阀值。SpeechProb1=sigmoid(entroEng-entroEngThrd), where entroEng is the sub-band energy information value, and entroEngThrd is the sub-band energy information gate threshold.
S108,根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值。S108: Acquire the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.
在步骤S108中,根据第一人声概率值和第二人声概率值的乘积计算当前帧音频信号的人声概率值。具体地,人声概率值SpeechProb通过如下公式计算:In step S108, the human voice probability value of the audio signal of the current frame is calculated according to the product of the first human voice probability value and the second human voice probability value. Specifically, the speech probability value SpeechProb is calculated by the following formula:
SpeechProb=SpeechProb1*SpeechProb2,其中,SpeechProb1为第一人声概率值,SpeechProb2为第二人声概率值。SpeechProb=SpeechProb1*SpeechProb2, where SpeechProb1 is the first vocal probability value, and SpeechProb2 is the second vocal probability value.
在步骤S108中,基于人声的特征,从基于时域特征计算的第一人声概率值和基于频域特征计算的第二人声概率值综合得到当前帧音频信号的人声概率值,同时考虑时域和频域两个维度,避免只考虑单个维度,造成对人声的误判。当然,本领域技术人员可以理解,除上述人声概率值的计算方式外,在其他实施例中,可以为时域和频域两个维度分别设置不同的权重值,根据第一人声概率值和时域权重值以及第二人声概率值和频域权重值计算最终的人声概率值。In step S108, based on the characteristics of the human voice, the human voice probability value of the current frame of audio signal is synthesized from the first human voice probability value calculated based on the time domain characteristics and the second human voice probability value calculated based on the frequency domain characteristics, and at the same time Consider the two dimensions of the time domain and the frequency domain to avoid only considering a single dimension, which may cause misjudgment of the human voice. Of course, those skilled in the art can understand that, in addition to the above-mentioned calculation method of the human voice probability value, in other embodiments, different weight values may be set for the two dimensions of the time domain and the frequency domain, respectively, according to the first human voice probability value. Calculate the final vocal probability value with the time domain weight value, the second vocal probability value and the frequency domain weight value.
图2是本申请第二实施例的人声检测方法的流程示意图。需注意的是,若有实质上相同的结果,本申请的方法并不以图2所示的流程顺序为限。如图2所示,该人声检测方法包括步骤:Fig. 2 is a schematic flowchart of a human voice detection method according to a second embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 2. As shown in Figure 2, the human voice detection method includes the steps:
S200,对音频样本中的音频信号进行预处理,所述预处理的处理方式包括重采样处理、降噪处理、啸叫抑制处理、回声消除处理中的至少一种。S200: Perform preprocessing on the audio signal in the audio sample, where the preprocessing method includes at least one of resampling processing, noise reduction processing, howling suppression processing, and echo cancellation processing.
在步骤S200中,重采样处理包括向上重采样处理和向下重采样处理中的至少一种,在向上重采样处理时,对该音频信号进行差值处理,在向下重采样处理时,对该音频信号进行抽取处理;降噪处理是指对音频信号中的噪声部分进行消除的处理方式;啸叫抑制处理是指对音频信号中出现的啸叫情况进行消除,可以采用如频率均衡法,通过将系统的频率响应调成近似的直线,使各频率的增益基本一致消除啸叫等方式进行啸叫抑制;回声消除处理可以通 过回声消除(Echo Cancellation,EC)技术实现,回声分为声学回音(Acoustic Echo)和线路回音(Line Echo),相应的回声消除技术对应有声学回声消除(Acoustic Echo Cancellation,AEC)和线路回声消除(Line Echo Cancellation,LEC)。In step S200, the resampling process includes at least one of up resampling and down resampling. In the up resampling process, the audio signal is subjected to difference processing, and in the down resampling process, The audio signal is subjected to extraction processing; noise reduction processing refers to the processing method of eliminating the noise part of the audio signal; howling suppression processing refers to eliminating the howling situation in the audio signal, such as frequency equalization method, By adjusting the frequency response of the system to an approximate straight line, the gain of each frequency is basically the same to eliminate howling and other ways to suppress howling; echo cancellation processing can be achieved through echo cancellation (EC) technology, and echo is divided into acoustic echo (Acoustic Echo) and Line Echo (Line Echo), the corresponding echo cancellation technology corresponds to Acoustic Echo Cancellation (AEC) and Line Echo Cancellation (LEC).
S201,根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息。S201: Acquire time-domain envelope information according to the audio signal of the current frame and the audio signals of the previous multiple frames in the audio sample.
S202,根据所述时域包络信息获取当前帧音频信号的时域包络信息值。S202: Acquire a time-domain envelope information value of the audio signal of the current frame according to the time-domain envelope information.
S203,获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量。S203: Obtain a frequency domain signal corresponding to the audio signal of the current frame, and obtain the energy of each subband of the audio signal of the current frame according to the frequency domain signal.
S204,根据所述各子带能量获取当前帧音频信号的子带能量信息值。S204: Obtain a sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band.
S205,分别确定当前帧音频信号的时域包络信息门阀值和子带能量信息门阀值。S205: Determine the time domain envelope information gate threshold and the subband energy information gate threshold of the audio signal of the current frame respectively.
S206,根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值。S206: Acquire a first human voice probability value of the audio signal of the current frame according to the time domain envelope information value and the time domain envelope information gate threshold value.
S207,根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值。S207: Acquire a second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold value.
S208,将所述第一人声概率值和所述第二人声概率值上传至区块链中,以使得所述区块链对所述第一人声概率值和所述第二人声概率值进行加密存储。S208. Upload the first vocal probability value and the second vocal probability value to a blockchain, so that the blockchain can compare the first vocal probability value and the second vocal probability value to the The probability value is encrypted and stored.
S209,根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值。S209: Acquire the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.
S210,根据所述人声概率值确认当前帧音频信号是否为人声帧。S210: Confirm whether the audio signal of the current frame is a human voice frame according to the human voice probability value.
步骤S201至步骤S207以及步骤S209具体参见第一实施例的描述,在此不进行一一赘述。For details of steps S201 to S207 and step S209, refer to the description of the first embodiment, which will not be repeated here.
在步骤S208中,具体地,基于所述第一人声概率值和所述第二人声概率值得到对应的摘要信息,具体来说,摘要信息由所述第一人声概率值或所述第二人声概率值进行散列处理得到,比如利用sha256s算法处理得到。将摘要信息上传至区块链可保证其安全性和对用户的公正透明性。用户设备可以从区块链中下载得该摘要信息,以便查证所述第一人声概率值和所述第二人声概率值是否被篡改。本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。In step S208, specifically, corresponding summary information is obtained based on the first vocal probability value and the second vocal probability value. Specifically, the summary information is determined by the first vocal probability value or the second vocal probability value. The second vocal probability value is obtained by hash processing, for example, obtained by processing the sha256s algorithm. Uploading summary information to the blockchain can ensure its security and fairness and transparency to users. The user equipment can download the summary information from the blockchain to verify whether the first human voice probability value and the second human voice probability value have been tampered with. The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
在步骤S210中,若所述当前帧音频信号的人声概率值大于或等于第一要求概率,则判断所述当前帧音频信号为人声帧;按照所述人声帧对应的编码方式对所述当前帧音频信号进行编码,得到第一音频编码流;对所述第一音频编码流进行发送。In step S210, if the human voice probability value of the audio signal of the current frame is greater than or equal to the first required probability, it is determined that the audio signal of the current frame is a human voice frame; The audio signal of the current frame is encoded to obtain a first encoded audio stream; and the first encoded audio stream is sent.
在步骤S210中,若所述当前帧音频信号的人声概率值小于第一要求概率,则判断所述当 前帧音频信号为非人声帧;按照所述非人声帧对应的编码方式对所述当前帧音频信号进行编码,得到第二音频编码流;对所述第二音频编码流进行发送。具体地,对于非人声帧,可以通过对数字信号值的修改,将所述非人声帧归一化为静音帧。若确定当前帧音频信号为非人声帧(环境噪声帧或静音帧),则可在通话应用里,减少非人声的传输,有效减少对带宽的占用,提升带宽利用率,减少传输延时,提升客户通话体验。In step S210, if the human voice probability value of the audio signal of the current frame is less than the first required probability, it is determined that the audio signal of the current frame is a non-human voice frame; Encoding the audio signal of the current frame to obtain a second encoded audio stream; and sending the second encoded audio stream. Specifically, for a non-human voice frame, the non-human voice frame may be normalized to a silent frame by modifying the digital signal value. If it is determined that the current frame of audio signal is a non-human voice frame (environmental noise frame or silent frame), in the call application, the transmission of non-human voice can be reduced, effectively reducing bandwidth occupation, improving bandwidth utilization, and reducing transmission delay , Enhance customer call experience.
图3为本申请第三实施例的人声检测装置的结构示意图。如图3所示,该装置30包括时域特征提取模块31、时域特征计算模块32、频域特征提取模块33、频域特征计算模块34、门阀值确定模块35、时域人声检测模块36、频域人声检测模块37和人声概率计算模块38,其中,时域特征提取模块31用于根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息。时域特征计算模块32用于根据所述时域包络信息获取当前帧音频信号的时域包络信息值。频域特征提取模块33用于获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量。频域特征计算模块34用于根据所述各子带能量获取当前帧音频信号的子带能量信息值。门阀值确定模块35用于分别确定当前帧音频信号的时域包络信息门阀值和子带能量信息门阀值。时域人声检测模块36用于根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值。频域人声检测模块37用于根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值。人声概率计算模块38用于根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值。FIG. 3 is a schematic structural diagram of a human voice detection device according to a third embodiment of the application. As shown in FIG. 3, the device 30 includes a time domain feature extraction module 31, a time domain feature calculation module 32, a frequency domain feature extraction module 33, a frequency domain feature calculation module 34, a gate threshold determination module 35, and a time domain voice detection module 36. The frequency domain human voice detection module 37 and the human voice probability calculation module 38, wherein the time domain feature extraction module 31 is used to obtain the time domain envelope information according to the current frame audio signal and the previous multiple frames of audio signals in the audio sample. The time-domain feature calculation module 32 is configured to obtain the time-domain envelope information value of the audio signal of the current frame according to the time-domain envelope information. The frequency domain feature extraction module 33 is configured to obtain a frequency domain signal corresponding to the audio signal of the current frame, and obtain the energy of each subband of the audio signal of the current frame according to the frequency domain signal. The frequency domain feature calculation module 34 is configured to obtain the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band. The gate threshold determination module 35 is used to determine the time domain envelope information gate threshold and the subband energy information gate threshold of the audio signal of the current frame respectively. The time domain human voice detection module 36 is configured to obtain the first human voice probability value of the current frame audio signal according to the time domain envelope information value and the time domain envelope information gate threshold value. The frequency domain human voice detection module 37 is configured to obtain the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold value. The human voice probability calculation module 38 is configured to obtain the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.
图4是本申请第四实施例的电子设备的结构示意图。如图4所示,该电子设备40包括处理器41及和处理器41耦接的存储器42。Fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application. As shown in FIG. 4, the electronic device 40 includes a processor 41 and a memory 42 coupled to the processor 41.
存储器42存储有用于实现上述任一实施例的人声检测方法的程序指令。The memory 42 stores program instructions for implementing the human voice detection method of any of the above embodiments.
处理器41用于执行存储器42存储的程序指令以进行人声检测。The processor 41 is configured to execute program instructions stored in the memory 42 to perform human voice detection.
其中,处理器41还可以称为CPU(Central Processing Unit,中央处理单元)。处理器41可能是一种集成电路芯片,具有信号的处理能力。处理器41还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 41 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 41 may be an integrated circuit chip with signal processing capabilities. The processor 41 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component . The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
参阅图5,图5为本申请第五实施例的存储介质的结构示意图。本申请实施例的存储介质存储有能够实现上述所有人声检测方法的程序指令51,所述存储介质可以是非易失性,也可以是易失性。其中,该程序指令51可以以软件产品的形式存储在上述存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式所述方法的全部或部分步骤。而前述的存储装置包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access  Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。Refer to FIG. 5, which is a schematic structural diagram of a storage medium according to a fifth embodiment of the application. The storage medium of the embodiment of the present application stores program instructions 51 capable of realizing the above-mentioned method for detecting all human voices. The storage medium may be non-volatile or volatile. Wherein, the program instructions 51 may be stored in the above-mentioned storage medium in the form of a software product, and include several instructions for making a computer device (may be a personal computer, a server, or a network device, etc.) or a processor to execute the program. Apply for all or part of the steps of the method described in each embodiment. The aforementioned storage devices include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. , Or terminal devices such as computers, servers, mobile phones, and tablets.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。以上仅为本申请的实施方式,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. The above are only implementations of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly applied to other related technical fields, The same reason is included in the scope of patent protection of this application.
以上所述的仅是本申请的实施方式,在此应当指出,对于本领域的普通技术人员来说,在不脱离本申请创造构思的前提下,还可以做出改进,但这些均属于本申请的保护范围。The above are only the implementation manners of this application. It should be pointed out here that for those of ordinary skill in the art, improvements can be made without departing from the creative concept of this application, but these all belong to this application. The scope of protection.

Claims (20)

  1. 一种人声检测方法,其中,包括:A human voice detection method, which includes:
    根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息;Acquire time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;
    根据所述时域包络信息获取当前帧音频信号的时域包络信息值;Acquiring the time domain envelope information value of the audio signal of the current frame according to the time domain envelope information;
    获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量;Acquiring a frequency domain signal corresponding to the audio signal of the current frame, and acquiring the energy of each subband of the audio signal of the current frame according to the frequency domain signal;
    根据所述各子带能量获取当前帧音频信号的子带能量信息值;Obtaining the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band;
    分别确定当前帧音频信号的时域包络信息门阀值和子带能量信息门阀值;Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively;
    根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值;Acquiring the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold;
    根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值;Acquiring the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;
    根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值。Acquire the human voice probability value of the audio signal of the current frame according to the first human voice probability value and the second human voice probability value.
  2. 根据权利要求1所述的人声检测方法,其中,所述根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息,包括:The human voice detection method according to claim 1, wherein the acquiring time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample comprises:
    获取音频样本中各帧音频信号的最大值;Obtain the maximum value of each frame of audio signal in the audio sample;
    计算所述音频样本中最近多帧音频信号最大值的均值并将所述均值作为平均包络值,所述最近多帧音频信号包括当前帧音频信号和当前帧音频信号之前的多帧音频信号,将所述最近多帧音频信号的最大值以及所述平均包络值作为所述时域包络信息。Calculating the average value of the maximum value of the most recent multi-frame audio signal in the audio sample and using the average value as the average envelope value, the most recent multi-frame audio signal including the current frame audio signal and the multi-frame audio signal before the current frame audio signal, The maximum value of the most recent multiple frames of audio signals and the average envelope value are used as the time-domain envelope information.
  3. 根据权利要求2所述的人声检测方法,其中,所述根据所述时域包络信息获取当前帧音频信号的时域包络信息值,包括:The human voice detection method according to claim 2, wherein the obtaining the time domain envelope information value of the current frame of audio signal according to the time domain envelope information comprises:
    获取最近多帧音频信号中每帧音频信号的最大值与所述平均包络值的差值;Obtaining the difference between the maximum value of each frame of the audio signal in the most recent multiple frames of audio signal and the average envelope value;
    将每帧音频信号最大值与所述平均包络值的差值进行对数运算,得到所述差值对应的对数值;Performing a logarithmic operation on the difference between the maximum value of the audio signal of each frame and the average envelope value to obtain the logarithmic value corresponding to the difference;
    将每帧音频信号的所述对数值进行累加,得到当前帧音频信号的时域包络信息值。The logarithmic value of each frame of audio signal is accumulated to obtain the time-domain envelope information value of the audio signal of the current frame.
  4. 根据权利要求1所述的人声检测方法,其中,所述获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量,包括:The human voice detection method according to claim 1, wherein said obtaining the frequency domain signal corresponding to the audio signal of the current frame, and obtaining the energy of each subband of the audio signal of the current frame according to the frequency domain signal comprises:
    通过傅里叶变换将当前帧音频信号从时域变换到频域,生成当前帧音频信号对应的频域信号;Transform the audio signal of the current frame from the time domain to the frequency domain through Fourier transform to generate the frequency domain signal corresponding to the audio signal of the current frame;
    对所述频域信号进行子带划分处理,计算各个子带的子带能量。Perform subband division processing on the frequency domain signal, and calculate the subband energy of each subband.
  5. 根据权利要求1所述的人声检测方法,其中,所述根据所述各子带能量获取当前帧音频信号的子带能量信息值,包括:The human voice detection method according to claim 1, wherein the obtaining the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band comprises:
    根据所述各子带能量计算各子带能量的平均能量值;Calculating the average energy value of each sub-band energy according to the energy of each sub-band;
    获取每个子带的子带能量与平均能量值的差值;Obtain the difference between the sub-band energy of each sub-band and the average energy value;
    将每个子带的差值进行对数运算,得到所述差值对应的对数值;Perform a logarithmic operation on the difference of each subband to obtain the logarithmic value corresponding to the difference;
    将每个子带的对数值进行累加,得到当前帧音频信号的子带能量信息值。The logarithmic value of each subband is accumulated to obtain the subband energy information value of the audio signal of the current frame.
  6. 根据权利要求1所述的人声检测方法,其中,确定当前帧音频信号的时域包络信息门阀值,包括:The human voice detection method according to claim 1, wherein determining the time domain envelope information gate threshold of the audio signal of the current frame comprises:
    根据当前时间之前的第一预设时间范围内时域包络信息值的最小值对所述时域包络信息门阀值进行更新;Updating the threshold value of the time domain envelope information gate according to the minimum value of the time domain envelope information value within the first preset time range before the current time;
    确定当前帧音频信号的子带能量信息门阀值,包括:。Determine the sub-band energy information gate threshold of the audio signal of the current frame, including:.
    根据当前时间之前的第一预设时间范围内子带能量信息值的最小值对所述子带能量信息门阀值进行更新。The sub-band energy information gate threshold value is updated according to the minimum value of the sub-band energy information value in the first preset time range before the current time.
  7. 根据权利要求1所述的人声检测方法,其中,所述根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值,包括:The human voice detection method according to claim 1, wherein the obtaining the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold value comprises:
    获取所述时域包络信息值与所述时域包络信息门阀值的差值;Acquiring the difference between the time domain envelope information value and the time domain envelope information gate threshold;
    将所述时域包络信息值与所述时域包络信息门阀值的差值进行归一化处理得到第一人声概率值;Normalizing the difference between the time domain envelope information value and the time domain envelope information gate threshold value to obtain the first vocal probability value;
    所述根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值,包括:The obtaining the second human voice probability value of the current frame audio signal according to the sub-band energy information value and the sub-band energy information gate threshold value includes:
    获取所述子带能量信息值与所述子带能量信息门阀值的差值;Acquiring the difference between the sub-band energy information value and the sub-band energy information gate threshold;
    将所述子带能量信息值与所述子带能量信息门阀值的差值进行归一化处理得到第二人声概率值;Normalizing the difference between the sub-band energy information value and the sub-band energy information gate threshold value to obtain a second vocal probability value;
    所述根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值之前,还包括:Before obtaining the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value, the method further includes:
    将所述第一人声概率值和所述第二人声概率值上传至区块链中,以使得所述区块链对所述第一人声概率值和所述第二人声概率值进行加密存储。Upload the first vocal probability value and the second vocal probability value to the blockchain, so that the blockchain compares the first vocal probability value and the second vocal probability value Encrypted storage.
  8. 一种人声检测装置,其中,所述装置包括:A human voice detection device, wherein the device includes:
    时域特征提取模块,用于根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息;The time domain feature extraction module is used to obtain the time domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;
    时域特征计算模块,用于根据所述时域包络信息获取当前帧音频信号的时域包络信息值;A time-domain feature calculation module, configured to obtain the time-domain envelope information value of the audio signal of the current frame according to the time-domain envelope information;
    频域特征提取模块,用于获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量;The frequency domain feature extraction module is used to obtain the frequency domain signal corresponding to the current frame audio signal, and obtain the energy of each subband of the current frame audio signal according to the frequency domain signal;
    频域特征计算模块,用于根据所述各子带能量获取当前帧音频信号的子带能量信息值;A frequency domain feature calculation module, configured to obtain the subband energy information value of the audio signal of the current frame according to the energy of each subband;
    门阀值确定模块,用于分别确定当前帧音频信号的时域包络信息门阀值和子带能量信息 门阀值;The gate threshold determination module is used to determine the time domain envelope information gate threshold and the subband energy information gate threshold of the audio signal of the current frame respectively;
    时域人声检测模块,用于根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值;A time-domain vocal detection module, configured to obtain the first vocal probability value of the current frame of audio signal according to the time-domain envelope information value and the time-domain envelope information gate threshold;
    频域人声检测模块,用于根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值;A frequency domain vocal detection module, configured to obtain the second vocal probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;
    人声概率计算模块,用于根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值。The human voice probability calculation module is configured to obtain the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.
  9. 一种电子设备,其中,所述设备包括处理器、以及与所述处理器耦接的存储器,所述存储器存储有可被所述处理器执行的程序指令;所述处理器执行所述存储器存储的所述程序指令时实现以下步骤:An electronic device, wherein the device includes a processor and a memory coupled to the processor, and the memory stores program instructions executable by the processor; the processor executes the memory storage The following steps are implemented when the program instructions are:
    根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息;Acquire time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;
    根据所述时域包络信息获取当前帧音频信号的时域包络信息值;Acquiring the time domain envelope information value of the audio signal of the current frame according to the time domain envelope information;
    获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量;Acquiring a frequency domain signal corresponding to the audio signal of the current frame, and acquiring the energy of each subband of the audio signal of the current frame according to the frequency domain signal;
    根据所述各子带能量获取当前帧音频信号的子带能量信息值;Obtaining the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band;
    分别确定当前帧音频信号的时域包络信息门阀值和子带能量信息门阀值;Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively;
    根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值;Acquiring the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold;
    根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值;Acquiring the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;
    根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值。Acquire the human voice probability value of the audio signal of the current frame according to the first human voice probability value and the second human voice probability value.
  10. 根据权利要求9所述的电子设备,其中,所述根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息,包括:The electronic device according to claim 9, wherein the acquiring time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample comprises:
    获取音频样本中各帧音频信号的最大值;Obtain the maximum value of each frame of audio signal in the audio sample;
    计算所述音频样本中最近多帧音频信号最大值的均值并将所述均值作为平均包络值,所述最近多帧音频信号包括当前帧音频信号和当前帧音频信号之前的多帧音频信号,将所述最近多帧音频信号的最大值以及所述平均包络值作为所述时域包络信息。Calculating the average value of the maximum value of the most recent multi-frame audio signal in the audio sample and using the average value as the average envelope value, the most recent multi-frame audio signal including the current frame audio signal and the multi-frame audio signal before the current frame audio signal, The maximum value of the most recent multiple frames of audio signals and the average envelope value are used as the time-domain envelope information.
  11. 根据权利要求10所述的电子设备,其中,所述根据所述时域包络信息获取当前帧音频信号的时域包络信息值,包括:The electronic device according to claim 10, wherein said obtaining the time domain envelope information value of the current frame of audio signal according to the time domain envelope information comprises:
    获取最近多帧音频信号中每帧音频信号的最大值与所述平均包络值的差值;Obtaining the difference between the maximum value of each frame of the audio signal in the most recent multiple frames of audio signal and the average envelope value;
    将每帧音频信号最大值与所述平均包络值的差值进行对数运算,得到所述差值对应的对数值;Performing a logarithmic operation on the difference between the maximum value of the audio signal of each frame and the average envelope value to obtain the logarithmic value corresponding to the difference;
    将每帧音频信号的所述对数值进行累加,得到当前帧音频信号的时域包络信息值。The logarithmic value of each frame of audio signal is accumulated to obtain the time-domain envelope information value of the audio signal of the current frame.
  12. 根据权利要求9所述的电子设备,其中,所述获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量,包括:The electronic device according to claim 9, wherein said obtaining the frequency domain signal corresponding to the current frame of audio signal, and obtaining each subband energy of the current frame of audio signal according to the frequency domain signal, comprises:
    通过傅里叶变换将当前帧音频信号从时域变换到频域,生成当前帧音频信号对应的频域信号;Transform the audio signal of the current frame from the time domain to the frequency domain through Fourier transform to generate the frequency domain signal corresponding to the audio signal of the current frame;
    对所述频域信号进行子带划分处理,计算各个子带的子带能量。Perform subband division processing on the frequency domain signal, and calculate the subband energy of each subband.
  13. 根据权利要求9所述的电子设备,其中,所述根据所述各子带能量获取当前帧音频信号的子带能量信息值,包括:The electronic device according to claim 9, wherein the obtaining the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band comprises:
    根据所述各子带能量计算各子带能量的平均能量值;Calculating the average energy value of each sub-band energy according to the energy of each sub-band;
    获取每个子带的子带能量与平均能量值的差值;Obtain the difference between the sub-band energy of each sub-band and the average energy value;
    将每个子带的差值进行对数运算,得到所述差值对应的对数值;Perform a logarithmic operation on the difference of each subband to obtain the logarithmic value corresponding to the difference;
    将每个子带的对数值进行累加,得到当前帧音频信号的子带能量信息值。The logarithmic value of each subband is accumulated to obtain the subband energy information value of the audio signal of the current frame.
  14. 根据权利要求9所述的电子设备,其中,确定当前帧音频信号的时域包络信息门阀值,包括:The electronic device according to claim 9, wherein determining the time domain envelope information gate threshold of the audio signal of the current frame comprises:
    根据当前时间之前的第一预设时间范围内时域包络信息值的最小值对所述时域包络信息门阀值进行更新;Updating the threshold value of the time domain envelope information gate according to the minimum value of the time domain envelope information value within the first preset time range before the current time;
    确定当前帧音频信号的子带能量信息门阀值,包括:。Determine the sub-band energy information gate threshold of the audio signal of the current frame, including:.
    根据当前时间之前的第一预设时间范围内子带能量信息值的最小值对所述子带能量信息门阀值进行更新。The sub-band energy information gate threshold value is updated according to the minimum value of the sub-band energy information value in the first preset time range before the current time.
  15. 根据权利要求9所述的电子设备,其中,所述根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值,包括:9. The electronic device according to claim 9, wherein the obtaining the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold value comprises:
    获取所述时域包络信息值与所述时域包络信息门阀值的差值;Acquiring the difference between the time domain envelope information value and the time domain envelope information gate threshold;
    将所述时域包络信息值与所述时域包络信息门阀值的差值进行归一化处理得到第一人声概率值;Normalizing the difference between the time domain envelope information value and the time domain envelope information gate threshold value to obtain the first vocal probability value;
    所述根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值,包括:The obtaining the second human voice probability value of the current frame audio signal according to the sub-band energy information value and the sub-band energy information gate threshold value includes:
    获取所述子带能量信息值与所述子带能量信息门阀值的差值;Acquiring the difference between the sub-band energy information value and the sub-band energy information gate threshold;
    将所述子带能量信息值与所述子带能量信息门阀值的差值进行归一化处理得到第二人声概率值;Normalizing the difference between the sub-band energy information value and the sub-band energy information gate threshold value to obtain a second vocal probability value;
    所述根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值之前,还包括:Before obtaining the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value, the method further includes:
    将所述第一人声概率值和所述第二人声概率值上传至区块链中,以使得所述区块链对所述第一人声概率值和所述第二人声概率值进行加密存储。Upload the first vocal probability value and the second vocal probability value to the blockchain, so that the blockchain compares the first vocal probability value and the second vocal probability value Encrypted storage.
  16. 一种存储介质,其中,所述存储介质内存储有程序指令,所述程序指令被处理器执 行时实现以下步骤:A storage medium, wherein program instructions are stored in the storage medium, and the following steps are implemented when the program instructions are executed by a processor:
    根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息;Acquire time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;
    根据所述时域包络信息获取当前帧音频信号的时域包络信息值;Acquiring the time domain envelope information value of the audio signal of the current frame according to the time domain envelope information;
    获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量;Acquiring a frequency domain signal corresponding to the audio signal of the current frame, and acquiring the energy of each subband of the audio signal of the current frame according to the frequency domain signal;
    根据所述各子带能量获取当前帧音频信号的子带能量信息值;Obtaining the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band;
    分别确定当前帧音频信号的时域包络信息门阀值和子带能量信息门阀值;Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively;
    根据所述时域包络信息值和所述时域包络信息门阀值获取当前帧音频信号的第一人声概率值;Acquiring the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold;
    根据所述子带能量信息值和所述子带能量信息门阀值获取当前帧音频信号的第二人声概率值;Acquiring the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;
    根据所述第一人声概率值和所述第二人声概率值获取当前帧音频信号的人声概率值。Acquire the human voice probability value of the audio signal of the current frame according to the first human voice probability value and the second human voice probability value.
  17. 根据权利要求16所述的存储介质,其中,所述根据音频样本中当前帧音频信号和前多帧音频信号获取时域包络信息,包括:The storage medium according to claim 16, wherein the acquiring time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample comprises:
    获取音频样本中各帧音频信号的最大值;Obtain the maximum value of each frame of audio signal in the audio sample;
    计算所述音频样本中最近多帧音频信号最大值的均值并将所述均值作为平均包络值,所述最近多帧音频信号包括当前帧音频信号和当前帧音频信号之前的多帧音频信号,将所述最近多帧音频信号的最大值以及所述平均包络值作为所述时域包络信息。Calculating the average value of the maximum value of the most recent multi-frame audio signal in the audio sample and using the average value as the average envelope value, the most recent multi-frame audio signal including the current frame audio signal and the multi-frame audio signal before the current frame audio signal, The maximum value of the most recent multiple frames of audio signals and the average envelope value are used as the time-domain envelope information.
  18. 根据权利要求16所述的存储介质,其中,所述获取当前帧音频信号对应的频域信号,根据所述频域信号获取当前帧音频信号的各子带能量,包括:The storage medium according to claim 16, wherein said obtaining the frequency domain signal corresponding to the current frame audio signal, and obtaining each subband energy of the current frame audio signal according to the frequency domain signal comprises:
    通过傅里叶变换将当前帧音频信号从时域变换到频域,生成当前帧音频信号对应的频域信号;Transform the audio signal of the current frame from the time domain to the frequency domain through Fourier transform to generate the frequency domain signal corresponding to the audio signal of the current frame;
    对所述频域信号进行子带划分处理,计算各个子带的子带能量。Perform subband division processing on the frequency domain signal, and calculate the subband energy of each subband.
  19. 根据权利要求16所述的存储介质,其中,所述根据所述各子带能量获取当前帧音频信号的子带能量信息值,包括:The storage medium according to claim 16, wherein the obtaining the subband energy information value of the audio signal of the current frame according to the energy of each subband comprises:
    根据所述各子带能量计算各子带能量的平均能量值;Calculating the average energy value of each sub-band energy according to the energy of each sub-band;
    获取每个子带的子带能量与平均能量值的差值;Obtain the difference between the sub-band energy of each sub-band and the average energy value;
    将每个子带的差值进行对数运算,得到所述差值对应的对数值;Perform a logarithmic operation on the difference of each subband to obtain the logarithmic value corresponding to the difference;
    将每个子带的对数值进行累加,得到当前帧音频信号的子带能量信息值。The logarithmic value of each subband is accumulated to obtain the subband energy information value of the audio signal of the current frame.
  20. 根据权利要求16所述的电子设备,其中,确定当前帧音频信号的时域包络信息门阀值,包括:The electronic device according to claim 16, wherein determining the time domain envelope information gate threshold of the audio signal of the current frame comprises:
    根据当前时间之前的第一预设时间范围内时域包络信息值的最小值对所述时域包络信息门阀值进行更新;Updating the threshold value of the time domain envelope information gate according to the minimum value of the time domain envelope information value within the first preset time range before the current time;
    确定当前帧音频信号的子带能量信息门阀值,包括:。Determine the sub-band energy information gate threshold of the audio signal of the current frame, including:.
    根据当前时间之前的第一预设时间范围内子带能量信息值的最小值对所述子带能量信息门阀值进行更新。The sub-band energy information gate threshold value is updated according to the minimum value of the sub-band energy information value in the first preset time range before the current time.
PCT/CN2020/123198 2020-07-24 2020-10-23 Human voice detection method, apparatus, device, and storage medium WO2021135547A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010723751.1A CN111883182B (en) 2020-07-24 2020-07-24 Human voice detection method, device, equipment and storage medium
CN202010723751.1 2020-07-24

Publications (1)

Publication Number Publication Date
WO2021135547A1 true WO2021135547A1 (en) 2021-07-08

Family

ID=73200498

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/123198 WO2021135547A1 (en) 2020-07-24 2020-10-23 Human voice detection method, apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN111883182B (en)
WO (1) WO2021135547A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270933B (en) * 2020-11-12 2024-03-12 北京猿力未来科技有限公司 Audio identification method and device
CN112669878B (en) * 2020-12-23 2024-04-19 北京声智科技有限公司 Sound gain value calculation method and device and electronic equipment
CN112967738A (en) * 2021-02-01 2021-06-15 腾讯音乐娱乐科技(深圳)有限公司 Human voice detection method and device, electronic equipment and computer readable storage medium
CN113572908A (en) * 2021-06-16 2021-10-29 云茂互联智能科技(厦门)有限公司 Method, device and system for reducing noise in VoIP (Voice over Internet protocol) call
CN113936694B (en) * 2021-12-17 2022-03-18 珠海普林芯驰科技有限公司 Real-time human voice detection method, computer device and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020103636A1 (en) * 2001-01-26 2002-08-01 Tucker Luke A. Frequency-domain post-filtering voice-activity detector
CN101763856A (en) * 2008-12-23 2010-06-30 华为技术有限公司 Signal classifying method, classifying device and coding system
CN102044242A (en) * 2009-10-15 2011-05-04 华为技术有限公司 Method, device and electronic equipment for voice activity detection
CN102324229A (en) * 2011-09-08 2012-01-18 中国科学院自动化研究所 Method and system for detecting abnormal use of voice input equipment
CN103646649A (en) * 2013-12-30 2014-03-19 中国科学院自动化研究所 High-efficiency voice detecting method
CN106098076A (en) * 2016-06-06 2016-11-09 成都启英泰伦科技有限公司 A kind of based on dynamic noise estimation time-frequency domain adaptive voice detection method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7567900B2 (en) * 2003-06-11 2009-07-28 Panasonic Corporation Harmonic structure based acoustic speech interval detection method and device
CN110111811B (en) * 2019-04-18 2021-06-01 腾讯音乐娱乐科技(深圳)有限公司 Audio signal detection method, device and storage medium
CN110782907B (en) * 2019-11-06 2023-11-28 腾讯科技(深圳)有限公司 Voice signal transmitting method, device, equipment and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020103636A1 (en) * 2001-01-26 2002-08-01 Tucker Luke A. Frequency-domain post-filtering voice-activity detector
CN101763856A (en) * 2008-12-23 2010-06-30 华为技术有限公司 Signal classifying method, classifying device and coding system
CN102044242A (en) * 2009-10-15 2011-05-04 华为技术有限公司 Method, device and electronic equipment for voice activity detection
CN102324229A (en) * 2011-09-08 2012-01-18 中国科学院自动化研究所 Method and system for detecting abnormal use of voice input equipment
CN103646649A (en) * 2013-12-30 2014-03-19 中国科学院自动化研究所 High-efficiency voice detecting method
CN106098076A (en) * 2016-06-06 2016-11-09 成都启英泰伦科技有限公司 A kind of based on dynamic noise estimation time-frequency domain adaptive voice detection method

Also Published As

Publication number Publication date
CN111883182B (en) 2024-03-19
CN111883182A (en) 2020-11-03

Similar Documents

Publication Publication Date Title
WO2021135547A1 (en) Human voice detection method, apparatus, device, and storage medium
WO2021196905A1 (en) Voice signal dereverberation processing method and apparatus, computer device and storage medium
KR100636317B1 (en) Distributed Speech Recognition System and method
US20190172480A1 (en) Voice activity detection systems and methods
US10552114B2 (en) Auto-mute redundant devices in a conference room
CN110853664B (en) Method and device for evaluating performance of speech enhancement algorithm and electronic equipment
CN113766073B (en) Howling detection in conference systems
WO2020037555A1 (en) Method, device, apparatus, and system for evaluating microphone array consistency
CN110177317B (en) Echo cancellation method, echo cancellation device, computer-readable storage medium and computer equipment
WO2014114049A1 (en) Voice recognition method and device
JP6058824B2 (en) Personalized bandwidth extension
CN112004177B (en) Howling detection method, microphone volume adjustment method and storage medium
EP3757993B1 (en) Pre-processing for automatic speech recognition
US10839820B2 (en) Voice processing method, apparatus, device and storage medium
Zhang et al. An efficient perceptual hashing based on improved spectral entropy for speech authentication
Shankar et al. Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids
WO2021143249A1 (en) Transient noise suppression-based audio processing method, apparatus, device, and medium
CN114627899A (en) Sound signal detection method and device, computer readable storage medium and terminal
WO2023124556A1 (en) Method and apparatus for recognizing mixed key sounds of multiple keyboards, device, and storage medium
CN112133324A (en) Call state detection method, device, computer system and medium
Nahma et al. An adaptive a priori SNR estimator for perceptual speech enhancement
CN114678038A (en) Audio noise detection method, computer device and computer program product
Jahanirad et al. Blind source computer device identification from recorded VoIP calls for forensic investigation
CN112382296A (en) Method and device for voiceprint remote control of wireless audio equipment
Nemade et al. Performance comparison of single channel Speech enhancement techniques for personal Communication

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20909036

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20909036

Country of ref document: EP

Kind code of ref document: A1