WO2021135547A1

WO2021135547A1 - Human voice detection method, apparatus, device, and storage medium

Info

Publication number: WO2021135547A1
Application number: PCT/CN2020/123198
Authority: WO
Inventors: 付姝华; 汪斌
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-07-24
Filing date: 2020-10-23
Publication date: 2021-07-08
Also published as: CN111883182B; CN111883182A

Abstract

Provided are a human voice detection method, apparatus (30), electronic device (40), and storage medium, relating to the technical field of artificial intelligence, said method comprising: obtaining time-domain envelope information by means of an audio signal of a current frame and an audio signal of a previous plurality of frames (S101, S201); obtaining the energy of each sub-band by means of the audio signal of the current frame (S103, S203); performing time-domain data analysis on time-domain envelope information, and performing frequency-domain data analysis on the energy of each sub-band; according to the analysis results, calculating a first human-voice detection probability value in the time-domain dimension and a second human-voice detection probability value in the frequency-domain dimension, respectively, of the audio signal of the current frame (S106, S107, S206, S207); according to the comprehensive calculation of the two human-voice detection probability values, obtaining a human-voice probability value of the current frame (S108, S209). By the described means, the accuracy of human voice detection is increased, accurate differentiation is made between human voice and non-stationary noise, effectively preventing damage to the human voice; at the same time, the non-stationary noise suppression effect is improved, and changes in a call scenario are adapted to by means of updating the gate threshold value, enabling rapid tracking of a valid human voice.

Description

Human voice detection method, device, equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 24, 2020, the application number is 202010723751.1, and the invention title is "Human Voice Detection Method, Apparatus, Equipment, and Storage Medium", the entire content of which is incorporated by reference In this application.

【Technical Field】

This application relates to the field of audio processing technology, and also to the field of artificial intelligence, and in particular to a method, device, device, and storage medium for detecting human voice.

【Background technique】

The application of VAD (Voice Activity Detection) voice coding technology is very popular. The purpose is to identify and eliminate the long silent period from the voice signal stream, so as to save voice channel resources without reducing the quality of service. It is an IP phone An important part of the application. For example, muting and not sending packets can save valuable bandwidth resources and help reduce the end-to-end delay felt by users. However, the current VAD technology generally can only distinguish between silent and non-silent. If it can further identify human voices and non-human voices, voice coding can further improve bandwidth utilization.

At the same time, the recognition of human voice and non-human voice plays a key role in noise suppression technology. Noise suppression represents a typical application of audio pre- and post-processing, and also determines the successful foundation of the performance of a call product. Non-human voice is regarded as noise De-tracking suppression can greatly improve noise suppression performance.

The inventor realizes that the human voice detection in the noise suppression in the prior art adopts a part of the VAD technology to improve to track the noise. This kind of technology has a good suppression effect on stationary noise, but has a poor suppression effect on non-stationary noise.

Therefore, it is necessary to provide a new human voice detection method.

[Summary of the invention]

The purpose of this application is to provide a human voice detection method, device and storage medium to solve the technical problem of poor suppression of non-stationary noise caused by the inability to accurately distinguish between human voice and non-stationary noise in the prior art.

The technical solution of the present application is as follows: a human voice detection method is provided, including:

Acquire time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;

Acquiring the time domain envelope information value of the audio signal of the current frame according to the time domain envelope information;

Acquiring a frequency domain signal corresponding to the audio signal of the current frame, and acquiring the energy of each subband of the audio signal of the current frame according to the frequency domain signal;

Acquiring the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band;

Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively;

Acquiring the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold;

Acquiring the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;

Acquire the human voice probability value of the audio signal of the current frame according to the first human voice probability value and the second human voice probability value.

Preferably, the acquiring time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample includes:

Obtain the maximum value of each frame of audio signal in the audio sample;

Calculating the average value of the maximum value of the most recent multi-frame audio signal in the audio sample and using the average value as the average envelope value, the most recent multi-frame audio signal including the current frame audio signal and the multi-frame audio signal before the current frame audio signal, The maximum value of the most recent multiple frames of audio signals and the average envelope value are used as the time-domain envelope information.

Another technical solution of the present application is as follows: a human voice detection device is provided, including:

The time domain feature extraction module is used to obtain the time domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;

A time-domain feature calculation module, configured to obtain the time-domain envelope information value of the audio signal of the current frame according to the time-domain envelope information;

The frequency domain feature extraction module is used to obtain the frequency domain signal corresponding to the current frame audio signal, and obtain the energy of each subband of the current frame audio signal according to the frequency domain signal;

A frequency domain feature calculation module, configured to obtain the subband energy information value of the audio signal of the current frame according to the energy of each subband;

The gate threshold determination module is used to determine the time domain envelope information gate threshold and the subband energy information gate threshold of the audio signal of the current frame respectively;

A time-domain vocal detection module, configured to obtain the first vocal probability value of the current frame of audio signal according to the time-domain envelope information value and the time-domain envelope information gate threshold;

A frequency domain vocal detection module, configured to obtain the second vocal probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;

The human voice probability calculation module is configured to obtain the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.

Another technical solution of the present application is as follows: an electronic device is provided, the device includes a processor, and a memory coupled to the processor, the memory stores program instructions that can be executed by the processor; When the processor executes the program instructions stored in the memory, the following steps are implemented:

Another technical solution of the present application is as follows: a storage medium is provided, and program instructions are stored in the storage medium, and the following steps are implemented when the program instructions are executed by a processor:

The beneficial effects of the present application are: the human voice detection method, device, equipment and storage medium of the present application obtain time-domain envelope information through the audio signal of the current frame and the audio signals of the previous multiple frames, and obtain the energy of each subband through the audio signal of the current frame , And then perform time-domain data analysis on the time-domain envelope information, perform frequency-domain data analysis on the energy of each subband, and calculate the first human voice detection probability value and frequency of the current frame of the audio signal in the time domain according to the two analysis results. The second human voice detection probability value in the domain dimension, and finally the human voice probability value of the current frame is comprehensively calculated based on the two human voice detection probability values. Through the above method, the accuracy of human voice detection is increased, and the human voice can be accurately distinguished from non-human voice. Smooth noise effectively avoids damage to the human voice, and at the same time improves the suppression of non-stationary noise. In addition, through the update of the gate threshold to adapt to the changes in the call scene, the effective human voice can be quickly tracked.

【Explanation of the drawings】

FIG. 1 is a schematic flowchart of the human voice detection method according to the first embodiment of the application;

2 is a schematic flowchart of a human voice detection method according to a second embodiment of this application;

3 is a schematic structural diagram of a human voice detection device according to a third embodiment of the application;

4 is a schematic structural diagram of a human voice detection device according to a fourth embodiment of the application;

FIG. 5 is a schematic structural diagram of a storage medium according to a fifth embodiment of this application.

【Detailed ways】

The following will clearly and completely describe the technical solutions in the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The terms "first", "second", and "third" in this application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first”, “second”, and “third” may explicitly or implicitly include at least one of the features. In the description of this application, "a plurality of" means at least two, such as two, three, etc., unless otherwise specifically defined. All directional indicators (such as up, down, left, right, front, back...) in the embodiments of this application are only used to explain the relative positional relationship between the components in a specific posture (as shown in the drawings) , Movement status, etc., if the specific posture changes, the directional indication will also change accordingly. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.

The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.

In the embodiment of the present application, each frame of audio signal is an audio original digital signal within a unit time, and the frame of audio signal may be any one of a silent frame, a human voice frame, or an environmental noise frame. Among them, the silent frame refers to the original audio digital signal frame without energy; the human voice frame and the environmental noise frame are both the original audio digital signal frame with energy, and the environmental noise frame and the silent frame are non-human voice frames; The main sound is the sound made when a person speaks. The human voice frame is the audio signal in which the human voice accounts for a larger proportion of the original audio digital signal; the main sound in the environmental noise frame is not the sound made by the person talking, and the environmental noise frame is the original audio digital signal The human voice accounts for a relatively small audio signal in the signal. In this embodiment, human voice detection is performed on each frame of audio signal to determine whether the audio signal of the current frame is a human voice frame. Since the silent frame is easily distinguished from the human voice frame, the human voice detection is mainly to distinguish the audio signal of the frame as the environment The noise frame is still the human voice frame.

In this embodiment of the present application, the time domain envelope information is obtained from the audio signal of the current frame and the audio signals of the previous multiple frames, the energy of each subband is obtained from the audio signal of the current frame, and time domain data analysis is performed on the time domain envelope information. Perform frequency domain data analysis on the energy of each subband, and calculate the first human voice detection probability value in the time domain dimension and the second human voice detection probability value in the frequency domain dimension of the audio signal of the current frame according to the two analysis results. Finally, according to the two analysis results, The personal voice detection probability value is comprehensively calculated to determine whether the current frame is a human voice frame.

Fig. 1 is a schematic flowchart of a human voice detection method according to a first embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 1. As shown in Figure 1, the human voice detection method includes steps:

S101: Acquire time-domain envelope information according to the audio signal of the current frame and the audio signals of the previous multiple frames in the audio sample.

In step S101, the time-domain envelope information of the most recent multi-frame audio signal is acquired. The first envelope information is the maximum value vmax of each frame of audio signal, and the second envelope information is the average value of the maximum value (average packet). Network value envelopeAve). Specifically, when it is necessary to perform human voice detection on the audio sample to be detected, the audio sample is first divided into frames, wherein each frame of the audio signal includes a plurality of sampling points, and each sampling point has an amplitude. The maximum value of each frame of audio signal is the maximum value of the amplitude of each sampling point of the audio signal. Suppose the audio signal of the t-th frame includes n sampling points, and the n sampling points are Xt(1), Xt(2),... ..., Xt(n), where Xt(n) represents the nth sampling point in the t-th frame audio signal, so the maximum value of the t-th frame audio signal vmax=max(Xt(1), Xt(2), ……, Xt(n)).

In step S101, record the maximum value vmax of each frame of audio signal, and then use the maximum value of the most recent M frames of audio signal (vmax(1), vmax(2), ..., vmax(M)) to calculate the average envelope value envelopeAve , The most recent M frame audio signal includes the current frame audio signal (the Mth frame) and the M-1 frame audio signal (the first frame, the second frame,..., the M-1th frame) before the current frame audio signal. The M-1 frame audio signal and the maximum value of the current frame audio signal are accumulated to obtain the accumulated value

Accumulate value

Divide by M to calculate the average envelope value envelopeAve.

S102: Acquire a time domain envelope information value of the audio signal of the current frame according to the time domain envelope information.

In step S102, time-domain data analysis is performed according to the time-domain envelope information of the audio signal obtained in step S101, and the time-domain envelope information is quantized to obtain the value of the time-domain envelope information (quantized value of the time-domain envelope information) In this embodiment, for the audio signal before the current frame, the time-domain envelope information is quantized and calculated in the following manner: First, obtain the maximum value of each frame of the audio signal in the most recent multiple frames of audio signal and the average envelope value Difference; then, the difference of each frame of audio signal is logarithmically calculated to obtain the log value corresponding to the difference; finally, the log value of each frame of audio signal is accumulated to obtain the time domain packet of the audio signal of the current frame Network information value. In this embodiment, since the time-domain envelope information is obtained based on the most recent multi-frame audio signal, the time-domain envelope of the human voice can be regarded as a smooth curve, which is different from the characteristics of environmental noise. Therefore, the time-domain envelope The network information value can well reflect the change of the sound, and the time-domain envelope information value can be used to accurately detect whether a human voice is present.

Specifically, the time-domain envelope information value envlopEng is calculated according to the following formula:

Among them, vMax(i) is the i-th frame audio signal in the most recent M frames of audio signal, i is 1, 2, ..., M, and envelopeAve is the average envelope value.

S103: Obtain a frequency domain signal corresponding to the audio signal of the current frame, and obtain the energy of each subband of the audio signal of the current frame according to the frequency domain signal.

In step S103, the audio signal of the current frame is a time domain signal. To extract frequency domain features from the signal, first, the audio signal of the current frame is transformed from the time domain to the frequency domain through Fourier transform to generate the corresponding audio signal of the current frame. Frequency domain signal; sub-band division processing is performed on the frequency domain signal, and the energy of each sub-band is calculated. Specifically, the frequency domain signal C corresponding to the audio signal of the current frame is divided into N subbands, and the end positions of the subbands are set to b(1), b(2),...,b(k),...b( N), and b(0)=1, the energy of each subband is subEng(k).

S104: Acquire a sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band.

In step S104, firstly, the average energy value of each sub-band energy is calculated according to the energy of each sub-band, that is, the energy value subEng(k) of each sub-band is accumulated to obtain the accumulated value

Accumulate value

Divide by N to obtain the average energy value aveSubEng; then, obtain the difference between the sub-band energy subEng(k) of each sub-band and the average energy value aveSubEng; then, perform logarithmic operation on the difference of each sub-band to obtain the difference The logarithmic value corresponding to the value; finally, the logarithmic value of each subband is accumulated to obtain the subband energy information value of the audio signal of the current frame. In this embodiment, the sub-band energy information value is calculated according to the sub-band energy of different sub-bands and the average energy value of each sub-band energy. Since the human voice has a correspondingly covered preset frequency band, the sub-band energy information value can reflect The unique sub-band energy distribution characteristics of the human voice, therefore, the sub-band energy information value can well distinguish the human voice from the environmental noise.

Specifically, the subband energy information value entroEng is calculated according to the following formula:

Among them, subEng(k) is the sub-band energy of the k-th sub-band, k is 1, 2, ..., N, and aveSubEng is the average energy value of each sub-band energy.

S105: Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively.

In an optional implementation manner, the time domain envelope information gate threshold value envlopEngThrd of the audio signal of the current frame may be updated according to the minimum value of the time domain envelope information value envlopEng within the first preset time range before the current time; the current frame The sub-band energy information gate threshold value of the audio signal may be updated according to the minimum value of the sub-band energy information value entroEng in the first preset time range before the current time. That is to say, the time-domain envelope information gate threshold and the sub-band energy information gate threshold are adjusted according to the change of the call scene. If the environmental noise is large in the first preset time range before the current time, the time-domain envelope information gate valve The threshold value and the sub-band energy information gate threshold are respectively increased to different degrees; if the environment is quieter in the first preset time range before the current time, the time-domain envelope information gate threshold and the sub-band energy information gate threshold are relatively reduced to different degrees. .

S106: Acquire a first human voice probability value of the audio signal of the current frame according to the time domain envelope information value and the time domain envelope information gate threshold value.

In step S106, the feature-based speech probability function maps each frame of audio signal to a probability value to obtain a probability value. For time-domain features, first, obtain the time-domain envelope information value and the time-domain envelope information gate valve. Then, the difference between the time domain envelope information value and the time domain envelope information gate threshold value is normalized to obtain the first vocal probability value.

Specifically, the first vocal probability value SpeechProb1 is calculated according to the following formula:

SpeechProb1=sigmoid(envlopEng-envlopEngThrd), where envlopEng is the time-domain envelope information value, and envlopEngThrd is the time-domain envelope information gate threshold.

S107: Acquire a second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold value.

In step S107, each frame of audio signal is mapped to a probability value to obtain a probability value based on the feature-based speech probability function. For the frequency domain feature, first, obtain the difference between the sub-band energy information value and the sub-band energy information gate threshold. Then, the difference between the sub-band energy information value and the sub-band energy information gate threshold value is normalized to obtain a second human voice probability value.

Specifically, the second vocal probability value SpeechProb2 is calculated according to the following formula:

SpeechProb1=sigmoid(entroEng-entroEngThrd), where entroEng is the sub-band energy information value, and entroEngThrd is the sub-band energy information gate threshold.

S108: Acquire the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.

In step S108, the human voice probability value of the audio signal of the current frame is calculated according to the product of the first human voice probability value and the second human voice probability value. Specifically, the speech probability value SpeechProb is calculated by the following formula:

SpeechProb=SpeechProb1*SpeechProb2, where SpeechProb1 is the first vocal probability value, and SpeechProb2 is the second vocal probability value.

In step S108, based on the characteristics of the human voice, the human voice probability value of the current frame of audio signal is synthesized from the first human voice probability value calculated based on the time domain characteristics and the second human voice probability value calculated based on the frequency domain characteristics, and at the same time Consider the two dimensions of the time domain and the frequency domain to avoid only considering a single dimension, which may cause misjudgment of the human voice. Of course, those skilled in the art can understand that, in addition to the above-mentioned calculation method of the human voice probability value, in other embodiments, different weight values may be set for the two dimensions of the time domain and the frequency domain, respectively, according to the first human voice probability value. Calculate the final vocal probability value with the time domain weight value, the second vocal probability value and the frequency domain weight value.

Fig. 2 is a schematic flowchart of a human voice detection method according to a second embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 2. As shown in Figure 2, the human voice detection method includes the steps:

S200: Perform preprocessing on the audio signal in the audio sample, where the preprocessing method includes at least one of resampling processing, noise reduction processing, howling suppression processing, and echo cancellation processing.

In step S200, the resampling process includes at least one of up resampling and down resampling. In the up resampling process, the audio signal is subjected to difference processing, and in the down resampling process, The audio signal is subjected to extraction processing; noise reduction processing refers to the processing method of eliminating the noise part of the audio signal; howling suppression processing refers to eliminating the howling situation in the audio signal, such as frequency equalization method, By adjusting the frequency response of the system to an approximate straight line, the gain of each frequency is basically the same to eliminate howling and other ways to suppress howling; echo cancellation processing can be achieved through echo cancellation (EC) technology, and echo is divided into acoustic echo (Acoustic Echo) and Line Echo (Line Echo), the corresponding echo cancellation technology corresponds to Acoustic Echo Cancellation (AEC) and Line Echo Cancellation (LEC).

S201: Acquire time-domain envelope information according to the audio signal of the current frame and the audio signals of the previous multiple frames in the audio sample.

S202: Acquire a time-domain envelope information value of the audio signal of the current frame according to the time-domain envelope information.

S203: Obtain a frequency domain signal corresponding to the audio signal of the current frame, and obtain the energy of each subband of the audio signal of the current frame according to the frequency domain signal.

S204: Obtain a sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band.

S205: Determine the time domain envelope information gate threshold and the subband energy information gate threshold of the audio signal of the current frame respectively.

S206: Acquire a first human voice probability value of the audio signal of the current frame according to the time domain envelope information value and the time domain envelope information gate threshold value.

S207: Acquire a second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold value.

S208. Upload the first vocal probability value and the second vocal probability value to a blockchain, so that the blockchain can compare the first vocal probability value and the second vocal probability value to the The probability value is encrypted and stored.

S209: Acquire the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.

S210: Confirm whether the audio signal of the current frame is a human voice frame according to the human voice probability value.

For details of steps S201 to S207 and step S209, refer to the description of the first embodiment, which will not be repeated here.

In step S208, specifically, corresponding summary information is obtained based on the first vocal probability value and the second vocal probability value. Specifically, the summary information is determined by the first vocal probability value or the second vocal probability value. The second vocal probability value is obtained by hash processing, for example, obtained by processing the sha256s algorithm. Uploading summary information to the blockchain can ensure its security and fairness and transparency to users. The user equipment can download the summary information from the blockchain to verify whether the first human voice probability value and the second human voice probability value have been tampered with. The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

In step S210, if the human voice probability value of the audio signal of the current frame is greater than or equal to the first required probability, it is determined that the audio signal of the current frame is a human voice frame; The audio signal of the current frame is encoded to obtain a first encoded audio stream; and the first encoded audio stream is sent.

In step S210, if the human voice probability value of the audio signal of the current frame is less than the first required probability, it is determined that the audio signal of the current frame is a non-human voice frame; Encoding the audio signal of the current frame to obtain a second encoded audio stream; and sending the second encoded audio stream. Specifically, for a non-human voice frame, the non-human voice frame may be normalized to a silent frame by modifying the digital signal value. If it is determined that the current frame of audio signal is a non-human voice frame (environmental noise frame or silent frame), in the call application, the transmission of non-human voice can be reduced, effectively reducing bandwidth occupation, improving bandwidth utilization, and reducing transmission delay , Enhance customer call experience.

FIG. 3 is a schematic structural diagram of a human voice detection device according to a third embodiment of the application. As shown in FIG. 3, the device 30 includes a time domain feature extraction module 31, a time domain feature calculation module 32, a frequency domain feature extraction module 33, a frequency domain feature calculation module 34, a gate threshold determination module 35, and a time domain voice detection module 36. The frequency domain human voice detection module 37 and the human voice probability calculation module 38, wherein the time domain feature extraction module 31 is used to obtain the time domain envelope information according to the current frame audio signal and the previous multiple frames of audio signals in the audio sample. The time-domain feature calculation module 32 is configured to obtain the time-domain envelope information value of the audio signal of the current frame according to the time-domain envelope information. The frequency domain feature extraction module 33 is configured to obtain a frequency domain signal corresponding to the audio signal of the current frame, and obtain the energy of each subband of the audio signal of the current frame according to the frequency domain signal. The frequency domain feature calculation module 34 is configured to obtain the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band. The gate threshold determination module 35 is used to determine the time domain envelope information gate threshold and the subband energy information gate threshold of the audio signal of the current frame respectively. The time domain human voice detection module 36 is configured to obtain the first human voice probability value of the current frame audio signal according to the time domain envelope information value and the time domain envelope information gate threshold value. The frequency domain human voice detection module 37 is configured to obtain the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold value. The human voice probability calculation module 38 is configured to obtain the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.

Fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application. As shown in FIG. 4, the electronic device 40 includes a processor 41 and a memory 42 coupled to the processor 41.

The memory 42 stores program instructions for implementing the human voice detection method of any of the above embodiments.

The processor 41 is configured to execute program instructions stored in the memory 42 to perform human voice detection.

The processor 41 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 41 may be an integrated circuit chip with signal processing capabilities. The processor 41 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component . The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

Refer to FIG. 5, which is a schematic structural diagram of a storage medium according to a fifth embodiment of the application. The storage medium of the embodiment of the present application stores program instructions 51 capable of realizing the above-mentioned method for detecting all human voices. The storage medium may be non-volatile or volatile. Wherein, the program instructions 51 may be stored in the above-mentioned storage medium in the form of a software product, and include several instructions for making a computer device (may be a personal computer, a server, or a network device, etc.) or a processor to execute the program. Apply for all or part of the steps of the method described in each embodiment. The aforementioned storage devices include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. , Or terminal devices such as computers, servers, mobile phones, and tablets.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. The above are only implementations of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly applied to other related technical fields, The same reason is included in the scope of patent protection of this application.

The above are only the implementation manners of this application. It should be pointed out here that for those of ordinary skill in the art, improvements can be made without departing from the creative concept of this application, but these all belong to this application. The scope of protection.

Claims

A human voice detection method, which includes:

Acquire time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;

Acquiring the time domain envelope information value of the audio signal of the current frame according to the time domain envelope information;

Acquiring a frequency domain signal corresponding to the audio signal of the current frame, and acquiring the energy of each subband of the audio signal of the current frame according to the frequency domain signal;

Obtaining the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band;

Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively;

Acquiring the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold;

Acquiring the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;

Acquire the human voice probability value of the audio signal of the current frame according to the first human voice probability value and the second human voice probability value.
The human voice detection method according to claim 1, wherein the acquiring time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample comprises:

Obtain the maximum value of each frame of audio signal in the audio sample;

Calculating the average value of the maximum value of the most recent multi-frame audio signal in the audio sample and using the average value as the average envelope value, the most recent multi-frame audio signal including the current frame audio signal and the multi-frame audio signal before the current frame audio signal, The maximum value of the most recent multiple frames of audio signals and the average envelope value are used as the time-domain envelope information.
The human voice detection method according to claim 2, wherein the obtaining the time domain envelope information value of the current frame of audio signal according to the time domain envelope information comprises:

Obtaining the difference between the maximum value of each frame of the audio signal in the most recent multiple frames of audio signal and the average envelope value;

Performing a logarithmic operation on the difference between the maximum value of the audio signal of each frame and the average envelope value to obtain the logarithmic value corresponding to the difference;

The logarithmic value of each frame of audio signal is accumulated to obtain the time-domain envelope information value of the audio signal of the current frame.
The human voice detection method according to claim 1, wherein said obtaining the frequency domain signal corresponding to the audio signal of the current frame, and obtaining the energy of each subband of the audio signal of the current frame according to the frequency domain signal comprises:

Transform the audio signal of the current frame from the time domain to the frequency domain through Fourier transform to generate the frequency domain signal corresponding to the audio signal of the current frame;

Perform subband division processing on the frequency domain signal, and calculate the subband energy of each subband.
The human voice detection method according to claim 1, wherein the obtaining the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band comprises:

Calculating the average energy value of each sub-band energy according to the energy of each sub-band;

Obtain the difference between the sub-band energy of each sub-band and the average energy value;

Perform a logarithmic operation on the difference of each subband to obtain the logarithmic value corresponding to the difference;

The logarithmic value of each subband is accumulated to obtain the subband energy information value of the audio signal of the current frame.
The human voice detection method according to claim 1, wherein determining the time domain envelope information gate threshold of the audio signal of the current frame comprises:

Updating the threshold value of the time domain envelope information gate according to the minimum value of the time domain envelope information value within the first preset time range before the current time;

Determine the sub-band energy information gate threshold of the audio signal of the current frame, including:.

The sub-band energy information gate threshold value is updated according to the minimum value of the sub-band energy information value in the first preset time range before the current time.
The human voice detection method according to claim 1, wherein the obtaining the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold value comprises:

Acquiring the difference between the time domain envelope information value and the time domain envelope information gate threshold;

Normalizing the difference between the time domain envelope information value and the time domain envelope information gate threshold value to obtain the first vocal probability value;

The obtaining the second human voice probability value of the current frame audio signal according to the sub-band energy information value and the sub-band energy information gate threshold value includes:

Acquiring the difference between the sub-band energy information value and the sub-band energy information gate threshold;

Normalizing the difference between the sub-band energy information value and the sub-band energy information gate threshold value to obtain a second vocal probability value;

Before obtaining the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value, the method further includes:

Upload the first vocal probability value and the second vocal probability value to the blockchain, so that the blockchain compares the first vocal probability value and the second vocal probability value Encrypted storage.
A human voice detection device, wherein the device includes:

The time domain feature extraction module is used to obtain the time domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;

A time-domain feature calculation module, configured to obtain the time-domain envelope information value of the audio signal of the current frame according to the time-domain envelope information;

The frequency domain feature extraction module is used to obtain the frequency domain signal corresponding to the current frame audio signal, and obtain the energy of each subband of the current frame audio signal according to the frequency domain signal;

A frequency domain feature calculation module, configured to obtain the subband energy information value of the audio signal of the current frame according to the energy of each subband;

The gate threshold determination module is used to determine the time domain envelope information gate threshold and the subband energy information gate threshold of the audio signal of the current frame respectively;

A time-domain vocal detection module, configured to obtain the first vocal probability value of the current frame of audio signal according to the time-domain envelope information value and the time-domain envelope information gate threshold;

A frequency domain vocal detection module, configured to obtain the second vocal probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;

The human voice probability calculation module is configured to obtain the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value.
An electronic device, wherein the device includes a processor and a memory coupled to the processor, and the memory stores program instructions executable by the processor; the processor executes the memory storage The following steps are implemented when the program instructions are:

Acquire time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;

Acquiring the time domain envelope information value of the audio signal of the current frame according to the time domain envelope information;

Acquiring a frequency domain signal corresponding to the audio signal of the current frame, and acquiring the energy of each subband of the audio signal of the current frame according to the frequency domain signal;

Obtaining the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band;

Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively;

Acquiring the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold;

Acquiring the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;

Acquire the human voice probability value of the audio signal of the current frame according to the first human voice probability value and the second human voice probability value.
The electronic device according to claim 9, wherein the acquiring time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample comprises:

Obtain the maximum value of each frame of audio signal in the audio sample;

Calculating the average value of the maximum value of the most recent multi-frame audio signal in the audio sample and using the average value as the average envelope value, the most recent multi-frame audio signal including the current frame audio signal and the multi-frame audio signal before the current frame audio signal, The maximum value of the most recent multiple frames of audio signals and the average envelope value are used as the time-domain envelope information.
The electronic device according to claim 10, wherein said obtaining the time domain envelope information value of the current frame of audio signal according to the time domain envelope information comprises:

Obtaining the difference between the maximum value of each frame of the audio signal in the most recent multiple frames of audio signal and the average envelope value;

Performing a logarithmic operation on the difference between the maximum value of the audio signal of each frame and the average envelope value to obtain the logarithmic value corresponding to the difference;

The logarithmic value of each frame of audio signal is accumulated to obtain the time-domain envelope information value of the audio signal of the current frame.
The electronic device according to claim 9, wherein said obtaining the frequency domain signal corresponding to the current frame of audio signal, and obtaining each subband energy of the current frame of audio signal according to the frequency domain signal, comprises:

Transform the audio signal of the current frame from the time domain to the frequency domain through Fourier transform to generate the frequency domain signal corresponding to the audio signal of the current frame;

Perform subband division processing on the frequency domain signal, and calculate the subband energy of each subband.
The electronic device according to claim 9, wherein the obtaining the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band comprises:

Calculating the average energy value of each sub-band energy according to the energy of each sub-band;

Obtain the difference between the sub-band energy of each sub-band and the average energy value;

Perform a logarithmic operation on the difference of each subband to obtain the logarithmic value corresponding to the difference;

The logarithmic value of each subband is accumulated to obtain the subband energy information value of the audio signal of the current frame.
The electronic device according to claim 9, wherein determining the time domain envelope information gate threshold of the audio signal of the current frame comprises:

Updating the threshold value of the time domain envelope information gate according to the minimum value of the time domain envelope information value within the first preset time range before the current time;

Determine the sub-band energy information gate threshold of the audio signal of the current frame, including:.

The sub-band energy information gate threshold value is updated according to the minimum value of the sub-band energy information value in the first preset time range before the current time.
9. The electronic device according to claim 9, wherein the obtaining the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold value comprises:

Acquiring the difference between the time domain envelope information value and the time domain envelope information gate threshold;

Normalizing the difference between the time domain envelope information value and the time domain envelope information gate threshold value to obtain the first vocal probability value;

The obtaining the second human voice probability value of the current frame audio signal according to the sub-band energy information value and the sub-band energy information gate threshold value includes:

Acquiring the difference between the sub-band energy information value and the sub-band energy information gate threshold;

Normalizing the difference between the sub-band energy information value and the sub-band energy information gate threshold value to obtain a second vocal probability value;

Before obtaining the human voice probability value of the current frame of audio signal according to the first human voice probability value and the second human voice probability value, the method further includes:

Upload the first vocal probability value and the second vocal probability value to the blockchain, so that the blockchain compares the first vocal probability value and the second vocal probability value Encrypted storage.
A storage medium, wherein program instructions are stored in the storage medium, and the following steps are implemented when the program instructions are executed by a processor:

Acquire time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample;

Acquiring the time domain envelope information value of the audio signal of the current frame according to the time domain envelope information;

Acquiring a frequency domain signal corresponding to the audio signal of the current frame, and acquiring the energy of each subband of the audio signal of the current frame according to the frequency domain signal;

Obtaining the sub-band energy information value of the audio signal of the current frame according to the energy of each sub-band;

Determine the time-domain envelope information gate threshold and the sub-band energy information gate threshold of the audio signal of the current frame respectively;

Acquiring the first human voice probability value of the current frame of audio signal according to the time domain envelope information value and the time domain envelope information gate threshold;

Acquiring the second human voice probability value of the audio signal of the current frame according to the sub-band energy information value and the sub-band energy information gate threshold;

Acquire the human voice probability value of the audio signal of the current frame according to the first human voice probability value and the second human voice probability value.
The storage medium according to claim 16, wherein the acquiring time-domain envelope information according to the audio signal of the current frame and the audio signal of the previous multiple frames in the audio sample comprises:

Obtain the maximum value of each frame of audio signal in the audio sample;

Calculating the average value of the maximum value of the most recent multi-frame audio signal in the audio sample and using the average value as the average envelope value, the most recent multi-frame audio signal including the current frame audio signal and the multi-frame audio signal before the current frame audio signal, The maximum value of the most recent multiple frames of audio signals and the average envelope value are used as the time-domain envelope information.
The storage medium according to claim 16, wherein said obtaining the frequency domain signal corresponding to the current frame audio signal, and obtaining each subband energy of the current frame audio signal according to the frequency domain signal comprises:

Transform the audio signal of the current frame from the time domain to the frequency domain through Fourier transform to generate the frequency domain signal corresponding to the audio signal of the current frame;

Perform subband division processing on the frequency domain signal, and calculate the subband energy of each subband.
The storage medium according to claim 16, wherein the obtaining the subband energy information value of the audio signal of the current frame according to the energy of each subband comprises:

Calculating the average energy value of each sub-band energy according to the energy of each sub-band;

Obtain the difference between the sub-band energy of each sub-band and the average energy value;

Perform a logarithmic operation on the difference of each subband to obtain the logarithmic value corresponding to the difference;

The logarithmic value of each subband is accumulated to obtain the subband energy information value of the audio signal of the current frame.
The electronic device according to claim 16, wherein determining the time domain envelope information gate threshold of the audio signal of the current frame comprises:

Updating the threshold value of the time domain envelope information gate according to the minimum value of the time domain envelope information value within the first preset time range before the current time;

Determine the sub-band energy information gate threshold of the audio signal of the current frame, including:.

The sub-band energy information gate threshold value is updated according to the minimum value of the sub-band energy information value in the first preset time range before the current time.