WO2019183747A1

WO2019183747A1 - Voice detection method and apparatus

Info

Publication number: WO2019183747A1
Application number: PCT/CN2018/080447
Authority: WO
Inventors: 郭红敬; 李国梁; 王鑫山; 杨柯; 朱虎
Original assignee: 深圳市汇顶科技股份有限公司
Priority date: 2018-03-26
Filing date: 2018-03-26
Publication date: 2019-10-03
Also published as: CN110537223A; CN110537223B

Abstract

A voice detection method and apparatus. The method comprises: determining the energy of each group from among N groups of a first data block in data to be processed, wherein N is a positive integer (S110); according to the energy of the N groups, determining an initial candidate noise set and an initial candidate voice set, wherein the maximum energy of groups in the initial candidate noise set is less than the minimum energy of groups in the initial candidate voice set (S120); according to the energy of each group in the initial candidate noise set, determining an initial noise threshold (S130); and according to the initial candidate voice set and the initial noise threshold, determining a candidate noise set iteratively processed for the first time and a candidate voice set iteratively processed for the first time, wherein the energy of groups in the candidate noise set iteratively processed for the first time is less than or equal to the initial noise threshold; and the energy of groups in the candidate voice set iteratively processed for the first time is greater than the initial noise threshold (S140).

Description

Method and device for voice detection

Technical field

The present application relates to the field of speech detection and, more particularly, to a method and apparatus for speech detection.

Background technique

With the rapid development of mobile Internet of Things technology, human-computer interaction technology, artificial intelligence and other technologies, various types of intelligent audio, smart wearable devices, and voice assistant products emerge in an endless stream, and people's requirements for voice quality and product experience are also increasing. This also poses a huge challenge to the requirements of speech recognition, speech enhancement, and voice interaction.

Voice Activity Detection (VAD), also known as speech endpoint detection, usually detects the starting point of an actual speech segment from a continuous audio signal in a complex noise background environment, depending on the characteristics of speech and noise. The termination point extracts a valid speech segment and eliminates interference from other non-speech signals such as noise.

The existing voice activity detection algorithms can include three types: the first type: the decision method based on the statistical characteristics of speech and noise, and the decision criterion used is mostly the maximum likelihood criterion, and the calculation amount of this method is relatively small, however, The performance of speech detection is general; the second type is based on statistical model and pattern classification. This method has high computational complexity and good performance. The third category: based on neural network and deep learning methods, the performance of such methods is better. However, the amount of calculation is relatively large and requires a large amount of training data.

Therefore, a speech detection algorithm is needed to ensure good detection performance under low complexity and low computational complexity.

Summary of the invention

The embodiment of the present application provides a method and an apparatus for voice detection, which can ensure good detection performance under low complexity and low calculation amount.

In a first aspect, a method for voice detection is provided, comprising:

Determining an energy of each of the N packets of the first data block in the data to be processed, wherein the N is a positive integer;

Determining an initial candidate noise set and an initial candidate speech set according to the N packet energies, wherein a maximum energy of a packet in the initial candidate noise set is smaller than a minimum of a packet in the initial candidate speech set energy;

Determining an initial noise threshold based on energy of each of the initial candidate noise sets;

Determining, according to the initial candidate speech set and the initial noise threshold, a candidate noise set processed by the first iteration and a candidate speech set processed by the first iteration, wherein the candidate noise set of the first iteration processing The energy of the packets in the group is less than or equal to the initial noise threshold, and the energy of the packets in the candidate speech set processed by the first iteration is greater than the initial noise threshold.

Therefore, the method for voice detection in the embodiment of the present application uses the energy of the packet as a feature parameter, which can smooth the noise and reduce the false alarm probability, and is advantageous for improving voice detection in comparison with the existing voice detection using single frame energy. Accuracy, at the same time, is conducive to reducing the computational complexity compared to speech detection using other parameters.

In a possible implementation manner, the method further includes:

Determining a noise threshold of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration, wherein the k is 1, 2, ...;

Determining the candidate noise set of the k+1th iteration process and the candidate speech set of the k+1th iteration process according to the candidate speech set processed by the kth iteration and the noise threshold of the kth iteration process.

In a possible implementation manner, the method further includes:

When the number of iterations k reaches the upper limit of the iteration, it is determined that the candidate speech set processed by the kth iteration is the target speech set, and the candidate noise set processed by the kth iteration is the target noise set.

In a possible implementation manner, the method further includes:

If the energy of the packet in the candidate speech set processed by the kth iteration is greater than the noise threshold of the kth iteration process, determining that the candidate speech set processed by the kth iteration is the target speech set, the first The candidate noise set processed by k iterations is the target noise set.

In a possible implementation manner, the method further includes:

Arranging the packets in the target speech set in chronological order;

And determining the updated target voice set according to a time interval between adjacent packets in the target voice set.

In a possible implementation manner, the determining, according to a time interval between adjacent packets in the target voice set, the updated target voice set, including:

If the time interval of two adjacent packets in the target voice set is less than a preset threshold, determining that other packets between the two adjacent packets are also voice signals, and between the two adjacent packets Other groupings are added to the target voice collection to obtain the updated target voice collection.

In a possible implementation manner, the determining, according to energy of each of the initial candidate noise sets, an initial noise threshold, including:

Determining an initial noise power based on energy of each of the initial candidate noise sets;

A result obtained by multiplying the initial noise power by a threshold factor is determined as the initial noise threshold, wherein the threshold factor is determined according to a target false alarm probability.

In a possible implementation manner, the first data block is a first data block in the to-be-processed data, and determining initial noise according to energy of each of the initial candidate noise sets Power, including:

An average of the energy of each of the initial candidate noise sets is determined as the initial noise power.

In a possible implementation manner, the first data block is a non-first data block in the to-be-processed data, and a previous data block of the first data block is a second data block, where the The energy of each of the initial candidate noise sets determines an initial noise power, including:

Determining an initial noise power of the first data block according to a target noise power of the second data block and an estimated noise power of the first data block, where an estimated noise power of the first data block is An average of energy of each of the initial candidate noise sets of the first data block, the target noise power of the second data block being each of the target noise sets of the second data block The average of the energy.

Therefore, the noise threshold for voice detection in the embodiment of the present application is determined according to a threshold factor and a noise power. On the one hand, an iterative update of noise power is performed in each data block, so that voices in each data block are made. The robustness of the detection is good. On the other hand, the noise power can be smoothed between data blocks, and the change of the environmental noise can be adaptively tracked, so that the noise threshold between the data blocks has better adaptability. In turn, each data block of the data to be processed has better robustness.

In a possible implementation, the determining the initial noise power of the first data block according to the target noise power of the second data block and the estimated noise power of the first data block, including: The initial noise power of the first data block is determined by the following formula:

P ₁ =αP ₁ '+(1-α)P ₂ ′′

Wherein P ₁ is an initial noise power of the first data block, the P ₁ ' is an estimated noise power of the first data block, and the P ₂ ′′ is a target of the second data block Noise power, 0 < α < 1.

In a possible implementation, the determining, according to the energy of the N packets, the initial candidate noise set and the initial candidate voice set, including:

Determining, into the initial candidate noise set, a certain proportion of the lesser packets of the N packets, and determining other ones of the N packets as the initial candidate speech set; or

A certain number of packets of lesser energy among the N packets are determined as the initial candidate noise set, and other ones of the N packets are determined as the initial candidate speech set.

In a second aspect, an apparatus for voice detection is provided, comprising a determining module for performing the method of any of the first aspect or the first aspect of the first aspect.

In a third aspect, a computer readable medium storing program code for execution by an electronic device, the program code comprising instructions for performing the method of the first aspect.

In a fourth aspect, a computer program product is provided, the computer program product comprising: computer program code, when the computer program code is executed by a processor of an electronic device, causing the electronic device to perform the first aspect or the first aspect A method in any of the possible implementations.

DRAWINGS

FIG. 1 is a schematic flowchart of a method for voice detection according to an embodiment of the present application.

2 is an overall flow chart of a method of voice detection according to an embodiment of the present application.

FIG. 3 is a schematic block diagram of an apparatus for voice detection according to an embodiment of the present application.

detailed description

The technical solutions of the embodiments of the present application are described below with reference to the accompanying drawings of the embodiments of the present application.

FIG. 1 is a schematic flowchart of a method for voice detection according to an embodiment of the present application. Hereinafter, a method for voice detection according to an embodiment of the present application is described by using a device for voice detection as an execution subject.

Optionally, the audio signal may be sampled at a certain sampling frequency (for example, 8 kHz, 16 kHz, 32 kHz, etc.) to obtain data to be processed, and the to-be-processed data may include a noise signal and/or a voice signal, and the voice detection may be performed. The device may be configured to process the sampled data to be processed to obtain a voice signal therein. In the embodiment of the present application, the apparatus for detecting a voice may divide the data to be processed into a plurality of data blocks for processing, and determine a voice signal and a noise signal in each data block, where, in the data to be processed, The first data block is taken as an example to describe the method of voice detection according to an embodiment of the present application.

As shown in FIG. 1, the method 100 includes:

S110. Determine energy of each of N packets of the first data block in the to-be-processed data, where the N is a positive integer;

S120. Determine, according to energy of the N packets, an initial candidate noise set and an initial candidate voice set, where a maximum energy of the packets in the initial candidate noise set is smaller than a packet in the initial candidate voice set. Minimum energy

S130. Determine an initial noise threshold according to energy of each packet in the initial candidate noise set.

S140. Determine, according to the initial candidate voice set and the initial noise threshold, a candidate noise set processed by the first iteration and a candidate voice set processed by the first iteration, where the candidate of the first iterative process is processed. The energy of the packets in the noise set is less than or equal to the initial noise threshold, and the energy of the packets in the candidate speech set processed by the first iteration is greater than the initial noise threshold.

Specifically, the apparatus for voice detection may divide the first data block into N packets, calculate the energy of each packet in units of packets, thereby reducing the calculation amount of voice detection, and according to a plurality of packets. The frame sampling data determines the energy of the packet, instead of performing voice detection according to the energy of the sampled data of each frame, which is beneficial to smoothing noise and improving the accuracy of voice detection.

Optionally, in some embodiments, an average of the power of each frame of sample data in one packet may be used as the energy of the packet, or the sum of the power of each frame of sample data in a packet may also be used as the packet. The energy of the packet, or the power of the sampled data in a packet may be smoothed to obtain the energy of the packet. The calculation method of the energy of the packet is not specifically limited in the embodiment of the present application.

After obtaining the energy of each of the N packets, further, an initial candidate noise set and an initial candidate speech set may be determined according to energy of the N packets, wherein the initial candidate noise set is grouped It can be considered as a noise signal, and the packet in the initial candidate speech set can be considered as a speech signal.

For example, a certain proportion of the less energy of the N packets may be determined as the initial candidate noise set, the remaining packets of the N packets are determined as the initial candidate speech set; or the energy of the N packets may be The smaller number of packets are determined as the initial candidate noise set, and the remaining packets in the N packets are determined as the initial candidate speech set, etc., and the initial candidate speech set and initialization are not specifically limited in this embodiment of the present application. The way in which candidate noise sets are divided.

In a specific implementation, the N packets may be arranged in ascending order according to the size of the energy of the packet. At this time, the less energy of the N packets is the group with the highest ranking, and then the ordering may be selected. A certain proportion (eg, 20%) or a certain number (eg, 20) of packets in the front constitutes an initial candidate noise set, and the remaining packets are determined as an initial candidate speech set, and thus, in the initial candidate noise set The energy of the packets is less than the energy of the packets in the initial candidate speech set.

As an example and not by way of limitation, if N=100, 20 smaller energy packets may be selected among the 100 packets to form an initial candidate noise set, that is, the initial candidate noise set includes packet 1 to packet 20, thereby The large 80 packets constitute the initial candidate speech set, ie the initial candidate speech set includes packet 21 to packet 100.

Further, the apparatus for voice detection may determine an initial noise threshold according to an energy of each packet in the initial candidate noise set, and the initial noise threshold may be used to determine whether a noise signal is still present in the initial candidate voice set. Where the packet whose energy value is less than the initial noise threshold can be considered as a noise signal.

Optionally, in some embodiments, determining, according to energy of each of the initial candidate noise sets, an initial noise threshold, including:

As an example and not by way of limitation, the threshold factor T may be determined according to the data length M of the packet and the target false alarm probability P _fa , wherein the target false alarm probability is the maximum false alarm probability allowed by the system, that is, the noise signal allowed by the system The erroneous judgment is the maximum probability of the speech signal, for example, the threshold factor T can be determined according to the following formula.

T=F ^-1 (1-P _fa ) Formula (1)

Optionally, in the embodiment of the present application, the initial noise power may be an average value of power of each packet in the initial candidate noise set, or power of each packet in the initial candidate noise set. And the like, the embodiment of the present application does not limit this.

After determining the initial noise threshold of the first data block, the energy of the packet in the initial candidate speech set may be sequentially compared with the initial noise threshold. If the energy of the packet is less than the initial noise threshold, the packet may be considered as the packet. After the comparison is completed, the packet of the initial candidate speech set whose energy is less than the initial noise threshold may be added to the initial candidate noise set to obtain the candidate noise set of the first iteration process. Determining, among the N packets, other than the candidate noise set processed by the first iteration as the candidate speech set processed by the first iteration, that is, the candidate noise set processed by the first iteration is the first time The obtained candidate noise set is updated, and the candidate speech set processed by the first iteration is the candidate speech set obtained by the first update.

Following the above example, if the initial candidate noise set includes packet 1 to packet 20, the initial candidate speech set includes packet 21 to packet 100, and the initial noise threshold can be compared with the energy of the packet 21 to the packet in packet 100. If the energy of the packet 21 to the packet 40 in the initial candidate speech set is less than the initial noise threshold, the packet 21 to the packet 40 may be added to the initial candidate noise set to obtain the candidate noise of the first iteration process. The set, including packet 1 to packet 40, is simultaneously available for the first iteratively processed candidate speech set, including packet 41 to packet 100.

Optionally, in the embodiment of the present application, the method 100 may further include:

The candidate noise set of the k+1th iteration process and the candidate speech set of the k+1th iteration process are determined according to the candidate speech set processed by the kth iteration and the noise threshold of the kth iteration process.

In the embodiment of the present application, after determining the candidate noise set processed by the first iteration, the noise threshold of the first iteration process may also be determined according to the energy of each packet in the candidate noise set processed by the first iteration. And determining, according to the noise threshold of the first iteration process, whether the candidate speech set processed by the first iteration further includes a noise signal, if the energy of each packet in the candidate speech set processed by the first iteration is greater than The noise threshold of the first iteration process may determine that the noise signal is not included in the candidate speech set processed by the first iteration, and thus may be confirmed as the target speech set, and the candidate noise of the first iteration process may be determined. The set is the target noise set. Otherwise, the above iterative operation may continue to be performed until the energy of each packet in the candidate speech set processed by the kth iteration is greater than the noise threshold of the kth iteration, or when the number of iterations k reaches Upper limit, at this time, it may be determined that the candidate speech set processed by the kth iteration is the target speech set That is, the packets in the candidate speech set processed by the kth iteration are all voice signals, and it can be determined that the packets in the candidate noise set processed by the kth iteration are all noise signals, that is, the candidate speech sets processed by the k iterations. For the target speech set, the candidate noise set processed by the kth iteration is the target noise set, thereby obtaining a decision result for each of the N packets, or N of the first data block may be determined. Which packets in the packet are voice signals and which packets are noise signals.

Similar to the foregoing manner of determining the initial noise power, determining the noise threshold of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration may include:

Determining the noise power of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration;

The product of the noise power processed by the kth iteration and the threshold factor is determined as the noise threshold of the kth iteration.

Therefore, in the embodiment of the present application, the noise threshold for voice detection is determined according to a threshold factor and a noise power. According to the above iterative process, iterative update of noise power in each data block can be implemented, and noise can be implemented at the same time. An iterative update of the threshold, thereby improving the robustness of speech detection within each data block.

Optionally, in the embodiment of the present application, the apparatus for voice detection may further perform smoothing processing on the current data block according to the noise power of the adjacent data block. Specifically, the following two situations can be included:

Case 1: If the first data block is the first data block in the to-be-processed data, determining the initial noise power according to the energy of each of the initial candidate noise sets, including:

That is, when the first data block is the first data block of the to-be-processed data, or when there is no other data block before the first data block, the initial candidate noise set of the first data block may be directly The average value of the power of each packet is determined as the initial noise power of the first data block, or the sum of the powers of each of the initial candidate noise sets of the first data block may be determined as the first The initial noise power of a block of data, etc. For the determination manner of the initial candidate noise set of the first data block, refer to the related description of the foregoing embodiment, and details are not described herein again.

Case 2: if the first data block is a non-first data block in the to-be-processed data, the previous data block of the first data block is a second data block, according to the initial candidate The energy of each packet in the noise set determines the initial noise power, including:

That is, when the first data block is the non-first data block of the to-be-processed data, or there is another data block before the first data, the initial noise power of the first data block may be according to the first data block. The estimated noise power of the data block and the target data power of the previous data block of the first data block, that is, the second data block, are determined.

Optionally, the estimated noise power of the first data block may be determined according to an energy of each of the initial candidate noise sets of the first data block, for example, the estimated noise power of the first data block may be An average of the power of each of the initial candidate noise sets of the first data block, or the estimated noise power of the first data block may also be each of the initial candidate noise sets of the first data block For the determination of the initial candidate noise set of the first data block, refer to the related description in the foregoing embodiment, and details are not described herein again.

Optionally, the target noise power of the second data block may be determined according to energy of each packet in the target candidate noise set of the second data block, for example, the target noise power of the second data block may be the second An average of the power of each of the target noise sets of the data block, or the target noise power of the second data block may also be the sum of the powers of each of the target noise sets of the second data block, The determining manner of the target noise set of the second data block may refer to the determining manner of the target noise set of the first data block, and details are not described herein again.

In a specific implementation, the initial noise power of the first data block may be determined according to the following formula (4):

P ₁ =αP ₁ '+(1-α)P ₂ ′′ Formula (4)

That is, the estimated value of the noise power of the first data block (ie, the estimated noise power) and the stable noise power (ie, the target noise power) of the previous data block of the first data block, that is, the second data block, may be smoothed. Processing, obtaining an initial noise power of the first data block, and further, determining an initial noise threshold of the first data block according to an initial noise power of the first data block and a threshold factor.

In summary, the method for voice detection in the embodiment of the present application can adaptively adjust the noise threshold according to the ambient noise while ensuring low computational complexity, and can also have better robustness while ensuring detection performance. Sex.

In the embodiment of the present application, after determining the determination result of the N packets in the first data block, the method 100 may further include:

Arranging the packets in the target speech set in chronological order;

Generally speaking, the speech signal is continuous in a short time. Therefore, the decision result should also be continuous in a short time. There may be a sudden change in the energy of the original speech or the influence of noise. The change of the speech signal and the noise signal is relatively frequent, and based on this, the decision result can be corrected.

In the embodiment of the present application, after determining the target voice set and the target noise set in the first data block, the signal type of each of the N packets is known, that is, belongs to a voice signal or a noise signal. Then, the N packets can be arranged in the order of sampling time, that is, the N packets are restored to the original order.

In this case, the decision result of the voice detection may be corrected according to the time interval between two adjacent packets belonging to the voice signal, so that the updated target voice set may be determined. For example, if the time interval between two adjacent packets belonging to the voice signal is less than a preset threshold, it can be determined that other packets between the two packets are also voice signals, so that the other packets can also be added to the target voice. The collection, the updated target collection (or, after correction).

Optionally, in the embodiment of the present application, the corrected target noise set may also be determined in a similar manner as described above, and details are not described herein for brevity.

For example, if the packet 21 and the packet 30 are two adjacent packets belonging to the voice signal, if the time interval between the packet 21 and the packet 30 is 10 ms, the interval is short, and at this time, the packet 21 and the packet 30 can be determined. The other packets in between are also voice signals, i.e., packets 22 through 29 can also be determined as voice signals, so that the updated target voice set can be obtained.

Therefore, the method for voice detection in the embodiment of the present application can also correct the decision result of the voice detection according to the non-mutation characteristic of the voice signal, thereby improving the accuracy of the voice detection.

Hereinafter, the method for voice detection according to an embodiment of the present application is described in detail in conjunction with the overall flowchart shown in FIG. 2. As shown in FIG. 2, the method may include the following content:

In the embodiment of the present application, the data to be processed may be divided into multiple data blocks for processing. Optionally, the length of the data block may be determined according to an application scenario or processing capability, and each data block includes L sampling points, which may be The processing capability and the detection precision determine the data length of the packet, and the L sample points are divided into N packets, and the data length of each packet is M=[L/N].

S201. Determine a threshold factor according to a preset false alarm probability.

For a specific implementation process of the S201, refer to the related description of the foregoing embodiment, and details are not described herein again.

In S202, the energy of each packet in the data block is determined.

For example, the energy of each of the i-th data blocks of the to-be-processed data may be: P _i =[p _i1 , p _i2 , . . . p _iN ], where p _{ij is} the i-th data The energy of the j-th packet of the block, the energy of each packet may be the sum of the powers of each sample point in each packet, or may also be the average of the power of each sample point, this embodiment of the present application This is not limited.

Further, the N packets may be sorted by the size of the energy, for example, in ascending order according to the size of the energy.

Thereafter, S203 is executed to determine whether the data block is the first data block in the to-be-processed data, and if yes, execute S204; otherwise, execute S205.

In S204, an initial noise power of the data block is determined.

The implementation process of the S204 may correspond to the implementation process of the case 1 in the foregoing embodiment. For brevity, details are not described herein again.

In S205, an initial noise power of the data block is determined according to the estimated noise power of the data block and the target noise power of the previous data block of the data block.

The implementation process of the S205 may correspond to the implementation process of the case 2 in the foregoing embodiment. For brevity, details are not described herein again.

Further, S206 may be performed to determine a noise threshold according to the noise power determined in S204 or S205 in combination with the threshold factor determined in S201;

For example, the product of the noise power and the threshold factor can be determined as the noise threshold.

Then, S207 is executed to re-determine the noise set and the voice set in the data block according to the noise threshold.

For example, a packet in the data block whose energy is greater than the noise threshold may be determined as a set of speech, and a packet in the data block whose energy is less than or equal to the noise threshold is determined as a noise set.

For example, if the initial noise power of the data block is determined according to the energy of the packet 1 to the packet k ₁ in the data block, it can be considered that the packet 1 to the packet k ₁ constitute an initial candidate noise set, and the packet k ₁ + 1 to group N constitute an initial candidate speech set. In S207, it may be determined according to the noise threshold whether any of the packets k ₁ +1 to N belong to the noise signal, and the packet whose energy is less than or equal to the noise threshold may be determined to belong to the noise signal.

In S208, it is determined whether there is a new packet join in the noise set of the data block, and if so, S209 is performed; otherwise, S210 is performed.

In S209, the updated noise power is determined according to the re-determined noise set, and then jumps to execution S206, and the updated noise threshold is determined according to the updated noise power. Further, S207 may be further performed according to the updated The noise threshold re-determines the noise set and the voice set in the data block until the preset number of iterations is reached, or the noise power is stabilized by iteration, so that the noise threshold is also stabilized, so that the packet whose energy is greater than the noise threshold can be A packet determined to be a speech signal whose energy is less than the noise threshold can be determined as a noise signal.

In S210, the decision result of each packet of the data block is output.

As described above, in the embodiment of the present application, the determination result of the voice detection may also be corrected. In an implementation manner, the identifier of each group may be set with an identifier, for example, a packet belonging to the voice signal may be Set the identifier 1, set the identifier of the packet belonging to the noise signal to 0, and then sort the packets according to the order of sampling time, that is, return to the original order, and then judge the result according to the time interval of the packets belonging to the adjacent voice signal. Make corrections.

For example, if the identification vector of the decision result is V=[v ₁ , v ₂ . . . v _N ], v _i ∈ 0, 1, wherein the identifier 1 indicates that the packet at the corresponding position is a voice signal, and the identifier is 0. The group representing the corresponding position is a noise signal, and according to the identifier vector, the position vector of the voice signal in the data block can be determined as W=(w ₁ , w ₂ , . . . , w _k ), k<L, 1 ≤ w _i ≤ N, where w _i can be used to identify the time information of the packet i, and the position vector of two adjacent packets in the position vector is differentiated, and Δ=(Δ ₁ , Δ ₂ ,. .. Δ _k-1 ), Δ _k-1 represents the time difference between w _k-1 and w _k , since the interval between adjacent speech signals is not too large, if Δ _{1 is} less than the preset threshold, then The decision result of the packet between w ₁ and w ₂ is also regarded as a speech signal, so that the label vector V' of the updated decision result can be obtained, and the final speech detection result of the data block is V'.

The embodiment of the present invention is described in detail with reference to FIG. 1 to FIG. 2 . Hereinafter, the device embodiment of the present application is described in detail with reference to FIG. 3 . It should be understood that the device embodiment and the method embodiment correspond to each other, and a similar description may be used. Refer to method embodiments.

FIG. 3 is a schematic structural diagram of an apparatus for voice detection according to an embodiment of the present application. As shown in FIG. 3, the apparatus 300 includes a determination module 310. The determining module 310 is configured to:

Determining an initial candidate noise set and an initial candidate speech set according to energy of the N packets, wherein a maximum energy of a packet in the initial candidate noise set is smaller than a minimum of a packet in the initial candidate speech set energy;

Optionally, in some embodiments, the determining module 310 is further configured to:

Arranging the packets in the target speech set in chronological order;

Optionally, in some embodiments, the determining module 310 is specifically configured to:

Optionally, in some embodiments, the first data block is the first data block in the to-be-processed data, and the determining module 310 is specifically configured to:

Optionally, in some embodiments, the first data block is a non-first data block in the to-be-processed data, and a previous data block of the first data block is a second data block, where The determination module is specifically used to:

Determining an initial noise power of the first data block according to a target noise power of the second data block and an estimated noise power of the first data block, where an estimated noise power of the first data block is An average of the energy of each of the initial candidate noise sets of the first data block, the target noise power of the second data block being the energy of each of the target noise sets of the second data block average value.

The initial noise power of the first data block is determined according to the following formula:

P ₁ =αP ₁ '+(1-α)P ₂ ′′

Optionally, the determining module 310 can be a specific processing capability processor, and the processor can be a central processing unit (CPU), or other general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit. (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc., which are not limited in this embodiment. The general purpose processor may be a microprocessor or the processor or any conventional processor or the like.

Optionally, the apparatus 300 for voice detection may further include a memory, which may include a read only memory and a random access memory, and provides instructions and data to the processor. A portion of the memory may also include a non-volatile random access memory. For example, the memory can also store information of the device type.

Optionally, in the embodiment of the present application, the memory may also be used to store the collected audio data.

The embodiment of the present application further provides a computer readable storage medium storing one or more programs, the one or more programs including instructions, when the portable electronic device is included in a plurality of applications When executed, the portable electronic device can be caused to perform the method of the embodiment shown in Figures 1-2.

The embodiment of the present application also proposes a computer program comprising instructions which, when executed by a computer, cause the computer to perform the corresponding flow of the method of the embodiment shown in Figures 1 to 2.

It should be understood that the term "and/or" herein is merely an association relationship describing an associated object, indicating that there may be three relationships, for example, A and/or B, which may indicate that A exists separately, and A and B exist simultaneously. There are three cases of B alone. In addition, the character "/" in this article generally indicates that the contextual object is an "or" relationship.

It should be understood that, in the various embodiments of the present application, the size of the sequence numbers of the foregoing processes does not mean the order of execution sequence, and the order of execution of each process should be determined by its function and internal logic, and should not be applied to the embodiment of the present application. The implementation process constitutes any limitation.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present application.

A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

The functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including The instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present application. It should be covered by the scope of protection of this application. Therefore, the scope of protection of the present application should be determined by the scope of the claims.

Claims

A method for voice detection, comprising:

Determining an energy of each of the N packets of the first data block in the data to be processed, wherein the N is a positive integer;

Determining an initial candidate noise set and an initial candidate speech set according to energy of the N packets, wherein a maximum energy of a packet in the initial candidate noise set is smaller than a minimum of a packet in the initial candidate speech set energy;

Determining an initial noise threshold based on energy of each of the initial candidate noise sets;

Determining, according to the initial candidate speech set and the initial noise threshold, a candidate noise set processed by the first iteration and a candidate speech set processed by the first iteration, wherein the candidate noise set of the first iteration processing The energy of the packets in the group is less than or equal to the initial noise threshold, and the energy of the packets in the candidate speech set processed by the first iteration is greater than the initial noise threshold.
The method of claim 1 further comprising:

Determining a noise threshold of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration, wherein the k is 1, 2, ...;

Determining the candidate noise set of the k+1th iteration process and the candidate speech set of the k+1th iteration process according to the candidate speech set processed by the kth iteration and the noise threshold of the kth iteration process.
The method of claim 2, wherein the method further comprises:

If the energy of the packet in the candidate speech set processed by the kth iteration is greater than the noise threshold of the kth iteration process, determining that the candidate speech set processed by the kth iteration is the target speech set, the first The candidate noise set processed by k iterations is the target noise set.
The method of claim 2, wherein the method further comprises:

When the number of iterations k reaches the upper limit of the iteration, it is determined that the candidate speech set processed by the kth iteration is the target speech set, and the candidate noise set processed by the kth iteration is the target noise set.
The method according to claim 3 or 4, wherein the method further comprises:

Arranging the packets in the target speech set in chronological order;

The updated target speech set is determined based on a time interval between adjacent packets in the target speech set.
The method according to claim 5, wherein the determining the updated target voice set according to a time interval between adjacent packets in the target voice set comprises:

If the time interval of two adjacent packets in the target voice set is less than a preset threshold, determining that other packets between the two adjacent packets are also voice signals, and between the two adjacent packets Other groupings are added to the target voice collection to obtain the updated target voice collection.
The method according to any one of claims 1 to 6, wherein the determining an initial noise threshold according to the energy of each of the initial candidate noise sets comprises:

Determining an initial noise power based on energy of each of the initial candidate noise sets;

A result obtained by multiplying the initial noise power by a threshold factor is determined as the initial noise threshold, wherein the threshold factor is determined according to a target false alarm probability.
The method according to claim 7, wherein said first data block is a first data block in said data to be processed, said energy according to each of said initial candidate noise sets To determine the initial noise power, including:

An average of the energy of each of the initial candidate noise sets is determined as the initial noise power.
The method according to claim 7, wherein the first data block is a non-first data block in the to-be-processed data, and the previous data block of the first data block is a second data block. Determining the initial noise power according to the energy of each of the initial candidate noise sets, including:

Determining an initial noise power of the first data block according to a target noise power of the second data block and an estimated noise power of the first data block, where an estimated noise power of the first data block is An average of energy of each of the initial candidate noise sets of the first data block, the target noise power of the second data block being each of the target noise sets of the second data block The average of the energy.
The method according to any one of claims 1 to 9, wherein the determining the initial candidate noise set and the initial candidate speech set according to the energy of the N packets comprises:

Determining, into the initial candidate noise set, a certain proportion of the lesser packets of the N packets, and determining other ones of the N packets as the initial candidate speech set; or

A certain number of packets of lesser energy among the N packets are determined as the initial candidate noise set, and other ones of the N packets are determined as the initial candidate speech set.
A device for voice detection, comprising: a determining module, configured to:

Determining an energy of each of the N packets of the first data block in the data to be processed, wherein the N is a positive integer;

Determining an initial candidate noise set and an initial candidate speech set according to energy of the N packets, wherein a maximum energy of a packet in the initial candidate noise set is smaller than a minimum of a packet in the initial candidate speech set energy;

Determining an initial noise threshold based on energy of each of the initial candidate noise sets;

Determining, according to the initial candidate speech set and the initial noise threshold, a candidate noise set processed by the first iteration and a candidate speech set processed by the first iteration, wherein the candidate noise set of the first iteration processing The energy of the packets in the group is less than or equal to the initial noise threshold, and the energy of the packets in the candidate speech set processed by the first iteration is greater than the initial noise threshold.
The apparatus according to claim 11, wherein the determining module is further configured to:

Determining a noise threshold of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration, wherein the k is 1, 2, ...;

Determining the candidate noise set of the k+1th iteration process and the candidate speech set of the k+1th iteration process according to the candidate speech set processed by the kth iteration and the noise threshold of the kth iteration process.
The device according to claim 12, wherein the determining module is further configured to:

When the number of iterations k reaches the upper limit of the iteration, it is determined that the candidate speech set processed by the kth iteration is the target speech set, and the candidate noise set processed by the kth iteration is the target noise set.
The device according to claim 12, wherein the determining module is further configured to:

If the energy of the packet in the candidate speech set processed by the kth iteration is greater than the noise threshold of the kth iteration process, determining that the candidate speech set processed by the kth iteration is the target speech set, the first The candidate noise set processed by k iterations is the target noise set.
The apparatus according to claim 13 or 14, wherein the determining module is further configured to:

Arranging the packets in the target speech set in chronological order;

And determining the updated target voice set according to a time interval between adjacent packets in the target voice set.
The device according to claim 15, wherein the determining module is specifically configured to:

If the time interval of two adjacent packets in the target voice set is less than a preset threshold, determining that other packets between the two adjacent packets are also voice signals, and between the two adjacent packets Other groupings are added to the target voice collection to obtain the updated target voice collection.
The device according to any one of claims 11 to 16, wherein the determining module is specifically configured to:

Determining an initial noise power based on energy of each of the initial candidate noise sets;

A result obtained by multiplying the initial noise power by a threshold factor is determined as the initial noise threshold, wherein the threshold factor is determined according to a target false alarm probability.
The device according to claim 17, wherein the first data block is the first data block in the to-be-processed data, and the determining module is specifically configured to:

An average of the energy of each of the initial candidate noise sets is determined as the initial noise power.
The apparatus according to claim 17, wherein the first data block is a non-first data block in the to-be-processed data, and a previous data block of the first data block is a second data block. The determining module is specifically configured to:

Determining an initial noise power of the first data block according to a target noise power of the second data block and an estimated noise power of the first data block, where an estimated noise power of the first data block is An average of the energy of each of the initial candidate noise sets of the first data block, the target noise power of the second data block being the energy of each of the target noise sets of the second data block average value.
The device according to any one of claims 11 to 19, wherein the determining module is further configured to:

Determining, into the initial candidate noise set, a certain proportion of the lesser packets of the N packets, and determining other ones of the N packets as the initial candidate speech set; or

A certain number of packets of lesser energy among the N packets are determined as the initial candidate noise set, and other ones of the N packets are determined as the initial candidate speech set.